Linux代码版本:linux4.4
导读:玩linux编程终究都是绕不开内存管理部分内容。从开始接触 linux,都看到 copy on write 机制,当时也很好奇是如何实现的。在接触 dpdk 时,使用 hugepage 减少 tlb miss 以提升性能,以及用户态 malloc 时先返回地址,实际并未分配物理内存。随着工作时间的增加,这些知识不能再只停留在概念和会调接口的水平,需要深入linux 内核代码一窥究竟。下面就从 arm64 的代码开始学习。
一、MMU相关知识
从接触 linux 就知道,CPU访问的地址都是 虚拟地址 ,需要 MMU 转换才能访问到 物理地址 ,而 MMU 则需要通过页表来进行地址转换,如果没有相应的页表,则会发生异常,就比如 访问 NULL 就会触发异常,用户态则应用程序死掉,内核态更严重,系统崩掉。总结下来,MMU 的两个主要功能:
1. 地址转换(根据页表转换)
2. 权限检查(页表中有几个 bit 标明可读、可写、可执行)
所以,当页表出现错误和访问权限错误的时候,就会进入异常,听起来好像时出错才会进入异常,而 linux 利用这种特性实现了正常的功能,如 copy on write 。
异常的入口时一段汇编代码,然后调用到 do_mem_abort
asmlinkage void __exception do_mem_abort(unsigned long addr, unsigned int esr,
struct pt_regs *regs)
{
const struct fault_info *inf = fault_info + (esr & 63);
struct siginfo info;
if (!inf->fn(addr, esr, regs))
return;
pr_alert("Unhandled fault: %s (0x%08x) at 0x%016lx\n",
inf->name, esr, addr);
info.si_signo = inf->sig;
info.si_errno = 0;
info.si_code = inf->code;
info.si_addr = (void __user *)addr;
arm64_notify_die("", regs, &info, esr);
}
从上面代码看出,传递进来三个参数:
1. addr 访问的异常地址
2. esr 异常类型
3 regs 进入异常时 CPU 寄存器的值
同时由代码可知,通过 fault_info 和异常类型来获取处理函数,如果没有相应的处理函数,则报 Unhandled fault 。
下面看 fault_info
static const struct fault_info {
int (*fn)(unsigned long addr, unsigned int esr, struct pt_regs *regs);
int sig;
int code;
const char *name;
} fault_info[] = {
{ do_bad, SIGBUS, 0, "ttbr address size fault" },
{ do_bad, SIGBUS, 0, "level 1 address size fault" },
{ do_bad, SIGBUS, 0, "level 2 address size fault" },
{ do_bad, SIGBUS, 0, "level 3 address size fault" },
{ do_translation_fault, SIGSEGV, SEGV_MAPERR, "level 0 translation fault" },
{ do_translation_fault, SIGSEGV, SEGV_MAPERR, "level 1 translation fault" },
{ do_translation_fault, SIGSEGV, SEGV_MAPERR, "level 2 translation fault" },
{ do_translation_fault, SIGSEGV, SEGV_MAPERR, "level 3 translation fault" },
{ do_bad, SIGBUS, 0, "unknown 8" },
{ do_page_fault, SIGSEGV, SEGV_ACCERR, "level 1 access flag fault" },
{ do_page_fault, SIGSEGV, SEGV_ACCERR, "level 2 access flag fault" },
{ do_page_fault, SIGSEGV, SEGV_ACCERR, "level 3 access flag fault" },
{ do_bad, SIGBUS, 0, "unknown 12" },
{ do_page_fault, SIGSEGV, SEGV_ACCERR, "level 1 permission fault" },
{ do_page_fault, SIGSEGV, SEGV_ACCERR, "level 2 permission fault" },
{ do_page_fault, SIGSEGV, SEGV_ACCERR, "level 3 permission fault" },
{ do_bad, SIGBUS, 0, "synchronous external abort" },
{ do_bad, SIGBUS, 0, "unknown 17" },
{ do_bad, SIGBUS, 0, "unknown 18" },
{ do_bad, SIGBUS, 0, "unknown 19" },
{ do_bad, SIGBUS, 0, "synchronous abort (translation table walk)" },
{ do_bad, SIGBUS, 0, "synchronous abort (translation table walk)" },
{ do_bad, SIGBUS, 0, "synchronous abort (translation table walk)" },
{ do_bad, SIGBUS, 0, "synchronous abort (translation table walk)" },
{ do_bad, SIGBUS, 0, "synchronous parity error" },
{ do_bad, SIGBUS, 0, "unknown 25" },
{ do_bad, SIGBUS, 0, "unknown 26" },
{ do_bad, SIGBUS, 0, "unknown 27" },
{ do_bad, SIGBUS, 0, "synchronous parity error (translation table walk)" },
{ do_bad, SIGBUS, 0, "synchronous parity error (translation table walk)" },
{ do_bad, SIGBUS, 0, "synchronous parity error (translation table walk)" },
{ do_bad, SIGBUS, 0, "synchronous parity error (translation table walk)" },
{ do_bad, SIGBUS, 0, "unknown 32" },
{ do_bad, SIGBUS, BUS_ADRALN, "alignment fault" },
{ do_bad, SIGBUS, 0, "unknown 34" },
{ do_bad, SIGBUS, 0, "unknown 35" },
{ do_bad, SIGBUS, 0, "unknown 36" },
{ do_bad, SIGBUS, 0, "unknown 37" },
{ do_bad, SIGBUS, 0, "unknown 38" },
{ do_bad, SIGBUS, 0, "unknown 39" },
{ do_bad, SIGBUS, 0, "unknown 40" },
{ do_bad, SIGBUS, 0, "unknown 41" },
{ do_bad, SIGBUS, 0, "unknown 42" },
{ do_bad, SIGBUS, 0, "unknown 43" },
{ do_bad, SIGBUS, 0, "unknown 44" },
{ do_bad, SIGBUS, 0, "unknown 45" },
{ do_bad, SIGBUS, 0, "unknown 46" },
{ do_bad, SIGBUS, 0, "unknown 47" },
{ do_bad, SIGBUS, 0, "TLB conflict abort" },
{ do_bad, SIGBUS, 0, "unknown 49" },
{ do_bad, SIGBUS, 0, "unknown 50" },
{ do_bad, SIGBUS, 0, "unknown 51" },
{ do_bad, SIGBUS, 0, "implementation fault (lockdown abort)" },
{ do_bad, SIGBUS, 0, "implementation fault (unsupported exclusive)" },
{ do_bad, SIGBUS, 0, "unknown 54" },
{ do_bad, SIGBUS, 0, "unknown 55" },
{ do_bad, SIGBUS, 0, "unknown 56" },
{ do_bad, SIGBUS, 0, "unknown 57" },
{ do_bad, SIGBUS, 0, "unknown 58" },
{ do_bad, SIGBUS, 0, "unknown 59" },
{ do_bad, SIGBUS, 0, "unknown 60" },
{ do_bad, SIGBUS, 0, "section domain fault" },
{ do_bad, SIGBUS, 0, "page domain fault" },
{ do_bad, SIGBUS, 0, "unknown 63" },
};
里面其实就三种处理函数 do_translation_fault 、do_page_fault 、do_bad
总结一下就是:
1. 页表转换错误 调用 do_translation_fault (里面也会调用 do_page_fault)
2. 权限出错调用 do_page_fault
3. 其他错误调用 do_bad(该函数直接返回了,啥也没干)
二、do_translation_fault 函数
static int __kprobes do_translation_fault(unsigned long addr,
unsigned int esr,
struct pt_regs *regs)
{
/*用户空间地址*/
if (addr < TASK_SIZE)
return do_page_fault(addr, esr, regs);
/*内核地址或非法地址*/
do_bad_area(addr, esr, regs);
{
struct task_struct *tsk = current;
struct mm_struct *mm = tsk->active_mm;
/*
* If we are in kernel mode at this point, we have no context to
* handle this fault with.
*/
/*判断是否是用户模式下触发的异常*/
if (user_mode(regs))
/*试图访问用户态程序不存在的内存地址,产生信号异常*/
__do_user_fault(tsk, addr, esr, SIGSEGV, SEGV_MAPERR, regs);
else
__do_kernel_fault(mm, addr, esr, regs);
{
/*
* Are we prepared to handle this kernel fault?
* We are almost certainly not prepared to handle instruction faults.
*/
/*查异常表尝试fixup(难道是用户态传递内核地址情况?)*/
if (!is_el1_instruction_abort(esr) && fixup_exception(regs))
return;
/*
* No handler, we'll have to terminate things with extreme prejudice.
*/
/*内核异常,打印异常信息*/
bust_spinlocks(1);
pr_alert("Unable to handle kernel %s at virtual address %08lx\n",
(addr < PAGE_SIZE) ? "NULL pointer dereference" :
"paging request", addr);
show_pte(mm, addr);
die("Oops", regs, esr);
bust_spinlocks(0);
do_exit(SIGKILL);
}
}
return 0;
}
由上可知:
1. 异常地址为用户态地址,直接调用 do_page_fault
2. 异常地址为内核地址时,由用户态触发的异常,则该程序报异常并产生不可忽视的信号。如果是内核态,则尝试 fixup ,如果不成,则报内核异常,也就是内核死了。
补充:fixup是 uaccess.h 定义的几种情况,如 copy_from_user ,将其指令地址放到 exception_tables ,fixup 时搜索错误指令是否是 exception_tables 中的指令。F:\linux4.4\kernel\Documentation\x86\exception-tables 中有详细的说明。摘抄如下:
When a process runs in kernel mode, it often has to access user
mode memory whose address has been passed by an untrusted program.
To protect itself the kernel has to verify this address.
In older versions of Linux this was done with the
int verify_area(int type, const void * addr, unsigned long size)
function (which has since been replaced by access_ok()).
This function verified that the memory area starting at address
'addr' and of size 'size' was accessible for the operation specified
in type (read or write). To do this, verify_read had to look up the
virtual memory area (vma) that contained the address addr. In the
normal case (correctly working program), this test was successful.
It only failed for a few buggy programs. In some kernel profiling
tests, this normally unneeded verification used up a considerable
amount of time.
To overcome this situation, Linus decided to let the virtual memory
hardware present in every Linux-capable CPU handle this test.
How does this work?
Whenever the kernel tries to access an address that is currently not
accessible, the CPU generates a page fault exception and calls the
page fault handler
void do_page_fault(struct pt_regs *regs, unsigned long error_code)
in arch/x86/mm/fault.c. The parameters on the stack are set up by
the low level assembly glue in arch/x86/kernel/entry_32.S. The parameter
regs is a pointer to the saved registers on the stack, error_code
contains a reason code for the exception.
do_page_fault first obtains the unaccessible address from the CPU
control register CR2. If the address is within the virtual address
space of the process, the fault probably occurred, because the page
was not swapped in, write protected or something similar. However,
we are interested in the other case: the address is not valid, there
is no vma that contains this address. In this case, the kernel jumps
to the bad_area label.
There it uses the address of the instruction that caused the exception
(i.e. regs->eip) to find an address where the execution can continue
(fixup). If this search is successful, the fault handler modifies the
return address (again regs->eip) and returns. The execution will
continue at the address in fixup.
Where does fixup point to?
Since we jump to the contents of fixup, fixup obviously points
to executable code. This code is hidden inside the user access macros.
I have picked the get_user macro defined in arch/x86/include/asm/uaccess.h
as an example. The definition is somewhat hard to follow, so let's peek at
the code generated by the preprocessor and the compiler. I selected
the get_user call in drivers/char/sysrq.c for a detailed examination.
大概翻译一下,内核模式要处理用户空间的地址,可该地址可能由不靠谱的应用程序传递而来,所以老的内核使用 verify_area 来验证地址的合法性。绝大多数正常情况下肯定验证通过的,但是这种查遍vma的机制特别浪费时间,linux内核无法容忍这种情况,就改进了 verify_area 的机制,将该问题交给了硬件机制,将uacess.h 中定义的那些情况 指令地址放到 异常表 中,然后搜索异常表就可以过滤出该种情况,然后尝试fixup。
三、do_page_fault 函数
看起来 do_page_fault 要处理的就是两种异常类型,一个时缺页异常,另一个时访问权限异常。但实际上时比较复杂的。大概分为以下几种情况(可能总结的不全):
缺页异常:
1. 匿名缺页
2. 文件映射缺页
3. 交换缺页
4. 栈扩展
5. 非法地址
权限访问异常:
1. copy on write
2. 非法地址
static int __kprobes do_page_fault(unsigned long addr, unsigned int esr,
struct pt_regs *regs)
{
struct task_struct *tsk;
struct mm_struct *mm;
int fault, sig, code;
unsigned long vm_flags = VM_READ | VM_WRITE | VM_EXEC;
unsigned int mm_flags = FAULT_FLAG_ALLOW_RETRY | FAULT_FLAG_KILLABLE;
/*如果打开了kprobe功能且是由内核态触发,由kprobe处理完成返回*/
if (notify_page_fault(regs, esr))
return 0;
/*获取当前任务*/
tsk = current;
mm = tsk->mm;
/* Enable interrupts if they were enabled in the parent context. */
/*进到该异常之前中断是开的,则还打开中断,意味着接下来的处理可以再被中断打断*/
if (interrupts_enabled(regs))
local_irq_enable();
/*
* If we're in an interrupt or have no user context, we must not take
* the fault.
*/
/*mm为BULL则说明打断的是 内核线程,另一种情况是为原子上下文,还是内核态*/
if (faulthandler_disabled() || !mm)
goto no_context;
/*由用户态触发进来*/
if (user_mode(regs))
mm_flags |= FAULT_FLAG_USER;
/*由可执行权限触发进来*/
if (is_el0_instruction_abort(esr)) {
vm_flags = VM_EXEC;
/*由可写权限触发进来*/
} else if ((esr & ESR_ELx_WNR) && !(esr & ESR_ELx_CM)) {
vm_flags = VM_WRITE;
mm_flags |= FAULT_FLAG_WRITE;
}
/*内核模式下访问用户虚拟地址权限错误,都是将用户态态程序杀死,然受产生 SIGSEGV 信号*/
if (addr < USER_DS && is_permission_fault(esr, regs)) {
/* regs->orig_addr_limit may be 0 if we entered from EL0 */
/*进程地址上界是 KERNEL_DS */
if (regs->orig_addr_limit == KERNEL_DS)
die("Accessing user space memory with fs=KERNEL_DS", regs, esr);
/*尝试执行用户态指令*/
if (is_el1_instruction_abort(esr))
die("Attempting to execute userspace memory", regs, esr);
/*异常表中查询不到,说明内核正在访问的是非法用户空间地址*/
if (!search_exception_tables(regs->pc))
die("Accessing user space memory outside uaccess.h routines", regs, esr);
}
/*
* As per x86, we may deadlock here. However, since the kernel only
* validly references user space from well defined areas of the code,
* we can bug out early if this is from code which shouldn't.
*/
/*尝试获取mm的信号量*/
if (!down_read_trylock(&mm->mmap_sem)) {
/*不是用户态触发的异常,异常表也搜索不到,走到内核错误处理流程*/
if (!user_mode(regs) && !search_exception_tables(regs->pc))
goto no_context;
retry:
down_read(&mm->mmap_sem);
} else {
/*
* The above down_read_trylock() might have succeeded in which
* case, we'll have missed the might_sleep() from down_read().
*/
might_sleep();
#ifdef CONFIG_DEBUG_VM
if (!user_mode(regs) && !search_exception_tables(regs->pc))
goto no_context;
#endif
}
/*上面将内核态错误过滤出来走 no_context ,走到这里说明时用户态程序产生的异常*/
fault = __do_page_fault(mm, addr, mm_flags, vm_flags, tsk);
{
struct vm_area_struct *vma;
int fault;
/*查找该地址所在的用户虚拟地址空间内*/
vma = find_vma(mm, addr);
fault = VM_FAULT_BADMAP;
/*没有查到说明是个异常地址*/
if (unlikely(!vma))
goto out;
/*扩展栈*/
if (unlikely(vma->vm_start > addr))
goto check_stack;
/*
* Ok, we have a good vm_area for this memory access, so we can handle
* it.
*/
good_area:
/*
* Check that the permissions on the VMA allow for the fault which
* occurred. If we encountered a write or exec fault, we must have
* appropriate permissions, otherwise we allow any permission.
*/
/*权限检查,访问一个没有权限的地址,直接返回,后面的处理还是杀死进程产生 SIGSEGV 信号*/
if (!(vma->vm_flags & vm_flags)) {
fault = VM_FAULT_BADACCESS;
goto out;
}
return handle_mm_fault(mm, vma, addr & PAGE_MASK, mm_flags);
{
int ret;
__set_current_state(TASK_RUNNING);
count_vm_event(PGFAULT);
mem_cgroup_count_vm_event(mm, PGFAULT);
/* do counter updates before entering really critical section. */
check_sync_rss_stat(current);
/*
* Enable the memcg OOM handling for faults triggered in user
* space. Kernel faults are handled more gracefully.
*/
/*用户态触发进来的,使能oom*/
if (flags & FAULT_FLAG_USER)
mem_cgroup_oom_enable();
ret = __handle_mm_fault(mm, vma, address, flags);
/*处理完成关闭oom*/
if (flags & FAULT_FLAG_USER) {
mem_cgroup_oom_disable();
/*
* The task may have entered a memcg OOM situation but
* if the allocation error was handled gracefully (no
* VM_FAULT_OOM), there is no need to kill anything.
* Just clean up the OOM state peacefully.
*/
if (task_in_memcg_oom(current) && !(ret & VM_FAULT_OOM))
mem_cgroup_oom_synchronize(false);
}
return ret;
}
check_stack:
if (vma->vm_flags & VM_GROWSDOWN && !expand_stack(vma, addr))
goto good_area;
out:
return fault;
}
/*
* If we need to retry but a fatal signal is pending, handle the
* signal first. We do not need to release the mmap_sem because it
* would already be released in __lock_page_or_retry in mm/filemap.c.
*/
if ((fault & VM_FAULT_RETRY) && fatal_signal_pending(current)) {
if (!user_mode(regs))
goto no_context;
return 0;
}
/*
* Major/minor page fault accounting is only done on the initial
* attempt. If we go through a retry, it is extremely likely that the
* page will be found in page cache at that point.
*/
perf_sw_event(PERF_COUNT_SW_PAGE_FAULTS, 1, regs, addr);
if (mm_flags & FAULT_FLAG_ALLOW_RETRY) {
if (fault & VM_FAULT_MAJOR) {
tsk->maj_flt++;
perf_sw_event(PERF_COUNT_SW_PAGE_FAULTS_MAJ, 1, regs,
addr);
} else {
tsk->min_flt++;
perf_sw_event(PERF_COUNT_SW_PAGE_FAULTS_MIN, 1, regs,
addr);
}
if (fault & VM_FAULT_RETRY) {
/*
* Clear FAULT_FLAG_ALLOW_RETRY to avoid any risk of
* starvation.
*/
mm_flags &= ~FAULT_FLAG_ALLOW_RETRY;
mm_flags |= FAULT_FLAG_TRIED;
goto retry;
}
}
up_read(&mm->mmap_sem);
/*
* Handle the "normal" case first - VM_FAULT_MAJOR / VM_FAULT_MINOR
*/
if (likely(!(fault & (VM_FAULT_ERROR | VM_FAULT_BADMAP |
VM_FAULT_BADACCESS))))
return 0;
/*
* If we are in kernel mode at this point, we have no context to
* handle this fault with.
*/
if (!user_mode(regs))
goto no_context;
if (fault & VM_FAULT_OOM) {
/*
* We ran out of memory, call the OOM killer, and return to
* userspace (which will retry the fault, or kill us if we got
* oom-killed).
*/
pagefault_out_of_memory();
return 0;
}
if (fault & VM_FAULT_SIGBUS) {
/*
* We had some memory, but were unable to successfully fix up
* this page fault.
*/
sig = SIGBUS;
code = BUS_ADRERR;
} else {
/*
* Something tried to access memory that isn't in our memory
* map.
*/
sig = SIGSEGV;
code = fault == VM_FAULT_BADACCESS ?
SEGV_ACCERR : SEGV_MAPERR;
}
__do_user_fault(tsk, addr, esr, sig, code, regs);
return 0;
no_context:
__do_kernel_fault(mm, addr, esr, regs);
return 0;
}
上面的流程,先过滤出来内核态发异常的情况:
1. 非法地址(内核崩了)
2. 如 copy_to_user/copy_from_user 等uaccess.h 定义的情况,搜索异常表fixup,非法用户态地址,此时杀掉用户态程序并产生 SIGSEGV 信号
剩下的就是用户态程序触发的异常,由 __do_page_fault 处理:
1. 搜索 vma,查不到则为非法地址,查到了也要判断访问权限,地址非法和权限错误同样杀死进程产生 SIGSEGV 信号
2. 用户态栈拓展,调用 expand_stack 进行拓展
3. 剩下的就是地址是合法的且权限也是对的,由 handle_mm_fault 处理
handle_mm_fault:
handle_mm_fault(mm, vma, addr & PAGE_MASK, mm_flags);
{
int ret;
__set_current_state(TASK_RUNNING);
count_vm_event(PGFAULT);
mem_cgroup_count_vm_event(mm, PGFAULT);
/* do counter updates before entering really critical section. */
check_sync_rss_stat(current);
/*
* Enable the memcg OOM handling for faults triggered in user
* space. Kernel faults are handled more gracefully.
*/
/*用户态触发进来的,使能oom*/
if (flags & FAULT_FLAG_USER)
mem_cgroup_oom_enable();
ret = __handle_mm_fault(mm, vma, address, flags);
{
pgd_t *pgd;
pud_t *pud;
pmd_t *pmd;
pte_t *pte;
/*大页文件系统的处理,暂不研究*/
if (unlikely(is_vm_hugetlb_page(vma)))
return hugetlb_fault(mm, vma, address, flags);
/*查找pgd,pub,pmd,pte页表项目,pgd是一定有的,pub,pmd,pte没有则创建*/
pgd = pgd_offset(mm, address);
pud = pud_alloc(mm, pgd, address);
if (!pud)
return VM_FAULT_OOM;
pmd = pmd_alloc(mm, pud, address);
if (!pmd)
return VM_FAULT_OOM;
/*大页文件系统的东西暂不研究*/
if (pmd_none(*pmd) && transparent_hugepage_enabled(vma)) {
int ret = create_huge_pmd(mm, vma, address, pmd, flags);
if (!(ret & VM_FAULT_FALLBACK))
return ret;
} else {
pmd_t orig_pmd = *pmd;
int ret;
barrier();
if (pmd_trans_huge(orig_pmd)) {
unsigned int dirty = flags & FAULT_FLAG_WRITE;
/*
* If the pmd is splitting, return and retry the
* the fault. Alternative: wait until the split
* is done, and goto retry.
*/
if (pmd_trans_splitting(orig_pmd))
return 0;
if (pmd_protnone(orig_pmd))
return do_huge_pmd_numa_page(mm, vma, address,
orig_pmd, pmd);
if (dirty && !pmd_write(orig_pmd)) {
ret = wp_huge_pmd(mm, vma, address, pmd,
orig_pmd, flags);
if (!(ret & VM_FAULT_FALLBACK))
return ret;
} else {
huge_pmd_set_accessed(mm, vma, address, pmd,
orig_pmd, dirty);
return 0;
}
}
}
/*
* Use __pte_alloc instead of pte_alloc_map, because we can't
* run pte_offset_map on the pmd, if an huge pmd could
* materialize from under us from a different thread.
*/
if (unlikely(pmd_none(*pmd)) &&
unlikely(__pte_alloc(mm, vma, pmd, address)))
return VM_FAULT_OOM;
/*
* If a huge pmd materialized under us just retry later. Use
* pmd_trans_unstable() instead of pmd_trans_huge() to ensure the pmd
* didn't become pmd_trans_huge under us and then back to pmd_none, as
* a result of MADV_DONTNEED running immediately after a huge pmd fault
* in a different thread of this mm, in turn leading to a misleading
* pmd_trans_huge() retval. All we have to ensure is that it is a
* regular pmd that we can walk with pte_offset_map() and we can do that
* through an atomic read in C, which is what pmd_trans_unstable()
* provides.
*/
if (unlikely(pmd_trans_unstable(pmd)))
return 0;
/*
* A regular pmd is established and it can't morph into a huge pmd
* from under us anymore at this point because we hold the mmap_sem
* read mode and khugepaged takes it in write mode. So now it's
* safe to run pte_offset_map().
*/
pte = pte_offset_map(pmd, address);
return handle_pte_fault(mm, vma, address, pte, pmd, flags);
{
pte_t entry;
spinlock_t *ptl;
/*
* some architectures can have larger ptes than wordsize,
* e.g.ppc44x-defconfig has CONFIG_PTE_64BIT=y and CONFIG_32BIT=y,
* so READ_ONCE or ACCESS_ONCE cannot guarantee atomic accesses.
* The code below just needs a consistent view for the ifs and
* we later double check anyway with the ptl lock held. So here
* a barrier will do.
*/
entry = *pte;
barrier();
/*页表不在内存中*/
if (!pte_present(entry)) {
/*页表项为 0 ,说明没被访问过*/
if (pte_none(entry)) {
/*判断vma->vm_ops 是否赋值,未赋值分配匿名页,已赋值则为基于文件的映射*/
if (vma_is_anonymous(vma))
return do_anonymous_page(mm, vma, address,
pte, pmd, flags);
else
/*处理文件页匿名映射*/
return do_fault(mm, vma, address, pte, pmd,
flags, entry);
{
pgoff_t pgoff = (((address & PAGE_MASK)
- vma->vm_start) >> PAGE_SHIFT) + vma->vm_pgoff;
pte_unmap(page_table);
/* The VMA was not fully populated on mmap() or missing VM_DONTEXPAND */
if (!vma->vm_ops->fault)
return VM_FAULT_SIGBUS;
/*读文件页错误*/
if (!(flags & FAULT_FLAG_WRITE))
return do_read_fault(mm, vma, address, pmd, pgoff, flags,
orig_pte);
/*写私有文件页*/
if (!(vma->vm_flags & VM_SHARED))
return do_cow_fault(mm, vma, address, pmd, pgoff, flags,
orig_pte);
/*写共享文件页*/
return do_shared_fault(mm, vma, address, pmd, pgoff, flags, orig_pte);
}
}
/*交换缺页处理,所访问内存被交换到 swap 分区了*/
return do_swap_page(mm, vma, address,
pte, pmd, flags, entry);
}
if (pte_protnone(entry))
return do_numa_page(mm, vma, address, entry, pte, pmd);
ptl = pte_lockptr(mm, pmd);
spin_lock(ptl);
if (unlikely(!pte_same(*pte, entry)))
goto unlock;
/*处理 copy on write 机制*/
if (flags & FAULT_FLAG_WRITE) {
if (!pte_write(entry))
return do_wp_page(mm, vma, address,
pte, pmd, ptl, entry);
entry = pte_mkdirty(entry);
}
entry = pte_mkyoung(entry);
if (ptep_set_access_flags(vma, address, pte, entry, flags & FAULT_FLAG_WRITE)) {
update_mmu_cache(vma, address, pte);
} else {
/*
* This is needed only for protection faults but the arch code
* is not yet telling us if this is a protection fault or not.
* This still avoids useless tlb flushes for .text page faults
* with threads.
*/
if (flags & FAULT_FLAG_WRITE)
flush_tlb_fix_spurious_fault(vma, address);
}
unlock:
pte_unmap_unlock(pte, ptl);
return 0;
}
}
/*处理完成关闭oom*/
if (flags & FAULT_FLAG_USER) {
mem_cgroup_oom_disable();
/*
* The task may have entered a memcg OOM situation but
* if the allocation error was handled gracefully (no
* VM_FAULT_OOM), there is no need to kill anything.
* Just clean up the OOM state peacefully.
*/
if (task_in_memcg_oom(current) && !(ret & VM_FAULT_OOM))
mem_cgroup_oom_synchronize(false);
}
return ret;
}
进来之后进行如下处理:
1. 先查找 pgd (一定存在的),再查找 pud、pmd、pte(不存在则创建)
2. 不在内存中且第一次访问,匿名页(vm->ops 未赋值)调用 do_anonymous_page ,文件映射页用 do_fault 处理(读文件页错误、写私有文件映射错误、写共享文件映射错误)
3. 不在内存中且被访问过,那就是被交换到swap空间了,调用 do_swap_page 处理。
4. 写权限错误,也就是 copy on write 机制,通过 do_wp_page
通过上面发现,linux 充分利用了mmu的异常,异常的流程也挺长的可,进一次异常都是对CPU性能的损失,正常应用当然没问题,对于追求极致性能应用场景,还有需要一定程度的了解,如果被交换到swap
空间,还需要调进来,太浪费时间了,就需要加大内存来换取性能。想malloc后的内存并没实际拿到,可直接memset一下分配实际物理内存,避免用的时候再来这么一套处理流程。
同时页发现,用户态程序的传递给内核的地址,内核都会做检查,也就没有办法”哄骗内核“非法获取数据。