struct page
page(页)是linux内核管理物理内存的最小单位,内核将整个物理内存按照页对齐方式划分成千上万个页进行管理,内核为了管理这些页将每个页抽象成struct page结构管理每个页状态及其他属性,针对一个4GB内存,那么将会存在上百万个struct page结构。而struct page结构本身就占有一定内存,如果struct page结构设计过大,那么本身就会占用较多内存,而给系统或者用户可用的内存就较少,所以对strcut page结构大小非常敏感,即使增加一个字节 对系统影响也会非常大,故社区对struct page的结构做了严格设计,不会轻易增加字段:
One of these structures exists for every physical page in the system; on a 4GB system, there will be one million page structures. Given that every byte added to struct page is amplified a million times, it is not surprising that there is a strong motivation to avoid growing this structure at any cost. So struct page contains no less than three unions and is surrounded by complicated rules describing which fields are valid at which times. Changes to how this structure is accessed must be made with great care.
为了减少struct page占用空间大小,设计之初使用了很多技巧,其中一直就是使用union结构,在5.8.10版本中整个struct page使用了两个较大union结构以节省内存,page结构划分如下几块:
可以看到在一个64位系统中,struct page主要包含两个union结构,大小分别位40个字节和4个字节,这样设计的目的主要是减少占用空间 。
除了使用union技术减少占用空间之外,还使用了其他两个技术其中一个就是对flags标志的使用:
Unions are not the only technique used to shoehorn as much information as possible into this small structure. Non-uniform memory access (NUMA) systems need to track information on which node each page belongs to, and which zone within the node as well. Rather than add fields to struct page
在NUMA系统中为了节省占用空间,将flags页标志位中 划分出一部分给node id 和zone使用,如下:
还有另外一个比较重要的技术就是复用,最典型的一个应用就是list_head lru链表,在page不同的时期及不同的用途,会指向不同的链表,以节省空间。
struct page 结构定义位于include\linux\mm_types.h文件中,5.8.10版本定义如下:
struct page {
unsigned long flags; /* Atomic flags, some possibly
* updated asynchronously */
/*
* Five words (20/40 bytes) are available in this union.
* WARNING: bit 0 of the first word is used for PageTail(). That
* means the other users of this union MUST NOT use the bit to
* avoid collision and false-positive PageTail().
*/
union {
struct { /* Page cache and anonymous pages */
/**
* @lru: Pageout list, eg. active_list protected by
* pgdat->lru_lock. Sometimes used as a generic list
* by the page owner.
*/
struct list_head lru;
/* See page-flags.h for PAGE_MAPPING_FLAGS */
struct address_space *mapping;
pgoff_t index; /* Our offset within mapping. */
/**
* @private: Mapping-private opaque data.
* Usually used for buffer_heads if PagePrivate.
* Used for swp_entry_t if PageSwapCache.
* Indicates order in the buddy system if PageBuddy.
*/
unsigned long private;
};
struct { /* page_pool used by netstack */
/**
* @dma_addr: might require a 64-bit value even on
* 32-bit architectures.
*/
dma_addr_t dma_addr;
};
struct { /* slab, slob and slub */
union {
struct list_head slab_list;
struct { /* Partial pages */
struct page *next;
#ifdef CONFIG_64BIT
int pages; /* Nr of pages left */
int pobjects; /* Approximate count */
#else
short int pages;
short int pobjects;
#endif
};
};
struct kmem_cache *slab_cache; /* not slob */
/* Double-word boundary */
void *freelist; /* first free object */
union {
void *s_mem; /* slab: first object */
unsigned long counters; /* SLUB */
struct { /* SLUB */
unsigned inuse:16;
unsigned objects:15;
unsigned frozen:1;
};
};
};
struct { /* Tail pages of compound page */
unsigned long compound_head; /* Bit zero is set */
/* First tail page only */
unsigned char compound_dtor;
unsigned char compound_order;
atomic_t compound_mapcount;
};
struct { /* Second tail page of compound page */
unsigned long _compound_pad_1; /* compound_head */
atomic_t hpage_pinned_refcount;
/* For both global and memcg */
struct list_head deferred_list;
};
struct { /* Page table pages */
unsigned long _pt_pad_1; /* compound_head */
pgtable_t pmd_huge_pte; /* protected by page->ptl */
unsigned long _pt_pad_2; /* mapping */
union {
struct mm_struct *pt_mm; /* x86 pgds only */
atomic_t pt_frag_refcount; /* powerpc */
};
#if ALLOC_SPLIT_PTLOCKS
spinlock_t *ptl;
#else
spinlock_t ptl;
#endif
};
struct { /* ZONE_DEVICE pages */
/** @pgmap: Points to the hosting device page map. */
struct dev_pagemap *pgmap;
void *zone_device_data;
/*
* ZONE_DEVICE private pages are counted as being
* mapped so the next 3 words hold the mapping, index,
* and private fields from the source anonymous or
* page cache page while the page is migrated to device
* private memory.
* ZONE_DEVICE MEMORY_DEVICE_FS_DAX pages also
* use the mapping, index, and private fields when
* pmem backed DAX files are mapped.
*/
};
/** @rcu_head: You can use this to free a page by RCU. */
struct rcu_head rcu_head;
};
union { /* This union is 4 bytes in size. */
/*
* If the page can be mapped to userspace, encodes the number
* of times this page is referenced by a page table.
*/
atomic_t _mapcount;
/*
* If the page is neither PageSlab nor mappable to userspace,
* the value stored here may help determine what this page
* is used for. See page-flags.h for a list of page types
* which are currently stored here.
*/
unsigned int page_type;
unsigned int active; /* SLAB */
int units; /* SLOB */
};
/* Usage count. *DO NOT USE DIRECTLY*. See page_ref.h */
atomic_t _refcount;
#ifdef CONFIG_MEMCG
struct mem_cgroup *mem_cgroup;
#endif
/*
* On machines where all RAM is mapped into kernel address space,
* we can simply calculate the virtual address. On machines with
* highmem some memory is mapped into kernel virtual memory
* dynamically, so we need a place to store that address.
* Note that this field could be 16 bits on x86 ... ;)
*
* Architectures with slow multiplication can define
* WANT_PAGE_VIRTUAL in asm/page.h
*/
#if defined(WANT_PAGE_VIRTUAL)
void *virtual; /* Kernel virtual address (NULL if
not kmapped, ie. highmem) */
#endif /* WANT_PAGE_VIRTUAL */
#ifdef LAST_CPUPID_NOT_IN_PAGE_FLAGS
int _last_cpupid;
#endif
} _struct_page_alignment;
该结构较大,需要按照上述分四块进行分析。
page flags
page flags标志位主要采用bit 位方式,来描述一个物理页的状态信息:
"Page flags" are simple bit flags describing the state of a page of physical memory. They are defined in <linux/page-flags.h>. Flags exist to mark "reserved" pages (kernel memory, I/O memory, or simply nonexistent), locked pages, those under writeback I/O, those which are part of a compound page, pages managed by the slab allocator, and more. Depending on the target architecture and kernel configuration options selected, there can be as many as 24 individual flags defined.
page flags 不仅仅划分分成标志位使用,还划分给了section、node id、zone使用,其划分的形式和内存模型以及内核配置有关,在include\linux\page-flags-layout.h文件中描述了其主要5种划分形式:
第一种形式为非sparse内存模式或者sparse vmemmap内存模式如下:
上述形式是常见的page flags形式,其中从0到63位最高位依次位FLAGS位(真正的页状态标志位)、中间剩余保留,以及ZONE和NODE部分,其中zone代表着该page归属于的zone区域,而NODE在NUMA系统种代表着该page所属于的node 节点id,如果是非NUMA系统则为0。中间剩余的部分为保留位。
而在上述第一种形式中如果开启了last_cpupid,则会开启LAST_CPUPID字段,形式如下:
如果是开启可非vmemmap的sparse内存模式,则需要增加section字段表示page所处于的mem_section:
当然上述形式如果开启了last_cpupid,则划分如下:
除了上述四种形式,sparse还支持没有node id形式来支持非NUMA系统:
上述几种形式字段的大小以及偏移每个架构都有不同,内核种对每个字段都提供了PGOFF宏,方便统一计算,宏定义位于(include\linux\mm.h)文件中 :
/* Page flags: | [SECTION] | [NODE] | ZONE | [LAST_CPUPID] | ... | FLAGS | */
#define SECTIONS_PGOFF ((sizeof(unsigned long)*8) - SECTIONS_WIDTH)
#define NODES_PGOFF (SECTIONS_PGOFF - NODES_WIDTH)
#define ZONES_PGOFF (NODES_PGOFF - ZONES_WIDTH)
#define LAST_CPUPID_PGOFF (ZONES_PGOFF - LAST_CPUPID_WIDTH)
#define KASAN_TAG_PGOFF (LAST_CPUPID_PGOFF - KASAN_TAG_WIDTH)
除了PGOFF还提供了 shitfs定义,如果一个字段位0则PGSHIFT则为0:
#define SECTIONS_PGSHIFT (SECTIONS_PGOFF * (SECTIONS_WIDTH != 0))
#define NODES_PGSHIFT (NODES_PGOFF * (NODES_WIDTH != 0))
#define ZONES_PGSHIFT (ZONES_PGOFF * (ZONES_WIDTH != 0))
#define LAST_CPUPID_PGSHIFT (LAST_CPUPID_PGOFF * (LAST_CPUPID_WIDTH != 0))
#define KASAN_TAG_PGSHIFT (KASAN_TAG_PGOFF * (KASAN_TAG_WIDTH != 0))
各个字段的MASK定义如下:
#define ZONEID_PGSHIFT (ZONEID_PGOFF * (ZONEID_SHIFT != 0))
#define ZONES_MASK ((1UL << ZONES_WIDTH) - 1)
#define NODES_MASK ((1UL << NODES_WIDTH) - 1)
#define SECTIONS_MASK ((1UL << SECTIONS_WIDTH) - 1)
#define LAST_CPUPID_MASK ((1UL << LAST_CPUPID_SHIFT) - 1)
#define KASAN_TAG_MASK ((1UL << KASAN_TAG_WIDTH) - 1)
#define ZONEID_MASK ((1UL << ZONEID_SHIFT) - 1)
可以看到上述几个宏最终都依赖于各个地段的WIDTH宏代表每个字段占有多少位,该如果某个字段不存在则将该字段的WIDTH为0:
SECTIONS字段操作
section字段宽度SECTIONS_WIDTH定义如下:
#if defined(CONFIG_SPARSEMEM) && !defined(CONFIG_SPARSEMEM_VMEMMAP)
#define SECTIONS_WIDTH SECTIONS_SHIFT
#else
#define SECTIONS_WIDTH 0
#endif
只有在配置内存模型为sparse且不支持vememap时,SECTION_WIDTH才为非零,此时取决于SECTIONS_SHIFT:
#define SECTIONS_SHIFT (MAX_PHYSMEM_BITS - SECTION_SIZE_BITS)
MAX_PHYSMEM_BITS和SECTION_SIZE_BITS与具体芯片架构有关
内核还将获取或者设置page section封装成函数,设置page section函数为:
static inline void set_page_section(struct page *page, unsigned long section)
{
page->flags &= ~(SECTIONS_MASK << SECTIONS_PGSHIFT);
page->flags |= (section & SECTIONS_MASK) << SECTIONS_PGSHIFT;
}
从page中获取section操作:
static inline unsigned long page_to_section(const struct page *page)
{
return (page->flags >> SECTIONS_PGSHIFT) & SECTIONS_MASK;
}
ZONES字段
ZONES_WIDTH定义如下:
#define ZONES_WIDTH ZONES_SHIFT
ZONES_SHIFT定义与具体的zone 最大MAX_NR_ZONE有关:
#if MAX_NR_ZONES < 2
#define ZONES_SHIFT 0
#elif MAX_NR_ZONES <= 2
#define ZONES_SHIFT 1
#elif MAX_NR_ZONES <= 4
#define ZONES_SHIFT 2
#elif MAX_NR_ZONES <= 8
#define ZONES_SHIFT 3
#else
#error ZONES_SHIFT -- too many zones configured adjust calculation
#endif
设置page中zone 操作函数如下:
static inline void set_page_zone(struct page *page, enum zone_type zone)
{
page->flags &= ~(ZONES_MASK << ZONES_PGSHIFT);
page->flags |= (zone & ZONES_MASK) << ZONES_PGSHIFT;
}
从page中获取到zone 函数如下:
static inline struct zone *page_zone(const struct page *page)
{
return &NODE_DATA(page_to_nid(page))->node_zones[page_zonenum(page)];
}
以及获取到zone id接口:
static inline int page_zone_id(struct page *page)
{
return (page->flags >> ZONEID_PGSHIFT) & ZONEID_MASK;
}
NODES字段操作
node节点宽度NODES_WIDTH定义如下:
#if SECTIONS_WIDTH+ZONES_WIDTH+NODES_SHIFT <= BITS_PER_LONG - NR_PAGEFLAGS
#define NODES_WIDTH NODES_SHIFT
#else
#ifdef CONFIG_SPARSEMEM_VMEMMAP
#error "Vmemmap: No space for nodes field in page flags"
#endif
#define NODES_WIDTH 0
#endif
这里做了检查防止使用的bit位综合超过unsigned long的BITS_PER_LONG 大小,如果没有超过则使用NODE_SHIFT配置:
#ifdef CONFIG_NODES_SHIFT
#define NODES_SHIFT CONFIG_NODES_SHIFT
#else
#define NODES_SHIFT 0
#endif
NODE_SHIFT的大小可以通过内核CONFIG_NODES_SHIFT来配置。
设置page中的node 字段操作如下:
static inline void set_page_node(struct page *page, unsigned long node)
{
page->flags &= ~(NODES_MASK << NODES_PGSHIFT);
page->flags |= (node & NODES_MASK) << NODES_PGSHIFT;
}
获取page中的node字段操作如下:
static inline int page_to_nid(const struct page *page)
{
struct page *p = (struct page *)page;
return (PF_POISONED_CHECK(p)->flags >> NODES_PGSHIFT) & NODES_MASK;
}
LAST__CPU_SHIFT
last cpu pid没有专门的width宏 只有LAST_CPUPID_SHIFT,定义如下:
#ifdef CONFIG_NUMA_BALANCING
#define LAST__PID_SHIFT 8
#define LAST__PID_MASK ((1 << LAST__PID_SHIFT)-1)
#define LAST__CPU_SHIFT NR_CPUS_BITS
#define LAST__CPU_MASK ((1 << LAST__CPU_SHIFT)-1)
#define LAST_CPUPID_SHIFT (LAST__PID_SHIFT+LAST__CPU_SHIFT)
#else
#define LAST_CPUPID_SHIFT 0
#endif
需要开启CONFIG_NUMA_BALANCING宏,才支持。且LAST_CPUID_SHIFT取决于NR_CPUS_BITS。
获取last cpu pid:
static inline int page_cpupid_last(struct page *page)
{
return (page->flags >> LAST_CPUPID_PGSHIFT) & LAST_CPUPID_MASK;
}
reset last cpu pid:
static inline void page_cpupid_reset_last(struct page *page)
{
page->flags |= LAST_CPUPID_MASK << LAST_CPUPID_PGSHIFT;
}
flags字段
flags字段基本都是固定的,每个flags占一个bit位,专门位于include\linux\page-flags.h对标志位进行统一管理,在该头文件中有一个针对page flags有一个详细的描述,其中有一段:
* The page flags field is split into two parts, the main flags area
* which extends from the low bits upwards, and the fields area which
* extends from the high bits downwards.
*
* | FIELD | ... | FLAGS |
* N-1 ^ 0
* (NR_PAGEFLAGS)
*
page flags主要划分为两端,其中以NR_PAGEFLAGS为分水线,NR_PAGEFLAGS以上的称之为可扩展部分:
page 标志 | 说明 |
PG_locked | 该页面释放已经上锁,如果已经上锁则置1,其他内核模块不能再访问 |
PG_referenced | 如果该页面最近是否被访问,如果被访问过则置位。用于LRU算法 |
PG_uptodate | 该页面已经从硬盘中成功读取 |
PG_dirty | 该页面是一个脏页,需要将该页面的数据刷新到硬盘中。当页面数据被修改时并不会立即刷新到硬盘中,而是暂时先保证到内存中,等待后面刷新到硬盘中。设置该页为脏页意味着再该页被置换出去之前必须要保证该页不能被-释放 |
PG_lru | 表示该页再LRU链表中。LRU链表指得是最少使用链表 |
PG_active | 表示该页处于活跃状态 |
PG_workingset | 设置该页为某个进程的woring set,关于working set 可以看下面文章: https://www.brendangregg.com/blog/2018-01-17/measure-working-set-size.html |
PG_waiters | 有进程在等待这个页面 |
PG_error | 该页面在操作IO过程中出现错误 |
PG_slab | 该页被slab所使用 |
PG_owner_priv_1 | 被页面的所有者使用,如果是作为pagecache页面,则文件系统有可能使用 |
PG_arch_1 | 与体系结构相关的一个状态位, |
PG_reserved | 该页被保留,不能够被swap out出去。在系统中kernek image(包括vDSO)以及 BIOS,initrd、hw table 以及vememap等待在系统系统初始化阶段就需要做保留以及DAM等常见需要做保留的页 都需要将页状态设置位保留 |
PG_private | 如果page中的private成员非空,则需要设置该标志, 用于I/O的页可使用该字段将页细分为多核缓冲区 |
PG_private_2 | 在PG_private基础上的扩充,经常用于aux data |
PG_writeback | 页面的内存正在向磁盘写 |
PG_head | 该页是一个head page页。在内核中有时需要将多个页组成一个compound pages,而设置该状态时表明该页是compound pages的第一个页 |
PG_mappedtodisk | 该页被映射到硬盘中 |
PG_reclaim | 该页可被回收 |
PG_swapbacked | 该page的后备存储器是swap/ram,一般匿名页才可以回写swap分 |
PG_unevictable | 该page被锁住,不能回收,并会出现在LRU_UNEVICTABLE链表中 |
PG_mlocked | 该页对应的vma被锁住,一般是通过系统调用mlock()锁定了一段内 |
PG_uncached | 该页被设置为不可缓存,需要配置CONFIG_ARCH_USES_PG_UNCACHED |
PG_hwpoison | hardware poisoned page. Don't touch,需要配置CONFIG_MEMORY_FAILURE |
PG_young | 需要CONFIG_IDLE_PAGE_TRACKING和CONFIG_64BIT才支持 |
PG_idle | 需要CONFIG_IDLE_PAGE_TRACKING和CONFIG_64BIT才支持 |
为了方便进行标位位置位,清零以及查看是否置位等操作,内核做了系列的宏定义,看起来比较复杂:
/*
* Macros to create function definitions for page flags
*/
#define TESTPAGEFLAG(uname, lname, policy) \
static __always_inline int Page##uname(struct page *page) \
{ return test_bit(PG_##lname, &policy(page, 0)->flags); }
#define SETPAGEFLAG(uname, lname, policy) \
static __always_inline void SetPage##uname(struct page *page) \
{ set_bit(PG_##lname, &policy(page, 1)->flags); }
#define CLEARPAGEFLAG(uname, lname, policy) \
static __always_inline void ClearPage##uname(struct page *page) \
{ clear_bit(PG_##lname, &policy(page, 1)->flags); }
#define __SETPAGEFLAG(uname, lname, policy) \
static __always_inline void __SetPage##uname(struct page *page) \
{ __set_bit(PG_##lname, &policy(page, 1)->flags); }
#define __CLEARPAGEFLAG(uname, lname, policy) \
static __always_inline void __ClearPage##uname(struct page *page) \
{ __clear_bit(PG_##lname, &policy(page, 1)->flags); }
#define TESTSETFLAG(uname, lname, policy) \
static __always_inline int TestSetPage##uname(struct page *page) \
{ return test_and_set_bit(PG_##lname, &policy(page, 1)->flags); }
#define TESTCLEARFLAG(uname, lname, policy) \
static __always_inline int TestClearPage##uname(struct page *page) \
{ return test_and_clear_bit(PG_##lname, &policy(page, 1)->flags); }
#define PAGEFLAG(uname, lname, policy) \
TESTPAGEFLAG(uname, lname, policy) \
SETPAGEFLAG(uname, lname, policy) \
CLEARPAGEFLAG(uname, lname, policy)
#define __PAGEFLAG(uname, lname, policy) \
TESTPAGEFLAG(uname, lname, policy) \
__SETPAGEFLAG(uname, lname, policy) \
__CLEARPAGEFLAG(uname, lname, policy)
#define TESTSCFLAG(uname, lname, policy) \
TESTSETFLAG(uname, lname, policy) \
TESTCLEARFLAG(uname, lname, policy)
#define TESTPAGEFLAG_FALSE(uname) \
static inline int Page##uname(const struct page *page) { return 0; }
#define SETPAGEFLAG_NOOP(uname) \
static inline void SetPage##uname(struct page *page) { }
#define CLEARPAGEFLAG_NOOP(uname) \
static inline void ClearPage##uname(struct page *page) { }
#define __CLEARPAGEFLAG_NOOP(uname) \
static inline void __ClearPage##uname(struct page *page) { }
#define TESTSETFLAG_FALSE(uname) \
static inline int TestSetPage##uname(struct page *page) { return 0; }
#define TESTCLEARFLAG_FALSE(uname) \
static inline int TestClearPage##uname(struct page *page) { return 0; }
#define PAGEFLAG_FALSE(uname) TESTPAGEFLAG_FALSE(uname) \
SETPAGEFLAG_NOOP(uname) CLEARPAGEFLAG_NOOP(uname)
#define TESTSCFLAG_FALSE(uname) \
TESTSETFLAG_FALSE(uname) TESTCLEARFLAG_FALSE(uname)
__PAGEFLAG(Locked, locked, PF_NO_TAIL)
PAGEFLAG(Waiters, waiters, PF_ONLY_HEAD) __CLEARPAGEFLAG(Waiters, waiters, PF_ONLY_HEAD)
PAGEFLAG(Error, error, PF_NO_TAIL) TESTCLEARFLAG(Error, error, PF_NO_TAIL)
PAGEFLAG(Referenced, referenced, PF_HEAD)
TESTCLEARFLAG(Referenced, referenced, PF_HEAD)
__SETPAGEFLAG(Referenced, referenced, PF_HEAD)
PAGEFLAG(Dirty, dirty, PF_HEAD) TESTSCFLAG(Dirty, dirty, PF_HEAD)
__CLEARPAGEFLAG(Dirty, dirty, PF_HEAD)
PAGEFLAG(LRU, lru, PF_HEAD) __CLEARPAGEFLAG(LRU, lru, PF_HEAD)
PAGEFLAG(Active, active, PF_HEAD) __CLEARPAGEFLAG(Active, active, PF_HEAD)
TESTCLEARFLAG(Active, active, PF_HEAD)
PAGEFLAG(Workingset, workingset, PF_HEAD)
TESTCLEARFLAG(Workingset, workingset, PF_HEAD)
__PAGEFLAG(Slab, slab, PF_NO_TAIL)
__PAGEFLAG(SlobFree, slob_free, PF_NO_TAIL)
PAGEFLAG(Checked, checked, PF_NO_COMPOUND) /* Used by some filesystems */
/* Xen */
PAGEFLAG(Pinned, pinned, PF_NO_COMPOUND)
TESTSCFLAG(Pinned, pinned, PF_NO_COMPOUND)
PAGEFLAG(SavePinned, savepinned, PF_NO_COMPOUND);
PAGEFLAG(Foreign, foreign, PF_NO_COMPOUND);
PAGEFLAG(XenRemapped, xen_remapped, PF_NO_COMPOUND)
TESTCLEARFLAG(XenRemapped, xen_remapped, PF_NO_COMPOUND)
PAGEFLAG(Reserved, reserved, PF_NO_COMPOUND)
__CLEARPAGEFLAG(Reserved, reserved, PF_NO_COMPOUND)
__SETPAGEFLAG(Reserved, reserved, PF_NO_COMPOUND)
PAGEFLAG(SwapBacked, swapbacked, PF_NO_TAIL)
__CLEARPAGEFLAG(SwapBacked, swapbacked, PF_NO_TAIL)
__SETPAGEFLAG(SwapBacked, swapbacked, PF_NO_TAIL)
/*
* Private page markings that may be used by the filesystem that owns the page
* for its own purposes.
* - PG_private and PG_private_2 cause releasepage() and co to be invoked
*/
PAGEFLAG(Private, private, PF_ANY) __SETPAGEFLAG(Private, private, PF_ANY)
__CLEARPAGEFLAG(Private, private, PF_ANY)
PAGEFLAG(Private2, private_2, PF_ANY) TESTSCFLAG(Private2, private_2, PF_ANY)
PAGEFLAG(OwnerPriv1, owner_priv_1, PF_ANY)
TESTCLEARFLAG(OwnerPriv1, owner_priv_1, PF_ANY)
/*
* Only test-and-set exist for PG_writeback. The unconditional operators are
* risky: they bypass page accounting.
*/
TESTPAGEFLAG(Writeback, writeback, PF_NO_TAIL)
TESTSCFLAG(Writeback, writeback, PF_NO_TAIL)
PAGEFLAG(MappedToDisk, mappedtodisk, PF_NO_TAIL)
/* PG_readahead is only used for reads; PG_reclaim is only for writes */
PAGEFLAG(Reclaim, reclaim, PF_NO_TAIL)
TESTCLEARFLAG(Reclaim, reclaim, PF_NO_TAIL)
PAGEFLAG(Readahead, reclaim, PF_NO_COMPOUND)
TESTCLEARFLAG(Readahead, reclaim, PF_NO_COMPOUND)
将上述一系列宏展开,主要有以下三种:
- 将页面置位统一命名位SetPageXXX,其中XXX为标志位中的后面小写部分,例如SetPageLRU,设置的是PG_lru标志位,SetPageDirty 设置的是PG_dirty标志位
- ClearPageXXX是将相应的标志位清空
- PageXXX,用于检查页面是否设置了该标志位。
第一个Union
struct page第一个union 在64位系统下是一个40字节的联合体,里面包含8个部分,是记录page的主要功能数据,每个部分都有有个结构体进行说明
Page cache and anonymous pages
第一个结构体主要是对匿名页和page cache的主要功能数据,主要结构成员如下:
struct { /* Page cache and anonymous pages */
/**
* @lru: Pageout list, eg. active_list protected by
* pgdat->lru_lock. Sometimes used as a generic list
* by the page owner.
*/
struct list_head lru;
/* See page-flags.h for PAGE_MAPPING_FLAGS */
struct address_space *mapping;
pgoff_t index; /* Our offset within mapping. */
/**
* @private: Mapping-private opaque data.
* Usually used for buffer_heads if PagePrivate.
* Used for swp_entry_t if PageSwapCache.
* Indicates order in the buddy system if PageBuddy.
*/
unsigned long private;
};
主要成员说明:
- struct list_head lru:为LRU链表,该链表会根据页面不同的用途挂载到不同的链表, 如在空闲时刻,被buddy系统管理时,会挂接到buffy的free 链表中。如果页面被分配,则会根据页面的激活状态,挂接到active list链表中。
- struct address_space *mapping:当页面被映射时 指向映射的地址空间。
- pgoff_t index:当该页面被文件映射时,代表偏移量
- unsigned long private:私有数据
page_pool used by netstack
如果该页被用作DMA映射,dma_addr_t则代表的是映射的一个总线地址:
struct { /* page_pool used by netstack */
/**
* @dma_addr: might require a 64-bit value even on
* 32-bit architectures.
*/
dma_addr_t dma_addr;
};
slab, slob and slub
该页面被slab/slob/slub所管理分配,即已经被buffy分配出去,进一步做小内存分配管理:
struct { /* slab, slob and slub */
union {
struct list_head slab_list;
struct { /* Partial pages */
struct page *next;
#ifdef CONFIG_64BIT
int pages; /* Nr of pages left */
int pobjects; /* Approximate count */
#else
short int pages;
short int pobjects;
#endif
};
};
struct kmem_cache *slab_cache; /* not slob */
/* Double-word boundary */
void *freelist; /* first free object */
union {
void *s_mem; /* slab: first object */
unsigned long counters; /* SLUB */
struct { /* SLUB */
unsigned inuse:16;
unsigned objects:15;
unsigned frozen:1;
};
};
};
主要结构说明:
- struct list_head slab_list:指向的是slab list链表
- struct page *next: 在slub中分配使用
- struct kmem_cache *slab_cache: 指向的是slab缓存描述符
- void *freelist::指向的是第一个空间的kobject。当一个页被buddy分配出去由slab进行管理时,会将该内存划分成相应大小的等份数组 即object进行分配管理。freelist指向的是第一个空闲的位置
- void *s_mem:指向第一个slab对象的起始地址
- unsigned long counters: 被slub用作计数
Tail pages of compound page
该结构表明为compound pages的最后一个页,至于compound pages的主要功能有一段描述:
A compound page is simply a grouping of two or more physically contiguous pages into a unit that can, in many ways, be treated as a single, larger page. They are most commonly used to create huge pages
compound page将多个连续的物理页组装联合在一起组成一个更大页,其最大的用途是可以创建一个huge 页,具体介绍可以参考:https://lwn.net/Articles/619514/
该此时该结果主要描述的是compound page的tail page:
struct { /* Tail pages of compound page */
unsigned long compound_head; /* Bit zero is set */
/* First tail page only */
unsigned char compound_dtor;
unsigned char compound_order;
atomic_t compound_mapcount;
};
主要结构成员:
- unsigned long compound_head:指向compound pages的第一个页
- unsigned char compound_dtor:只有在第一个tail page设置
- unsigned char compound_order:只有在第一个tail page设置, page order
- atomic_t compound_mapcount: compound 映射数目
Second tail page of compound page
第二中compound tail page结构:
struct { /* Second tail page of compound page */
unsigned long _compound_pad_1; /* compound_head */
atomic_t hpage_pinned_refcount;
/* For both global and memcg */
struct list_head deferred_list;
};
Page table pages
该结构主要用于page table,结构成员如下:
struct { /* Page table pages */
unsigned long _pt_pad_1; /* compound_head */
pgtable_t pmd_huge_pte; /* protected by page->ptl */
unsigned long _pt_pad_2; /* mapping */
union {
struct mm_struct *pt_mm; /* x86 pgds only */
atomic_t pt_frag_refcount; /* powerpc */
};
#if ALLOC_SPLIT_PTLOCKS
spinlock_t *ptl;
#else
spinlock_t ptl;
#endif
ZONE_DEVICE pages
当该页面属于ZONE_DEVICE时:
struct { /* ZONE_DEVICE pages */
/** @pgmap: Points to the hosting device page map. */
struct dev_pagemap *pgmap;
void *zone_device_data;
/*
* ZONE_DEVICE private pages are counted as being
* mapped so the next 3 words hold the mapping, index,
* and private fields from the source anonymous or
* page cache page while the page is migrated to device
* private memory.
* ZONE_DEVICE MEMORY_DEVICE_FS_DAX pages also
* use the mapping, index, and private fields when
* pmem backed DAX files are mapped.
*/
};
rcu_head
rcu_head主要被用作RCU锁
struct rcu_head rcu_head;
第二个Union
struct page的第二个union大小为4个字节,主要成员包括如下:
union { /* This union is 4 bytes in size. */
/*
* If the page can be mapped to userspace, encodes the number
* of times this page is referenced by a page table.
*/
atomic_t _mapcount;
/*
* If the page is neither PageSlab nor mappable to userspace,
* the value stored here may help determine what this page
* is used for. See page-flags.h for a list of page types
* which are currently stored here.
*/
unsigned int page_type;
unsigned int active; /* SLAB */
int units; /* SLOB */
};
- atomic_t _mapcount:如果该页面被映射到用户层进程的计数,即被映射到多少个用户进程。_mapcount为-1时代表没有被PTE映射,等于0时 表示只有一个父进程使用被映射,当大于0时代表除了父进程还有其他进程使用这个页面。
- unsigned int page_type:如果该页面即不属于page slab也不属于user space,则该代表页面类型即使用用途
- unsigned int active:表示slab中活跃对象
- int units:被slob使用
_refcount
_refcount 被用作引用计数管理,用于跟踪内存使用状况。初始化为空闲状态时计数为0,当被分配引用时 计数会+1。
参考资料
https://lwn.net/Articles/335768/
https://lwn.net/Articles/619514/
https://lwn.net/Articles/787388/