Linux内存管理 (16)内存规整

专题:Linux内存管理专题

关键词:内存规整、页面迁移、pageblock、MIGRATE_TYPES。

内存碎片的产生:伙伴系统以页为单位进行管理,经过大量申请释放,造成大量离散且不连续的页面。这时就产生了很多碎片。

内存规整也即内存碎片整理,内存碎片也是以页面为单位的。实现基础是内存页面按照可移动性进行分组。内存规整的实现基础是页面迁移。

Linux内核以pageblock为单位来管理页的迁移属性。

为什么需要内存规整?

有些情况下,物理设备需要大段连续物理内存。虽然此时空闲内存足够,但是哟与无法找到连续的物理内存,仍然造成内存分配失败。

1. 内存规整的触发

下面是内存页面分配,以及分配失败之后采取的措施,以便促成分配成功。

可以看出采取的措施,越来越重。首先采用kswapd来进行页面回收,然后尝试页面规整、直接页面回收,最后是OOM杀死进程来获取更多内存空间。

alloc_pages-------------------------------------页面分配的入口
->__alloc_pages_nodemask
->get_page_from_freelist--------------------直接从zonelist的空闲列表中分配页面
->__alloc_pages_slowpath--------------------在初次尝试分配失败后,进入slowpath路径分配页面
->wake_all_kswapds------------------------唤醒kswapd内核线程进行页面回收
->get_page_from_freelist------------------kswapd页面回收后再次进行页面分配
->__alloc_pages_direct_compact------------进行页面规整,然后进行页面分配
->__alloc_pages_direct_reclaim------------直接页面回收,然后进行页面分配
->__alloc_pages_may_oom-------------------尝试触发OOM

另一条路径是在kswapd的balance_pgdat中会判断是否需要进行内存规整。

kswapd
->balance_pgdat-------------------------------遍历内存节点的zone,判断是否处于平衡状态即WMARK_HIGH。
->compact_pgdat-----------------------------针对整个内存节点进行内存规整

其中compact_pddat->__compact_pgdat->compact_zone,最终的实现和__alloc_pages_direct_compact调用compact_zone一样。

1.1 内存规整相关节点

内存规整相关有两个节点,compact_memory用于触发内存规整;extfrag_threshold影响内核决策是采用内存规整还是直接回收来满足大内存分配。

节点入口代码:

static struct ctl_table vm_table[] = {
...
#ifdef CONFIG_COMPACTION
{
.procname = "compact_memory",
.data = &sysctl_compact_memory,
.maxlen = sizeof(int),
.mode = ,
.proc_handler = sysctl_compaction_handler,
},
{
.procname = "extfrag_threshold",
.data = &sysctl_extfrag_threshold,
.maxlen = sizeof(int),
.mode = ,
.proc_handler = sysctl_extfrag_handler,
.extra1 = &min_extfrag_threshold,
.extra2 = &max_extfrag_threshold,
}, #endif /* CONFIG_COMPACTION */
...
{ }
}

1.1.1 /proc/sys/vm/compact_memory

打开compaction Tracepoint:echo 1 > /sys/kernel/debug/tracing/events/compaction/enable

触发内存规整:sysctl -w vm.compact_memory=1

查看Tracepoint:cat /sys/kernel/debug/tracing/trace

1.1.2 /proc/sys/vm/extfrag_threshold

在compact_zone中调用函数compaction_suitable->__compaction_suitable进行判断是否进行内存规整。

和extfrag_threshold相关部分如下,如果当前fragindex不超过sysctl_extfrag_threshold,则不会继续进行内存规整。

所以这个参数越小越倾向于进行内存规整,越大越不容易进行内存规整。

static unsigned long __compaction_suitable(struct zone *zone, int order,
int alloc_flags, int classzone_idx)
{
...
fragindex = fragmentation_index(zone, order);
if (fragindex >= 0 && fragindex <= sysctl_extfrag_threshold)
return COMPACT_NOT_SUITABLE_ZONE; return COMPACT_CONTINUE;
}

设置extfrag_threshold:sysctl -w vm.extfrag_threshold=500

1.1.3 其它Debug信息

/sys/kernel/debug/extfrag/extfrag_index

/sys/kernel/debug/extfrag/unusable_index

2. 内存规整实现

在进入细节前,先看看内存规整函数框架。

__alloc_pages_direct_compact
->try_to_compact_pages-----------------直接内存规整来满足高阶分配需求
->compact_zone_order-----------------遍历zonelist对每个zone进行规整
->compact_zone---------------------对zone进行规整
->compaction_suitable------------检查是否继续规整,COMPACT_PARTIAL/COMPACT_SKIPPED都跳过。
->compact_finished---------------在while中判断是否可以停止内存规整
->isolate_migratepages-----------查找可以迁移页面
->migrate_pages------------------进行页面迁移操作
->get_free_page_from_freelist------在规整完成后进行页面分配操作

__alloc_pages_direct_compact首先执行规整操作,然后进行页面分配。

static struct page *
__alloc_pages_direct_compact(gfp_t gfp_mask, unsigned int order,
int alloc_flags, const struct alloc_context *ac,
enum migrate_mode mode, int *contended_compaction,
bool *deferred_compaction)
{
unsigned long compact_result;
struct page *page; if (!order)-----------------------------------------------------------------order为0情况,不用进行内存规整。
return NULL; current->flags |= PF_MEMALLOC;
compact_result =try_to_compact_pages(gfp_mask, order, alloc_flags, ac,-----进行内存规整,当前进程会置PF_MEMALLOC,避免进程迁移时发生死锁。
mode, contended_compaction);
current->flags &= ~PF_MEMALLOC; switch (compact_result) {
case COMPACT_DEFERRED:
*deferred_compaction = true;
/* fall-through */
case COMPACT_SKIPPED:
return NULL;
default:
break;
}
...
page = get_page_from_freelist(gfp_mask, order,-----------------------------进行内存分配
alloc_flags & ~ALLOC_NO_WATERMARKS, ac);
...
count_vm_event(COMPACTFAIL); cond_resched(); return NULL;
}

try_to_compact_pages执行内存规整,以pageblock为单位,选择pageblock中可迁移页面。

unsigned long try_to_compact_pages(gfp_t gfp_mask, unsigned int order,
int alloc_flags, const struct alloc_context *ac,
enum migrate_mode mode, int *contended)
{
int may_enter_fs = gfp_mask & __GFP_FS;
int may_perform_io = gfp_mask & __GFP_IO;
struct zoneref *z;
struct zone *zone;
int rc = COMPACT_DEFERRED;
int all_zones_contended = COMPACT_CONTENDED_LOCK; /* init for &= op */ *contended = COMPACT_CONTENDED_NONE; /* Check if the GFP flags allow compaction */
if (!order || !may_enter_fs || !may_perform_io)
return COMPACT_SKIPPED; trace_mm_compaction_try_to_compact_pages(order, gfp_mask, mode); /* Compact each zone in the list */
for_each_zone_zonelist_nodemask(zone, z, ac->zonelist, ac->high_zoneidx,-----------根据掩码遍历特定zone
ac->nodemask) {
int status;
int zone_contended; if (compaction_deferred(zone, order))
continue; status =compact_zone_order(zone, order, gfp_mask, mode,-----------------------针对特定zone进行规整
&zone_contended, alloc_flags,
ac->classzone_idx);
rc = max(status, rc);
/*
* It takes at least one zone that wasn't lock contended
* to clear all_zones_contended.
*/
all_zones_contended &= zone_contended; /* If a normal allocation would succeed, stop compacting */
if (zone_watermark_ok(zone, order, low_wmark_pages(zone),
ac->classzone_idx, alloc_flags)) {--------------------------------当前zoen水位是否高于WMARK_LOW,如果是则退出当前循环。
/*
* We think the allocation will succeed in this zone,
* but it is not certain, hence the false. The caller
* will repeat this with true if allocation indeed
* succeeds in this zone.
*/
compaction_defer_reset(zone, order, false);
/*
* It is possible that async compaction aborted due to
* need_resched() and the watermarks were ok thanks to
* somebody else freeing memory. The allocation can
* however still fail so we better signal the
* need_resched() contention anyway (this will not
* prevent the allocation attempt).
*/
if (zone_contended == COMPACT_CONTENDED_SCHED)
*contended = COMPACT_CONTENDED_SCHED; goto break_loop;
}
...
continue;
break_loop:
/*
* We might not have tried all the zones, so be conservative
* and assume they are not all lock contended.
*/
all_zones_contended = ;
break;
} /*
* If at least one zone wasn't deferred or skipped, we report if all
* zones that were tried were lock contended.
*/
if (rc > COMPACT_SKIPPED && all_zones_contended)
*contended = COMPACT_CONTENDED_LOCK; return rc;
}

compact_zone_order调用compact_zone,最主要的就是将参数填入struct compact_control结构体,然后和zone一起作为参数传递给compact_zone。

struct compact_control数据结构记录了被迁移的页面,以及规整过程中迁移到的页面列表。

static unsigned long compact_zone_order(struct zone *zone, int order,
gfp_t gfp_mask, enum migrate_mode mode, int *contended,
int alloc_flags, int classzone_idx)
{
unsigned long ret;
struct compact_control cc = {
.nr_freepages = ,
.nr_migratepages = ,
.order = order,------------------------------------------需要规整的页面阶数
.gfp_mask = gfp_mask,------------------------------------页面规整的页面掩码
.zone = zone,
.mode = mode,--------------------------------------------页面规整模式-同步、异步
.alloc_flags = alloc_flags,
.classzone_idx = classzone_idx,
};
INIT_LIST_HEAD(&cc.freepages);-------------------------------初始化迁移目的地的链表
INIT_LIST_HEAD(&cc.migratepages);----------------------------初始化将要迁移页面链表 ret = compact_zone(zone, &cc); VM_BUG_ON(!list_empty(&cc.freepages));
VM_BUG_ON(!list_empty(&cc.migratepages)); *contended = cc.contended;
return ret;
} static int compact_zone(struct zone *zone, struct compact_control *cc)
{
int ret;
unsigned long start_pfn = zone->zone_start_pfn;
unsigned long end_pfn = zone_end_pfn(zone);
const int migratetype = gfpflags_to_migratetype(cc->gfp_mask);
const bool sync = cc->mode != MIGRATE_ASYNC;
unsigned long last_migrated_pfn = ; ret = compaction_suitable(zone, cc->order, cc->alloc_flags,
cc->classzone_idx);-------------------------------根据当前zone水位来判断是否需要进行内存规整,COMPACT_CONTINUE表示可以做内存规整。
switch (ret) {
case COMPACT_PARTIAL:
case COMPACT_SKIPPED:
/* Compaction is likely to fail */
return ret;
case COMPACT_CONTINUE:
/* Fall through to compaction */
;
} /*
* Clear pageblock skip if there were failures recently and compaction
* is about to be retried after being deferred. kswapd does not do
* this reset as it'll reset the cached information when going to sleep.
*/
if (compaction_restarting(zone, cc->order) && !current_is_kswapd())
__reset_isolation_suitable(zone); /*
* Setup to move all movable pages to the end of the zone. Used cached
* information on where the scanners should start but check that it
* is initialised by ensuring the values are within zone boundaries.
*/
cc->migrate_pfn = zone->compact_cached_migrate_pfn[sync];-----------------表示从zone的开始页面开始扫描和查找哪些页面可以被迁移。
cc->free_pfn = zone->compact_cached_free_pfn;-----------------------------从zone末端开始扫描和查找哪些空闲的页面可以用作迁移页面的目的地。
if (cc->free_pfn < start_pfn || cc->free_pfn > end_pfn) {-----------------下面对free_pfn和migrate_pfn进行范围限制。
cc->free_pfn = end_pfn & ~(pageblock_nr_pages-);
zone->compact_cached_free_pfn = cc->free_pfn;
}
if (cc->migrate_pfn < start_pfn || cc->migrate_pfn > end_pfn) {
cc->migrate_pfn = start_pfn;
zone->compact_cached_migrate_pfn[] = cc->migrate_pfn;
zone->compact_cached_migrate_pfn[] = cc->migrate_pfn;
} trace_mm_compaction_begin(start_pfn, cc->migrate_pfn,
cc->free_pfn, end_pfn, sync); migrate_prep_local(); while ((ret = compact_finished(zone, cc, migratetype)) ==
COMPACT_CONTINUE) {-----------------------------------while中从zone开头扫描查找合适的迁移页面,然后尝试迁移到zone末端空闲页面中,直到zone处于低水位WMARK_LOW之上。
int err;
unsigned long isolate_start_pfn = cc->migrate_pfn; switch (isolate_migratepages(zone, cc)) {-----------------------------用于扫描和查找合适迁移的页,从zone头部开始找起,查找步长以pageblock_nr_pages为单位。
case ISOLATE_ABORT:
ret = COMPACT_PARTIAL;
putback_movable_pages(&cc->migratepages);
cc->nr_migratepages = ;
goto out;
case ISOLATE_NONE:
/*
* We haven't isolated and migrated anything, but
* there might still be unflushed migrations from
* previous cc->order aligned block.
*/
goto check_drain;
case ISOLATE_SUCCESS:
;
} err = migrate_pages(&cc->migratepages, compaction_alloc,--------------migrate_pages是页面迁移核心函数,从cc->migratepages中摘取页,然后尝试去迁移。
compaction_free, (unsigned long)cc, cc->mode,
MR_COMPACTION); trace_mm_compaction_migratepages(cc->nr_migratepages, err,
&cc->migratepages); /* All pages were either migrated or will be released */
cc->nr_migratepages = ;
if (err) {------------------------------------------------------------没处理成功的页面会放回到合适的LRU链表中。
putback_movable_pages(&cc->migratepages);
/*
* migrate_pages() may return -ENOMEM when scanners meet
* and we want compact_finished() to detect it
*/
if (err == -ENOMEM && cc->free_pfn > cc->migrate_pfn) {
ret = COMPACT_PARTIAL;
goto out;
}
}
...
} out:
...
trace_mm_compaction_end(start_pfn, cc->migrate_pfn,
cc->free_pfn, end_pfn, sync, ret); return ret;
}

compaction_suitable根据当前zone水位决定是否需要继续内存规整,主要工作由__compaction_suitable进行处理。

主要依据zone低水位和extfrag_threshold两个参数进行判断。

unsigned long compaction_suitable(struct zone *zone, int order,
int alloc_flags, int classzone_idx)
{
unsigned long ret; ret = __compaction_suitable(zone, order, alloc_flags, classzone_idx);
trace_mm_compaction_suitable(zone, order, ret);
if (ret == COMPACT_NOT_SUITABLE_ZONE)
ret = COMPACT_SKIPPED; return ret;
} static unsigned long __compaction_suitable(struct zone *zone, int order,
int alloc_flags, int classzone_idx)
{
int fragindex;
unsigned long watermark; /*
* order == -1 is expected when compacting via
* /proc/sys/vm/compact_memory
*/
if (order == -)
return COMPACT_CONTINUE; watermark = low_wmark_pages(zone);
/*
* If watermarks for high-order allocation are already met, there
* should be no need for compaction at all.
*/
if (zone_watermark_ok(zone, order, watermark, classzone_idx,
alloc_flags))--------------------------------------COMPACT_PARTIAL:如果满足低水位,则不需要进行内存规整。
return COMPACT_PARTIAL; /*
* Watermarks for order-0 must be met for compaction. Note the 2UL.
* This is because during migration, copies of pages need to be
* allocated and for a short time, the footprint is higher
*/
watermark += (2UL << order);---------------------------------------------------增加水位高度为watermark+2<<order。
if (!zone_watermark_ok(zone, , watermark, classzone_idx, alloc_flags))--------COMPACT_SKIPPED:如果达不到新水位,说明当前zone中空闲页面很少,不适合作内存规整,跳过此zone。
return COMPACT_SKIPPED; /*
* fragmentation index determines if allocation failures are due to
* low memory or external fragmentation
*
* index of -1000 would imply allocations might succeed depending on
* watermarks, but we already failed the high-order watermark check
* index towards 0 implies failure is due to lack of memory
* index towards 1000 implies failure is due to fragmentation
*
* Only compact if a failure would be due to fragmentation.
*/
fragindex = fragmentation_index(zone, order);
if (fragindex >= && fragindex <= sysctl_extfrag_threshold)-----------------由extfrag_threshold控制的内存规整流程
return COMPACT_NOT_SUITABLE_ZONE; return COMPACT_CONTINUE;
}

compact_finished判断内存规整流程是否可以结束,结束的条件有两个:

一是cc->migrate_pfn和cc->free_pfn两个指针相遇;二是以order为条件判断当前zone的水位在低水位之上。

static int compact_finished(struct zone *zone, struct compact_control *cc,
const int migratetype)
{
int ret; ret = __compact_finished(zone, cc, migratetype);
trace_mm_compaction_finished(zone, cc->order, ret);
if (ret == COMPACT_NO_SUITABLE_PAGE)
ret = COMPACT_CONTINUE; return ret;
} static int __compact_finished(struct zone *zone, struct compact_control *cc,
const int migratetype)
{
unsigned int order;
unsigned long watermark; if (cc->contended || fatal_signal_pending(current))
return COMPACT_PARTIAL; /* Compaction run completes if the migrate and free scanner meet */
if (cc->free_pfn <= cc->migrate_pfn) {-----------------------------------------扫描可迁移页面和空闲页面,从zone的头尾向中间运行。当两者相遇,可以停止规整。
/* Let the next compaction start anew. */
zone->compact_cached_migrate_pfn[] = zone->zone_start_pfn;
zone->compact_cached_migrate_pfn[] = zone->zone_start_pfn;
zone->compact_cached_free_pfn = zone_end_pfn(zone); /*
* Mark that the PG_migrate_skip information should be cleared
* by kswapd when it goes to sleep. kswapd does not set the
* flag itself as the decision to be clear should be directly
* based on an allocation request.
*/
if (!current_is_kswapd())
zone->compact_blockskip_flush = true; return COMPACT_COMPLETE;--------------------------------------------------停止内存规整
} /*
* order == -1 is expected when compacting via
* /proc/sys/vm/compact_memory
*/
if (cc->order == -)----------------------------------------------------------order为-1表示强制执行内存规整,继续内存规整
return COMPACT_CONTINUE; /* Compaction run is not finished if the watermark is not met */
watermark = low_wmark_pages(zone); if (!zone_watermark_ok(zone, cc->order, watermark, cc->classzone_idx,
cc->alloc_flags))--------------------------------------不满足低水位条件,继续内存规整。
return COMPACT_CONTINUE; /* Direct compactor: Is a suitable page free? */
for (order = cc->order; order < MAX_ORDER; order++) {
struct free_area *area = &zone->free_area[order]; /* Job done if page is free of the right migratetype */
if (!list_empty(&area->free_list[migratetype]))----------------------------空闲页面为空,无法进行迁移,停止内存规整。
return COMPACT_PARTIAL; /* Job done if allocation would set block type */
if (order >= pageblock_order && area->nr_free)
return COMPACT_PARTIAL;
} return COMPACT_NO_SUITABLE_PAGE;
}

isolate_migratepages扫描并寻找zone中可迁移页面,结果回添加到cc->migratepages链表中。

扫描的一个重要参数是页的迁移属性参考MIGRATE_TYPES有详细解释。

Linux内核以pageblock为单位来管理页的迁移属性,一个pageblock大小为4MB大小,即2^10个页面。

pageblock_nr_pages即为1024个页面。

static isolate_migrate_t isolate_migratepages(struct zone *zone,
struct compact_control *cc)
{
unsigned long low_pfn, end_pfn;
struct page *page;
const isolate_mode_t isolate_mode =
(cc->mode == MIGRATE_ASYNC ? ISOLATE_ASYNC_MIGRATE : ); /*
* Start at where we last stopped, or beginning of the zone as
* initialized by compact_zone()
*/
low_pfn = cc->migrate_pfn; /* Only scan within a pageblock boundary */
end_pfn = ALIGN(low_pfn + , pageblock_nr_pages); /*
* Iterate over whole pageblocks until we find the first suitable.
* Do not cross the free scanner.
*/
for (; end_pfn <= cc->free_pfn;---------------------------------------从cc->migrate_pfn开始以pageblock_nr_pages为步长向zone尾部进行扫描。
low_pfn = end_pfn, end_pfn += pageblock_nr_pages) { /*
* This can potentially iterate a massively long zone with
* many pageblocks unsuitable, so periodically check if we
* need to schedule, or even abort async compaction.
*/
if (!(low_pfn % (SWAP_CLUSTER_MAX * pageblock_nr_pages))
&& compact_should_abort(cc))
break; page = pageblock_pfn_to_page(low_pfn, end_pfn, zone);
if (!page)
continue; /* If isolation recently failed, do not retry */
if (!isolation_suitable(cc, page))
continue; /*
* For async compaction, also only scan in MOVABLE blocks.
* Async compaction is optimistic to see if the minimum amount
* of work satisfies the allocation.
*/
if (cc->mode == MIGRATE_ASYNC &&
!migrate_async_suitable(get_pageblock_migratetype(page)))----migrate_async_suitable判断pageblock是否是MIGRATE_MOVABLE和MIGRATE_CMA两种类型,这两种类型可以迁移。
continue; /* Perform the isolation */
low_pfn = isolate_migratepages_block(cc, low_pfn, end_pfn,
isolate_mode);---------------------------扫描和分离pageblock中的页面是否是和迁移。 if (!low_pfn || cc->contended) {
acct_isolated(zone, cc);
return ISOLATE_ABORT;
} /*
* Either we isolated something and proceed with migration. Or
* we failed and compact_zone should decide if we should
* continue or not.
*/
break;
} acct_isolated(zone, cc);
/*
* Record where migration scanner will be restarted. If we end up in
* the same pageblock as the free scanner, make the scanners fully
* meet so that compact_finished() terminates compaction.
*/
cc->migrate_pfn = (end_pfn <= cc->free_pfn) ? low_pfn : cc->free_pfn; return cc->nr_migratepages ? ISOLATE_SUCCESS : ISOLATE_NONE;
}

compaction_alloc()从zone的末尾开始查找空闲页面,并把空闲页面添加到cc->freepages链表中。然后从cc->freepages中摘除页面,返回给migrate_pages作为迁移使用。

compaction_free是规整失败的处理函数,将空闲页面返回给cc->freepages。

static struct page *compaction_alloc(struct page *migratepage,
unsigned long data,
int **result)
{
struct compact_control *cc = (struct compact_control *)data;
struct page *freepage; /*
* Isolate free pages if necessary, and if we are not aborting due to
* contention.
*/
if (list_empty(&cc->freepages)) {
if (!cc->contended)
isolate_freepages(cc);--------------------------------------查找可以用来作为迁移目的页面 if (list_empty(&cc->freepages))---------------------------------如果没有页面可被用来作为迁移目的页面,返回NULL。
return NULL;
} freepage = list_entry(cc->freepages.next, struct page, lru);
list_del(&freepage->lru);-------------------------------------------将空闲页面从cc->freepages中摘除。
cc->nr_freepages--; return freepage;----------------------------------------------------找到可以被用作迁移目的的页面
} static void compaction_free(struct page *page, unsigned long data)
{
struct compact_control *cc = (struct compact_control *)data; list_add(&page->lru, &cc->freepages);-------------------------------失败情况下,将页面放回cc->freepages。
cc->nr_freepages++;
}
上一篇:python 深拷贝和浅拷贝


下一篇:HTML&CSS基础知识点