crush算法
crush是ceph的核心算法。这里仅仅从代码层面做一下小结,随着理解的深入会持续更新这篇小结。
在ceph中是通过crush和cluster map定位数据的位置,并不是查询中心上的metadata获得数据片的位置信息,没有了metadata好处太多了。
算法的设计和复杂度分析参考论文:<CRUSH: Controlled, Scalable, Decentralized Placement of Replicated Data>
生成clustermap
构造图中所示的一个拓扑关系;
crushtool --num_osds 24 -o crushmap --build host uniform 2 rack tree 2 raw tree 2 root straw 0 crushtool -d crushmap -o flat.txt
摘取一段flat.txt的内容:
tunable choose_local_tries 0 tunable choose_local_fallback_tries 0 tunable choose_total_tries 50 tunable chooseleaf_descend_once 1 tunable straw_calc_version 1 device 0 osd.0 device 1 osd.1 device 2 osd.2 device 3 osd.3 device 4 osd.4 device 5 osd.5 device 6 osd.6 device 7 osd.7 device 8 osd.8 device 9 osd.9 device 10 osd.10 device 11 osd.11 device 12 osd.12 device 13 osd.13 device 14 osd.14 device 15 osd.15 device 16 osd.16 device 17 osd.17 device 18 osd.18 device 19 osd.19 device 20 osd.20 device 21 osd.21 device 22 osd.22 device 23 osd.23 type 0 device type 1 host type 2 rack type 3 raw type 4 root host host0 { id -1 # do not change unnecessarily alg uniform # do not change bucket size (2) unnecessarily hash 0 # rjenkins1 item osd.0 weight 1.000 pos 0 item osd.1 weight 1.000 pos 1 } ... ... ... rack rack0 { id -13 # do not change unnecessarily alg tree # do not change pos for existing items unnecessarily hash 0 # rjenkins1 item host0 weight 2.000 pos 0 item host1 weight 2.000 pos 1 } ... ... ... root root { id -22 # do not change unnecessarily alg straw hash 0 # rjenkins1 item raw0 weight 8.000 item raw1 weight 8.000 item raw2 weight 8.000 } ... ... ... rule replicated_ruleset { ruleset 0 type replicated min_size 1 max_size 10 step take root step chooseleaf firstn 0 type rack step emit }
可以看到从叶子节点依次向上到root节点。
replicated_ruleset描述了选择replica的步骤,比如:
step take root
step chooseleaf firstn 0 type rack
step emit
而,crush算法实现了这些step的语义。
数据结构
图中是一个5层的Storage cluster分布图。在ceph中这张图被组织成cluster map的数据结构。
bucket
叶子节点是真正存储数据的device;中间节点叫bucket,这两者在代码中都由crush_bucket表示。
每个bucket都有type字段,type的取值范围和实际的storage cluster中角色相关,表示了一种实体。
struct crush_bucket { // bucket的id,所有cluster map中的bucket都存储在一个连续的数组中,树状的关系正是通过在items中存储子节点的id来完成。 // id < 0表示该bucket是inner bucket // id > 0表示该bucket是叶子节点,是真正的device __s32 id; // 标志bucket的种类,每种bucket对应一种选择算法。 __u8 alg; // 标志bucket的类型。 __u16 type; // hash选择算法,标志如何根据输入bucket, r, x选择出子节点bucket。 __u8 hash; // 权重,不同容量不同IO速率的权重可以配置 __u32 weight; // 下面两个成员存储了子树bucket // size标志子节点的个数 // *items动态分配的数组,指向子节点bucket的id __u32 size; __s32 *items; };
crush_bucket抽象了bucket的通用属性。论文中提到了4种bucket,每种bucket的都有适用的场景,都有自己的select算法。
bucket的select算法的输入为:x,r。其中x是object或pg的值,r为副本的编号。
enum { CRUSH_BUCKET_UNIFORM = 1, CRUSH_BUCKET_LIST = 2, CRUSH_BUCKET_TREE = 3, CRUSH_BUCKET_STRAW = 4, CRUSH_BUCKET_STRAW2 = 5, };
所有的select算法必须是公平的,在新增和删除device的时候,redistribute的代价越小越好,也就是在同样的输出要尽量保证同样的输出(这并不是一件简单的事情)。
自然地,具体的bucket继承自crush_bucket。
crush_bucket_uniform
struct crush_bucket_uniform { struct crush_bucket h; __u32 item_weight; };
uniform类型,每个子节点的地位是均等的。选择子节点的复杂度最低,新增device和删除device都是最耗时的。
crush_bucket_list
struct crush_bucket_list { struct crush_bucket h; __u32 *item_weights; __u32 *sum_weights; };
list类型,sum_weights = 链表中后续节点的权重之和。新增设备只需要插入头部。
crush_bucket_tree
struct crush_bucket_tree { struct crush_bucket h; __u8 num_nodes; __u32 *node_weights; };
tree类型,树中每个节点都有label,一个节点hash值的计算输入为r, x, label,然后分别和左右子树比较。
只需要保证在新增或删除节点时label不变就可以保证同样的x,r就能选择出同样的device节点。
crush_bucket_straw
struct crush_bucket_straw { struct crush_bucket h; __u32 *item_weights; __u32 *straws; };
crush_do_rule函数
crush_do_rule函数的语义是:从cluster的拓扑树中,公平地选择出和x关联的replica个device。这个过程借助于一个个step实现。
本例子中的step为:
step take root step chooseleaf firstn 0 type rack step emit
稍作解释: 1) step take root:从root节点开始向下搜索 2) step chooseleaf firstn 0 type rack:在每个类型为rack的子树中,向下公平的选出replica个叶子几点 3) step emit:算法结束。 可见,这个集群中一个obj的副本分布在不同的rack上。 crush实现了12种step指令,replica的选择正是通过组合这些指令灵活的完成各种副本机制。
src/crush/mapperl.c:crush_do_rule int crush_do_rule(const struct crush_map *map, int ruleno, int x, int *result, int result_max, const __u32 *weight, int weight_max, int *scratch) { int result_len; int *a = scratch; int *b = scratch + result_max; int *c = scratch + result_max*2; int recurse_to_leaf; int *w; int wsize = 0; int *o; int osize; int *tmp; struct crush_rule *rule; __u32 step; int i, j; int numrep; int out_size; int choose_tries = map->choose_total_tries + 1; int choose_leaf_tries = 0; int choose_local_retries = map->choose_local_tries; int choose_local_fallback_retries = map->choose_local_fallback_tries; int vary_r = map->chooseleaf_vary_r; if ((__u32)ruleno >= map->max_rules) { dprintk(" bad ruleno %d\n", ruleno); return 0; } rule = map->rules[ruleno]; result_len = 0; w = a; o = b; // 遍历rule中的step,本文的例子中包含3个step: // 1) step take root // 2) step chooseleaf firstn 0 type rack // 3) step emit for (step = 0; step < rule->len; step++) { int firstn = 0; struct crush_rule_step *curstep = &rule->steps[step]; switch (curstep->op) { // 命中第一个"step take root" case CRUSH_RULE_TAKE: if ((curstep->arg1 >= 0 && curstep->arg1 < map->max_devices) || (-1-curstep->arg1 >= 0 && -1-curstep->arg1 < map->max_buckets && map->buckets[-1-curstep->arg1])) { // w[0] 指向了root节点的id w[0] = curstep->arg1; wsize = 1; } else { dprintk(" bad take value %d\n", curstep->arg1); } break; case CRUSH_RULE_SET_CHOOSE_TRIES: if (curstep->arg1 > 0) choose_tries = curstep->arg1; break; case CRUSH_RULE_SET_CHOOSELEAF_TRIES: if (curstep->arg1 > 0) choose_leaf_tries = curstep->arg1; break; case CRUSH_RULE_SET_CHOOSE_LOCAL_TRIES: if (curstep->arg1 >= 0) choose_local_retries = curstep->arg1; break; case CRUSH_RULE_SET_CHOOSE_LOCAL_FALLBACK_TRIES: if (curstep->arg1 >= 0) choose_local_fallback_retries = curstep->arg1; break; case CRUSH_RULE_SET_CHOOSELEAF_VARY_R: if (curstep->arg1 >= 0) vary_r = curstep->arg1; break; // 命中第2个step "step chooseleaf firstn 0 type rack" // 标志firstn // 然后遍历w数组,依次执行chooseleaf,对应论文中的最外层的循环 // for i in w do // select1(i, n, t) case CRUSH_RULE_CHOOSELEAF_FIRSTN: case CRUSH_RULE_CHOOSE_FIRSTN: firstn = 1; /* fall through */ case CRUSH_RULE_CHOOSELEAF_INDEP: case CRUSH_RULE_CHOOSE_INDEP: if (wsize == 0) break; recurse_to_leaf = curstep->op == CRUSH_RULE_CHOOSELEAF_FIRSTN || curstep->op == CRUSH_RULE_CHOOSELEAF_INDEP; osize = 0; for (i = 0; i < wsize; i++) { int bno; /* * see CRUSH_N, CRUSH_N_MINUS macros. * basically, numrep <= 0 means relative to * the provided result_max */ numrep = curstep->arg1; if (numrep <= 0) { numrep += result_max; if (numrep <= 0) continue; } j = 0; /* make sure bucket id is valid */ bno = -1 - w[i]; if (bno < 0 || bno >= map->max_buckets) { // w[i] is probably CRUSH_ITEM_NONE dprintk(" bad w[i] %d\n", w[i]); continue; } if (firstn) { int recurse_tries; if (choose_leaf_tries) recurse_tries = choose_leaf_tries; else if (map->chooseleaf_descend_once) recurse_tries = 1; else recurse_tries = choose_tries; // 开始进入选择函数 // map:cluster map的结构 // x:是要存储obj的hash值或者pg的id // curstep->arg2:是要选择的类型 osize += crush_choose_firstn( map, map->buckets[bno], weight, weight_max, x, numrep, curstep->arg2, o+osize, j, result_max-osize, choose_tries, recurse_tries, choose_local_retries, choose_local_fallback_retries, recurse_to_leaf, vary_r, c+osize, 0); } else { out_size = ((numrep < (result_max-osize)) ? numrep : (result_max-osize)); crush_choose_indep( map, map->buckets[bno], weight, weight_max, x, out_size, numrep, curstep->arg2, o+osize, j, choose_tries, choose_leaf_tries ? choose_leaf_tries : 1, recurse_to_leaf, c+osize, 0); osize += out_size; } } if (recurse_to_leaf) /* copy final _leaf_ values to output set */ memcpy(o, c, osize*sizeof(*o)); /* swap o and w arrays */ tmp = o; o = w; w = tmp; wsize = osize; break; case CRUSH_RULE_EMIT: for (i = 0; i < wsize && result_len < result_max; i++) { result[result_len] = w[i]; result_len++; } wsize = 0; break; default: dprintk(" unknown op %d at step %d\n", curstep->op, step); break; } } return result_len; }
crush_choose_firstn
crush_choose_firstn的任务是从map中选出类型为rack的节点,并在其子树中选出一个device。 在本例中,type的值为rack。
static int crush_choose_firstn(const struct crush_map *map, struct crush_bucket *bucket, const __u32 *weight, int weight_max, int x, int numrep, int type, int *out, int outpos, int out_size, unsigned int tries, unsigned int recurse_tries, unsigned int local_retries, unsigned int local_fallback_retries, int recurse_to_leaf, unsigned int vary_r, int *out2, int parent_r) { int rep; unsigned int ftotal, flocal; int retry_descent, retry_bucket, skip_rep; struct crush_bucket *in = bucket; int r; int i; int item = 0; int itemtype; int collide, reject; int count = out_size; dprintk("CHOOSE%s bucket %d x %d outpos %d numrep %d tries %d recurse_tries %d local_retries %d local_fallback_retries %d parent_r %d\n", recurse_to_leaf ? "_LEAF" : "", bucket->id, x, outpos, numrep, tries, recurse_tries, local_retries, local_fallback_retries, parent_r); // numrep:副本的个数 // 第1层循环 for (rep = outpos; rep < numrep && count > 0 ; rep++) { ftotal = 0; skip_rep = 0; // 第2层循环 // 这层循环负责选择出一个device,所以初始化in为这个函数的输入bucket,即root节点 do { retry_descent = 0; in = bucket; flocal = 0; // 第3层循环,就是论文提到的local search do { collide = 0; retry_bucket = 0; r = rep + parent_r; // r是副本的下标 // 在进行local search的开始r必须加上ftotal // 为什么r要加上ftotal?这个解释起来比较麻烦也是很关键的地方: // 正常情况下,第1层的for循环进行numrep次就能选出numrep个replica,每次的选择的时候r都是不同的,也就能选择出不同的replica了。 // 如果在进行选择时候正好选中了fail的节点,就要为这个r重新选择,为了避免同样的r不再选中上次的fail节点,必须把选择失败的信息融入r中 // 所以,这里使用了r += ftotal r += ftotal; if (in->size == 0) { reject = 1; goto reject; } if (local_fallback_retries > 0 && flocal >= (in->size>>1) && flocal > local_fallback_retries) item = bucket_perm_choose(in, x, r); else item = crush_bucket_choose(in, x, r); // 从in中的子节点选择出一个item,选择的算法就是in的类型对应的算法(4种bucket每一个都有相应的算法)。 if (item >= map->max_devices) { dprintk(" bad item %d\n", item); skip_rep = 1; break; } // 如果item < 0 说明选择出来的子节点不是叶子节点,从buckets中取出这个bucket的itemtype if (item < 0) itemtype = map->buckets[-1-item]->type; else itemtype = 0; dprintk(" item %d type %d\n", item, itemtype); // 如果itemtype和目标的type类型,需要继续向下搜索,这是一个深度优先搜索! // 这个深度优先是比较有意思的:当到达叶子节点后并不需要回溯到上一次,搜索邻居节点。 // 一方面:本次搜索的目标是在以in为节点的子树下面搜索出一个device节点; // 另外,在从一个节点深度优先到子节点的时候,是随机选择子节点的,在一定概率上其他的节点被搜索过了,只是实际没有发生而已。 if (itemtype != type) { if (item >= 0 || (-1-item) >= map->max_buckets) { dprintk(" bad item type %d\n", type); skip_rep = 1; break; } // 更新in为当前in的子节点,以便于深度优先搜索 in = map->buckets[-1-item]; // 在进行一次第3层的循环 retry_bucket = 1; continue; } // 走到这里说明已经找到了一个目标type一致的bucket // 检查是否冲突 for (i = 0; i < outpos; i++) { if (out[i] == item) { collide = 1; break; } } reject = 0; // 如果没有冲突,并且recurse_to_leaf为true(需要再当前的bucket继续向下直到找到一个叶子节点) if (!collide && recurse_to_leaf) { if (item < 0) { int sub_r; if (vary_r) sub_r = r >> (vary_r-1); else sub_r = 0; // 开始递归调用: // 递归中的根节点是本次搜索到的类型为type的map->buckets[-1-item] // 递归中的目标type是0,也就是device if (crush_choose_firstn(map, map->buckets[-1-item], weight, weight_max, x, outpos+1, 0, out2, outpos, count, recurse_tries, 0, local_retries, local_fallback_retries, 0, vary_r, NULL, sub_r) <= outpos) /* didn't get leaf */ reject = 1; } else { /* we already have a leaf! */ out2[outpos] = item; } } if (!reject) { /* out? */ if (itemtype == 0) reject = is_out(map, weight, weight_max, item, x); else reject = 0; } reject: if (reject || collide) { ftotal++; flocal++; if (collide && flocal <= local_retries) retry_bucket = 1; else if (local_fallback_retries > 0 && flocal <= in->size + local_fallback_retries) retry_bucket = 1; else if (ftotal < tries) retry_descent = 1; else skip_rep = 1; dprintk(" reject %d collide %d " "ftotal %u flocal %u\n", reject, collide, ftotal, flocal); } } while (retry_bucket); } while (retry_descent); if (skip_rep) { dprintk("skip rep\n"); continue; } dprintk("CHOOSE got %d\n", item); out[outpos] = item; outpos++; count--; #ifndef __KERNEL__ if (map->choose_tries && ftotal <= map->choose_total_tries) map->choose_tries[ftotal]++; #endif } dprintk("CHOOSE returns %d\n", outpos); return outpos; }