CRUSH的全称是Controlled Replication Under Scalable Hashing,是ceph数据存储的分布式选择算法,也是ceph存储引擎的核心。
ceph的客户端在往集群里读写数据时,动态计算数据的存储位置。这样ceph就无需维护一个叫metadata的东西,从而提高性能。
ceph分布式存储有关键的3R: Replication(数据复制)、Recovery(数据恢复)、Rebalancing(数据均衡)。在组件故障时,ceph默认等待300秒,然后将OSD标记为down和out,并且初始化recovery操作。这个等待时间可以在集群配置文件的mon_osd_down_out_interval参数里设置。
当新的主机或磁盘加入到集群时,CRUSH开始rebalancing操作,它将数据从存在的主机、磁盘迁移到新的主机、磁盘。rebalancing时会尽量利用所有磁盘,以提高集群性能。如果ceph集群在重度使用中,推荐做法是新加入的磁盘设置权重0,并且逐步提高权重,使得数据迁移缓慢发生,以免影响性能。所有的分布式存储在扩容时都建议这样操作。
在实际中可能经常需要调整集群的布局。默认的CRUSH布局很简单,执行ceph osd tree命令,会看到仅有host和OSD这两种bucket类型在root下面。默认的布局对分区容错很不利,没有rack、row、room这些概念。下面我们增加一种bucket类型:rack(机架)。所有的host(主机)都应位于rack下面。
一、修改crushmap实验
(1)执行ceph osd tree得到当前的集群布局:
[root@node3 ~]# ceph osd tree ID CLASS WEIGHT TYPE NAME STATUS REWEIGHT PRI-AFF -1 0.05878 root default -3 0.01959 host node1 0 hdd 0.00980 osd.0 up 1.00000 1.00000 3 hdd 0.00980 osd.3 up 1.00000 1.00000 -5 0.01959 host node2 1 hdd 0.00980 osd.1 up 1.00000 1.00000 4 hdd 0.00980 osd.4 up 1.00000 1.00000 -7 0.01959 host node3 2 hdd 0.00980 osd.2 up 1.00000 1.00000 5 hdd 0.00980 osd.5 up 1.00000 1.00000
(2)增加rack:
[root@node3 ~]# ceph osd crush add-bucket rack03 rack added bucket rack03 type rack to crush map [root@node3 ~]# ceph osd crush add-bucket rack01 rack added bucket rack01 type rack to crush map [root@node3 ~]# ceph osd crush add-bucket rack02 rack added bucket rack02 type rack to crush map
(3)将host移动到rack下面:
[root@node3 ~]# ceph osd crush move node1 rack=rack01 moved item id -3 name 'node1' to location {rack=rack01} in crush map [root@node3 ~]# ceph osd crush move node2 rack=rack02 moved item id -5 name 'node2' to location {rack=rack02} in crush map [root@node3 ~]# ceph osd crush move node3 rack=rack03 moved item id -7 name 'node3' to location {rack=rack03} in crush map
(4)将rack移动到默认的root下面:
[root@node3 ~]# ceph osd crush move rack01 root=default moved item id -9 name 'rack01' to location {root=default} in crush map [root@node3 ~]# ceph osd crush move rack02 root=default moved item id -10 name 'rack02' to location {root=default} in crush map [root@node3 ~]# ceph osd crush move rack03 root=default moved item id -11 name 'rack03' to location {root=default} in crush map
(5)再次运行ceph osd tree命令:
[root@node3 ~]# ceph osd tree ID CLASS WEIGHT TYPE NAME STATUS REWEIGHT PRI-AFF -1 0.05878 root default -9 0.01959 rack rack01 -3 0.01959 host node1 0 hdd 0.00980 osd.0 up 1.00000 1.00000 3 hdd 0.00980 osd.3 up 1.00000 1.00000 -10 0.01959 rack rack02 -5 0.01959 host node2 1 hdd 0.00980 osd.1 up 1.00000 1.00000 4 hdd 0.00980 osd.4 up 1.00000 1.00000 -11 0.01959 rack rack03 -7 0.01959 host node3 2 hdd 0.00980 osd.2 up 1.00000 1.00000 5 hdd 0.00980 osd.5 up 1.00000 1.00000
会看到新的布局已产生,所有host都位于特定rack下面。按此操作,就完成了对CRUSH布局的调整。
对一个已知对象,可以根据CRUSH算法,查找它的存储结构。比如data这个pool里有一个文件test.txt:
[root@node3 ~]# echo "this is test! ">>test.txt [root@node3 ~]# rados -p data ls [root@node3 ~]# rados -p data put test.txt test.txt [root@node3 ~]# rados -p data ls test.txt
显示它的存储结构:
[root@node3 ~]# ceph osd map data test.txt osdmap e42 pool 'data' (1) object 'test.txt' -> pg 1.8b0b6108 (1.8) -> up ([3,4,2], p3) acting ([3,4,2], p3)
crushmap与ceph的存储架构有关,在实际中可能需要经常调整它。如下先把它dump出来,再反编译成明文进行查看。
[root@node3 ~]# ceph osd getcrushmap -o crushmap 22 [root@node3 ~]# crushtool -d crushmap -o crushmap [root@node3 ~]# cat crushmap # begin crush map tunable choose_local_tries 0 tunable choose_local_fallback_tries 0 tunable choose_total_tries 50 tunable chooseleaf_descend_once 1 tunable chooseleaf_vary_r 1 tunable chooseleaf_stable 1 tunable straw_calc_version 1 tunable allowed_bucket_algs 54 # devices device 0 osd.0 class hdd device 1 osd.1 class hdd device 2 osd.2 class hdd device 3 osd.3 class hdd device 4 osd.4 class hdd device 5 osd.5 class hdd # types type 0 osd type 1 host type 2 chassis type 3 rack type 4 row type 5 pdu type 6 pod type 7 room type 8 datacenter type 9 region type 10 root # buckets host node1 { id -3 # do not change unnecessarily id -4 class hdd # do not change unnecessarily # weight 0.020 alg straw2 hash 0 # rjenkins1 item osd.0 weight 0.010 item osd.3 weight 0.010 } rack rack01 { id -9 # do not change unnecessarily id -14 class hdd # do not change unnecessarily # weight 0.020 alg straw2 hash 0 # rjenkins1 item node1 weight 0.020 } host node2 { id -5 # do not change unnecessarily id -6 class hdd # do not change unnecessarily # weight 0.020 alg straw2 hash 0 # rjenkins1 item osd.1 weight 0.010 item osd.4 weight 0.010 } rack rack02 { id -10 # do not change unnecessarily id -13 class hdd # do not change unnecessarily # weight 0.020 alg straw2 hash 0 # rjenkins1 item node2 weight 0.020 } host node3 { id -7 # do not change unnecessarily id -8 class hdd # do not change unnecessarily # weight 0.020 alg straw2 hash 0 # rjenkins1 item osd.2 weight 0.010 item osd.5 weight 0.010 } rack rack03 { id -11 # do not change unnecessarily id -12 class hdd # do not change unnecessarily # weight 0.020 alg straw2 hash 0 # rjenkins1 item node3 weight 0.020 } root default { id -1 # do not change unnecessarily id -2 class hdd # do not change unnecessarily # weight 0.059 alg straw2 hash 0 # rjenkins1 item rack01 weight 0.020 item rack02 weight 0.020 item rack03 weight 0.020 } # rules rule replicated_rule { id 0 type replicated min_size 1 max_size 10 step take default step chooseleaf firstn 0 type host step emit } # end crush map
这个文件包括几节,大概说明下:
crushmap设备:见上述文件#device后面的内容。这里列举ceph的OSD列表。不管新增还是删除OSD,这个列表会自动更新。通常你无需更改此处,ceph会自动维护。 crushmap bucket类型:见上述文件#types后面的内容。定义bucket的类型,包括root、datacenter、room、row、rack、host、osd等。默认的bucket类型对大部分ceph集群来说够用了,不过你也可以增加自己的类型。 crushmap bucket定义:见上述文件#buckets后面的内容。这里定义bucket的层次性架构,也可以定义bucket所使用的算法类型。 crushmap规则:见上述文件#rules后面的内容。它定义pool里存储的数据应该选择哪个相应的bucket。对较大的集群来说,有多个pool,每个pool有它自己的选择规则。
crushmap应用的实际场景中,我们可以定义一个pool名字为SSD,它使用SSD磁盘来提高性能。再定义一个pool名字为SATA,它使用SATA磁盘来获取更好的经济性。假设有3个ceph存储node,每个node上都有独立的osd服务。
首先在crushmap文件里修改root default为:
root default { id -1 # do not change unnecessarily id -2 class hdd # do not change unnecessarily # weight 0.059 alg straw2 hash 0 # rjenkins1 item rack01 weight 0.020 }
主要修改其item,删除item rack02 weight 0.020,item rack03 weight 0.020 的内容
并增加如下内容:
root ssd { id -15 alg straw hash 0 item rack02 weight 0.020 } root sata { id -16 alg straw hash 0 item rack03 weight 0.020 } # rules rule replicated_rule { id 0 type replicated min_size 1 max_size 10 step take default step chooseleaf firstn 0 type host step emit } rule ssd-pool { ruleset 1 type replicated min_size 1 max_size 10 step take ssd step chooseleaf firstn 0 type osd step emit } rule sata-pool { ruleset 2 type replicated min_size 1 max_size 10 step take sata step chooseleaf firstn 0 type osd step emit
ruleset 2这个规则里,step take sata表示优先选择sata的bucket
ruleset 1这个规则里,step take ssd表示优先选择ssd的bucket
需要注意的就是bucket id不要重复就好
编译文件,并上传到集群:
[root@node3 ~]# crushtool -c crushmap -o crushmap.new [root@node3 ~]# ceph osd setcrushmap -i crushmap.new 23
再次查看此时的集群布局:
[root@node3 ~]# ceph osd tree ID CLASS WEIGHT TYPE NAME STATUS REWEIGHT PRI-AFF -16 0.01999 root sata -11 0.01999 rack rack03 -7 0.01999 host node3 2 hdd 0.00999 osd.2 up 1.00000 1.00000 5 hdd 0.00999 osd.5 up 1.00000 1.00000 -15 0.01999 root ssd -10 0.01999 rack rack02 -5 0.01999 host node2 1 hdd 0.00999 osd.1 up 1.00000 1.00000 4 hdd 0.00999 osd.4 up 1.00000 1.00000 -1 0.01999 root default -9 0.01999 rack rack01 -3 0.01999 host node1 0 hdd 0.00999 osd.0 up 1.00000 1.00000 3 hdd 0.00999 osd.3 up 1.00000 1.00000
接下来观察ceph -s是否健康状态OK。如果健康OK,增加2个pool:
[root@node3 ~]# ceph osd pool create sata 64 64 pool 'sata' created [root@node3 ~]# ceph osd pool create ssd 64 64 pool 'ssd' created
给上述2个新创建的pool分配crush规则:
[root@node3 ~]# ceph osd pool set sata crush_rule sata-pool set pool 2 crush_rule to sata-pool [root@node3 ~]# ceph osd pool set ssd crush_rule ssd-pool set pool 3 crush_rule to ssd-pool
查看规则是否生效:
[root@node3 ~]# ceph osd dump |egrep -i "ssd|sata" pool 2 'sata' replicated size 3 min_size 2 crush_rule 2 object_hash rjenkins pg_num 64 pgp_num 64 last_change 55 flags hashpspool stripe_width 0 pool 3 'ssd' replicated size 3 min_size 2 crush_rule 1 object_hash rjenkins pg_num 64 pgp_num 64 last_change 60 flags hashpspool stripe_width 0
现在写往sata pool的目标,将优先存储到SATA设备上。写往ssd pool的目标,将优先存储到SSD设备上。
用rados命令进行测试:
[root@node3 ~]# touch file.ssd [root@node3 ~]# touch file.sata [root@node3 ~]# rados -p ssd put filename file.ssd [root@node3 ~]# rados -p sata put filename file.sata
最后使用ceph osd map命令检查它们的存储位置:
[root@node3 ~]# ceph osd map ssd file.ssd osdmap e69 pool 'ssd' (3) object 'file.ssd' -> pg 3.46b33220 (3.20) -> up ([4,1], p4) acting ([4,1,0], p4) [root@node3 ~]# ceph osd map sata file.sata osdmap e69 pool 'sata' (2) object 'file.sata' -> pg 2.df856dd1 (2.11) -> up ([5,2], p5) acting ([5,2,0], p5)
可以看到对应类型的对象优先存储到对应类型的设备上
二、Crush class实验
luminous版本的ceph新增了一个功能crush class,这个功能又可以称为磁盘智能分组。因为这个功能就是根据磁盘类型自动的进行属性的关联,然后进行分类。无需手动修改crushmap,极大的减少了人为的操作。以前的操作有多麻烦可以看看:ceph crushmap
ceph中的每个osd设备都可以选择一个class类型与之关联,默认情况下,在创建osd的时候会自动识别设备类型,然后设置该设备为相应的类。通常有三种class类型:hdd,ssd,nvme。
由于当前实验环境下没有ssd和nvme设备,只好修改class标签,假装为有ssd设备,然后进行实验。
一,实验环境
[root@node3 ~]# cat /etc/redhat-release CentOS Linux release 7.3.1611 (Core) [root@node3 ~]# ceph -v ceph version 12.2.1 (3e7492b9ada8bdc9a5cd0feafd42fbca27f9c38e) luminous (stable)
二,修改crush class:
1,查看当前集群布局:
[root@node3 ~]# ceph osd tree ID CLASS WEIGHT TYPE NAME STATUS REWEIGHT PRI-AFF -1 0.05878 root default -3 0.01959 host node1 0 hdd 0.00980 osd.0 up 1.00000 1.00000 3 hdd 0.00980 osd.3 up 1.00000 1.00000 -5 0.01959 host node2 1 hdd 0.00980 osd.1 up 1.00000 1.00000 4 hdd 0.00980 osd.4 up 1.00000 1.00000 -7 0.01959 host node3 2 hdd 0.00980 osd.2 up 1.00000 1.00000 5 hdd 0.00980 osd.5 up 1.00000 1.00000
可以看到只有第二列为CLASS,只有hdd类型。
通过查看crush class,确实只有hdd类型
[root@node3 ~]# ceph osd crush class ls [ "hdd" ]
2,删除osd.0,osd.1,osd.2的class:
[root@node3 ~]# for i in 0 1 2;do ceph osd crush rm-device-class osd.$i;done done removing class of osd(s): 0 done removing class of osd(s): 1 done removing class of osd(s): 2
再次通过命令ceph osd tree查看osd.0,osd.1,osd.2的class
[root@node3 ~]# ceph osd tree ID CLASS WEIGHT TYPE NAME STATUS REWEIGHT PRI-AFF -1 0.05878 root default -3 0.01959 host node1 0 0.00980 osd.0 up 1.00000 1.00000 3 hdd 0.00980 osd.3 up 1.00000 1.00000 -5 0.01959 host node2 1 0.00980 osd.1 up 1.00000 1.00000 4 hdd 0.00980 osd.4 up 1.00000 1.00000 -7 0.01959 host node3 2 0.00980 osd.2 up 1.00000 1.00000 5 hdd 0.00980 osd.5 up 1.00000 1.00000
可以发现osd.0,osd.1,osd.2的class为空
3,设置osd.0,osd.1,osd.2的class为ssd:
[root@node3 ~]# for i in 0 1 2;do ceph osd crush set-device-class ssd osd.$i;done set osd(s) 0 to class 'ssd' set osd(s) 1 to class 'ssd' set osd(s) 2 to class 'ssd'
再次通过命令ceph osd tree查看osd.0,osd.1,osd.2的class
[root@node3 ~]# ceph osd tree ID CLASS WEIGHT TYPE NAME STATUS REWEIGHT PRI-AFF -1 0.05878 root default -3 0.01959 host node1 3 hdd 0.00980 osd.3 up 1.00000 1.00000 0 ssd 0.00980 osd.0 up 1.00000 1.00000 -5 0.01959 host node2 4 hdd 0.00980 osd.4 up 1.00000 1.00000 1 ssd 0.00980 osd.1 up 1.00000 1.00000 -7 0.01959 host node3 5 hdd 0.00980 osd.5 up 1.00000 1.00000 2 ssd 0.00980 osd.2 up 1.00000 1.00000
可以看到osd.0,osd.1,osd.2的class变为ssd
再查看一下crush class:
[root@node3 ~]# ceph osd crush class ls [ "hdd", "ssd" ]
可以看到class中多出了一个名为ssd的class
4,创建一个优先使用ssd设备的crush rule:
创建了一个rule的名字为:rule-ssd,在root名为default下的rule
[root@node3 ~]# ceph osd crush rule create-replicated rule-ssd default host ssd
查看集群的rule:
[root@node3 ~]# ceph osd crush rule ls replicated_rule rule-ssd
可以看到多出了一个名为rule-ssd的rule
通过下面的命令下载集群crushmap查看有哪些变化:
[root@node3 ~]# ceph osd getcrushmap -o crushmap 20 [root@node3 ~]# crushtool -d crushmap -o crushmap [root@node3 ~]# cat crushmap # begin crush map tunable choose_local_tries 0 tunable choose_local_fallback_tries 0 tunable choose_total_tries 50 tunable chooseleaf_descend_once 1 tunable chooseleaf_vary_r 1 tunable chooseleaf_stable 1 tunable straw_calc_version 1 tunable allowed_bucket_algs 54 # devices device 0 osd.0 class ssd device 1 osd.1 class ssd device 2 osd.2 class ssd device 3 osd.3 class hdd device 4 osd.4 class hdd device 5 osd.5 class hdd # types type 0 osd type 1 host type 2 chassis type 3 rack type 4 row type 5 pdu type 6 pod type 7 room type 8 datacenter type 9 region type 10 root # buckets host node1 { id -3 # do not change unnecessarily id -4 class hdd # do not change unnecessarily id -9 class ssd # do not change unnecessarily # weight 0.020 alg straw2 hash 0 # rjenkins1 item osd.0 weight 0.010 item osd.3 weight 0.010 } host node2 { id -5 # do not change unnecessarily id -6 class hdd # do not change unnecessarily id -10 class ssd # do not change unnecessarily # weight 0.020 alg straw2 hash 0 # rjenkins1 item osd.1 weight 0.010 item osd.4 weight 0.010 } host node3 { id -7 # do not change unnecessarily id -8 class hdd # do not change unnecessarily id -11 class ssd # do not change unnecessarily # weight 0.020 alg straw2 hash 0 # rjenkins1 item osd.2 weight 0.010 item osd.5 weight 0.010 } root default { id -1 # do not change unnecessarily id -2 class hdd # do not change unnecessarily id -12 class ssd # do not change unnecessarily # weight 0.059 alg straw2 hash 0 # rjenkins1 item node1 weight 0.020 item node2 weight 0.020 item node3 weight 0.020 } # rules rule replicated_rule { id 0 type replicated min_size 1 max_size 10 step take default step chooseleaf firstn 0 type host step emit } rule rule-ssd { id 1 type replicated min_size 1 max_size 10 step take default class ssd step chooseleaf firstn 0 type host step emit } # end crush map
可以看到在root default下多了一行: id -12 class ssd。在rules下,多了一个rule rule-ssd其id为1
5,创建一个使用该rule-ssd规则的存储池:
[root@node3 ~]# ceph osd pool create ssdpool 64 64 rule-ssd pool 'ssdpool' created
查看ssdpool的信息可以看到使用的crush_rule 为1,也就是rule-ssd
[root@node3 ~]# ceph osd pool ls detail pool 1 'ssdpool' replicated size 3 min_size 2 crush_rule 1 object_hash rjenkins pg_num 64 pgp_num 64 last_change 39 flags hashpspool stripe_width 0
6,创建对象测试ssdpool:
创建一个对象test并放到ssdpool中:
[root@node3 ~]# rados -p ssdpool ls [root@node3 ~]# echo "hahah" >test.txt [root@node3 ~]# rados -p ssdpool put test test.txt [root@node3 ~]# rados -p ssdpool ls test
查看该对象的osd组:
[root@node3 ~]# ceph osd map ssdpool test osdmap e46 pool 'ssdpool' (1) object 'test' -> pg 1.40e8aab5 (1.35) -> up ([1,2,0], p1) acting ([1,2,0], p1)
可以看到该对象的osd组使用的都是ssd磁盘,至此验证成功。可以看出crush class相当于一个辨别磁盘类型的标签。
参考:
https://www.cnblogs.com/sisimi/p/7799980.html
https://www.cnblogs.com/sisimi/p/7804138.html
https://www.bladewan.com/2017/08/05/crush/