btrfs 使用指南 - 1 概念,创建,块设备管理,性能优化

一、btrfs概念
在btrfs中存在三种类型的数据,data, metadata和system。它们表示:
       DATA
           store data blocks and nothing else。数据块。

       METADATA
           store internal metadata in b-trees, can store file data if they fit into the inline limit。
       b-trees格式存储的btrfs内部源数据,例如文件inode信息,文件大小,修改时间等等。

       SYSTEM
           store structures that describe the mapping between the physical devices and the linear logical space representing the filesystem。
       块设备和文件系统线性逻辑空间之间的映射信息,类似寻址映射关系,还包括RAID的关系(profile)。

block group或chunk的概念,这两个术语可以通用。它们表示:
           a logical range of space of a given profile, stores data, metadata or both; sometimes the terms are used interchangably。
       block group或chunk术语用来表示以上几种数据类型data,metadata,system的一个空间逻辑范围,一次性分配的最小空间。(为了保持好的数据连续性?)

           A typical size of metadata block group is 256MiB (filesystem smaller than 50GiB) and 1GiB (larger than 50GiB), for data it’s 1GiB. The system block group size is a few megabytes.
       例如metadata数据类型一次分配的空间为256MB(当文件系统小于50GB时)或1GB(当文件系统大于50GB时)。
       data数据类型一次分配的空间是1GB。
       system数据块则一次分配很少的MB。
       你可以用btrfs filesystem show观察到这些信息。

       RAID
           a block group profile type that utilizes RAID-like features on multiple devices: striping, mirroring, parity
       RAID是profile的一种描述,包括条带(raid0, raid10),mirror(raid1),奇偶校验(raid 5,6)。

       profile
           when used in connection with block groups refers to the allocation strategy and constraints, see the section PROFILES for more details
       profile和block group结合起来,用来描述数据的分配策略或约束。例如:
       single表示只存一份数据,即每个block group都是独一无二的。
           DUP表示在一个块设备中存双份数据,即每个block group在 同一个块设备 中有一个一样的block group副本。
       RAID0表示条带,单个block group可能跨块设备存储。
       RAID10表示镜像加条带,单个block group可能跨块设备存储,其中每个部分都会在两个块设备中存成镜像。

PROFILES
       There are the following block group types available:

       ┌────────┬─────────────────────┬────────────┬─────────────────┐
       │Profile │ Redundancy          │ Striping   │ Min/max devices │
       ├────────┼─────────────────────┼────────────┼─────────────────┤
       │        │                     │            │                 │
       │single  │ 1 copy              │ n/a        │ 1/any           │
       ├────────┼─────────────────────┼────────────┼─────────────────┤
       │        │                     │            │                 │
       │DUP     │ 2 copies / 1 device │ n/a        │ 1/1             │
       ├────────┼─────────────────────┼────────────┼─────────────────┤
       │        │                     │            │                 │
       │RAID0   │ n/a                 │ 1 to N     │ 2/any           │
       ├────────┼─────────────────────┼────────────┼─────────────────┤
       │        │                     │            │                 │
       │RAID10  │ 2 copies            │ 1 to N     │ 4/any           │
       ├────────┼─────────────────────┼────────────┼─────────────────┤
       │        │                     │            │                 │
       │RAID5   │ 2 copies            │ 3 to N - 1 │ 2/any           │
       ├────────┼─────────────────────┼────────────┼─────────────────┤
       │        │                     │            │                 │
       │RAID6   │ 3 copies            │ 3 to N - 2 │ 3/any           │
       └────────┴─────────────────────┴────────────┴─────────────────┘
二、创建一个btrfs文件系统
man mkfs.btrfs
       -d|--data <profile>
           Specify the profile for the data block groups. Valid values are raid0, raid1, raid5, raid6, raid10 or single, (case does not matter).
       指定data数据类型的profile,需要结合块设备,如果底层块设备没有冗余措施,建议这里使用冗余存储。否则存单份即可,single。
       如果有多个块设备,可以选择是否需要条带,条带话可以带来好的负载均衡性能。
       -m|--metadata <profile>
           Specify the profile for the metadata block groups. Valid values are raid0, raid1, raid5, raid6, raid10, single or dup, (case does not matter).

           A single device filesystem will default to DUP, unless a SSD is detected. Then it will default to single. The detection is based on the value of /sys/block/DEV/queue/rotational, where DEV is the short name of the device.
           This is because SSDs can remap the blocks internally to a single copy thus deduplicating them which negates the purpose of increased metadata redunancy and just wastes space.

           Note that the rotational status can be arbitrarily set by the underlying block device driver and may not reflect the true status (network block device, memory-backed SCSI devices etc). Use the options --data/--metadata
           to avoid confusion.
       指定metadata数据类型的profile,需要结合块设备,如果底层块设备没有冗余措施,建议这里使用冗余存储。否则存单份即可,single。
       如果有多个块设备,可以选择是否需要条带,条带话可以带来好的负载均衡性能。
       -n|--nodesize <size>
           Specify the nodesize, the tree block size in which btrfs stores metadata. The default value is 16KiB (16384) or the page size, whichever is bigger. Must be a multiple of the sectorsize, but not larger than 64KiB (65536).
           Leafsize always equals nodesize and the options are aliases.

           Smaller node size increases fragmentation but lead to higher b-trees which in turn leads to lower locking contention. Higher node sizes give better packing and less fragmentation at the cost of more expensive memory
           operations while updating the metadata blocks.

               Note
               versions up to 3.11 set the nodesize to 4k.
       对于数据库应用,建议使用4K,减少冲突。
       -f|--force
           Forcibly overwrite the block devices when an existing filesystem is detected. By default, mkfs.btrfs will utilize libblkid to check for any known filesystem on the devices. Alternatively you can use the wipefs utility to
           clear the devices.

有多个块设备时,可以直接指定多个块设备进行格式化。
并且可以为metadata和data指定不同的profile级别。
例如:
[root@digoal ~]# mkfs.btrfs -m raid10 -d raid10 -n 4096 -f /dev/sdb /dev/sdc /dev/sdd /dev/sde
btrfs-progs v4.3.1
See http://btrfs.wiki.kernel.org for more information.

Label:              (null)
UUID:               00036b8e-7914-41a9-831a-d35c97202eeb
Node size:          4096
Sector size:        4096
Filesystem size:    80.00GiB
Block group profiles:  可以看到已分配的block group,三种数据类型,分别分配了多少容量。
  Data:             RAID10            2.01GiB
  Metadata:         RAID10            2.01GiB
  System:           RAID10           20.00MiB
SSD detected:       no
Incompat features:  extref, skinny-metadata
Number of devices:  4
Devices:
   ID        SIZE  PATH
    1    20.00GiB  /dev/sdb
    2    20.00GiB  /dev/sdc
    3    20.00GiB  /dev/sdd
    4    20.00GiB  /dev/sde
下面这个,metadata使用raid1,不使用条带。而data使用raid10,使用条带。可以看到system和metadata一样,使用了raid1。
不过建议将metadata和data设置为一致的风格。
[root@digoal ~]# mkfs.btrfs -m raid1 -d raid10 -n 4096 -f /dev/sdb /dev/sdc /dev/sdd /dev/sde
btrfs-progs v4.3.1
See http://btrfs.wiki.kernel.org for more information.

Label:              (null)
UUID:               4eef7b0c-73a3-430c-bb61-028b37d1872b
Node size:          4096
Sector size:        4096
Filesystem size:    80.00GiB
Block group profiles:
  Data:             RAID10            2.01GiB
  Metadata:         RAID1             1.01GiB
  System:           RAID1            12.00MiB
SSD detected:       no
Incompat features:  extref, skinny-metadata
Number of devices:  4
Devices:
   ID        SIZE  PATH
    1    20.00GiB  /dev/sdb
    2    20.00GiB  /dev/sdc
    3    20.00GiB  /dev/sdd
    4    20.00GiB  /dev/sde

[root@digoal ~]# btrfs filesystem show /dev/sdb
Label: none  uuid: 4eef7b0c-73a3-430c-bb61-028b37d1872b
        Total devices 4 FS bytes used 28.00KiB
        devid    1 size 20.00GiB used 2.00GiB path /dev/sdb
        devid    2 size 20.00GiB used 2.00GiB path /dev/sdc
        devid    3 size 20.00GiB used 1.01GiB path /dev/sdd
        devid    4 size 20.00GiB used 1.01GiB path /dev/sde

三、mount btrfs文件系统
如果你的btrfs管理了多个块设备,那么你有两种选择来mount它,第一种是直接指定多个块设备,第二种是先scan,再mount,因为某些系统重新启动或者btrfs模块重新加载后,需要重新scan来识别。
例如:
[root@digoal ~]# btrfs device scan
Scanning for Btrfs filesystems
[root@digoal ~]# mount /dev/sdb /data01
[root@digoal ~]# btrfs filesystem show /data01
Label: none  uuid: 00036b8e-7914-41a9-831a-d35c97202eeb
        Total devices 4 FS bytes used 1.03MiB
        devid    1 size 20.00GiB used 2.01GiB path /dev/sdb
        devid    2 size 20.00GiB used 2.01GiB path /dev/sdc
        devid    3 size 20.00GiB used 2.01GiB path /dev/sdd
        devid    4 size 20.00GiB used 2.01GiB path /dev/sde
或者
[root@digoal ~]# mount -o device=/dev/sdb,device=/dev/sdc,device=/dev/sdd,device=/dev/sde /dev/sdb /data01
[root@digoal ~]# btrfs filesystem show /data01
Label: none  uuid: 00036b8e-7914-41a9-831a-d35c97202eeb
        Total devices 4 FS bytes used 1.03MiB
        devid    1 size 20.00GiB used 2.01GiB path /dev/sdb
        devid    2 size 20.00GiB used 2.01GiB path /dev/sdc
        devid    3 size 20.00GiB used 2.01GiB path /dev/sdd
        devid    4 size 20.00GiB used 2.01GiB path /dev/sde
或者
# vi /etc/fstab

UUID=00036b8e-7914-41a9-831a-d35c97202eeb /data01 btrfs ssd,ssd_spread,discard,noatime,nodiratime,compress=no,space_cache,recovery,defaults 0 0
或者
UUID=00036b8e-7914-41a9-831a-d35c97202eeb /data01 btrfs device=/dev/sdb,device=/dev/sdc,device=/dev/sdd,device=/dev/sde,ssd,ssd_spread,discard,noatime,nodiratime,compress=no,space_cache,recovery,defaults 0 0
四、mount参数建议
https://btrfs.wiki.kernel.org/index.php/Mount_options
https://access.redhat.com/documentation/en-US/Red_Hat_Enterprise_Linux/7/html/Storage_Administration_Guide/btrfs-mount.html
4.1 ssd相关参数建议
discard,ssd,ssd_spread

discard
    Use this option to enable discard/TRIM on freed blocks.
ssd
    Turn on some of the SSD optimized behaviour within btrfs. This is enabled automatically by checking /sys/block/sdX/queue/rotational to be zero. This does not enable discard/TRIM!

ssd_spread
    Mount -o ssd_spread is more strict about finding a large unused region of the disk for new allocations, which tends to fragment the free space more over time. It is often faster on the less expensive SSD devices. 廉价ssd硬盘建议开启ssd_spread

nossd
The ssd mount option only enables the ssd option. Use the nossd option to disable it.

4.2 性能相关参数建议
noatime,nodiratime,space_cache

noatime,nodiratime
    as discussed in the mailing list noatime mount option might speed up your file system, especially in case you have lots of snapshots. Each read access to a file is supposed to update its unix access time. COW will happen and will make even more writes. Default is now relatime which updates access times less often.
space_cache
    Btrfs stores the free space data on-disk to make the caching of a block group much quicker. It's a persistent change and is safe to boot into old kernels.

4.3 其他建议参数建议
defaults,compress=no,recovery

compress=no
recovery
    Enable autorecovery upon mount; currently it scans list of several previous tree roots and tries to use the first readable. The information about the tree root backups is stored by kernels starting with 3.2, older kernels do not and thus no recovery can be done.
thread_pool=number 
    The number of worker threads to allocate.

4.4 Linux块设备IO调度策略建议
    deadline

五、resize btrfs文件系统
btrfs文件系统整合了块设备的管理,正如前面所述,btrfs存储了data, metadata, system三种数据类型。当任何一种数据类型需要空间时,btrfs会为对应的数据类型分配空间(block group),这些分配的空间就来自btrfs管理的块设备。
所以,resize btrfs,实际上就是resize 块设备的使用空间。对于单个块设备的btrfs,resize btrfs root挂载点和resize block dev的效果是一样的。

5.1 扩大
单位支持k,m,g。
# btrfs filesystem resize amount /mount-point
# btrfs filesystem show /mount-point
# btrfs filesystem resize devid:amount /mount-point
# btrfs filesystem resize devid:max /mount-point

对于单个块设备的btrfs,不需要指定块设备ID
# btrfs filesystem resize +200M /btrfssingle
Resize '/btrfssingle' of '+200M'

对于多个块设备的btrfs,需要指定块设备ID
[root@digoal ~]# btrfs filesystem show /data01
Label: none  uuid: 00036b8e-7914-41a9-831a-d35c97202eeb
        Total devices 4 FS bytes used 2.12GiB
        devid    1 size 19.00GiB used 4.01GiB path /dev/sdb
        devid    2 size 20.00GiB used 4.01GiB path /dev/sdc
        devid    3 size 20.00GiB used 4.01GiB path /dev/sdd
        devid    4 size 20.00GiB used 4.01GiB path /dev/sde
[root@digoal ~]# btrfs filesystem resize '1:+1G' /data01
Resize '/data01' of '1:+1G'
[root@digoal ~]# btrfs filesystem show /data01
Label: none  uuid: 00036b8e-7914-41a9-831a-d35c97202eeb
        Total devices 4 FS bytes used 2.12GiB
        devid    1 size 20.00GiB used 4.01GiB path /dev/sdb
        devid    2 size 20.00GiB used 4.01GiB path /dev/sdc
        devid    3 size 20.00GiB used 4.01GiB path /dev/sdd
        devid    4 size 20.00GiB used 4.01GiB path /dev/sde
可以指定max,表示使用块设备的所有容量。
[root@digoal ~]# btrfs filesystem resize '1:max' /data01
Resize '/data01' of '1:max'

5.2 缩小
# btrfs filesystem resize amount /mount-point
# btrfs filesystem show /mount-point
# btrfs filesystem resize devid:amount /mount-point
类似:
# btrfs filesystem resize -200M /btrfssingle
Resize '/btrfssingle' of '-200M'

5.3 设置固定大小
# btrfs filesystem resize amount /mount-point
# btrfs filesystem resize 700M /btrfssingle
Resize '/btrfssingle' of '700M'

# btrfs filesystem show /mount-point
# btrfs filesystem resize devid:amount /mount-point

同样支持max:
[root@digoal ~]# btrfs filesystem resize 'max' /data01
Resize '/data01' of 'max'
[root@digoal ~]# btrfs filesystem resize '2:max' /data01
Resize '/data01' of '2:max'
[root@digoal ~]# btrfs filesystem resize '3:max' /data01
Resize '/data01' of '3:max'
[root@digoal ~]# btrfs filesystem resize '4:max' /data01
Resize '/data01' of '4:max'

六、btrfs文件系统卷管理
btrfs文件系统多个块设备如何管理
MULTIPLE DEVICES
       Before mounting a multiple device filesystem, the kernel module must know the association of the block devices that are attached to the filesystem UUID.

       There is typically no action needed from the user. On a system that utilizes a udev-like daemon(自动识别, 不需要scan, centos 7是这样的), any new block device is automatically registered. The rules call btrfs device scan.

       The same command can be used to trigger the device scanning if the btrfs kernel module is reloaded (naturally all previous information about the device registration is lost).

       Another possibility is to use the mount options device to specify the list of devices to scan at the time of mount.

           # mount -o device=/dev/sdb,device=/dev/sdc /dev/sda /mnt

           Note
           that this means only scanning, if the devices do not exist in the system, mount will fail anyway. This can happen on systems without initramfs/initrd and root partition created with RAID1/10/5/6 profiles. The mount
           action can happen before all block devices are discovered. The waiting is usually done on the initramfs/initrd systems.
否则,在操作系统重启或者btrfs模块重载后,需要先scan 一下,才能mount使用了多个块设备的btrfs。

七、负载均衡
使用raid0, raid10, raid5, raid6时,支持条带,一个block group将横跨多个块设备,所以有负载均衡的作用。

八、单到多转换
如果一开始btrfs只用了一个块设备,要转换成raid1,如何转换?
[root@digoal ~]# mkfs.btrfs -m single -d single -n 4096 -f /dev/sdb
btrfs-progs v4.3.1
See http://btrfs.wiki.kernel.org for more information.
Label:              (null)
UUID:               165f59f6-77b5-4421-b3d8-90884d3c0b40
Node size:          4096
Sector size:        4096
Filesystem size:    20.00GiB
Block group profiles:
  Data:             single            8.00MiB
  Metadata:         single            8.00MiB
  System:           single            4.00MiB
SSD detected:       no
Incompat features:  extref, skinny-metadata
Number of devices:  1
Devices:
   ID        SIZE  PATH
    1    20.00GiB  /dev/sdb

[root@digoal ~]# mount -o ssd,ssd_spread,discard,noatime,nodiratime,compress=no,space_cache,recovery,defaults /dev/sdb /data01
添加块设备
[root@digoal ~]# btrfs device add /dev/sdc /data01 -f
使用balance在线转换,其中-m指metadata, -d指data
[root@digoal ~]# btrfs balance start -dconvert=raid1 -mconvert=raid1 /data01
Done, had to relocate 3 out of 3 chunks
这里的chunks指的就是block group.

[root@digoal ~]# btrfs filesystem show /data01
Label: none  uuid: 165f59f6-77b5-4421-b3d8-90884d3c0b40
        Total devices 2 FS bytes used 360.00KiB
        devid    1 size 20.00GiB used 1.28GiB path /dev/sdb
        devid    2 size 20.00GiB used 1.28GiB path /dev/sdc

查看balance任务是否完成
[root@digoal ~]# btrfs balance status -v /data01
No balance found on '/data01'

还可以继续转换,例如data我想用raid0,可以这样。
[root@digoal ~]# btrfs balance start -dconvert=raid0 /data01
Done, had to relocate 1 out of 3 chunks
这里的chunks指的就是block group.

九、添加块设备,数据重分布。
和前面的转换差不多,只是不改-d -m的profile。
[root@digoal ~]# btrfs device add /dev/sdd/data01 -f
[root@digoal ~]# btrfs device add /dev/sde/data01 -f

[root@digoal ~]# btrfs filesystem show /dev/sdb
Label: none  uuid: 165f59f6-77b5-4421-b3d8-90884d3c0b40
        Total devices 4 FS bytes used 616.00KiB
        devid    1 size 20.00GiB used 1.28GiB path /dev/sdb
        devid    2 size 20.00GiB used 1.28GiB path /dev/sdc
        devid    3 size 20.00GiB used 0.00B path /dev/sdd
        devid    4 size 20.00GiB used 0.00B path /dev/sde
数据重分布
[root@digoal ~]# btrfs balance start /data01
Done, had to relocate 3 out of 3 chunks
[root@digoal ~]# btrfs filesystem show /dev/sdb
Label: none  uuid: 165f59f6-77b5-4421-b3d8-90884d3c0b40
        Total devices 4 FS bytes used 1.29MiB
        devid    1 size 20.00GiB used 1.03GiB path /dev/sdb
        devid    2 size 20.00GiB used 1.03GiB path /dev/sdc
        devid    3 size 20.00GiB used 2.00GiB path /dev/sdd
        devid    4 size 20.00GiB used 2.00GiB path /dev/sde
将metadata转换为raid10存储,重分布。
[root@digoal ~]# btrfs balance start -mconvert=raid10 /data01
Done, had to relocate 2 out of 3 chunks
[root@digoal ~]# btrfs filesystem show /dev/sdb
Label: none  uuid: 165f59f6-77b5-4421-b3d8-90884d3c0b40
        Total devices 4 FS bytes used 1.54MiB
        devid    1 size 20.00GiB used 1.53GiB path /dev/sdb
        devid    2 size 20.00GiB used 1.53GiB path /dev/sdc
        devid    3 size 20.00GiB used 1.53GiB path /dev/sdd
        devid    4 size 20.00GiB used 1.53GiB path /dev/sde
查看重分布后的三种类型的使用量。
[root@digoal ~]# btrfs filesystem df /data01
Data, RAID0: total=4.00GiB, used=1.25MiB
System, RAID10: total=64.00MiB, used=4.00KiB
Metadata, RAID10: total=1.00GiB, used=36.00KiB
GlobalReserve, single: total=4.00MiB, used=0.00B

十、删除块设备(必须确保达到该profile级别最小个数的块设备)
[root@digoal ~]# btrfs filesystem df /data01
Data, RAID10: total=2.00GiB, used=1.00GiB
System, RAID10: total=64.00MiB, used=4.00KiB
Metadata, RAID10: total=1.00GiB, used=1.18MiB
GlobalReserve, single: total=4.00MiB, used=0.00B
[root@digoal ~]# btrfs filesystem show /data01
Label: none  uuid: 165f59f6-77b5-4421-b3d8-90884d3c0b40
        Total devices 4 FS bytes used 1.00GiB
        devid    1 size 20.00GiB used 1.53GiB path /dev/sdb
        devid    2 size 20.00GiB used 1.53GiB path /dev/sdc
        devid    3 size 20.00GiB used 1.53GiB path /dev/sdd
        devid    4 size 20.00GiB used 1.53GiB path /dev/sde
因为raid10至少需要4个块设备,所以删除失败
[root@digoal ~]# btrfs device delete /dev/sdb /data01
ERROR: error removing device '/dev/sdb': unable to go below four devices on raid10

先转换为raid1,再演示
[root@digoal ~]# btrfs balance start -mconvert=raid1 -dconvert=raid1 /data01
Done, had to relocate 3 out of 3 chunks
[root@digoal ~]# btrfs filesystem df /data01
Data, RAID1: total=2.00GiB, used=1.00GiB
System, RAID1: total=32.00MiB, used=4.00KiB
Metadata, RAID1: total=1.00GiB, used=1.11MiB
GlobalReserve, single: total=4.00MiB, used=0.00B
[root@digoal ~]# btrfs filesystem show /data01
Label: none  uuid: 165f59f6-77b5-4421-b3d8-90884d3c0b40
        Total devices 4 FS bytes used 1.00GiB
        devid    1 size 20.00GiB used 1.03GiB path /dev/sdb
        devid    2 size 20.00GiB used 2.00GiB path /dev/sdc
        devid    3 size 20.00GiB used 2.00GiB path /dev/sdd
        devid    4 size 20.00GiB used 1.03GiB path /dev/sde
raid1最少只需要2个块设备,所以可以删除两个。
[root@digoal ~]# btrfs device delete /dev/sdb /data01
[root@digoal ~]# btrfs device delete /dev/sdc /data01
[root@digoal ~]# btrfs filesystem df /data01
Data, RAID1: total=2.00GiB, used=1.00GiB
System, RAID1: total=32.00MiB, used=4.00KiB
Metadata, RAID1: total=256.00MiB, used=1.12MiB
GlobalReserve, single: total=4.00MiB, used=0.00B
[root@digoal ~]# btrfs filesystem show /data01
Label: none  uuid: 165f59f6-77b5-4421-b3d8-90884d3c0b40
        Total devices 2 FS bytes used 1.00GiB
        devid    3 size 20.00GiB used 2.28GiB path /dev/sdd
        devid    4 size 20.00GiB used 2.28GiB path /dev/sde
继续删除则失败
[root@digoal ~]# btrfs device delete /dev/sdd /data01
ERROR: error removing device '/dev/sdd': unable to go below two devices on raid1
再加回去
[root@digoal ~]# btrfs device add /dev/sdb /data01
[root@digoal ~]# btrfs device add /dev/sdc /data01
[root@digoal ~]# btrfs balance start /data01
Done, had to relocate 4 out of 4 chunks
转换为raid5
[root@digoal ~]# btrfs balance start -mconvert=raid5 -dconvert=raid5 /data01
Done, had to relocate 4 out of 4 chunks
可以删除1个,因为raid5最少需要3个块设备
[root@digoal ~]# btrfs device delete /dev/sde /data01

[root@digoal ~]# btrfs filesystem df /data01
Data, RAID5: total=2.00GiB, used=1.00GiB
System, RAID5: total=64.00MiB, used=4.00KiB
Metadata, RAID5: total=1.00GiB, used=1.12MiB
GlobalReserve, single: total=4.00MiB, used=0.00B
[root@digoal ~]# btrfs filesystem show /data01
Label: none  uuid: 165f59f6-77b5-4421-b3d8-90884d3c0b40
        Total devices 3 FS bytes used 1.00GiB
        devid    3 size 20.00GiB used 1.53GiB path /dev/sdd
        devid    5 size 20.00GiB used 1.53GiB path /dev/sdb
        devid    6 size 20.00GiB used 1.53GiB path /dev/sdc

十、处理坏块设备。
假设当前btrfs管理了3个块设备,其中data profile=raid5, metadata profile=raid5, system profile=raid1
设置好这样的状态:
[root@digoal ~]# btrfs balance start -sconvert=raid1 -f /data01
Done, had to relocate 1 out of 3 chunks

[root@digoal ~]# btrfs fi show
Label: none  uuid: 165f59f6-77b5-4421-b3d8-90884d3c0b40
        Total devices 3 FS bytes used 1.00GiB
        devid    3 size 20.00GiB used 1.53GiB path /dev/sdd
        devid    5 size 20.00GiB used 1.50GiB path /dev/sdb
        devid    6 size 20.00GiB used 1.53GiB path /dev/sdc

[root@digoal ~]# btrfs fi df /data01
Data, RAID5: total=2.00GiB, used=1.00GiB
System, RAID1: total=32.00MiB, used=4.00KiB
Metadata, RAID5: total=1.00GiB, used=1.12MiB
GlobalReserve, single: total=4.00MiB, used=0.00B

删除一个块设备文件,模拟坏设备
[root@digoal ~]# rm -f /dev/sdb

[root@digoal ~]# btrfs fi df /data01
Data, RAID5: total=2.00GiB, used=1.00GiB
System, RAID1: total=32.00MiB, used=4.00KiB
Metadata, RAID5: total=1.00GiB, used=1.12MiB
GlobalReserve, single: total=4.00MiB, used=0.00B

现在btrfs显示有一些设备处于missing状态。
[root@digoal ~]# btrfs fi show /data01
Label: none  uuid: 165f59f6-77b5-4421-b3d8-90884d3c0b40
        Total devices 3 FS bytes used 1.00GiB
        devid    3 size 20.00GiB used 1.53GiB path /dev/sdd
        devid    6 size 20.00GiB used 1.53GiB path /dev/sdc
        *** Some devices missing

umount掉之后,就不能挂载上来了。必须使用degraded模式挂载。
[root@digoal ~]# umount /data01

[root@digoal ~]# mount /dev/sdc /data01
mount: wrong fs type, bad option, bad superblock on /dev/sdc,
       missing codepage or helper program, or other error

       In some cases useful info is found in syslog - try
       dmesg | tail or so.

dmesg|tail -n 5
[ 1311.617838] BTRFS: open /dev/sdb failed
[ 1311.618763] BTRFS info (device sdc): disk space caching is enabled
[ 1311.618767] BTRFS: has skinny extents
[ 1311.623540] BTRFS: failed to read chunk tree on sdc
[ 1311.648198] BTRFS: open_ctree failed

你可以看到超级块在sdc sdd是好的。
[root@digoal ~]# btrfs rescue super-recover -v /dev/sdc
All Devices:
        Device: id = 3, name = /dev/sdd
        Device: id = 6, name = /dev/sdc

Before Recovering:
        [All good supers]:
                device name = /dev/sdd
                superblock bytenr = 65536

                device name = /dev/sdd
                superblock bytenr = 67108864

                device name = /dev/sdc
                superblock bytenr = 65536

                device name = /dev/sdc
                superblock bytenr = 67108864

        [All bad supers]:

All supers are valid, no need to recover
所以可以使用degraded挂载。
[root@digoal ~]# mount -t btrfs -o degraded /dev/sdc /data01

[root@digoal ~]# btrfs fi show /data01
Label: none  uuid: 165f59f6-77b5-4421-b3d8-90884d3c0b40
        Total devices 3 FS bytes used 1.00GiB
        devid    3 size 20.00GiB used 1.53GiB path /dev/sdd
        devid    6 size 20.00GiB used 1.53GiB path /dev/sdc
        *** Some devices missing

[root@digoal ~]# btrfs fi df /data01
Data, RAID5: total=2.00GiB, used=1.00GiB
System, RAID1: total=32.00MiB, used=4.00KiB
Metadata, RAID5: total=1.00GiB, used=1.12MiB
GlobalReserve, single: total=4.00MiB, used=0.00B

删除missing的块设备,同样需要保证profile对应的级别,至少要满足最少的数据块格式,因为用了raid5,所以至少要3个块设备。删除失败。
[root@digoal ~]# btrfs device delete missing /data01
ERROR: error removing device 'missing': unable to go below two devices on raid5

你可以先添加块设备进来,然后再删除missing的设备。
[root@digoal ~]# btrfs device add /dev/sde /data01

[root@digoal ~]# btrfs fi show /data01
Label: none  uuid: 165f59f6-77b5-4421-b3d8-90884d3c0b40
        Total devices 4 FS bytes used 1.00GiB
        devid    3 size 20.00GiB used 1.53GiB path /dev/sdd
        devid    6 size 20.00GiB used 1.53GiB path /dev/sdc
        devid    7 size 20.00GiB used 0.00B path /dev/sde
        *** Some devices missing

[root@digoal ~]# btrfs device delete missing /data01

[root@digoal ~]# btrfs fi show /data01
Label: none  uuid: 165f59f6-77b5-4421-b3d8-90884d3c0b40
        Total devices 3 FS bytes used 1.00GiB
        devid    3 size 20.00GiB used 1.53GiB path /dev/sdd
        devid    6 size 20.00GiB used 1.53GiB path /dev/sdc
        devid    7 size 20.00GiB used 1.50GiB path /dev/sde

重新平衡。
[root@digoal ~]# btrfs balance start /data01
Done, had to relocate 3 out of 3 chunks

[root@digoal ~]# btrfs fi show /data01
Label: none  uuid: 165f59f6-77b5-4421-b3d8-90884d3c0b40
        Total devices 3 FS bytes used 1.00GiB
        devid    3 size 20.00GiB used 1.53GiB path /dev/sdd
        devid    6 size 20.00GiB used 1.50GiB path /dev/sdc
        devid    7 size 20.00GiB used 1.53GiB path /dev/sde

[小结]
1. 建议的mkfs参数
多个块设备时,建议
-n 4096 -m raid10 -d raid10
或
-n 4096 -m raid10 -d raid5
...
单个块设备建议(非SSD)
-n 4096 -m DUP -d single
单个块设备建议(SSD)
-n 4096 -m single -d single

2. 建议的mount参数
discard,ssd,ssd_spread,noatime,nodiratime,space_cache,defaults,compress=no,recovery

3. 建议的IO调度策略
deadline

4. btrfs 架构
搞清几个概念:
1. block group, chunk
2. profile
3. 三种数据类型
4. block dev

5. 添加块设备后,记得执行重分布。

6. 搞清楚man btrfs以及所有子命令所有的内容.

[参考]
1. man mkfs.btrfs
2. man btrfs
3. https://access.redhat.com/documentation/en-US/Red_Hat_Enterprise_Linux/7/html/Storage_Administration_Guide/ch-btrfs.html
4. https://access.redhat.com/documentation/en-US/Red_Hat_Enterprise_Linux/7/html/Storage_Administration_Guide/index.html
5. https://www.suse.com/events/susecon/sessions/presentations/SUSECon-2012-TT1301.pdf
6. https://www.suse.com/documentation/
7. https://wiki.gentoo.org/wiki/Btrfs
8. https://wiki.gentoo.org/wiki/ZFS
上一篇:从技术思维角度聊一聊,『程序员』摆地摊的正确姿势


下一篇:阿里云容器服务差异化 SLO 混部技术实践