Ceph 故障处理记录一
一、故障描述
k8s01、k8s02、k8s03三台服务器组成的ceph集群
集群状态如下:
[root@k8s01 ops]# ceph -s
cluster:
id: b5f36dec-8faa-4efa-b08d-cbcd8305ae63
health: HEALTH_WARN
1 MDSs report slow metadata IOs
1 MDSs report slow requests
clock skew detected on mon.k8s03, mon.k8s04
1 monitors have not enabled msgr2
Reduced data availability: 81 pgs stale
Degraded data redundancy: 11395/34185 objects degraded (33.333%), 81 pgs degraded, 81 pgs undersized
services:
mon: 3 daemons, quorum k8s01,k8s03,k8s04 (age 7h)
mgr: k8s01(active, since 7h)
mds: cephfs:1 {0=k8s01=up:active}
osd: 3 osds: 3 up (since 2m), 3 in (since 9h)
task status:
scrub status:
mds.k8s01: idle
data:
pools: 3 pools, 81 pgs
objects: 11.39k objects, 1.8 GiB
usage: 6.1 GiB used, 154 GiB / 160 GiB avail
pgs: 11395/34185 objects degraded (33.333%)
81 stale+active+undersized+degraded
主要问题在于:cephfs客户端无法挂载!
初步分析:由于mds 只有一台k8s01,刚好重启的是k8s01,由于重启间隔时间较长可能有半小时至一小时,导致了数据不一致,主要看如何修复数据。
后期需要考虑MDS高可用!
二、解决问题思路
①、提示81pgs 过时
通过ceph -s 可以看出
Reduced data availability: 81 pgs stale
pg 状态是:pgs stale | pgs 过时
卡住的 PGs
有失败发生后,PG 会进入“degraded”(降级)或“peering”(连接建立中)状态,这种情况时有发生。通常这些状态意味着正常的失败恢复正在进行。然而,如果一个 PG 长时间处于这些状态中的某个,就意味着有更大的问题。因此 monitor 在 PG 卡 ( stuck ) 在非最优状态时会告警。我们具体检查:
inactive(不活跃)—— PG 长时间不是active(即它不能提供读写服务了);
unclean(不干净)—— PG 长时间不是clean(例如它未能从前面的失败完全恢复);
stale(不新鲜)—— PG 状态没有被ceph-osd更新,表明存储这个 PG 的所有节点可能都down了。
解决办法:
通过以上得知可以通过重启osd 服务来恢复PG状态。
试过的方法:
systemctl restart ceph-osd.target
systemctl restart ceph-osd@1.service
依次重启了三个集群点,发现状态还是没变!
[root@k8s03 ~]# systemctl restart ceph-osd.target
[root@k8s03 ~]# ceph pg stat #无法查看PG状态
^CInterrupted
[root@k8s03 ~]# systemctl restart ceph-osd@1.service
[root@k8s03 ~]# ceph -s
cluster:
id: b5f36dec-8faa-4efa-b08d-cbcd8305ae63
health: HEALTH_WARN
1 MDSs report slow metadata IOs
1 MDSs report slow requests
clock skew detected on mon.k8s03, mon.k8s04
1 monitors have not enabled msgr2
1 osds down
1 host (1 osds) down
Reduced data availability: 81 pgs stale
Degraded data redundancy: 11395/34185 objects degraded (33.333%), 81 pgs degraded, 81 pgs undersized
services:
mon: 3 daemons, quorum k8s01,k8s03,k8s04 (age 7h)
mgr: k8s01(active, since 7h)
mds: cephfs:1 {0=k8s01=up:active}
osd: 3 osds: 2 up (since 1.90458s), 3 in (since 9h)
task status:
scrub status:
mds.k8s01: idle
data:
pools: 3 pools, 81 pgs
objects: 11.39k objects, 1.8 GiB
usage: 6.1 GiB used, 154 GiB / 160 GiB avail
pgs: 11395/34185 objects degraded (33.333%)
81 stale+active+undersized+degraded
最后通过重启服务器发现状态恢复回来。
[root@k8s04 ~]# ceph -s
cluster:
id: b5f36dec-8faa-4efa-b08d-cbcd8305ae63
health: HEALTH_WARN
mon k8s01 is low on available space
1 monitors have not enabled msgr2
services:
mon: 3 daemons, quorum k8s01,k8s03,k8s04 (age 8m)
mgr: k8s01(active, since 7h)
mds: cephfs:1 {0=k8s01=up:active}
osd: 3 osds: 3 up (since 8m), 3 in (since 10h)
task status:
scrub status:
mds.k8s01: idle
data:
pools: 3 pools, 81 pgs
objects: 11.42k objects, 1.7 GiB
usage: 8.9 GiB used, 231 GiB / 240 GiB avail
pgs: 81 active+clean
io:
client: 5.9 KiB/s wr, 0 op/s rd, 0 op/s wr
参考:https://www.jianshu.com/p/9d740d025034
参考:https://blog.csdn.net/pansaky/article/details/86700301