查看ceph集群状态的时候出现了ERR的错误,报错如下
[root@node1 ceph]# ceph -s
cluster 8eaa3f15-0946-4500-b018-6d31d1cc69f6
health HEALTH_ERR
clock skew detected on mon.node2, mon.node3
54 pgs are stuck inactive for more than 300 seconds
121 pgs peering
54 pgs stuck inactive
85 pgs stuck unclean
Monitor clock skew detected
monmap e1: 3 mons at {node1=192.168.209.100:6789/0,node2=192.168.209.101:6789/0,node3=192.168.209.102:6789/0}
election epoch 266, quorum 0,1,2 node1,node2,node3
osdmap e5602: 12 osds: 11 up, 11 in; 120 remapped pgs
flags sortbitwise,require_jewel_osds
pgmap v16259: 128 pgs, 1 pools, 0 bytes data, 0 objects
1421 MB used, 54777 MB / 56198 MB avail
120 remapped+peering
7 active+clean
1 peering
可以看出pgmap的状态不对,正常的状态应该所有的pg都active+clean
问题解决
经过排查之后发现时ceph三个节点之间的时间不同步(NTP服务端没有自启动),造成pgmap混乱,只需要修改时间或者配置ntp服务即可。
查看时间
[root@node1 ceph]# date
Sun Sep 9 21:44:39 EDT 2018
[root@node2 ~]# date
Tue Sep 4 21:37:10 EDT 2018
[root@node3 ~]# date
Sun Sep 9 21:51:39 EDT 2018
启动了NTP服务端启动了NTP服务之后,时间完成同步,再次查看ceph的状态,可发现ceph集群恢复正常。
[root@node3 ~]# ceph -s
cluster 8eaa3f15-0946-4500-b018-6d31d1cc69f6
health HEALTH_OK
monmap e1: 3 mons at {node1=192.168.209.100:6789/0,node2=192.168.209.101:6789/0,node3=192.168.209.102:6789/0}
election epoch 278, quorum 0,1,2 node1,node2,node3
osdmap e5647: 12 osds: 11 up, 11 in
flags sortbitwise,require_jewel_osds
pgmap v16402: 128 pgs, 1 pools, 0 bytes data, 0 objects
1374 MB used, 54824 MB / 56198 MB avail
128 active+clean