ceph osd的多种状态

2023-12-14 18:07:58

acting set 通过crush算法计算出来的osd
up set 正在提供读写服务的osd
当出现osd down掉或者发生数据迁移时actiong set会重新设置，把数据迁移到acting set中，而使用up set提供读写服务

peering完毕不意味着所有副本都具有相同数据，只是标明所有副本达成了一致（版本，主从关系等）

ACTIVE
代表peering已经完毕，代表primary osd具有有效数据，pg可以提供读写服务

CLEAN
代表peering已经完毕，并且所有副本数据no stray

DEGRADED
写完primary osd，未写完其他osd时，pg处于degraded状态
pg中部分osd 处于down状态，则pg处于active+degranded状态
如果osd处于down状态并且pg长期处于degranded状态，osd将被标为为out，并且重新映射数据。down到out的时间通过mon osd down out interval控制
当pg中部分对象无法找到或读写是，pg被标记为degranded状态，此时pg中其他对象仍然是可以访问

RECOVERING
pg中osd在down之后重新up（未发生重新映射），他的数据已经落后于其他副本，需要更新到最新状态，此时pg被标记为recovering

BACK FILLING
当新osd加入集群之后，pg需要重新分配到新的osd上，当backfilling完成之后，新的osd可以服务请求。

REMAPPED
pg发生了重新remap，数据从旧的acting set向新的acting set迁移，在迁移过程中，发给新primary osd的请求会转发给旧primary osd，直到迁移完成

问题：这些状态的关系和状态迁移

STALE
pg的primary osd一段时间未向monitor汇报

As previously noted, a placement group is not necessarily problematic just because its state is not active+clean. Generally, Ceph’s ability to self repair may not be working when placement groups get stuck. The stuck states include:
Unclean: Placement groups contain objects that are not replicated the desired number of times. They should be recovering.
Inactive: Placement groups cannot process reads or writes because they are waiting for an OSD with the most up-to-date data to come back up.
Stale: Placement groups are in an unknown state, because the OSDs that host them have not reported to the monitor cluster in a while (configured by mon osd report timeout).

问题：
读写操作为什么需要通过primary osd进行？有哪些场景会导致数据不一致？

码农公寓

相关文章