MongoDB Oplog Stones 实现分析及启动加载优化

本文基于 4.2 代码分析

背景

Oplog Collection 首先是作为一个 Capped Collection 来实现的,但是单纯的 Capped Collection 会在每一次的写操作之后,如果发现集合大小超出配置的最大值,就会同步的去进行删除文档的操作

删除文档的步骤大约是,

  • 计算:设置一个 end cursor 指向 capped collection 头部,然后不断的调用 cursor->next(),通过累积 value 大小来确定需要删除的文档数
  • 删除:

    • 需要删除的文档数量小于等于 3 个,直接循环调用 cursor->remove() 删除文档
    • 否则,设置两个 cursor,一个 start,一个 end,start 直接设置在 capped collection 的头部,然后调用 session->truncate(session, start, end) 来批量删除文档,session->truncate() 实际最终也是调用 cursor->remove()

具体代码可参考:WiredTigerRecordStore::_cappedDeleteAsNeeded_inlock

可以看到确定需要删除的文档数是这里面比较耗时的部分,cursor->next() 来统计 value 大小,实际就是在不断的做读取。对于 Oplog Collection 来说,所有的用户写操作都会记录 Oplog,当 Oplog 写满后,每一次的用户写操作都会触发同步的删除操作,显然效率很低。

所以 MongoDB 采用了一种标记删除位点,然后批量删除的策略来解决这个问题。

Oplog Stones 初始化

所谓的 Oplog Stone,实际上就是用 truncate point(删除位点) 在 oplog 上分隔的逻辑区域,而 truncate point 本质上就是 oplog entry 的 ts 字段,同时作为 RecordID,实际也是对应的 WiredTiger Table 中一条记录的 key。

Oplog Stone 的信息 MongoDB 并没有做持久化,而是选择每次重启的时候重新初始化

Oplog Stones 整体初始化的逻辑还是比较简单的,首先是根据 cappedMaxSize 计算需要多少个 stone,

    // The minimum oplog stone size should be BSONObjMaxInternalSize.
    const unsigned int oplogStoneSize =
        std::max(gOplogStoneSizeMB * 1024 * 1024, BSONObjMaxInternalSize);

    // IDL does not support unsigned long long types.
    const unsigned long long kMinStonesToKeep = static_cast<unsigned long long>(gMinOplogStones);
    const unsigned long long kMaxStonesToKeep =
        static_cast<unsigned long long>(gMaxOplogStonesDuringStartup);

    unsigned long long numStones = maxSize / oplogStoneSize;
    size_t numStonesToKeep = std::min(kMaxStonesToKeep, std::max(kMinStonesToKeep, numStones));

这里有很多默认值参与计算,我们需要知道的是,oplog stone 最多 100 个,最少 10 个,如果 oplogSizeMB 配置值超过 2GB,在默认情况下,基本上就需要 100 个 stone(这个是根据 Capped Collection配置的最大值算出来的基准,如果 Capped Collection 实际还没有写满,会根据实际大小来换算,stone 会更少,但是无论如何,上限不会超过 100)。

确定了 Oplog Stone 的个数,下面要做的就是确定每个 Oplog Stone 的边界,分两种情况

  • Oplog 集合当前的 record 数量太少,小于 20 倍的需要 sample 的数量(对于 100 个 stone 来说,每个 stone 默认 sample 10 条记录,所以 Collection record 数量低于 2w 条,就走全表的逻辑),直接通过全表扫描的方式来确定 Oplog Stone 边界,这个逻辑很简单,就是对扫描到的 oplog entry 累加大小,超过单个 oplog stone 大小上限就生成一个 stone,保存下来,直到扫描结束。(代码:WiredTigerRecordStore::OplogStones::_calculateStonesByScanning
  • 否则,就不能通过全表扫描了,效率太低。MongoDB 借助于 WiredTiger 提供的 random cursor 来进行采样,从而快速确定每个 oplog stone 的边界。(代码:WiredTigerRecordStore::OplogStones::_calculateStonesBySampling

    1. 正常来说,100 个 oplog stone,采样 100 次,似乎就可以确定所有 stone 的边界了,但是为了保证边界尽可能准确,MongoDB 采用了 oversampling 的方式,即,对于每一个 stone 采样 10 次(默认),100 个 stone 就采样 1000 次。然后,把这些采集到的 sample 按 key(opTime)排序,每个 oplog stone 使用每 10 个 sample 中最后一个的 key 作为其边界,即,从 0 开始数,那么 9th, 19th, 29th, ……号 sample 是顺序对应的 oplog stone 的边界。
    2. 另外,MongoDB 在后面的版本迭代中还使用了优化的 random cursor,可以保证更精确的采样,通过 next_random_sample_size 告知 WT 需要 random 采样的个数,WT 会把底层的 table 划分为 next_random_sample_size 份,分别从其中获取一个采样,显然这样划分后,每第 10 个 sample 可以更贴近预期的 oplog stone 边界。

这个是进程启动时 oplog stone 的初始化的方式,随着有新的写入,还会创建新的 oplog stone,这个时候 oplog stone 的大小是可以保证精确的,因为在写入的时候,可以很方便的统计到当前已经在最新的 stone 里面写了多少数据,这个数值是精确的。所以,如果初始启动的时候因为 oplog stone 边界不精确导致 oplog 删除的过多或过少,并不是个大问题,这个会在后续的更新中把误差抹掉。

Oplog 回收

有了 Oplog Stone 后,oplog 的回收就不需要再每次写入时去计算要删除的文档数,再同步去删除,只需要在当前 oplog stone 写满后,创建新的 oplog stone 时,把最老的 oplog stone 删除掉即可

这个按 stone 删除通常会一次性删除比较多的文档,所以oplog 的删除动作是放在后台的 OplogTruncaterThread来做的,删除时会直接调用 session->truncate 方法,使用 oldest oplog stone 的边界作为 range truncate 的上界。

//WiredTigerRecordStore::reclaimOplog
            setKey(cursor, stone->lastRecord);
            invariantWTOK(session->truncate(session, nullptr, nullptr, cursor, nullptr));

删除的时候还需要考虑 stable checkpoint 对 oplog 的依赖,具体逻辑后面再发文分析。

Oplog Stones 初始化的时间开销分析

单纯从代码逻辑看,可以看到 Oplog Stones 的初始化的绝大部分时间都会花在 random cursor 的采样上,因为其他的步骤都是简单的 in-memory compute,几个 ms 足以完成。总的初始化时长会和采样的个数成正比,采样个数最多是 1000 个,所以这个初始化时间是有上限的,并不是随 oplog 集合大小无限增长

为了验证上述结论,我们构造一个测试场景来证明。

先构造数据集,创建副本集 1,使用 mongo shell 自带的 benchmark 工具生成数据,

benchRun(
  {
  "host": "127.0.0.1:9111",
  "ops": [
    {
      "ns": "test.foo",
      "op": "insert",
      "doc": {
        "x": {
          "#RAND_STRING": [
            64
          ]
        },
        "y": {
          "#RAND_STRING": [
            128
          ]
        }
      }
    }
  ],
  "parallel": 16,
  "seconds": $seconds
  }
)

生成大约 1.5 亿条 oplog,33GB,(有压缩,实际是在 50G左右)

xdjmgset-dbfs1:PRIMARY> show dbs
admin    0.000GB
config   0.000GB
local   33.363GB
test    35.284GB
xdjmgset-dbfs1:PRIMARY> use local
switched to db local
xdjmgset-dbfs1:PRIMARY> db.oplog.rs.count()
150531637

在代码上加了一些日志用于查看 cursor random 的时间开销,以及每一次 random 采样的开销,

diff --git a/src/mongo/db/storage/wiredtiger/wiredtiger_record_store.cpp b/src/mongo/db/storage/wiredtiger/wiredtiger_record_store.cpp
index f2c3d1c220..7f029b788d 100644
--- a/src/mongo/db/storage/wiredtiger/wiredtiger_record_store.cpp
+++ b/src/mongo/db/storage/wiredtiger/wiredtiger_record_store.cpp
@@ -460,8 +460,14 @@ void WiredTigerRecordStore::OplogStones::_calculateStonesBySampling(OperationCon
     // each logical section.
     auto cursor = _rs->getRandomCursorWithOptions(opCtx, extraConfig);
     std::vector<RecordId> oplogEstimates;
+    const std::uint64_t startWaitTime = curTimeMicros64();
     for (int i = 0; i < numSamples; ++i) {
+        const std::uint64_t startWaitTime = curTimeMicros64();
         auto record = cursor->next();
+        auto waitTime = curTimeMicros64() - startWaitTime;
+        LOG(1) << "WT cursor random sample " << i << ", "
+               << Timestamp(record->id.repr()).toStringPretty() << ", took "
+               << waitTime / 1000.0 << "ms";
         if (!record) {
             // This shouldn't really happen unless the size storer values are far off from reality.
             // The collection is probably empty, but fall back to scanning the oplog just in case.
@@ -471,6 +477,8 @@ void WiredTigerRecordStore::OplogStones::_calculateStonesBySampling(OperationCon
         }
         oplogEstimates.push_back(record->id);
     }
+    auto waitTime = curTimeMicros64() - startWaitTime;
+    LOG(1) << "WT cursor random sampling total took " << waitTime/1000.0 << "ms";
     std::sort(oplogEstimates.begin(), oplogEstimates.end());

     for (int i = 1; i <= wholeStones; ++i) {

把副本集 1 重启,观察日志,

020-10-27T15:34:09.058+0800 I  STORAGE  [initandlisten] Taking 755 samples and assuming that each section of oplog contains approximately 1991955 records totaling to 687194924 bytes
2020-10-27T15:34:09.058+0800 D1 STORAGE  [initandlisten] WT cursor random sample 0, Oct 26 19:54:57:7347, took 0.415ms
2020-10-27T15:34:09.063+0800 D1 STORAGE  [initandlisten] WT cursor random sample 1, Oct 26 19:57:47:6215, took 4.488ms
2020-10-27T15:34:09.067+0800 D1 STORAGE  [initandlisten] WT cursor random sample 2, Oct 26 15:37:47:1030, took 4.608ms
2020-10-27T15:34:09.072+0800 D1 STORAGE  [initandlisten] WT cursor random sample 3, Oct 26 15:44:15:4619, took 4.471ms
2020-10-27T15:34:09.076+0800 D1 STORAGE  [initandlisten] WT cursor random sample 4, Oct 26 15:46:51:2640, took 4.597ms
2020-10-27T15:34:09.081+0800 D1 STORAGE  [initandlisten] WT cursor random sample 5, Oct 26 15:49:22:10335, took 4.556ms
2020-10-27T15:34:09.086+0800 D1 STORAGE  [initandlisten] WT cursor random sample 6, Oct 26 15:52:03:10684, took 4.746ms
2020-10-27T15:34:09.090+0800 D1 STORAGE  [initandlisten] WT cursor random sample 7, Oct 26 15:54:14:4494, took 4.586ms
2020-10-27T15:34:09.095+0800 D1 STORAGE  [initandlisten] WT cursor random sample 8, Oct 26 15:56:46:1960, took 4.889ms
2020-10-27T15:34:09.100+0800 D1 STORAGE  [initandlisten] WT cursor random sample 9, Oct 26 15:59:18:7246, took 4.695ms
2020-10-27T15:34:09.105+0800 D1 STORAGE  [initandlisten] WT cursor random sample 10, Oct 26 16:02:05:4727, took 4.895ms
2020-10-27T15:34:09.110+0800 D1 STORAGE  [initandlisten] WT cursor random sample 11, Oct 26 16:04:30:5742, took 4.673ms
2020-10-27T15:34:09.115+0800 D1 STORAGE  [initandlisten] WT cursor random sample 12, Oct 26 16:06:45:1917, took 4.881ms
2020-10-27T15:34:09.119+0800 D1 STORAGE  [initandlisten] WT cursor random sample 13, Oct 26 16:08:50:5188, took 4.786ms
2020-10-27T15:34:09.124+0800 D1 STORAGE  [initandlisten] WT cursor random sample 14, Oct 26 16:11:13:7634, took 4.449ms
2020-10-27T15:34:09.129+0800 D1 STORAGE  [initandlisten] WT cursor random sample 15, Oct 26 16:13:25:6775, took 5.204ms
...
2020-10-27T15:34:12.463+0800 D1 STORAGE  [initandlisten] WT cursor random sample 752, Oct 26 15:56:15:232, took 4.923ms
2020-10-27T15:34:12.467+0800 D1 STORAGE  [initandlisten] WT cursor random sample 753, Oct 26 15:58:47:1953, took 4.399ms
2020-10-27T15:34:12.472+0800 D1 STORAGE  [initandlisten] WT cursor random sample 754, Oct 26 16:01:28:5317, took 4.598ms
2020-10-27T15:34:12.472+0800 D1 STORAGE  [initandlisten] WT cursor random sampling total took 3414.51ms

可以看到这个实例采样了 755 次,总共耗时 3414ms,每次采样的时间都比较固定,在 4ms - 5ms 之间。

然后 MongoDB 的 serverStatus 命令本身也提供了一个 section 的输出用于查看启动时,初始化 oplog stones 的总时长,和初始化方法(是 scan 还是 sampling),

xdjmgset-dbfs1:PRIMARY> db.serverStatus().oplogTruncation
{
        "totalTimeProcessingMicros" : NumberLong(3418164),
        "processingMethod" : "sampling",
        "totalTimeTruncatingMicros" : NumberLong(0),
        "truncateCount" : NumberLong(0)
}

可以看到,其他部分的时间开销在 4ms 左右,不到总时长的 1%。

为了验证初始化时间和 sample 的个数成正比,同样根据上述方法构造另外一个数据集,25GB,1.13 亿条 oplog,

xdjmgset-dbfs1:PRIMARY> show dbs
admin    0.000GB
config   0.000GB
local   25.145GB
test    26.517GB
xdjmgset-dbfs1:PRIMARY> use local
switched to db local
xdjmgset-dbfs1:PRIMARY> db.oplog.rs.count()
113211477

重启之后查看日志输出,

2020-10-27T15:43:02.121+0800 I  STORAGE  [initandlisten] Taking 568 samples and assuming that each section of oplog contains approximately 1991875 records totaling to 687195044 bytes
2020-10-27T15:43:02.121+0800 D1 STORAGE  [initandlisten] WT cursor random sample 0, Oct 27 12:33:29:5201, took 0.216ms
2020-10-27T15:43:02.125+0800 D1 STORAGE  [initandlisten] WT cursor random sample 1, Oct 27 12:36:06:5577, took 4.489ms
2020-10-27T15:43:02.130+0800 D1 STORAGE  [initandlisten] WT cursor random sample 2, Oct 27 12:38:30:1191, took 4.417ms
2020-10-27T15:43:02.134+0800 D1 STORAGE  [initandlisten] WT cursor random sample 3, Oct 27 12:40:51:1654, took 4.526ms
2020-10-27T15:43:02.139+0800 D1 STORAGE  [initandlisten] WT cursor random sample 4, Oct 27 12:43:12:9085, took 4.51ms
2020-10-27T15:43:02.144+0800 D1 STORAGE  [initandlisten] WT cursor random sample 5, Oct 27 12:45:36:3523, took 4.465ms
2020-10-27T15:43:02.148+0800 D1 STORAGE  [initandlisten] WT cursor random sample 6, Oct 27 12:48:09:6883, took 4.63ms
2020-10-27T15:43:02.153+0800 D1 STORAGE  [initandlisten] WT cursor random sample 7, Oct 27 12:50:09:6716, took 4.484ms
2020-10-27T15:43:02.157+0800 D1 STORAGE  [initandlisten] WT cursor random sample 8, Oct 27 12:52:24:1495, took 4.531ms
2020-10-27T15:43:02.162+0800 D1 STORAGE  [initandlisten] WT cursor random sample 9, Oct 27 12:54:39:3871, took 4.705ms
2020-10-27T15:43:02.167+0800 D1 STORAGE  [initandlisten] WT cursor random sample 10, Oct 27 12:57:15:3946, took 4.661ms
2020-10-27T15:43:02.171+0800 D1 STORAGE  [initandlisten] WT cursor random sample 11, Oct 27 12:59:36:5033, took 4.74ms
2020-10-27T15:43:02.176+0800 D1 STORAGE  [initandlisten] WT cursor random sample 12, Oct 27 13:01:52:6908, took 4.424ms
2020-10-27T15:43:02.181+0800 D1 STORAGE  [initandlisten] WT cursor random sample 13, Oct 27 13:04:22:2838, took 4.637ms
2020-10-27T15:43:02.186+0800 D1 STORAGE  [initandlisten] WT cursor random sample 14, Oct 27 13:06:42:6574, took 5.21ms
...
2020-10-27T15:43:04.771+0800 D1 STORAGE  [initandlisten] WT cursor random sample 567, Oct 27 12:17:32:2820, took 4.397ms
2020-10-27T15:43:04.771+0800 D1 STORAGE  [initandlisten] WT cursor random sampling total took 2650.65ms

进行了 568 次 sample,总时间开销 2650ms,而 2650ms 基本上等于 ( 568.0/755) * 3414ms = 2568ms ,和 sample 个数成正比的结论可成立。

综上,考虑到单次 random cursor sample 的开销大约是 4-5ms,总 sample 上限是 1000,那么 oplog stones 初始化时间的上限在 5s 左右(NVMe SSD)。


update:上面的分析忽略了一个事情,即每次 random cursor next,是需要把获取 record id 的,因为这个就是 oplog 的 truncate point,但是同时也会把 value 读上来,而且 WiredTiger 的读取都是以 page(extent)为粒度的,如果平均每条 oplog 的 value 都很大,必然造成单个 disk page 也很大,重启的时候,显然所有 page 的读取都要自己读 disk,那么就会造成单次 random cursor next 耗时提升。

所以,经过进一步测试,最终的结论是,当单条 oplog 平均 size 不大时(几百个字节 - 几 KB), oplog stones 初始化时间的上限在 5s 左右。当单条 oplog 平均 size 太大时(几十 KB - 几 MB),那么因为 page size 太大,冷启动读取 disk page 开销变大, oplog stones 初始化时间可能达到几十秒甚至更高,具体时间和单条 oplog 平均 size 强相关。

Oplog Stones reload 优化

阿里云数据库平台有一套完善的日志采集系统,通过对线上运行日志分析,仍然发现有不少实例写入时的 Value 比较大,导致的结果是 oplog entry 的大小超过几百 KB,设置达到几 MB,这种情况下,如果重启,初始化 oplog stones 的时间开销就要达到几十秒甚至更多,所以仍然有必要对这个问题进行优化。

优化前后对比

构造 1MB 大小的 oplog entry,重启 100 个 stone,994 个 sample,加载时间在 27.7s 左右

xdjmgset-dbfs1:PRIMARY> db.serverStatus().oplogTruncation
{
        "totalTimeProcessingMicros" : NumberLong(27791069),
        "processingMethod" : "sampling",
        "totalTimeTruncatingMicros" : NumberLong(0),
        "truncateCount" : NumberLong(0)
}

2020-11-13T14:47:52.391+0800 I  STORAGE  [initandlisten] The size storer reports that the oplog contains 20378 records totaling to 21355524826 bytes
2020-11-13T14:47:52.391+0800 D1 COMMAND  [WT-OplogStonesSaverThread-local.oplogstones.rs] BackgroundJob starting: WT-OplogStonesSaverThread-local.oplogstones.rs
2020-11-13T14:47:52.391+0800 D2 STORAGE  [WT-OplogTruncaterThread-local.oplog.rs] no global storage engine yet
2020-11-13T14:47:52.391+0800 I  STORAGE  [initandlisten] Sampling the oplog to determine where to place markers for truncation
2020-11-13T14:47:52.391+0800 D2 STORAGE  [WT-OplogStonesSaverThread-local.oplogstones.rs] WT-OplogStonesSaverThread-local.oplogstones.rs: no global storage engine yet
2020-11-13T14:47:52.393+0800 I  STORAGE  [initandlisten] Sampling from the oplog between Nov 13 14:43:08:618 and Nov 13 14:47:21:3 to determine where to place markers for truncation
2020-11-13T14:47:52.393+0800 I  STORAGE  [initandlisten] Taking 994 samples and assuming that each section of oplog contains approximately 205 records totaling to 214833771 bytes
2020-11-13T14:47:52.572+0800 D1 STORAGE  [initandlisten] WT cursor random sample 0, Nov 13 14:43:28:232, took 178.246ms
2020-11-13T14:47:52.620+0800 D1 STORAGE  [initandlisten] WT cursor random sample 1, Nov 13 14:45:13:143, took 48.437ms
2020-11-13T14:47:52.661+0800 D1 STORAGE  [initandlisten] WT cursor random sample 2, Nov 13 14:45:35:242, took 40.899ms
2020-11-13T14:47:52.868+0800 D1 STORAGE  [initandlisten] WT cursor random sample 3, Nov 13 14:43:19:114, took 206.935ms
...
2020-11-13T14:48:20.182+0800 I  STORAGE  [initandlisten] WiredTiger record store oplog processing took 27791ms

优化后,加载时间在 7ms 左右,减少了 3-4 个数量级

xdjmgset-dbfs1:PRIMARY> db.serverStatus().oplogTruncation
{
        "totalTimeProcessingMicros" : NumberLong(7261),
        "processingMethod" : "reloadingLocal",
        "totalTimeTruncatingMicros" : NumberLong(0),
        "truncateCount" : NumberLong(0)
}

优化思路简述

  1. 把内存中的 oplog stones 保存在本地
  2. 后台线程负责在内存 oplog stones 变化时更新本地文件。
  3. 重启加载从本地文件加载保存的 oplog stones 信息,需要处理一些边界条件。

最后,打一个小广告,阿里云 MongoDB 目前上线了 Serverless 版本,让您方便的以超低成本试用 MongoDB,活动期间首购 1 元包月,续费也有 5 折优惠,欢迎试用:https://www.aliyun.com/product/mongodb

上一篇:predis连接问题(connection refused)排查


下一篇:使用elk+redis搭建nginx日志分析平台(转)