版本: 2.2.0
操作: 从 分片shard3_2 挪chunk到 分片shard1_1
现象:
1) mongos 上move chunk命令执行后卡住, 发现日志:
Tue Mar 4 20:34:12 [conn25194] CMD: movechunk: { moveChunk: "archive.archive", to: "shard1_1", find: { uid: -2130706432 } } Tue Mar 4 20:34:12 [conn25194] moving chunk ns: archive.archive moving ( ns:archive.archive at: shard3_2:shard3_2/192.168.130.13:15001,192.168.130.14:15002,10.1.2.13:15003 lastmod: 1|2||000000000000000000000000 min: { uid: -2130706432 } max: { uid: MaxKey }) shard3_2:shard3_2/192.168.130.13:15001,192.168.130.14:15002,10.1.2.13:15003 -> shard1_1:shard1_1/10.1.2.12:13003,192.168.130.11:13001,192.168.130.12:13002 Tue Mar 4 20:34:20 [LockPinger] cluster 192.168.130.14:9001,192.168.130.15:9001,192.168.130.16:9001 pinged successfully at Tue Mar 4 20:34:20 2014 by distributed lock pinger ‘192.168.130.14:9001,192.168.130.15:9001,192.168.130.16:9001/yz-uc-core-a08:27227:1392462352:1804289383‘, sleeping for 30000ms
2)在shard3_2的primary日志里面发现这种记录,并且这种日志持续10个小时:
Wed Mar 5 05:34:57 [conn4754510] moveChunk data transfer progress: { active: true, ns: "archive.archive", from: "shard3_2/192.168.130.13:15001,192.168.130.14:15002,10.1.2.13:15003", min: { uid: -2130706432 }, max: { uid: MaxKey }, shardKeyPattern: { uid: 1 }, state: "catchup", counts: { cloned: 0, clonedBytes: 0, catchup: 0, steady: 0 }, ok: 1.0 } my mem used: 0
最后有个失败日志:
Wed Mar 5 06:40:13 [conn4754510] distributed lock ‘archive.archive/yz-uc-core-a03:15001:1349773025:1657874366‘ unlocked. Wed Mar 5 06:40:13 [conn4754510] about to log metadata event: { _id: "yz-uc-core-a03-2014-03-04T22:40:13-163", server: "yz-uc-core-a03", clientAddr: "192.168.130.18:21606", time: new Date(1393972813700), what: "moveChunk.from", ns: "archive.archive", details: { min: { uid: -2130706432 }, max: { uid: MaxKey }, step1 of 6: 1, step2 of 6: 442, step3 of 6: 9, step4 of 6: 36000896, note: "aborted" } } Wed Mar 5 06:40:13 [conn4754510] command admin.$cmd command: { moveChunk: "archive.archive", from: "shard3_2/192.168.130.13:15001,192.168.130.14:15002,10.1.2.13:15003", to: "shard1_1/10.1.2.12:13003,192.168.130.11:13001,192.168.130.12:13002", fromShard: "shard3_2", toShard: "shard1_1", min: { uid: -2130706432 }, max: { uid: MaxKey }, maxChunkSizeBytes: 134217728, shardId: "archive.archive-uid_-2130706432", configdb: "192.168.130.14:9001,192.168.130.15:9001,192.168.130.16:9001", secondaryThrottle: false, $auth: { local: { __system: 2 } } } ntoreturn:1 keyUpdates:0 locks(micros) r:66 w:28 reslen:398 36361617ms
3)在shard1_1的primary日志里面发现这种记录, 并且这种日志持续10个小时:
Tue Mar 4 22:14:47 [migrateThread] warning: migrate commit waiting for 3 slaves for ‘archive.archive‘ { uid: -2130706432 } -> { uid: MaxKey } waiting for: 5315c844:6
最后个失败日志:
Wed Mar 5 06:40:13 [migrateThread] about to log metadata event: { _id: "yz-uc-core-a01-2014-03-04T22:40:13-161", server: "yz-uc-core-a01", clientAddr: ":27017", time: new Date(1393972813571), what: "moveChunk.to", ns: "archive.archive", details: { min: { uid: -2130706432 }, max: { uid: MaxKey }, step1 of 5: 9, step2 of 5: 0, step3 of 5: 0, step4 of 5: 0, note: "aborted" } }
4)在shard1_1的secondary里面没有任何日志, 说明move shard对于secondary其实是透明的.
综上,(下面称shardx_y的primary为shardx_y_pri, 称其secondary为shardx_y_bjs)
流程是这样的,chunk要从shard3_2挪到chunk1_1, 根据mongodb的设计, 数据由shard1_1_pri 向shard3_2_pri拉取数据, shard1_1_pri会等数据同步到若干个secondary之后才返回. 在我们的迁移中, shard1_1的primary一直在等待oplog完成同步到其他secondary的工作,如日志:warning:
migrate commit waiting for 3 slaves for. 另一方面shard3_2 也在不停检查shard1_1的迁移状态.
疑问:
1) 在日志
Tue Mar 4 22:14:47 [migrateThread] warning: migrate commit waiting for 3 slaves for ‘archive.archive‘ { uid: -2130706432 } -> { uid: MaxKey } waiting for: 5315c844:6
中最后有一个时间戳, 是一个mongodb的oplog的时间戳, primary在根据这个时间戳检查同步. 但是去secondary的local.rs.oplog里面检查这个时间的oplog,发现已经存在secondary里面了.因为mongodb是先执行,再写oplog,那么可以断定这个oplog已经执行了. 另外, 还做了其他实验, 在mongos上面做了会增改这个replica的操作,发现很快就同步到shard2_1_bjs了,说明并不存在延时问题.
2) 对于这个现象, 在google group上面得到10gen工程师(是一个北邮毕业的程序媛)的回复: 这个故障估计是触发mongodb2.2.0的一个bug, 建议我们升级到2.2.7. bug见链接: https://jira.mongodb.org/browse/SERVER-5351
3) 我搜了之前成功move chunk的操作日志, 如下
Wed Aug 21 01:31:35 [migrateThread] warning: migrate commit waiting for 2 slaves for ‘post.post‘ { uid: 1107296256 } -> { uid: MaxKey } waiting for: 5213a7f7:4
奇怪了,我们明明有3个secondary, 但是为什么这里显示2呢?
“waiting for 2 slaves for ‘post.post‘”(正常)
“waiting for 3 slaves for ‘archive.archive‘ ” (异常)
4) 同样的集群, 同样的replica set, 为啥一个2一个3呢, 于是跟了一些代码
//d_migrate.cpp class MigrateStatus { void _go() { //...... // pause to wait for replication Timer t; while ( t.minutes() < 600 ) { //10个小时的循环检查同步进度 if ( flushPendingWrites( lastOpApplied ) break; sleepsecs(1); } } bool flushPendingWrites( const ReplTime& lastOpApplied ) { if ( ! opReplicatedEnough( lastOpApplied ) ) { OpTime op( lastOpApplied ); //这里就是打waiting for 3 slaves 日志的地方 OCCASIONALLY warning() << "migrate commit waiting for " << slaveCount << " slaves for ‘" << ns << "‘ " << min << " -> " << max << " waiting for: " << op << migrateLog; return false; } //... } //下面这两个函数检查同步进度,注意不是引用传值, //也就是说, slaveCount是由之前的代码计算好的,传进去opReplicatedEnough之后不会被修改 bool opReplicatedEnough( const ReplTime& lastOpApplied ) { return mongo::opReplicatedEnough( lastOpApplied , slaveCount ); } bool opReplicatedEnough( OpTime op , BSONElement w ) { return slaveTracking.opReplicatedEnough( op , w ); } };
好了,在flushPendingWrites里面可以看到, slave的数量是打印的这个变量, 关键是 mongo::opReplicatedEnough传进来的 slaveCount ,他是如何计算的?. 我们可以找到这个变量的计算方式:
slaveCount = ( getSlaveCount() / 2 ) + 1; (在2.2.7版本,计算slaveCount不再使用这个函数)
我们的replica set有3个secondary, 也就是说, 正常的日志打印出来应该是 3/2+1=2 , 但是故障发生的时候, 打印出来是3, 也就是4/2+1或者5/2+1, 也就是说这个mongod实例认为他有有4或者5个secondary, 可以猜测是数据库内部发生异常了.
处理:
这么看来,或许只有按10gen的工程师说的那样,升级数据库了,但是升级数据库不能很快进行,于是找一找其他办法,可以对付这个bug
1) 看一下这个slaveCount是怎么来的:
mongo::opReplicatedEnough( lastOpApplied , slaveCount );这个函数的功能就是, 如果有slaveCount个secondary的opTime大于lastOpApplied,那么返回true. 深入这个函数内部, 找到这个代码:
bool _replicatedToNum_slaves_locked(OpTime& op, int numSlaves ) { for ( map<Ident,OpTime>::iterator i=_slaves.begin(); i!=_slaves.end(); i++) { OpTime s = i->second; if ( s < op ) { continue; } if ( --numSlaves == 0 ) return true; } return numSlaves <= 0; }
到这里就发现了,其实就是这个变量_slaves存储了seconday的信息, 找到其定义: map<Ident,OpTime> _slaves;
2) 到这里,基本就可以确定,在故障的mongod里面,_slaves里面有4或者5个元素,于是我想到,如果能找到什么办法,让_slaves清空或者重置则是再好不过,于是找到了修改_slaves的地方:
//repl_block.cpp:111 void reset() { if ( _currentlyUpdatingCache ) return; scoped_lock mylk(_mutex); _slaves.clear(); //清除 } //repl_block.cpp:118 void update( const BSONObj& rid , const string& host , const string& ns , OpTime last ) { REPLDEBUG( host << " " << rid << " " << ns << " " << last ); Ident ident(rid,host,ns); scoped_lock mylk(_mutex); _slaves[ident] = last; //修改 _dirty = true; //..... }
有且只有这两个地方直接更改 _slaves. 于是接着看代码, 可以发现,在数据库关闭的时候和replica的配置修改的时候会重置_slaves. 因为_slaves.clear()的上层代码还是比较多,并且还想看看有没有其他更简便的方法, 并且确认重启或者修改replica的配置可以重置_slaves, 所以不犹豫直接在源码上面多个地方打log, 重新编译mongodb到测试环境集群去测试:
src/mongo/db/repl_block.cpp:115:log(LL_ERROR) << "[yyy] clear _slave in " << " void reset() _slave.clear()\n"; src/mongo/db/repl_block.cpp:126:log(LL_ERROR) << "[yyy] ident:" << rid.toString(false, false) << " " << host << " " << ns << " | last:" << last; src/mongo/db/repl/replset_commands.cpp:215:log(LL_ERROR) << "[yyy] clear _slave in " << " CmdReplSetReconfig"; src/mongo/db/repl/rs_config.cpp:72:log(LL_ERROR) << "[yyy] clear _slave in " << "saveConfigLocally" << rsLog; src/mongo/db/oplog.cpp:50:log(LL_ERROR) << "[yyy] clear _slave in " << " close database";
最后确认了会重置_slaves的操作: mongod启动和replica配置的修改
2) 综合了上面的推论和测试, 认为重启mongod可以让mongod根据实际情况重置_slaves, 否则处于日常的运转中, 只有修改replica的配置. 鉴于修改配置风险更高,所以打算通过重启mongod的方式处理.
3) 因为我们的集群请求量很大, 保险起见, 在凌晨操作. 切主 - 等待换主和同步 - 关闭原来的primary - 重启 - 等待同步 - 把主切回来 - 等待换主和同步 - move chunk - 成功{ "millis" : 2977, "ok" : 1 }!
后续: 这两天跟了很多mongodb的代码,后面还写点代码分析.