常见问题
注意:
- 任何情况下都不要直接 kill Pump或Drainer
- 一但数据同步失败 Drainer就会进入 paused (暂停状态)
TiDB3.0.2同步 MariaDB 数据超长引发 binglog同步失败,解决方案如下:
-
修改TiDB错误
例如:删除某个超长的列,然后重新创建一个与MySQL长度合理的列。 删除表时不要忘记备份数据 -
查看
drainer.log
找到发生异常的时间点
[tidb@dev11 ~]$ tail -1000f /home/tidb/deploy/log/drainer.log
......
[2019/08/22 11:27:21.586 +08:00] [INFO] [syncer.go:254] ["write save point"] [ts=410634011353284609]
[2019/08/22 11:27:26.397 +08:00] [INFO] [syncer.go:254] ["write save point"] [ts=410634012205252609]
[2019/08/22 11:27:30.608 +08:00] [ERROR] [executor.go:85] ["exec fail"] [query="DELETE FROM `dev2_pfizer_activiti`.`ACT_RU_JOB` WHERE `ID_` = ? LIMIT 1"] [args="[\"ZTA4MjMyZDAtYzNmYS0xMWU5LWJkZjEtMDY4YWEwOWI2NzQy\"]"] [error="Error 1213: Deadlock found when trying to get lock; try restarting transaction"]
......
# 异常的发生时间为 [2019/08/22 11:27:30.608 +08:00]
- 根据 异常的发生时间 [2019/08/22 11:27:30.608 +08:00] 到
pump.log
找到这个时间相近的下一个 MaxCommitTS的值
[tidb@dev11 ~]$ vim /home/tidb/deploy/log/pump.log
# 在vim中查找
/[2019/08/22 11:27:3
# 根据时间的定位,最相近的下一个 TS值是 [MaxCommitTS=410634015285444609]
[2019/08/22 11:27:36.411 +08:00] [INFO] [storage.go:381] [DBStats] [DBStats="{\"WriteDelayCount\":0,\"WriteDelayDuration\":0,\"WritePaused\":false,\"AliveSnapshots\":0,\"AliveIterators\":0,\"IOWrite\":5790744,\"IORead\":0,\"BlockCacheSize\":0,\"OpenedTablesCount\":0,\"LevelSizes\":null,\"LevelTablesCounts\":null,\"LevelRead\":null,\"LevelWrite\":null,\"LevelDurations\":null}"]
[2019/08/22 11:27:36.445 +08:00] [INFO] [server.go:522] ["server info tick"] [writeBinlogCount=8932] [alivePullerCount=3] [MaxCommitTS=410634015285444609]
[2019/08/22 11:27:46.411 +08:00] [INFO] [storage.go:381] [DBStats] [DBStats="{\"WriteDelayCount\":0,\"WriteDelayDuration\":0,\"WritePaused\":false,\"AliveSnapshots\":0,\"AliveIterators\":0,\"IOWrite\":5791236,\"IORead\":0,\"BlockCacheSize\":0,\"OpenedTablesCount\":0,\"LevelSizes\":null,\"LevelTablesCounts\":null,\"LevelRead\":null,\"LevelWrite\":null,\"LevelDurations\":null}"]
[2019/08/22 11:27:46.445 +08:00] [INFO] [server.go:522] ["server info tick"] [writeBinlogCount=8932] [alivePullerCount=3] [MaxCommitTS=410634017644740612]
- 修改下游数据库中的CommitTS
MariaDB [(none)]> show databases;
+-----------------------------+
| Database |
+-----------------------------+ |
| information_schema |
| mysql |
| performance_schema |
| tidb_binlog |
| tidb_loader |
+-----------------------------+
22 rows in set (0.00 sec)
MariaDB [(none)]>
MariaDB [(none)]> use tidb_binlog;
Reading table information for completion of table and column names
You can turn off this feature to get a quicker startup with -A
Database changed
MariaDB [tidb_binlog]>
MariaDB [tidb_binlog]>
MariaDB [tidb_binlog]> show tables;
+-----------------------+
| Tables_in_tidb_binlog |
+-----------------------+
| checkpoint |
+-----------------------+
1 row in set (0.00 sec)
MariaDB [tidb_binlog]> select * from checkpoint;
+---------------------+---------------------------------------------+
| clusterID | checkPoint |
+---------------------+---------------------------------------------+
| 6697082055159617667 | {"commitTS":410636899695198209,"ts-map":{}} |
+---------------------+---------------------------------------------+
1 row in set (0.00 sec)
MariaDB [tidb_binlog]>
- 启动 drainer
[tidb@test1 tidb-ansible]$ ansible-playbook start_drainer.yml
Congrats! All goes well. :-)
[tidb@test1 tidb-ansible]$
注意:
如果因为操作pump不得当导致 TiDB客户端无法连接,需要重新执行如下语句进行无感知更新
ansible-playbook rolling_update.yml --tags=tidb
注意:
任何情况下都不要直接 kill Pump或Drainer
注意:
drainer发生异常后会自动切换为 paused 状态,要恢复为 online状态只需要重新执行一次语句即可
nohup ./bin/drainer -config config/drainer.toml &
重新部署容器后异常 这个问题是 pump 没启动 就设置了 开启binlog导致的
[tidb@test1 tidb-ansible]$ ansible-playbook start.yml
fatal: [192.168.180.46]: FAILED! => {"changed": false, "elapsed": 300, "msg": "the TiDB port 4000 is not up"}
PLAY [grafana_servers] **************************************************************************************************************************************************************************
to retry, use: --limit @/home/tidb/tidb-ansible/retry_files/start.retry
PLAY RECAP **************************************************************************************************************************************************************************************
192.168.180.46 : ok=27 changed=0 unreachable=0 failed=1
192.168.180.47 : ok=10 changed=0 unreachable=0 failed=1
192.168.180.48 : ok=13 changed=0 unreachable=0 failed=1
localhost : ok=7 changed=4 unreachable=0 failed=0
ERROR MESSAGE SUMMARY ***************************************************************************************************************************************************************************
[192.168.180.47]: Ansible FAILED! => playbook: start.yml; TASK: wait until the PD port is up; message: {"changed": false, "elapsed": 300, "msg": "the PD port 2379 is not up"}
[192.168.180.48]: Ansible FAILED! => playbook: start.yml; TASK: wait until the TiKV port is up; message: {"changed": false, "elapsed": 300, "msg": "the TiKV port 20160 is not up"}
[192.168.180.46]: Ansible FAILED! => playbook: start.yml; TASK: wait until the TiDB port is up; message: {"changed": false, "elapsed": 300, "msg": "the TiDB port 4000 is not up"}
tidb log 异常信息 listener stopped, waiting for manual kill
[ERROR] [server.go:364] ["listener stopped, waiting for manual kill."] # 原因是 开启binlog必须先开启 pump, 这个错误是因为 pump下线了导致的
[stack="github.com/pingcap/tidb/server.(*Server).Run\n\t/home/jenkins/workspace/release_tidb_3.0/go/src/github.com/pingcap/tidb/server/server.go:364\nmain.runServer\n\t/home/jenkins/workspace/release_tidb_3.0/go/src/github.com/pingcap/tidb/tidb-server/main.go:568\nmain.main\n\t/home/jenkins/workspace/release_tidb_3.0/go/src/github.com/pingcap/tidb/tidb-server/main.go:174\nruntime.main\n\t/usr/local/go/src/runtime/proc.go:200"]
drainer log 异常信息 got signal to exit
[2019/08/20 20:56:27.850 +08:00] [INFO] [syncer.go:254] ["write save point"] [ts=410597664302825480]
[2019/08/20 20:56:32.301 +08:00] [INFO] [syncer.go:254] ["write save point"] [ts=410597665574223873]
[2019/08/20 20:56:36.537 +08:00] [INFO] [syncer.go:254] ["write save point"] [ts=410597666688335873]
[2019/08/20 20:56:40.817 +08:00] [INFO] [syncer.go:254] ["write save point"] [ts=410597667802447873]
[2019/08/20 20:56:45.456 +08:00] [INFO] [syncer.go:254] ["write save point"] [ts=410597669047631873]
[2019/08/20 20:56:49.539 +08:00] [INFO] [main.go:63] ["got signal to exit."] [signal=hangup] # 原因是使用 nohup 启动,但是客户连接是非正常使用 exit 断开,导致系统杀掉了 drainer 进程。
[2019/08/20 20:56:49.539 +08:00] [INFO] [server.go:406] ["begin to close drainer server"]
[2019/08/20 20:56:49.539 +08:00] [ERROR] [pump.go:147] ["pump receive binlog failed"] [id=dev10:8250] [error=EOF]
[2019/08/20 20:56:49.543 +08:00] [INFO] [server.go:371] ["has already update status"] [id=dev10:8249]
[2019/08/20 20:56:49.543 +08:00] [INFO] [server.go:410] ["commit status done"]
[2019/08/20 20:56:49.543 +08:00] [INFO] [collector.go:130] ["publishBinlogs quit"]
[2019/08/20 20:56:49.544 +08:00] [INFO] [util.go:66] [Exit] [name=heartbeat]
[2019/08/20 20:56:49.544 +08:00] [INFO] [pump.go:72] ["pump is closing"] [id=dev10:8250]
[2019/08/20 20:56:49.544 +08:00] [INFO] [syncer.go:254] ["write save point"] [ts=410597669244239873]
[2019/08/20 20:56:49.544 +08:00] [INFO] [util.go:66] [Exit] [name=collect]
[2019/08/20 20:56:49.544 +08:00] [INFO] [load.go:434] ["Run()... in Loader quit"]
[2019/08/20 20:56:49.544 +08:00] [INFO] [mysql.go:114] ["Successes chan quit"]
[2019/08/20 20:56:49.633 +08:00] [INFO] [syncer.go:246] ["handleSuccess quit"]
[2019/08/20 20:56:49.633 +08:00] [INFO] [util.go:66] [Exit] [name=syncer]
[2019/08/20 20:56:49.633 +08:00] [INFO] [main.go:73] ["drainer exit"]
pump log 异常信息 got signal to exit
[2019/08/20 20:56:05.316 +08:00] [INFO] [server.go:522] ["server info tick"] [writeBinlogCount=10016] [alivePullerCount=1] [MaxCommitTS=410597658037583873]
[2019/08/20 20:56:15.309 +08:00] [INFO] [storage.go:381] [DBStats] [DBStats="{\"WriteDelayCount\":0,\"WriteDelayDuration\":0,\"WritePaused\":false,\"AliveSnapshots\":0,\"AliveIterators\":0,\"IOWrite\":2225967,\"IORead\":0,\"BlockCacheSize\":0,\"OpenedTablesCount\":0,\"LevelSizes\":null,\"LevelTablesCounts\":null,\"LevelRead\":null,\"LevelWrite\":null,\"LevelDurations\":null}"]
[2019/08/20 20:56:15.317 +08:00] [INFO] [server.go:522] ["server info tick"] [writeBinlogCount=10016] [alivePullerCount=1] [MaxCommitTS=410597660396879873]
[2019/08/20 20:56:25.309 +08:00] [INFO] [storage.go:381] [DBStats] [DBStats="{\"WriteDelayCount\":0,\"WriteDelayDuration\":0,\"WritePaused\":false,\"AliveSnapshots\":0,\"AliveIterators\":0,\"IOWrite\":2226617,\"IORead\":0,\"BlockCacheSize\":0,\"OpenedTablesCount\":0,\"LevelSizes\":null,\"LevelTablesCounts\":null,\"LevelRead\":null,\"LevelWrite\":null,\"LevelDurations\":null}"]
[2019/08/20 20:56:25.316 +08:00] [INFO] [server.go:522] ["server info tick"] [writeBinlogCount=10018] [alivePullerCount=1] [MaxCommitTS=410597663542607875]
[2019/08/20 20:56:35.309 +08:00] [INFO] [storage.go:381] [DBStats] [DBStats="{\"WriteDelayCount\":0,\"WriteDelayDuration\":0,\"WritePaused\":false,\"AliveSnapshots\":0,\"AliveIterators\":0,\"IOWrite\":2229972,\"IORead\":0,\"BlockCacheSize\":0,\"OpenedTablesCount\":0,\"LevelSizes\":null,\"LevelTablesCounts\":null,\"LevelRead\":null,\"LevelWrite\":null,\"LevelDurations\":null}"]
[2019/08/20 20:56:35.317 +08:00] [INFO] [server.go:522] ["server info tick"] [writeBinlogCount=10054] [alivePullerCount=1] [MaxCommitTS=410597665587331079]
[2019/08/20 20:56:45.309 +08:00] [INFO] [storage.go:381] [DBStats] [DBStats="{\"WriteDelayCount\":0,\"WriteDelayDuration\":0,\"WritePaused\":false,\"AliveSnapshots\":0,\"AliveIterators\":0,\"IOWrite\":2230701,\"IORead\":0,\"BlockCacheSize\":0,\"OpenedTablesCount\":0,\"LevelSizes\":null,\"LevelTablesCounts\":null,\"LevelRead\":null,\"LevelWrite\":null,\"LevelDurations\":null}"]
[2019/08/20 20:56:45.317 +08:00] [INFO] [server.go:522] ["server info tick"] [writeBinlogCount=10058] [alivePullerCount=1] [MaxCommitTS=410597667815555073]
[2019/08/20 20:56:49.539 +08:00] [INFO] [main.go:65] ["got signal to exit."] [signal=hangup] # 原因是使用 nohup 启动,但是客户连接是非正常使用 exit 断开,导致系统杀掉了 pump 进程。
[2019/08/20 20:56:49.539 +08:00] [INFO] [server.go:839] ["begin to close pump server"]
[2019/08/20 20:56:49.539 +08:00] [INFO] [server.go:496] ["detect drainer checkpoint routine exit"]
[2019/08/20 20:56:49.539 +08:00] [INFO] [server.go:537] ["gcBinlogFile exit"]
[2019/08/20 20:56:49.539 +08:00] [INFO] [server.go:519] ["printServerInfo exit"]
[2019/08/20 20:56:49.539 +08:00] [INFO] [server.go:472] ["genFakeBinlog exit"]
[2019/08/20 20:56:49.539 +08:00] [INFO] [server.go:848] ["background goroutins are stopped"]
[2019/08/20 20:56:49.539 +08:00] [INFO] [node.go:216] ["Heartbeat goroutine exited"]
[2019/08/20 20:56:49.543 +08:00] [INFO] [server.go:832] ["update state success"] [NodeID=dev10:8250] [state=paused]
[2019/08/20 20:56:49.543 +08:00] [INFO] [server.go:851] ["commit status done"]
[2019/08/20 20:56:49.544 +08:00] [INFO] [server.go:855] ["grpc is stopped"]
[2019/08/20 20:56:49.544 +08:00] [INFO] [chaser.go:76] ["Slow chaser quits"]
[2019/08/20 20:56:49.546 +08:00] [INFO] [server.go:860] ["storage is closed"]
[2019/08/20 20:56:49.546 +08:00] [INFO] [server.go:865] ["pump node is closed"]
[2019/08/20 20:56:49.546 +08:00] [INFO] [server.go:871] ["has closed pdCli"]
[2019/08/20 20:56:49.546 +08:00] [INFO] [server.go:876] ["has closed tiStore"]
[2019/08/20 20:56:49.546 +08:00] [INFO] [main.go:68] ["pump is closed"]
[2019/08/20 20:56:49.546 +08:00] [INFO] [main.go:80] ["pump exit"]
二进制部署的pump/drainer 转 配置文件部署的pump/drainer 注意事项
要按照如下步骤切换部署
- 加入pump/drainer集群
- 暂停二进制pump/drainer
- 下线进进制pump/drainer(注:状态必须是offline状态否则,切换失败)
tail -1000 drainer.log 异常信息如下:
[tidb@dev11 log]$ pwd
/home/tidb/deploy/log
[tidb@dev11 log]$ tail -1000 drainer.log
[2019/08/21 08:46:07.005 +08:00] [INFO] [pump.go:133] ["pump create pull binlogs client"] [id=dev10:8250]
[2019/08/21 08:46:07.006 +08:00] [ERROR] [pump.go:217] ["pump create PullBinlogs client failed"] [id=dev10:8250] [error="rpc error: code = Unavailable desc = all SubConns are in TransientFailure, latest connection error: connection error: desc = \"transport: Error while dialing dial tcp 192.168.180.32:8250: connect: connection refused\""]
[2019/08/21 08:46:07.006 +08:00] [ERROR] [pump.go:135] ["pump create pull binlogs client failed"] [id=dev10:8250] [error="rpc error: code = Unavailable desc = all SubConns are in TransientFailure, latest connection error: connection error: desc = \"transport: Error while dialing dial tcp 192.168.180.32:8250: connect: connection refused\""] [errorVerbose="rpc error: code = Unavailable desc = all SubConns are in TransientFailure, latest connection error: connection error: desc = \"transport: Error while dialing dial tcp 192.168.180.32:8250: connect: connection refused\"\ngithub.com/pingcap/errors.AddStack\n\t/home/jenkins/workspace/release_tidb_3.0/go/pkg/mod/github.com/pingcap/errors@v0.11.4/errors.go:174\ngithub.com/pingcap/errors.Trace\n\t/home/jenkins/workspace/release_tidb_3.0/go/pkg/mod/github.com/pingcap/errors@v0.11.4/juju_adaptor.go:15\ngithub.com/pingcap/tidb-binlog/drainer.(*Pump).createPullBinlogsClient\n\t/home/jenkins/workspace/release_tidb_3.0/go/src/github.com/pingcap/tidb-binlog/drainer/pump.go:221\ngithub.com/pingcap/tidb-binlog/drainer.(*Pump).PullBinlog.func1\n\t/home/jenkins/workspace/release_tidb_3.0/go/src/github.com/pingcap/tidb-binlog/drainer/pump.go:134\nruntime.goexit\n\t/usr/local/go/src/runtime/asm_amd64.s:1337"]
Drainer 集群同步数据到 MariaDB 产生重复数据