一大早的,某从库突然报出故障:SQL线程中断!
查看从库状态:
mysql> show slave status\G
Slave_IO_State: Waiting for master to send event
Master_Log_File: mysql-bin.026023
Read_Master_Log_Pos: 230415889
Relay_Log_File: relay-bin.058946
Relay_Log_Pos: 54632056
Relay_Master_Log_File: mysql-bin.026002
Slave_IO_Running: Yes
Slave_SQL_Running: No
Last_Errno: 1452
Last_Error: Error 'Cannot add or update a child row: a foreign key constraint fails (`zabbix`.`trigger_discovery`, CONSTRAINT `c_trigger_discovery_2` FOREIGN KEY (`parent_triggerid`) REFERENCES `triggers` (`triggerid`) ON DELETE CASCADE)' on query. Default database: 'zabbix'. Query: 'insert into trigger_discovery (triggerdiscoveryid,triggerid,parent_triggerid,name) values (1677,26249,22532,'Free inodes is less than 20% on volume {#FSNAME}'), (1678,26250,22532,'Free inodes is less than 20% on volume {#FSNAME}'), (1679,26251,22532,'Free inodes is less than 20% on volume {#FSNAME}')'
Exec_Master_Log_Pos: 54631910
重点关注报错信息,定位问题,问题是:Cannot add or update a child row:a foreign key constraint fails ,涉及到的外键是:c_trigger_discovery_2
那这个外键的定义是什么呢?
报错信息中也有列出:
trigger_discovery`, CONSTRAINT `c_trigger_discovery_2` FOREIGN KEY (`parent_triggerid`) REFERENCES `triggers` (`triggerid`) ON DELETE CASCADE
那明白了,是表trigger_discovery中的列`parent_triggerid`和表triggers中的列`triggerid`有外键关联,现在这里的数据插入出现了问题
那为什么会出现问题?
继续看报错,错误是从这里开始的:
insert into trigger_discovery (triggerdiscoveryid,triggerid,parent_triggerid,name) values (1677,26249,22532,'Free inodes is less than 20% on volume {#FSNAME}')
上述外键对应的列parent_triggerid的值是22532,难道这个值在表triggers中有问题?
我们去表triggers中查看:
从库
mysql> select * from triggers where triggerid=22532;
Empty set (0.00 sec) 主库
mysql> select * from triggers where triggerid=22532;
+-----------+------------+--------------------------------------------------+-----+--------+-------+----------+------------+----------+-------+------------+------+-------------+-------+
| triggerid | expression | description | url | status | value | priority | lastchange | comments | error | templateid | type | value_flags | flags |
+-----------+------------+--------------------------------------------------+-----+--------+-------+----------+------------+----------+-------+------------+------+-------------+-------+
| 22532 | {23251}<20 | Free inodes is less than 20% on volume {#FSNAME} | | 0 | 0 | 2 | 0 | | | 13272 | 0 | 1 | 2 |
+-----------+------------+--------------------------------------------------+-----+--------+-------+----------+------------+----------+-------+------------+------+-------------+-------+
1 row in set (0.00 sec)
果然,从库中没有这个值对应的信息,但主库中是有的,原来是主从不一致导致的,从库中缺失这个值,主库中顺利插入了,但数据传到从库后,从库的外键约束限制了这一插入操作,所以SQL线程阻塞。
问题找到了,那怎么解决?
首先,为了让从库尽快恢复运行,就先把这个错误跳过吧:
mysql>SET GLOBAL SQL_SLAVE_SKIP_COUNTER = 1; #跳过一个事务
mysql>start slave;
接下来就是主从数据不一致的问题,可以使用pt-table-checksum来检查下不一致的数据,再进行同步,具体步骤如下:
在主库执行: mysql>GRANT SELECT, PROCESS, SUPER, REPLICATION SLAVE,CREATE,DELETE,INSERT,UPDATE ON *.* TO 'USER'@'MASTER_HOST' identified by 'PASSWORD';
注:创建用户,这些权限都是必须的,否则会报错
shell> ./pt-table-checksum --host='master_host' --user='user' --password='password' --port='port' --databases=zabbix --ignore-tables=ignore_table --recursion-method=processlist
注:(1)因为涉及到的表太多,查看后发现很多表都有外键关联,错综复杂,而且因为是监控表,即使丢失一些也没什么关系,所以查出较大的且没有外键关联的表用ignore-tables选项排除,对其他表进行比对,如果表比较少的话直接指定--TABLES
(2)recursion-method如果不设的话,会报错:Diffs cannot be detected because no slaves were found. 其参数有四:processlist/hosts/dsn/no,用来决定查找slave的方式是show full processlist还是show slave hosts还是直接给出slave信息,具体用法在另一随笔pt-table-checksum介绍中详述
shell>./pt-table-sync --print --replicate=percona.checksums h=master_host,u=user,p=password,P=port h=slave_host,u=user,p=password,P=port --recursion-method=processlist >pt.log
注:最好使用--print,不要直接使用--execute,否则如果弄出问题,就更麻烦了,打印出直接执行的语句,去从库执行就好了
将pt.log传到从库,直接执行,然后再次在主库上进行一致性检查,如果还有不一致的数据,记得登录mysql去把checksums表清空,然后再次进行检查同步,直到没有不一致的数据。
当然,如果主从数据反复出现不一致的话,那就要先去检查造成不一致的原因了,釜底抽薪才是硬道理。