生产环境新上线集群小概率gtid空洞问题

背景:生产环境新上集群会产生gtid空洞问题

例子:

群集名

服务器

端口

master UUID

复制异常点

ibupub1group01

10.60.52.168

55944

9cd48d9c-6e29-11ec-92d8-98039ba567ea

存在GAP: 间隔10秒两次动态GTID比较:1-402:404-5103#1-402:404-5109

 

线索1:查看binlog发现空洞gtid的binlog不存在,且问题发生于集群已经搭建完毕,已经开始写心跳的阶段后

 

[op1@SVR16599HP360 tmp]$ cat log.txt |grep "9cd48d9c-6e29-11ec-92d8-98039ba567ea:40" -A 10 | head -n 50

SET @@SESSION.GTID_NEXT= '9cd48d9c-6e29-11ec-92d8-98039ba567ea:402'/*!*/;

# at 159796

#220105 22:45:58 server id 52153  end_log_pos 159880 CRC32 0xdc731cf7    Query     thread_id=6068       exec_time=0   error_code=0

SET TIMESTAMP=1641393958/*!*/;

BEGIN

/*!*/;

# at 159880

#220105 22:45:58 server id 52153  end_log_pos 159957 CRC32 0x49efcff4      Rows_query

# replace into repl_heartbeat

# values (@@hostname,now())

# at 159957

--

SET @@SESSION.GTID_NEXT= '9cd48d9c-6e29-11ec-92d8-98039ba567ea:404'/*!*/;

# at 160194

 

继续查:

线索2:线上已有集群写心跳不会出现问题,为什么新上集群才有小概率会发生?

猜测:和写心跳这个操作无关,写心跳只是gtid增长的手段,新上集群和已有集群的差别在于新上集群在搭建集群环节有reset master操作,怀疑到语句reset master

验证

经过一系列测试,接下来的操作可100%稳定复现:

第一步:在测试集群上部署一个小脚本,while 1 不停的在master上写心跳,然后在master上手工reset master。然后观察Executed_Gtid_Set:

mysql> show master status;
+-------------------------+----------+--------------+------------------+-----------------------------------------------+
| File | Position | Binlog_Do_DB | Binlog_Ignore_DB | Executed_Gtid_Set |
+-------------------------+----------+--------------+------------------+-----------------------------------------------+
| FATM00003214-bin.000001 | 12771 | | | 744c6c9a-766d-11eb-9807-fa163e6a649a:1-31:135 |
+-------------------------+----------+--------------+------------------+-----------------------------------------------+

发现会稳定出现一个不存在的幽灵gtid 135,且binlog中搜不到这个gtid的相关信息:

mysqlbinlog -vvv FATM00003214-bin.000001|grep 744c6c9a-766d-11eb-9807-fa163e6a649a:135   ---》 返回空

 

第二步:模拟搭建集群操作,在slave上stop slave;reset master;change master to xxx;start slave;show slave status;

mysql> show slave status\G

*************************** 1. row ***************************

               Slave_IO_State: Waiting for master to send event

                  Master_Host: 10.6.26.233

                  Master_User: usvr_replication

                  Master_Port: 55111

                Connect_Retry: 60

              Master_Log_File: FATM00003214-bin.000001

          Read_Master_Log_Pos: 15585

               Relay_Log_File: FATM00003215-relay-bin.000002

                Relay_Log_Pos: 15812

        Relay_Master_Log_File: FATM00003214-bin.000001

             Slave_IO_Running: Yes

            Slave_SQL_Running: Yes

              Replicate_Do_DB:

          Replicate_Ignore_DB:

           Replicate_Do_Table:

       Replicate_Ignore_Table:

      Replicate_Wild_Do_Table:

  Replicate_Wild_Ignore_Table:

                   Last_Errno: 0

                   Last_Error:

                 Skip_Counter: 0

          Exec_Master_Log_Pos: 15585

              Relay_Log_Space: 16082

              Until_Condition: None

               Until_Log_File:

                Until_Log_Pos: 0

           Master_SSL_Allowed: No

           Master_SSL_CA_File:

           Master_SSL_CA_Path:

              Master_SSL_Cert:

            Master_SSL_Cipher:

               Master_SSL_Key:

        Seconds_Behind_Master: 0

Master_SSL_Verify_Server_Cert: No

                Last_IO_Errno: 0

                Last_IO_Error:

               Last_SQL_Errno: 0

               Last_SQL_Error:

  Replicate_Ignore_Server_Ids:

             Master_Server_Id: 10001

                  Master_UUID: 744c6c9a-766d-11eb-9807-fa163e6a649a

             Master_Info_File: mysql.slave_master_info

                    SQL_Delay: 0

          SQL_Remaining_Delay: NULL

      Slave_SQL_Running_State: Slave has read all relay log; waiting for more updates

           Master_Retry_Count: 86400

                  Master_Bind:

      Last_IO_Error_Timestamp:

     Last_SQL_Error_Timestamp:

               Master_SSL_Crl:

           Master_SSL_Crlpath:

           Retrieved_Gtid_Set: 744c6c9a-766d-11eb-9807-fa163e6a649a:1-38

            Executed_Gtid_Set: 744c6c9a-766d-11eb-9807-fa163e6a649a:1-38

                Auto_Position: 1

         Replicate_Rewrite_DB:

                 Channel_Name:

           Master_TLS_Version:

1 row in set (0.00 sec)

 

Slave在向master拉取gtid时,拉取到了理论上应该存在的744c6c9a-766d-11eb-9807-fa163e6a649a:1-38,以及幽灵gtid 135,然后因为binlog里没有gtid135,导致slave上不会执行,不会记载在Executed_Gtid_Set中。

此时集群已经搭建完成,在心跳不断的写入的情况下,master的gtid :1-38:135会被填平至例如1-169,但因为master的135是不存在的,slave上就会被跳过,至此产生空洞:

 

mysql> show slave status\G

*************************** 1. row ***************************

               Slave_IO_State: Waiting for master to send event

                  Master_Host: 10.6.26.233

                  Master_User: usvr_replication

                  Master_Port: 55111

               Connect_Retry: 60

              Master_Log_File: FATM00003214-bin.000001

          Read_Master_Log_Pos: 68475

               Relay_Log_File: FATM00003215-relay-bin.000002

                Relay_Log_Pos: 68702

        Relay_Master_Log_File: FATM00003214-bin.000001

             Slave_IO_Running: Yes

            Slave_SQL_Running: Yes

              Replicate_Do_DB:

          Replicate_Ignore_DB:

           Replicate_Do_Table:

       Replicate_Ignore_Table:

      Replicate_Wild_Do_Table:

  Replicate_Wild_Ignore_Table:

                   Last_Errno: 0

                   Last_Error:

                 Skip_Counter: 0

          Exec_Master_Log_Pos: 68475

              Relay_Log_Space: 68972

              Until_Condition: None

               Until_Log_File:

                Until_Log_Pos: 0

           Master_SSL_Allowed: No

           Master_SSL_CA_File:

           Master_SSL_CA_Path:

              Master_SSL_Cert:

            Master_SSL_Cipher:

               Master_SSL_Key:

        Seconds_Behind_Master: 0

Master_SSL_Verify_Server_Cert: No

                Last_IO_Errno: 0

                Last_IO_Error:

               Last_SQL_Errno: 0

               Last_SQL_Error:

  Replicate_Ignore_Server_Ids:

             Master_Server_Id: 10001

                  Master_UUID: 744c6c9a-766d-11eb-9807-fa163e6a649a

             Master_Info_File: mysql.slave_master_info

                    SQL_Delay: 0

          SQL_Remaining_Delay: NULL

      Slave_SQL_Running_State: Slave has read all relay log; waiting for more updates

           Master_Retry_Count: 86400

                  Master_Bind:

      Last_IO_Error_Timestamp:

     Last_SQL_Error_Timestamp:

               Master_SSL_Crl:

           Master_SSL_Crlpath:

           Retrieved_Gtid_Set: 744c6c9a-766d-11eb-9807-fa163e6a649a:1-134:136-169

            Executed_Gtid_Set: 744c6c9a-766d-11eb-9807-fa163e6a649a:1-134:136-169

                Auto_Position: 1

         Replicate_Rewrite_DB:

                 Channel_Name:

           Master_TLS_Version:

1 row in set (0.00 sec)

 

结论:集群上线的过程中可能在mysql实例部署完毕后某个脚本就开始运行,然后这个采集脚本的insert动作和master的reset master操作撞车,产生那个幽灵gtid,导致后续slave/dr上产生gtid空洞。至于为什么撞车这个动作会生成一个幽灵gtid尚且不知

解决方法:集群搭建流程中master在reset master之前flush tables with read lock; 等最后再打开,杜绝撞车的可能性

 

上一篇:换了IP地址,原来的SSL证书是否还能使用?


下一篇:haproxy ssl 支持