记一次Lock wait timeout异常的排查过程

#########################

业务反映某一天这个Lock wait timeout异常特别严重,于是找到我来排查,这不很明显的吗?就是等待锁超时异常了,但是呢,业务不知道哪个地方出现了锁等待,也不清楚什么sql导致的,拉一堆人马在那里等待我救命

于是,我迅雷不及掩耳之势查找了mysql错误日志和中间件日志,业务也说了什么时候出现的,因此直接定位到对应时间段周围:

 

percona版本的mysql服务的错误日志如下:

2021-07-13T14:21:27.442042+08:00 3262227 [Note] InnoDB: Lock wait timeout info:
Requested thread id: 3262227
Requested trx id: 48830402
Requested query: update apple set mac_last_heartbeat = 1626157276000 where ext_id = 105790
Total blocking transactions count: 1
Blocking transaction number: 1
Blocking thread id: 3232905
Blocking query id: 1092250130
Blocking trx id: 48830301

根据错误日志得出:

1)时间是:2021-07-13T14:21:27.442042+08:00,

2)问题是:mysql的3262227会话线程被3232905会话线程给阻塞了

3)3262227会话线程上的什么sql被阻塞了?3262227会话线程上的“update apple set mac_last_heartbeat = 1626157276000 where ext_id = 105790”这个sql语句被阻塞住了,

      但是呢,这里只知道3232905会话线程的id,其余信息对解决问题没啥帮助,于是排查kingshard上的业务sql,中间件日志记录了业务发起sql的时间,执行时间,来源ip,目的ip,会话线程id,SQL语句,

      因此根据3232905会话线程这个id去找到这个会话在该时间段周围的所有sql,于是如下:

 

数据库中间件kingshard的日志如下:

2021/07/13 14:21:05 - OK - 1.7ms - 10.10.10.20:44465,conn_id=7479433|COMMIT
2021/07/13 14:21:05 - OK - 1.5ms - 10.10.10.20:44465->10.10.10.10:3306,mysql_connect_id=3232905|select current_timestamp
2021/07/13 14:21:05 - OK - 1.7ms - 10.10.10.20:44465->10.10.10.10:3306,mysql_connect_id=3232905|insert into apple  (ext_id, 
                                   int_id, txnid, mac_db, mac_table, mac_partition, mac_lock_state, mac_lock_type, mac_last_heartbeat, mac_user, mac_host) 
                                   values (105790, 1,0, dw_business, dwd_ord_ord_df, null, w, e, 1626157265000, work, zjy-hadoop-prc-st2203.bj)
2021/07/13 14:21:05 - OK - 1.6ms - 10.10.10.20:44465->10.10.10.10:3306,mysql_connect_id=3232905|select ext_id, 
                                   int_id, mac_db, mac_table, mac_partition, mac_lock_state, mac_lock_type, txnid from apple where ext_id = 105790
2021/07/13 14:21:05 - OK - 1.5ms - 10.10.10.20:44465->10.10.10.10:3306,mysql_connect_id=3232905|SAVEPOINT `__66a6cb85_17a856cffba__3b3e`
2021/07/13 14:21:05 - OK - 1.6ms - 10.10.10.20:44465->10.10.10.10:3306,mysql_connect_id=3232905|select ext_id,
                                   int_id, mac_db, mac_table, mac_partition, mac_lock_state, mac_lock_type, txnid from apple where mac_db in
                                   (dw_business) and (mac_table is null or mac_table in(dwd_ord_ord_df)) and ext_id <= 105790
2021/07/13 14:21:05 - OK - 1.5ms - 10.10.10.20:44465,conn_id=7479433|ROLLBACK
2021/07/13 14:21:05 - OK - 0.0ms - 10.10.10.20:44465,conn_id=7479433|COMMIT
2021/07/13 14:21:05 - OK - 0.0ms - 10.10.10.20:44465,conn_id=7479433|ROLLBACK
2021/07/13 14:21:05 - OK - 0.0ms - 10.10.10.20:44465->0.0.0.0:6853,conn_id=7479433|SET autocommit=1
2021/07/13 14:21:07 - Connected - conn_id=7543980, autopilot@127.0.0.1:15509 () capability: 238213
2021/07/13 14:21:07 - Quit - conn_id=7543980, 127.0.0.1:15509
2021/07/13 14:21:11 - Connected - conn_id=7543981, autopilot@127.0.0.1:15667 () capability: 238213
2021/07/13 14:21:11 - Quit - conn_id=7543981, 127.0.0.1:15667
2021/07/13 14:21:14 - OK - 0.0ms - 10.10.10.20:43325->0.0.0.0:6853,conn_id=7309803|SET autocommit=0

### 上面日志表示:

# 3232905这个会话线程先执行了select current_timestamp语句,

# 然后执行了insert into apple语句,

# 然后又鬼使神差地执行了SAVEPOINT `__66a6cb85_17a856cffba__3b3e`语句,

# 最后执行了一个select语句

# 然后就没了然后

................

####下面这句是3262227会话线程的执行结果:ERROR,耗时10s多,来源ip是10.10.10.20,这个是业务方的服务所在机器的ip,目的ip为10.10.10.10,这个是mysql集群的主库ip地址,sql语句如下所示:

2021/07/13 14:21:27 - ERROR - 10981.2ms - 10.10.10.20:45954->10.10.10.10:3306,mysql_connect_id=3262227|update apple set mac_last_heartbeat = 1626157276000 where ext_id = 105790

 

 

 

 

 

 

 

 

########################

记一次Lock wait timeout异常的排查过程

上一篇:线上OOM排查步骤总结


下一篇:Fluent中的压力表与物理意义