问题现象
##### 统一监控告警平台-DB: [触发]
> `告警集群:` jq-prod-db-prometheus
> `告警描述:` 数据库Oracle堵塞会话超过5分钟,会话ID:6055
> `告警环境:` prod
> `告警类型:` oracle_prod
> `告警实例:` nbadb02
> `告警IP:` 11.11.11.11
> `实例别名:` nbadb02
> `当前数值:` 2
> `告警级别:` 严重
> `触发时间:` 2021-03-16 14:11:59
> `持续时间:` 10s
> `告警次数:` 1
> `告警应用:`nba_oracle
> `应用负责人:`
处理过程
- 登录
11.11.11.11
进入oracle,并设置列宽(如果使用plsql则不用设置)
[bzops@nbadb02 ~]$ sudo -i
[root@nbadb02 ~]# su - oracle
[oracle@nbadb02 ~]$ sqlplus / as sysdba
SQL*Plus: Release 11.2.0.4.0 Production on Tue Mar 2 15:05:36 2021
Copyright (c) 1982, 2013, Oracle. All rights reserved.
Connected to:
Oracle Database 11g Enterprise Edition Release 11.2.0.4.0 - 64bit Production
With the Partitioning, OLAP, Data Mining and Real Application Testing options
SQL>
SQL> col sample_time for a30
col event for a30
col program for a30
col machine for a40
set lines 200 pages 999
- 查看当前的活动事务,发现会话
6055
和1573
是14:07就已经开始了,到现在还没有结束,
同时last_call_et
是0,说明会话还在不停的执行sql中,并没有空闲
其中6055
占用的undo已经达到了11M,说明事务比较大
SQL> select a.inst_id inst,A.sid,a.serial#,a.blocking_session blk,a.last_call_et call,a.status,a.sql_id,a.PLSQL_ENTRY_OBJECT_ID plsql,PREV_SQL_ID,b.start_date,a.event,b.USED_UBLK*8/1024 m,a.seconds_in_wait wait,a.osuser,a.program,a.logon_time,b.xid,a.machine
from gv$session a,gv$transaction b where a.taddr=b.addr and b.addr is not null and a.inst_id=b.inst_id order by 5 desc;
INST SID SERIAL# BLK CALL STATUS SQL_ID PLSQL PREV_SQL_ID START_DATE EVENT M WAIT OSUSER PROGRAM LOGON_TIME XID MACHINE
1 2 1545 7417 0 ACTIVE crwj8kgsbsghz crwj8kgsbsghz 2021/3/16 14:14:59 SQL*Net message to client 0.046875 0 nbauser JDBC Thin Client 2021/3/16 14:11:16 58001500481BE003 nba-service-daemon-58f676dcf6-bk8rp
2 2 6059 34681 0 INACTIVE 2xuyhc81va9jk 2021/3/16 14:11:49 SQL*Net message from client 0.8984375 0 nbauser JDBC Thin Client 2021/3/16 14:03:07 F6000400896A0A01 nba-service-mq-6c654b54df-dzw62
3 2 6055 46555 0 INACTIVE dxupr3kq3k911 2021/3/16 14:07:10 SQL*Net message from client 11.4609375 0 nbauser JDBC Thin Client 2021/3/16 14:06:26 3F001C004161E604 nba-service-web-64cd6dcc69-w45f2
4 2 1573 24181 0 INACTIVE 2uy2z5g8ky2ab 2021/3/16 14:07:01 SQL*Net message from client 1.578125 0 nbauser JDBC Thin Client 2021/3/16 13:07:09 D3000D00D01C6801 nba-service-mq-6c654b54df-pqgp6
- 查看会话历史,可以看到很多会话都被
6055
阻塞了,印证了上面查的6055
的事务比较大
SQL> select sample_time,session_id as sid,sql_id,event,plsql_entry_object_id as plsql,sql_plan_hash_value as plan,sql_plan_line_id as line,sql_exec_id as execid,blocking_session blk,program as program,xid,user_id ,machine from gv$active_session_history
where blocking_session is not null ;
SAMPLE_TIME SID SQL_ID EVENT PLSQL PLAN LINE EXECID BLK PROGRAM XID USER_ID MACHINE
1 16-MAR-21 02.14.26.440 PM 1731 08qkr18cnk9rk enq: TX - row lock contention 585082728 1 49065086 6055 JDBC Thin Client 101 nba-service-web-64cd6dcc69-w45f2
2 16-MAR-21 02.14.26.440 PM 3742 08qkr18cnk9rk enq: TX - row lock contention 585082728 1 49064614 6055 JDBC Thin Client 101 nba-service-web-64cd6dcc69-dtkz6
3 16-MAR-21 02.14.26.440 PM 1773 08qkr18cnk9rk enq: TX - row lock contention 585082728 1 49064803 6055 JDBC Thin Client 101 nba-service-web-64cd6dcc69-dtkz6
4 16-MAR-21 02.14.26.440 PM 6108 08qkr18cnk9rk enq: TX - row lock contention 585082728 1 49064714 6055 JDBC Thin Client 101 nba-service-web-64cd6dcc69-64v4t
5 16-MAR-21 02.14.26.440 PM 7261 08qkr18cnk9rk enq: TX - row lock contention 585082728 1 49065133 6055 JDBC Thin Client 101 nba-service-web-64cd6dcc69-5h6zf
6 16-MAR-21 02.14.26.440 PM 9541 08qkr18cnk9rk enq: TX - row lock contention 585082728 1 49064892 6055 JDBC Thin Client 101 nba-service-web-64cd6dcc69-wcctn
7 16-MAR-21 02.14.26.440 PM 10690 08qkr18cnk9rk enq: TX - row lock contention 585082728 1 49064975 6055 JDBC Thin Client 101 nba-service-web-64cd6dcc69-64v4t
8 16-MAR-21 02.14.26.440 PM 11724 08qkr18cnk9rk enq: TX - row lock contention 585082728 1 49065049 6055 JDBC Thin Client 101 nba-service-web-64cd6dcc69-76hs8
9 16-MAR-21 02.14.25.440 PM 1773 08qkr18cnk9rk enq: TX - row lock contention 585082728 1 49064803 6055 JDBC Thin Client 101 nba-service-web-64cd6dcc69-dtkz6
10 16-MAR-21 02.14.25.440 PM 1731 08qkr18cnk9rk enq: TX - row lock contention 585082728 1 49065086 6055 JDBC Thin Client 101 nba-service-web-64cd6dcc69-w45f2
11 16-MAR-21 02.14.25.440 PM 3742 08qkr18cnk9rk enq: TX - row lock contention 585082728 1 49064614 6055 JDBC Thin Client 101 nba-service-web-64cd6dcc69-dtkz6
12 16-MAR-21 02.14.25.440 PM 6108 08qkr18cnk9rk enq: TX - row lock contention 585082728 1 49064714 6055 JDBC Thin Client 101 nba-service-web-64cd6dcc69-64v4t
13 16-MAR-21 02.14.25.440 PM 7261 08qkr18cnk9rk enq: TX - row lock contention 585082728 1 49065133 6055 JDBC Thin Client 101 nba-service-web-64cd6dcc69-5h6zf
14 16-MAR-21 02.14.25.440 PM 9541 08qkr18cnk9rk enq: TX - row lock contention 585082728 1 49064892 6055 JDBC Thin Client 101 nba-service-web-64cd6dcc69-wcctn
15 16-MAR-21 02.14.25.440 PM 10690 08qkr18cnk9rk enq: TX - row lock contention 585082728 1 49064975 6055 JDBC Thin Client 101 nba-service-web-64cd6dcc69-64v4t
16 16-MAR-21 02.14.25.440 PM 11724 08qkr18cnk9rk enq: TX - row lock contention 585082728 1 49065049 6055 JDBC Thin Client 101 nba-service-web-64cd6dcc69-76hs8
17 16-MAR-21 02.14.24.430 PM 9541 08qkr18cnk9rk enq: TX - row lock contention 585082728 1 49064892 6055 JDBC Thin Client 101 nba-service-web-64cd6dcc69-wcctn
18 16-MAR-21 02.14.24.430 PM 11724 08qkr18cnk9rk enq: TX - row lock contention 585082728 1 49065049 6055 JDBC Thin Client 101 nba-service-web-64cd6dcc69-76hs8
19 16-MAR-21 02.14.24.430 PM 10690 08qkr18cnk9rk enq: TX - row lock contention 585082728 1 49064975 6055 JDBC Thin Client 101 nba-service-web-64cd6dcc69-64v4t
20 16-MAR-21 02.14.24.430 PM 7261 08qkr18cnk9rk enq: TX - row lock contention 585082728 1 49065133 6055 JDBC Thin Client 101 nba-service-web-64cd6dcc69-5h6zf
21 16-MAR-21 02.14.24.430 PM 6108 08qkr18cnk9rk enq: TX - row lock contention 585082728 1 49064714 6055 JDBC Thin Client 101 nba-service-web-64cd6dcc69-64v4t
22 16-MAR-21 02.14.24.430 PM 3742 08qkr18cnk9rk enq: TX - row lock contention 585082728 1 49064614 6055 JDBC Thin Client 101 nba-service-web-64cd6dcc69-dtkz6
23 16-MAR-21 02.14.24.430 PM 1773 08qkr18cnk9rk enq: TX - row lock contention 585082728 1 49064803 6055 JDBC Thin Client 101 nba-service-web-64cd6dcc69-dtkz6
24 16-MAR-21 02.14.24.430 PM 1731 08qkr18cnk9rk enq: TX - row lock contention 585082728 1 49065086 6055 JDBC Thin Client 101 nba-service-web-64cd6dcc69-w45f2
25 16-MAR-21 02.14.23.430 PM 10690 08qkr18cnk9rk enq: TX - row lock contention 585082728 1 49064975 6055 JDBC Thin Client 101 nba-service-web-64cd6dcc69-64v4t
26 16-MAR-21 02.14.23.430 PM 1731 08qkr18cnk9rk enq: TX - row lock contention 585082728 1 49065086 6055 JDBC Thin Client 101 nba-service-web-64cd6dcc69-w45f2
27 16-MAR-21 02.14.23.430 PM 11724 08qkr18cnk9rk enq: TX - row lock contention 585082728 1 49065049 6055 JDBC Thin Client 101 nba-service-web-64cd6dcc69-76hs8
28 16-MAR-21 02.14.23.430 PM 1773 08qkr18cnk9rk enq: TX - row lock contention 585082728 1 49064803 6055 JDBC Thin Client 101 nba-service-web-64cd6dcc69-dtkz6
29 16-MAR-21 02.14.23.430 PM 3742 08qkr18cnk9rk enq: TX - row lock contention 585082728 1 49064614 6055 JDBC Thin Client 101 nba-service-web-64cd6dcc69-dtkz6
30 16-MAR-21 02.14.23.430 PM 6108 08qkr18cnk9rk enq: TX - row lock contention 585082728 1 49064714 6055 JDBC Thin Client 101 nba-service-web-64cd6dcc69-64v4t
31 16-MAR-21 02.14.23.430 PM 7261 08qkr18cnk9rk enq: TX - row lock contention 585082728 1 49065133 6055 JDBC Thin Client 101 nba-service-web-64cd6dcc69-5h6zf
32 16-MAR-21 02.14.23.430 PM 9541 08qkr18cnk9rk enq: TX - row lock contention 585082728 1 49064892 6055 JDBC Thin Client 101 nba-service-web-64cd6dcc69-wcctn
33 16-MAR-21 02.14.22.400 PM 10690 08qkr18cnk9rk enq: TX - row lock contention 585082728 1 49064975 6055 JDBC Thin Client 101 nba-service-web-64cd6dcc69-64v4t
................
- 既然知道了阻塞者是
6055
,同时也不仅6055
一个会话到现在没结束,会话也不是空闲中,
接下来只能找出6055
正在跑的sql发给开发评估这个事务是不是正常
oracle中可以通过sql_trace来记录会话一段时间内都跑的什么sql
(如果sql执行时间大于一秒,其实通过活动会话历史v$active_session_history也可以定位,但是6055
的last_call_et
是0,说明执行的sql都是小于一秒的)
SQL> exec dbms_system.set_sql_trace_in_session(6055,46555,true);
PL/SQL procedure successfully completed.
间隔1分钟
SQL> exec dbms_system.set_sql_trace_in_session(6055,46555,false);
PL/SQL procedure successfully completed.
- 收集好sql_trace后,进入oracle的trace目录,按照时间排序可以看到最新的ora_开头的日志
-rw-r----- 1 oracle asmadmin 74K Mar 16 14:15 nba2_ora_149894.trm
-rw-r----- 1 oracle asmadmin 2.3M Mar 16 14:15 nba2_ora_149894.trc
[oracle@nbadb02 trace]$ tkprof nba2_ora_149894.trc
output = bk.log
TKPROF: Release 11.2.0.4.0 - Development on Tue Mar 16 14:16:08 2021
Copyright (c) 1982, 2011, Oracle and/or its affiliates. All rights reserved.
[oracle@nbadb02 trace]$ sz bk.log
- 过滤掉bk.log中的sys用户的递归sql,可以看到
6055
执行的sql逻辑
SQL ID: dzns8cts4u1a4 Plan Hash: 383260619
update tab1 set col1=:1 , col2=:2 , col3=
:3 , col4=:4 , col5=:5 , col6=:6 , col7=:7 , col8=:8 ,
col9=:9 , col10=:10 , col11=:11 , col12=:12 , col13=:13
where
ID=:14 and VERSION=:15
call count cpu elapsed disk query current rows
------- ------ -------- ---------- ---------- ---------- ---------- ----------
Parse 1878 0.03 0.05 0 0 0 0
Execute 1878 0.28 0.46 202 9475 17465 1878
Fetch 0 0.00 0.00 0 0 0 0
------- ------ -------- ---------- ---------- ---------- ---------- ----------
total 3756 0.31 0.51 202 9475 17465 1878
....
SQL ID: dxupr3kq3k911 Plan Hash: 743517995
select sysinterfa0_.ID as ID400_, sysinterfa0_.BRAND_ID as BRAND2_400_,
sysinterfa0_.CONTENT as CONTENT400_, sysinterfa0_.CREATE_TIME as
CREATE4_400_, sysinterfa0_.ERROR_COUNT as ERROR5_400_, sysinterfa0_.OP_TIME
as OP6_400_, sysinterfa0_.REMARK as REMARK400_, sysinterfa0_.SHOP_ID as
SHOP8_400_, sysinterfa0_.SLIP_CODE as SLIP9_400_, sysinterfa0_.SOURCE_SYS
as SOURCE10_400_, sysinterfa0_.STATUS as STATUS400_,
sysinterfa0_.TARGET_SYS as TARGET12_400_, sysinterfa0_.TYPE as TYPE400_,
sysinterfa0_.VERSION as VERSION400_
from
tab2 sysinterfa0_ where sysinterfa0_.SLIP_CODE=:1 and
sysinterfa0_.TYPE=:2 and sysinterfa0_.SOURCE_SYS=:3 and
sysinterfa0_.TARGET_SYS=:4 and sysinterfa0_.BRAND_ID=:5
call count cpu elapsed disk query current rows
------- ------ -------- ---------- ---------- ---------- ---------- ----------
Parse 1913 0.03 0.04 0 0 0 0
Execute 1913 0.02 0.03 0 0 0 0
Fetch 1913 0.04 0.09 55 9854 0 1878
------- ------ -------- ---------- ---------- ---------- ---------- ----------
total 5739 0.10 0.17 55 9854 0 1878
....
select shop.id
from tab1 sku,
tab2 pro,
tab3 br,
tab4 shop
where sku.product_id = pro.id
and pro.brand_id = br.brand_id
and br.shop_id = shop.id
and sku.code = :1
and shop.IS_SYNC_BMS = 1
call count cpu elapsed disk query current rows
------- ------ -------- ---------- ---------- ---------- ---------- ----------
Parse 382 0.00 0.00 0 0 0 0
Execute 382 0.00 0.00 0 0 0 0
Fetch 382 0.11 0.14 1 15280 0 0
------- ------ -------- ---------- ---------- ---------- ---------- ----------
total 1146 0.12 0.16 1 15280 0 0
.....
解决办法
- 把以上sql发给开发分析后,反馈这个事务逻辑有优化空间,现在在做批量库存的操作,有阻塞可以忽略,未来会优化逻辑