ORACLE的Dead Connection Detection浅析

2023-07-24 11:35:22

在复杂的应用环境下，我们经常会遇到一些非常复杂并且有意思的问题，例如，我们会遇到网络异常（网络掉包、无线网络断线）、客户端程序异常（例如应用程序崩溃Crash）、操作系统蓝屏、客户端电脑掉电、死机重启等异常情况，此时数据库连接可能都没有正常关闭（Colse）、事务都没有提交，连接（connections）就断开了。如果遇到这些情况，你未提交的一个事务在数据库中是否会回滚？如果回滚，什么条件才会触发回滚？需要多久才会触发回滚（不是回滚需要多少时间）？如果是一个查询呢，那么情况又是怎么样呢？ORACLE数据库是否提供某些机制来解决这些问题呢？如果这些问题你都能回答，那么可以不用看下文了，在介绍理论知识之前，我们先通过构造测试案例，测试一下，毕竟实践出真知，抽象的理论需要实验来加深理解、全面详细阐述。

我们首先来测试一下数据库会话正常退出的情况吧，我在客户端使用（SQL*Plus）连接到数据库，执行一个UPDATE语句后不提交，然后退出（注意：实验步骤是在服务器端查询一些信息后才退出）。如下所示：

SQL> select * from v$mystat where rownum=1;

       SID STATISTIC#      VALUE

---------- ---------- ----------

       196          0          0

SQL> select sid,serial# from v$session where sid=196;

       SID    SERIAL#

---------- ----------

       196          9

SQL> update scott.dept set loc='CHICAGO' where deptno=40;

1 row updated.

SQL> exit    --在服务器查询一些信息后才执行该命令

在服务器端我们查看会话（196,9）的一些相关信息，如下所示：

SQL> set linesize 1200

SQL> select sid, seconds_in_wait, event from v$session_wait where sid=196;

       SID SECONDS_IN_WAIT EVENT

---------- ----------- -------------------------------------------------

       196              33 SQL*Net message from client

SQL> SELECT B.USERNAME

  2         ,B.SID

  3         ,B.SERIAL#

  4         ,LOGON_TIME

  5         ,A.OBJECT_ID

  6         ,A.LOCKED_MODE

  7  FROM   V$LOCKED_OBJECT A,

  8         V$SESSION B

  9  WHERE  A.SESSION_ID = B.SID

 10  ORDER  BY B.LOGON_TIME;

USERNAME              SID    SERIAL# LOGON_TIM  OBJECT_ID   LOCKED_MODE

----------------- ---------- ---------- --------- ----------   -----------

TEST                   196          9 01-DEC-16      73199       3

从上面可以看到196会话对表SCOTT.DEPT持有锁（Row-X 行独占(RX)），对象ID为73199，然后我们在客户端不提交UPDATE语句就执行exit命令退出会话后，然后在服务器端检查会话是否回滚。如下所示，测试结果我们可以看到，正常exit后，会话会立即回滚。（pmon进程立即回收相关进程，回收资源）

SQL> select sid, seconds_in_wait, event from v$session_wait where sid=196;

no rows selected

SQL> SELECT B.USERNAME

  2         ,B.SID

  3         ,B.SERIAL#

  4         ,LOGON_TIME

  5         ,A.OBJECT_ID

  6  FROM   V$LOCKED_OBJECT A,

  7         V$SESSION B

  8  WHERE  A.SESSION_ID = B.SID

  9  ORDER  BY B.LOGON_TIME;

no rows selected

SQL>

接下来，我们来构造网络异常的案例（需要多台机器或虚拟机），如下所示，我们首先在虚拟机上使用SQL*Plus连接到服务器端（账号为test，另外服务器上sqlnet.ora 不要设置SQLNET.EXPIRE_TIME参数，不启用DCD，后面介绍至这个），然后执行一个UPATE语句不提交

SQL> show user;

USER is "TEST"

SQL> select * from v$mystat where rownum =1;

       SID STATISTIC#      VALUE

---------- ---------- ----------

       914          0          1

SQL> select sid,serial# from v$session where sid=914;

       SID    SERIAL#

---------- ----------

       914       3944

SQL> update scott.emp set sal=8000 where empno=7369;

1 row updated.

SQL>

然后我们断开虚拟机的网络，构造网络异常案例（在客户端机器上执行service network stop命令断开网络），我们在服务器端使用SQL*Plus查看会话（914,3944）的情况，如下所示

SQL> select sid, seconds_in_wait, event from v$session_wait where sid=914;

       SID SECONDS_IN_WAIT EVENT

---------- --------------- ----------------------------------------------------------------

       914              93 SQL*Net message from client

SQL>  SELECT B.USERNAME

  2         ,B.SID

  3         ,B.SERIAL#

  4         ,LOGON_TIME

  5         ,A.OBJECT_ID

  6  FROM   V$LOCKED_OBJECT A,

  7         V$SESSION B

  8  WHERE  A.SESSION_ID = B.SID

  9  ORDER  BY B.LOGON_TIME;

USERNAME                              SID    SERIAL# LOGON_TIM  OBJECT_ID

------------------------------ ---------- ---------- --------- ----------

TEST                                  914       3944 01-DEC-16     782460

SQL>

我们继续执行上面语句，你会看到看到会话914一直是INACTIVE，对表一直持有Row-X 行独占(RX)，而且seconds_in_wait也一直在增长

SQL> select sid, seconds_in_wait, event from v$session_wait where sid=914;

       SID SECONDS_IN_WAIT EVENT

---------- --------------- -----------------------------------------------------

       914            4928 SQL*Net message from client

SQL>  SELECT B.USERNAME

  2         ,B.SID

  3         ,B.SERIAL#

  4         ,LOGON_TIME

  5         ,A.OBJECT_ID

  6  FROM   V$LOCKED_OBJECT A,

  7         V$SESSION B

  8  WHERE  A.SESSION_ID = B.SID

  9  ORDER  BY B.LOGON_TIME;

USERNAME                              SID    SERIAL# LOGON_TIM  OBJECT_ID

------------------------------ ---------- ---------- --------- ----------

TEST                                  914       3944 01-DEC-16     782460

SQL>  select sid, seconds_in_wait, event from v$session_wait where sid=914;

       SID SECONDS_IN_WAIT EVENT

---------- --------------- ------------------------------------------------

       914            5853 SQL*Net message from client

SQL> SELECT B.USERNAME

  2         ,B.SID

  3         ,B.SERIAL#

  4         ,LOGON_TIME

  5         ,A.OBJECT_ID

  6  FROM   V$LOCKED_OBJECT A,

  7         V$SESSION B

  8  WHERE  A.SESSION_ID = B.SID

  9  ORDER  BY B.LOGON_TIME;

USERNAME                              SID    SERIAL# LOGON_TIM  OBJECT_ID

------------------------------ ---------- ---------- --------- ----------

TEST                                  914       3944 01-DEC-16     782460

SQL>

最后一直等待pmon进程回收资源，经过多次测试，发现这个时间都是在7860多秒后才会被PMON进程回收资源。

那么这个是否有一个固定的值？这个值是否有规律呢？我构造了这样一个脚本在服务器端运行(根据实际情况修改sid, serial#的值)，测试数据库需要耗费多久时间，PMON进程才会回收进程，释放资源，回滚事务。

CREATE TABLE TEST.SESSION_WAIT_RECORD

AS

SELECT sid,

       seconds_in_wait,

       event,

   sysdate as curr_datetime

FROM   v$session_wait

       where 1=0;

CREATE TABLE TEST.LOCK_OBJECT_RECORD AS

        SELECT B.username,

               B.sid,

               B.serial#,

               logon_time,

               A.object_id ,

               sysdate as curr_datetime

        FROM   v$locked_object A,

               v$session B

        WHERE  A.session_id = B.sid

       AND 1=0;

DECLARE

    v_index NUMBER := 1;

BEGIN

    WHILE v_index != 0 LOOP

          INSERT INTO SESSION_WAIT_RECORD

        SELECT sid,

               seconds_in_wait,

               event ,

         sysdate

        FROM   v$session_wait

        WHERE  sid = 916;

                INSERT INTO LOCK_OBJECT_RECORD

        SELECT B.username,

                   B.sid,

               B.serial#,

               logon_time,

               A.object_id ,

               sysdate

        FROM   v$locked_object A,

               v$session B

        WHERE  A.session_id = B.sid

                AND A.session_id=916 AND B.serial#=415

        ORDER  BY B.logon_time;

        commit;

        dbms_lock.Sleep(10);

        SELECT Count(*)

        INTO   v_index

        FROM   v$session_wait

        WHERE  sid = 916;

    END LOOP;

END;

1：多次的测试结果是否一直一致？这个值是否有什么规律？

从我的几次测试结果来看（当然没有大量测试和考虑各种场景），几次测试的结果如下（查询SESSION_WAIT_RECORD表），基本上都是在7872~7876. 由于上面SQL会休眠10秒，所以可以推断数据库会在一个固定的时间后清理断开的会话。

测试实验1

测试实验2：

测试实验3：

将上面脚本休眠的时间改为2秒，避免修改时间过长引起的误差，测试结果是7876

看似结果有点不一致，其实是因为误差，因为脚本里面休眠的时间(实验1、2的休眠时间为10秒，实验3改为2秒)，以及其他方面的一些误差导致，规律就是这个跟Linux系统的TCP keepalive有关系，我们先来看看TCP keepalive概念，如下

The keepalive concept is very simple: when you set up a TCP connection, you associate a set of timers. Some of these timers deal with the keepalive procedure. When the keepalive timer reaches zero, you send your peer a keepalive probe packet with no data in it and the ACK flag turned on. You can do this because of the TCP/IP specifications, as a sort of duplicate ACK, and the remote endpoint will have no arguments, as TCP is a stream-oriented protocol. On the other hand, you will receive a reply from the remote host (which doesn't need to support keepalive at all, just TCP/IP), with no data and the ACK set.

顾名思义，TCP keepalive它是用来保存TCP连接的，注意它只适用于TCP连接。系统会替你维护一个timer，时间到了，就会向remote peer发送一个probe package，当然里面是没有数据的，对方就会返回一个应答，这时你就知道这个通道保持正常。与TCP keepalive有关的三个参数tcp_keepalive_time、tcp_keepalive_intvl、tcp_keepalive_probes

[root@myln01uat ~]# cat /proc/sys/net/ipv4/tcp_keepalive_time

7200

[root@mylnx01uat ~]# cat /proc/sys/net/ipv4/tcp_keepalive_intvl

[root@mylnx01uat ~]# cat /proc/sys/net/ipv4/tcp_keepalive_probes

[root@getlnx01uat ~]#

/proc/sys/net/ipv4/tcp_keepalive_time 当keepalive起用的时候，TCP发送keepalive消息的频度。默认是2小时。

/proc/sys/net/ipv4/tcp_keepalive_intvl 当探测没有确认时，keepalive探测包的发送间隔。缺省是75秒。

/proc/sys/net/ipv4/tcp_keepalive_probes 如果对方不予应答，keepalive探测包的发送次数。缺省值是9。

那么在Oracle没有启用DCD时，系统和数据库如何判断一个连接是否异常，需要关闭呢？这个时间是这样计算的，首先它等待了7200，然后每隔75秒发送探测包，一共发送了9次后（7200+ 75*9 = 7875 ），都没有收到客户端应答，那么它就判断这个连接死掉了，可以关闭了。所以这个值是一个固定值, 具体为7875, 当然不同的操作系统可能有所不同,取决于上面三个tcp_keepalive参数，过了7875秒后，这个时候PMON进程就会回收与它相关的所有资源（例如回滚事务，释放lock、latch、memory）。这个值与我测试的时间非常接近了（考虑我们是采集的等待时间，以及测试脚本里面有休眠时间，这样采集的数据有些许偏差）。

2：如果是一个查询操作呢？结果又是什么情况。

如果是查询操作，结果依然是如此，有兴趣的可以自行测试。

3：这个是否跟专用连接服务器模式与共享连接服务器模式有关？

测试结果发现，专用连接服务器模式与共享连接服务器模式都一样。只是跟Linux的系统内核参数tcp_keepalive_time、tcp_keepalive_intvl、tcp_keepalive_probes有关系。

那么问题来了，如果会话持续持有Row-X 行独占(RX)长达7875秒，那么很有可能导致系统出现一些性能问题，重要的系统里面这个是不可接受的，好了，现在回到我们讨论的正题，ORACLE是怎么处理这些问题的？它应该有一套机制来解决这个问题，否则它也太弱了。其实ORACLE提供了DCD（Dead Connection Detection 即死连接检测）机制来解决这个问题，下面来介绍这个：

Dead Connection Detection概念

DCD是Dead Connection Detection的缩写，用于检查那些死掉但没有断开的session。它的具体原理如下所示：

当一个新的数据库连接建立后，SQL*Net读取参数文件的SQLNET.EXPIRE_TIME设置（如果设置了的话），在服务端初始化DCD，DCD会为这个连接创建一个定时器，当该定时器超过SQLNET.EXPIRE_TIME指定时间间隔后，就会向客户端发送一个probe package（侦测包），该包实质上是一个空的SQL*NET包，不包括任何有用数据，它仅在底层协议上创建了数据流。如果此时客户端连接还是正常的话，那么这个probe package就会被客户端直接丢弃，然后Oracle服务器就会把该连接对应的定时器重新复位。如果客户异常退出的话，侦测包由客户端的IP层交到TCP层时，就会发现原先的连接已经不存在了，然后TCP层就会返回错误信息，该信息被ORACLE服务端接收到后，ORACLE就会知道该连接已经不可用了，于是SQL*NET就会向操作系统发送消息，释放该连接的相关资源。

官方文档关于Dead Connection Detection的介绍请参考文档“Dead Connection Detection (DCD) Explained (文档 ID 151972.1)”，摘抄部分如下所示

                      DEAD CONNECTION DETECTION

                        =========================

OVERVIEW

-------- 

Dead Connection Detection (DCD) is a feature of SQL*Net 2.1 and later, including

Oracle Net8 and Oracle NET. DCD detects when a partner in a SQL*Net V2 client/server

or server/server connection has terminated unexpectedly, and flags the dead session

so PMON can release the resources associated with it.

DCD is intended primarily for environments in which clients power down their

systems without disconnecting from their Oracle sessions, a problem

characteristic of networks with PC clients.

DCD is initiated on the server when a connection is established. At this

time SQL*Net reads the SQL*Net parameter files and sets a timer to generate an

alarm.  The timer interval is set by providing a non-zero value in minutes for

the SQLNET.EXPIRE_TIME parameter in the sqlnet.ora file.

When the timer expires, SQL*Net on the server sends a "probe" packet to the

client. (In the case of a database link, the destination of the link

constitutes the server side of the connection.)  The probe is essentially an

empty SQL*Net packet and does not represent any form of SQL*Net level data,

but it creates data traffic on the underlying protocol. 

If the client end of the connection is still active, the probe is discarded,

and the timer mechanism is reset.  If the client has terminated abnormally,

the server will receive an error from the send call issued for the probe, and

SQL*Net on the server will signal the operating system to release the

connection's resources. 

On Unix servers, the sqlnet.ora file must be in either $TNS_ADMIN or

$ORACLE_HOME/network/admin. Neither /etc nor /var/opt/oracle alone is valid. 

It should be also be noted that in SQL*Net 2.1.x, an active orphan process

(one processing a query, for example) will not be killed until the query

completes. In SQL*Net 2.2, orphaned resources will be released regardless of

activity.

This is a server feature only.  The client may be running any supported

SQL*Net V2 release.

如何开启/启用DCD

开启DCD(Dead Connection Detection)非常简单，只需要在服务器端的sqlnet.ora里面设置SQLNET.EXPIRE_TIME即可，当然客户端也需要支持SQL*Net V2以及后面版本。如何检查、确认是否开启了DCD，官方文档有详细介绍：Note.395505.1 How to Check if Dead Connection Detection (DCD) is Enabled in 9i and 10g。此处不做展开。

DCD的问题与异常

DCD在一些版本和平台还是有蛮多Bug的，你在Oracle Metalink上搜索一下，都能查到很多，另外我在测试过程中，设置SQLNET.EXPIRE_TIME=5，测试发现，清理这些Dead Connection的时间不是5分钟，而是20多分钟，

搜索了大量资料，也没有完全彻底弄清楚这个问题，只是知道这个跟TCP/IP有超时重传机制有关系，网络知识是我的薄弱项啊（尝试了多次无果后，只能放弃），当然，数据库回收Dead Connection也不会完全跟SQLNET.EXPIRE_TIME指定的时间一致的（例如，下面官方文档就明确指出not at the exact time of the DCD value）。另外这个值还有可能被防火墙影响，可以参考防火墙、DCD与TCP Keep alive这篇文章。

To answer common questions about Dead Connection Detection (DCD). 

Common Questions about Dead Connection Detection

------------------------------------------------

Q: What is Dead Connection Detection?

A: Dead Connection Detection (DCD) allows SQL*Net/Net8 to identify connections

   that have been left hanging by the abnormal termination of a client. This

   feature minimizes the waste of resources by connections that are no longer

   valid.  It also automatically forces a rollback of uncommitted transactions

   and locks held by the user of the broken connection.

Q: How does Dead Connection Work?

A: On a connection with DCD enabled, a small probe packet is sent from server

   to client at a user defined interval (usually several minutes).  If the

   connection is invalid (usually due to the client process or machine being

   unreachable), the session is "flagged" as dead, and PMon cleans up that session

   when next doing housekeeping (not at the exact time of the DCD value).

   The DCD mechanism does NOT terminate any sessions (idle or active). It merely

   marks the "dead" session for deletion by PMon.

Q: How do you set the Dead Connection Detection feature?

A: DCD is enabled on the server side of the connection by defining a parameter

   in the sqlnet.ora file in $ORACLE_HOME/network/admin called

   SQLNET.EXPIRE_TIME. This parameter determines the time period between

   successive probe packets across a connection between client and server.

   SQLNET.EXPIRE_TIME= <# of minutes>

   The sqlnet.expire_time parameter is defined in minutes and can have any value

   between 1 and an infinite number.  If it is not defined in the sqlnet.ora

   file, DCD is not used.  A time of 10 minutes is probably optimum for most

   applications. 

   DCD probe packets originate on the server side, so it must be enabled on the

   server side. If you define sqlnet.expire_time on the client side, it will be

   ignored.

Q: Will this work with the Oracle Multi-Threaded Server?

A: DCD will work and is very useful with Multi-Threaded Server (MTS)

   configurations. MTS alone does not solve the problem, as a client that is

   powered down when connected to a MTS will also leave a defunct connection

   within the MTS (at least until the underlying protocol detects the loss of

   the client, at which time it will inform MTS, which will then free the

   resources). The resources used per client with MTS are less than those used

   by dedicated server, however, so the net gain per connection within MTS is

   less than that with dedicated server. Having said that, DCD has a distinct

   advantage within MTS configurations - as each server process is managing

   multiple clients simultaneously, the DBA has no option of killing a single

   process as a result of the termination of a single client. DCD therefore

   increases database uptime by allowing resources to be managed more

   effectively.

Q: Can I use DCD on all of my connections over all protocols? 

A: You can use DCD over all protocols except for APPC/LU6.2, which prevents DCD

   from working due to its half-duplex nature. It also does not work over

   bequeathed connections. You should carefully consider whether to use DCD

   before you use it, however, as it creates additional processing work on the

   server and can also increase network traffic. Furthermore, some protocols

   already implement a form of DCD already, so it may not necessarily be needed

   on all protocols.

Q: Are there any differences if I am using DCD on connections that go through

   the Oracle Multi-Protocol Interchange (MPI) or Connection Manager (CMAN)?

A: No. DCD works through MPI and CMAN in the same way as direct client/server.

   If your connection spans across half-duplex and full-duplex protocols (for

   example APPC/LU6.2 and TCP/IP), DCD will be disabled by the server.

DCD的好处与弊端

其实，DCD的好处上面已经基本阐述清楚了，其实DCD还是有一些弊端的。例如，在Windows平台性能很差（bug#303578）；在SCO Unix下它会触发Bug，消耗大量CPU资源; DCD 在协议层是很消耗资源的, 所以如果要用DCD来清除死进程, 会加重系统的负担, 任何时候, 干净的退出系统，这是首要的. 如下英文所述：

DCD is much more resource-intensive than similar mechanisms at the protocol level, so if you depend on DCD to clean up all dead processes, that will put

an undue load on the server. Clearly it is advantageous to exit applications cleanly in the first place.

参考资料：

http://www.laoxiong.net/firewall-dcd-and-tcp-keep-alive.html

Note.601605.1 A discussion of Dead Connection Detection, Resource Limits, V$SESSION, V$PROCESS and OS processes:
Note.395505.1 How to Check if Dead Connection Detection (DCD) is Enabled in 9i and 10g:
Connections on Windows Platform Timout after 2 Hours, Why ? (文档 ID 1073461.1)
Concurrent Manager Functionality Not Working And PCP Failover Takes Long Inspite of Enabling DCD With Database Server (文档 ID 438921.1)
Common Questions About Dead Connection Detection (DCD) (文档 ID 1018160.6)
Dead Connection Detection (DCD) Explained (文档 ID 151972.1)

码农公寓

相关文章