程序師世界是廣大編程愛好者互助、分享、學習的平台,程序師世界有你更精彩!
首頁
編程語言
C語言|JAVA編程
Python編程
網頁編程
ASP編程|PHP編程
JSP編程
數據庫知識
MYSQL數據庫|SqlServer數據庫
Oracle數據庫|DB2數據庫
 程式師世界 >> 數據庫知識 >> Oracle數據庫 >> Oracle教程 >> ORACLE的Dead Connection Detection淺析,deaddetection

ORACLE的Dead Connection Detection淺析,deaddetection

編輯:Oracle教程

ORACLE的Dead Connection Detection淺析,deaddetection


    在復雜的應用環境下,我們經常會遇到一些非常復雜並且有意思的問題,例如,我們會遇到網絡異常(網絡掉包、無線網絡斷線)、客戶端程序異常(例如應用程序崩潰Crash)、操作系統藍屏、客戶端電腦掉電、死機重啟等異常情況,此時數據庫連接可能都沒有正常關閉(Colse)、事務都沒有提交,連接(connections)就斷開了。如果遇到這些情況,你未提交的一個事務在數據庫中是否會回滾? 如果回滾,什麼條件才會觸發回滾?需要多久才會觸發回滾(不是回滾需要多少時間)?如果是一個查詢呢,那麼情況又是怎麼樣呢?ORACLE數據庫是否提供某些機制來解決這些問題呢?如果這些問題你都能回答,那麼可以不用看下文了,在介紹理論知識之前,我們先通過構造測試案例,測試一下,畢竟實踐出真知,抽象的理論需要實驗來加深理解、全面詳細闡述。

我們首先來測試一下數據庫會話正常退出的情況吧,我在客戶端使用(SQL*Plus)連接到數據庫,執行一個UPDATE語句後不提交,然後退出(注意:實驗步驟是在服務器端查詢一些信息後才退出)。如下所示:

 

SQL> select * from v$mystat where rownum=1;
 
       SID STATISTIC#      VALUE
---------- ---------- ----------
       196          0          0
 
SQL> select sid,serial# from v$session where sid=196;
 
       SID    SERIAL#
---------- ----------
       196          9
 
SQL> update scott.dept set loc='CHICAGO' where deptno=40;
 
1 row updated.
 
SQL> exit    --在服務器查詢一些信息後才執行該命令

 

服務器端我們查看會話(196,9)的一些相關信息,如下所示:

SQL> set linesize 1200
SQL> select sid, seconds_in_wait, event from v$session_wait where sid=196;
 
       SID SECONDS_IN_WAIT EVENT
---------- ----------- -------------------------------------------------
       196              33 SQL*Net message from client
 
SQL> SELECT B.USERNAME
  2         ,B.SID
  3         ,B.SERIAL#
  4         ,LOGON_TIME
  5         ,A.OBJECT_ID 
  6         ,A.LOCKED_MODE
  7  FROM   V$LOCKED_OBJECT A, 
  8         V$SESSION B 
  9  WHERE  A.SESSION_ID = B.SID 
 10  ORDER  BY B.LOGON_TIME;
 
USERNAME              SID    SERIAL# LOGON_TIM  OBJECT_ID   LOCKED_MODE 
----------------- ---------- ---------- --------- ----------   -----------
TEST                   196          9 01-DEC-16      73199       3

 

 

從上面可以看到196會話對表SCOTT.DEPT持有鎖(Row-X 行獨占(RX)),對象ID為73199,然後我們在客戶端不提交UPDATE語句就執行exit命令退出會話後,然後在服務器端檢查會話是否回滾。如下所示,測試結果我們可以看到,正常exit後,會話會立即回滾。(pmon進程立即回收相關進程,回收資源)

SQL> select sid, seconds_in_wait, event from v$session_wait where sid=196;
 
no rows selected
 
SQL> SELECT B.USERNAME
  2         ,B.SID
  3         ,B.SERIAL#
  4         ,LOGON_TIME
  5         ,A.OBJECT_ID 
  6  FROM   V$LOCKED_OBJECT A, 
  7         V$SESSION B 
  8  WHERE  A.SESSION_ID = B.SID 
  9  ORDER  BY B.LOGON_TIME;
 
no rows selected
 
SQL> 

 

接下來,我們來構造網絡異常的案例(需要多台機器或虛擬機),如下所示,我們首先在虛擬機上使用SQL*Plus連接到服務器端(賬號為test,另外服務器上sqlnet.ora 不要設置SQLNET.EXPIRE_TIME參數,不啟用DCD,後面介紹至這個),然後執行一個UPATE語句不提交

SQL> show user;
USER is "TEST"
SQL> select * from v$mystat where rownum =1;
 
       SID STATISTIC#      VALUE
---------- ---------- ----------
       914          0          1
 
SQL> select sid,serial# from v$session where sid=914;
 
       SID    SERIAL#
---------- ----------
       914       3944
 
SQL> update scott.emp set sal=8000 where empno=7369;
 
1 row updated.
 
SQL> 

 

然後我們斷開虛擬機的網絡,構造網絡異常案例(在客戶端機器上執行service network stop命令斷開網絡,我們在服務器端使用SQL*Plus查看會話(914,3944)的情況,如下所示

SQL> select sid, seconds_in_wait, event from v$session_wait where sid=914;
 
       SID SECONDS_IN_WAIT EVENT
---------- --------------- ----------------------------------------------------------------
       914              93 SQL*Net message from client
 
SQL>  SELECT B.USERNAME
  2         ,B.SID
  3         ,B.SERIAL#
  4         ,LOGON_TIME
  5         ,A.OBJECT_ID 
  6  FROM   V$LOCKED_OBJECT A, 
  7         V$SESSION B 
  8  WHERE  A.SESSION_ID = B.SID 
  9  ORDER  BY B.LOGON_TIME;
 
USERNAME                              SID    SERIAL# LOGON_TIM  OBJECT_ID
------------------------------ ---------- ---------- --------- ----------
TEST                                  914       3944 01-DEC-16     782460
 
SQL> 

 

我們繼續執行上面語句,你會看到看到會話914一直是INACTIVE,對表一直持有Row-X 行獨占(RX),而且seconds_in_wait也一直在增長

 

SQL> select sid, seconds_in_wait, event from v$session_wait where sid=914;
 
       SID SECONDS_IN_WAIT EVENT
---------- --------------- -----------------------------------------------------
       914            4928 SQL*Net message from client
 
SQL>  SELECT B.USERNAME
  2         ,B.SID
  3         ,B.SERIAL#
  4         ,LOGON_TIME
  5         ,A.OBJECT_ID 
  6  FROM   V$LOCKED_OBJECT A, 
  7         V$SESSION B 
  8  WHERE  A.SESSION_ID = B.SID 
  9  ORDER  BY B.LOGON_TIME;
 
USERNAME                              SID    SERIAL# LOGON_TIM  OBJECT_ID
------------------------------ ---------- ---------- --------- ----------
TEST                                  914       3944 01-DEC-16     782460
 
SQL>  select sid, seconds_in_wait, event from v$session_wait where sid=914;
 
       SID SECONDS_IN_WAIT EVENT
---------- --------------- ------------------------------------------------
       914            5853 SQL*Net message from client
 
SQL> SELECT B.USERNAME
  2         ,B.SID
  3         ,B.SERIAL#
  4         ,LOGON_TIME
  5         ,A.OBJECT_ID 
  6  FROM   V$LOCKED_OBJECT A, 
  7         V$SESSION B 
  8  WHERE  A.SESSION_ID = B.SID 
  9  ORDER  BY B.LOGON_TIME;
 
USERNAME                              SID    SERIAL# LOGON_TIM  OBJECT_ID
------------------------------ ---------- ---------- --------- ----------
TEST                                  914       3944 01-DEC-16     782460
 
SQL> 

 

最後一直等待pmon進程回收資源,經過多次測試,發現這個時間都是在7860多秒後才會被PMON進程回收資源。

 

 

那麼這個是否有一個固定的值?這個值是否有規律呢?我構造了這樣一個腳本在服務器端運行(根據實際情況修改sid, serial#的值),測試數據庫需要耗費多久時間,PMON進程才會回收進程,釋放資源,回滾事務。

CREATE TABLE TEST.SESSION_WAIT_RECORD
AS
SELECT sid, 
       seconds_in_wait, 
       event,
   sysdate as curr_datetime
FROM   v$session_wait 
       where 1=0;
        
 
CREATE TABLE TEST.LOCK_OBJECT_RECORD AS 
        SELECT B.username, 
               B.sid, 
               B.serial#, 
               logon_time, 
               A.object_id ,
               sysdate as curr_datetime
        FROM   v$locked_object A, 
               v$session B 
        WHERE  A.session_id = B.sid 
       AND 1=0;
 
 
 
 
DECLARE 
    v_index NUMBER := 1; 
BEGIN 
    WHILE v_index != 0 LOOP 
    
          INSERT INTO SESSION_WAIT_RECORD
        SELECT sid, 
               seconds_in_wait, 
               event ,
         sysdate
        FROM   v$session_wait 
        WHERE  sid = 916; 
 
                INSERT INTO LOCK_OBJECT_RECORD
        SELECT B.username, 
                   B.sid, 
               B.serial#, 
               logon_time, 
               A.object_id ,
               sysdate
        FROM   v$locked_object A, 
               v$session B 
        WHERE  A.session_id = B.sid 
                AND A.session_id=916 AND B.serial#=415
        ORDER  BY B.logon_time; 
        
        commit;
 
        dbms_lock.Sleep(10); 
 
        SELECT Count(*) 
        INTO   v_index 
        FROM   v$session_wait 
        WHERE  sid = 916; 
    END LOOP; 
END; 

 

1:多次的測試結果是否一直一致?這個值是否有什麼規律?

從我的幾次測試結果來看(當然沒有大量測試和考慮各種場景),幾次測試的結果如下(查詢SESSION_WAIT_RECORD表),基本上都是在7872~7876. 由於上面SQL會休眠10秒,所以可以推斷數據庫會在一個固定的時間後清理斷開的會話。

測試實驗1

測試實驗2:

 

測試實驗3:

將上面腳本休眠的時間改為2秒,避免修改時間過長引起的誤差,測試結果是7876

 

看似結果有點不一致,其實是因為誤差,因為腳本裡面休眠的時間(實驗1、2的休眠時間為10秒,實驗3改為2秒),以及其他方面的一些誤差導致,規律就是這個跟Linux系統的TCP keepalive有關系,我們先來看看TCP keepalive概念,如下

 

The keepalive concept is very simple: when you set up a TCP connection, you associate a set of timers. Some of these timers deal with the keepalive procedure. When the keepalive timer reaches zero, you send your peer a keepalive probe packet with no data in it and the ACK flag turned on. You can do this because of the TCP/IP specifications, as a sort of duplicate ACK, and the remote endpoint will have no arguments, as TCP is a stream-oriented protocol. On the other hand, you will receive a reply from the remote host (which doesn't need to support keepalive at all, just TCP/IP), with no data and the ACK set.

 

顧名思義,TCP keepalive它是用來保存TCP連接的,注意它只適用於TCP連接。系統會替你維護一個timer,時間到了,就會向remote peer發送一個probe package,當然裡面是沒有數據的,對方就會返回一個應答,這時你就知道這個通道保持正常。與TCP keepalive有關的三個參數tcp_keepalive_time、tcp_keepalive_intvl、tcp_keepalive_probes

[root@myln01uat ~]# cat /proc/sys/net/ipv4/tcp_keepalive_time

7200

[root@mylnx01uat ~]# cat /proc/sys/net/ipv4/tcp_keepalive_intvl

75

[root@mylnx01uat ~]# cat /proc/sys/net/ipv4/tcp_keepalive_probes

9

[root@getlnx01uat ~]#

 

/proc/sys/net/ipv4/tcp_keepalive_time    當keepalive起用的時候,TCP發送keepalive消息的頻度。默認是2小時。

/proc/sys/net/ipv4/tcp_keepalive_intvl   當探測沒有確認時,keepalive探測包的發送間隔。缺省是75秒。

/proc/sys/net/ipv4/tcp_keepalive_probes  如果對方不予應答,keepalive探測包的發送次數。缺省值是9。

 

那麼在Oracle沒有啟用DCD時,系統和數據庫如何判斷一個連接是否異常,需要關閉呢?這個時間是這樣計算的,首先它等待了7200,然後每隔75發送探測包,一共發送了9次後(7200+ 75*9 = 7875 ),都沒有收到客戶端應答,那麼它就判斷這個連接死掉了,可以關閉了。 這個時候PMON進程就會回收與它相關的所有資源(例如回滾事務,釋放lock、latch、memory)。這個值與我測試的時間非常接近了(考慮我們是采集的等待時間,以及測試腳本裡面有休眠時間,這樣采集的數據有些許偏差)。

 

2: 如果是一個查詢操作呢?結果又是什麼情況。

  如果是查詢操作,結果依然是如此,有興趣的可以自行測試。

 

3:這個是否跟專用連接服務器模式與共享連接服務器模式有關?

測試結果發現,專用連接服務器模式與共享連接服務器模式都一樣。只是跟Linux的系統內核參數tcp_keepalive_time、tcp_keepalive_intvl、tcp_keepalive_probes有關系。

那麼問題來了,如果會話持續持有Row-X 行獨占(RX)長達7875秒,那麼很有可能導致系統出現一些性能問題,重要的系統裡面這個是不可接受的,好了,現在回到我們討論的正題,ORACLE是怎麼處理這些問題的?它應該有一套機制來解決這個問題,否則它也太弱了。 其實ORACLE提供了DCD(Dead Connection Detection 即死連接檢測)機制來解決這個問題,下面來介紹這個:

 

Dead Connection Detection概念

DCD是Dead Connection Detection的縮寫,用於檢查那些死掉但沒有斷開的session。它的具體原理如下所示:

當一個新的數據庫連接建立後,SQL*Net讀取參數文件的SQLNET.EXPIRE_TIME設置(如果設置了的話),在服務端初始化DCD,DCD會為這個連接創建一個定時器,當該定時器超過SQLNET.EXPIRE_TIME指定時間間隔後,就會向客戶端發送一個probe package(偵測包),該包實質上是一個空的SQL*NET包,不包括任何有用數據,它僅在底層協議上創建了數據流。如果此時客戶端連接還是正常的話,那麼這個probe package就會被客戶端直接丟棄,然後Oracle服務器就會把該連接對應的定時器重新復位。如果客戶異常退出的話,偵測包由客戶端的IP層交到TCP層時,就會發現原先的連接已經不存在了,然後TCP層就會返回錯誤信息,該信息被ORACLE服務端接收到後,ORACLE就會知道該連接已經不可用了,於是SQL*NET就會向操作系統發送消息,釋放該連接的相關資源。

官方文檔關於Dead Connection Detection的介紹請參考文檔“Dead Connection Detection (DCD) Explained (文檔 ID 151972.1)”,摘抄部分如下所示

                      DEAD CONNECTION DETECTION 
                        =========================

OVERVIEW 
-------- 
 
Dead Connection Detection (DCD) is a feature of SQL*Net 2.1 and later, including
Oracle Net8 and Oracle NET. DCD detects when a partner in a SQL*Net V2 client/server
or server/server connection has terminated unexpectedly, and flags the dead session
so PMON can release the resources associated with it.
 
DCD is intended primarily for environments in which clients power down their 
systems without disconnecting from their Oracle sessions, a problem
characteristic of networks with PC clients.

DCD is initiated on the server when a connection is established. At this 
time SQL*Net reads the SQL*Net parameter files and sets a timer to generate an 
alarm.  The timer interval is set by providing a non-zero value in minutes for 
the SQLNET.EXPIRE_TIME parameter in the sqlnet.ora file.

When the timer expires, SQL*Net on the server sends a "probe" packet to the 
client. (In the case of a database link, the destination of the link
constitutes the server side of the connection.)  The probe is essentially an 
empty SQL*Net packet and does not represent any form of SQL*Net level data, 
but it creates data traffic on the underlying protocol. 
 
If the client end of the connection is still active, the probe is discarded, 
and the timer mechanism is reset.  If the client has terminated abnormally, 
the server will receive an error from the send call issued for the probe, and 
SQL*Net on the server will signal the operating system to release the 
connection's resources. 
 
On Unix servers, the sqlnet.ora file must be in either $TNS_ADMIN or 
$ORACLE_HOME/network/admin. Neither /etc nor /var/opt/oracle alone is valid. 
 
It should be also be noted that in SQL*Net 2.1.x, an active orphan process 
(one processing a query, for example) will not be killed until the query 
completes. In SQL*Net 2.2, orphaned resources will be released regardless of 
activity.

This is a server feature only.  The client may be running any supported 
SQL*Net V2 release.
 

如何開啟/啟用DCD

 

開啟DCD(Dead Connection Detection)非常簡單,只需要在服務器端的sqlnet.ora裡面設置SQLNET.EXPIRE_TIME即可,當然客戶端也需要支持SQL*Net V2以及後面版本。如何檢查、確認是否開啟了DCD,官方文檔有詳細介紹:Note.395505.1 How to Check if Dead Connection Detection (DCD) is Enabled in 9i and 10g。 此處不做展開。

 

DCD的問題與異常

 

DCD在一些版本和平台還是有蠻多Bug的,你在Oracle Metalink上搜索一下,都能查到很多,另外我在測試過程中,設置SQLNET.EXPIRE_TIME=5,測試發現,清理這些Dead Connection的時間不是5分鐘,而是20多分鐘,

搜索了大量資料,也沒有完全徹底弄清楚這個問題,只是知道這個跟TCP/IP有超時重傳機制有關系,網絡知識是我的薄弱項啊(嘗試了多次無果後,只能放棄),當然,數據庫回收Dead Connection也不會完全跟SQLNET.EXPIRE_TIME指定的時間一致的(例如,下面官方文檔就明確指出not at the exact time of the DCD value)。 另外這個值還有可能被防火牆影響,可以參考防火牆、DCD與TCP Keep alive這篇文章。

To answer common questions about Dead Connection Detection (DCD). 


Common Questions about Dead Connection Detection
------------------------------------------------

Q: What is Dead Connection Detection?

A: Dead Connection Detection (DCD) allows SQL*Net/Net8 to identify connections
   that have been left hanging by the abnormal termination of a client. This
   feature minimizes the waste of resources by connections that are no longer
   valid.  It also automatically forces a rollback of uncommitted transactions
   and locks held by the user of the broken connection.


Q: How does Dead Connection Work?

A: On a connection with DCD enabled, a small probe packet is sent from server
   to client at a user defined interval (usually several minutes).  If the
   connection is invalid (usually due to the client process or machine being
   unreachable), the session is "flagged" as dead, and PMon cleans up that session
   when next doing housekeeping (not at the exact time of the DCD value).
   The DCD mechanism does NOT terminate any sessions (idle or active). It merely
   marks the "dead" session for deletion by PMon.

Q: How do you set the Dead Connection Detection feature?

A: DCD is enabled on the server side of the connection by defining a parameter
   in the sqlnet.ora file in $ORACLE_HOME/network/admin called
   SQLNET.EXPIRE_TIME. This parameter determines the time period between
   successive probe packets across a connection between client and server.

   SQLNET.EXPIRE_TIME= <# of minutes>
 
   The sqlnet.expire_time parameter is defined in minutes and can have any value
   between 1 and an infinite number.  If it is not defined in the sqlnet.ora
   file, DCD is not used.  A time of 10 minutes is probably optimum for most
   applications. 
 
   DCD probe packets originate on the server side, so it must be enabled on the
   server side. If you define sqlnet.expire_time on the client side, it will be
   ignored.


Q: Will this work with the Oracle Multi-Threaded Server?

A: DCD will work and is very useful with Multi-Threaded Server (MTS)
   configurations. MTS alone does not solve the problem, as a client that is
   powered down when connected to a MTS will also leave a defunct connection
   within the MTS (at least until the underlying protocol detects the loss of
   the client, at which time it will inform MTS, which will then free the
   resources). The resources used per client with MTS are less than those used
   by dedicated server, however, so the net gain per connection within MTS is
   less than that with dedicated server. Having said that, DCD has a distinct
   advantage within MTS configurations - as each server process is managing
   multiple clients simultaneously, the DBA has no option of killing a single
   process as a result of the termination of a single client. DCD therefore
   increases database uptime by allowing resources to be managed more
   effectively.

Q: Can I use DCD on all of my connections over all protocols? 
 
A: You can use DCD over all protocols except for APPC/LU6.2, which prevents DCD
   from working due to its half-duplex nature. It also does not work over
   bequeathed connections. You should carefully consider whether to use DCD
   before you use it, however, as it creates additional processing work on the
   server and can also increase network traffic. Furthermore, some protocols
   already implement a form of DCD already, so it may not necessarily be needed
   on all protocols.

Q: Are there any differences if I am using DCD on connections that go through
   the Oracle Multi-Protocol Interchange (MPI) or Connection Manager (CMAN)?
 
A: No. DCD works through MPI and CMAN in the same way as direct client/server.
   If your connection spans across half-duplex and full-duplex protocols (for
   example APPC/LU6.2 and TCP/IP), DCD will be disabled by the server.

 

DCD的好處與弊端

 

其實,DCD的好處上面已經基本闡述清楚了,其實DCD還是有一些弊端的。例如,在Windows平台性能很差(bug#303578);在SCO Unix下它會觸發Bug,消耗大量CPU資源; DCD 在協議層是很消耗資源的, 所以如果要用DCD來清除死進程, 會加重系統的負擔, 任何時候, 干淨的退出系統,這是首要的. 如下英文所述:

DCD is much more resource-intensive than similar mechanisms at the protocol level, so if you depend on DCD to clean up all dead processes, that will put

an undue load on the server. Clearly it is advantageous to exit applications cleanly in the first place.

 

參考資料:

http://www.laoxiong.net/firewall-dcd-and-tcp-keep-alive.html

Note.601605.1 A discussion of Dead Connection Detection, Resource Limits, V$SESSION, V$PROCESS and OS processes:
Note.395505.1 How to Check if Dead Connection Detection (DCD) is Enabled in 9i and 10g:
Connections on Windows Platform Timout after 2 Hours, Why ? (文檔 ID 1073461.1)
Concurrent Manager Functionality Not Working And PCP Failover Takes Long Inspite of Enabling DCD With Database Server (文檔 ID 438921.1)
Common Questions About Dead Connection Detection (DCD) (文檔 ID 1018160.6)
Dead Connection Detection (DCD) Explained (文檔 ID 151972.1)

  1. 上一頁:
  2. 下一頁:
Copyright © 程式師世界 All Rights Reserved