程序師世界是廣大編程愛好者互助、分享、學習的平台,程序師世界有你更精彩!
首頁
編程語言
C語言|JAVA編程
Python編程
網頁編程
ASP編程|PHP編程
JSP編程
數據庫知識
MYSQL數據庫|SqlServer數據庫
Oracle數據庫|DB2數據庫
 程式師世界 >> 數據庫知識 >> Oracle數據庫 >> Oracle教程 >> ParallelQuery導致的ORA-04031

ParallelQuery導致的ORA-04031

編輯:Oracle教程

ParallelQuery導致的ORA-04031


一個朋友遇到ORA-04031問題。雖然這個錯誤是非常常見的,然而這裡的Case 也有點讓人為之震驚!

Tue Aug 26 11:51:13 2014
Errors in file /oracle/app/oracle/diag/rdbms/xx/xx1/trace/xx1_p485_28873.trc  (incident=1589637)
ORA-04031: 無法分配 32792 字節的共享內存 ("shared pool","unknown object","sga heap(2,0)","PX msg pool")
Use ADRCI or Support Workbench to package the incident.
See Note 411.1 at My Oracle Support for error and packaging details.
Tue Aug 26 11:51:15 2014
。。。。。。。。
DDE: Problem Key 'ORA 4031' was completely flood controlled (0x6)
Further messages for this problem key will be suppressed for up to 10 minutes
Tue Aug 26 11:57:12 2014
PMON failed to acquire latch, see PMON dump
Tue Aug 26 11:57:15 2014
DIAG (ospid: 14345) has not called a wait for 49 secs.
Tue Aug 26 11:57:58 2014
Errors in file /oracle/app/oracle/diag/rdbms/xx/xx1/trace/xx1_qmnc_15686.trc  (incident=1608420)
ORA-00445: background process "q00T" did not start after 120 seconds
Tue Aug 26 11:58:17 2014
。。。。。。。。
Tue Aug 26 12:07:31 2014
WARNING: inbound connection timed out (ORA-3136)
Tue Aug 26 12:07:31 2014
WARNING: inbound connection timed out (ORA-3136)
Tue Aug 26 12:07:46 2014
PMON failed to acquire latch, see PMON dump

上述信息為Rac 節點1的信息,對於節點2,信息也基本類似,這裡就不貼了。從日志來看,是p621出異常,根據這一點
我們可以判斷為是並行進程出現了異常,接著從下面的ORA-04031錯誤也可以證實這一點。
ORA-04031: 無法分配 32792 字節的共享內存 (“shared pool”,”unknown object”,”sga heap(2,0)”,”PX msg pool”)

從這個錯誤來看,可以判斷P621進程無法分別32792 byte的內存了。這部分內存是需要為Px msg pool進程分配。

這裡首先說說PX msg pool是干什麼的?

Px msg pool其實是用於Parallel 操作的。通常情況下這部分內存區域應該是從large pool中分配。然而很多業務系統中
由於設置不到,導致parallel 操作仍然使用shared pool的memory。這一點,我們稍後討論。

接著我們看後面的日子發現PMON進程都無法獲得latch了,然後這個時候系統已經不行了。 很明顯這個問題的根源就是前面
的ORA-04031錯誤,這裡我們需要分析下trace,來確認下,為什麼進程會報錯。

該trace文件(xx1_p485_28873.trc)的內容如下:

Allocation Request Summary Informaton
=====================================
Current information setting:  04014fff
  SGA Heap Dump Interval=3600 seconds
  Dump Interval=300 seconds
  Last Dump Time=08/26/2014 11:51:12
  Dump Count=1
Allocation request for:      PX msg pool
  Heap: c00000012fe3ae08, size: 32792
******************************************************
HEAP DUMP heap name="sga heap(2,0)"  desc=c00000012fe3ae08
 extent sz=0xfe0 alt=248 het=32767 rec=9 flg=-126 opc=0
 parent=0000000000000000 owner=0000000000000000 nex=0000000000000000 xsz=0x10000000 heap=0000000000000000
 fl2=0x20, nex=0000000000000000, dsxvers=1, dsxflg=0x0
 dsx first ext=0x30000000
 latch set 2 of 2
 durations disabled for this heap
 reserved granules for root 0 (granule size 268435456)

可以看到,PX msg pool請求獲得32792 byte的內存大小,從什麼地方申請呢? heap 2,也就是shared pool的第2個subpool。
但是卻無法分配這麼多,以至於報錯,下面我們來看下subpool 2的實際情況:

Memory Utilization of Subpool 2
================================
      Allocation Name            Size
___________________________  ____________
"free memory              "    1431685456         ++++整個subpool 2 還有1.4G free memory左右
"miscellaneous            "          1680
"vem_user_actlog1         "             0
"SWEEPERR                 "             0
"File Space Usage         "        112144
"trace_knlasg             "             0
"ges big msg buffers      "      56801208
"dsktab_kfgsg             "         57128
"hot latch diagnostics    "             0
"gcs res hash bucket      "       4194304
"Wait event pointers      "             0
"gcs res latch table      "             0
"file state object        "       4698096
"vips_package_size        "             0
"SEQ S.O.                 "       1057920
。。。。
"gcs resources            "     223317648
"gc lock state            "             0
"kglsim alloc latch area  "             0
"gcs affinity             "             0
"gcs opaque in            "         49248
"gcs pnp histo            "             0
"ges resource             "     605005184
"knlp free small rcv buf  "             0
"SQLA                     "        280160
"PRTI                     "             0
"SQLP                     "         53608
"SQLK                     "         28336
"gcs dynamic r            "       2556288
"work area tab            "        236640
"plwpil:wa                "             0
"kebm run-once actions    "             0
"grptab_kfgsg             "             0
"AW SGA                   "             0
"dbktb: trace buffer      "      53673984
"v_swperrcount            "             0
"ksunfy : SSO free list   "      27834240
"ksxp IPC state object    "        346448
"ksdhng: cache history    "             0
"DFW_CONFIG_CAPTURE       "             0
"enqueue resources        "        862232
"kzekm heap descriptor    "             0
"kelt translation table   "             0
"IPS_PACKAGE_UNPACK_HISTOR"             0
"kglsim hash table bkts   "       2097152
"KSXR global channels     "             0
"ges enqueues             "     344980144
"trace buf hdr xtend      "          2080
"KSXR receive buffers     "       1061888
"kglsim Java timestamps   "        131072
"VM OSD context           "             0
"KCB Table Scan Buffer    "       4194816
"knlp receive buffers     "             0
"PLS chunk                "             0
"knlu_init_btree:init     "             0
"KEWS aggregation objs    "             0
"DDE_USER_ACTION_DEF      "             0
"keswxNotify:tabPlans     "        131072
"EM_USER_ACTIVITY         "             0
"KSXR large reply queue   "        168296
"event-class map          "             0
"KSQ event description    "             0
"KSXP CONN Stats buffer   "        728432
"KSXP CTX Stats buffer    "        578128
"KSXP PORT Stats buffer   "        142768
"kglsim alloc latches     "             0
"kglsim used list arr     "             0
"kglsim free obj list     "             0
"KTCN global operators    "             0
"KESTB existence bitvec   "        131072
"KFD extent enqueue obj   "         42792
"ksdhng: el wtr cache     "        401280
"ksb cic process list     "             0
"gcs shadows              "     130592800
"value block f            "         43832
"ksdhng: blkers cache     "         72960
"FileOpenBlock            "     255011680
"qmtb_init_xxta           "             0
"sort segment handle      "         54936
"latchnum to latch map    "             0
"LISTEN ADDRESS ENTRY     "             0
"vips_pkg_file            "             0
"PX msg pool              "      47109312      +++++ 當前PX msg pool已經分配了47M左右
。。。。。。
"Checkpoint queue         "     116765184
"sim segment hits         "         72960
"sim lru segments         "         72960
"pesom.c:Global           "             0
"peshm.c:Global           "             0
"paralltp txn r           "      22476896
"ASH buffers              "     119537664
"ASM KFFD SO              "             0
"statement queuing        "             0
"v_actinc                 "             0
"message pool freequeue   "        523240
"ADR_CONTROL              "             0
"vshowincb                "             0
"vproblem2                "             0
"PMON blockers            "             0
"procs: ksunfy            "       6408000
"sga listelement          "             0
"primem_kfmdsg            "             0
"agent name               "             0
"SYSTEM PARAMETERS        "        232944
"Online xxtafile Move sess"             0
"sched job queue          "             0
"vproblem                 "             0
"object queue hash buckets"       8404992
"object queue hash table d"        229824
"obj stats allocation chun"       1290240
"qmnsf_heap               "             0
"sim cache nbufs          "             0
"qmu subheap              "         53976
"MTTR advisory context    "             0
"LGWR-network Server info "         34032
"LGWR debug module memory "             0
"logout storm management  "             0
"log_simultaneous_copies  "         49248
"sys event stats          "        552960
"parameter handle         "       4273688
"Parameter Handle         "             0
"name-service recovery    "         80776
"channel handle           "         48360
"parameter blocks         "         22016
"qmrcInitSGA:qmrcSGA      "             0
"PARAMETER ENTRY          "             0
"ASM map operations       "         30552
"Security Class           "         33280
"GCR                      "             0
"post/wait queues         "         55872
"bloom filter             "             0
"Auto tune undo info      "             0
"pesldT03_Instance        "         33848
"cp pool array            "             0
"cp cmon array            "             0
"UNDO INFO                "             0
"Log Archive Config Contex"             0
"state objects            "     107440104
"ArchLog Dest Array       "         22784
"parallel kcbibr          "             0
"archive_lag_target       "             0
"poisoned memory          "         32768
"name-service request     "             0
"latch recovery structures"             0
"latches for protecting fs"             0
"latches for protecting ls"             0
"Label Cache Heap descrptr"             0
"cp connection            "             0
"sched job slv            "             0
"bt_subentry              "             0
"Saved job ids            "             0
"object level stat table  "             0
"pso tbs: ksunfy          "        780000
"dummy                    "             0
"qm_init_sga:qmdpsg       "             0
"analytic workspace       "             0
"sim trace buf            "             0
"qm_init_sga:oidctx       "             0
"Sort Segment             "      16852160
"PRESENTATION ENTRY       "             0
"cp srv array             "             0
"recov_kgqbtctx           "         26512
"dbkea sga                "             0
"pesom.c:latch array      "             0
"Banner Storage           "             0
"qmps connections         "         65280
"quiescing session        "             0
"qm_init_uga:qmsg         "             0
"enqueue                  "             0
"buffer handles           "       5472000
"process group array      "         69776
"DML lock                 "       7066304
"property service SO      "             0
"broker globals           "             0
"sga node map             "             0
"prmtzdini tz region      "        801416
"VIEW                     "             0
"savepoints               "             0
"Monitor Modification Hash"             0
"RESUMABLE NAME           "             0
"obj stats hash table     "             0
"max allowable # log files"        506400
"OS proc request holder   "        293424
"message pool context area"         22400
"media recovery state obje"             0
"db_files                 "       1783968
"base-kglhdusr            "       1218016
"constraints              "             0
"ASM scan context         "             0
"PX server msg stats      "             0
"PX server deq stats      "             0
"cinfo_kfnsg              "             0
"cco_subnode              "             0
"cco_pubnode              "             0
"KQR X PO                 "      50414728
"KQR L PO                 "       5977688
"parameter table block    "      31158248
"KQR M SO                 "        192808
"KQR L SO                 "        128104
"KQR S SO                 "         90928
"KSFV SGA                 "             0
"KSIR SGA                 "             0
"ASM rollback operations  "             0
"kccsgehtt                "             0
"kglsgivl2                "         28160
"plugin xxtafile array    "         36016
"plwxx:PLW_STR_NEW_RVAL   "             0
"plwspv:PLW_STR_NEW_VAL   "             0
"lm heartbeat             "             0
"KGSK sga                 "             0
"BRANCH TABLE SEGMENTED AR"        105288
"ksv slave                "             0
"RVWR post requested array"             0
"parameter value memory   "             0
"RTA Bounxxry Info thread "         71808
"RTA read control file ind"             0
"k2q_lock                 "       4606648
"ASM extent pointer array "     318197080
。。。。。。。
"kglsim object batch      "     338733696
"ktlbk state objects      "       5417280
。。。。。
"kglsim heap              "     197796864
"ksz parent               "        204960
"kgnfs authp              "         16464
"event statistics per sess"      29840640
"event statistics ptr arra"         36480
"event list to post commit"             0
"KFG SO child             "             0
"KGKP randnum             "         40000
Library Cache Dump

可以看到gcs/ges resource/enqueue這部分內存都消耗了1.4g 左右,還是比較大的,另外ASM extent pointer array
也有300多m。這個可能跟asm的extent分配有關系,對於DW緩解,建議不要使用默認的uniform AU 1m,太小了。

總體來看,從這個subpool的使用情況來看,無法看出是不是碎片嚴重了。但是從dump下面的library 數據來看,也是
能看出一些端倪的,如下:

LIBRARY CACHE STATISTICS:
namespace           gets hit ratio      pins hit ratio    reloads   invalids
-------------- --------- --------- --------- --------- ---------- ----------
SQL AREA       325150175     0.984 2560976170     0.992    2656085    3942426
TABLE/PROCEDURE  73717472     0.996 786560445     0.998     481299       1362
BODY             11738443     1.000  86301250     1.000        521          0
TRIGGER           3247968     1.000   3248823     0.999        153          2
INDEX              576294     0.993    577639     0.974       8448          0
........

大家可以看到,SQL area 這裡的reloads和invalids太高了。說明這個時間點shared pool的可用內存可能不多。雖然
從前面的dump看free memory還不少。 我們需要明白一點,Oracle消耗shared pool的內存,是需要連續的一段內存區域。

從這裡來看如果如不是shared pool碎片嚴重那麼可能就是有大的SQL操作。我們看該進程的信息發現該進程一直在等待

某個latch Child parallel query alloc buffer,如下:

SO: 0xc000001842ba9750, type: 2, owner: 0x0000000000000000, flag: INIT/-/-/0x00 if: 0x3 c: 0x3
 proc=0xc000001842ba9750, name=process, file=ksu.h LINE:12616 ID:, pg=0
(process) Oracle pid:664, ser:2, calls cur/top: 0xc0000015e1989c10/0xc000001711b743a8
          flags : (0x0) -
          flags2: (0x0),  flags3: (0x10)
          intr error: -2147473260, call error: 0, sess error: 0, txn error 0
          intr queue: 2147494036
  ksudlp FALSE at location: 0
  (post info) last post received: 2147494036 136 12
              last post received-location: ksu.h LINE:13927 ID:ksusig
              last process to post me: c000001832b0b620 242 0
              last post sent: 0 0 356
              last post sent-location: kxfp.h LINE:4561 ID:kxfprienq: slave
              last process posted by me: c000001832b0b620 242 0
  (latch info) wait_event=0 bits=50
    holding    (efd=10) c000001804617de8 Child parallel query alloc buffer level=6 child#=23
        Location from where latch is held: kxfpb.h LINE:496 ID:kxfpbalo: allocate a buffer
        Context saved from call: 13835058152010268280
        state=busy [holder orapid=664] wlstate=free [value=0]
        waiters [orapid (seconds since: put on list, posted, alive check)]:
         746 (456, 1409025072, 2)
         1068 (456, 1409025072, 3)
         747 (456, 1409025072, 2)
         857 (456, 1409025072, 1)
         1067 (456, 1409025072, 3)
         748 (456, 1409025072, 1)
         983 (434, 1409025072, 4)
         1070 (384, 1409025072, 3)
         749 (384, 1409025072, 1)
         1069 (384, 1409025072, 2)
         912 (346, 1409025072, 4)
         662 (323, 1409025072, 3)
         871 (308, 1409025072, 4)
         906 (270, 1409025072, 4)
         735 (209, 1409025072, 4)
         786 (157, 1409025072, 1)
         972 (74, 1409025072, 1)
         waiter count=17
    holding    (efd=10) c00000168571bdd0 Child process queue reference level=4 child#=10989
        Location from where latch is held: kxfp.h LINE:4438 ID:kxfprialo2: process qref child init
        Context saved from call: 13835058150030153416
        state=busy [holder orapid=664] wlstate=free [value=0]
  Process Group: DEFAULT, pseudo proc: 0xc0000018330dea98
  O/S info: user: oracle, term: UNKNOWN, ospid: 28873
  OSD pid info: Unix process pid: 28873, image: oracle@xx01 (P485)
PSO child state object changes :

這個latch Child parallel query alloc buffer 也就是在進行Px msg pool 內存分配的時候需要獲得的,當內存分配完畢之後latch會立刻
釋放,通常情況下,Oracle 對於Latch的申請和釋放是非常之快的。很明顯這裡這個進程由於申請內存失敗,導致latch也沒有釋放,還阻塞
了一堆的會話。

由於朋友這裡沒有其他的信息,所以很難准確的定位到是不是碎片或者其他的問題。但是從trace 裡面也不難發現一些信息。前面提到進程
申請內存,勢必就要看看下這個會話是什麼操作了,不看不知道,一看嚇一跳,如下:

MERGE INTO xx.xxxx_A A
USING (SELECT /*+ parallel(a,512) parallel(b,8)  */
        :B2 STATIS_MONTH,
        NVL(B.PROV_ID, '000') PROV_ID,
        COUNT(DISTINCT A.SERV_NUM) CNT
         FROM (SELECT DISTINCT SERV_NUM
                 FROM xxx.TDW_XXXXX_MIGU_MEMBER_M A
                WHERE STATIS_MONTH = :B1
                  AND MEMBER_LVL = '1'
                  AND MEMBER_STAT IN ('1', '0')
                  AND EXISTS (SELECT /*+ parallel(b,64)*/
                        1
                         FROM xxx.TKR_DEV_XXXXX_CORE_M B
                        WHERE A.SERV_NUM = B.SERV_NUM)) A,
              xxx.VDW_NUMBER_SEGMENT B
        WHERE SUBSTR(A.SERV_NUM, 1, 7) = B.SEGMENT(+)
        GROUP BY B.PROV_ID) B
ON (A.STATIS_MONTH = B.STATIS_MONTH AND A.PROV_ID = B.PROV_ID)
WHEN MATCHED THEN
  UPxxTE SET CORE_NRML = CNT

大家可以看到,這個sql居然開了512個並行,如果把幾個表的並行都加起來,高達584個。問題應用這裡還不止這一個SQL。

對於並行查詢,我們知道,即使你的並行度設置的再高,也不一定能用這麼多,因為這是受限制數據庫參數的。不過,悲劇的是,
他這裡默認參數都是比較大的。cpu_count=114,默認的dop都高達228。parallel_max_server更是高達好幾千了。

還有一點,他的參數parallel_force_local是ture,這也導致parallel進程只能在單個實例進行分配。

其實上述信息都不是重要的,更重要也是更致命的一點的是,並行查詢使用的Px msg pool居然從shared pool了進行分配了。

我看信息,他這裡的large pool設置高達10gb啊,其實是完全沒用派上用場。

對於Px msg pool,如果你的數據庫PARALLEL_MIN_SERVERS比較大(這裡就是比較大),那麼當數據庫啟動時候,可能Px msg pool
就會消耗不少的內存的。

對於Parallel操作的內存消耗Px memory,可以從large pool分配,也可以從shared pool進行分配。這裡需要注意,即使你設置了
large pool,如果不滿足下面的條件,那麼Px memory仍然會從shared pool去分配。

1) parallel_automatic_tuning 參數設置為auto
2) _PX_use_large_pool 隱含參數設置為true
3) 使用ASMM ,即使設置了sga_target 或 memory_target.

很遺憾的是,他這裡3點都不滿足。

  1. 上一頁:
  2. 下一頁:
Copyright © 程式師世界 All Rights Reserved