程序師世界是廣大編程愛好者互助、分享、學習的平台,程序師世界有你更精彩!
首頁
編程語言
C語言|JAVA編程
Python編程
網頁編程
ASP編程|PHP編程
JSP編程
數據庫知識
MYSQL數據庫|SqlServer數據庫
Oracle數據庫|DB2數據庫
 程式師世界 >> 數據庫知識 >> Oracle數據庫 >> Oracle教程 >> oom_kill_process造成數據庫掛起並出現found dead shared server,oomkillprocess

oom_kill_process造成數據庫掛起並出現found dead shared server,oomkillprocess

編輯:Oracle教程

oom_kill_process造成數據庫掛起並出現found dead shared server,oomkillprocess


這篇博客是上一篇博客Oracle shutdown immediate遭遇ORA-24324 ORA-24323 ORA-01089的延伸(數據庫掛起hang時,才去重啟的),其實這是我們海外一工廠的遇到的案例,把內容拆開是因為這個case分開講述顯得主題明確一些。正式進入主題:

 

服務器數據庫版本Oracle Database 10g Release 10.2.0.4.0,操作系統為Red Hat Enterprise Linux Server release 5.7,虛擬機。當時告警日志裡面出現大量的found dead shared server這裡信息。數據庫也出現連接不上的情況

 

found dead shared server 'S016', pid = (35, 23)
found dead shared server 'S023', pid = (42, 1)
Fri Aug  5 10:28:48 2016
found dead shared server 'S013', pid = (32, 110)
found dead shared server 'S021', pid = (40, 1)
Fri Aug  5 10:33:53 2016
found dead shared server 'S012', pid = (31, 132)
found dead shared server 'S023', pid = (38, 3)
Fri Aug  5 10:38:55 2016
found dead shared server 'S013', pid = (32, 111)
found dead shared server 'S022', pid = (42, 3)
Fri Aug  5 10:40:53 2016
found dead shared server 'S020', pid = (39, 4)
found dead shared server 'S021', pid = (40, 3)
failed to start shared server, oer=0

 

通過檢查發現系統內存耗盡,出現了oom_kill 。OOMkiller,即out of memory killer,是linux下面當內存耗盡時的的一種處理機制。當內存較少時,OOM會遍歷整個進程鏈表,然後根據進程的內存使用情況以及它的oom score值最終找到得分較高的進程,然後發送kill信號將其殺掉。

 

# grep -i kill /var/log/messages | more
Aug  5 10:12:10 xxxxx kernel: oracle invoked oom-killer: gfp_mask=0x201da, order=0, oom_adj=0
Aug  5 10:12:10 xxxxx kernel:  [<ffffffff810d9ae6>] oom_kill_process+0x85/0x25b
Aug  5 10:12:11 xxxxx kernel: Out of memory: kill process 21687 (oracle) score 2296119 or a child
Aug  5 10:12:11 xxxxx kernel: Killed process 21687 (oracle)
Aug  5 10:12:11 xxxxx kernel: oracle invoked oom-killer: gfp_mask=0x201da, order=0, oom_adj=0
Aug  5 10:12:11 xxxxx kernel:  [<ffffffff810d9ae6>] oom_kill_process+0x85/0x25b
Aug  5 10:12:11 xxxxx kernel: Out of memory: kill process 21668 (oracle) score 2144517 or a child
Aug  5 10:12:11 xxxxx kernel: Killed process 21668 (oracle)
Aug  5 10:23:09 xxxxx kernel: oracle invoked oom-killer: gfp_mask=0x201da, order=0, oom_adj=0
Aug  5 10:23:09 xxxxx kernel:  [<ffffffff810d9ae6>] oom_kill_process+0x85/0x25b
Aug  5 10:23:09 xxxxx kernel: Out of memory: kill process 21756 (oracle) score 2144517 or a child
Aug  5 10:23:09 xxxxx kernel: Killed process 21756 (oracle)
Aug  5 10:23:09 xxxxx kernel: oracle invoked oom-killer: gfp_mask=0x201da, order=0, oom_adj=0
Aug  5 10:23:09 xxxxx kernel:  [<ffffffff810d9ae6>] oom_kill_process+0x85/0x25b
Aug  5 10:23:09 xxxxx kernel: Out of memory: kill process 21732 (oracle) score 2138384 or a child
Aug  5 10:23:09 xxxxx kernel: Killed process 21732 (oracle)
Aug  5 10:28:08 xxxxx kernel: oracle invoked oom-killer: gfp_mask=0x201da, order=0, oom_adj=0
Aug  5 10:28:08 xxxxx kernel:  [<ffffffff810d9ae6>] oom_kill_process+0x85/0x25b
Aug  5 10:28:09 xxxxx kernel: Out of memory: kill process 21752 (oracle) score 2144521 or a child
Aug  5 10:28:09 xxxxx kernel: Killed process 21752 (oracle)
Aug  5 10:28:09 xxxxx kernel: oracle invoked oom-killer: gfp_mask=0x201da, order=0, oom_adj=0
Aug  5 10:28:09 xxxxx kernel:  [<ffffffff810d9ae6>] oom_kill_process+0x85/0x25b
Aug  5 10:28:09 xxxxx kernel: Out of memory: kill process 21722 (oracle) score 2138377 or a child
Aug  5 10:28:09 xxxxx kernel: Killed process 21722 (oracle)
Aug  5 10:32:24 xxxxx kernel: oracle invoked oom-killer: gfp_mask=0x201da, order=0, oom_adj=0
Aug  5 10:32:24 xxxxx kernel:  [<ffffffff810d9ae6>] oom_kill_process+0x85/0x25b
Aug  5 10:32:24 xxxxx kernel: Out of memory: kill process 21718 (oracle) score 2135307 or a child
Aug  5 10:32:24 xxxxx kernel: Killed process 21718 (oracle)
Aug  5 10:32:24 xxxxx kernel: gdm-rh-security invoked oom-killer: gfp_mask=0x201da, order=0, oom_adj=0
Aug  5 10:32:24 xxxxx kernel:  [<ffffffff810d9ae6>] oom_kill_process+0x85/0x25b
Aug  5 10:32:24 xxxxx kernel: Out of memory: kill process 22053 (oracle) score 2135300 or a child
Aug  5 10:32:24 xxxxx kernel: Killed process 22053 (oracle)
Aug  5 10:37:54 xxxxx kernel: beremote invoked oom-killer: gfp_mask=0x201da, order=0, oom_adj=0
Aug  5 10:37:54 xxxxx kernel:  [<ffffffff810d9ae6>] oom_kill_process+0x85/0x25b
Aug  5 10:37:54 xxxxx kernel: Out of memory: kill process 22238 (oracle) score 2134274 or a child
Aug  5 10:37:54 xxxxx kernel: Killed process 22238 (oracle)
Aug  5 10:37:54 xxxxx kernel: oracle invoked oom-killer: gfp_mask=0x201da, order=0, oom_adj=0
Aug  5 10:37:54 xxxxx kernel:  [<ffffffff810d9ae6>] oom_kill_process+0x85/0x25b
Aug  5 10:37:54 xxxxx kernel: Out of memory: kill process 22128 (oracle) score 2133001 or a child
--More--

 

從上面可以看到大量的ORACLE的進程被kill掉,從而導致ORACLE出現"found dead shared server 'S016', pid = (35, 23)"這類錯誤,在官方文檔Found Dead Shared Server Messages Reported In Alert.Log (文檔 ID 760872.1) 有如下介紹(這個文檔較老舊,不過原理依然適用於此處環境):

 

SYMPTOMS


The following is being reported in the alert.log file

Mon Dec 22 16:48:31 2008
found dead shared server 'S004', pid = (13, 1)
found dead shared server 'S001', pid = (10, 1)

No further errors accompany those messages in the alert.log file.

Listener.log shows no errors.

CAUSE

When a session is killed, in some circumstances, the shared server could die as well leading to those messages being reported in the alert.log file.
This has been verified by << Bug 4270723 >> FOUND DEAD SHARED SERVER

 

也就是說當一個會話被殺,在某些情況下, shared sever進程會死掉,導致上面“found dead shared server”出現在告警日志中。 這個案例中,由於系統大量kill掉會話進程,導致shared server進程死掉。所以Oracle數據庫出現無法訪問的情況。突然掛起。

 

$grep "Out of memory" messages
Aug  5 10:12:11 xxxxx kernel: Out of memory: kill process 21687 (oracle) score 2296119 or a child
Aug  5 10:12:11 xxxxx kernel: Out of memory: kill process 21668 (oracle) score 2144517 or a child
Aug  5 10:23:09 xxxxx kernel: Out of memory: kill process 21756 (oracle) score 2144517 or a child
Aug  5 10:23:09 xxxxx kernel: Out of memory: kill process 21732 (oracle) score 2138384 or a child
Aug  5 10:28:09 xxxxx kernel: Out of memory: kill process 21752 (oracle) score 2144521 or a child
Aug  5 10:28:09 xxxxx kernel: Out of memory: kill process 21722 (oracle) score 2138377 or a child
Aug  5 10:32:24 xxxxx kernel: Out of memory: kill process 21718 (oracle) score 2135307 or a child
Aug  5 10:32:24 xxxxx kernel: Out of memory: kill process 22053 (oracle) score 2135300 or a child
Aug  5 10:37:54 xxxxx kernel: Out of memory: kill process 22238 (oracle) score 2134274 or a child
Aug  5 10:37:54 xxxxx kernel: Out of memory: kill process 22128 (oracle) score 2133001 or a child
Aug  5 10:38:46 xxxxx kernel: Out of memory: kill process 22227 (oracle) score 2132996 or a child
Aug  5 10:39:14 xxxxx kernel: Out of memory: kill process 22229 (oracle) score 2134280 or a child
Aug  5 10:40:57 xxxxx kernel: Out of memory: kill process 22286 (oracle) score 2135299 or a child
Aug  5 10:41:24 xxxxx kernel: Out of memory: kill process 22245 (oracle) score 2135302 or a child
Aug  5 10:41:25 xxxxx kernel: Out of memory: kill process 22485 (oracle) score 2133009 or a child
Aug  5 10:41:56 xxxxx kernel: Out of memory: kill process 21779 (oracle) score 2132880 or a child
Aug  5 10:42:08 xxxxx kernel: Out of memory: kill process 22068 (oracle) score 2132873 or a child
Aug  5 10:42:26 xxxxx kernel: Out of memory: kill process 22249 (oracle) score 2132873 or a child
Aug  5 10:42:26 xxxxx kernel: Out of memory: kill process 22278 (oracle) score 2132873 or a child
Aug  5 10:42:31 xxxxx kernel: Out of memory: kill process 21662 (oracle) score 2132872 or a child
Aug  5 10:42:47 xxxxx kernel: Out of memory: kill process 22045 (oracle) score 2132872 or a child
Aug  5 10:42:57 xxxxx kernel: Out of memory: kill process 22314 (oracle) score 2132872 or a child
Aug  5 10:43:35 xxxxx kernel: Out of memory: kill process 22336 (oracle) score 2132872 or a child
Aug  5 10:43:35 xxxxx kernel: Out of memory: kill process 22435 (oracle) score 2132870 or a child
Aug  5 10:43:55 xxxxx kernel: Out of memory: kill process 21666 (oracle) score 2132869 or a child
Aug  5 10:44:02 xxxxx kernel: Out of memory: kill process 22263 (oracle) score 2132869 or a child
Aug  5 10:44:19 xxxxx kernel: Out of memory: kill process 22405 (oracle) score 2132866 or a child
Aug  5 10:44:20 xxxxx kernel: Out of memory: kill process 22438 (oracle) score 2132866 or a child
Aug  5 10:44:20 xxxxx kernel: Out of memory: kill process 22453 (oracle) score 2132865 or a child
Aug  5 10:44:23 xxxxx kernel: Out of memory: kill process 22466 (oracle) score 2132737 or a child
Aug  5 10:44:26 xxxxx kernel: Out of memory: kill process 22499 (oracle) score 2132607 or a child
Aug  5 10:44:27 xxxxx kernel: Out of memory: kill process 21716 (oracle) score 1078417 or a child
Aug  5 10:44:27 xxxxx kernel: Out of memory: kill process 21670 (oracle) score 1066455 or a child
Aug  5 10:48:02 xxxxx kernel: Out of memory: kill process 22829 (oracle) score 2134273 or a child
Aug  5 10:49:47 xxxxx kernel: Out of memory: kill process 22900 (oracle) score 2133007 or a child
Aug  5 10:50:36 xxxxx kernel: Out of memory: kill process 22842 (oracle) score 2133095 or a child
Aug  5 10:51:25 xxxxx kernel: Out of memory: kill process 22990 (oracle) score 2134285 or a child
Aug  5 10:51:25 xxxxx kernel: Out of memory: kill process 23054 (oracle) score 2132994 or a child
Aug  5 10:51:49 xxxxx kernel: Out of memory: kill process 22933 (oracle) score 2134277 or a child
Aug  5 10:51:49 xxxxx kernel: Out of memory: kill process 23103 (oracle) score 2132996 or a child
Aug  5 10:52:52 xxxxx kernel: Out of memory: kill process 23211 (oracle) score 2134267 or a child

 

在官方文檔Oracle VM Server hangs after Invoking the OOM Killer and having hundreds of kpartx processes spawned and "state S non-preferred supports toluSnA" reported on the FC LUNs (文檔 ID 2123877.1)介紹了這麼一個案例

 

APPLIES TO:

Oracle VM - Version 3.3.2 and later
Linux x86-64

SYMPTOMS

On Oracle VM Server, during normal server operation, the server suddenly hangs, with the following kind of messages being logged about invoking the Out Of Memory Killer :

$ grep "Out of memory" messages
Mar 19 09:17:50 ovs01 kernel: Out of memory: Kill process 7382 (xend) score 2 or sacrifice child
Mar 19 09:17:51 ovs01 kernel: Out of memory: Kill process 7382 (xend) score 2 or sacrifice child
Mar 19 09:17:56 ovs01 kernel: Out of memory: Kill process 7425 (xend) score 2 or sacrifice child
Mar 19 09:17:57 ovs01 kernel: Out of memory: Kill process 7425 (xend) score 2 or sacrifice child
Mar 19 09:17:58 ovs01 kernel: Out of memory: Kill process 7425 (xend) score 2 or sacrifice child
Mar 19 09:17:58 ovs01 kernel: Out of memory: Kill process 7425 (xend) score 2 or sacrifice child
Mar 19 09:18:38 ovs01 kernel: Out of memory: Kill process 7557 (xend) score 2 or sacrifice child
Mar 19 09:19:21 ovs01 kernel: Out of memory: Kill process 17926 (mounted.ocfs2) score 1 or sacrifice child

Stack traces like this one are also being printed on the console :

Mar 19 09:17:50 ovs01 kernel: Out of memory: Kill process 7382 (xend) score 2 or sacrifice child
Mar 19 09:17:50 ovs01 kernel: Killed process 7422 (sh) total-vm:5512kB, anon-rss:40kB, file-rss:108kB
Mar 19 09:17:51 ovs01 kernel: snmpd invoked oom-killer: gfp_mask=0x2800d0, order=0, oom_score_adj=0
Mar 19 09:17:51 ovs01 kernel: snmpd cpuset=/ mems_allowed=0
Mar 19 09:17:51 ovs01 kernel: Pid: 6602, comm: snmpd Tainted: G W 3.8.13-55.1.8.el6uek.x86_64 #2
Mar 19 09:17:51 ovs01 kernel: Call Trace:
Mar 19 09:17:51 ovs01 kernel: [<ffffffff810d6bba>] ? cpuset_print_task_mems_allowed+0xba/0xe0
Mar 19 09:17:51 ovs01 kernel: [<ffffffff81133c43>] dump_header+0x93/0x1f0
Mar 19 09:17:51 ovs01 kernel: [<ffffffff8128d330>] ? ___ratelimit+0xa0/0x120
Mar 19 09:17:51 ovs01 kernel: [<ffffffff8159a43a>] ? error_exit+0x2a/0x60
Mar 19 09:17:51 ovs01 kernel: [<ffffffff811343da>] oom_kill_process+0x20a/0x350
Mar 19 09:17:51 ovs01 kernel: [<ffffffff8122b735>] ? security_capable_noaudit+0x15/0x20
Mar 19 09:17:51 ovs01 kernel: [<ffffffff8106bcc5>] ? has_ns_capability_noaudit+0x15/0x20
Mar 19 09:17:51 ovs01 kernel: [<ffffffff8106bce7>] ? has_capability_noaudit+0x17/0x20
Mar 19 09:17:51 ovs01 kernel: [<ffffffff811347a6>] out_of_memory+0x286/0x2d0
Mar 19 09:17:51 ovs01 kernel: [<ffffffff8113a1a2>] __alloc_pages_slowpath+0x742/0x7c0
Mar 19 09:17:51 ovs01 kernel: [<ffffffff8113a51b>] __alloc_pages_nodemask+0x2fb/0x320
Mar 19 09:17:51 ovs01 kernel: [<ffffffff8117ec27>] kmem_getpages+0x67/0x1c0
Mar 19 09:17:51 ovs01 kernel: [<ffffffff8117fa99>] fallback_alloc+0x189/0x250
Mar 19 09:17:51 ovs01 kernel: [<ffffffff8117f85a>] ____cache_alloc_node+0x9a/0x150
Mar 19 09:17:51 ovs01 kernel: [<ffffffff8118102f>] kmem_cache_alloc+0x15f/0x2f0
Mar 19 09:17:51 ovs01 kernel: [<ffffffff811fa370>] proc_alloc_inode+0x20/0xa0
Mar 19 09:17:51 ovs01 kernel: [<ffffffff811adec7>] alloc_inode+0x27/0xa0
Mar 19 09:17:51 ovs01 kernel: [<ffffffff811adf5b>] new_inode_pseudo+0x1b/0x70
Mar 19 09:17:51 ovs01 kernel: [<ffffffff811adfcd>] new_inode+0x1d/0x40
Mar 19 09:17:51 ovs01 kernel: [<ffffffff811fc648>] proc_pid_make_inode+0x28/0x100
Mar 19 09:17:51 ovs01 kernel: [<ffffffff811fdbbb>] proc_pid_instantiate+0x1b/0xd0
Mar 19 09:17:51 ovs01 kernel: [<ffffffff811fc2d6>] proc_fill_cache+0x126/0x150
Mar 19 09:17:51 ovs01 kernel: [<ffffffff811a58f0>] ? sys_ioctl+0xb0/0xb0
Mar 19 09:17:51 ovs01 kernel: [<ffffffff811fdba0>] ? pid_revalidate+0x120/0x120
Mar 19 09:17:51 ovs01 kernel: [<ffffffff811a58f0>] ? sys_ioctl+0xb0/0xb0
Mar 19 09:17:51 ovs01 kernel: [<ffffffff811a58f0>] ? sys_ioctl+0xb0/0xb0
Mar 19 09:17:51 ovs01 kernel: [<ffffffff811fe25e>] proc_pid_readdir+0xfe/0x170
Mar 19 09:17:51 ovs01 kernel: [<ffffffff811a58f0>] ? sys_ioctl+0xb0/0xb0
Mar 19 09:17:51 ovs01 kernel: [<ffffffff811a58f0>] ? sys_ioctl+0xb0/0xb0
Mar 19 09:17:51 ovs01 kernel: [<ffffffff811fae3a>] proc_root_readdir+0x4a/0x60
Mar 19 09:17:51 ovs01 kernel: [<ffffffff811a58f0>] ? sys_ioctl+0xb0/0xb0
Mar 19 09:17:51 ovs01 kernel: [<ffffffff811a5ab8>] vfs_readdir+0xb8/0xd0
Mar 19 09:17:51 ovs01 kernel: [<ffffffff811a5c65>] sys_getdents+0x95/0x110
Mar 19 09:17:51 ovs01 kernel: [<ffffffff815a2599>] system_call_fastpath+0x16/0x1b

In the processes dump before the Out Of Memory Killer starts killing processes, there are hundreds of kpartx instances :

$ grep -c kpartx messages
124438

Error messages like the following one are also being reported :

Mar 13 03:59:20 ovs01 kernel: sd 12:0:2:98: alua: port group 01 state S non-preferred supports toluSnA
Mar 13 03:59:40 ovs01 kernel: sd 11:0:3:98: alua: port group 01 state S non-preferred supports toluSnA
Mar 13 03:59:55 ovs01 kernel: sd 12:0:2:98: alua: port group 01 state S non-preferred supports toluSnA
Mar 13 04:00:13 ovs01 kernel: sd 11:0:3:98: alua: port group 01 state S non-preferred supports toluSnA
Mar 13 04:00:24 ovs01 kernel: sd 12:0:2:98: alua: port group 01 state S non-preferred supports toluSnA
Mar 13 04:00:58 ovs01 kernel: sd 11:0:3:98: alua: port group 01 state S non-preferred supports toluSnA
Mar 13 04:01:09 ovs01 kernel: sd 12:0:2:98: alua: port group 01 state S non-preferred supports toluSnA
Mar 13 04:01:34 ovs01 kernel: sd 11:0:3:98: alua: port group 01 state S non-preferred supports toluSnA
Mar 13 04:01:46 ovs01 kernel: sd 12:0:2:98: alua: port group 01 state S non-preferred supports toluSnA

 

CAUSE

Issue(s) on the Fiber Channel Storage Server which is hosting the LUNs used by the Oracle VM Server.
 

SOLUTION

To resolve this issue, please resolve the issue on the Fiber Channel Storage Array - The issue on the VM Server will then resolve as the issues on the Storage is fixed.

 

 

跟這個案例非常類似,猜測就有可能就是這個原因造成的,需要海外那邊的系統管理員去檢查、處理。哎,但是一直沒給我反饋。有時候要幫他們看著這些屬於他們職責范圍內的事情,還得不到有效支持,真心覺得累!

  1. 上一頁:
  2. 下一頁:
Copyright © 程式師世界 All Rights Reserved