程式師世界 >> 數據庫知識 >> Oracle數據庫 >> Oracle教程 >> 再談ORACLECPROCD進程

再談ORACLECPROCD進程

編輯：Oracle教程

再談ORACLECPROCD進程

羅列一下有關oprocd的知識點

oprocd是oracle在rac中引入用來fencing io的

在unix系統下，如果我們沒有采用oracle之外的第三方集群軟件，才會存在oprocd進程

在linux系統下，只有在10.2.0.4版本後，才會具有oprocd進程

在window下，不會存在oprocd 進程，但是會存在一個oraFenceService服務，用來實現相同的功能，該服務采用的技術是基於windows的，與oprocd不同

oprocd進程可以運行在兩者模式下：fatal和no fatal，在fatal模式下，如果系統hang住，或者其他原因觸發oprocd則oprocd進程會自動重啟服務器。在no fatal模式下，如果系統hang住，或者其他原因觸發oprocd進程，則oprocd進程會在日志中記錄警告信息，但是不會重啟系統。

oprocd進程具有兩個參數：timeout 指定oprocd進程調用的時間間隔 margin 指定允許的時間偏差，如果時間偏差超過margin，則oprocd進程會重啟系統或者記錄錯誤信息到日志。

oprocd進程的日志文件位於：/etc/oracle/oprocd 或者 /var/opt/oracle/oprocd

oprocd進程從cssd進程派生而來，並且以root用戶身份允許

[root@node2 init.d]# ps -ef | grep oprocd
root      5109 11227  0 20:37 pts/0    00:00:00 grep oprocd
root      5758  4849  0 19:14 ?        00:00:00 /bin/sh /etc/init.d/init.cssd oprocd
root      6084  5758  0 19:14 ?        00:00:00 /u01/app/crs_home/bin/oprocd.bin run -t 1000 -m 10000 -hsi 5:10:50:75:90 -f

如果一個節點被hang住了很長時間，那麼集群中的其他節點會把該節點剔除出去，在這種情況下，我們需要采取措施重啟被hang住的節點，以便達到fencing io的目的。oprocd被設置了兩個參數：timeout 和margin，進程會每間隔timeout時間被喚醒一次，如果本次被喚醒的時間與上次被喚醒的時間間隔超過timeout+margin，那麼oprocd進程會認為oracle 節點被hang住，因此會自動重啟節點或者將警告信息寫入日志。

通常情況下，我們可以將oprocd進程重啟系統的原因歸為四類：

1:：操作系統的調度問題

2：操作系統的存在硬件或者驅動問題

3：系統具有大量負載，導致調度程序無法及時調入oprocd進程

4：oracle bug

Bug 5015469 – OPROCD may reboot the node whenever the system date is moved

backwards.
Fixed in 10.2.0.3+
Fixed in 10.1.0.3 + One off patch for Bug 4206159.
Fixed in 10.2.0.4+
Fixed in 10.2.0.3+

Bug 4206159 – Oprocd is prone to time regression due to current API used (AIX only)

Diagnostic Fixes (VERY NECESSARY IN MOST CASES):

Bug 5137401 – Oprocd logfile is cleared after a reboot

Bug 5037858 – Increase the warning levels if a reboot is approaching

oprocd進程的兩個參數：timeout和margin，其默認值在init.cssd 文件中指定，如

[root@node2 init.d]# cat init.cssd | grep ^OPROCD_DEFAULT_
OPROCD_DEFAULT_TIMEOUT=1000
OPROCD_DEFAULT_MARGIN=500
OPROCD_DEFAULT_HISTORGRAM=

因此，默認情況下，如果兩次喚醒oprocd進程的時間間隔超過1.5s，oprocd進程就會重啟系統。這往往是不合適的，如果我們手工修改init.cssd文件中的默認值，需要oracle support才可以。

如果需要突破1.5s的限制，我們可以調用init.cssd來實現目的，通過調用init.cssd可以修改兩個參數：reboottime 和 diagwait，如果diagwait> reboottime,那麼margin=diagwait-reboottime。在設置diagwait時，需要將集群中所有節點的所有進程停掉，都在可以造成數據損壞，只需在rac中的一個節點修改即可。建議將diagwait修改為13

[root@node2 bin]# ./crsctl get css reboottime
3
[root@node2 bin]# ./crsctl get css diagwait
13
[root@node2 bin]# ./crsctl set css diagwait 13 -force

在11.2.0.1後，我們不再需要修改diagwait，因此架構已經發生了改變。

在windows下我們也可以修改diagwait，但是與在linux下不同，修改diagwait不會造成上面的變化。

下面再來看一下有關hangcheck_timer的有關信息，hangcheck_timer與oprocd可以實現相同的功能，但是兩者之間沒有必然的聯系

Hangcheck-Timer Module
Hangcheck-Timer Module Requirements for Oracle 9i, 10g, and 11g RAC on Linux
Starting in release 9.2.0.2 and later, Oracle RAC environments required using a new I/O fencing model, named the hangcheck-timer module. This module was implemented to replace the Watchdog module, which provided similar fencing functionality. Hangcheck-timer was subsequently delivered as part of the standard kernel distribution for Linux kernel releases 2.4 and above.
Hangcheck-timer should be loaded at boot time, and monitors the Linux kernel for long operating system hangs that could affect the reliability of a RAC node. It runs in kernel mode and uses the Time Stamp Counter (TSC) to catch scheduling delays or node hangs. This is done by setting a timer, then checking when the timer fires as to whether it was delayed by more than the allowed margin of error. If the duration exceeds the allowed time of (hangcheck_tick + hangcheck_margin seconds), the machine is restarted. Hangcheck-timer will not cause reboots to occur due to CPU starvation.
Hangcheck-timer requires three configuration parameters:
hangcheck_tick - defines how often, in seconds, the hangcheck-timer checks the node for hangs. The default value is 60 seconds.
hangcheck_margin - defines how much margin is allowed, in seconds, between expected scheduling and real scheduling time. The default value is 180 seconds.
hangcheck_reboot - determines if the hangcheck-timer restarts the node if the kernel fails to respond within the sum of the hangcheck_tick and hangcheck_margin parameter values. If the value of hangcheck_reboot is equal to or greater than 1, then the hangcheck-timer module restarts the system. If the hangcheck_reboot parameter is set to zero, then the hangcheck-timer module will not reboot the node, even if a hang is detected. The default value varies by kernel version. In the 2.4 kernel, the default is 1. In 2.6 kernels, the default is 0.
Hangcheck-timer will provide message logging to the system messages log when a failure is detected, and a node restart is initiated by the module:
When Hangcheck-timer reboots it may leave "Hangcheck: hangcheck is restarting the machine" message in /var/log/messages
If you see the following message in /var/log/messages: "Hangcheck: hangcheck value past margin!" this means a reboot was required but was not performed, because hangcheck_reboot was not set to 1. If this message is seen, you must reload the hangcheck module as described earlier in this note, with the hangcheck_reboot value set to 1.
Note : Hangheck timer is not required starting with Oracle Clusterware 11gR2