10.2.0.5 2node RAC 에서 한쪽 (Slave) 노드가 자꾸 재기동 되는 현상에서
이것 저것 확인하다가 아래와 같이 에러로그가 있어서 찾아보다가 확신이 안서서
일단은 포워딩 하여 저장해 본다.
아래는 crsd.log 이다.(master)
2015-10-10 12:49:30.874: [ COMMCRS][9526]clsc_receive: (114771870) error 2
2015-10-10 21:54:31.062: [ COMMCRS][9526]clsc_receive: (114771870) error 2
2015-10-13 10:41:30.444: [ CRSEVT][11008]32CAAMonitorHandler :: 0:Could not join /oracle/crs/bin/racgwrap(check)
category: 1234, operation: scls_process_join, loc: childcrash, OS error: 0, other: Abnormal termination of the child
2015-10-13 10:41:30.449: [ CRSEVT][11008]32CAAMonitorHandler :: 0:Action Script /oracle/crs/bin/racgwrap(check) timed out for ora.glvndbp05.vip! (timeout=60)
2015-10-13 10:41:30.449: [ CRSAPP][11008]32CheckResource error for ora.glvndbp05.vip error code = -2
2015-10-13 10:43:03.482: [ CRSEVT][11011]32CAAMonitorHandler :: 0:Could not join /oracle/crs/bin/racgwrap(check)
category: 1234, operation: scls_process_join, loc: childcrash, OS error: 0, other: Abnormal termination of the child
2015-10-13 10:43:03.482: [ CRSEVT][11011]32CAAMonitorHandler :: 0:Action Script /oracle/crs/bin/racgwrap(check) timed out for ora.glvndbp05.vip! (timeout=60)
2015-10-13 10:43:03.482: [ CRSAPP][11011]32CheckResource error for ora.glvndbp05.vip error code = -2
2015-10-13 10:44:36.509: [ CRSEVT][11017]32CAAMonitorHandler :: 0:Could not join /oracle/crs/bin/racgwrap(check)
category: 1234, operation: scls_process_join, loc: childcrash, OS error: 0, other: Abnormal termination of the child
위에서 확인해 보면 정리해서...
[CRSEVT] CAAMonitorHandler :: 0:Could not join .../crs/bin/racgwrap(check)
category: 1234, operation: scls_process_join, loc: childcrash, OS error: 0, other: Abnormal termination of the child
[CRSEVT] CAAMonitorHandler :: 0:Action Script ../crs/bin/racgwrap(check) timed out for 오라클.vip! (timeout=60)
이부분이 보인다. 검색해 보니 딱 한분(한국어로 검색...) 올려놔 있어서
정확하게 몰라서 일단은 공유해 본다.
퍼온 것이므로 문제 되면 자삭하겠습니다.
10g RAC 환경에서 있는 bug 인데 racgmain check 데몬이 비정상적으로 fork 되면서 메모리 사용율이 올라가게 되어 결국 나중엔 시스템을 사용할수 없는 지경까지 이르게 됨.
oracle 26024 1 0 Dec 6 ? 0:00 /oracle/crs/bin/racgmain check
oracle 23218 1 0 Dec 6 ? 0:00 /oracle/crs/bin/racgmain check
oracle 23179 1 0 Dec 4 ? 0:00 /oracle/ora10/bin/racgmain check
oracle 27277 1 0 Dec 6 ? 0:00 /oracle/ora10/bin/racgmain check
oracle 1028 1 0 Dec 5 ? 0:00 /oracle/ora10/bin/racgmain check
oracle 7991 1 0 Dec 4 ? 0:00 /oracle/ora10/bin/racgmain check
oracle 15324 1 0 Dec 3 ? 0:00 /oracle/ora10/bin/racgmain check
oracle 14314 1 0 Dec 4 ? 0:00 /oracle/ora10/bin/racgmain check
oracle 10895 1 0 Dec 4 ? 0:00 /oracle/ora10/bin/racgmain check
oracle 404 1 0 Dec 3 ? 0:00 /oracle/ora10/bin/racgmain check
해결책은 아래와 같이 CRS bundle #2 patchset을 적용시키거나 workaround 방법을 써서 조치해 주어야 함.
=====================================================================================
Applies to:
Oracle Server - Enterprise Edition - Version: 10.2.0.1 to 11.1.0.6
Information in this document applies to any platform.
Oracle Server Enterprise Edition - Version: 10.1.0.2 to 10.2.0.4
Symptoms
System slows down and many "racgmain check" processes may appear in ps output. CRS log would show the following messages.
oracle@HA5-ZW05:[/home/oracle] ps -ef|grep "racgmain check"|wc -l
1290
~~~~
CAAMonitorHandler :: 0:Action Script /opt/oracle/product/crs/bin/racgwrap(check) timed out for ora.harac1.vip! (timeout=60)
CheckResource error for ora.harac1.vip error code = -2
CAAMonitorHandler :: 0:Could not join /opt/oracle/product/crs/bin/racgwrap(check)
category: 1234, operation: scls_process_join, loc: childcrash, OS error: 0,
other: Abnormal termination of the child
~~~~
Cause
crsd.bin invokes the racgmain to check the status of the resources that are managed by CRS. The racgmain is invoked through the wrapper script racgwrap.
If the resource action timed out, crsd kills the action script, which is racgwrap, while racgmain process will not be killed. Over time, this might create lot of orphan racgmain processes in the system. This would eventually slow down the due to the resource contention at the OS level.
Internal bug:6196746 addresses this issue.
Solution
This is fixed in 11.1.0.7 patchset.. If you are running into this issue in 10gR2, please go ahead and apply 10.2.0.4 patchset and the latest CRS bundle patch. This fix is included in CRS bundle patch from bundle #2 onwards.
Following option could be used as a temporary workaround until the patch is applied.
1. Make a copy of racgwrap located under $ORACLE_HOME/bin and $CRS_HOME/bin on ALL Nodes
2. Edit the file racgwrap and modify the last 3 lines from:
~~~
$ORACLE_HOME/bin/racgmain "$@"
status=$?
exit $status
to:
# Line added to fix for Bug 6196746
exec $ORACLE_HOME/bin/racgmain "$@"
~~~
3. Kill all the orphan racgmain processes running.
$ ps -ef|grep "racgmain check"
oracle 18701 1 0 Aug 1 ? 0:00 /oracle/product/10.2.0/database/bin/racgmain check
oracle 14653 1 0 Aug 1 ? 0:00 /oracle/product/10.2.0/database/bin/racgmain check
oracle 24517 1 0 Aug 1 ? 0:00 /oracle/product/10.2.0/database/bin/racgmain check
$ kill -9 <PID of racgmain>
펌 : http://pat98.tistory.com/376
추가로 찾은 문서도 공유해 본다.
10g/11gR1: Many Orphaned Or Hanging "racgmain" Processes Running (문서 ID 732086.1) <--위에 것과 동일
10g RAC: One node VIP status always shows "UNKNOWN" and "CRS-0223: Resource 'ora.rac-test1.vip' has placement error" when try to startup the VIP. (문서 ID 1993024.1)
Symptoms
Symptom 1:
In 10g RAC on unix platform, VIP on one nodes always shows "UNKNOWN".
Symptom 2:
When try to start it up, it report following error:
CRS-1028: Dependency analysis failed because of:
CRS-0223: Resource 'ora.rac-test1.vip' has placement error.
Symptom 3:
In CRSD log, find following error:
2015-03-19 10:51:09.772: [ CRSRES][3737213248][ALERT]0`ora.rac-test1.vip` on member `rac-test1` has experienced an unrecoverable failure.
2015-03-19 10:51:09.772: [ CRSRES][3737213248]0Human intervention required to resume its availability.
2015-03-19 10:51:09.772: [ CRSEVT][3741415744]0CAAMonitorHandler :: 0:Could not execute /u01/app/oracle/product/10.2.0/crs_1/bin/racgwrap(stop) for ora.rac-test2.vip
category: 1234, operation: scls_canexec, loc: , OS error: 0, other: no exe permission, file /u01/app/oracle/product/10.2.0/crs_1/bin/racgwrap ===> No execute permission for this file.
Solution
1. Shutdown all the resource of this node: instance/asm/nodeapps/crs
2. Change permission of those 2 files to 751 on the node with issue
chmod 751 /u01/app/oracle/product/10.2.0/crs_1/racg/admin/racgwrap
chmod 751 /u01/app/oracle/product/10.2.0/crs_1/bin/racgeut
3. Then you can startup all the resource and check whether VIP is online.