Monday, February 7, 2011

Oracle 10g CRS troubleshooting tips and tricks

A very good article @ http://www.ritzyblogs.com/OraTalk/PostID/113/Oracle-10g-CRS-troubleshooting-tips-and-tricks: same is as below :

Oracle 10g CRS troubleshooting tips and tricks



Whenever a node is having issues joining the cluster back post reboot, here is a quick check list I would suggest:

 /var/log/messages

 ifconfig

 ip route

 /etc/hosts

 /etc/sysconfig/network-scripts/ifcfg-eth*

 ethtool

 mii-tool

 cluvfy

 $ORA_CRS_HOME/log



Let us now take a closer look at specifc issues with examples and steps taken for their resolution.

These are all tested on Oracle 10.2.0.4 database on RHEL4 U8 x-64



1. srvctl not able to start Oracle Instance but sqlplus able to start

a. Check racg log for actual error message.

% more $ORACLE_HOME/log/`hostname -s`/racg/ora.{DBNAME}.{INSTANCENAME}.inst.log



b. Check if srvctl is configured to use correct parameter file(pfile/spfile)

% srvctl config database -d {DBNAME} -a

You can also validate parameter file by using sqlplus to see the exact error message.



c. Check ownership for $ORACLE_HOME/log

If this is owned by root, srvctl won't be able to start instance as oracle user.

# chown -R oracle:dba $ORACLE_HOME/log



2. VIP has failed over to another node but is not coming back to the original node

Fix: The node where the VIP has failed over, bring it down manually as root

Example: ifconfig eth0:2 down

PS: Be careful to bring down only VIP. A small typo may bring down your public interface:)



3. Moving OCR to a different location

PS: This can be done while CRS is up as root.

While trying to change ocr mirror or the ocr to a new location, ocrconfig complaints.

The fix is to touch the new file.

Example:

# ocrconfig -replace ocrmirror /crs_new/cludata/ocrfile

PROT-21: Invalid parameter



# touch /crs_new/cludata/ocrfile

# chown root:dba /crs_new/cludata/ocrfile

# ocrconfig -replace ocrmirror /crs_new/cludata/ocrfile



Verify:

a. Validate using "ocrcheck". Device/File Name should point to the new one with integrity check succeeded.

b. Ensure OCR inventory is updated correctly

# cat /etc/oracle/ocr.loc

ocrconfig_loc and ocrmirrorconfig_loc should point to correct locations.



4. Moving Voting Disk to a different location

PS: CRS must be down while moving the voting disk.



The idea is to add new voting disks and delete the older ones.

Find below sample errors and their fix.

# crsctl add css votedisk /crs_new/cludata/cssfile_new

Cluster is not in a ready state for online disk addition



We need to use force option. However, before using force option, ensure CRS is down.

If CRS is up, DO NOT use force option else it may corrupt your OCR.



# crsctl add css votedisk /crs_new/cludata/cssfile_new -force

Now formatting voting disk: /crs_new/cludata/cssfile_new

successful addition of votedisk /crs_new/cludata/cssfile_new.



Verify using "crsctl query css votedisk" and then delete the old votedisks.

While deleting too, you'll need to use force option.



Also verify the permissions of the voting disk files. It should be oracle:dba

If voting disks were added using root, the permission should be changed to oracle:dba



5. Manually registering listener resource to OCR

Listener was registered manually with OCR but srvctl was unable to bring up the listener

Let us first see example of how to manually do this.

From an existing available node, print the listener resource

% crs_stat -p ora.test-server2.LISTENER_TEST-SERVER2.lsnr > /tmp/res

% cat /tmp/res

NAME=ora.test-server2.LISTENER_TEST-SERVER2.lsnr

TYPE=application

ACTION_SCRIPT=/orahome/ora10g/product/10.2.0/db_1/bin/racgwrap

ACTIVE_PLACEMENT=0

AUTO_START=1

CHECK_INTERVAL=600

DESCRIPTION=CRS application for listener on node

FAILOVER_DELAY=0

FAILURE_INTERVAL=0

FAILURE_THRESHOLD=0

HOSTING_MEMBERS=test-server2

OPTIONAL_RESOURCES=

PLACEMENT=restricted

REQUIRED_RESOURCES=ora.test-server2.vip

RESTART_ATTEMPTS=5

SCRIPT_TIMEOUT=600

START_TIMEOUT=0

STOP_TIMEOUT=0

UPTIME_THRESHOLD=7d

USR_ORA_ALERT_NAME=

USR_ORA_CHECK_TIMEOUT=0

USR_ORA_CONNECT_STR=/ as sysdba

USR_ORA_DEBUG=0

USR_ORA_DISCONNECT=false

USR_ORA_FLAGS=

USR_ORA_IF=

USR_ORA_INST_NOT_SHUTDOWN=

USR_ORA_LANG=

USR_ORA_NETMASK=

USR_ORA_OPEN_MODE=

USR_ORA_OPI=false

USR_ORA_PFILE=

USR_ORA_PRECONNECT=none

USR_ORA_SRV=

USR_ORA_START_TIMEOUT=0

USR_ORA_STOP_MODE=immediate

USR_ORA_STOP_TIMEOUT=0

USR_ORA_VIP=



Modify relevant parameters in the resource file to point to correct instance.

Rename as resourcename.cap

% mv /tmp/res /tmp/ora.test-server1.LISTENER_TEST-SERVER1.lsnr.cap



Register with OCR

% crs_register ora.test-server1.LISTENER_TEST-SERVER1.lsnr -dir /tmp/



Start listener

% srvctl start listener -d testdb -n test-server1



While trying to start listener, srvctl is throwing errors like "Unable to read from listener log file"

The listener log file exists.

If resource is registered using root, then srvctl won't be able to start using oracle user.

So all the aforementioned operations while registering the listener manually should be done using oracle user.



6. Services

While checking status of a service, it says "not running"

If we try to start it using srvctl, the error message is "No such service exists" or "already running"

If we try to add service with same name, it says "already exists"

This happens because the service is in an "Unknown" state in the OCR

Using crs_stat, check if any related resource for service(resource names ending with .srv and .cs) is still lying around.

srvctl remove service -f has been tried and the issue persists.

Here is the fix:

# crs_stop -f {resourcename}

# crs_unregister {resourcename}

Now service can be added and started correctly.



7. Post host reboot, CRS is not starting

After host reboot, CRS was not coming up. No CRS logs in $ORA_CRS_HOME

Check /var/log/messages

"Cluster Ready Services waiting on dependencies. Diagnostics in /tmp/crsctl.9559"

No logs seen in /tmp/crsctl.*



Run cluvfy to identify the issue

$ORA_CRS_HOME/bin/cluvfy stage -post crsinst -n {nodename}



/tmp was not writable



/etc/fstab was incorrect and was fixed for making /tmp available



If you see messages like "Shutdown CacheLocal. my hash ids don't match" in the CRS log, then

check if /etc/oracle/ocr.loc is same across all nodes of the cluster.



8. CRS binary restored by copying from existing node in the cluster

CRS not starting with following messages in /var/log/messages;

"Id "h1" respawning too fast: disabled for 5 minutes"



CRSD log showing "no listener"



If CRS binary is restored by copying from existing node in the cluster, then you need to ensure:

a. Hostnames are modified correctly in $ORA_CRS_HOME/log

b. You may need to cleanup socket files from /var/tmp/.oracle



PS:Exercise caution while working with the socket files. If CRS is up, you should never touch those files otherwise reboot may be inevitable.



9. CRS rebooting frequently by oprocd

Check /etc/oracle/oprocd/ and grep for "Rebooting".

Check /var/log/messages and grep for "restart"

If the timestamps are matching, this confirms reboots are being initated by oprocd process.



%ps -ef
grep oprocd

root 10409 9937 0 Feb27 ? 00:00:00 /oracle/product/crs/bin/oprocd.bin run -t 1000 -m 500 -f



-t 1000 means oprocd would wake up every 1000ms

-m 500 means allow upto 500ms margin of error

Basically with these options if oprocd wakes up after > 1.5 secs it’s going to force a reboot.

This is conceptually analogous to what hangcheck timer used to do pre 10.2.0.4 Oracle releases on Linux.



Fix is to set CSS diagwait to 13

#crsctl set css diagwait 13 -force



# /oracle/product/crs/bin/crsctl get css diagwait

13



This actually changes what parameters oprocd runs with

%ps -ef
grep oprocd

root 10409 9937 0 Feb27 ? 00:00:00 /oracle/product/crs/bin/oprocd.bin run -t 1000 -m 10000 -hsi 5:10:50:75:90 -f



Note that the margin has now changed to 10000ms i.e 10 seconds in place of the default 0.5 seconds.



PS: Setting diagwait requires a full shutdown of Oracle Clusterware on ALL nodes.



10. Cluster hung. All SQL queries on GV$ views are hanging.

Alert log from all instance have message like below:

INST1: IPC Send timeout detected. Receiver ospid 1650



INST2:IPC Send timeout detected.Sender: ospid 24692

Receiver: inst 1 binc 150 ospid 1650



INST3: IPC Send timeout detected.Sender: ospid 12955

Receiver: inst 1 binc 150 ospid 1650



The ospid on all instances belong to LCK0 - Lock Process

In case of inter-instance lock issues, it's important to identify the instance from where it's initiating.

As seen from above, INST1 is the one that needs to be fixed.

Just identify the process that is causing row cache lock and kill it otherwise reboot node 1.



11. Inconsistent OCR with invalid permissions

% srvctl add db -d testdb -o /oracle/product/10.2

PRKR-1005 : adding of cluster database testdb configuration failed, PROC-5: User does not have permission to perform a cluster registry operation on this key. Authentication error [User does not have permission to perform this operation] [0]



crs_stat doesn't have any trace of it so utilities like crs_setperm/crs_unregister/crs_stop won't work in this case.



ocrdump shows:

[DATABASE.LOG.testdb]

UNDEF :

SECURITY : {USER_PERMISSION : PROCR_ALL_ACCESS, GROUP_PERMISSION : PROCR_ALL_ACCESS, OTHER_PERMISSION : PROCR_READ, USER_NAME : root, GROUP_NAME : root}



[DATABASE.LOG.testdb.INSTANCE]

UNDEF :

SECURITY : {USER_PERMISSION : PROCR_ALL_ACCESS, GROUP_PERMISSION : PROCR_ALL_ACCESS, OTHER_PERMISSION : PROCR_READ, USER_NAME : root, GROUP_NAME : root}



These logs are owned by root and that's the problem.

This means that the resource was perhaps added into OCR using root.

Though it has been removed by root but now it cannot be added by oracle user unless we get rid of the aforementioned.



Shutdown the entire cluster and either restore from previous good backup of OCR using:

ocrconfig -restore backupfilename



You can get list of backups using:

ocrconfig -showbackup



If you are not sure of last good backup, there you can also do the following:

Take export backup of OCR using:

ocrconfig -export /tmp/export -s online



Edit /tmp/export and remove those 2 lines pointing to DATABASE.LOG.testdb and DATABASE.LOG.testdb.INSTANCE owned by root



Import it back now

ocrconfig -import /tmp/export



After starting the cluster, verify using ocrdump.

The OCRDUMPFILE should not have any trace of those leftover log entries owned by root




Tuesday, January 11, 2011

Missing logs, Out-of SYNC (Physical standby)

Auto Sync, some time auto sync not happens result of which standby lag behinds, some steps below can make standby to sync with primary:


1.

On standby check

1. Select process,status from v$managed_standby; and this query should return mrp as process, if mrp is not coming means recovery process Is not active.

2. Try recover managed standby database cancel; and again try recover managed standby database using current logfile disconnect from session.

At primary check :

3. select current_scn from v$database;

4. rman > backup incremental from scn scn_number database format ‘/tmp/forSTANDBY_%U’ tag FORSTANDBY;

At standby :

Do recovery at standby as :

5. Rman nocatalog target /

6. RMAN> catalog backuppiece ‘/tmp/forSTANDBY_%U’;

7. RMAN> recover database noredo;

8. RMAN> delete backup tag ‘FORSTANDBY’;

9. Alter database managed standby database using current logfile disconnect from session;

10. Select process,status from v$managed_standby;

11. Select sequence#,applied from v$archived_log;

12. Select max(sequence#) from v$log and match if it is same as of primary;





2. Copy controlfile from primary to standby :

At primary :

1. RMAN>backup current controlfile for standby;

or

2. RMAN> backup current controlfile for standby format ‘/tmp/controflile.ctl’;

At standby :

3. RMAN>restore standby controlfile from ‘/tmp/controlfile.ctl’;

4. Sql> shutdown immediate

5. SQL> startup nomount

6. SQL> alter database mount standby database;

7. SQL> recover managed standby database disconnect from session;

8. Check everything running fine : if yes then perfect otherwise :

9. SQL>select name from v$datafile ; (check all datafiles )

10. RMAN> set newname for datafile “datafile_name” to “standby_location”; or

11. RMAN> catalog start with '+diskgroup//datafile/';

12. SQL> alter database add standby logfile group 4 ‘+standby_location’ size 50M; (add same no of standby logfiles as of online redo log files).

13. SQL>shutdown immediate

14. SQL>startup nomount

15. SQL>alter database mount standby database;

16. RMAN>recover database ; (check incarnation and scn should be same as of primary)

17. SQL>recover managed standby database using current logfile disconnect from session;

18. Check mrp is running or not.

19. Check log sequence# on both databases.



3. Register logfiles:

1. Locate the missing log file using SQL>SELECT SEQUENCE#, APPLIED FROM V$ARCHIVED_LOG;

2. Copy all the missing log files from primary archive destination to standby archive destination;

3. SQL>Alter database register logfile ‘log_file_path’;

4. Now check status of mrp process.

Monday, January 10, 2011

Max Protection vs Max Availability (Data Guard)

1. Whether log_archive_dest_2 is in lgwr,affirm or sync mode ,no matters about the above parameter , by default the protection mode would be max performance.


2. Max Availability will behave like max performance mode when no standby archive destinations available.
3. Max Availability will remain in max availability mode when standby redo logs will not available , in this case archiver process will come into picture.

4. Max protection mode will shutdown automatically if no standby destinations available for log transfer.for this we need atleast 2 log_archive_dest in order to aviod primary instance shutdown;

5. Max protection mode will shutdown if lgwr,async and noaffirm is used.

6. Max availability mode will behave like max performance mode if lgwr,async aand noaffirm is used , also it will use archiver (arch) process instead of lgwr.

7. Max availability mode will use archiver process for atleast gurantee the sync .

8. Max performance will use lgwr incase of lgwr,async,noaffirm instead of arch (in max availability ).