May 16 2010

Oracle 10.2.0.3 Gap Resolution of Physical Standby Appears to Hang

Category: 10g,Data Guard,Database,Standbyittichai @ 5:39 pm

Last week we’ve encountered the archived-log-gap-missing issue on Oracle 10.2.0.3 physical standby. Looking from V$MANAGED_STANDBY, it showed waiting for gap of the archived log sequence #53713.

SQL>  SELECT PROCESS, STATUS, THREAD#, SEQUENCE#, BLOCK#, BLOCKS
FROM  V$MANAGED_STANDBY;

PROCESS   STATUS          THREAD#   SEQUENCE#     BLOCK#     BLOCKS
--------- ------------ ----------  ---------- ---------- ----------
ARCH      CONNECTED              0          0          0          0
ARCH      CLOSING                2      58067     102401       1360
ARCH      CONNECTED              0          0          0          0
ARCH      CONNECTED              0          0          0          0
RFS       IDLE                   0          0          0          0
RFS       IDLE                   0          0          0          0
RFS       IDLE                   0          0          0          0
RFS       IDLE                   0          0          0          0
RFS       IDLE                   0          0          0          0
RFS       IDLE                   2      58068          2       1823
RFS       IDLE                   0          0          0          0
MRP0      WAIT_FOR_GAP           1      53713          0          0

The alert log showed similarly with additional information saying that it was trying to open the missing log and the Fetch Archive Log (FAL) was requesting the gap sequence from primary, but all failed.

Media Recovery Log  /opt/oracle/admin/PROD/standby/arch_1_53713_645984751
Error opening  /opt/oracle/admin/PROD/standby/arch_1_53713_645984751
Attempting  refetch
Media Recovery Waiting for thread 1 sequence 53713
Fetching  gap sequence in thread 1, gap sequence 53713-53713
FAL[client]:  Error fetching gap sequence
Wed May 12 15:26:44 2010
FAL[client]:  Failed to request gap sequence
GAP - thread 1 sequence 53713-53713
DBID  2748812654 branch 645984751
FAL[client]: All defined FAL servers  have been attempted.

Interestingly, the specific archived log file (arch_1_53713_645984751) existed already on the standby server with proper permission. Not only that, Data Guard was aware of it when querying the v$archived_log.

select sequence#, name, archived, applied from  v$archived_log
where sequence# like '%53713%';

SEQUENCE#  NAME                                                    ARC APP
---------- ------------------------------------------------------- --- ---
53713      /opt/oracle/admin/PROD/standby/arch_3_53713_645984751   YES NO

An attempt to register it again obviously failed.

SQL> ALTER DATABASE REGISTER LOGFILE '/opt/oracle/admin/PROD/standby/arch_1_53713_645984751';
ALTER DATABASE REGISTER LOGFILE '/opt/oracle/admin/PROD/standby/arch_1_53713_645984751'
*
ERROR  at line 1:
ORA-16089: archive log has already been registered

At this point, it did not seem like anything we could do. Even trying to copy the archived file to another location and re-register with a different path did not help either.

Reading blog post by Jason, this might be a bug as described in the document ID 5576816.8 – “Bug 5576816  FAL gap resolution does not work with max_connection set in some scenario.” The bug’s description says that “Gap resolution appears to hang after exhausting disk space on the standby system. The hang persists even after additional disk space is made available.” This seems to fit our incident because before having this issue, the disk space on the standby was filled up. And even after cleanup, the gap resolution appeared to hang.

The suggested workaround from the document is to disable parallel archival and bounce the primary instance. Um…

The restart of the primary instance was something we’re hesitant to do especially when without a clear description of what it will do. Even though a suggestion of using “kill -9″ on the archiver (ARC) OS processes in the post’s comments with no downtime was very temping, it was still obviously too risky to do in the production environment.

Fortunately, while looking into other options, we’ve found this blog from Andy. It’s interesting that once we tried “alter database register” again but now with “or replace” keyword, it immediately fixed this issue.

ALTER DATABASE REGISTER OR REPLACE PHYSICAL LOGFILE '/opt/oracle/admin/BMCPD/standby/arch_1_53713_645984751';

Tags: , , , , , ,


Feb 12 2008

What is the current status of Data Guard?

Category: Data Guard,Standbyittichai @ 11:35 pm

While working in one of Data Guard projects, I was asked about the current sync status of standby. I started talking about gap, SCN, or applied and received archive log numbers, and so on, which obviously did not make sense to business users. The only thing they’d like to know is whether or not data at standby is up-to-date with that of primary, or if not, what date/time of standby data it is at right now.

All existing scripts we do have in house are just checking the received or applied SCNs or thread# of archive logs. In order to make sense out of these numbers for business users or managers, I will need to convert them into date and time. Fortunately, I recalled the “scn_to_timestamp” function which allows me to convert the SCN number to its corresponding timestamp.

I can get the current_scn from v$database of the standby. However, in the case of physical standby, when database is being mounted (recovery mode), this function does not work. It should work fine with the logical standby which is opened all the time.

SQL> select scn_to_timestamp(current_scn) from v$database;
select scn_to_timestamp(current_scn) from v$database
*
ERROR at line 1:
ORA-00604: error occurred at recursive SQL level 1
ORA-01219: database not open: queries allowed on fixed tables/views only

To resolve this, once getting the current_scn on standby, I can log on to the primary and get the time stamp corresponding to the SCN number.

Sample below is the KSH script to obtain the date/time based on a SCN. Note that the syntax to check the gap is added as well. The account connecting to primary just needs SELECT_CATALOG_ROLE.

# Get SCN number from DR
DRSCN=/tmp/drscn$$.log
${SQLPLUS} -s /nolog <<EOF
connect / as sysdba
set heading off
set feedback off
col current_scn format 999999999999999999999999
spool ${DRSCN}
select current_scn from v\$database;
spool off
exit
EOF
DR_SCN=`cat ${DRSCN}`

rm ${DRSCN}

# Get Timestamp from Primary
TSDR=/tmp/tsdr$$.log
TSSTAT=/tmp/tsstat$$.log
${SQLPLUS} -s /nolog &lt&ltEOF
connect user/password@PRIMARY
set heading off
set feedback off
spool ${TSDR}
select
to_char(scn_to_timestamp(${DR_SCN}),'MM/DD/YYYY HH24:MI')
from dual;
spool off
spool ${TSSTAT}
select
case
when
scn_to_timestamp(${DR_SCN}) > systimestamp - interval '1' hour
  then 'OK'
when
scn_to_timestamp(${DR_SCN}) > systimestamp - interval '2' hour
  then 'WARNING'
else 'CRITICAL'
end "STATUS"
from dual;
spool off
exit
EOF

Tags: ,