May 16 2010

Oracle 10.2.0.3 Gap Resolution of Physical Standby Appears to Hang

Category: 10g,Data Guard,Database,Standbyittichai @ 5:39 pm

Last week we’ve encountered the archived-log-gap-missing issue on Oracle 10.2.0.3 physical standby. Looking from V$MANAGED_STANDBY, it showed waiting for gap of the archived log sequence #53713.

SQL>  SELECT PROCESS, STATUS, THREAD#, SEQUENCE#, BLOCK#, BLOCKS
FROM  V$MANAGED_STANDBY;

PROCESS   STATUS          THREAD#   SEQUENCE#     BLOCK#     BLOCKS
--------- ------------ ----------  ---------- ---------- ----------
ARCH      CONNECTED              0          0          0          0
ARCH      CLOSING                2      58067     102401       1360
ARCH      CONNECTED              0          0          0          0
ARCH      CONNECTED              0          0          0          0
RFS       IDLE                   0          0          0          0
RFS       IDLE                   0          0          0          0
RFS       IDLE                   0          0          0          0
RFS       IDLE                   0          0          0          0
RFS       IDLE                   0          0          0          0
RFS       IDLE                   2      58068          2       1823
RFS       IDLE                   0          0          0          0
MRP0      WAIT_FOR_GAP           1      53713          0          0

The alert log showed similarly with additional information saying that it was trying to open the missing log and the Fetch Archive Log (FAL) was requesting the gap sequence from primary, but all failed.

Media Recovery Log  /opt/oracle/admin/PROD/standby/arch_1_53713_645984751
Error opening  /opt/oracle/admin/PROD/standby/arch_1_53713_645984751
Attempting  refetch
Media Recovery Waiting for thread 1 sequence 53713
Fetching  gap sequence in thread 1, gap sequence 53713-53713
FAL[client]:  Error fetching gap sequence
Wed May 12 15:26:44 2010
FAL[client]:  Failed to request gap sequence
GAP - thread 1 sequence 53713-53713
DBID  2748812654 branch 645984751
FAL[client]: All defined FAL servers  have been attempted.

Interestingly, the specific archived log file (arch_1_53713_645984751) existed already on the standby server with proper permission. Not only that, Data Guard was aware of it when querying the v$archived_log.

select sequence#, name, archived, applied from  v$archived_log
where sequence# like '%53713%';

SEQUENCE#  NAME                                                    ARC APP
---------- ------------------------------------------------------- --- ---
53713      /opt/oracle/admin/PROD/standby/arch_3_53713_645984751   YES NO

An attempt to register it again obviously failed.

SQL> ALTER DATABASE REGISTER LOGFILE '/opt/oracle/admin/PROD/standby/arch_1_53713_645984751';
ALTER DATABASE REGISTER LOGFILE '/opt/oracle/admin/PROD/standby/arch_1_53713_645984751'
*
ERROR  at line 1:
ORA-16089: archive log has already been registered

At this point, it did not seem like anything we could do. Even trying to copy the archived file to another location and re-register with a different path did not help either.

Reading blog post by Jason, this might be a bug as described in the document ID 5576816.8 – “Bug 5576816  FAL gap resolution does not work with max_connection set in some scenario.” The bug’s description says that “Gap resolution appears to hang after exhausting disk space on the standby system. The hang persists even after additional disk space is made available.” This seems to fit our incident because before having this issue, the disk space on the standby was filled up. And even after cleanup, the gap resolution appeared to hang.

The suggested workaround from the document is to disable parallel archival and bounce the primary instance. Um…

The restart of the primary instance was something we’re hesitant to do especially when without a clear description of what it will do. Even though a suggestion of using “kill -9″ on the archiver (ARC) OS processes in the post’s comments with no downtime was very temping, it was still obviously too risky to do in the production environment.

Fortunately, while looking into other options, we’ve found this blog from Andy. It’s interesting that once we tried “alter database register” again but now with “or replace” keyword, it immediately fixed this issue.

ALTER DATABASE REGISTER OR REPLACE PHYSICAL LOGFILE '/opt/oracle/admin/BMCPD/standby/arch_1_53713_645984751';

Tags: , , , , , , , ,