May 15 2009

RAC Assessment from Oracle

Category: 11g,Databaseittichai @ 7:30 am

In our first large-scale Oracle 11g RAC with ASM deployment, we started this project by engaging Oracle. The primary goal of this engagement is to get as much technical input as possible especially when it comes to the areas of best practices. We’ve been advised by our Oracle account manager to use free RAC assessment service provided by a special unit called Oracle RAC Assurance Team. That was really the first time I’ve heard of this kind of service even though we’ve worked with Oracle since our first Oracle 9i RAC many years back. Please note that this assessment service is not available to all customers as a general offering – business justification is required.

The focus of the review is to validate system components, pertinent to designed functionality and implementation, in order to assess capabilities needed to sustain and support an Oracle RAC environment. The goal is to determine whether the configuration has the potential to cause availability or integrity problems.

The assessment does not imply certifications by Oracle of any hardware and software related to infrastructure.

We’ve been asked to provide the following:

  • Architecture diagrams depicting RAC and application components
  • RDA from each node (Metalink Note 314422.1)
  • Oracle Clusterware information
       $ env
       $ id
       $ cluvfy stage -post crsinst -n all -verbose
       $ crs_stat –t
       $ crsctl query css votedisk
       $ ocrcheck
  • Cluster diagnostic information (using diacollection.pl from Metalink Note 330358.1)
  • System logs
  • Public and Private interconnection definitions
       $ oifcfg getif
       $ oifcfg iflist
  • Interconnection traffic
       select * from gv$configured_interconnects;
  • Service definition
       set pages 60 space 2 num 8 lines 132 verify off feedback off
       column name format a15
       column network_name format a12
       column failover_method format a10
       column failover_type format a10

       select service_id, name, network_name, failover_method,
        failover_type, goal,dtp, enabled, AQ_HA_NOTIFICATIONS, CLB_GOAL
       from dba_services;
  • Opatch information
       opatch lsinventory -oh $RDBMS_HOME
       opatch lsinventory -oh $ASM_HOME
       opatch lsinventory -oh $CRS_HOME
  • System definitions (/etc/system)
  • Layer 3/Network Configuration
       # dladm show-link
  • Adaptor summary statistics for public, VIP and interconnect
       kstat -n <NIC name>
       netstat -in
       ifconfig –a
  • UDP parameter settings
       # ndd /dev/udp udp_xmit_hiwat
       # ndd /dev/udp udp_recv_hiwat
  • ASM instance information
       set linesize 1500
       set pagesize 1000
       column name format a20
       column path format a40
       column failgroup format a20

       select GROUP_NUMBER, NAME, SECTOR_SIZE, BLOCK_SIZE,
         ALLOCATION_UNIT_SIZE, STATE, TYPE, TOTAL_MB, FREE_MB
       from v$asm_diskgroup
         order by GROUP_NUMBER;

       select GROUP_NUMBER, DISK_NUMBER, NAME, MOUNT_STATUS,
         MODE_STATUS, STATE, REDUNDANCY, FAILGROUP, PATH,
         TOTAL_MB, FREE_MB
       from v$asm_disk
       order by GROUP_NUMBER, DISK_NUMBER;
  • AWR/Statspack report for peak load period with
  • OSWatcher or OS statistics (top, mpstat, iostat, vmstat, SAR for same corresponding period as AWR/Statspack data) – See Metalink note 301137.1
  • Alert and others logs

After a few weeks, Oracle came back with recommendations which have been evaluated against best practices compiled from its experience with its global RAC customer base. I found these recommendations very helpful. In fact, in some cases, I would not have known of some of the recommendations without this analysis. I’d like to share with you some recommendations given for our specific environment (Oracle 11.1.0.7 with ASM on Solaris 10). Please keep in mind that this may not apply to your environment. So it is advisable to consult Oracle support and/or the provided Metalink Notes if you have any questions.

These are some best practices and recommendations:

  • Whenever possible use a non-shared Oracle Home for ASM and the RDBMS. In most cases the non-shared Oracle Home is the preferred solution due to the ability to patch in a rolling upgrade fashion with zero downtime and the elimination of the single point of failure and binary dependency issues that the shared Oracle Home approach introduces.
  • Do not install the CRS Oracle Home using a shared file system location.
  • Ensure that run-time files in directories /var/tmp/.oracle, or /tmp/.oracle are not removed while Oracle background, or user processes are active. These directories contains a number of “special” socket files that are used by local clients to connect via the IPC protocol (sqlnet) to various Oracle processes including the TNS listener, the CSS, CRS & EVM daemons or even the database instance.
  • Add  slewalways yes and disable pll in /etc/inet/ntp.conf to avoid node reboot due to a leap second event. Refer to Metalink Note: 759143.1.
  • On Solaris, set the UDP buffer settings (udp_xmit_hiwat and udp_recv_hiwat) to 65536. These should improve cluster interconnect throughput. Refer to Metalink Note: 181489.1. Please create an RC script so that it will set udp_xmit_hiwat and udp_recv_hiwat to 65536 everytime the server rebooted.
        ndd -set /dev/udp udp_xmit_hiwat 65536
        ndd -set /dev/udp udp_recv_hiwat 65536
  • Ensure that both OCR and Voting disks are backed up periodically. The OCR backups are taken automatically, and stored on the Master RAC node. Voting disks can be backed up manually using dd with a 4k block size (hot backup).
  • Do not use crs_unregister to remove default CRS resources (nodeapps) unless instructed by Oracle Support.
  • Ensure that the CSS DIAGWAIT parameter is set to 13 seconds on all platforms. DIAGWAIT is a CSS parameter built-into 10.1.0.5+, 10gR2, 11g which will allow enough time for the failed node to flush final trace files to better help debug a node failure. Always keep CSS misscount greater than the setting of DIAGWAIT. Refer to MetaLink Note:559365.1.
  • Increase PARALLEL_EXECUTION_MESSAGE_SIZE from the default (2k), to 4k or 8k in order to provide better recovery slave performance.
  • Set PARALLEL_MIN_SERVERS to CPU_COUNT-1, in order to pre-spawn recovery slaves at startup, instead of when recovery is required.
  • On each RDBMS instance supported by ASM, add the following SHARED_POOL allocations, based on each Disk Group type in use:
  • - For disk groups using external redundancy, every 100GB of space needs 1MB of extra shared pool plus 2MB
    - For disk groups using normal redundancy, every 50GB of space needs 1MB of extra shared pool plus 4MB
    - For disk groups using high redundancy, every 33GB of space needs 1MB of extra shared pool plus 6MB

  • Avoid bad writes to ASM disks by avoiding Raid 5. See NOTE: 30286.1 – I/O Tuning with Different RAID Configurations.
  • Ensure the Network Listener is configured to use an address-list that specifies an IPC connection before the TCP connections. This will avoid fail-over delays associated with TCP timeouts. Refer to Metalink Note: 403743.1.

In conclusion, I’m very impressed with the information and the kind of professionalism Oracle RAC Assessment team has provided. This has definitely given us (as well as management) more confidence with our new setup.

Tags: , ,


May 04 2009

Informatica 8.6.1 Hot Fix 2 – Object Check-In is Slow

Category: Database,ETLittichai @ 7:51 am

If you’re using Informatica as ETL tool as we do, this may be useful if you’re working on upgrading it to the latest version 8.6.1 hot fix 2.

During the testing, our ETL developers complained that the object check-in took a long time (sometime up to 2 minutes – compared to only a few seconds in the existing environment). There are no loads or waits as far as we could see on the repository database. My gut feeling was that something was missing here. We then opened a support case with Informatica, and, sure enough, the support team asked us to try adding new indexes as follows:

CREATE INDEX OPB_VALIDAT_IDX_SR
ON OPB_VALIDATE (
OBJECT_ID, OBJECT_TYPE,
VERSION_NUMBER
);

CREATE INDEX OPB_COMPONE_IDX_SR
ON OPB_COMPONENT (
REF_OBJ_ID, OBJECT_TYPE
);

With these new indexes, the check-in is now instant!

Tags: ,


May 01 2009

Oracle RAT’s Workload Capture With Duration Set Does Not Stop Automatically

Category: 11g,Databaseittichai @ 8:18 am

One of most exciting features of Oracle 11g is the Oracle Real Application Testing (RAT). Fortunately, Oracle extends support of this feature back to the previous versions. Even though the workload replay is only possible on Oracle 11g, the workload capture is now available in Oracle 9i and 10g. See the Metalink note: 560977.1 – Real Application Testing for Earlier Releases for more details.

I’ve tried the capture on Oracle 9.2.0.8 with all required patches mentioned in above Metalink note. I used the scripts as provided by Oracle.

I’ve found that everything works except for one minor issue which was later identified by Oracle support as an unpublished bug. This issue was that the workload capture initiated by the DBMS_WORKLOAD_CAPTURE.START_CAPTURE does not stop automatically even when the duration parameter is specified. For your reference, the bug number is 6068696 – “Gen V111 (74) CAPTURE WITH DURATION SET DOES NOT STOP AUTOMATICALLY.” And as expected, there will be no fix backported to 9.2.0.8. The only workaround is to manually stop it.

-- Check Date/Time before start
SQL> !date
Fri Apr 17 09:43:39 CDT 2009

-- Check for any existing capture
SQL> select NAME, DBNAME, DBVERSION, STATUS, START_TIME from DBA_WORKLOAD_CAPTURES
where STATUS <> 'COMPLETED';

no rows selected

-- Start capture
SQL> BEGIN
DBMS_WORKLOAD_CAPTURE.START_CAPTURE (name => 'TSDW_CAPTURE_TEST',
dir => 'CAPTURE_DIR_FA_TSDW',
duration => 30); -- duration in seconds
END;
/

PL/SQL procedure successfully completed.

-- Verify that capture is running
SQL> select NAME, DBNAME, DBVERSION, STATUS, START_TIME from DBA_WORKLOAD_CAPTURES
where STATUS <> 'COMPLETED';

NAME               DBNAME DBVERSION   STATUS        START_TIME
------------------ -----  ----------- ------------- --------------------
TSDW_CAPTURE_TEST  TSDW   9.2.0.8.0   IN PROGRESS   Apr 17 2009 09:43:47

-- Check point after about one minute
SQL> !date
Fri Apr 17 09:44:42 CDT 2009

-- Capture is still running - ok, let's give some more time...
SQL> select NAME, DBNAME, DBVERSION, STATUS, START_TIME from DBA_WORKLOAD_CAPTURES
where STATUS <> 'COMPLETED';

NAME               DBNAME DBVERSION   STATUS        START_TIME
------------------ -----  ----------- ------------- --------------------
TSDW_CAPTURE_TEST  TSDW   9.2.0.8.0   IN PROGRESS   Apr 17 2009 09:43:47

-- Next check point - almost 2 minutes past
SQL> !date
Fri Apr 17 09:45:48 CDT 2009

-- Still running
SQL> select NAME, DBNAME, DBVERSION, STATUS, START_TIME from DBA_WORKLOAD_CAPTURES
where STATUS <> 'COMPLETED'; 

NAME               DBNAME DBVERSION   STATUS        START_TIME
------------------ -----  ----------- ------------- --------------------
TSDW_CAPTURE_TEST  TSDW   9.2.0.8.0   IN PROGRESS   Apr 17 2009 09:43:47

-- Use finish_capture manually
SQL> BEGIN
DBMS_WORKLOAD_CAPTURE.FINISH_CAPTURE ();
END;
/

PL/SQL procedure successfully completed.

-- Gone. No more capture.
SQL> select NAME, DBNAME, DBVERSION, STATUS, START_TIME from DBA_WORKLOAD_CAPTURES
where STATUS <> 'COMPLETED';

no rows selected

Tags: , ,