Tuesday, February 23, 2010

RMAN Performance Tuning Diagnostics

RMAN Performance Tuning Diagnostics [ID 311068.1]



PURPOSE
-------

Provide diagnostic reference for collecting RMAN backup
performance diagnostic information.

SCOPE & APPLICATION
-------------------

You should have some skills in database performance evaluation.

RMAN Performance Tuning Diagnostics
-----------------------------

Collecting performance information regarding RMAN backup
performance problems, bottlenecks come from somewhere. Not all tracing
shown here is required. It is up to you to determine which will help
you better define and isolate your performance issue. Collecting the
right information can help to quickly identify and resolve any Oracle
bottlenecks and provide evidence when it a 3rd party issue.

What should be collected?

1. RMAN debug output
2. Channel debug=5 trace=1 or 5 output trace files.
3. STATSPACK reports
4. If the CPU and slow sql (rather than IO) is the issue, the 10046 trace
and tkprof.
5. The resulting trace can be used with tkprof to get the explain plan
6. Collecting data from v$backup_async_io
7. Backup to tape is slow, how to determine if it's an Oracle or a media issue?

Prerequisites:

Make sure timed_statistics = true for all performance testing.
Always set the NLS_DATE_FORMAT = 'DD-MON-YYYY HH24:MI:SS' to get precision
time in the RMAN output.


1. RMAN debug output
This will always provide information to help the issue if RMAN is at fault.
Oracle will produce a useful error as sqlcode= or rc= in the debug. If you
do not see any errors and the channel released due to errors it is likely
the failure occurred outside of RMAN and failed in Oracle (channels) or the
device output used for backup.

2. Channel debug=5 trace=5 output trace files.
trace 1 - general trace info for skg and sbt
trace 5 - verbose channel tracing of skg and sbt calls

CONFIGURE CHANNEL DEVICE TYPE DISK DEBUG=5 TRACE 1;
or
CONFIGURE CHANNEL DEVICE TYPE SBT PARMS "ENV=(....) DEBUG=5 TRACE 5;
Trace is only useful for tape backup and restore tracing.

Example debug=5 output:
-----------------------

krbbpc: 04/29/2005 18:03:02: backup piece 1 started,
outfname=test2_ts_STXDB01Q_2agj3kam_T20050429_s74_s74_p1

-- cut to reduce output --

krbbpc: 04/29/2005 18:03:12: backup piece 1 finished,
handle=test2_ts_STXDB01Q_2agj3kam_T20050429_s74_s74_p1
elapsed time: 0:10
performance statistics:
count of waits for each file:
38: /odb/stx02/oradata07/stxdb01q/ts_impwrctr1_data_f1.dbf
output file throughput graph:
samples taken at 1 second intervals
bytes/second: min=131072, max=13107200, mean=4915200, stddev=4586958


3. STATSPACK reports
Statspack reports should be generated for the time during the backups to
capture the top wait events, the top sql, and pl/sql consuming the
resources. You can quickly see if the RMAN sql is taking the system resources
or is competing for resources. The reports should be at level 10 and split
into 15 minute intervals. This example shows a backup spending 80% of the
time waiting for sbtwrite2. So Oracle is waiting on tape IO.

Example output:
--------------

Top 5 Timed Events
~~~~~~~~~~~~~~~~~~ % Total
Event Waits Time (s) Ela Time
-------------------------------------------- ------------ ----------- --------
sbtwrite2 1,002 22 80.68
DFS lock handle 121 1 4.00
CPU time 1 3.60
control file sequential read 1,290 1 3.26
process startup 2 1 3.07

More can be determined from statspack and/or AWR output reports. The first main
indicator to look for bottlenecks are the top waits events.

Common Issue:
-------------

sbtwrite2 --> waiting for tape
sbtbackup --> waiting for tape
CPU Time --> sql or pl/sql to compile or runtime loop issue
IO Done --> waiting for IO slaves


4. If the CPU and slow sql (rather than IO) is the issue, the 10046 event trace
and tkprof. In the RMAN script add the sql 10046 event:

RMAN> sql "alter session set event ''10046 trace name context, level 12''";

Usually, slower performance sql will be related to the catalog, the 10046 trace
is best performed on the catalog database unless you are specifically looking
at sql performance in the target database. If there is a catalog in use, make
sure to review note:247611.1 "Known RMAN Performance Problems". This
provides new indexes and other workarounds that can be used to remedy known
issues for a variety of performance problems.


5. The resulting trace can be used with tkprof to get the explain plan
and cost of the sql execution.

- tkprof tracefile outfile [explain=user/password] [options...]
- TKPROF filename_source filename_output EXPLAIN=[username/password]
SYS=[YES/No] TABLE=[tablename]
$ tkprof ora_xxxx.trc tkprof.out explain=use/pwd sys=yes table=table_name

We always want sys=yes because the target is backed up by SYS AS SYSDBA so all
commands will run under the sys schema.

TABLE
Specifies the schema and name of the table into which TKPROF temporarily places
execution plans before writing them to the output file. If the specified table
already exists, TKProf deletes its rows then uses it for the EXPLAIN PLAN
command. If this table does not exist, TKProf creates, uses, then drops it.
This parameter is ignored if the EXPLAIN parameter isn't used.

6. Collecting data from V$BACKUP_ASYNC_IO

Determine how much time Oracle took to open the file to backup then close. In
this example the backup of datafile "/data17/mds_sm_odr_apr2004_tab01.dbf"
took 1 hour and 28 minutes. The archivelogs from sequence# 45411 until sequence#
45420 were all placed in a single backup that completed in 11 seconds. See Note:237083.1.

Example output
--------------

V$BACKUP_ASYNC_IO:

FNAME OPEN CLOSE H:MM
------------------------------------ -------------------- -------------------- ----
/data17/mds_sm_odr_apr2004_tab01.dbf 23-MAR-2005 02:53:52 23-MAR-2005 04:21:49 1:28
/arch/mdsp_arch1_45415_518895446.log 23-MAR-2005 04:01:19 23-MAR-2005 04:09:53 :08
/arch/mdsp_arch1_45417_518895446.log 23-MAR-2005 04:01:19 23-MAR-2005 04:09:53 :08
/arch/mdsp_arch1_45412_518895446.log 23-MAR-2005 04:01:19 23-MAR-2005 04:09:53 :08
/arch/mdsp_arch1_45418_518895446.log 23-MAR-2005 04:01:19 23-MAR-2005 04:09:53 :08
/arch/mdsp_arch1_45420_518895446.log 23-MAR-2005 04:01:19 23-MAR-2005 04:09:53 :08
/arch/mdsp_arch1_45419_518895446.log 23-MAR-2005 04:01:21 23-MAR-2005 04:09:53 :08
/arch/mdsp_arch1_45416_518895446.log 23-MAR-2005 04:01:21 23-MAR-2005 04:09:53 :08
/arch/mdsp_arch1_45414_518895446.log 23-MAR-2005 04:01:21 23-MAR-2005 04:09:53 :08
/arch/mdsp_arch1_45411_518895446.log 23-MAR-2005 04:09:53 23-MAR-2005 04:11:58 :02
/arch/mdsp_arch1_45413_518895446.log 23-MAR-2005 04:09:53 23-MAR-2005 04:11:58 :02
rigg1rac_1_1 23-MAR-2005 04:01:19 23-MAR-2005 04:12:05 :11


7. Backup to tape is slow, how to determine if it's an Oracle or a media issue?
Backup a tablespace that represents a good amount of data. But make sure
there is enough room to write to local disk also.
a. backup tablespace to tape # Time to tape?
RMAN> run {
allocate channel t1 device type 'sbt_tape'
parms 'SBT_LIBRARY='
debug=5 TRACE 2;
backup tablespace ;
}

b. backup tablespace to disk api # Time to disk?
RMAN> run {
allocate channel t1 device type 'sbt_tape'
parms 'SBT_LIBRARY=oracle.disksbt,ENV=(BACKUP_DIR=/path/backup)'
debug=5 TRACE 2;
backup tablespace ;
}

c. backup validate tablespace # Time to just read the data no output
RMAN> backup validate tablespace ;

With the debugging on the channels you can compare the difference in time it
takes to read the data during validate and to backup the same data to disk and
tape using the same sbt calls that are used by RMAN in either case.

Oracle9i-->

CONFIGURE CHANNEL DEVICE TYPE 'SBT_TAPE' PARMS
'SBT_LIBRARY=oracle.disksbt,ENV=(BACKUP_DIR=/path/backup)' debug=5 TRACE 2;

allocate channel t1 device type 'sbt_tape' parms
'SBT_LIBRARY=oracle.disksbt,ENV=(BACKUP_DIR=/path/backup)' debug=5 TRACE 2;

REFERENCES
==========

Note.94224.1 FAQ- Statspack Complete Reference
Note 237083.1 Using V$BACKUP_ASYNC_IO / V$BACKUP_SYNC_IO to Monitor RMAN Performance
Note 247611.1 "Known RMAN Performance Problems".

No comments: