Oracle 911 - Emergency Oracle Support

by Donald K. Burleson

I work in a worldwide Oracle emergency support center (www.remote-dba.net) where I receive calls from panicked clients all over the world, each experiencing a serious loss of performance on their mission-critical Oracle databases. I have worked on systems where poor performance shut-down entire factories, leaving thousands of workers sitting on their hands waiting for me to do something magical to re-start their assembly line. I have worked with hospital image delivery systems where patients were cut wide open, on the operating table, the doctors waiting for the information they need to save lives. I have worked on financial systems where downtime is measured in tens of thousands of dollars per minute.

Working late nights as an emergency support DBA comes with huge stress levels. Most of these databases are brand new to me, and I have only a few minutes to access the situation and devise a plan to quickly relieve their bottleneck. I work with emergency Oracle performance problems every week, and I only get called when the easy remedies have already been exhausted. When I ask why their DBA is not helping, I generally hear one of the following stories:

We outsourced our DBA staff to India. They can’t figure it out and suggested that we call you.
Our DBA is on a three-week hiking vacation in the Rocky Mountains.
Our DBA quit six months ago and we haven’t had time to find a replacement.
We hired a guy and got him OCP certified, but we don’t think he knows what he is doing.
We called MetaLink Support and we don’t understand their instructions. They said that we could call you.

When a production database is in crisis, costs are never an issue, but minimizing downtime is of the essence. The clients demand a fast-fix, and this often requires unconventional methods.

The Unconventional Methods of Emergency Support

Unlike academic and scientific Oracle DBAs who demand proofs and reproducible results before making an Oracle change, the emergency Oracle DBA has no such luxury. The emergency DBA must use every weapon at their disposal to get the client running as quickly as possible. These unconventional methods are almost always driven by the client who does not appreciate the benefits of an elegant, long-term fix for the root cause of the problem.

Clients are impatient and they often insist on symptom-treating stop-gap remedies that are neither elegant nor comprehensive. In many cases, the client does not want to hear the time-consuming tasks that are required to address the root cause of the problem.

Oracle Silver Bullets

Contrary to the pontifications of theoreticians and ivory-tower academics, there are many silver bullets for Oracle performance tuning. By silver bullet, I mean a small set of commands that quickly relieves an acute performance bottleneck. Some of these techniques of just-in-time tuning have been codified in Oracle10g with the Automatic Memory Management (AMM) facility where the SGA regions are changed dynamically to meet changing demands in processing.

Let’s take a close look at the real-world silver bullets that I have encountered over the past 12 years of emergency Oracle support. The following stories are true, verifiable accounts (I have witnesses!) of Oracle databases where a fast fix was used to relieve an acute performance problem:

Fix missing CBO statistics
Replace an obsolete statistics gathering method
Initialize missing Oracle instance parameters
Add missing indexes
Implement cursor_sharing=force
Implement the KEEP pool for small-table full-scanned tables
Change the CBO optimizer parameters
Add additional SGA RAM
Employ Materialized Views
Create Bitmap indexes
Add freelists
Windows Oracle issues

To those scientist DBAs who demand “proofs” that these methods worked, you are not going to find it in this article. Most of the production systems that I tune have hundreds of segments and thousands of concurrent users. Trying to test a hypothesis on a large running database is like trying to tune your car while it’s flying down the freeway at 75 miles per hour. It is impossible to reproduce the conditions of a complex performance breakdown, and the emergency support DBA is forced to rely on experience and antidotal evidence to guide their actions. (If you want proof that these techniques work, just send me an e-mail.)

Anyway, let’s examine each of these silver bullets and see how a well-placed silver bullet can save the day.

Fix Missing CBO Statistics

The call came in from a client in Florida who had just moved their system into production and was experiencing a serious performance problem. Upon inspection, we found optimizer_mode=choose and only one table with statistics. The DBA told me that she was running cost-based and she seemed totally unaware of the requirement to analyze the schema for CBO statistics. Here is how I recall the conversation:

DB – “How are you collecting statistics?”

DBA – “We have BMC Patrol.”

DB – “No, no, how are you getting SQL optimizer statistics?”

DBA – “The Oracle sales rep said that the CBO was intelligent, so I assumed it was getting its own statistics.”

You know, in a way, she was right. The problem started when she wanted to know the average row length for a table. She did a Google search and discovered that it was in the dba_tables.avg_row_len column. When she found it null, she went to MetaLink and learned that an analyze table command would fill in the avg_row_len column.

As we know, when using optimizer_mode=choose with only one table analyzed, any SQL that touches the table will be optimized as a cost-based query and the CBO will dynamically estimate statistics for all tables with missing statistics. In this case, a multi-step silver bullet did the trick:

alter table xxx delete/drop??? statistics;           
exec dbms_stats (…);

The system immediately returned to an acceptable performance level and the DBA learned about the importance of providing complete and timely statistics for the CBO using the dbms_stats utility.

Repair Obsolete CBO Statistics Gathering

A shop called from Australia complaining about a serious degradation in SQL performance after implementing partitioned tablespaces in a 16-CPU Solaris 64-bit Oracle 9.0.4 system. They said that they thoroughly tested the change in their development and QA instances and they could not understand why they system was grinding to a halt.

Upon inspection, it turned out that they were using analyze table and analyze index commands to gather their CBO statistics. As we may know, the dbms_stats utility gathers partition-wise statistics. There was not time to full a deep-sample collection, so a dbms_stats was issued with a 10% sample size. Note that I parallelized it with 15 parallel processes to speed-up the statistics collection:

exec dbms_stats.gather_schema_stats( -
   ownname          => 'SAPR4', -
   options          => 'GATHER AUTO', -
   estimate_percent => 10, -
   method_opt       => 'for all columns size repeat', -
   degree           => 15 -
)

This took less than 30 minutes and the improved CBO statistics tripled the performance of the entire database.

Initialize Missing Oracle Instance Parameters

I got a call from a client in California who said that their performance was getting progressively worse as more customers accessed the Oracle database. Upon inspection I discovered that their db_cache_size parameter was not present in the init.ora file. A quick instance bounce (to re-set sga_max_size and db_cache_size) resulted in a 400% performance improvement.

In another memorable case, I received a call from a data warehouse client in California who complained that their performance degraded as the database grew. I quick look revealed that the sort_area_size parameter was missing and defaulting to a tiny value. Again, a change of sort_area_size=1048575, a quick bounce of the instance, and overall database performance improved by more than 50%.

Adding Missing Indexes

An Oracle Financial application shop in New York called and said that their performance degraded as more data was entered into the tables. A quick check of v$sql_plan using the plan9i.sql script (see code depot for script) looked like this:

                     Full table scans and counts

OWNER      NAME                      NUM_ROWS C K   BLOCKS NBR_FTS
---------- ------------------------ -------- - - -------- -------

APPLSYS    FND_CONC_RELEASE_DISJS       14,293 N       4,293 498,864
APPLSYS    FND_CONC_RELEASE_PERIODS    384,173 N      67,915 134,864
DONALD     PERSON_LOGON_ID          18,263,390 N     634,272   96,212
DONALD     SITE_AMDMNT               2,371,232 N      51,020   50,719
DONALD     CLIN_PTCL_VIS_MAP        23,123,384 N     986,395   11,273

Here we see a huge number of large-table full-table scans. A quick look into v$sql revealed that the rows returned by each query was small, and a common WHERE clause for many queries looked like this:

WHERE customer_status = ‘:v1’ and customer_age > :v2;

A quick creation of a concatenated index on customer_status and customer_age resulted in a 50x performance improvement and reduced disk I/O by more than 600%.

In another memorable case on an 8.1.6 database, my access.sql script revealed suspect large-table full-table scans:

                  Full table scans and counts
                                        
OWNER      NAME                     NUM_ROWS    C K   BLOCKS  NBR_FTS     
---------- -------------------- ------------    - - -------- --------     
APPLSYS    FND_CONC_RELEASE_DISJS     1,293,292 N K  65,282   498,864     
APPLSYS    FND_CONC_RELEASE_PERIODS   4,373,362 N K  62,282   122,764     
APPLSYS    FND_CONC_RELEASE_STATES      974.193 N K   9,204    98,122     
APPLSYS    FND_CONC_PP_ACTIONS          715,021 N     6,309    52,036     
APPLSYS    FND_CONC_REL_CONJ_MEMBER      95,292 N K   4,409    23,122

In this case I noted immediately that the full-scans were not necessary because all queries had this highly-selective entry in the WHERE clause:

WHERE to_char(order_date,’MON-DD’)=:v1;

The DBA had created an index on the order_date column and was surprised that their order_date index was not being used, primarily because their boss was too cheap to pay for him to attend an Oracle8 new features class. Creating the function-based index on to_char(order_date,’MON-DD’) resulted in an immediate 5x performance improvement.

Changing CBO Optimizer Parameters

In one memorable emergency, an Oracle 9.0.2 client from Phoenix called complaining about steadily degrading performance. A quick look into v$sql_plan view using my plan9i.sql script revealed loads of suspected unnecessary large-table full-table scans.

In this case, the top SQL was extracted from v$sql and timed as-is and with an index hint. The query with the index hint ran almost 20x faster, but it was unclear why the CBO was not choosing the index. This was a production emergency and I did not have the luxury of investigating the root cause of the CBO issue. I had to act fast, so I ran a script against v$bh and user_indexes and discovered that approximately 65% of the indexes were currently inside the data buffer cache.

alter system set optimizer_index_cost_adj=20;
alter system set optimizer_index_caching=65.

This quick fix changed the execution plans for more than 350 SQL statements and overall system response time was halved. The client was elated and I was then able to take my time to investigate the root cause of the problem.

Implement cursor_sharing=force

I had a 9.0.2.4 database in Toronto, Canada, where the end users were complaining about poor performance right after a new manufacturing plant was added to the existing database. A quick look at the STATSPACK top-5 timed events looked like this:

Top 5 Wait Events

~~~~~~~~~~~~~~~~~                                             Wait     % Total
Event                                               Waits Time (cs)   Wt Time
-------------------------------------------- ------------ ------------ -------
enqueue                                            25,901      479,654   46.71
db file scattered read                         10,579,442      197,205   29.20
db file sequential read                           724,325      196,583    9.14
latch free                                      1,150,979       51,084    4.97
log file parallel write                           148,932       39,822    3.88

My first look was into the SQL section of the STATSPACK report where I noted that almost all of the SQL was using “literals” in the WHERE clause of all queries.

WHERE customer_state = ‘Alabama’ and customer_type = ‘REDNECK’;

This was a vendor package with dynamically generated SQL, so cursor_sharing was the only fast solution. Setting cursor_sharing=force greatly reduced the contention on the library cache and reduce CPU consumption. The end users reported a 75% improvement in overall performance.

Implement the KEEP Pool for Small-table Full-scanned Tables

I had a database just last month in New Zealand (running 9.2.0.4) that had a 16 CPU Solaris server with 8 gig of RAM. The complaint was that performance had been degrading since the last production change. A STATSPACK top-5 timed events reports showed that more than 80% of system waits related to “db file scattered reads.” A quick review of v$sql_plan using plan9i.sql showed lots of small-table full table scans, with many of the table not assigned to the KEEP pool (as denoted by the “K” column in the listing below):

                  Full table scans and counts
                                         
OWNER      NAME                     NUM_ROWS C K   BLOCKS  NBR_FTS      
---------- -------------------- ------------ - - -------- --------      
APPLSYS    FND_CONC_RELEASE_DISJS         39 N         44   98,864      
APPLSYS    FND_CONC_RELEASE_PERIODS       39 N K       21   78,232      
APPLSYS    FND_CONC_RELEASE_STATES         1 N K        2   66,864      
APPLSYS    FND_CONC_PP_ACTIONS         7,021 N      1,262   52,036      
APPLSYS    FND_CONC_REL_CONJ_MEMBER        0 N K      322   50,174      
APPLSYS    FND_FILE_TEMP                   0 N        544   48,611      
APPLSYS    FND_RUN_REQUESTS               99 N         98   48,606      
INV        MTL_PARAMETERS                  6 N K       16   21,478      
APPLSYS    FND_PRODUCT_GROUPS              1 N         23   12,555      
APPLSYS    FND_CONCURRENT_QUEUES_TL       13 N K       10   12,257      
AP         AP_SYSTEM_PARAMETERS_ALL        1 N K        6    4,521

As you may know, rows fetched into the db_cache_size from full-table scans are not pinged to the most recently used (MRU) end of the data buffer. Running my buf_blocks.sql script (see code depot for script) confirmed that the FTS blocks were falling off the least recently used end of the buffer, and had to be frequently re-loaded into the buffer.

Contents of Data Buffers

                                             Number of Percentage
                                             Blocks in of object
             Object            Object        Buffer    Buffer Buffer    Block
Owner        Name              Type          Cache     Blocks Pool       Size
------------ -------------------------- ----------- ---------- ------- -------
DW01         WORKORDER         TAB PART      94,856          6 DEFAULT   8,192
DW01         HOUSE             TAB PART      50,674          7 DEFAULT 16,384
ODSA         WORKORDER         TABLE         28,481          2 DEFAULT 16,384
DW01         SUBSCRIBER        TAB PART      23,237          3 DEFAULT   4,096
ODS          WORKORDER         TABLE         19,926          1 DEFAULT   8,192
DW01         WRKR_ACCT_IDX     INDEX          8,525          5 DEFAULT 16,384
DW01         SUSC_SVCC_IDX     INDEX          8,453         38 KEEP     32,768

In this case I ran my buf_keep_pool.sql script (see code depot for script) to re-assign all tables that experienced small-table full-table scans into the KEEP pool. The output looks like this, and can be fed directly into SQL*Plus:

alter TABLE BOM.BOM_OPERATIONAL_ROUTINGS storage (buffer_pool keep);
alter INDEX BOM.CST_ITEM_COSTS_U1 storage (buffer_pool keep);
alter TABLE INV.MTL_ITEM_CATEGORIES storage (buffer_pool keep);
alter TABLE INV.MTL_ONHAND_QUANTITIES storage (buffer_pool keep);
alter TABLE INV.MTL_SUPPLY_DEMAND_TEMP storage (buffer_pool keep);
alter TABLE PO.PO_REQUISITION_LINES_ALL storage (buffer_pool keep);
alter TABLE AR.RA_CUSTOMER_TRX_ALL storage (buffer_pool keep);
alter TABLE AR.RA_CUSTOMER_TRX_LINES_ALL storage (buffer_pool keep);
alter INDEX WIP.WIP_REQUIREMENT_OPERATIONS_N3 storage (buffer_pool keep);

With more efficient buffer caching I fixed the problem in less than one hour and overall database performance more than doubled.

Add additional SGA RAM

One of the most common silver bullets are databases where the “working set” of frequently referenced data cannot fit into the data buffer cache. This used to be a huge problem for 32-bit Oracle server where the total SGA size was difficult to grow beyond 1.7 gig without special tricks like AWE and NUMA. However, I still routinely see databases on dedicated server with 8 gig RAM and the SGA size is less than 500 meg. A quick increase to db_block_buffers or db_cache_size, and performance improves dramatically.

Employ Materialized Views

This was a call from a point-of-sale data warehouse in Germany. The IT manager spoke very little English and most of the correspondence was done using Babelfish. The system was largely read-only with a short batch window for nightly updates. Once I got connected I immediately noticed that virtually every query in the system was performing a sum() or avg() function against several key tables. The v$sql_plan view (via plan9i.sql) shows loads of very-large-table full-table scans, and the system was crippled with “db file scattered read” waits.

Top 5 Timed Events
~~~~~~~~~~~~~~~~~~                                          % Total
Event                                 Waits     Time (s)   Ela Time
-------------------------------- ------------ ----------- --------
db file scattered read                 325,519     3,246    82.04
library cache load lock                  4,673     1,363     9.26
db file sequential read                534,598     7,146     4.54
CPU time                                 1,154       645     3.83
log file parallel write                 19,157       837     1.68

In this case I was easily able to create three materialized views and employ query rewrite to reduce physical disk I/O by over 2,000% and improve performance by more than 30x. A real silver bullet!

Implement Bitmap Indexes

I was called upon to assist with a State Police query system that was experiencing slow query performance. The system was read-only except for a 30-minute window at night for data loading. Upon inspection of the SQL I noted complex combinational WHERE clauses:

WHERE color=’BLU’ and make=’CHEVY’ and year=1997 and doors=2;

The distinct values for each of these columns were less than 200, and concatenated indexes were employed. Replacing the b-tree indexes with bitmap indexes resulted in a stunning performance improvement for the entire system, taking queries from 3 seconds down to under 1/10th of a second.

Add Freelists

The client called from Michigan with a complaint that their order processing center was unable to keep up with adding new orders into Oracle. The client had just expanded their telephone order processing department and had doubled the order processing staff to meet a surge in market interest. The VP was frantic, saying that 400 order-entry clerks were getting 30-second response time and they were forced to manually write down order information.

I checked v$session and found 450 connected users, and a quick review of v$sql revealed at virtually all the DML was inserts into a customer_order table. The top timed event was buffer busy wait, and it was clear that there were enqueues on the segment header blocks for the table and its indexes.

The “proper” fix for this issue is to create a new tablespace for the table and index using Automatic Segment Space Management (ASSM), also known as bitmap freelists. I could then reorganize the table online with the dbms_redefinition utility and alter index cust_pk and rebuild the index into the new tablespace. However, it would take me several hours to build and execute the jobs, and the VP said that he was losing more than $500 per minute.

The system was on release 9.2.0.4, so I was able to immediately relieve the segment header contention with these commands:

alter table customer_order freelists 5;
alter index cust_pk freelists 5;

(Note: I did not know the length of the enqueues on the segment header, so I added the additional freelists, one at a time, until the buffer busy waits disappeared.)

Additional freelists did the trick and the segment header contention disappeared. However, I knew that this was only a stop-gap fix and as soon as they ran their weekly purge (a single process) that only one of the five freelists would get the released blocks, causing the table to extend unnecessarily.

Windows Oracle issues

Windows Oracle databases are always the most fun to tune because they are often implemented by someone with a very limited knowledge of Oracle. Here are my favorite Windows Oracle Silver Bullets.

Norton Anti-Virus - I got a call from a new client in England who said that their database had slowed down to the point where sub-second queries now take 15 minutes. A review of a STATSPACK report shows giant waits on read I/O, sometimes up to 10 seconds. A review of the external Windows environment revealed that a well-intentioned support person was told to install Norton Antivirus on all Windows servers. Upon every block read, Norton was conducting a virus check!

A really cool screen saver – I had a Windows Oracle system that was experiencing sporadic CPU shortages, periods when more than half of the PU was being consumed by some external process. Because I had dialed in, I could not see the obvious. An on-site person informed me that the screensaver was called 3D Flowerbox. This was a hugely CPU-intensive screensaver that performed thousands of calculations per second to generate the cool display.

Conclusion

For adrenaline junkies, being an Emergency Oracle Support DBA is great fun. As I have illustrated, the rules for Emergency Oracle support often require a unique set of techniques:

Fix the symptom first – You can come back and address the root cause later.
Time is of the essence – You must provide a quick fix, and instance-wide adjustments are often your best hope.
Be creative – Traditional (i.e. time-consuming) tuning methods don’t apply in an emergency.

For more information and listing of the scripts referenced I this article, see my book “Creating a Self-Tuning Oracle Database” by Rampant TechPress.

Donald K. Burleson is one of the world’s top Oracle Database experts with more than 20 years of full-time DBA experience. He specializes in creating database architectures for very large online databases and he has worked with some of the world’s most powerful and complex systems. A former Adjunct Professor, Don Burleson has written 15 books, published more than 100 articles in national magazines, serves as Editor-in-Chief of Oracle Internals and edits for Rampant TechPress. Don is a popular lecturer and teacher and is a frequent speaker at Oracle Openworld and other international database conferences. Don’s Web sites include DBA-Oracle, Remote-DBA, Oracle-training, remote support and remote DBA.

Contributors : Donald K. Burleson
Last modified 2005-06-22 12:20 AM

DBAzine.com

Sections

Personal tools

Menu

Who Are You?