Inside Oracle10g Dynamic Sampling

by Donald K. Burleson

Introduction

One of the greatest problems with the Oracle Cost-based Optimizer (CBO) was not a problem with the CBO at all, but with the failure of the Oracle DBA to gather accurate schema statistics. Even with the dbms_stats package, the schema statistics were often stale, and the DBA did not always create histograms for skewed data columns and data columns that are used to estimate the size of SQL intermediate result sets.

This resulted in a “bum rap” for Oracle’s CBO, and beginner DBAs often falsely accused it of failing to generate optimal execution plans when the real cause of the sub-optimal execution plan was the DBA’s failure to collect complete schema statistics.

Hence, Oracle has automated the function of collecting and refreshing schema statistics in Oracle10g. This automates a very important DBA task and ensures that Oracle will always gather good statistics and choose the best execution plan for any query. Using the enhanced dbms_stats package, Oracle will automatically estimate the sample size, detect skewed columns that would benefit from histograms, and refresh the schema statistics when they become stale.

begin
   dbms_stats.gather_schema_stats(
      ownname          => 'SCOTT',
      estimate_percent => dbms_stats.auto_sample_size,
      method_opt       => 'for all columns size skewonly',
      degree           => 7
   );
end;
/

However, there was always a nagging problem with the CBO. Even with good statistics, the CBO would sometimes determine a sub-optimal table-join order, causing unnecessarily large intermediate result sets. For example, consider the complex WHERE clause in the query below. Even with the best schema statistics, it can be impossible to predict a priori the optimal table-join order (the one that has the smallest intermediate baggage). Reducing the size of the intermediate row-sets can greatly improve the speed of the query.

select
   stuff
from
   customer
natural join
   orders
natural join
   item
natural join
   product
where
   credit_rating * extended_credit > .07
and
   (qty_in_stock * velocity) /.075 < 30
or
   (sku_price / 47) * (qty_in_stock / velocity) > 47;

In this example, the four-way table join only returns 18 rows, but the query carries 9,000 rows in intermediate result sets, slowing-down the SQL execution speed (refer to figure 1).

Figure 1: Sub-optimal intermediate row sets.

If we were somehow able to predict the sizes of the intermediate results, we can re-sequence the table-join order to carry less “intermediate baggage” during the four-way table join, in this example carrying only 3,000 intermediate rows between the table joins (refer to figure 2) .

Figure 2: Optimal intermediate row sets.

Let’s take a closer look at this issue. Assume that we have a three-way table join against tables that all contain over 10,000 rows each. This database has 50,000 student rows, 10,000 course rows and 5,000 professor rows (refer to figure 3).

Figure 3: Number of rows in each table.

If the number of rows in the table determined the best table-join order, we would expect that any three-way table join would start by joining the professor and course tables, and then would join the RESULT set to the student table.

Ah, but whenever there is a WHERE clause, the total number of rows in each table does not matter (if you are using index access). Here is the query:

select
   student_name
from
   professor
natural join
   course
natural join
   student
where
   professor = ‘jones’
and
   course = ‘anthropology 610’;

Stan Nowakowski
Bob Crane
James Bakke
Patty O’Furniture

4 Rows selected.

Despite the huge numbers of rows in each table, the final result set will only be four rows. If the CBO can guess a priori the size of the final result, he can use sampling techniques to examine the WHERE clause of the query and determine which two table we should join together first.

There are only two table-join choices in our simplified example:

Join (student to course) and (RESULT to professor)
Join professor to course and (RESULT to student)

So, then, which is better? The best solution will be the one in which RESULT is smallest. Because the query is filtered with a WHERE clause, the number of rows in each table is incidental, and what we are really concerned about is the number of rows “where professor = ‘jones’” and “where course = ‘Anthropology 610’.”

If we know, the best table-join order becomes obvious. Assume that Professor Jones is very popular and teaches 50 courses and that Anthropology 610 as a total of eight students. Knowing this, we can see that the size of the intermediate row baggage is very different:

Join professor to course and (RESULT to student).

Figure 4: A sub-optimal intermediate row size.

If the CBO were to join the student table to the course table first, the intermediate result set would only be eight rows, far less baggage to carry over to the final join:

Join (student to course) and (RESULT to professor).

Figure 5: An optimal intermediate row size.

Now that we have only eight rows returned from the first query, it easy to join the tiny eight-row result set into the professor table to get the final answer.

How Do We Estimate Join Cardinality?

As we can see, in the absence of column histograms, Oracle CBO must be able to “guess” this information, and it sometimes gets it wrong. This is one reason why the ORDERED hint is one of the most popular SQL tuning hints; using the ORDERED hint allows you to specify that the tables be joined together in the same order that they appear in the FROM clause, like this:

select /+ ORDERED */
   student_name
from
   student
natural join
   course
natural join
   professor
where
   professor = ‘jones’
and
   course = ‘anthropology 610’;

Remember, if the values for the professor and course table columns are not skewed, then it is unlikely that the 10g automatic statistics would have created histograms buckets in the dba_histograms view for these columns.

As we can see, the Oracle CBO needs to be able to accurately estimate the final number of rows returned by each step of the query and then use schema metadata (from running dbms_stats) to choose the table-join order that results in the least amount of “baggage” (intermediate rows) from each of the table join operations.

But this is a daunting task. When a SQL query has a complex WHERE clause, it can if very difficult to estimate the size of the intermediate result sets, especially when the WHERE clause transforms column values with mathematical functions. This is because Oracle has made a commitment to making the CBO infallible, even when incomplete information exists. However, Oracle9i introduced the new dynamic sampling method for gathering run-time schema statistics, and it is now enabled by default in Oracle10g.

Note that dynamic sampling is not for every database. Let’s take a closer look.

Dynamic Sampling

The main objective of dynamic sampling is to create more accurate selectivity and cardinality estimates, which, in turn, helps the CBO generate faster execution plans. Dynamic sampling is normally used to estimate single-table predicate selectivity when collected statistics cannot be used or are likely to lead to significant errors in estimation. It is also used to estimate table cardinality for tables without statistics or for tables whose statistics are too out of date to trust.

The optimizer_dynamic_sampling initialization parameter controls the number of blocks read by the dynamic sampling query. The parameter can be set to a value from 0 to 10. In 10g, the default for this parameter is set to “2,” automatically enabling dynamic sampling. Beware that the optimizer_features_enable parameter will turns off dynamic sampling if it is set to a version earlier than 9.2.0.

A value of 0 means dynamic sampling will not be performed. Increasing the value of the parameter results in more aggressive dynamic sampling, in terms of both the type of tables sampled (analyzed or un-analyzed) and the amount of I/O spent on sampling.

Dynamic Sampling Is Not for Everyone

When dynamic_sampling was first introduced in Oracle9i, it was used primarily for data warehouse systems with complex queries. Because it is enabled by default in Oracle10g, you may want to turn off dynamic_sampling to remove unnecessary overhead if any of the following are true:

You have an online transaction processing (OLTP) database with small, single-table queries.
Your queries are not frequently re-executed (as determined by the executions column in v$sql and executions_delta in dba_hist_sqlstat).
Your multi-table joins have simple WHERE clause predicates with single-column values and no built-in or mathematical functions.
Dynamic sampling is ideal whenever a query is going to execute multiple times because the sample time is small compared to the overall query execution time.

By sampling data from the table at runtime, Oracle10g can quickly evaluate complex WHERE clause predicates and determine the selectivity of each predicate, using this information to determine the optimal table-join order. Let’s use the Oracle SQL sample clause to see how this works.

Sample Table Scans

A sample table scan retrieves a random sample of data of whatever size you choose. The sample can be from a simple table or a complex SELECT statement such as a statement involving multiple joins and complex views.

To peek inside dynamic sampling, we can run some simple SQL queries. The following SQL statement uses a sample block and sample rows scan on the customer table. (Note that there are 50,000 rows in this table.) The first statement shows a sample block scan and the last one shows a sample row scan:

select
   count(*)
from
   customer
   sample block(20);

COUNT(*)
----------
12268

select
   pol_no,
   sales_id,
   sum_assured,
   premium
from
   customer
   sample (0.02) ;

     POL_NO   SALES_ID SUM_ASSURED    PREMIUM
---------- ---------- ----------- ---------- --
      2895         10        2525          2
      3176         10        2525          2
      9228         10        2525          2
     11294         11        2535          4
     19846         11        2535          4
     25547         12        2545          6
     29583         12        2545          6
     40042         13        2555          8
     47331         14        2565         10
     45283         14        2565         10

10 rows selected.

We can use the new dynamic_sampling hint to sample rows from the table.

select /*+ dynamic_sampling(customer 10) */
   pol_no,
    sales_id,
    sum_assured,
    premium
from
    customer;

Conclusion

Dynamic sampling addresses an innate problem in SQL and this issue is common to all relational databases. Estimating the optimal join order involves guessing the sequence that results in the smallest amount of intermediate row-sets, and it is impossible to collect every possible combination of WHERE clauses with a priori statistics.

Dynamic sampling is a godsend for databases that have large n-way table joins that execute frequently. By sampling a tiny sub-set of the data, the Oracle 10g CBO gleans clues as to the fastest table-join order.

As we have noted, dynamic_sampling does not take a long time to execute, but it can be an unnecessary overhead for all Oracle10g databases. Dynamic sampling is just another example of Oracle’s commitment to making Oracle10g an intelligent, self-optimizing database.

Donald K. Burleson is one of the world’s top Oracle Database experts with more than 20 years of full-time DBA experience. He specializes in creating database architectures for very large online databases and he has worked with some of the world’s most powerful and complex systems. A former Adjunct Professor, Don Burleson has written 15 books, published more than 100 articles in national magazines, serves as Editor-in-Chief of Oracle Internals and edits for Rampant TechPress. Don is a popular lecturer and teacher and is a frequent speaker at Oracle Openworld and other international database conferences. Don’s Web sites include DBA-Oracle, Remote-DBA, Oracle-training, remote support and remote DBA.

Contributors : Donald K. Burleson
Last modified 2005-06-22 12:04 AM

DBAzine.com

Sections

Personal tools

Menu

Who Are You?