Subselect Performance Enhancements

by Gabrielle Wiorkowski

Most users have been cautious with the use of subselects since DB2 became generally available. The DB2 developers have made a number of performance enchancements to subselect processing over the years. This article concentrates primarily on the enhancements made in V4, V5 and V6 as well as how and under what conditions these enhancements can be a benefit.

Returning Few or Many Values

Subselect of a SELECT, UPDATE, and DELETE statement can qualify a few or many rows when using IN, =ANY, =SOME, =ALL (with or without NOT and WHERE CURRENT OF). Consider the processing required to identify those parts that are being used in a quantity of 200 on jobs.

SELECT PN,          PNAME,  WEIGHT, CITY 
FROM        P       
WHERE       PN      IN     
            SELECT  PN     
            FROM    SPJ     
            WHERE   QTY     =       200);

Whether the subselect returns a few or many values significantly effects how the statement is processed and its performance. In any case, a matching index scan can be used to satisfy predicates in the subselect (WHERE QTY = 200 in the example). The goal is to minimize the number of values that must be processed in the subselect.

Transformation of Subselect to IN (list) Processing

It wasn't possible to use a index on the column preceding the subslect before the subselect enchancements. If few rows qualify in the subselect and many rows must be processed for the outer select, the qualifying values from the subselect are sorted, duplicates are eliminated and written to a work file (V5 APAR PQ23243). If P7, P7, P7, P5, P3, P8, P8, P9, P9, P1, P1, P1, P3, P3, P4, P8, P9, P2, P2, P5, P6, P1 qualify with a QTY = 200 in the subselect , the statement is executed like:

SELECT PN, PNAME, WEIGHT, CITY
FROM P
WHERE PN IN ('P1','P2',P3','P4','P5','P6','P7','P8','P9');

This allows for nine matching index scans on a P.PN clustering index. The processing can be identified in the PLAN_TABLE with columns ACCESSTYPE = 'N' and MATCHCOL > 0. OPTIMIZE FOR 1 ROW encourages the transformation to IN (list) processing. The same principle applies to multi-level subselects.

It is important to remain cautious of many qualifying rows in the subselect, which in some cases are based on the value requested by the user (QTY = 10 can cause many SPJ.PNs to qualify, for example). One alternative is the use of reoptimization to allow the optimizer to see the value requested by the user at run time.

Another cause of concern is the possibility that the characteristics of the data might change in such a way that many rows qualify in the subselect after the access path is chosen. But this problem can easily be solved by doing a rebind or thorough the use of reoptimization. The optimizer will then recognize that many rows qualify in the subselect and will choose to avoid IN (list) processing.

There are, however, a few restrictions. This processing applies to non-correlated subselects, not correlated ones. As a result, the column preceding the subselect must match the data type and length of the selected column in the subselect. The column preceding the subselect cannot be part of an expression (P.PN + 4, for example).

Returning Many Rows

If there are many SPJ.PN being used with a QTY of 200, it turns out to be inefficient to perform many matching index scans on P.PN table using the IN (list) processing, particularly if there is a low cluster ratio. The optimizer wants to avoid "death by random I/O" by doing a tablespace scan on the P table, comparing each P.PN with those identified in the inner select. The number of comparisons is reduced by sorting the inner select result, eliminating duplicates and building an in-memory sparse index in a temporary work area (V4).

The purpose of the in-memory sparse index is to improve the performance of searching the work area containing the results of the subselect for every qualifying row in the outer select. DB2 can create an in-memory sparse index that points to subsections of the qualifying result set from the subselect. It can then do a binary search on the values in each subsection for each qualifying row in the outer select. Figure 1 gives a conceptual picture of how this works. The rows that are crossed out are eliminated from the temporary work area before the in-memory sparse index is built.

PN in-memory sparse index	Binary Search of Values
P1000	P1000, P1001, P1002, P1003, P1004, P1005, ...
P2000	P2000, P2001, P2002, P2003, P2004, P2005, ...
P3000	P3000, P3001, P3002, P3003, P3004, P3005, ...
~~P3000~~	~~P3000, P3001, P3002, P3003, P3004, P3005, ...~~
~~P3000~~	~~P3000, P3001, P3002, P3003, P3004, P3005, ...~~
~~P3000~~	~~P3000, P3001, P3002, P3003, P3004, P3005, ...~~
~~P3000~~	~~P3000, P3001, P3002, P3003, P3004, P3005, ...~~
P6000	P6000, P6001, P6002, P6003, P6004, P6005, ...

Figure 1. In-memory sparse index

DB2 can locate P1004, for example, from the outer SELECT, use the in-memory sparse index to find the subsection that contains the P1000s, and do a binary search of the parts to locate P1004. When P2002 is encountered during a tablespace of the outer table, the in-memory sparse index is used to locate the P2000's subsection and a binary search performed to locate the P2002 efficiently. This processing continues for each row in the outer table as long as you are fetching rows. Thus far, we have considered transformation of positive IN subselects.

A negated predicate (NOT IN), for example, is evaluated at stage 1, but matching index scans are not used on negated predicates. However, NOT EXISTS is an exception in which a matching index scan can be used on the column in the inner table of the pseudo join predicate in the subselect.

Transformation of NOT IN to NOT EXISTS

If the optimizer estimates that many rows qualify in the inner table and there is a usable index for locating rows in the subselect, it can transform a NOT IN non-correlated subselect to a NOT EXISTS subselect (V5 APAR PQ24284) for improved performance. This avoids the cost of in-memory sparse index processing. Rather, a matching index scan can be used to determine if the value exists in the inner table with the NOT EXISTS formulation similar to a nested loop join.

If the value does not exist in the inner table, a qualifying row from the outer table can be returned to the program immediately without first processing a work file. This is particularly important if many rows qualify in the subselect and few rows are fetched. For example, if one million rows qualify in the subselect, about 1,000 rows may need to be scanned with an in-memory sparse index for each qualifying row in the outer table.

Writing a Subselect as a Join

Joins have advantages over subselects in that there is a greater opportunity to use matching index scans on both tables, P.PN and SPJ.PN columns, processed as in the following example.

SELECT P.PN, PNAME, COLOR, WEIGHT, CITY
FROM P, SPJ
WHERE P.PN = SPJ.PN
AND QTY = 200;

Unfortunately, however, the join's results does not match those of the subselect exactly because the subselect does not return duplicate rows, while the join does. The subselect returns:

PN	PNAME	COLOR	WEIGHT	CITY
P1	Nut	Red	12	London
P2	Bolt	Green	17	Paris
P3	Screw	Blue	17	Rome
P6	Cog	Red	19	London

and the join returns:

PN	PNAME	COLOR	WEIGHT	CITY
P1	Nut	Red	12	London
P2	Bolt	Green	17	Paris
P3	Screw	Blue	17	Rome
P3	Screw	Blue	17	Rome
P3	Screw	Blue	17	Rome
P3	Screw	Blue	17	Rome
P3	Screw	Blue	17	Rome
P6	Cog	Red	19	London

Duplicates can be eliminated from the join results using the DISTINCT operator:

SELECT DISTINCT P.PN,PNAME,COLOR,WEIGHT,CITY
FROM P, SPJ
WHERE P.PN = SPJ.PN
AND QTY = 200;

The DISTINCT operator here requires a sort to eliminate the duplicate rows from the join result. And it may be a more costly sort than the one needed to satisfy the inner select in the subselect formulation. That is because the rows returned by the join are longer than those returned by the subselect. In this particular example, the subselect may provide better performance since the rows to be sorted are shorter. However, if the data is such that there are no duplicates, you can remove the DISTINCT and the join performs best.

The earlier IN subselect has performance advantages if few rows qualify in the inner select and an index is used on the column preceding the subselect to do few matching index scans into the outer table.

Transformation of a Subselect to a Join

Most subselects can be rewritten as a join. (The opposite is not true.) Usually the join is the more efficient of the two methods if the subselect qualifies many values for the following reasons:

A join's WHERE clause can reduce the number of rows to be processed. The inner select of a subselect frequently qualifies a number of rows that are written to a temporary work area before processing of the outer select can begin. In contrast, a join can process one row from each of the tables to be joined and return them immediately to the program without processing perhaps a large number of rows from the inner select in an intermediate work area.
DB2 does not use a matching index scan on the column preceding a non-correlated subselect prior to V5 APAR PQ23243. However, it can use a matching index scan on the column with an equivalent join.

The Automatic Transformation

The optimizer can transform a non-correlated or correlated subselect using the IN, =ALL, =ANY, or =SOME operator into an equivalent join and take advantage of any indexes the join would prior to V4. So the optimizer can be certain that the subselect and join are equivalent -- that they will return identical results in all cases -- there are restrictions on when it will make these transformations:

The column in the inner select must have a unique index to ensure that no duplicates are returned as a result of the transformation.
The subselect must access only one table.
Potential join columns must have the same data type and length.
The subselect cannot contain a GROUP BY or HAVING clause or use a column function.
The operator on the inner select cannot be negated (such as NOT IN).

After the transformation, a nested loop join is used in most cases, although a hybrid or merge join is possible. If the unique index is dropped, the plan or package is marked invalid. The next time the plan or package is executed, it is rebound without the transformation to a join.

Example: If there is a unique index on the SN column in the S table, this subselect:

SELECT SN, PN, JN, QTY
FROM SPJ
WHERE SN IN
(SELECT SN
FROM S
WHERE CITY = 'London');

can be transformed into a join, like the following:

SELECT S.SN, PN, JN, QTY
FROM SPJ, S
WHERE CITY = 'London'
AND SPJ.SN = S.SN;

Transformation can occur with composite indexes as well. If there is a composite index on CITY and SN in that order, the transformation can occur, provided the leading column in the composite index is specified with an equal predicate as is the case in the above example.

If a subselect is automatically transformed into a join, the access path can differ from that chosen when the join is explicitly coded. In most cases, it is best to code the join. It much clearer to the next person looking at the code to determine what is happening.

IN Versus EXISTS

An IN (list) subselect using an in-memory sparse index compared to a join or EXISTS subselect has advantages in some cases:

Many rows qualify in outer select
Most rows are fetched or must be processed for a COUNT, for example
Inner table is small
Low cluster ratio index used for matching index scans into the inner table when IN subselect is not used

The reason that the IN subselect has a performance advantage is that many matching index scans into widely dispersed rows are not required as is the case with a join or EXISTS formulation. This avoids the excessive seek time required for many matching index scans used by an EXISTS subselect and a nested loop join.

Michael Hannan in Melbourne, Australia, discovered this when 968,961 matching index scans were required for a join and EXISTS formulation. The IN subselect resulted in a non-matching index scan, sort and build of an in-memory sparse index on 3,000 values qualifying in the subselect. The CPU and elapsed seconds for the three formulations are shown in Figure 2.

Figure 2 Comparison of IN (list) subselect, join and EXISTS subselect

Description	CPU Seconds	Elapsed Seconds
IN subselect	31.07	87
Nested loop join	124.96	348
EXISTS subselect	124.99	239.64

We have discussed another case in which an IN subselect has performance advantages after transformation to IN (list) processing. If few rows qualify in the inner select and an index is used on the column preceding subselect, the IN subselect formulation has performance advantages. Indeed, we have discussed some of the many factors that must be considered when tuning challenging subselects. Hopefully, these factors will prove useful in analyzing subselects that performs less than optimally.

---

Gabrielle Wiorkowski is a senior DB2 consultant and an IBM gold consultant. In addition to consulting for many major organizations since V1.1, she has lectured widely on DB2 around the world and authored DB2 for OS/390 Development for Performance, now in its third edition. She can be reached via her website at http://www.GabrielleDB2.com.

Contributors : Gabrielle Wiorkowski
Last modified 2005-04-12 06:21 AM

DBAzine.com

Sections

Personal tools

Menu

Who Are You?