Set Operations

by Joe Celko

Part 2 in a series

Introduction

SQL is a language that is supposed to be based on sets. Dr. Codd even defined the classic set operations as part of his eight basic operators for a relational database. Yet we did not have a full collection of basic set operations until the SQL-92 Standard.

By set operations, I mean union, intersection, and set difference -- the basic operators used in elementary set theory, which has been taught in the United States public school systems for decades.

Perhaps the problem in SQL that you did not have in pure set theory is that SQL tables are multisets (also called bags), which means that, unlike sets, they allow duplicate elements (rows or tuples). Dr. Codd's relational model is stricter and uses only true sets. SQL handles these duplicate rows with an ALL or DISTINCT modifier in different places in the language; ALL preserves duplicates and DISTINCT removes them.

Another more subtle problem is that set operations only make sense when the two sets are made up of the same kind of elements. In good database model, each table has one and only one type of elements. That is, you don't have more than one Inventory table, more than one Personnel table, etc.

But when the INCITS H2 (nee ANSI X3) Database Standards Committee added these operators, the model in the SQL-92 standard was to pair off the two tables on a row-per-row basis for set operations.

(note: In SQL-92, we introduced the shorthand TABLE <table name> for the query or subquery SELECT * FROM <table name>, which lets us refer to a table as a whole without referring to its columns. I will use this notation to save space)

Set Operations: Union

Microsoft introduced its ACCESS database product in 1992, after five years and tens of millions of dollars' worth of development work. The first complaints they got on their CompuServe user support forum involved the lack of a UNION operator. UNIONs are supported in SQL-86, SQL-89, and SQL-92, but the other set operations have to be constructed by the programmer in SQL-89. The syntax for the UNION statement is:

<query> UNION [ALL] <query>

Technically, this BNF is not right, but I will get back to that later. The UNION statement takes two tables and builds a new table from them. The two tables must be "union compatible", which means that they have the same number of columns, and that each column in the first table has the same datatype (or automatically cast to it) as the column in the same position in the second table.

That is, their rows have the same structure, so they can be put in the same final result table. Most implementations will do some datatype conversions to create the result table, but this is very implementation-dependent and you should check it out for yourself.

What is interesting is that the result of a UNION has no name, and its columns are not named. If you want to have names, then you have to use an AS operator to create those names, thus.

((SELECT a, b, c FROM TableA WHERE city = 'Boston')
 UNION
 (SELECT x, y, z FROM TableB WHERE city = 'New York'))
 AS Cities (tom, dick, harry)

However, in actual products will find a multitude of other ways of doing this:

The columns have the names of the first table in the UNION statement.
The columns have the names of the last table in the UNION statement.
The columns have the names generated by the SQL engine.
The columns are referenced by a position number. This was the SQL-89 convention.

There are two forms of the UNION statement: the UNION and the UNION ALL. There was never a UNION DISTINCT option in the language. The UNION is the same operator you had in school; it returns the rows that appear in either or both tables and removes redundant duplicates from the result table.

In most older SQL implementations, this removal is done by merge-sorting the two tables and discarding duplicates during the merge. This has the side effect that the result table is sorted, but you cannot depend on that. This also explains why the ORDER BY clause is a common feature on UNION operators. As long as the engine is sorting on all the columns anyway, why not let the programmer decide the sort keys?

The UNION ALL preserves the duplicates from both tables in the result table. In most implementations, this statement is done appending one table to the other, giving you a predictable ordering.

The UNION and UNION ALL operators are of the same precedence and are executed from left to right unless parentheses change the order.

In theory, the order of execution of UNIONs is not important, but it can be in practice. Even today, many optimizers generate separate results for each table expression in the UNION [ALL] first, then bring them together. And likewise, few products re-order the execution based on table sizes. Consider this expression:

(TABLE SmallTable1)
 UNION
(TABLE BigTable)
 UNION
(TABLE SmallTable2);

It will probably merge SmallTable1 into BigTable, then merge SmallTable2 into that first result. If the rows of SmallTable1 are spread out in the first result table, locating duplicates from SmallTable2 will take longer than if we had written the query thus:

(TABLE SmallTable1)
 UNION
(TABLE SmallTable2))
 UNION
(TABLE BigTable);

There are many reason that products lack UNION optimizations. First, UNIONs are not a common operation, so it is not worth the effort. And secondly, the order of execution becomes important if UNION and UNION ALL are mixed together:

 TABLE X
 UNION
 TABLE Y
 UNION ALL
 TABLE Z

Is executed as

(TABLE X UNION TABLE Y)
 UNION ALL
 TABLE Z

and that is not the same as:

 TABLE X
 UNION
 (TABLE Y UNION ALL TABLE Z)

Optimization of UNIONs is highly product-dependent, so you should experiment with it.

As a general statement, if you know that there are no duplicates, or that duplicates are not a problem in your situation, use the UNION ALL operator instead of UNION for speed.

There is no attempt to merge the table expressions and use OR-ed predicates.
For example:

  SELECT * FROM Personnel WHERE sex = 'm'
 UNION ALL
 SELECT * FROM Personnel WHERE sex = 'f'

can be replaced by:

  SELECT * FROM Personnel WHERE sex IN ('m', 'f');

A useful trick for building the union of different columns from the same table is to use a CROSS JOIN, a table of sequential integers from 1 thru (n) and a CASE expression, thus

  SELECT employee,
    CASE WHEN S1.seq = 1 THEN P1.primary_lanuage
       WHEN S1.seq = 2 THEN P1.secondary_lanuage
       ELSE NULL END
  FROM Personnel AS F1
    CROSS JOIN
    Sequence AS S1
 WHERE S1.seq IN (1,2)

This acts like the UNION ALL statement, but change the SELECT to SELECT DISTINCT and you have a UNION. The advantage of this statement over the more obvious UNION is that it makes one pass thru the table. Given a large table, that can be important for good performance.

---

Joe Celko was a member of the ANSI X3H2 Database Standards Committee and helped write the SQL-92 standards. He is the author of over 450 magazine columns and four books, the best known of which is SQL for Smarties (Morgan-Kaufmann Publishers, 1999). He is the Vice President of RDBMS at North Face Learning in Salt Lake City.

Contributors : Joe Celko
Last modified 2005-04-20 10:25 AM

DBAzine.com

Sections

Personal tools

Menu

Who Are You?

Set Operations

Part 2 in a series

Introduction

Set Operations: Union