Thinking in SQL

by Joe Celko

Learning to think in terms of SQL is a jump for most programmers. Most of your career is spent writing procedural code and suddenly, you have to deal with non-procedural code. The thought pattern has to change from sequences to sets of data elements.

As an example of what I mean, consider a posting made on 1999 December 22 by J.R. Wiles to a Microsoft SQL Server website: "I need help with a statement that will return distinct records for the first three fields where all values in field four are all equal to zero."

What do you notice about this program specification? It is very poorly written. But this is very typical of what people put out on the Internet when they ask for SQL help.

There are no fields in a SQL database; there are columns. The minute that someone calls a column a field, you that he is not thinking in the right terms.

A field is defined within the application program. A column is defined in the database, independently of the application program. This is why a call to some library routine in a procedural language like "READ a, b, c, d FROM My_File;" is not the same as "READ d, c, b, a FROM My_File;" while "SELECT a, b, c, d FROM My_Table;" and "SELECT d, c, b, a FROM My_Table;" are the same thing in a different order.

The next problem is that he does not give any DDL (Data Definition Language) for the table he wants us to use for the problem. This means we have to guess what the column datatypes are, what the constraints are and everything else about the table. However, he did give some sample data in the posting which lets us guess that the table looks like this:

CREATE TABLE Foobar
(col1 INTEGER NOT NULL,
 col2 INTEGER NOT NULL,
 col3 INTEGER NOT NULL,
 col4 INTEGER NOT NULL);

INSERT INTO Foobar
VALUES (1, 1, 1, 0),
       (1, 1, 1, 0),
       (1, 1, 1, 0),
       (1, 1, 2, 1),
       (1, 1, 2, 0),
       (1, 1, 2, 0),
       (1, 1, 3, 0),
       (1, 1, 3, 0),
       (1, 1, 3, 0);

Then he tells us that the query should return these two rows:

  (1, 1, 1, 0)
  (1, 1, 3, 0)

Did you notice that this table had no name and no key specified? While it is a bad practice not to have a declared PRIMARY KEY on a table, just ignore it for the moment.

At this point, people started sending in possible answers. Tony Rogerson at Torver Computer Consultants Ltd came up with this answer:

SELECT *
  FROM (SELECT col1, col2, col3, SUM(col4)
          FROM Foobar
         GROUP BY col1, col2, col3)
       AS F1(col1, col2, col3, col4)
 WHERE F1.col4 = 0;

Using the assumption, which is not given anywhere in the specification, Tony decided that col4 has a constraint -- ...

           col4 INTEGER NOT NULL CHECK(col4 IN (0, 1)));

Notice how doing this INSERT INTO statement would ruin his answer:

  INSERT INTO Foobar (col1, col2, col3, col4)
 VALUES (4, 5, 6, 1), (4, 5, 6, 0), (4, 5, 6, -1);

But there is another problem. This is a procedural approach to the query, even though it looks like SQL! The innermost query builds groups based on the first three columns and gives you the summation of the fourth column within each group. That result, named F1, is then passed to the containing query which then keeps only groups with all zeros, under his assumption about the data.

Now, students, what do we use to select groups from a grouped table? The HAVING clause! Mark Soukup noticed this was a redundant construction and offered this answer:

SELECT col1, col2, col3, 0 AS col4zero
  FROM Foobar
 GROUP BY col1, col2, col3
HAVING SUM(col4) = 0;

Why is this an improvement? The HAVING clause does not have to wait for the entire subquery to be built before it can go to work. In fact, with a good optimizer, it does not have to wait for an entire group to be built before dropping it from the results.

However, there is still that assumption about the values in col4. Roy Harvey came up with answer that gets round that problem:

  SELECT col1, col2, col3, 0 AS col4zero
   FROM Foobar
  GROUP BY col1, col2, col3
 HAVING COUNT(*)
       = SUM(CASE WHEN col4 = 0
                  THEN 1 ELSE 0 END);

Using the CASE expression inside an aggregation function this way is a handy trick. The idea is that you count the number of rows in each group and count the number of zeros in col4 of each group and if they are the same, then the group is one we want in the answer.

However, when most SQL compilers see an expression inside an aggregate function like SUM(), they have trouble optimizing the code.

I came up with two approaches. Here is the first:

SELECT col1, col2, col3
  FROM Foobar
 GROUP BY col1, col2, col3
HAVING MIN(col4) = MAX(col4)  -- one value in table
   AND MIN(col4) = 0;         -- has a zero

The first predicate is to guarantee that all values in column four are the same. Think about the characteristics of a group of identical values. Since they are all the same, the extremes will also be the same. The second predicate assures us that col4 is all zeros in each group. This is the same reasoning; if they are all alike and one of them is a zero, then all of them are zeros.

However, these answers make assumptions about how to handle NULLs in col4. The specification said nothing about NULLs, so we have two choices: (1) discard all NULLs and then see if the known values are all zeros (2)Keep the NULLs in the groups and use them to disqualify the group. To make this easier to see, let's do this statement:

 INSERT INTO Foobar (col1, col2, col3, col4)
 VALUES (7, 8, 9, 0), (7, 8, 9, 0), (7, 8, 9, NULL);

Tony Rogerson's answer will drop the last row in this statement from the SUM() and the outermost query will never see it. This group passes the test and gets to the result set.

Roy Harvey's will convert the NULL into a zero in the SUM(), the SUM() will not match COUNT(*) and thus this group is rejected.

My first answer will give the "benefit of the doubt" to the NULLs, but I can add another predicate and reject groups with NULLs in them.

SELECT col1, col2, col3
  FROM Foobar
 GROUP BY col1, col2, col3
HAVING MIN(col4) = MAX(col4)
   AND MIN(col4) = 0
   AND COUNT(*) = COUNT(col4);   -- No NULL in the column

The advantages of using simple aggregate functions is that SQL engines are tuned to produce them quickly and to optimize code containing them. For example, the MIN(), MAX() and COUNT(*)functions for a base table can often be determined directly from an index or from a statistics table used by the optimizer, without reading the base table itself.

As an exercise, what other predicates can you write with aggregate functions that will give you a group characteristic? I will offer a copy of SQL FOR SMARTIES (second edition) for the longest list. Send me an email at mailto:71062.1056@compuserve.com with your answers.

---

Joe Celko was a member of the ANSI X3H2 Database Standards Committee and helped write the SQL-92 standards. He is the author of over 450 magazine columns and four books, the best known of which is SQL for Smarties (Morgan-Kaufmann Publishers, 1999). He is the Vice President of RDBMS at North Face Learning in Salt Lake City.

Contributors : Joe Celko
Last modified 2005-04-20 10:29 AM

DBAzine.com

Sections

Personal tools

Menu

Who Are You?

Thinking in SQL