Thinking in Aggregates

by Joe Celko

Set theory was one of the major advances in mathematics. Before Gregor Cantor invented set theory, collections of numbers were handled as sequences. A set is a completed whole with properties that belong the set, but not always to each individual element of the set. This is a big conceptual leap.

SQL is a set-oriented language and it is often possible to get the information you want without actually knowing anything about the individual elements in the set represented in the tables.

In one job I had, we were looking at an existing SQL database. We wanted to find the stored procedures that used the same subset of tables, so we could consider building VIEWs or consolidating the procedures. The table we built looks like this:

CREATE TABLE ProcTabs
(procname CHAR(25) NOT NULL,
 tablename CHAR(25) NOT NULL,
 PRIMARY KEY (procname, tablename));

If you remember a previous column on relational division, you might first think about building a table with one combination of tables, then dividing the ProcTabs table by this divisor table to get the procedure names.

Unfortunately, there are 60 tables in the schema. The formula for all possible combinations involves factorials. In short, you don’t have enough storage to hold that table.

But how would a mathematician determine if two sets are equal? If set A and set B are the same size and there is a one-to-one mapping between them that is the same size as either of the original sets, then they are the same. We can do a mapping by doing an INNER JOIN between the two tables.

SELECT P1.procname AS proc1,
       P2.procname AS proc2,
       COUNT(*) AS tally_of_procs
  FROM ProcTabs AS P1, ProcTabs AS P2
 WHERE P1.procname < P2.procname -- different procs
 GROUP BY P1.procname, P2.procname 
HAVING (SELECT COUNT(*)   -- size of set A
          FROM ProcTabs AS P11
         WHERE P11.procname = P1.procname)
        = (SELECT COUNT(*)  -- size of mapping
             FROM ProcTabs AS P111, ProcTabs AS P222
            WHERE P111.procname = P1.procname
               AND P222.procname = P2.procname
               AND P111.tablename = P222.tablename)
    AND (SELECT COUNT(*)  -- size of set B
           FROM ProcTabs AS P22
          WHERE P22.procname = P2.procname)
        = (SELECT COUNT(*)  -- size of mapping
             FROM ProcTabs AS P111, ProcTabs AS P222
            WHERE P111.procname = P1.procname
               AND P222.procname = P2.procname
               AND P111.tablename = P222.tablename);

The weakness of this approach is that we don’t exactly know are what the collections of tables; we only know how many tables there are in the procedures. We are working at a higher level of aggregation.

You can use the aggregate functions and the HAVING clause to determine certain characteristics of the groups formed by the GROUP BY clause. For example, given a simple grouped table with three columns like this:

SELECT col1, col2
  FROM Foobar
 GROUP BY col1, col2
HAVING ..

You can determine the following properties of the groups with these HAVING clauses:

HAVING COUNT (DISTINCT col_x) = COUNT (col_x)	— col_x has all distinct values
HAVING COUNT(*) = COUNT(col_x);	— There are no NULLs in the column
HAVING MIN(col_x - <const>) = -MAX(col_x - <const>)	— col_x deviates above and below the constant value by the same amount
HAVING MIN(col_x) * MAX(col_x) < 0	— MAX is positive and MIN is negative
HAVING MIN(col_x) * MAX(col_x) = 0	— either one or both of MIN or MAX is zero
HAVING MIN(col_x) * MAX(col_x) > 0	— col_x is either all positive or all negative
HAVING MIN(col_x) = -MAX(col_x)	— col_x deviates above and below zero by the same amount
HAVING MIN(SIGN(col_x)) = MAX(SIGN(col_x))	— col_x is all positive, all negative or all zero
HAVING MIN(ABS(col_x)) = 0;	— col_x has at least one zero
HAVING MIN(ABS(col_x)) = MIN(col_x)	— col_x >= 0 (although the where clause can handle this, too)
HAVING MIN(col_x) = -MAX(col_x)	— col_x deviates above and below zero by the same amount
HAVING MIN(col_x) * MAX(col_x) = 0	— either one or both of MIN or MAX is zero
HAVING MIN(col_x) < MAX(col_x)	— col_xhas more than one value (may be faster than count (*) > 1)
HAVING MIN(col_x) = MAX(col_x)	— col_x has one value or NULLs

Let me remind you again, that if there is no GROUP BY clause, the HAVING clause will treat the entire table as a single group according to the SQL-89 and SQL-92 standards. This means that if you wish to apply one of the tests given above to the whole table, you will need to use a constant in the SELECT list.

This will be easier to see with an example. You are given a table with a column of unique sequential numbers that start at “one” and increase. When you insert an new row, you must either use a sequence number that is not currently in the column — that is, fill the gaps. If there are no gaps, then and only then can you use the next highest integer in the sequence.

CREATE TABLE Foobar
(seq_nbr INTEGER NOT NULL PRIMARY KEY
         CHECK (seq > 0),
 name CHAR(5) NOT NULL);

INSERT INTO Foobar
VALUES (1, 'Tom'), (2, 'Dick'), (4, 'Harry'), (5, 'Moe');

How do I find if I have any gaps?

 EXISTS (SELECT 'gap'  -- any constant will work
           FROM Foobar
         HAVING COUNT(*) = MAX(seq_nbr))

You could not use “SELECT seq_nbr” because the column values will not be identical within the single group made from the table, so the subquery fails with a cardinality violation. Likewise, “SELECT *” fails because the asterisk is converted into a column name picked by the SQL engine. Here is the insertion statement:

INSERT INTO Foobar (seq_nbr, name)
VALUES (CASE WHEN EXISTS      -- no gaps
                  (SELECT 'no gaps'
                     FROM Foobar
                   HAVING COUNT(*) = MAX(seq_nbr))
             THEN (SELECT MAX(seq_nbr) FROM Foobar) + 1
             ELSE (SELECT MIN(seq_nbr) -- gaps
                     FROM Foobar
                    WHERE (seq_nbr - 1)
                          NOT IN (SELECT seq_nbr
                                     FROM Foobar)
                      AND seq_nbr > 0) - 1, 'Celko');

The ELSE clause has to handle a special situation when 1 is in the seq_nbr column, so that it does not return an illegal zero. The only tricky part is waiting for the entire scalar subquery expression to compute before subtracting one; writing “MIN(seq_nbr -1)” or “MIN(seq_nbr) -1” in the SELECT list could disable the use of indexes in many SQL products.

Joe Celko was a member of the ANSI X3H2 Database Standards Committee and helped write the SQL-92 standards. He is the author of over 750 magazine columns and seven books, the best known of which is SQL for Smarties, now in its third edition. For more articles, puzzles, resources, and more, go to www.celko.com.

Contributors : Joe Celko
Last modified 2005-09-08 02:29 PM

DBAzine.com

Sections

Personal tools

Menu

Who Are You?

Thinking in Aggregates