Thinking in Aggregates
Set theory was one of the major advances in mathematics. Before Gregor Cantor invented set theory, collections of numbers were handled as sequences. A set is a completed whole with properties that belong the set, but not always to each individual element of the set. This is a big conceptual leap.
SQL is a set-oriented language and it is often possible to get the information you want without actually knowing anything about the individual elements in the set represented in the tables.
In one job I had, we were looking at an existing SQL database. We wanted to find the stored procedures that used the same subset of tables, so we could consider building VIEWs or consolidating the procedures. The table we built looks like this:
CREATE TABLE ProcTabs (procname CHAR(25) NOT NULL, tablename CHAR(25) NOT NULL, PRIMARY KEY (procname, tablename));
If you remember a previous column on relational division, you might first think about building a table with one combination of tables, then dividing the ProcTabs table by this divisor table to get the procedure names.
Unfortunately, there are 60 tables in the schema. The formula for all possible combinations involves factorials. In short, you don’t have enough storage to hold that table.
But how would a mathematician determine if two sets are equal? If set A and set B are the same size and there is a one-to-one mapping between them that is the same size as either of the original sets, then they are the same. We can do a mapping by doing an INNER JOIN between the two tables.
SELECT P1.procname AS proc1, P2.procname AS proc2, COUNT(*) AS tally_of_procs FROM ProcTabs AS P1, ProcTabs AS P2 WHERE P1.procname < P2.procname -- different procs GROUP BY P1.procname, P2.procname HAVING (SELECT COUNT(*) -- size of set A FROM ProcTabs AS P11 WHERE P11.procname = P1.procname) = (SELECT COUNT(*) -- size of mapping FROM ProcTabs AS P111, ProcTabs AS P222 WHERE P111.procname = P1.procname AND P222.procname = P2.procname AND P111.tablename = P222.tablename) AND (SELECT COUNT(*) -- size of set B FROM ProcTabs AS P22 WHERE P22.procname = P2.procname) = (SELECT COUNT(*) -- size of mapping FROM ProcTabs AS P111, ProcTabs AS P222 WHERE P111.procname = P1.procname AND P222.procname = P2.procname AND P111.tablename = P222.tablename);
The weakness of this approach is that we don’t exactly know are what the collections of tables; we only know how many tables there are in the procedures. We are working at a higher level of aggregation.
You can use the aggregate functions and the HAVING clause to determine certain characteristics of the groups formed by the GROUP BY clause. For example, given a simple grouped table with three columns like this:
SELECT col1, col2 FROM Foobar GROUP BY col1, col2 HAVING ..
You can determine the following properties of the groups with these HAVING clauses:
HAVING COUNT (DISTINCT col_x) = COUNT (col_x) |
— col_x has all distinct values |
HAVING COUNT(*) = COUNT(col_x); |
— There are no NULLs in the column |
HAVING MIN(col_x - <const>) = -MAX(col_x - <const>) |
— col_x deviates above and below the constant value by the same amount |
HAVING MIN(col_x) * MAX(col_x) < 0 |
— MAX is positive and MIN is negative |
HAVING MIN(col_x) * MAX(col_x) = 0 |
— either one or both of MIN or MAX is zero |
HAVING MIN(col_x) * MAX(col_x) > 0 |
— col_x is either all positive or all negative |
HAVING MIN(col_x) = -MAX(col_x) |
— col_x deviates above and below zero by the same amount |
HAVING MIN(SIGN(col_x)) = MAX(SIGN(col_x)) |
— col_x is all positive, all negative or all zero |
HAVING MIN(ABS(col_x)) = 0; |
— col_x has at least one zero |
HAVING MIN(ABS(col_x)) = MIN(col_x) |
— col_x >= 0 (although the where clause can handle this, too) |
HAVING MIN(col_x) = -MAX(col_x) |
— col_x deviates above and below zero by the same amount |
HAVING MIN(col_x) * MAX(col_x) = 0 |
— either one or both of MIN or MAX is zero |
HAVING MIN(col_x) < MAX(col_x) |
— col_xhas more than one value (may be faster than count (*) > 1) |
HAVING MIN(col_x) = MAX(col_x) |
— col_x has one value or NULLs |
Let me remind you again, that if there is no GROUP BY clause, the HAVING clause will treat the entire table as a single group according to the SQL-89 and SQL-92 standards. This means that if you wish to apply one of the tests given above to the whole table, you will need to use a constant in the SELECT list.
This will be easier to see with an example. You are given a table with a column of unique sequential numbers that start at “one” and increase. When you insert an new row, you must either use a sequence number that is not currently in the column — that is, fill the gaps. If there are no gaps, then and only then can you use the next highest integer in the sequence.
CREATE TABLE Foobar (seq_nbr INTEGER NOT NULL PRIMARY KEY CHECK (seq > 0), name CHAR(5) NOT NULL); INSERT INTO Foobar VALUES (1, 'Tom'), (2, 'Dick'), (4, 'Harry'), (5, 'Moe');
How do I find if I have any gaps?
EXISTS (SELECT 'gap' -- any constant will work FROM Foobar HAVING COUNT(*) = MAX(seq_nbr))
You could not use “SELECT seq_nbr” because the column values will not be identical within the single group made from the table, so the subquery fails with a cardinality violation. Likewise, “SELECT *” fails because the asterisk is converted into a column name picked by the SQL engine. Here is the insertion statement:
INSERT INTO Foobar (seq_nbr, name) VALUES (CASE WHEN EXISTS -- no gaps (SELECT 'no gaps' FROM Foobar HAVING COUNT(*) = MAX(seq_nbr)) THEN (SELECT MAX(seq_nbr) FROM Foobar) + 1 ELSE (SELECT MIN(seq_nbr) -- gaps FROM Foobar WHERE (seq_nbr - 1) NOT IN (SELECT seq_nbr FROM Foobar) AND seq_nbr > 0) - 1, 'Celko');
The ELSE clause has to handle a special situation when 1 is in the seq_nbr column, so that it does not return an illegal zero. The only tricky part is waiting for the entire scalar subquery expression to compute before subtracting one; writing “MIN(seq_nbr -1)” or “MIN(seq_nbr) -1” in the SELECT list could disable the use of indexes in many SQL products.
--
Joe Celko was a member of the ANSI X3H2 Database Standards Committee and helped write the SQL-92 standards. He is the author of over 750 magazine columns and seven books, the best known of which is SQL for Smarties, now in its third edition. For more articles, puzzles, resources, and more, go to www.celko.com.
Contributors : Joe Celko
Last modified 2005-09-08 02:29 PM