Scrubbing Data with Non-1NF Tables - Part 1

by Joe Celko

SQL does not require that a table have unique constraints, a primary key, or anything else that would ensure data integrity. In short, you can use a table pretty much like a file if you wish. Is this a bad thing?

Well, mostly yes and a little no. You should never have such a beast in your final schema, but one common programming trick is to use a table without any constraints as a staging area. You load data from an external source into one of these non_TABLEs, scrub it and pass it along to a real table in the actual schema. The trouble is that a lot of the time, the non-TABLE is denormalized as well as full of bad data. You can do some normalization from the staging table into another set of scrubbing tables, but you can also do some work with the table as it stands.

This example is based on material posted by a Newbie on a SQL newsgroup, but his situation is not uncommon. He gets raw data from a source that can have duplicate rows and repeating groups in violation of First Normal Form (1NF). His scrub tables look like this:

CREATE TABLE PeopleSkills
(name VARCHAR(10) NOT NULL,
  code1 INTEGER NOT NULL,
  code2 INTEGER NOT NULL,
  code3 INTEGER NOT NULL,
  code4 INTEGER NOT NULL,
  code5 INTEGER NOT NULL);

INSERT INTO PeopleSkills VALUES ('mary', 1, 7, 8, 9, 13);
INSERT INTO PeopleSkills VALUES ('mary', 1, 7, 8, 9, 13);
INSERT INTO PeopleSkills VALUES ('mary', 1, 7, 7, 7, 13);
INSERT INTO PeopleSkills VALUES ('mary', 1, 7, 8, 9, 13);
INSERT INTO PeopleSkills VALUES ('joe', 1, 7, 8, 9, 3);
INSERT INTO PeopleSkills VALUES ('bob', 1, 7, 8, 9, 3);
INSERT INTO PeopleSkills VALUES ('larry', 22, 17, 18, 19, 113); -- notarget codes
INSERT INTO PeopleSkills VALUES ('mary', 1, 3, 2, 9, 13);
INSERT INTO PeopleSkills VALUES ('melvin', 1, 3, 2, 9, 13); -- 2 targetcodes
INSERT INTO PeopleSkills VALUES ('irving', 1, 8, 2, 9, 13); -- 1 targetcodes

Part of the scrubbing is to find which people have some or all of a particular code. The list can change, so we put it in a table of its own like this:

CREATE TABLE TargetCodes
(code INTEGER NOT NULL PRIMARY KEY,
  description VARCHAR(50) NOT NULL);
INSERT INTO TargetCodes VALUES (1, 'code1')
INSERT INTO TargetCodes VALUES (3, 'code3')
INSERT INTO TargetCodes VALUES (7, 'code7')

The first goal is to return a report with the name of the person and the number of target codes they have in their skills inventory.

The first thought of an experienced SQL programmer is to normalize the repeated group. The obvious way to do this is with a derived table, thus:

SELECT P1.name, COUNT(*)
  FROM (SELECT name, code1 FROM PeopleSkills
         UNION
         SELECT name, code2 FROM PeopleSkills
         UNION
         SELECT name, code3 FROM PeopleSkills
         UNION
         SELECT name, code4 FROM PeopleSkills
         UNION
         SELECT name, code5 FROM PeopleSkills)
         AS P1 (name, code) -- normalized table!
        LEFT OUTER JOIN
        TargetCodes AS T1
        ON T1.code = P1.code
 GROUP BY P1.name;

The reason that this fools experienced SQL programmers is that they know that a schema should be in First Normal Form and they immediately fix that problem without looking a bit further. They want to correct the design problem first.

That chain of UNIONs can be replaced by a chain of OR-s, hidden in an IN() predicate. This one is not so bad to write.

SELECT P1.name, COUNT (DISTINCT T1.code) AS Tally
  FROM PeopleSkills AS P1
       LEFT OUTER JOIN
       TargetCodes AS T1
       ON T1.code IN (code1, code2, code3, code4, code5)
GROUP BY name;

Results
name    tally
=============
'bob' 	3
'irving'	1
'joe' 	3
'larry'	0
'mary'	3
'melvin'	2

The trick is the use of an IN () predicate when you have a repeating group. This will give you just the names of those who have one or more target codes.

SELECT DISTINCT name
  FROM PeopleSkills AS P1
WHERE code1 IN (SELECT code FROM TargetCodes)
   OR code2 IN (SELECT code FROM TargetCodes)
   OR code3 IN (SELECT code FROM TargetCodes)
   OR code4 IN (SELECT code FROM TargetCodes)
   OR code5 IN (SELECT code FROM TargetCodes);

This next modification will shown you which codes each person has, with 1/0 flags. This has a neat trick with little-used SUM(DISTINCT <exp>) construction, but you have to know what the target codes are in advance.

SELECT name,
        SUM(DISTINCT CASE
                    WHEN 1 IN (code1, code2, code3, code4, code5)
                    THEN 1 ELSE 0 END) AS code1,
        SUM(DISTINCT CASE 
                    WHEN 3 IN (code1, code2, code3, code4, code5)
                    THEN 1 ELSE 0 END) AS code3,
        SUM(DISTINCT CASE 
                    WHEN 7 IN (code1, code2, code3, code4, code5)
                    THEN 1 ELSE 0 END) AS code7
  FROM PeopleSkills AS P1
GROUP BY name;

Results
 Name     code1 code3 code7
===========================
'bob' 	1	1	1
'irving'	1	0	0
'joe'	      1	1	1
'larry'	0	0	0
'mary'	1	1	1
'melvin'	1	1	0

Another trick for scrubbing such data is the Bose-Nelson sort (“A Sorting Problem” by R. C. Bose and R. J. Nelson; Journal of the ACM, vol. 9 pages 282-296, and my article in DR. DOBB'S JOURNAL back in 1985). This is a recursive procedure that takes an integer and then generates swap pairs for a vector of that size. A swap pair is a pair of position numbers from 1 to (n) in the vector that need to be exchanged if they are out of order. Swap pairs are also related to Sorting Networks in the literature (see The Art of Computer Programming, by Donald Knuth, vol. 3).

You are probably thinking that this method is a bit weak because the results are only good for sorting a fixed number of items. But a table only has a fixed number of columns, so that is not a problem in denormalized SQL.

You can set up a sorting network that will sort five items, with the minimal number of exchanges, nine swaps, like this:

swap (c1, c2);
swap (c4, c5);
swap (c3, c5);
swap (c3, c4);
swap (c1, c4);
swap (c1, c3);
swap (c2, c5);
swap (c2, c4);
swap (c2, c3);

You might want to deal yourself a hand of five playing cards in one suit to see how it works. Put the cards face down on the table and pick up the pairs, swapping them if required, then turn over the row to see that it is in sorted order when you are done.

In theory, the minimum number of swaps needed to sort (n) items is CEILING(log2 (n!)) and as (n) increases, this approaches O(n*log2(n)). Computer Science majors will remember that “Big O” expression as the expected performance of the best sorting algorithms, such as Quicksort. The Bose-Nelson method is very good for small values of (n). If (n < 9), then it is perfect, actually. But as things get bigger, Bose-Nelson approaches O(n ^ 1.585). In English, this method is good for a fixed-size list of 16 or fewer items and goes to hell after that.

You can write a version of the Bose-Nelson procedure that will output the SQL code for a given value of (n). The obvious direct way to do a swap(x, y) is to write a chain of UPDATE statements. Remember that in SQL, the SET clause assignments happen in parallel, so you can easily write a SET clause that exchanges the two items when are out of order. Using the above swap chain, we get this block of code:

BEGIN ATOMIC
-- swap (code1, code2);
UPDATE PeopleSkills
   SET code1 = code2, code2 = code1
WHERE code1 > code2;

-- swap (code4, code5);
UPDATE PeopleSkills
   SET code4 = code5, code5 = code4
WHERE code4 > code5;

-- swap (code3, code5);
UPDATE PeopleSkills
SET code3 = code5, code5 = code3
WHERE code3 > code5;

-- swap (code3, code4);
UPDATE PeopleSkills
   SET code3 = code4, code4 = code3
WHERE code3 > code4;

-- swap (code1, code4);
UPDATE PeopleSkills
   SET code1 = code4, code4 = code1
WHERE code1 > code4;

-- swap (code1, code3);
UPDATE PeopleSkills
   SET code1 = code3, code3 = code1
WHERE code1 > code3;

-- swap (code2, code5);
UPDATE PeopleSkills
   SET code2 = code5, code5 = code2
WHERE code2 > code5;

-- swap (code2, code4);
UPDATE PeopleSkills
   SET code2 = code4, code4 = code2
WHERE code2 > code4;

-- swap (code2, code3);
UPDATE PeopleSkills
   SET code2 = code3, code3 = code2
WHERE code2 > code3;

SELECT * FROM PeopleSkills;
END;

This is fully portable, standard SQL code and it can be machine generated. But that parallelism is useful. It is worthwhile to combine some of the UPDATE statements, but you have to be careful not to change the effective sequence of the swap operations.

If you look at the first two UPDATE statements, you can see that they do not overlap. This means you could roll them into one statement like this:

-- swap (code1, code2) AND swap (code4, code5);
UPDATE Foobar
   SET code1 = CASE WHEN code1 <= code2 THEN code1 ELSE code2 END,
       code2 = CASE WHEN code1 <= code2 THEN code2 ELSE code1 END,
       code4 = CASE WHEN code4 <= code5 THEN code4 ELSE code5 END,
       code5 = CASE WHEN code4 <= code5 THEN code5 ELSE code4 END
WHERE code4 > code5 OR code1 > code2

The advantage of doing this is that you have to execute only one UPDATE statement and not two. Updating a table, even on non-key columns, usually locks the table and prevents other users from getting to the data. If you could roll the statements into one single UPDATE, you would have the best of all possible worlds, but I doubt that the code would be easy to read. I’ll leave that as an exercise to the reader.

Joe Celko was a member of the ANSI X3H2 Database Standards Committee and helped write the SQL-92 standards. He is the author of over 750 magazine columns and seven books, the best known of which is SQL for Smarties, now in its third edition. For more articles, puzzles, resources, and more, go to www.celko.com.

Contributors : Joe Celko
Last modified 2006-01-04 01:21 PM

DBAzine.com

Sections

Personal tools

Menu

Who Are You?

Scrubbing Data with Non-1NF Tables - Part 1