Skip to content

DBAzine.com

Sections
Personal tools
You are here: Home » Of Interest » Articles of Interest » Scrubbing Data with Non-1NF Tables - Part 3
Seeking new owner for this high-traffic DBAzine.com site.
Tap into the potential of this DBA community to expand your business! Interested? Contact us today.
Who Are You?
I am a:
Mainframe True Believer
Distributed Fast-tracker

[ Results | Polls ]
Votes : 3558
 

Scrubbing Data with Non-1NF Tables - Part 3

by Joe Celko

Part 1  |  Part 2  |  Part 3

In the first two parts of this series, I showed you some queries on the non-First Normal Form (NFNF or non-1NF) table and updates for scrubbing raw data. At the end of part 2, the data had been moved to a table with the correct data types, but no constraints or validations.

At the point at which you have the raw data scrubbed this far, there is a temptation to simply load it into the “real tables” in the database. Resist the temptation. The syntax of the data might be acceptable, but that does not mean it is right.

We can classify errors as single-column or multi-column errors. A single-column error might be a gender code of “B” when only “M” or “F” is allowed. A multiple-column error involves individual columns that are valid, but the combination of which is invalid. For example, pregnancy is a valid medical condition; male is a valid gender; but a pregnant male is an invalid combination.

The first test is to see if your key is actually a key by running a test for NULLs and counting the occurrences of unique values:

  SELECT key_1, key_2, ... key_n
FROM ScrubTank
GROUP BY key_1, key_2, ... key_n
HAVING COUNT(*) > 1 -- dups
OR (SIGN(key_1) + .. + SIGN(key_n) IS NULL

You can also use SUBSTRING(), CASE, or other functions with concatenation so that any NULL will propagate.

Let’s assume we have a column with a code that is five characters long and we have trimmed and edited the original raw data until all the rows of that column are indeed CHAR(5). But there is a syntax rule that the code is of this format (using SQL-99 predicates):

CHECK (Foo_code SIMILAR TO
'[:UPPER:][:UPPER:][:DIGIT:][:DIGIT:][:DIGIT:]')

If you add this to your scrub table with an ALTER TABLE statement, you need to know if your SQL product will immediately test existing data for compatibility, or if the constraint will go into effect only for inserted or updated data.

Instead of adding the check constraints all at once, write case expressions that will do the testing for you. The format is simple and can be done with a text editor. Pull off the predicates from the CHECK()constraints in the target table and put them into a query like this:

SELECT
CASE WHEN NOT <> THEN 'err_###'
WHEN NOT <> THEN 'err_###'
...
ELSE ' ' END AS <>,
...
FROM ScrubTank;

A CASE expression will test each WHEN clause in the order written, so when you see one error message, you will need to correct it and then pass the data through the query again. The goal is to get a query with all blanks in the columns to show that all the rows have passed.

Rules that apply to more than one column can be tested with another query that looks for the table constraints in the same way. It is a good idea to do this as a separate step after the single column validations. A correction in one column will often fix the multi-column errors, too.

Hopefully, we are now ready to finally put the scrubbed data into one or more of the target tables in the actual database schema. That ought to be a simple “INSERT INTO.. SELECT.. FROM ScrubTank” statement.

Frankly, there are better tools for data scrubbing than pure SQL; this series of articles was more of a “proof of concept” than a recommendation. If you have the logical constraints in the text of your database schema, then pulling them out is a matter of a text edit, not completely new programming. While this approach is a bit of work, it gives you a script that you can reuse and does not cost you any extra money for new software.

--

Joe Celko was a member of the ANSI X3H2 Database Standards Committee and helped write the SQL-92 standards. He is the author of over 750 magazine columns and seven books, the best known of which is SQL for Smarties, now in its third edition. For more articles, puzzles, resources, and more, go to www.celko.com.


Contributors : Joe Celko
Last modified 2006-01-04 01:22 PM
Transaction Management
Reduce downtime and increase repeat sales by improving end-user experience.
Free White Paper
Database Recovery
Feeling the increased demands on data protection and storage requirements?
Download Free Report!
 
 

Powered by Plone