Scrubbing Data with Non-1NF Tables - Part 2

by Joe Celko

In the first part of this series, I showed you some queries and tricks for getting data from a non-First Normal Form (NFNF or non-1NF) table. Let’s assume that you are moving data from a file into such a table. What should the target table look like?

The usual answer is to make all the columns NVARCHAR (n) where (n) is the maximum size allowed by your particular SQL product. This is the most general data type, and it can hold all kinds of garbage. The real shame about this schema design is that people do use it in their actual database and not just as a staging area for scrubbing bad data. For the record, these are quick and dirty tricks for data scrubbing when you don’t have any other tools. If you do have an ETL tool or other data scrubbing tools, use them instead.

The first question to ask is whether you should be using NVARCHAR (n) or simply VARCHAR (n)? If you allow a national character set, you can catch some errors that might not be seen in a simple Latin-1 alphabet. But most of the time, you can be sure that the file was in ASCII or EBCDIC by the time you moved it to the staging table with a utility program, such as BCP in the SQL Server family.

The best way to do this is with a Comma Separated Values (CSV) file. You can modify such a file with a text editor. If worst comes to worse, you can even add individual “INSERT INTO <column list> VALUES (<csv string>);” code around each line and run the file as an SQL transaction with save points.

The second question is what value of (n) to use? Setting it to the max is fine for the first scrubbing. The next thing you are going to do is run a query that looks for the minimum, maximum, and average length of each of the columns. If a column is supposed to be a fixed length, then all three of these should be the same.

If a column is supposed to be of varying length, then all three of these should be in a reasonable range. How do you define reasonable? Bigger than zero length is often a good criterion for a column being too short. This can happen when a field was skipped on an input form or if there were errors in converting it into a CSV file. As a recent personal example, I moved an ACT file into SQL Server using the ACT utility program to get a CSV file and found several rows where the data had gotten shifted over one position, leaving blank or empty columns.

You generally have some idea if a varying column is too long. For example, the United State Postal Service standards for labels use CHAR(35), so any address line longer than that is suspect (and cannot be used on bulk mailings).

If you have columns which are longer than expected, the first action should be to UPDATE the scrub table using TRIM() and REPLACE() functions to remove extra blanks. Extra white space is the usual culprit. You might find it is faster to do this quick clean up in the original CSV file with a text editor.

However, other simple edits are probably best done in SQL since a text editor does not see the individual fields. You might want to change “Street” to “St” to keep mailing addresses short, but a text editor will cheerfully make “John Longstreet” into “John Longst” as well.

In the same UPDATE, you can use UPPER() or LOWER() to be sure that your data is in the right case. Proper capitalization for text is a bit harder and if you have to do this often, it is a good idea to write a stored procedure or user defined function in the 4GL language that came with your SQL product.

Finally, look at the data itself. Many SQL products offer functions that test to see if a string is a valid numeric expression or to cast it into a numeric. But you have to be careful since some of these functions stop parsing as soon as they have a numeric string; that is, given the string “123XX,” your library function might return 123 and ignore the invalid characters at the end.

Most SQL products have some kind of regular expression predicate that works like the SQL-92 SIMILAR TO predicate or the grep() utilities in Unix. This is a great tool for validating the scrubbed data, but it has some limits. It only tells you that the data is in a validate format, but not if it is valid data.

For example, given a date of “12/11/03” you have no idea if it was supposed to be “2003-11-12” or “2003-12-11” without outside information. This is why we have the ISO-8601 Standards for displaying temporal data. Likewise “2003-02-30” will pass a simple regular expression parse, but the there is no such date.

One of the most common errors in file systems was to load the same raw data into the file more than once. Sometimes it was literally the same data — an operator hung a magnetic tape, loaded a file, and then forgot to mark the job as having been done. The next shift comes to work and repeats the operation. Other times, a data entry clerk simply input the same data twice or sent a correction without removing the erroneous data. Given an impatient user with a fast mouse button, you can get the same problem in a new technology.

At this point, you are ready to move the raw data to a new table with columns that have appropriate data types, but no constraints just yet. The move should be done with an “INSERT INTO … SELECT DISTINCT … ” to get rid of the redundant duplicates.

In part III of this series, we will scrub this “less dirty” data a bit further.

Joe Celko was a member of the ANSI X3H2 Database Standards Committee and helped write the SQL-92 standards. He is the author of over 750 magazine columns and seven books, the best known of which is SQL for Smarties, now in its third edition. For more articles, puzzles, resources, and more, go to www.celko.com.

Contributors : Joe Celko
Last modified 2006-01-04 01:21 PM

DBAzine.com

Sections

Personal tools

Menu

Who Are You?

Scrubbing Data with Non-1NF Tables - Part 2