Molecular, Atomic and Sub-atomic Data Elements - Part 1

by Joe Celko

Part 1 | Part 2

If you were a kid in the 1950s, you will remember those wonderful science fiction movies that always had the word “atomic” in the title, like “Atomic Werewolf from Mars,” or worse. We were still in awe of the atomic bomb and were assured that we would soon be driving atomic cars and airplanes. It was sort of like the adjective “extreme” is today. Nobody knows quite what it means, but it sounds really, really cool.

Technically, the ancient Greek root meaning of atomic is “without parts.” The idea was that if you were to continue to divide a physical entity into smaller and smaller pieces, you would eventually hit some elemental baseline. If you went beyond that baseline, you would destroy that entity.

When we describe First Normal Form (1NF), we say that a data element should hold atomic or scalar values. What we mean is that if one tries to pull out “sub-atomic parts” from the value in a column, the value loses meaning.

“Scalar” is used as a synonym for atomic, but its meaning is actually a little trickier. To be scalar requires that there be a scale of measurement from which the value is drawn and from which it takes meaning. It is a bit more strict, and a good database designer will try to establish the scales of measurement in his data model.

Most newbies assume that if they have a column in a SQL table, this automatically makes the value atomic. A column cannot hold a data structure like an array, linked list, or another table; it has to be of a simple data type. Ergo, it must be an atomic value. This was very easy up to Standard SQL-92, since the language had no support for those structures. But this is no longer true in SQL-99, which introduces several very non-relational “features" and since several vendors added their own support for arrays, nested tables, and variant datatypes.

The worst way to design a schema is probably to split an attribute along tables. If I were to design a schema with a “Male_Personnel” and a “Female_Personnel” table, or one table per department, you would see the fallacy instantly. Here, an attribute — gender — is turned into metadata for defining tables.

In the old punch-cards-and-tape-file-system days, we physically moved data to such selective files to make processing easier. It was how we got parallelism. Okay guys, this is the 21st century; and we need to stop doing this.

A partitioned table is not the same thing. It is one, logical, semantic unit of data; the system and not the applications maintain it. The fact that it is physically split across physical file structures has nothing to do with the semantics.

Perhaps the fact that DDL often has a mix of logical data descriptions combined with physical implementations in vendor extensions confuses us. As an aside, I often wonder if SQL should have had a separate syntax for referential integrity, relational cardinality, membership, domain constraints, and so forth, rather than allowing them in the DDL.

Atomic Data Elements

The other mistake one can make is to split an atomic attribute into columns. As we all know from those 1950s science fiction movies, nothing good comes from splitting atoms — it could turn your brother into an atomic werewolf!

A phone number in the United States is displayed as three sections (area code, exchange, and number). Each part is useless by itself. In fact, you should include the international prefixes to make it more exact, but usually, context is enough. You would not split this data element over three columns because you search and use this value in the order in which it is presented, and you use it as a whole unit. This is an atom, not a molecule.

You can also split a single data element across rows. Consider this absurd table:

CREATE TABLE Personnel
(worker_name CHAR(20) NOT NULL,
 attribute_name CHAR(15) NOT NULL
    CHECK (attribute_name IN ('weight', 'height', 'bowling score')),
 attribute_value INTEGER NOT NULL,
 PRIMARY KEY (worker_name, attribute_name));

The bad news is that you will see this kind of thing in the real world — one column shows metadata and the other shows a value.

Let’s look at a subtler version of the same thing. Consider the following table, which mimics a clipboard on which we record the start and finish of a task by an employee:

CREATE TABLE TaskList
(worker_name CHAR(20) NOT NULL,
 task_nbr INTEGER NOT NULL,
 task_time TIMESTAMP DEFAULT CURRENT_TIMESTAMP NOT NULL,
 task_status CHAR(1) DEFAULT 'S' NOT NULL
    CHECK (task_status IN ('S', 'F')),
 PRIMARY KEY (worker_name, task_nbr, task_status));

To know if a task is finished (task_status = 'F'), we first need to know that it was started (task_status = 'S') — that means, there must be a self-join in a constraint. A good heuristic is one in which a self-joined constraint means that the schema is bad because something is split and has to be re-assembled in the constraint.

Let’s rewrite the DDL with the idea that a task is a data element:

CREATE TABLE TaskList
(worker_name CHAR(20) NOT NULL,
 task_nbr INTEGER NOT NULL,
 task_start_time TIMESTAMP DEFAULT CURRENT_TIMESTAMP NOT NULL,
 task_end_time TIMESTAMP, -- null means in process 
  PRIMARY KEY (worker_name, task_nbr));

I deliberately picked a temporal split to lead into the next section. But there are other ways to split a data element over rows in the same table.

Non-atomic Data Elements

The most common newbie error is splitting a temporal data element into (year, month, day) columns or as (year, month) columns, or (year) columns. The problem with temporal data is that, by its nature, it is not atomic; it is a continuum. A continuum has no atomic parts; it can be infinitely subdivided. Thus, the year 2005 is shorthand for the pair (“2005-01-01 00:00:00,” “2005-12-31 23:59:59.999..”) by which we live with the precision that our SQL product has for the open end on the left. It includes every point in between — every uncountably infinite one of them.

The Greeks did not have the concept of a continuum, and this lead to Zeno’s famous paradoxes. (Hey, this is a database article, but you can Google Greek philosophy for yourself.)

We will spend more time on this in Part II of the series.

Faking It

So, how do programmers “fake it” within the syntax of SQL when they want non-1NF data semantics? One way is to use a group of columns in which all the members of the group have the same semantic value; that is, they represent the same data element. Consider the following table of an employee and his children:

CREATE TABLE Employees
(emp_nbr INTEGER NOT NULL,
 emp_name CHAR(30) NOT NULL,
 ...
 child1 CHAR(30), birthday1 DATE, sex1 CHAR(1),
 child2 CHAR(30), birthday2 DATE, sex2 CHAR(2),
 child3 CHAR(30), birthday3 DATE, sex3 CHAR(1),
 child4 CHAR(30), birthday4 DATE, sex4 CHAR(1));

This looks like the layouts of many existing file system records in COBOL and other 3GL languages. The birthday and sex information for each child is part of a repeated group and therefore violates 1NF. This is faking a four-element array in SQL; the index just happens to be part of the column name!

Very clearly, the dependents should have been in their own table. There would be no upper limit on family size, aggregation would be much easier, the schema would have fewer NULLs, and so on.

Suppose I have a table showing the quantity of a product sold in each month of a particular year. I originally built the table to look like the following:

 CREATE TABLE Abnormal
 (product CHAR(10) NOT NULL PRIMARY KEY,
  month_01 INTEGER, -- null means
  month_02 INTEGER,
  ...
  month_12 INTEGER);

and I wanted to flatten it out into a more normalized form:

 CREATE TABLE Normal
 (product CHAR(10) NOT NULL,
  month_nbr INTEGER NOT NULL,
  qty INTEGER NOT NULL,
  PRIMARY KEY (product, month_nbr));

I can use the statement

INSERT INTO Normal (product, month_nbr, qty)
SELECT product, 1, month_01
  FROM Abnormal
 WHERE month_01 IS NOT NULL
UNION ALL
SELECT product, 2, month_02
  FROM Abnormal
 WHERE month_02 IS NOT NULL
...

UNION ALL
SELECT product, 12, month_12
  FROM Abnormal
 WHERE bin_12 IS NOT NULL;

While a UNION ALL expression is usually slow, this has to be run only once to load the normalized table, and then the original table can be dropped.

Parsing a List in a String

Another popular method is to use a string and fill it with a comma-separated list. This results in a lot of string-handling procedures to work around this kludge. The right way would be to put the list into a single column in a table. Consider this example:

CREATE TABLE InputStrings
(key_col CHAR(10) NOT NULL PRIMARY KEY,
 input_string VARCHAR(255) NOT NULL);

INSERT INTO InputStrings VALUES ('first', '12,34,567,896');
INSERT INTO InputStrings VALUES ('second', '312,534,997,896');
 ...

This will be the table that receives the output, in the form of the original key column and one parameter per row.

CREATE TABLE Parmlist
(key_col CHAR(5) NOT NULL PRIMARY KEY, 
 parm INTEGER NOT NULL);

It makes life easier if the lists in the input strings start and end with a comma. You will also need a table called “Sequence,” which is a set of integers from 1 to (n).

SELECT key_col, 
      CAST (SUBSTRING (',' || I1.input_string || ',', MAX(S1.seq || 1), 
                           (S2.seq - MAX(S1.seq || 1))) 
         AS INTEGER), 
       COUNT(S2.seq) AS place
  FROM InputStrings AS I1, Sequence AS S1, Sequence AS S2 
 WHERE SUBSTRING (',' || I1.input_string || ',', S1.seq, 1) = ','
   AND SUBSTRING (',' || I1.input_string || ',', S2.seq, 1) = ','
   AND S1.seq < S2.seq
   AND S2.seq <= DATALENGTH(I1.input_string) + 1
 GROUP BY I1.key_col, I1.input_string, S2.seq;

The S1 and S2 copies of Sequence are used to locate bracketing pairs of commas, and the entire set of substrings located between them are extracts and cast as integers in one non-procedural step.

The trick is to be sure that the left-hand comma of the bracketing pair is the closest one to the second comma. The place column tells you the relative position of the value in the input string.

Ken Henderson can be credited with creating a very fast version of this trick. Instead of using a comma to separate the fields within the list, put each value into a fixed-length substring and extract them by using a simple multiplication of the length by the desired array index number. This is a direct imitation of how many compilers handle arrays at the hardware level.

For the record, I do not advocate any of these programming approaches. I am presenting them as common kludges used by programmers.

Joe Celko was a member of the ANSI X3H2 Database Standards Committee and helped write the SQL-92 standards. He is the author of over 750 magazine columns and seven books, the best known of which is SQL for Smarties, now in its third edition. For more articles, puzzles, resources, and more, go to www.celko.com.

Contributors : Joe Celko
Last modified 2006-01-04 11:43 AM

DBAzine.com

Sections

Personal tools

Menu

Who Are You?

Molecular, Atomic and Sub-atomic Data Elements - Part 1

Atomic Data Elements

Non-atomic Data Elements

Faking It

Parsing a List in a String