Molecular, Atomic, and Sub-atomic Data Elements - Part 2

by Joe Celko

Part 1 | Part 2

SQL and most programming languages are based on syntax and have very little to do with semantics. A data element is different from its representation, and so is its atomicity. Now that we have discussed atomic data elements in part 1 of this article, let’s consider the next unit of matter — molecules!

Molecular Data Elements

A molecule is a unit of matter made up of atoms in a particular arrangement. So, let me define a unit of data made up of scalar or atomic values in a particular arrangement. The principle characteristic of such a unit is that the whole loses precise meaning when any part is removed. Note that I said precise meaning — it can still have some meaning, but it now refers to a set — possibly, an infinite set — of values.

One example of such a molecular unit would be (longitude, latitude) pairs in separate columns. Together, they provide a precise location, a point; apart, they describe a line or a circle with an infinite number of points.

Yes, you could model a location as a single column with the pair inside of it, but the arithmetic would be a screaming pain. You would have to write a special parser to read that column, effectively making it a user-defined data type. Making it a “two-atom molecule” makes much more sense. But the point is this: semantically it is one data element — namely, a geographical location.

Likewise, the most common newbie error is to put a person’s last, first, and middle name into one column rather than utilizing last_name, first_name, and middle_name columns. The error is easy to understand; a name is a (relatively) unique identifier for a person, and identifiers are semantically atomic. But in practice, sorting, searching and matching are best done with the atoms exposed.

Isomer Data Elements

The worst situation involves isomer data elements. An isomer is a molecule that has the same atoms as another, but is arranged a little differently. The most common examples are right- and left-handed versions of the same item.

The simple example of this scenario is a table with a mix of scales — say, temperatures in both Celsius and Fahrenheit. This requires two columns, one for the number and one for the scale. I can then write VIEWs to display the numbers on either scale, depending on the user. Here, the same semantic value is modeled dynamically by a VIEW. (The correct design would have picked one and only one scale, but bear with me; things get worse.)

Consider mixed currencies. On a given date, I get a deposit in one of many currencies, which I need to convert to other currencies, all based on the daily exchange rate:

CREATE TABLE Deposits
(..
 deposit_amt DECIMAL (20,2) NOT NULL,
 currency_code CHAR(3) NOT NULL, -- use ISO code
 deposit_date DATE DEFAULT CURRENT_DATE NOT NULL,
 ..);

CREATE TABLE ExchangeRates
(..
 currency_code CHAR(3) NOT NULL, -- use ISO code
 exchange_date DATE DEFAULT CURRENT_DATE NOT NULL,
 exchange_rate DECIMAL (8,4)NOT NULL,
 ..);

Semantically, the deposit had one, and only one, value at that time. But suppose I express that value in U.S. Dollars, but my friend thinks in Euros. There is no single, hard formula for converting the currencies, so you have to use a join:

CREATE VIEW DepositsDollars (.., dollar_amt, )
AS
SELECT .., (D1.deposit_amt * E1.exchange_rate), 
  FROM Deposits AS D1, ExchangeRates AS E1
 WHERE D1.deposit_date = E1.exchange_date;

And likewise, there will be a “DepositsEuros” with a euro-amt column, and whatever else we need. The VIEWs are good, atomic scalar designs, but the underlying base tables are not!

Another approach would have been to find one unit of currency, and then to use only that one, making the conversion at the front end. The bad news is that such an approach would have lost information about the relative positions among the currencies and would have been subject to rounding errors. This is not an easy problem.

Validating a Molecule

The major advantage for keeping each atomic data element in its own column is that you can easily set up rules among them to validate the whole. For example, an address is a molecular unit of data. Within it, I can see whether the city and state codes match the ZIP code.

Thus, instead of putting such constraints into one CHECK() constraint, break it into separate constraints that have meaningful names that will show up in errors messages.

Faking It

Faking a molecular data element is truly awful. Shoving all of the atomic parts into a string, then writing user-defined functions to parse the string and convert it into one format is a nightmare. In fact, it is so bad, I am not going to show you code for it.

Joe Celko was a member of the ANSI X3H2 Database Standards Committee and helped write the SQL-92 standards. He is the author of over 450 magazine columns and four books, the best known of which is SQL for Smarties (Morgan-Kaufmann Publishers, 1999). For more articles, puzzles, resources, and more, go to www.celko.com.

Contributors : Joe Celko
Last modified 2006-01-06 10:42 AM

DBAzine.com

Sections

Personal tools

Menu

Who Are You?

Molecular, Atomic, and Sub-atomic Data Elements - Part 2

Molecular Data Elements

Isomer Data Elements

Validating a Molecule

Faking It