The Ghost of Sequential Processing

by Joe Celko

When we were first creating relational database products, we really did not understand at a fundamental level what we were doing. As a result, we made a lot of mistakes then and have to live with them now. The biggest mistakes come from exposing the physical representation of the logical model to the programmer.

This is a holdover from the early programming language while we were very close to the hardware. For example, the fields in a COBOL or FORTRAN program were assumed to be physically located in the order in which they were declared. This meant that you could define a template that overlaid the same physical space and read the representation in several different ways. In COBOL, the command was REDEFINES, EQUIVALENCE in FORTRAN and a union in 'C.'

From a logical viewpoint, this redefinition makes no sense at all. It is confusing the numeral with the number that the numeral represents.

Early SQL and Contiguous Storage

The early SQLs were based on existing file systems. The data was kept in physically contiguous disk pages, in physically contiguous rows, made up of physically contiguous columns — in short, just like a deck of punch cards or a magnetic tape. You located data by counting its position in the deck, starting at the front.

Physically contiguous storage is only one way of building a relational database and it is not always the best one. But aside from that, the whole idea of a relational database is that user is not supposed to know how things are stored at all, much less write code that depends on the particular physical representation in a particular release of a particular product.

One significant error is the IDENTITY column in the Sybase family (SQL Server and Sybase). If you are not familiar with this "feature," it is assigned to a column as its data type with the limitation that a table can have only one such column. The database engine assigns a sequential integer in this column to every row in the table as it is inserted.

People actually program with this "feature" and even use it as the primary key for the table! Now, let's go into painful details as to why this thing is bad.

IDENTITY Crisis

The practical considerations are that IDENTITY is proprietary and non-portable, so you know that you will have maintenance problems when you change releases or products. It also has some very strange bugs in both Sybase and SQL Server.

But let's look at the logical problems. First, try to create a table with two columns and try to make them both IDENTITY columns. If you cannot declare more than one column to be of a certain datatype, then that thing is not a datatype at all, by definition.

Next, create a table with one column and make it an IDENTITY column. Now try to insert, update and delete different numbers from it. If you cannot insert, update and delete rows from a table, then it is not a table by definition.

Finally, create a simple table with one IDENTITY column and a few other columns. Use a few statements like

INSERT INTO Foobar (a, b, c) VALUES ('a1', 'b1', 'c1');
INSERT INTO Foobar (a, b, c) VALUES ('a2', 'b2', 'c2');
INSERT INTO Foobar (a, b, c) VALUES ('a3', 'b3', 'c3');

to put a few rows into the table and notice that the IDENTITY column sequentially numbered them in the order in which they were presented. If you delete a row, the gap in the sequence is not filled in, and the sequence continues from the highest number that has ever been used in that column in that particular table.

But now use a statement with a query expression in it, like this:

INSERT INTO Foobar (a, b, c)
SELECT x, y, z
FROM Floob;

Since a query result is a table, and a table is a set that has no ordering, what should the IDENTITY numbers be? The entire, whole, completed set is presented to Foobar all at once, not a row at a time. There are (n!) ways to number (n) rows, so which one do you pick? The answer has been to use whatever the physical order of the result set happened to be — that non-relational phrase, "physical order" again. But it is actually worse than that. If the same query is executed again, but with new statistics or after an index has been dropped or added, the new execution plan could bring the result set back in a different physical order.

Oh, why did duplicate rows in the second query get different IDENTITY numbers? In the relational model, they should be treated the same if all the values of all the attributes are identical.

There are better ways of creating identifiers, but that is the subject for another column. In the meantime, stop writing bad code, until I can teach you how to write good code.

Joe Celko was a member of the ANSI X3H2 Database Standards Committee and helped write the SQL-92 standards. He is the author of over 450 magazine columns and four books, the best known of which is SQL for Smarties (Morgan-Kaufmann Publishers, 1999). He is the Vice President of RDBMS at North Face Learning in Salt Lake City.

Contributors : Joe Celko
Last modified 2005-04-20 10:30 AM

DBAzine.com

Sections

Personal tools

Menu

Who Are You?

The Ghost of Sequential Processing

Early SQL and Contiguous Storage

IDENTITY Crisis