XML Data Management: Caveat Emptor

by Fabian Pascal

Those who follow my writings (here, at Database Debunkings, and elsewhere) know that the main thrust of my work has been demonstrating the negative effects of the lack of foundation knowledge on data management practice. They are familiar with the Weekly Quotes I post regularly on my site as evidence of the sad state of affairs. At the beginning of a new year, I collect the “best” pearls from among the preceding year’s quotes for a yearly presentation called, “To Laugh or To Cry? Fallacies in Data Management” (in fact, I have just completed the 2004 edition, which will be offered on May 3^rd at the DAMA conference in Los Angeles).

If they do not know, coming in, why they ought to laugh, cry, or both, those who attend the presentation get to figure it out by the time they leave (or at least, so I hope). It has occurred to me, though, that given the state of knowledge, those who do not attend may be puzzled as to what’s wrong with those quotes (there is evidence to that effect from the feedback I get); and even for some of those who do attend, things may not be entirely clear. So it is probably a good idea to make sure that as many people understand what’s wrong with those quotes and why, because they pertain to foundation knowledge.

To test your foundation knowledge, you are urged to try and figure out what’s wrong with the quotes before reading the explanation.

Here’s the first quote:

"... although hierarchical ... databases no longer pose a serious threat to RDBMSs, the jury is still out on whether or not Web services will spur RDBMS vendors to adopt native XML databases ... Solutions that consolidate multirelational data structures into a single XML view, such as Software AG's Tamino, could become a tempting approach to simplify development and management issues."

— Mario Apicella, “Relational Databases Extend their Reach,” Infoworld

We have no idea what “multirelational data structures” means, but the use of such nonsensical terms is clear indication of the common lack knowledge by trade journalists of the subject matter they cover (refer to “On Intellectual Laziness,” The Ignorance Mechanism); another indication is the use of the terms “database” and “DBMS” interchangeably. But we do know that Apicella contradicts himself without realizing it: XML DBMSs (and the databases they manage, which are not the same thing) are hierarchic (see the following).

The most important aspect of XML that practitioners ought to realize, but do not, is that XML is not a data management technology at all. XML was invented for inter-system data exchange purposes, as admitted by at least one of its authors in the second quote:

“1. [T]he only normative definitions of XML, and of Namespaces, operate almost completely at a syntactic level.

2. I’ve been in software for 20 years and I’ve seen lots of interoperable cross-platform syntax and very rarely an interoperable cross-platform data structure or API.

Obviously, once you’re dealing with some XML inside of a program, you think in terms of the structure. But XML’s interoperability is strongly linked to the fact that its definition is syntactic. [emphasis added]”

— Tim Bray (co-inventor of XML)

In order for data exchange between systems to work, the sending and recipient systems must first agree on what data — that is, meaning — will be exchanged. Note very carefully that because meaning is agreed on up front, there is no need to include information about it with the data being transmitted — that would be unnecessary and inefficient.

The only other thing that is needed for exchange is an agreed upon physical format, some form of delimited serialization of the data that the recipient system can be set to digest — this is what Bray means by “syntactic.” The criterion here is not semantics, but performance: the best format would be one that is most efficient. And it should be quite obvious that XML does not, to put it politely, fare well with respect to efficiency, precisely because it repeats tags in each and every record/document, which often overwhelm the data (refer to “I’ve Glimpsed the Future and It’s XML”). That is why the XML standard is subverted to make performance acceptable (refer to “The Horrors of XML”), as we predicted would happen early on.

The problem is that the people who invented XML were not data management, but text publishing, specialists. Text/word formatting is done with tags; hence, SGML, HTML, and XML. But formatting and meaning are hardly the same thing and cannot be handled the same way (refer to Tags Do Not a Language Make, “To a Hammer, Everything Looks Like Nails”). This is, in fact, a critical difference between data management and text processing. As properly educated data management practitioners know, there is much more to meaning than a bunch of nested tags. The best approximation a DBMS can have of the user-understood meaning of a database is the database schema, the sum total of integrity constraints declared and known to the DBMS (refer to Database Foundations paper #4, “Un-muddling Modeling”). And the fact is, that in the initial XML version, there was no notion of integrity in that sense whatsoever — clear evidence that it could not be used for data management (refer to “The Myth of Self-describing XML” (PDF)).

Otherwise put, the notion of meaning in XML, as represented by tags, was neither necessary for data exchange, nor sufficient for data management. It is essentially due to lack of knowledge of data fundamentals, and poor understanding of the difference between semantics and format, data management and data exchange.

This fundamental error notwithstanding, it was only a matter of time (and we predicted it quite early on too) that once data is XML-formatted for exchange, somebody somewhere with the same lack of foundation knowledge will come up with the idea of XML databases and DBMSs. Sure enough, that is, of course, exactly what happened, as revealed in the third quote:

“XML will become the dominant format for data interchange: It's flexible and self-describing. Many applications will want to query the data in the same format that they interchange it. If we're exchanging data in XML, I'll want to write queries against XML sources.”

— Don Chamberlin, IBM representative to W3C

Hence, my argument about The Exchange Tail and the Management Dog, which is also the title of a seminar.

Data management requires a data model that provides structure, integrity, and manipulation features (again, see “Un-muddling Modeling”). At best, XML has structure — the hierarchy of nested tags — but no integrity and no manipulation. So, to do XML data management, they had to be reinvented — which is exactly what W3C has been doing — and for the XML’s tree structure. Unfortunately, what few practitioners, particularly younger ones, realize — due to poor education — is that the hierarchic data model was extensively used and discredited more than three decades ago, because it was not cost-effective (see Database Foundations papers #1, “What First Normal Form Really Means,” and #2, “What First Normal Form Means Not”). It was made obsolete by the relational model (hierarchic systems are still used because it is difficult to migrate away from them, out of ignorance; or because of SQL’s failure to deliver all the relational benefits).

The relational model was not truly and fully implemented by the industry, which produced much inferior SQL-based products instead (incidentally, Chamberlin is the author of IBM’s SQL, so judge for yourself what this says about the XML query language based on his proposal), which is exactly what can be expected in the absence of foundation knowledge (refer to “They Who Don’t Remember the Past Are Doomed to Relive It”). Consider the fourth quote:

“Three decades past, the relational empire conquered the hierarchical hegemony. Today, an upstart challenges the relational empire's dominance, threatening the return of hierarchy. XML is Lisp's bastard nephew, with uglier syntax and no semantics. Yet XML is poised to enable the creation of a Web of data that dwarfs anything since the Library at Alexandria. This talk examines the design of XQuery, the W3C standard query language for XML, and related standards such as XML Schema.”

— Philip Wadler, “Keynote,” VLDB Conference, Rome, September 2001

The correct solution is not reinventing an obsolete technology that would throw data management back decades — the Library of Alexandria may fit the bill alright — but rather genuine implementations of the relational model. As long as the industry flouts the scientific foundation of data management, we will continue to reinvent the past, oblivious to the fact that we regress, rather than progress.

Fabian Pascal has a national and international reputation as an independent technology analyst, consultant, author and lecturer specializing in data management. He was affiliated with Codd & Date and for 20 years held various analytical and management positions in the private and public sectors, has taught and lectured at the business and academic levels, and advised vendor and user organizations on data management technology, strategy and implementation. Clients include IBM, Census Bureau, CIA, Apple, Borland, Cognos, UCS, and IRS. He is founder, editor and publisher of Database Debunkings, a Web site dedicated to dispelling persistent fallacies, flaws, myths and misconceptions prevalent in the IT industry. Together with Chris Date he has recently launched the Database Foundations Series of papers. Author of three books, he has published extensively in most trade publications, including DM Review, Database Programming and Design, DBMS, Byte, Infoworld and Computerworld. He is author of the contrarian columns Against the Grain, Setting Matters Straight, and for The Journal of Conceptual Modeling. His third book, Practical Issues in Database Management serves as text for his seminars.

Special Offer: Author Fabian Pascal is offering DBAzine.com readers subscriptions to the Database Foundations Series of papers at a discount. To receive your discount, just let him know you’re a DBAzine reader before you subscribe! Contact information is available on the “About” page of his site.

Contributors : Fabian Pascal
Last modified 2005-04-12 06:21 AM

DBAzine.com

Sections

Personal tools

Menu

Who Are You?

XML Data Management: Caveat Emptor