The XML Bug

by Fabian Pascal

When early on I was warning that XML is an unnecessary regression to the bad old days of hierarchic databases (see several articles in my Against the Grain series), one of its proponents argued that I was missing XML's point:

"XML is just a nice, little low-level technique which has some nice properties at the current state of technology for transmitting data ... [it] was developed for entirely physical purposes: to provide a fairly rich and adaptable format for sending small collections of data between systems in a way that has some nice performance characteristics (readable, lo-tech, integrates with URLs, etc.)."

But whatever the original intent of XML inventors, absent foundational knowledge, it was just a matter of time before the industry would do what it usually does -- extend technology or products to unintended areas, for which they are ill suited. So the current resident fad is XML databases and DBMSs, a bandwagon that the press, always eager for new advertising sources, promptly jumped on. In a January Network World Fusion article "Working Out the Bugs in XML Databases," for example, John Cox sees "a growing belief that XML-based information needs its own database."

"As network executives begin to experiment with Web services, they're likely to find that they need a new kind of data store: the XML database. These software products are designed to efficiently store and manage the growing numbers of XML documents that users are creating, especially in Web interactions with business partners and customers. Advocates cite several advantages of XML databases compared with traditional databases: simplicity, ease of application development, ability to search and query XML documents, and fast document retrieval."

"These software products" are DBMSs, not databases. Anyway, Cox has it backwards, of course: XML DBMSs are more "traditional" then he realizes. Like most trade journalists, who have no understanding of data fundamentals, he is unaware that XML DBMSs are a throwback to the old hierarchic DBMSs, predominant decades ago and discarded precisely because they were complex, inflexible and difficult for application development, the exact opposite of what is now being claimed for their XML reincarnation:

Note: Here's an excerpt from the manual for IBM's old IMS hierarchic DBMS (brought to my attention by Chris Date): "Logically deleting a logical child prevents further access to the logical child using its logical parent. Unidirectional logical child segments are assumed to be logically deleted. A logical parent is considered logically deleted when all its logical children are physically deleted. For physically paired logical relationships, the physical child paired to the logical child must also be physically deleted before the logical parent is considered logically deleted."

"There's no formal, standard definition of an XML database, although the XML:DB Initiative describes such a database as one that defines a logical model for an XML document (not for the data in the document), and manages documents based on that model. The key point is the database 'thinks and acts' based on XML - XML goes in, and XML comes out, even though these products can physically store the documents in an object or relational database or a proprietary storage model, such as indexed files."

I wonder how a DBMS (not database!) can "think and act based on XML" without a formal precise definition of an XML data model? Nevertheless, Cox considers relational DBMSs, which do have a precise formal logical foundation, just physical storage managers for XML documents, and XML products, which don't have one, DBMSs. Go figure. Obviously, Cox does not understand the important distinction between files and databases, and is unaware that the latter are the technological progress that replaced the former. Neither does he appreciate the difference between data and documents from a logical perpective (see next).

"The lack of formal definition is just one issue that raises the hackles of critics. They also point to the immaturity of the products and of XML standards; the absence of a standard, reliable query language to match the SQL used in relational databases; and possible data integrity problems. Relational vendors are also adding better support for XML. For example, Microsoft is developing the Yukon release of SQL Server. Oracle demonstrated to customers in December a technology called Project XDB. The goal of both projects is to let the databases treat XML documents as a new data type and manage them as they now work with relational data and objects."

The hierarchic approach underlying XML does, in fact, have a formal foundation: graph theory. But as the XML 1.0 specification explicitly states, it does not adhere to to the theory. The reason is the same as that for which old hierarchic database management (e.g., IBM's IMS) eschewed the theory too: it is extremely complex.

What is more, the real world is not thoroughly hierarchic, but the hierarchic approach can handle only hierarchies. This means that hierarchy and its complexity must be hoisted upon any and all database representations, whether justified (e.g. organizational, or bill-of-material structures) or, as is more often the case, not. Since the relational approach can handle both non-hierarchic and hierarchic data in a formal and much simpler way (see Chapter 7 in my Practical Issues in Database Management, what exactly is the advantage of the hierarchic approach underlying XML?

And what kind of product "maturity" and query language "reliability" can be expected in the absence of a theoretical foundation? Here's Hugh Darwen on the so-called "XML query algebra": "Now, my eyes light up at the word "algebra" ... Originally, I understood it to mean a set of operations that are closed over some type. That is, every operation in X Algebra operates on zero or more values of type X and returns a value of type X. Hence, set algebra, Boolean algebra, relational algebra and the algebra of numbers that gives us arithmetic. Over what is the XML Query Algebra closed? Nobody has ever given me an answer that makes sense (apart from the occasional, honest "I don't know")."

But it's integrity (which Cox mentions only in passing, without amplifying on) that makes XML databases a highly questionable proposition. The meaning of a database is in its schema (what Cox refers to as a formal logical definition), and a schema is nothing but the sum total of integrity constraints on the data (see Chapter X in my book). The fact is that XML did not initially provide any support for integrity, without which database management is meaningless (see below).

"If I had an Oracle (relational) database, I'd want to really know what's going in the background to handle XML," says Larry Hanson, data architect for the California Board of Equalization (BOE) ... "if you store these documents as objects, for example, can you query them, and tag them?" Oracle claims that these actions will be possible with XDB but how well the technology performs when processing lots of data or very large data sets remains to be seen."

Note how terminology is thrown around with disregard as to meaning:

Oracle is not a RDBMS, only a SQL DBMS -- big difference!
Where do "objects" fit in a (supposedly relational) SQL DBMS? Are "objects" a physical storage, not a logical model construct? If so, what is logical level in object database management?
What kind of integrity and queries do tags per-se permit? e.g. how does the DBMS know whether the data -- the content of a document -- is valid? And if it does not know, how reliable is the data and how meaningful are the answers it produces?

As to performance, text files are not exactly a recipe for maximizing it.

Hanson's point, echoed by others, is that XML data is fundamentally different from relational data. "XML data are extremely well-suited to hierarchical storage," says Hanson, who is a former database administrator. "In XML databases, an online tax return can be stored in its entirety. In a relational database, each line of the return would have to be a different table [of data in rows and columns]."Trying to "force fit" an XML document into the rigid relational structure can waste storage space and lead to inefficiencies in queries and retrievals."

Data is not intrinsically relational or XML. Any data can be represented either relationally, or in XML documents (see Chapter 1 in my book). In either case the underlying data model -- relational or hierarchic -- is purely logical. Logical representations that users see should be insulated from physical storage details.

SQL, ODBMS and XML fail to to provide adequate insulation, contributing to the logical-physical confusion so rampant in the industry (including academia: see Denormalization for Performance - Et Tu Academia?. It is almost impossible to find arguments not corrupted by it (see The Logical Physical Confusion. The table representation of lines in a tax return (logical) has nothing to do with how the data is stored (physical), so the notion of relational "force-fitting" is pure bunk (Ironically, as mentioned above, it's actually XML that force-fits hierarchy on all data, although this has nothing to do with physical storage either). If there are any "inefficiencies in queries and retrievals", they have nothing to do with relational technology at the logical level and everything to do with physical implementation details of SQL databases and DBMSs.

"Companies are finding that new applications such as Web services, which are built on XML, tend to have data models [sic] that don't map well to traditional relational structures," says Philippe Gelinas, CEO of software developer Xiasoft, which developed the TextML Server for XML documents. "Often customers try to make these applications work first with an existing [relational] database and find it doesn't work," he says. "Then they shop for an XML database."

That is only because customers do not possess the necessary foundational knowledge, and succumb to industry hype, particularly that by XML vendors with a text/publishing background who know very little, if anything, about database management.

Fabian Pascal has a national and international reputation as an independent technology analyst, consultant, author, and lecturer specializing in data management. He was affiliated with Codd & Date and for more than 15 years held various analytical and management positions in the private and public sectors, has taught and lectured at the business and academic levels, and advised vendor and user organizations on database technology, strategy and implementation. Clients include IBM, Census Bureau, CIA, Apple, Borland, Cognos, UCSF, IRS. He is founder and editor of Database Debunkings, a Web site dedicated to dispelling prevailing fallacies and misconceptions in the database industry, where C.J. Date is a senior contributor. He has contributed extensively to most trade publications, including Database Programming and Design, DBMS, DataBased Advisor, Byte, Infoworld, and Computerworld and is author of the contrarian column "Against the Grain." His third book, Practical Issues in Database Management - a Guide for the Thinking Practitioner (Addison Wesley, June 2000), serves as text for a seminar bearing the same name.

Contributors : Fabian Pascal
Last modified 2005-04-12 06:21 AM

XML Documents not Data

Posted by elarson at 2007-04-30 10:12 AM

I realize this article is pretty old, but I wanted to leave a quick comment nonetheless.

XML as data does have some value, but the impetus for XML was actually for documents. This small but important distinction makes the technologies surrounding XML (ie XPath, XSLT, XLink, XPointer, XInclude, Namespaces, etc) all become much different. Having worked on binary document formats quite a bit, XML is a breath of fresh air in that it allows and enforces recursive thought and pattern matching. I mention it because whenever I have tried to use XML as a storage mechanism it always seemed to get in the way. Instead, when using XML from a document context, the technologies become natural.

Regarding XML databases, conceptually it could be considered something closer to a content management system. Once a document is loaded, the programmers tools are more than enough to take care of the of validating the data because as you mention, it is near impossible without a set schema. I think XML databases are more a compliment to traditional databases in that they use pieces such as efficient I/O and basic keys to store documents that can then be queried for information. On a higher level, if you consider the amount of time one spends transitioning data between different formats (databases to objects to output and back), having the ability to keep the data in an original format and transform it via XSLT means application code becomes minimal.

Again, I realize this article is rather dated so all this could be obvious.

Replies to this comment

XNL is nonsense in every respect (Posted by mikharakiri at 2007-05-07 06:57 PM)

DBAzine.com

Sections

Personal tools

Menu

Who Are You?

The XML Bug

XML Documents not Data

Replies to this comment