Skip to content

DBAzine.com

Sections
Personal tools
You are here: Home » Of Interest » Articles of Interest » If You Liked SQL,You'll Love XQUERY
Seeking new owner for this high-traffic DBAzine.com site.
Tap into the potential of this DBA community to expand your business! Interested? Contact us today.
Who Are You?
I am a:
Mainframe True Believer
Distributed Fast-tracker

[ Results | Polls ]
Votes : 3448
 

If You Liked SQL,You'll Love XQUERY

by Fabian Pascal

Recently I’ve come across an excerpt from Don Chamberlin’s Chapter 2 in an Addison-Wesley book, XQUERY FROM THE EXPERTS: INFLUENCES ON THE DESIGN OF XQUERY with contributions from various others. For those who do not know who Chamberlin is, here’s the bio from the article:

“Don Chamberlin is one of IBM’s representatives in the W3C XML Query Working Group. He is also a coauthor of the Quilt language proposal, which formed the basis for the XQuery design. Don is best known as co-inventor of the SQL database language and as author of two books on the DB2 database system. He holds a Ph.D. from Stanford University and is a staff member at IBM’s Almaden Research Center. He is an ACM Fellow and a member of the National Academy of Engineering. Don is an editor of the working drafts of XML Use Cases and XQuery 1.0 and XPath 2.0 Data Model.”

In the excerpt Chamberlin states:

“Creating a new query language is a serious business. Many person-years have been spent in defining XQuery, and many more will be spent on its implementation. If the language is successful, developers of Web-based applications will use it for many years to come. A successful query language can enhance productivity and serve as a unifying influence in the growth of an industry. On the other hand, a poorly designed language can inhibit the acceptance of an otherwise promising technology. The designers of XQuery took their responsibilities very seriously, not only in the interest of their individual companies, but also in order to make a contribution to the industry as a whole.”

One does not know whether to laugh or cry. It has been quite obvious that the designers of SQL had little understanding of data fundamentals in general, and the relational model in particular; and SQL was hardly developed in accordance with good language design principles. If there were any doubts about that to this point, XML and XQuery make it perfectly clear that things are becoming worse, not better. It is not the seriousness with which XQuery specifications were undertaken that is in question here, but rather that of the knowledge and understanding of the specifiers, given that they are vendors.

We often invoke Date’s Incoherence PrincipleIt’s not possible to treat coherently that which is incoherent — when debunking industry pronouncements and, as in so many of the cases, it would take us several-fold more space, not to mention patience, to address all the confusion, inconsistencies, and outright errors in the article. We will, therefore, expose just enough lapses to justify not bothering with the rest.

The following argument that Chamberlin was inspired by actually inspired the title of my seminar, “The Exchange Tail and The Management Dog”:

“Since XML is emerging as a universal format for data interchange among disparate applications, it is natural for queries that cross application boundaries to be framed in terms of the XML representation of data. In other words, if an application is viewed as a source of information in XML format, it is logical to pose queries against that XML format. This is the basic reason why a query language for XML data is extremely important in a connected world.”

“Natural” perhaps for those without a grasp of data fundamentals. It is precisely because a majority in the industry is in that state that we warned early on that somebody somewhere would try to extend XML to data management, for which it was not originally intended, and where it does not belong. The industry has been doing this sort of thing ever since I can remember, with predictable consequences.

“Realizing this promise [of the Web] required some organizing principle for the exchange of information. This principle had to be independent of any particular language or application and easily extensible to new and unanticipated kinds of information. At present, the leading candidate for this organizing principle is the Extensible Markup Language, XML. XML provides a neutral notation for labeling the parts of a body of information and representing the relationships among these parts. Since XML does not attach any semantic meaning to its labels, applications are free to interpret them as they see fit. Applications that agree on a common vocabulary can use XML for data interchange. Since XML does not mandate any particular storage technique, it can be used as a common interchange format among systems that store data in file systems, relational databases, object repositories, and many other storage formats.”

Observe very carefully:

      • Indeed, data/information management requires “some organizing principle”; that is, structure; anything “unstructured” — and many in the industry promote XML for that purpose — is not data, but meaningless random noise that carries no information. Data exchange, however, requires a serializable format, something altogether different (see “The Myth of Self-Describing XML” - PDF).
      • As an inter-system, serialized data-exchange format, XML tags are “syntactic”; interpretation and manipulation — semantics, that is — were supposed to come from outside XML, namely from existing and future database systems and applications. Hence, the objective was a format “independent of any particular language or application and easily extensible to new and unanticipated kinds of information.

We point out, in passing, that relational databases are not storage systems, to be lumped together with file systems. They are organized collections of data from which mechanized logical inferences are made.

“… the World Wide Web Consortium (W3C) organized a query language workshop called QL ’98 … [led to] creation of a W3C working group for XML Query … whose initial charter called for the specification of a formal data model and query language for XML, to be coordinated with existing W3C standards such as XML Schema, XML Namespaces, XML Information Set, and XSLT. The purpose of the new query language was to provide a flexible facility to extract information from real and virtual XML documents.

One of the earliest activities of the Query working group was to draw up a formal statement of requirements for an XML query language. This document was quickly followed by a set of use cases that described diverse usage scenarios for the new language, including specific queries and expected results.”

Querying is a semantic data management function, not data interchange (syntactic) function, but One wonders:

      • If XML’s point is to be language-independent, why an XML-specific language?
      • If XML’s point is to be database-independent, why reinvent the data management wheel (and, we shall argue, a “square wheel” at that?)
      • If XML is for syntactic interchange, can it be used for semantic data management?
      • If it can, should it be? Is it cost-effective?

These are the questions that anybody who considers extending XML to data management should have addressed upfront. W3C obviously did not.

The notion of a formal data model (not the same as formal requirements!) was proposed by Ted Codd in the late 60s. He observed that the then-current hierarchic data model was abstracted from existing practices and products (IBM’s IMS), a backwards and usually futile exercise: you build products to, and practices on, a formal foundation, not the other way around. In the case of the hierarchic model, there was an additional problem: its formal foundation — graph theory—proved too complex in terms of representation, integrity and particularly manipulation for both implementation and use, and had to be scrapped altogether. Codd’s relational model was an effort to get rid of the complexities plaguing those models and made hierarchic (and network) model implementations cost-in effective. The latter were replaced by SQL products, an attempt — albeit flawed — to implement the relational model whose formal foundation had all the desirable properties mentioned by Chamberlin in the excerpt, and proved much more cost-effective, SQL warts and all.

Having spent his entire career at IBM, and developed what was supposed to be a relational data language at least in intention (if not in outcome), the futility of (a) post-fitting a formal foundation to existing products/practices, let alone to products/practices not even intended for data management (b) specifying a formal hierarchic data model in particular shouldn’t have escaped Chamberlin. But apparently it did. Neither Chamberlin, nor the industry, seem to have learned anything from the experience, as they are not only repeating the same mistakes, but adding even more flaws, some of which Codd was wise enough to recognize and avoid (as they say,“They Who Do Not Remember the Past, Are Condemned to Repeat It”).

We have little doubt as to the outcome this time around, even if it will take a considerable time and waste of resources to realize it. There is, in fact, already evidence suggesting that such expectations are justified: as soon as a formal foundation was considered for XML, the core XML “object” — the document — had to be abandoned and replaced with an abstraction called a “sequence.” What does this say about the whole endeavor?

“The XML Query Working Group undertook to define a language with two alternative syntaxes: a keyword-based syntax called XQuery, optimized for human reading and writing, and an XML-based syntax called XQueryX, optimized for machine generation. This chapter describes only the keyword-based XQuery syntax, which has been the major focus of the working group.”

Given that XML was invented expressly for inter-system exchanges, one must also wonder what’s the point of “human reading and writing”? Chamberlin doesn’t seem to distinguish between a serialized data exchange format for systems and a database structure for logic inferences. Had he understood the distinction, perhaps he would have given us a truly relational, well-designed data language for humans, not SQL or XQuery.

“Early in its history, the XML Query Working Group confronted the question of whether XML is sufficiently different from other data formats to require a query language of its own. The SQL language is a very well established standard for retrieving information from relational databases and has recently been enhanced with new facilities called "structured types" that support nested structures similar to the nesting of elements in XML. If SQL could be further extended to meet XML query requirements, developers could leverage their considerable investment in SQL implementations, and users could apply the features of these robust and mature systems to their XML databases without learning a completely new language.”

Correction: the SQL language retrieves information from SQL databases, not relational ones (the former, Chamberlin’s own contribution) due to failure to understand the latter. And here’s the evidence, black on white.

“Relational data is "flat" -- that is, organized in the form of a two-dimensional array of rows andcolumns. In contrast, XML data is "nested,” and its depth of nesting can be irregular and unpredictable. Relational databases can represent nested data structures by using structured types or tables with foreign keys but it is difficult to search these structures for objects at an unknown depth of nesting. In XML, on the other hand, it is very natural to search for objects whose position in a document hierarchy is unknown. An example of such a query might be “Find all the red things", represented in the XPath language by the expression //*[@color = "Red"]. This query would be much more difficult to represent in a relational query language.”

Unbelievable. Any wonder that SQL fails so abysmally at relational fidelity? We may not expect the average practitioner to distinguish between pictures of relations, which are “flat” due to the presentation medium, and relations of N degree themselves, which are N-dimensional logical structures. But we sure expect “industry experts” to be aware of the difference.

What makes the “nesting” argument even more ridiculous is that any data, even natural hierarchies, can be represented and manipulated relationally. Ironically, it is Chamberlin’s SQL that has failed for a long time to provide support for hierarchic operations (see chapter 7, “Climbing Trees in SQL”, in Practical Issues in Database Management). Either this failure is itself due to the misconception that a relation is “flat”, or the SQL deficiency is now misleading him Chamberlin into believing that, but whichever is the case (probably both), it’s not pretty.

Going back to W3C, QWG focused on the wrong question. We have specified the ones it should have focused on above, the gist of which is why use a data interchange format for data management, something for which it was not intended? Inter-system data exchange requires an agreed-on efficient machine-readable delimited file format. Aside from the fact that XML is a poor format even from that perspective (see “The Horror of XML”) — and unnecessary, because more efficient ones were readily available — it has nothing to do with a query language, and it does not warrant one. On the other hand, neither does SQL have anything to do with exchange formats, and should not be bastardized even more than it already is with XML extensions.

“Because of its regular structure, relational data is “dense”-- that is, every row has a value in every column. This gave rise to the need for a “null value” to represent unknown or inapplicable values in relational databases. XML data, on the other hand, may be “sparse”. Since all the elements of a given type need not have the same structure, information that is unknown or inapplicable can simply not appear. This gives an XML query language additional degrees of freedom for dealing with missing data.”

NULLs are another major flaw in SQL and one of the many relational model infidelities (although, to be fair, this one is at least in part due to Ted Codd himself, who is not, however, responsible for the many flaws of NULLs in SQL and its commercial implementations).

The relational model is predicate logic applied to databases. Predicate logic is the real-world’s two-valued logic (true/false). As I explain in chapter 10 of Practical Issues in Database Management (a Database Foundations paper, “Can’t Assert What You Don’t Know,” is forthcoming), logic guarantees correctness — defined as consistency — of query results. It is to preserve logical correctness, therefore, that Codd’s Information Principle requires that all information in relational databases be represented as values in relations. The term “NULL values” suggests that Chamberlin does not realize that part of the problem with NULLs is that they are not values — indeed, they are supposed to be markers for the absence of values. Whatever a database table with NULLs is, neither is it a relation, nor do NULLs represent anything in the real world and, consequently, correctness and the rest of the relational benefits are lost.

Incidentally, “inapplicable values” are a red herring. They are an artifact of bad database design. There is only one kind of missing value — unknown — and as I demonstrate in the above-mentioned chapter, it can be handled relationally, without the huge problems of SQL’s NULLs.

Be that as it may, the way XML “handles” missing data is not an advantage, but a liability for data exchange. In fact, much of the XML being transmitted is tags, not data. And given that tags are actually unnecessary, the inefficiency should be obvious. The fact is that in order for any data interchange to work, the parties must first agree on what data will be exchanged — semantics — and once they do that, there is no need to repeat the tags in each and every record/document being transmitted. Any agreed-upon delimited format will do, and the criterion here is efficiency, on which XML fares rather poorly (again, see “The Myth of Self-Describing XML” - PDF and “The Horror of XML”).

“In a relational database, the rows of a table are not considered to have an ordering other than the orderings that can be derived from their values. XML documents, on the other hand, have an intrinsic order that can be important to their meaning and cannot be derived from data values. This has several implications for the design of a query language. It means that queries must at least provide an option in which the original order of elements is preserved in the query result. It means that facilities are needed to search for objects on the basis of their order, as in “Find the fifth red object” or “Find objects that occur after this one and before that one”. It also means that we need facilities to impose an order on sequences of objects, possibly at several levels of a hierarchy. The importance of order in XML contrasts sharply with the absence of intrinsic order in the relational data model.”

In the rest of the excerpt Chamberlin discusses the “basic principles” of XQuery design, among them compositionality, closure, simplicity, and completeness. As we already mentioned, these are the very aspects in which the relational model excels, and SQL and the hierarchic/XML model fail. Judging from SQL and the pronouncements in the excerpt, and we have every reason to believe that XQuery will be worse, not better, than SQL.

Caveat emptor.

--

Fabian Pascal has a national and international reputation as an independent technology analyst, consultant, author and lecturer specializing in data management. He was affiliated with Codd & Date and for 20 years held various analytical and management positions in the private and public sectors, has taught and lectured at the business and academic levels, and advised vendor and user organizations on data management technology, strategy and implementation. Clients include IBM, Census Bureau, CIA, Apple, Borland, Cognos, UCS, and IRS. He is founder, editor and publisher of Database Debunkings, a Web site dedicated to dispelling persistent fallacies, flaws, myths and misconceptions prevalent in the IT industry. Together with Chris Date he has recently launched the Database Foundations Series of papers. Author of three books, he has published extensively in most trade publications, including DM Review, Database Programming and Design, DBMS, Byte, Infoworld and Computerworld. He is author of the contrarian columns Against the Grain, Setting Matters Straight, and for The Journal of Conceptual Modeling. His third book, Practical Issues in Database Management serves as text for his seminars.


Special Offer: Author Fabian Pascal is offering DBAzine.com readers subscriptions to the Database Foundations Series of papers at a discount. To receive your discount, just let him know you’re a DBAzine reader before you subscribe! Contact information is available on the “About” page of his site.


Contributors : Fabian Pascal
Last modified 2005-04-12 06:21 AM
Transaction Management
Reduce downtime and increase repeat sales by improving end-user experience.
Free White Paper
Database Recovery
Feeling the increased demands on data protection and storage requirements?
Download Free Report!
 
 

Powered by Plone