What XML Means
Published: April 1, 2004
Published in TDAN.com April 2004
Publisher's Note ... This article previously appeared in The Journal of Conceptual Modeling, May 2003.
The following two quotes were denigrated in a recent article: 1) “There really is no difference between a document and a database.” 2) “ XML data is fundamentally different from relational data … [relational structure] can led to inefficiencies in queries and retrievals.” While both of these claims were denounced, they both do contain substantial truth. This article will consider to what extent they are true. It will also present XML as a reasonable data model with characteristics of special interest.
Natural language is more general, expressive and difficult than relational databases. Nevertheless, texts certainly do store data and do require a model (background knowledge) to be information – try reading in a foreign language or in an unknown theoretical field in your own language. Since natural language is a proliferating machine-readable data source, thanks to the web, it is of legitimate interest as a data object. Unfortunately the problem of extracting data from general text is not solved, perhaps not even solvable.
XML offers an approach in which the users of language can mark the significant factual content of text. The raison d’etre of XML is marking, not data management. However to mark content implies that some level of data management must be inherent. Moreover the simple, hierarchical structure of XML, the expressiveness of tags and the flexible range of constraints available through the DTD (a BNF equivalent) creates a data model adaptable to many sources – text, formula, database or programmatic data.
This article proceeds by examining Pascal’s definition of data model. A simple data model for XML is given and is examined against Pascal’s definition of and requirements for a data model. Some characteristics unique to XML are shown and an argument is made that it is simple in a meaningful sense.
It is a tenet, since Von Neumann, that anything may be data. In this regard, there is no difference between a conceptual model, a logical model and a physical model. All maintain data. The differences are simply what the models mean to represent. These models are at three levels because: 1) they represent three levels of abstraction in one process, 2) what is modeled at each level is different.
All three models support data structuring, data ‘access’ and inference. Even a simple diagrammatic notion such as a semantic network diagram does these. It is certainly fair to insist on a narrower definition for ‘data model’.
Among proponents of XML there are two distinguishable viewpoints, related to the purposes of the viewers. The first view is a data centric view. Persons with this view are interested in more or less the same set of operations as traditional DBMS. XML seems to be, with a suitable query language, adequate for this. This view is fundamental in XML-QL and in semi-structured databases. [Abiteboul] The second (and the original) view of XML is a text centric view. Persons with this view are concerned with capturing the data content of texts without the violence of re-writing the source. XML marking while useful, is not perfect at capturing the data in even the most common texts. [Riggs] Technically a model is just a set with relations such that there is an interpretation function that makes the standard commutation diagram work. XML is a model of data in text, as complete with respect to data as relational theory. XML maps far more directly to text data sources. That this ready mapping extends to many other program systems may be no more than the reflection of the nature of programs themselves. It is however a fact of real world practice, shown by the proliferation of XML as an underlying means of logical storage in more and more software systems.
A data model is not a map
A data model is not “a general theory of data used to map enterprise-specific business model…to enterprise specific logical models that are understood by DBMS's.” A compiler is a map from a conceptual model to a physical model. Compiler theories (such as LR-k grammars and algorithms) are maps from language classes to compilers. Relational theory is a general data model, in which specific models may be constructed. The paradigm here is first order predicate calculus in which a theory may be embedded.
As a practical matter, creating a database (or information base, etc.) requires a judicious choice of theoretically possible representations. This is quite as true for relational models as any other. Fictitious (non-domain) entities may be created (as for m-n relationships). Constraints must be selected or omitted. The entire design may become subject to efficiency and purposive constraints as well.
The XML Data Model
While XML is new and evolving, the fundamental data representation is the well-known a tree. For sake of argument, the following is the model for XML
1) An ordered tree of named nodes with id’s.
With two named, leaf nodes:
a) A ‘text’ node that holds a string.2) A set of manipulations (XSLT, XMLQL, XQL, XIRQL are candidates.)
b) A ‘reference’ node that holds a pointer to a node existing in the tree.
It is similar to other models such as semi-structured data. [Abiteboul]
XML as a data model
Rather than offer a specific definition of data model, Pascal's criteria for a data model are examined. XML seems to meet them if they define storing and retrieving data and are not construed so narrowly as to simply define ‘relational database’.
The tree is one of the most basic models of data. It is no less formal because it is not generally described by first order predicate calculus and set theory. It does not have a specific semantics as has been recognized for a long time. [Link] We are not aware of an essentially clearer semantics for relational databases.
The desire for a richer representation of link semantics for XML is addressed in addressed by several communities. RDF is an example of this. [RDF]
The XML model has four types: node names, node id’s, node references and content as data. The semantics of the first three is as clear as table names, column headers or foreign keys. Content is more general/less detailed than relational domains. Content and node references clearly are data in XML. Node id's are 'pseudo data', they are arbitrary, but constrained to uniqueness. Node names are not subject to data operations (Although we may generate them in XML processors). In a sense then, XML has more data types than the relational where we mean types of data not domains.
The utility of general data types is to map values to sets of operations; a specific list is not a definition of 'data model'. Strings, of course, support only weak data operations. Yet in normal discourse, words have types. They are 'soft typed', i.e. the reader can interpret them by context. Such a theory (more probably sets of such theories) would be of interest.
The lack of a broader range of content data types is a concern for some potential uses of XML. Extensions to XML such as XML schema address this.[Schema] XML schema is much richer in data types than most relational systems.
The normal forms are a masterpiece in relational theory. They are however not universally applied in building relational databases They are constraints on the relational theories inference methods with the Aristotelian goal of not saying of what is not that it is. The 1st normal form is distinct from the others in that regard. XML will have to address keys in order to develop a similar area of theory.
Pascal explicitly mentions constraints as representing business rules. The DTD, which is a specific formulation of BNF seems to be a reasonable means to express these constraints. It is not clear that these are part of the relational model proper.
Keys are one integrity constraint not normally considered in XML articles. Thus data referring to an individual can occur in several places. This models textual reality, but may be a difficulty in practice. This concern is indirectly implied in "XML and Free Text". [Riggs]
XML data operations, based on tree traversal, are in no way theoretically deficient. Whether they are practically sufficient is a separate question. Again one may define data model as narrowly as one wishes, but only by removing sense. XML separates its representation model from its inference model. There are several candidate inference models, among them XSLT, XML-QL, XQL and XIRQL.
Select and project are simple enough (although obviously not exactly the 'same'.) Many with the data base view of XML do insist that the set of operations must be broadened to include at least join. [XMLQL] Others with the text centric view wish to add operations required for text retrieval. [XIRQL]. Why these operations on a data (information-, text-, knowledge- ) base are not legitimate is not clear to me.
Data Model Characteristics
If XML is a data model (the old hierarchical model more or less) perhaps it is just a poor model. This is belied prima facie by its increasing use. Pascal offers four necessary characteristics of a data model are considered below.
It is well-known that a relational database can be represented as a tree and vice-versa [Abiteboul]. The formalization above is enough for this purpose. Thus in the formal sense of what can be represented, neither is more general than the other.
XML is superior to relational databases in at least one sense: it can often be imposed on free text sources without disturbing them. Relational databases do not maintain locality of data. [Sowa] This is exactly the real sense of the quote, dismissed by Pascal, that “ XML data is fundamentally different from relational data … [relational structure] can led to inefficiencies in queries and retrievals.” The tree representation does something that relational representations cannot do - represent the data as it is without need of reconstruction.
Generality of inference is another aspect of a model. In XML we have to pick from the competing inference models. The range of data operations is narrower, the same or broader as we choose.
The use of XML as a serialization of data between software systems includes RDBMS but extends to many other sorts of applications. The desired generality would seem to be to communicate data between the widest assortment of applications of all types. Besides text, the structure of data in most programs is more nearly tree-like than relational. The value of a single representation for interchange is obvious. Formality Formality does not depend on expression in predicate calculus plus set theory. The representational model of XML is adequately formal. The inference models for traditional data processes depend on tree traversal and are equally well defined. If we extend inference to include search as in the information retrieval community, there are also formal models. (They most fundamental - the probabilistic model, is however difficult.)
No model is simply ‘complete’. A model is complete with respect to some domain, abstract or real and set of operations. Soundness is similarly determined by that comparison, not by the choice of mathematical theory chosen to express the model. It is quite possible for a graphical theory to be sound and complete. To turn the normal relationship on its head, tableaus are a sound and complete theory for sentential calculus.
Since representation equality of XML and relations is guaranteed, only inferential completeness is at question here. The inference in the domain of discourse are the important criterion here. Information retrieval is a ubiquitous example in which non-crisp methods (vector or probabilistic retrieval) are typically of more value than crisp (Boolean) methods.
Simplicity is likewise not a single measure. Relational databases are reasonably economical, that is they do not have too many axiom schemata or inference rules. The same is true of trees and tree traversal. Representational simplicity is also of concern as is operational simplicity.
The formal issue is one of more or less, not optimal. No one I know likes to write their logical formulas using the Sheffer stroke (the minimal set of operators for logic) , although it is more economical than using ‘and’, ‘or’ and ‘not’ (the familiar set). Indeed, most people are happy to add implication and logical equivalence to their models.
Representational simplicity is extrinsic to the model and concerns the ease with which model objects are interpreted to domain objects. XML has both pluses and minuses here as data theory. As said earlier: it models documents and some other types of data better but perhaps individuals worse. (So far at least.)
Pascal claims that meaning in the sense of “how things are related and how to deal with them are lost in XML. That claim is hollow. How things are related is the essence of structure (and pointers). How to infer with things is a more open question and depends on what you need to infer. Surely the XML data model stores and retrieves data, how much narrower must the definition be?
The lack of rich set of domains and domain operations limits some operations, but also simplifies and generalizes XML. Although XML Schema attempts to address classical database concerns, in is not impossible that an alternative ('soft typing') could provide the same data functions.
As a data model XML does not have a theory of keys and dependencies known to me. Whether that is vicious or not may be worth discussing. In any case the data operations which can be easily modeled by XML will not go away.
Tag names are a two-sided device that are meant to mediate between text and data values. They should resonate with a reader and they should mark data as a meaningful (computational) unit. Relational attribute names do no more. In fact, attempts to link XML with ontologies [OIL] can be seen as attempts to formalize something ubiquitous but unanalyzed in the relational model. (Unless we consider the data dictionary a suitably formal device.)
In regard to the use of XML as a data interchange format, the problem is not one of interchange between RDBMS's. The problem XML addresses is much more general. Also note that interchange is a problem of representation, not inference. We only need to exchange representation between equivalent inference systems. Thus XML is works here too.
It is certain that there are interesting questions about XML. Its origin from a committee, myriad uses and rapid development complicate formalizing it authoritatively. However, it clearly does have a formal basis, if a yet incomplete and multi-faceted one. Its core is already enough to recognize it as a data model and one with unique applicability. The proliferation of auxiliary notations such as: XML schema, XML links, RDF, etc. indicate its fecundity, not its frailty.
[Abiteboul] Data on the Web, Serge Abiteboul, Peter Buneman, Dan Suciu, Morgan Kaufmann, 2000
[Date] "Models, Models, Everywhere, nor any Time to Think", C. J. Date, www.dbdebunk.com/cjd3a.htm
[LINK] "What's in a Link: Foundations for Semantic Networks", William A. Woods, in Readings in Knowledge Representation, p217, Morgan Kaufmann, 1984
[OIL] Welcome to OIL, www.ontoknowledge.org/oil
[RDF] Resource Description Framework (RDF):Concepts and Abstract Syntax, W3C Working Draft 23 January 2003, http://www.w3.org/TR/rdf-concepts [Pascal1] "What Meaning Means", Fabian Pascal http://www.inconcept.com/JCM
[Pascal2] "Something to Call One's Own", Fabian Pascal, http://www.dbdebunk.com/fp6a.htm
[Riggs] "XML and Free Text", K. R. Riggs, Journal of the American Society for Information Science and Technology, V53, N 6, 2002, 526-528
[Schema] XML Schema, W3C org, www.w3.org/XML/Schema
[Sowa] Knowledge Representation, John F. Sowa, Brooks Cole, 2000
[XIRQL] XIRQL: A Query Language for Information Retrieval in XML Documents, Fuhr, N. Großjohann, K. http://www.is.informatik.uni-duisburg.de/bib/xml/Fuhr_Grossjohann_01.html.en
[XML-QL] XML-QL: A Query Language for XML, Alin Deutsch, Mary Fernandez, Daniela Florescu, Alon Levy, Dan Suciu, http://www.w3.org/TR/NOTE-xml-ql/#issues
This article previously appeared in The Journal of Conceptual Modeling, May 2003.
Ken Roger Riggs, Ph.D -
Ken Roger Riggs, Ph.D. is a computer scientist and a professor of CIS at Florida A&M University. His degrees are in Philosophy (Indiana University) , Computer Science (U. of Central Florida) and Electrical and Computer Engineering (U. of Miami ). He has been involved in a mix of practice, research and teaching since 1976. His interest in models is both practical and pedagogical. His recent published work includes papers on AI, databases, datamining and XML marking. Other recent publications address conceptual modeling related issues in software engineering (refactoring) and programming languages.