Ok, So What is this XML Thing?
Published: September 1, 1999
You've heard about it. It doesn’t mean "extra medium large" on a shirt. It has something to do with the web. It has something to do with meta data. But what is it?
You've heard about it. It doesn't mean "extra medium large" on a shirt. It has something to do with the web. It has something to do with meta data. But what is it?
The "Extensible Markup Language" (XML) is a document description language, much like "Hypertext Markup Language" (HTML) used to construct web pages. It is much more versatile than HTML, however, and as such it has profound implications on how we view what the web is and what it can do.
Books describing XML tend to weigh in at five pounds or more, which is a shame, since the basic structure and purpose of the language isn't nearly that big. This article is a much briefer presentation (in English) of the essential concepts involved.
It's More Than HTML
HTML is a language used to create web pages, using a series of "tags" which instruct the software reading it how to present the material.
Like HTML, XML is a system of tags that describe components of a document. In its simplest incarnation, it could be viewed simply as an advanced version of HTML. In fact it is not: It and HTML are both sub-sets of something called "Standard Generalized Markup Language", or SGML. This is a sophisticated tag language, which, "due to [its] complexity, and the complexity of the tools required," as the Object Management Group has so delicately put it, "has not achieved widespread uptake."
HTML consists of a set of predefined "tags" that instruct a piece of software called a "browser" to do certain things with the document. Typically these tags describe aspects of presentation, such as font style and size, line spacing, and so forth. Some tags, however, also identify links to other pages, drawings, artwork and so forth. The point is that every browser used by everyone on the internet knows how to interpret these tags and what to do with them. Since these tags are primarily concerned with presentation of the data, however, it is not possible to use tags to describe data structure or in any other way to describe the contents of a document.
XML allows tags to be defined by users. This gives users tremendous power to describe the structure and nature of the information presented in a document. This means, however, that standard browsers will not be able to do anything with these extensions. This makes the software environment for XML more complex, as described below.
Unlike HTML, XML does not allow description of the presentation of data. An associated language, "Extensible Style Language" (XSL) must be used to address this.
What is it?
Here is an example of XML used to describe a data record that might be presented in a document:
Note a few interesting things about this example.
First of all, as with HTML, each tag is surrounded by less than and greater than brackets (<>), and is usually followed by text. The text is in turn followed by an end tag, in the form </...>. A tag may have no content, in which case either the end tag follows immediately upon the tag (as in <specification></specification>), or the tag itself ends with a forward slash (as in <specification/>). Unlike with HTML, however, the end tag is always required.
A second thing to note is that, in this case, following the tag for product, a set of related tags follow, describing characteristics (columns, in this case) of product. In this particular case, the tag <PRODUCT> has been defined such that it must be followed by exactly one tag for <product_id> and one for <product_name>. You can't see this from the example, but <unit_of_measure> is optional. The tag <specification> is also optional, and there also may be one or more occurrences of it.
Although it is optional, all XML documents should begin with (or whatever version number is appropriate.) Note that the structure is hierarchical, so that an element can be under only one other element, and there can be only one hierarchy in a document.
Comments are in the form <!-- . . . --> Note that the double hyphens must be part of the comment. Note also that, unlike HTML, XML lets you use a comment to surround lines of code that you want to disable.
The meaning of a tag is defined in a "document type declaration" (DTD). This is a body of code that defines tags through a set of "ELEMENTS".
The DTD for the above example looks like this:
The DTD for an XML document can be either part of the document or in an external file. If it is external, the DOCTYPE statement still occurs in the document, with the argument "SYSTEM -filename-", where "-filename-" is the name of the file containing the DTD. For example, if the above DTD were in an external file called "xxx.dtd", the DOCTYPE statement would read:
<!DOCTYPE product SYSTEM xxx.dtd>
The same line would then also appear as the first line in the file xxx.dtd. Note that the name specified in the DOCTYPE statement must be the same as the name of the highest level ELEMENT.
The definition for the element product includes a list of other elements that must follow – in this case, product_id, product_name, unit_of_measure, and specification. The "?" after unit_of_measure means that one occurrence may or may not follow. It's optional. The "*" after specification means that it is optional, but one or more occurrences may follow.
If there were a "+" after any element in the list, it would mean the element is not optional, and that there may be more than one occurrence of it.
Each of the elements in the list is then defined in turn in one of the lines that follow. "#PCDATA" means that the tag will contain text that can be parsed by browsing software. Specification is further elaborated upon as being followed by variable and value.
XML is case sensitive. XML keywords are in all uppercase. The case of a tag names must be the same as in its DTD definition. By convention, entity/table names in the above example are all in uppercase, while attribute/column names are all in lowercase. Conventions will vary.
Tags can have attributes. For example, instead of listing associated tags in defining <!ELEMENT specification (variable, value)>, above, the following line could be added to the DTD:
This creates "variable" and "value" as two attributes of specification, so they do not have to appear as element in their own right. The data from the above example would then look like this:
Note that this provides yet another design decision in the lap of the XML designer. There are advantages and disadvantages to each way of doing this.
Three levels of correctness are associated with an XML document:
A "well-formed" XML document is one where the elements are properly structured as a tree, with the opening and closing tags correctly nested. Well-formed documents are essential for information exchange. A "valid" XML document is well formed and has tags that correspond to the document type declaration. It contains only elements and attribute values that conform to the DTD. While an XML document can be prepared and read without a DTD, a DTD is essential for establishing validity. A "semantically correct" XML document is beyond the control of XML. It is incumbent upon the preparer of the document to insure that it is logically structured and makes sense. (2) Implications
The question remains, what does all this mean? The answer to that question is not obvious. Clearly web screens that display data from a database can be designed to do so more easily and with more control.
Not in the language, however, is the mechanism by which data will actually be retrieved from a database and placed in this page. If web pages are to be created with database data, software must be written to retrieve those data and create the pages. Presumably this would be in some combination of Java and SQL.
In addition, a standard browser, by definition, cannot properly interpret customized tags.
This can be addressed in one of three ways:
Software "applets" may be written and attached to the page. These would understand the data structure and respond accordingly to each tag. Generic software may read the DTD and respond to tags accordingly. In this case, the response would be limited to what can be inferred from the DTD. A community may define a set of tags for its purposes, agree to use them, and develop community-specific software to respond to them. Presumably the first two options will be in Java or a similar language, but the standard tools for doing this remain to be written. The third option has already begun to take effect. For example, the chemical industry has set up an XML-based Chemical Markup Language, and astronomers, mathematicians and the like have similarly defined sets of tags for describing things in their respective fields.
Used to Describe Data
One feature of XML that has captured the industry's imagination is its ability to describe data structures and hold data. As was seen in the above example, with XML, you can define new tags specifically to describe the equivalent of tables and columns in a relational database structure. More significantly, the tags for a set of columns or attributes can be related to the tags for their parent table or entity.
While the tag structure does seem to be a good vehicle for describing and communicating database structure, the requirement for discipline in the way we organize data is more present than ever. XML doesn't care if we have repeating groups, monstrous data structures, or whatever. If we are to use XML to express a data structure, it is incumbent upon us to do as good a job with the tool as we can.
Following in the tradition of the chemists and astronomers described above, the Object Management Group (OMG) has settled on a set of XML tags they call the XML Meta data Interchange (XMI) as a way to describe in standard terms the structure of data about data ("meta data"). This is useful in communicating between CASE tools, and in describing a "meta data repository". Along the same lines, a group of companies are in the process of defining a Common Warehouse Meta data Interchange (CWMI) that comprises a subset of the XMI tags to support data warehouses.
This means that there are actually two ways that a database structure can be described in XML:
First, an application database can be described in the DTD of an XML document. In this case the operational data contained in the described database could be placed between sets of the described tags. The DTD could, for example, be generated by one CASE tool and read by another one as a way of communicating data structure from one to the other.
A second approach is to make the table and column definitions data that appear between tags of an XMI metamodel. This is a little more arcane, since the XMI metamodel is very abstract, but using the XMI metamodel allows for description of much more than tables and columns.)
Note, however, that the issue in defining a meta data repository or communicating between CASE tools is not the use of XML or any other particular language. The issue is the database structure and its semantics. The important question is not how a universal meta data repository will be represented. It could as easily be represented by a set of relational tables or an entity/relationship diagram. The questions are, what's in it and what does it mean? XML by itself does not answer that question. Which objects are significant and should be described? That is the harder question. Having a new language for describing them doesn't seem to contribute to that conversation.
Indeed, in recognizing that XML is a good vehicle for describing database structure, the issue that seems most obvious is that this will put greater responsibility on data administrators to define data correctly. XML will not do that. XML will only record whatever data design (good or bad) human beings come up with.
As Clive Finkelstein has said, the advent of XML is going to make data modelers and designers even more important than they are now. "After fifteen years of obscurity, data modelers can finally become overnight successes." (3)
 Object Management Group, XML Meta data Interchange (XMI) Proposal to the OMG OA&DTF RFP3: Stream-based Model Interchange Format,page 4-33 (2) Ibid, page 4-36. (3) Clive Finkelstein, Lecture, Data Resource Management Association, Seattle, Washington, May, 1999.
Recent articles by David C. Hay
David C. Hay - In the information industry since it was called “data processing,” Dave Hay has been producing data models to support strategic and requirements planning for more than twenty-five years. As President of Essential Strategies, Inc. for nearly twenty of those years, Dave has worked in a variety of industries including, among others, banking, clinical pharmaceutical research, broadcasting, and all aspects of oil production and processing. Projects entailed various aspects of defining corporate information architecture, identifying requirements, and planning strategies for the implementation of new systems.
Dave’s recently published book, Enterprise Model Patterns: Describing the World, is an “upper ontology” consisting of a comprehensive model of any enterprise from several levels of abstraction. It is the successor to his groundbreaking 1995 book, Data Model Patterns: Conventions of Thought – the original book describing standard data model configurations for standard business situations.
In between, he has written Requirements Analysis: From Business Views to Architecture (2003) and Data Model Patterns: A Metadata Map (2006). Since he took the unusual step of using UML in the Enterprise Model Patterns… book, a follow-on book, UML and Data Modeling: A Reconciliation was published later in 2011. This book both shows data modelers how to adapt the UML notation to their purposes, and UML modelers how to adapt UML to produce business-oriented architectural models.
Dave has spoken at numerous international and local DAMA, semantics, and other conferences as well as at various user group meetings. He can be reached at firstname.lastname@example.org, (713) 464-8316, or via his company's website.