Data Structure: Data Modeling or XML?
Published: July 10, 2007
XML is a perfectly good vehicle for describing data to be transmitted from one place to another. It is not so good for describing the semantics – the nature of – the underlying data. It cannot replace data modeling and sound database design.
XML, data modeling, and database design are all ways to structure data. Each has its place. Unfortunately, our industry is somewhat confused as to what those places are. This article attempts to sort that out.
So says the introduction to an XML model published by the Justice Department to promote data sharing. In response to this the Justice Department has developed the “Global Justice XML Data Model” as a “collaborative effort among local, state, tribal, and federal visionaries.”2
The problem is that, while XML is a perfectly good vehicle for describing data to be transmitted from one place to another, it is not so good for describing the semantics – the nature of – the underlying data. It cannot replace data modeling and sound database design.
This article will present this issue from two different perspectives:
DIFFERENT KINDS OF DATA MODELS
Three Ways to Look at Data
In 1975, the American National Standards Institute (ANSI) published its “3-schema approach” to viewing data.3 As they the Institute saw it, there were at least three different perspectives on data. (Actually, in various drafts, they saw many more perspectives than that, but these have survived.)
Everyone in an organization looks at its data slightly differently. Each person holds in mind a “schema” or structure that is the representation of his/her surroundings from a particular point of view. Each of these schemas describing the world as one person sees it is called by ANSI an external schema. While, ideally, these are similar for the many people working in a company, as often as not they are in fact very different. Misunderstandings about what something means are at the heart of many systems (and management) problems in companies today.
Because of their diversity, it is not practical to draw a picture of each person’s view of their data, but it is appropriate when surveying these points of view to assemble an ontology or glossary of the language used throughout the organization. Where terms have different meanings or there are different terms for the same thing, these differences should be made explicit.
All of these participants in an organization are, after all, fundamentally looking at the same business, even if their perspectives are different. The objective of a good systems analyst is to integrate the set of external schemas to produce a single conceptual schema which represents the fundamental structure of the organization’s data.
The data model that portrays a conceptual schema consists of boxes representing the things of significance to it, along with lines describing how they are related to each other. (This is often referred to as an “entity/relationship” or “E/R” model.)
These conceptual schemas can be rendered at various levels of abstraction.
ANSI originally described the internal schema as the way data are represented in a physical storage medium. This included the use of random-access storage, tracks and cylinders, and so forth. With the advent of database management systems, however, this was broken into two schemas:
Note that along with relational database design, XML is one of several options for implementing a conceptual data structure. XML scripts are examples of logical schemas. As with relational database designs and object-oriented UML diagrams, these should be derived from a conceptual data model – not a replacement for it.
Note that in each of these technologies, trade-offs must be made in translating from the conceptual model. Relational database design does not recognize super-/sub-type structures, so adjustments must be made. XML is fundamentally hierarchical, so the translation from a two-dimensional data model will involve de-normalization that will sacrifice some flexibility.
The Semantics of an Entity/Relationship Model
It is the conceptual entity/relationship model that describes the business whose issues are being addressed by any system implementation. This is where the semantics of an organization, as identified by its external schemas, are resolved into a coherent structure. It is this coherent structure that should be the basis for any system design effort, regardless of the technology.
A conceptual data model is fundamentally a set of assertions about the nature of the organization. It is composed of entity classes4, attributes describing the data they represent, and lines representing the relationships between pairs of them.
Figure 1 shows a sample data model describing the structure of an Excel workbook. Included is a description of its authors and the companies it applies to, as well as its create and last update date and version number. These would apply to any document, but this model is concerned in particular with Excel spreadsheets. Specifically, an EXCEL WORKBOOK (a kind of DOCUMENT) may be composed of one or more WORKSHEETS, where each WORKSHEET may be composed of one or more CELLS. Each CELL is at the intersection of one ROW and one COLUMN. In addition, each CELL must be in a particular STYLE.
Note that here, relationship names are assigned rigorously, so that a relationship asserts a significant fact about the organization: “Each EXCEL WORKBOOK may becomposed ofone or more WORKSHEETS.” This approach is intended for presentation to non-technical business people so it sounds more like reasonable English, unlike the more commonly used structure, “Each EXCEL WORKBOOK has zero, one or more WORKSHEETS.”
Note that the conceptual model describes a real situation, completely independent of any technology that might be used to implement a system. (This particular example is a special case because the subject of the model is a technology, but the point still holds. This model describes that subject, without concern for how this descriptive information might be manipulated by technology.) Indeed, the same conceptual model may be used to implement a multitude of completely different kinds of systems: the entity classes can be rendered as either relational tables or object-oriented classes. The relationships in a relational world become foreign keys; in an object-oriented world, they become behaviors by which each of the classes gains access to other classes.
And in XML, entity classes become tags, while relationships become implied in hierarchical tag structures. (If you are reading this section, contact your author at email@example.com. The first five people to do so will receive a free autographed copy of his latest book, Data Model Patterns: A Metadata Map.)
The data model in Figure 1 was actually inferred from an XML sample provided by Peter Aiken.5 An example of a document described by this model is shown in Figure 2. After an introductory line, it has a tag defining a which is the overall DOCUMENT in the above data model. After lines describing the “namespaces” that define the tag, the workbook contains a tag for . In the model above, these are attributes of the entity class DOCUMENT.
After the come some lines defining , with one block for each . In the model above, these are instances of the entity class STYLE.
After that comes a . These are shown in the data model by the fact that each EXCEL WORKBOOK may be composed of one or more WORKSHEETS. You will note by the ellipses (...) that this is not the entire XML document. Space constraints prevent showing the rest of the script, in which are portrayed a set of rows identified by , , and . In the much more compact data model in Figure 1, each WORKSHEET may be composed of one or more CELLS. Each CELL, in turn, must be in one and only one STYLE and must be located in both one COLUMN and one ROW.
Moreover, of course, this is just a set of instances of the model. The “model” itself is written in a variant of XML called “XML Schema.” This is an XML script using predefined tags that are for the purpose of defining other XML tags. A portion of the schema used to describe the document in Figure 2 is shown in Figure 3.
Figure 3: A Sample XML Schema
As you can see, there are significant differences between these two ways to represent data structure:
Note that a relational database design is the technique of choice for storing the data before they are sent and after they are received. This is yet another kind of logical model.
To compare these two kinds of logical models, note that the relational database design has the two-dimensional organization of the data model. As such, it is possible to retrieve data from various directions. The XML script is fundamentally hierarchical. It represents the person who is the primary author and the secondary author, but only with great difficulty can it represent all the spreadsheets written by an author.
While an XML script (and a relational database design) can be derived directly from a conceptual data model, it is extremely difficult to derive a conceptual data model from an XML script. Your author can testify as to how hard this is by virtue of that fact that this indeed is what was done here.
Understand, however, that the model shown in Figure 1 was not directly derived from the XML. Liberties were taken and assumptions were made. The actual derivative from the XML above is shown in Figure 4, below. Note that the hierarchical structure made it impossible to show any of the configurations above where entity classes had multiple parents. In the XML, “Author,” “Last Author,” and “Company” are de-normalized as attributes of WORKBOOK. CELL could only be shown as being in a ROW, not as being the intersection of a ROW and COLUMN. STYLE could only be shown as applying to the entire WORKBOOK, not as being different from cell to cell.
Now, if the data model had come first, the XML could at least have been made more correct. STYLE could be a de-normalized attribute of CELL. Indeed, a TABLE could be shown as being composed of one or more CELLS, with the “Row number,” “Column letter,” “Autofit width,” and “Width” as attributes. Again, this is de-normalized, but that is acceptable when all you are doing is communicating between two (it is to be hoped) normalized databases.
Because the semantics of the data are not visible in an XML script, it is extremely dangerous to create one without having done a data model first.
Storing and Sharing Data
In addition to semantic considerations, there is another reason to be wary of over-enthusiastic XML projects.
A collection of interested federal employees and contractors6 have banded together to create something called the “Federal Enterprise Architecture” (FEA). It consists of a set of “Reference Models” that go way beyond just data architecture:
The first three deal with the business of the federal government:
One is concerned with technology:
And one goes to the heart of data architecture:
The FEA DRM
The Data Reference Model is significant in that it recognizes three different views on how to describe and use data:
These are shown schematically in Figure 5.
From the point of view of the Federal Data Reference Model, then, it is clear that XML is about data sharing. While some people have tried to create XML “databases,” it seems clear that this is not the technology to use for managing data, controlling access to it, and retrieving it. For data description, then, the technology of choice is currently the relational database. Relational database technology has been proving itself for over twenty years
While it is true that a relational database can be used for sharing data across organizational boundaries, many organizations are not ready to make their underlying database available to outsiders. A means of exporting data and sending it (in a controlled fashion) is clearly desirable, and XML fits this bill.
For years, the worlds of data modeling and database design have had an uneasy relationship. People whose world is logical database design often don’t understand the value of conceptual data models, while data modelers often do not understand the constraints under which database designers work.
XML is a new kind of logical model technology, so this category of misunderstandings is relatively new in this field. It is the same set of misunderstandings, however, even if it has a new flavor. Programming languages must reflect the semantics and mission of an organization if they are to be successful. They are not, however, adequate to represent these fully. If you want to understand the business problems being addressed, create a data model. Then use that data model as the basis for defining whatever technology you want to use to build systems.
Recent articles by David C. Hay
David C. Hay - In the information industry since it was called “data processing,” Dave Hay has been producing data models to support strategic and requirements planning for more than twenty-five years. As President of Essential Strategies, Inc. for nearly twenty of those years, Dave has worked in a variety of industries including, among others, banking, clinical pharmaceutical research, broadcasting, and all aspects of oil production and processing. Projects entailed various aspects of defining corporate information architecture, identifying requirements, and planning strategies for the implementation of new systems.
Dave’s recently published book, Enterprise Model Patterns: Describing the World, is an “upper ontology” consisting of a comprehensive model of any enterprise from several levels of abstraction. It is the successor to his groundbreaking 1995 book, Data Model Patterns: Conventions of Thought – the original book describing standard data model configurations for standard business situations.
In between, he has written Requirements Analysis: From Business Views to Architecture (2003) and Data Model Patterns: A Metadata Map (2006). Since he took the unusual step of using UML in the Enterprise Model Patterns… book, a follow-on book, UML and Data Modeling: A Reconciliation was published later in 2011. This book both shows data modelers how to adapt the UML notation to their purposes, and UML modelers how to adapt UML to produce business-oriented architectural models.
Dave has spoken at numerous international and local DAMA, semantics, and other conferences as well as at various user group meetings. He can be reached at firstname.lastname@example.org, (713) 464-8316, or via his company's website.