|
Data Model Quality: Where Good Data Begins
Published: January 1, 2005
Published in TDAN.com January 2005
The old chestnut goes that the only difference between a data modeler and a terrorist is that you can negotiate with a terrorist. In spite of the fact that there are many notations available and many ways to approach the effort using each of those notations, data modeling is such an intense, personal experience that it is very easy to become emotionally attached to a particular way to do it. On top of that, the difficulty that goes into making a modeling tool work just the way we want it to also tends to makes us uncomfortable with the prospect of change. It's a wonder any of us can talk to each other at all. Still, data models are products to be delivered to one's consumers, and it is clear, on the surface at least, that there is a difference between a good data model and a bad one. Unfortunately, however, that difference between good and bad is rarely articulated, and more often it is expressed in arguments carried out with an almost religious fervor. This paper attempts to cut through such arguments and present as dispassionately as possible some basic principles of data model quality. At the very least, it hopes to get the issues out on the table where we can discuss them. To be sure, it is inevitable that the "principles" will reflect the prejudices of your author. They are, however, drawn not only from his own sixteen years' experience, but also from that of other people as well. So, they are presented for your consideration with some confidence that there is merit to them. The first order of business of course is to clarify exactly what kind of data model we are talking about. The term "data model" has taken on several quite different meanings in recent years, reflecting the very different uses that collections of boxes and lines can be put to.
This article will address the conceptual-convergent-view of data modeling. This is the view that seems the most misunderstood-not the least because of its often being confused with the other two. The purpose of the conceptual model is not only to describe an organization in a way that represents fundamental structures, but also to describe it in terms that can be understood by the business stakeholders who will be responsible for any systems developed from it. For this reason, it is not only a tool for stimulating and disciplining thought processes but also a tool for communication. The issues described in this article, then, have to do with aesthetics as well as with clarity and coherence of thought. It is important to understand that the conceptual modeling described in this article is not about capturing the concrete terms of business clients, nor is it about database (or object class) design. Any notation can be used to create a conceptual model, but it is important to understand that that is what you are doing. A UML conceptual model composed of business classes is not the same as a UML design model describing the bits of program code that are design classes. Conceptual data model quality is particularly important for an organization that is beginning to focus on data quality. You cannot collect quality data if your database is poorly organized. You cannot organize your database properly if you do not understand the real structure and organization of the data. And you cannot understand the structure of the data if you cannot discuss it with the people who define and use it. This requires all the graphic and linguistic help you can get. By the way, the first prerequisite to data model quality is simply a passion for data modeling. If you are not a fanatic about making sure that each attribute and entity type has a precise definition, that each box is properly aligned, and that each symbol is in exactly the right position, you will never produce a great model, no matter what technique or approach you use. This article addresses data model quality in terms of three dimensions: a model's content, its names, and its appearance. At the end of the article, we will discuss how to go about producing a quality data model. Content: Six PrinciplesIn 1994, Matthew West, of the Shell Oil Company, prepared a set of three white papers called "Developing High Quality Data Models". The first of these specifically addressed data model quality.[3] In it, he laid out six principles for high quality data models:
1. Entity types: The Underlying Nature of an ObjectThe first principle goes to the heart of the meaning of an entity type: An entity type in a conceptual model is a thing of significance to the organization, about which it wishes to keep information. It is the thing itself, not the thing playing a role. Unfortunately, if you look through your interview notes looking for nouns as candidate entity types, many of the nouns will combine references to things with the roles they play. Figure 1 shows two common models, describing customer and vendor: Each customer may be a buyer in one or more sales orders; Each vendor may be a seller in one or more purchase orders. These models raise a question, however: what if the same person or organization is both a customer and a vendor? Or, what if a division of your own company is a customer? By folding the "customerness" and the "vendorness" of a person or an organization into its underlying definition, you are no longer able to deal with that definition separately.
Figure 1: Order Roles
Figure 2 shows an alternative. This model actually addresses two issues. First, it recognizes that we will often have relationships that apply to either a person or an organization. By defining the super-type party, we can talk about the groups of people and organizations together. Second, we have generalized the concept of sales order and purchase order to simply, order. This recognizes that the two kinds of orders are in fact the same thing. The only difference is whether the viewer of the model is buying or selling. Thus "customerness" and "vendorness" are not characteristics of entity types but rather roles played in the relationships between party and order. More than that, by explicitly recognizing that every order has both a buyer and a seller in it, we have also recognized that we are a party, just like all those others we deal with. This turns out to be a useful thing. Indeed, the values of our attributes can be stored just like those for everyone else.
Figure 2: Parties and Orders
By the way, note that if you are successful in creating a model of true entity types, and have correctly assigned single-valued attributes to them, you will have by definition created a fully normalized model. To say that all attributes of an entity type are a function of the key, the whole key, and nothing but the key is to say that they are in fact attributes of this thing. 2. Sub-types and Super-typesFollowing on the above point, the second principle asserts that "Entity types should be part of a sub-type/super-type hierarchy ('class hierarchy' if you're familiar with object oriented terms) in order to define a universal context for the model." As it happens, the sub-type/super-type structure is not a relational concept. By definition, a relational database consists of flat, two-dimensional tables only. But in his formulation of information engineering, Clive Finkelstein picked up the concept of "inheritance" from the object-oriented community and it has turned out to be a very powerful way to represent the world. Among other things, it allows you to show that the concrete things most people see are examples of more general concepts. There are actually two justifications for setting up a sub-type/super-type structure-one formal and one informal. The formal justification for a sub-type comes when it inherits all relationships and attributes from its super-type, but it has at least one unique relationship and/or attribute. The informal justification for a sub-type is to illustrate the kinds of things represented by the super-type, even if there are no differences in their attributes or relationships. This can be very useful in the presentation of a model, to clarify the meaning of the super-type. Figure 3 shows examples of both of these uses. It represents the kinds of facilities that constitute an oil field. The kinds of subsurface facilies (the stuff in an oil well) have distinct attributes, as well as specific relationships among them that could not be expressed if they were not broken out into sub-types. Specifically, each well may be composed of one or more completions, and each well may be composed of one or more wellbores. Each of these three entity types has a different set of attributes: "Surface Elevation" for well; "Top Depth" and "Bottom Depth" for wellbore; none for completion. On the other hand, however, there seems to be no formal justification for the sub-types of surface facility. As nearly as we can tell from the figure, the attributes of a steam generator, a dehydration plant or any other surface facility are all the same. Even so, it is useful to show them as sub-types, to clarify just what is meant by a "surface facility". In fact, further analysis may in fact show different attributes for the different sub-types, but even if it did not, this would still be a reasonable way to present surface facility.
Figure 3: Sub-types and Super-types
3. Activities and Associations are Entity TypesThe first draft of a model is likely to contain many-to-many relationships. Indeed even many of the one-to-many relationships, upon further reflection, turn out to be many-to-many relationships. Normalization will require you then to define associative (intersect) entities to account for them. (The tangible entity types thus associated are called "reference entity types".) Principle 3 says that, from the beginning, "activities and associations should be represented by entity types (not relationships)." Figure 4 shows employment as a relationship between person and organization. Each person may be an "employee" by virtue of being employed by one and only one organization. (Note that the commonly specified "employee" is an example of the spurious entity types described in Principle 1, above.)
Figure 4: A Draft
The only problem with this is that it represents considerably more company loyalty than is normally the case. In the "real" world, each person is likely to be employed by more than one organization over time, necessitating records to be kept of each employment. This results in Figure 5. As it happens, over time, nearly all one-to-many relationships between reference entity types are really many-to-many relation-ships.
Figure 5: A Final Draft
4. Relationships as InvolvementOk, if associations are to be represented as entity types, what do relationships represent? Simply, a relationship is an entity type's involvement with one of those associative entities described above. In Figure 5, above, there are two relationships between any associative entity and its reference entities. Each employment may be of one person and in one organization. 5. All Attributes are Candidate RelationshipsThe fourth principle says simply to examine each attribute of an entity type and ask whether it isn't really a relationship to something else. As an example, consider data about "households" received by a bank from a market research company. First, we draw an entity type for household, as shown in Figure 6. The attributes shown are those delivered by the market research company.
Figure 6: Household
Let's look at the attributes of household: "Income", "number of cars", and "sociological group", are examples of potential relationships to other things. "Average balance" could stay as an attribute, but if we want to keep it over time, it needs its own entity type as well. This means that the model can be better represented as shown in Figure 7.
Figure 7: Household Membership
Here, what had been the attribute "Sociological group" is now a sociological group membership for a household in a sociological group. Using this structure allows you, as you could not before, define in advance a specific list of sociological groups that are of interest. You can also now see that a car ownership is of a car model. (If it were of interest, you could add a relationship to assert that each car model must be manufactured by an organization-a car company.) Income ranges can similarly be grouped into pre-defined categories. And finally, bank balance contains the household's "Average balance" for multiple points in time, as designated by the "Effective date" of each occurrence. If it is not clear what kinds of categories will be of interest in the future, this model can be further generalized, as shown in Figure 8. For example, a household parameter group could be "Car Model", and a household parameter could be "Number of Jaguars". Thus, the household parameter value of "Number of Jaguars" for "Jay Leno" could be "5".
Figure 8: Generalized Households
A word of caution: Generalizing a model makes it much more flexible and robust. It insulates it against the effects of future changes to the business. At the same time, however, it removes the model from the language of your customers. It is important to use judgment in determining how general to become. In your author's experience, most business people can be taught the meaning and implications of more general models, and once so taught, they appreciate the elegance. It is important to do this teaching carefully, however. In presenting this model, begin with Figure 6 and work your way to Figures 7 and 8. 6. Only Use Surrogate IdentifiersThe sixth principle in the Shell paper asserts that only surrogate identifiers should be used as unique identifiers for entity types. A surrogate identifier is a system generated number that is uniquely assigned to each row. It has no inherent meaning and is used only to distinguish one row from another. Your author respectfully disagrees with part of this one. It is the case that for reference entity types (those that are not required to be related to any other entity types), surrogate identifiers are best. It is extremely rare to find any attributes for product, person, activity, etc. that could never change value. Moreover, the temptation (left over from manual identification systems), to encode all manner of information into the identifier is genuinely a bad idea. Surrogate identifiers are simpler, easier to implement, and a better guarantor of the integrity of the data. In the case of associative entity types, however, the identity of an occurrence is, by definition, derived from the things it is associating. If the related things are cleanly identified with surrogate identifiers, there is no harm in using the relationships to those things as the identifier for the association. By definition, you cannot change the values of such a composite identifier for an occurrence and keep the same occurrence. If it is necessary to allow for more than one occurrence of a pair, it is a simple extension to add a "Sequence number" to the compound identifier of the associative entity type. Even though this is using a surrogate for part of the identifier, it is useful to keep the natural identifier of the entity type visible. Figure 9 shows this. party and product type are each identified by a surrogate "ID". vendor, on the other hand is identified by a combination of the party that is in the role of the vendor and the product type that is sold by the vendor. (In this notation, an attribute's participation in a unique identifier is shown by the octothorpe (#) to the left of the attribute. A relationship's participation is shown by a short line across the relationship line, next to the entity type being identified.) If one of the relationships is optional, then of course this argument doesn't fly. In that case a surrogate identifier is better. By the way, this is a very gentle argument. To assign a surrogate identifier to all entity types, including associative ones does minimal harm. To do so, however, does cloud the meaning of each entity type.
Figure 9: Identifiers
NamesThe second important dimension of data model quality is the use of adequate names for entity types, attributes, and relationships. Entity type NamesAs cited in the description of Principle 1, above, an entity type is a thing of significance to the organization. As such, it should have a common, recognized natural language name. In Figure 9, for example, the names party, vendor, and product type are all names of real things. An entity type name should not be:
Attribute NamesAn attribute is the definition of a kind of data about an entity type. This also should be a simple natural language term. There is some discussion in the industry about standardizing the structure of attribute names, but this is problematic. The idea of classifying attribute names is not a bad one, but it must be carried out with care. A structure that produces names that are not simple or intuitive is counter-productive if the model is to be understood by the public at large. Some argue that the entity type name should be included in the attribute name. In a conceptual data model, however, an attribute is never shown or described outside the context of its entity type, so this is redundant. In deference to those who use CASE tool utilities to generate default database designs, however, one place where the entity type name can reasonably be added is to the identifier, "ID" ("Person ID", "Product Type ID", and so forth). It does not make the name excessively long, and when the attributes get converted to foreign keys in a system implementation, the foreign key columns are easier to understand. In most cases, however, descriptive attributes gain nothing but length when the entity type name is appended to it. More problematic is the desire to standardize the qualifiers ("amount", "text", "quantity", etc.) to attribute names. This can be of value if the list of qualifiers is defined with care. Attribute names like "description text", however, are simply silly. Relationship NamesYour author studied under Richard Barker and is a strong believer in the Oracle/SSADM[5] standard structure for relationship names. It provides discipline, and it permits the relationships to be read as easy-to-understand but precise English sentences. When models are being presented to the public, this is extremely valuable. Each of the two roles in a relationship should be named with this structure:
For example, in Figure 10, the first relationship consists of the following two roles:
The second relationship consists of the following two roles:
(Note that these sentences are much more graceful than something like "A party buys in zero, one, or more orders". Non-data modelers don't talk like that.) Sentences in approximately this structure have been used throughout this article, and it is likely that you didn't even notice. This is a very natural way to describe a relationship.
Figure 10: Relationship Example
Appearance: AestheticsEdward Tufte has written, "Confusion and clutter are failures of [drawing] design, not attributes of information. And so the point is to find design strategies that reveal detail and complexity ¾ rather than to fault the data for an excess of complication. Or, worse, to fault viewers for a lack of understanding."[6] (Emphasis added.) In other words, if someone has trouble understanding your model, it is not necessarily because the underlying meaning of the model is too complex, nor is it because the viewer isn't smart enough to understand it. It may simply be the case that the model wasn't good enough. A conceptual data model, developed as part of a requirements analysis project is fundamentally for the purpose of sharing the analyst's understanding of the business with the business people. For this reason, it is extremely important that the data model which represents that analysis be as clear and easy to read as possible. This means that aesthetics are a very important dimension of data model quality. The first aspect of a model's aesthetics is the notation used to create the model. Among the notations available are:
Criteria for selecting a conceptual modeling notation include:
Organization Many of you have been presented with a data model, only to be overwhelmed by the profusion of boxes and lines on the page. The invention of the plotter was not an unmixed blessing. Size If a model is to be readable, it must be constrained to a reasonable number of entity types. As a rule of thumb, it should not have more than about fifteen entity types on an 8 ½" by 11" page. (A few more are permissible if they are sub-types.) A page should represent a subject area, ideally consisting of the "logical horizon" of a key transaction entity type.* Straight Lines An important aesthetic principle is that all lines between pairs of entity types must be straight. To put an elbow in a line is to add a symbol. But it is a symbol that has no meaning, so it is a visual distraction. Moreover, to have a relationship between two entity types that wanders all over the drawing is to have one that is extremely difficult to follow and understand. If necessary, stretch entity boxes. Following this rule makes it much less important whether or not lines cross. The eye doesn't see the crossing because it's focused on the relationship. Orientation One problem many models have is that there is no sense of organization. Figure 11 shows an example of a typical model. What is the model about? Measurements? Test Types? It's hard to say.
Figure 11: Disorganization
The Oracle/SSADM answer to this is to have all the crow's feet point either to the left or the top of the diagram. Figure 12 shows an alternative to Figure 11, following this rule. Note also how straightening the lines helped as well. In this diagram, we can see that its subjects are people, test types, and samples. The reason the diagram exists at all is to portray the collection of measurements. Note also that a measurement is similar in structure to expected measurement. Organization makes a big difference.
Figure 12: Organization
And yes there are heretics out there who would have the toes of the crow's feet point south and east. Fie on them! Well, ok, as long as the practice is followed consistently, the effect is the same. Alignment The art of data modeling is in the aesthetics of positioning entity type boxes on the page. It is difficult to give hard and fast rules in this area. It is generally a good idea, however, to align the left or right and top or bottom edges of logical groups of entity types. In general, have relationships (or groups of relationships) intersect the edge of an entity type at its center. Be sure to tidy up the relationship names, so that they don't cross the relationship lines. The Modeling ProcessOk, if these are the characteristics of a good model, how do you go about achieving them? Is this a skill that can be mastered by ordinary workers or can it only be done by a superior expert? The fact of the matter is that there are ways that anyone can go about developing a model that greatly increase the chances of being successful. First, it is important to recognize that modeling is a process like any other part of the system development effort. It requires time and resources and these have to be planned for. At the very least, a project plan could look something like this:
The big question of course is what does the "prepare models" step look like? How much time should be allocated to it? When your author did his first data models, this was a very painful process of collecting all the nouns from the interviews, posting them up (one per Post-it note) on a whiteboard and trying to infer structure from them. This required many iterations over many months. Arriving at a model required a lot of "Ah ha! Now I see it!" With experience, however, came the discovery that, if we are really going after fundamental things of significance, those are the same for any organization. Out of this discovery came, among other things, the book, Data Model Patterns: Conventions of Thought.[8] By using patterns, the process of developing a model no longer requires great intuitive leaps. It is simply a matter of recognizing these patterns when they occur, and elaborating on them as necessary to accommodate the particulars of this organization. The patterns exist in six broad categories:
Now, instead of trying to infer these things from notes, you can begin by asking the questions, such as "What people and organizations are of interest to you?" A person may be an employee, a customer, or an agent for a vendor. An organization may be a customer, a department, or a vendor. Each of these are roles played by a person and they can be modeled in a straightforward manner. (What? A customer can be either a person or an organization? Then perhaps we should define a super-type of person and organization that can play the role of "vendor". By convention, that super-type is known as party.) Similarly, what kinds of products are of interest? Are we only concerned with products for sale? What about equipment used in manufacturing? Are chemicals included? Other questions home in on facilities, procedures, and contracts. Just as data models provide structure to the enterprise, data model patterns provide structure to the process of modeling. To be sure, the particular sub-types of these elements are different from company to company. But the five categories are fundamental. Using them as a starting point makes the details much easier to find. ConclusionAn important component in the search for data quality is quality in the design of the tools that will capture, store, and manipulate data. This design must be derived from a sound understanding of the nature of the data and the organization manipulating them. This understanding can come from an effective effort to describe that nature in ways that everyone can see. Hence, data model quality is an important pre-requisite to data quality. [1] For a description of Mr. Zachman's "Framework for information systems architecture" go to http://www.essentialstrategies.com/publications/methodology/zachman.htm. [2] For a discussion of the differences between John Zachman's and David Hay's framework, see David C. Hay, Requirements Analysis: From Business Views to Architecture (Englewood Cliffs, NJ: Prentice Hall, 2003), 5-6. [3] Matthew West "Developing High Quality Data Models, Volume 1, Principles and Techniques", The Data Management Guide. (London: Shell International Petroleum Company Limited, 1994). These ideas are further expanded at http://www.matthew-west.org.uk/documents/princ03.pdf. [4] Ibid, p. 30. [5] "Structured Systems Analysis Methodology", commonly used in Europe. [6] Edward R. Tufte, Envisioning Information. (Cheshire, Connecticut: Graphics Press, 1990), 53. [7] For a full comparison of the most common notations, see Hay, Requirements Analysis, 343-387. * The logical horizion of an entity type is the entity type itself, plus all the entity types it requires-either directly or indirectly. To define one, start with the entity type in question and follow the trail of mandatory relationships to find all required entity types. [8] David C. Hay, Data Model Patterns: Conventions of Thought. (New York: Dorset House, 1995). This article was originally published in Cutter IT Journal, Vol. 16, No. 1, January 2003. It is reprinted with permission of the publisher, Cutter Information, LLC, Arlington, MA USA www.cutter.com. Go to Current Issue | Go to Issue Archive Recent articles by David C. Hay
David C. Hay - In the information industry since the days of punched cards, paper tape and teletype machines, Dave has been producing data models to support strategic and requirements planning for more than twenty
years. He has worked in a variety of industries, including, among others, banking, clinical pharmaceutical research, and all aspects of oil production and processing.
He is the founder and President of Essential Strategies, Inc., a fourteen-year-old consulting firm dedicated to helping clients define corporate information architecture, identify requirements, and plan strategies for the implementation of new systems. Dave is the author of the book, Data Model Patterns: Conventions of Thought, and Requirements Analysis: From Business Views to Architecture. His new book Data Model Patterns: A Metadata Map is a comprehensive schema of metadata from many different perspectives. He has also spoken at numerous international and local DAMA conferences, Oracle user group conferences, and many others.
He can be reached at dch@essentialstrategies.com, (713) 464-8316, or via his company's website at http://www.essentialstrategies.com. |