Better Data Models - Today:
Understanding Data Model Quality
Published: October 1, 2005
Written by Graeme Simsion, Simsion & Associates / University of Melbourne
Published in TDAN.com October 2005
Introduction - An Approach to Data Modeling Problems
One of the consequences of writing a book  on a technical topic is to be asked for advice on that topic, not only in the course of consulting assignments, but as an instructor, panel member or conference participant.
Generally, this is a pleasant duty, but the challenge is to be able to give some useful direction in a very short time, without detailed knowledge of the problem. "It's in the book" is concise and often the soundest response, but it's unlikely to win friends or add much to a panel discussion. What is wanted is a few words of advice that will offer insight into the problem and point the way towards a solution.
Data modeling questions from experienced practitioners generally fall into one of three categories:
I have offered some thoughts on the first two categories in an earlier article. In answering more technical questions about models and modeling, I have found that the single most useful approach is to frame the problem in terms of the factors which determine the quality of a data model. Indeed in a recent class for advanced data modelers, we developed a routine of returning to the slide that summarized these factors whenever a participant had a technical question. Later I found that one attendee had pinned a copy of the slide to his wall, and was using it as a reference point when reviewing models with his staff. For an "ask the expert" session, I printed copies, and found that they provided a good basis for addressing two thirds of the questions.
So, where is this all-purpose slide? It's coming, but we first need to understand the context in which it is intended to be used, and note an important assumption (if you're not with me on this, the slide won't be helpful).
After that, we'll look at each of the factors in a little more detail, with the aid of a "simple" case study.
Behind the Quality Factors
The idea of quality factors, at least as used here, arises from the fact that there is more than one workable solution to most data modeling problems, and hence we need some means of comparing alternative solutions. The "one right answer" assumption, often implicit but reinforced by both academic and industry teaching and language, is probably the single greatest impediment to good data modeling. It stifles progress at both ends of the development process. Getting started is difficult because we feel it necessary to produce the ultimate model rather than something basic which works but can be improved. And when we do produce something workable, we are not motivated to explore serious enhancements or alternatives.
When someone tells me that they are having trouble modeling a situation, I encourage them to produce something, keeping in mind Ernest Hemingway's advice that "the first draft of anything is s***", and the corollary: "good writing is re-writing". With the tyranny of the blank page behind them, they can then work on making the model better. The quality factors help define what is meant by "better".
Once someone has produced a model, I suggest that they evaluate it in terms of the quality factors. Where is there room for improvement? How could we modify the model to achieve it? What would we sacrifice in doing so?
If the modeler is faced with choices - "which is the better model?" or "which is the better representation of this data?" - the quality criteria provide a basis for comparison. The purpose is not merely to answer the question "which model is better?" but to understand clearly how and why the options differ.
The "one right answer" assumption often lurks in the background of debate between modelers and other stakeholders, particularly DBAs. If the data modeler asserts that a model is simply wrong, but the project team goes on to build a working application based on it, then the modeler will have a credibility problem.
What the data modeler should do in such circumstances is to observe that the model is seriously weak in certain areas, and point out the impact: perhaps reduced ability to accommodate change or more complex programming to maintain duplicated data. The DBA might well respond that these sacrifices are necessary in order to meet performance objectives, and a debate should ensue, focusing on the impact of the various options on the quality factors, and the relative importance of these factors. Where one factor has to be traded off against others, as is usually the case, the application owner, properly briefed, is the logical person to make the final decision. Moving away from absolutes of "right" and "wrong" is a good start towards that decision.
Ten Quality Factors
Say "quality" and the definition "fit for purpose" is likely to be fired back. The purpose of a data model is typically to specify a database to support a computer application, and (it is to be hoped) future applications that may require some or all of the data. The ultimate measure of quality is how well it supports those applications.
Each of the ten quality factors shown in Fig 1 addresses some component of that fitness for purpose. They are based on those proposed in Data Modeling Essentials, with a couple of minor additions and changes.
Scope / Coverage
Enforcement of Business Rules
Stability and Flexibility
Communication (Ease of Understanding)
Ease of Programming / Enquiry
Fig 1. Data Model Quality Factors
Before we look at each factor in detail, a few qualifications are required.
With the above qualifications in mind, let's now look at each of the factors.
To add some consistency to the illustrative examples, I have taken as a case study a small database that I built myself in order to keep track of wine which I had purchased in advance of it being bottled .
Scope / Coverage
Decisions about what to include are not always straightforward, particularly when the data in question serves to support decisions rather than tightly-defined transactions. Should my wine database include the ratings of the wines by prominent reviewers - an important factor in deciding whether to buy more if the opportunity arises? If so, which reviewers should I include? Should I allow for wines beyond my original purchases from Bordeaux - recognizing that the additional characteristics applying to those wines will need to be accommodated? What would I need to add to allow my database to serve as a wine inventory once the wine was delivered?
This factor (which I originally called "Completeness") is not often the focus of debate in the context of assessing overall quality; it tends to be treated as a stand-alone requirement, which merits only a tick when the model is reviewed. When there are tradeoffs to be made, Scope is often treated as a given. However, as a data modeler, you should be aware of the temptation to extend the scope of a model (it's often so easy!) and that doing so may impact on other factors, typically Ease of Programming. You may not explicitly recognize the negative reaction to a model as being a consequence of an earlier decision to expand its scope.
Having taught normalization, hammered home its importance, and admonished students for producing models in anything less than fifth normal form without a well-argued reason, I have to admit that my wine database was deficient in this area. Wine comes in different sized bottles, with correspondingly different prices (no, the formula isn't simple), and my database included a separate row for each size. Common data, such as the wine's ratings which are not dependent on bottle size, was repeated for each row.
Why did I do this? The truth is that it seemed simpler than creating an extra table for those relatively few situations in which I ordered the same wine in different bottle sizes. In retrospect, and in terms of the quality framework, I was buying (perceived) Ease of Programming and even some Elegance in the form of simplicity. Time has proven it to be a bad decision, which I would not have made if I had been forced to review the quality tradeoffs objectively. So my attitude to normalization has in this case been vindicated, but I have a little more empathy with those who make compromises!
The importance of non-redundancy is itself largely determined by the relative importance of update and enquiry transactions, together with performance requirements.
Enforcement of Business Rules vs Stability and Flexibility
It makes sense to look at Enforcement of Business Rules and Stability and Flexibility together, as they are naturally in opposition. More of one usually means less of the other. And in my experience, this is the trade-off that data modelers most frequently struggle with, often without recognizing it explicitly. (To labor the point, recognizing the quality factors and trade-offs is the most important step towards resolving them).
The reason that these two factors are opposed is pretty simple: the more rules you enforce in the model, the more likely it is that one or more of them will be rendered invalid by business changes. So the model will be less stable in the face of business change.
We can vary the degree to which a model implements business rules by varying the level of generalization of entities and attributes. This is a major topic in its own right - but it's worth noting that generalization decisions (e.g. what level of subtype shall we implement?) can generally be framed in terms of the Business Rules / Stability trade-off. Generalization decisions are also likely to impact on Elegance and Ease of Programming / Enquiry.
I make a distinction between Stability and Flexibility, though both are about the ability of the model to accommodate business change. Stability is the ability of the model to accept business changes without modification; flexibility relates to ease of modification.
My wine database had separate columns for each of the different reviewers' ratings. It was flexible in being able to accommodate additional reviewers (just add another column) but not stable (adding a column does change the database structure, and any programs using that data would need to be modified accordingly). By generalizing the columns to a single (repeating) "reviewer rating" column, and normalizing them out to a separate table, I would have created a database that could accommodate new reviewers without change, at the cost (perhaps) of some simplicity of programming. I would also have lost some business rules - perhaps not terribly important ones in this instance. The model would no longer enforce a specific set of pre-selected reviewers, and it would be difficult to apply different constraints to the data from different reviewers (they use different rating scales).
Elegance is a hard concept to pin down - but most data modelers will admit to a few "aha!" moments when a particular representation seems particularly neat. Often they interpret that feeling (incorrectly of course) as a sign that they have arrived at the single right answer. For this reason, elegance can be a trap as well as a desirable objective; the important thing is to keep it in perspective as just one dimension of quality.
Elegance is associated with consistency, conciseness, and comprehensiveness. All things being equal, these qualities should make programming cheaper and easier, and less prone to errors. Unfortunately they are often achieved through high levels of generalization which may mean poor enforcement of business rules and a requirement for sophisticated generic code.
My wine database had a serious elegance problem: I built it for the first vintage I purchased, and just set up a copy for the second vintage - taking the opportunity to make a few improvements to the structure. Scope was improved but at a severe cost in elegance. Any work which required pulling data together from two vintages was going to be messy.
Communication and Ease of Programming / Enquiry
Communication (the ease with which the model is understood by others) and Ease of Programming / Enquiry are closely related. Understanding is a pre-requisite to proper use of the model by programmers and end-users.
One of the most frequent complaints about data models developed by sophisticated data modelers is that they are... sophisticated. The usual culprit is high levels of generalization, implemented in the interests of scope, stability and elegance. With generalization comes unfamiliar language as the modeler seeks terms to embrace multiple business concepts. "Party", "Tradable Object" and "Schedulable Item" make their appearance, and more is asked of programmers and users if they are to use the database correctly.
A recursive relationship (often resulting from generalizing the original participant entities) is easy to draw, but can be a nightmare to code. "Table-driven" applications may offer great stability and business flexibility - but require expert programmers and sophisticated use of programming tools to achieve the desired results.
Databases intended for direct enquiry by "unsophisticated" business users are a particularly strong example of the need for ease of enquiry. In a data mart setting, this factor is likely to be given very strong weight, if necessary at the expense of other dimensions.
It was in the interests of ease of enquiry that I designed my original wine purchases databases as a single table (a spreadsheet in fact) - no joins or coding required! To facilitate communication, column names were simple and natural (Pre-tax Purchase Price, Post-Tax Purchase Price). My assistant could use it without help and simple things were simple to do. But I paid a price for this familiar representation of data: finding out how much I had spent meant bringing together pre-tax prices and post-tax prices in both Euros and Australian dollars. Of course, a proper multi-currency design with the inclusion of tax rules would have made such enquiries easier - but at the expense of communication...
"Data, once captured, should be readily available to all legitimate users." So goes the data management mantra: the difficulty, as data managers know, is encouraging individual project teams to look beyond their own goals to those of the wider organization. Their issue is (or should be) the cost that they incur in doing so: in our terms, the trade-off between Data Re-usability and other quality factors. Of course, if they cannot identify such sacrifices, there should be no reason not to design for maximum data reusability.
In practice, the trade-off is typically against Performance and / or Ease of Programming / Enquiry; we organize data in a way to favor our own transactions over those which may appear in the future. Denormalization is a classic technique.
My wine database worked well for reconciling purchases with receipts - the "buy wine" transaction was neatly entered in a single line. But as I considered using the data as the basis for an inventory, I had the problem of dealing with multiple purchases of the same wine (at different prices). Where was the simple "stock on hand" figure?
The Integration factor is related to Data Re-usability, insofar as it involves a wider perspective than the original application. The issue usually boils down to a need for consistency with other data representations: having to accept someone else's data model, if only in part. As such Integration may impact any of the other factors by removing the ability to optimize that factor.
I have to admit that integration was never a consideration when I designed my wine database. Unfortunately, the purchases were not tax deductible, so the only potential stakeholder I could think of (my accountant) was not interested. But my wine merchant recently computerized, and I found myself with the latest offer nicely organized in a spreadsheet with all the ratings and pricing data that I needed - in an entirely different format. Here was proof if I needed it that there is more than one way to model the same scenario. Given a choice of re-designing or re-keying, I've been reduced to the latter.
Performance I've left until last, because it is generally tackled last. Common practice is to develop a data model independent of performance considerations, then to make any necessary changes (which may potentially impact any of the other quality factors). The hope is that adequate performance can be achieved without impacting the logical structures, or that any such impact will be minimal. Behind this, also, is an echo of the one right answer philosophy: get it right, then compromise only if necessary. Of course, the fact that the compromised model will still work emphasizes that the issue is about relative goodness rather than simple right or wrong.
Broadly, I believe the approach is reasonable, but with two major caveats:
Firstly, the data modeler needs to stay involved. If there are to be changes, they should be weighed against the other factors, and the data modeler is the best placed person to facilitate the process and communicate the results. Performance should not, by default, be allowed to trump all of the other factors.
Secondly, if it is clear from the outset that performance is going to be a critical factor that may require some tough design decisions, then it should be considered from the outset. The performance expert (typically a DBMS expert) should be involved, and different designs should be evaluated with performance in mind. It's much easier to build performance into the design than to try to build it on.
It is with some relief that I can state that performance was not an issue with my wine database.
Pulling it Together
I chose my wine database as a case study primarily to illustrate that almost all of these quality factors will be relevant even in the simplest of data models. My original database was a single table with 13 columns, yet it demonstrates issues in terms of all of the quality factors but Performance. It also (embarrassingly for me) illustrates what happens when one doesn't take the time to review a model rigorously.
It should be clear from the above that the performance dimensions interact with one another. The art of data modeling is to reach the best balance. In reviewing a model, I suggest that you evaluate it in terms of each factor, noting ideas for improvement and which factors they will impact. Going into a discussion with other stakeholders so prepared will demonstrate a much greater empathy with their possible priorities, and enable you to communicate the price that they will pay for having them met.
If all this sounds like overkill, remember that the data model is generally the single most important determinant of the shape, capabilities and flexibility of the applications that use it; data model quality is the foundation of application quality.
Quality factors are discussed directly at a more introductory level in Data Modeling Essentials (pp10-15). However, they provide the foundation for much of the remainder of the book, as they do for my teaching and consulting on data modeling. The fact that the specific factors used in the book are slightly different from the version in this paper is of minor importance; the critical issue is to recognize that a model must address multiple, inter-related quality criteria and to model with these constantly in mind.
There are a few academic papers that address data model quality. Most are targeted at the models produced by students in response to simple, contrived exercises, and focus on semantic and syntactic completeness: did the modeler get it all and represent it correctly? They often implicitly support the "one right answer" view, and do not offer much to the experienced practitioner. Of the more sophisticated papers, I would recommend one by Moody and Shanks as the most interesting for practitioners. It looks at quantifying data model quality factors - an interesting direction, but one which I think is less important than getting the best possible set of factors and recognizing them qualitatively. Again, the issue is not so much to pick a winner, but to understand why models differ, what their weaknesses and strengths are, and how they can be improved.
 Since you asked, it's Data Modeling Essentials 3rd Edition, Morgan Kaufmann, 2005, co-authored with Graham Witt.
 Tackling Data Modeler's Toughest Challenge, TDAN Newsletter, April 2005.
 Case studies taken from the real world are almost never "simple" in the way that artificial teaching examples are.
 I've stopped saying "contention". I'm well aware that that the 'one right answer' school of thought has many adherents, but am satisfied that the position I take here can be both theoretically and empirically demonstrated. I discussed the issue in some detail in an earlier article: You're Making it Up: Data Modeling: Analysis or Design?, TDAN, October, 2004
 We are not looking at data models for data planning or documentation of existing databases.
 The first edition of Data Modeling Essentials (1994) listed seven factors roughly corresponding to seven of those listed here. The current edition lists 8, treating Performance separately, and incorporating Ease of Programming / Enquiry in Elegance.
 For the information of wine-loving readers, the wine was (red and white) Bordeaux purchased en primeur from various suppliers. I trust such readers will forgive me a few simplifications which I have made so as not to overwhelm / bore the uninitiated.
 Rules not implemented in the model can still be implemented elsewhere - in code, in data values or external to the application.
 Moody, D and Shanks, G (2003): Improving the Quality of Entity Relationship Models: Experience in Research and Practice, Information Systems Journal, 619-650.
Recent articles by Graeme Simsion
Graeme Simsion - Graeme is the author of Data Modeling Essentials and Data Modeling Theory and Practice. Since mid-2007, he is no longer involved in data management and data modeling, but continues to draw on his experience as a consultancy manager to advise and teach on the management and delivery of consulting services.