Better Data Models – Today:

Written by Graeme Simsion, Simsion & Associates / University of Melbourne[1]

Published in TDAN.com October 2005

Introduction – An Approach to Data Modeling Problems

One of the consequences of writing a book [2] on a technical topic is to be asked for advice on that topic, not only in the course of consulting assignments,
but as an instructor, panel member or conference participant.

Generally, this is a pleasant duty, but the challenge is to be able to give some useful direction in a very short time, without detailed knowledge of the problem. “It’s in the book” is concise
and often the soundest response, but it’s unlikely to win friends or add much to a panel discussion. What is wanted is a few words of advice that will offer insight into the problem and point the
way towards a solution.

Data modeling questions from experienced practitioners generally fall into one of three categories:

the politics of data modeling (in particular, how to persuade others of its value, and ensure that it is given adequate weight in applications planning and development);
the organization of the data modeling task (in particular the relationships between data modelers, business people, other analysts and DBAs)
technical questions about modeling (“how do I model this situation?”)

I have offered some thoughts on the first two categories in an earlier article.[3] In answering more technical questions about models and modeling, I have found
that the single most useful approach is to frame the problem in terms of the factors which determine the quality of a data model. Indeed in a recent class for advanced data modelers, we developed a
routine of returning to the slide that summarized these factors whenever a participant had a technical question. Later I found that one attendee had pinned a copy of the slide to his wall, and was
using it as a reference point when reviewing models with his staff. For an “ask the expert” session, I printed copies, and found that they provided a good basis for addressing two thirds of the
questions.

So, where is this all-purpose slide? It’s coming, but we first need to understand the context in which it is intended to be used, and note an important assumption (if you’re not with me on this,
the slide won’t be helpful).

After that, we’ll look at each of the factors in a little more detail, with the aid of a “simple[4]” case study.

Behind the Quality Factors

The idea of quality factors, at least as used here, arises from the fact[5] that there is more than one workable solution to most data modeling problems, and
hence we need some means of comparing alternative solutions. The “one right answer” assumption, often implicit but reinforced by both academic and industry teaching and language, is probably the
single greatest impediment to good data modeling. It stifles progress at both ends of the development process. Getting started is difficult because we feel it necessary to produce the ultimate
model rather than something basic which works but can be improved. And when we do produce something workable, we are not motivated to explore serious enhancements or alternatives.

When someone tells me that they are having trouble modeling a situation, I encourage them to produce something, keeping in mind Ernest Hemingway’s advice that “the first draft of anything is
s***”, and the corollary: “good writing is re-writing”. With the tyranny of the blank page behind them, they can then work on making the model better. The quality factors help define what is
meant by “better”.

Once someone has produced a model, I suggest that they evaluate it in terms of the quality factors. Where is there room for improvement? How could we modify the model to achieve it? What would we
sacrifice in doing so?

If the modeler is faced with choices – “which is the better model?” or “which is the better representation of this data?” – the quality criteria provide a basis for comparison. The purpose is
not merely to answer the question “which model is better?” but to understand clearly how and why the options differ.

The “one right answer” assumption often lurks in the background of debate between modelers and other stakeholders, particularly DBAs. If the data modeler asserts that a model is simply wrong, but
the project team goes on to build a working application based on it, then the modeler will have a credibility problem.

What the data modeler should do in such circumstances is to observe that the model is seriously weak in certain areas, and point out the impact: perhaps reduced ability to accommodate change or
more complex programming to maintain duplicated data. The DBA might well respond that these sacrifices are necessary in order to meet performance objectives, and a debate should ensue, focusing on
the impact of the various options on the quality factors, and the relative importance of these factors. Where one factor has to be traded off against others, as is usually the case, the application
owner, properly briefed, is the logical person to make the final decision. Moving away from absolutes of “right” and “wrong” is a good start towards that decision.

Ten Quality Factors

Say “quality” and the definition “fit for purpose” is likely to be fired back. The purpose of a data model is typically to specify a database to support a computer application[6], and (it is to be hoped) future applications that may require some or all of the data. The ultimate measure of quality is how well it supports those applications.

Each of the ten quality factors shown in Fig 1 addresses some component of that fitness for purpose. They are based on those proposed in Data Modeling Essentials, with a couple of minor additions
and changes[7].

Scope / Coverage

Non-redundancy

Enforcement of Business Rules

Stability and Flexibility

Elegance

Communication (Ease of Understanding)

Ease of Programming / Enquiry

Data Re-usability

Integration

Performance

Fig 1. Data Model Quality Factors

Before we look at each factor in detail, a few qualifications are required.

The order does not reflect relative importance, but is a useful default sequence for evaluating a model for a transaction processing application. If a particular factor is paramount or
problematic, it makes sense to look at it earlier.
The factors do not cover every aspect of data model quality, nor do they represent the only way that quality might be assessed. They simply represent an evolving collection of dimensions that
have proved useful for addressing practical situations.
I have not included “cost” as a separate factor. The cost of developing the model is not a quality factor in the same sense as the factors listed in Figure 1 – in particular it is not
embodied in the model itself. The costs associated with using the model in developing and maintaining applications are a function of the other factors; indeed they represent the notional common
currency for evaluating them. (“So what is this reduction in Ease of Programming going to cost us in additional programming resources?”)
The factors are not independent – a problem if you are trying to build a rigorous mathematical model, but much less of an issue in practice. Elegance, Ease of Programming / Enquiry and
Communication address similar ideas, but it is helpful to have the three perspectives. More importantly, some of the factors are in natural opposition, and data modeling decisions will often
involve trading one off against another.
The relative importance of the factors, and the difficulty in achieving satisfactory results in each dimension, will vary from situation to situation.
The framework was originally designed with transaction processing (rather than decision support) databases in mind. However, it translates well into a data warehouse / data mart environment,
with some changes to the typical relative importance of the factors. In fact, discussing and understanding these changes is a helpful way of making the transition from modeling for transaction
processing systems to data warehouses and marts.

With the above qualifications in mind, let’s now look at each of the factors.

To add some consistency to the illustrative examples, I have taken as a case study a small database that I built myself in order to keep track of wine which I had purchased in advance of it being
bottled [8].

Scope / Coverage

Decisions about what to include are not always straightforward, particularly when the data in question serves to support decisions rather than tightly-defined transactions. Should my wine database
include the ratings of the wines by prominent reviewers – an important factor in deciding whether to buy more if the opportunity arises? If so, which reviewers should I include? Should I allow for
wines beyond my original purchases from Bordeaux – recognizing that the additional characteristics applying to those wines will need to be accommodated? What would I need to add to allow my
database to serve as a wine inventory once the wine was delivered?

This factor (which I originally called “Completeness”) is not often the focus of debate in the context of assessing overall quality; it tends to be treated as a stand-alone requirement, which
merits only a tick when the model is reviewed. When there are tradeoffs to be made, Scope is often treated as a given. However, as a data modeler, you should be aware of the temptation to extend
the scope of a model (it’s often so easy!) and that doing so may impact on other factors, typically Ease of Programming. You may not explicitly recognize the negative reaction to a model as being
a consequence of an earlier decision to expand its scope.

Non-Redundancy

Having taught normalization, hammered home its importance, and admonished students for producing models in anything less than fifth normal form without a well-argued reason, I have to admit that my
wine database was deficient in this area. Wine comes in different sized bottles, with correspondingly different prices (no, the formula isn’t simple), and my database included a separate row for
each size. Common data, such as the wine’s ratings which are not dependent on bottle size, was repeated for each row.

Why did I do this? The truth is that it seemed simpler than creating an extra table for those relatively few situations in which I ordered the same wine in different bottle sizes. In retrospect,
and in terms of the quality framework, I was buying (perceived) Ease of Programming and even some Elegance in the form of simplicity. Time has proven it to be a bad decision, which I would not have
made if I had been forced to review the quality tradeoffs objectively. So my attitude to normalization has in this case been vindicated, but I have a little more empathy with those who make
compromises!

The importance of non-redundancy is itself largely determined by the relative importance of update and enquiry transactions, together with performance requirements.

Enforcement of Business Rules vs Stability and Flexibility

It makes sense to look at Enforcement of Business Rules and Stability and Flexibility together, as they are naturally in opposition. More of one usually means less of the other. And in my
experience, this is the trade-off that data modelers most frequently struggle with, often without recognizing it explicitly. (To labor the point, recognizing the quality factors and trade-offs is
the most important step towards resolving them).

The reason that these two factors are opposed is pretty simple: the more rules you enforce in the model, the more likely it is that one or more of them will be rendered invalid by business changes.
So the model will be less stable in the face of business change.

We can vary the degree to which a model implements business rules[9] by varying the level of generalization of entities and attributes. This is a major topic in
its own right – but it’s worth noting that generalization decisions (e.g. what level of subtype shall we implement?) can generally be framed in terms of the Business Rules / Stability trade-off.
Generalization decisions are also likely to impact on Elegance and Ease of Programming / Enquiry.

I make a distinction between Stability and Flexibility, though both are about the ability of the model to accommodate business change. Stability is the ability of the model to accept business
changes without modification; flexibility relates to ease of modification.

My wine database had separate columns for each of the different reviewers’ ratings. It was flexible in being able to accommodate additional reviewers (just add another column) but not stable
(adding a column does change the database structure, and any programs using that data would need to be modified accordingly). By generalizing the columns to a single (repeating) “reviewer rating”
column, and normalizing them out to a separate table, I would have created a database that could accommodate new reviewers without change, at the cost (perhaps) of some simplicity of programming. I
would also have lost some business rules – perhaps not terribly important ones in this instance. The model would no longer enforce a specific set of pre-selected reviewers, and it would be
difficult to apply different constraints to the data from different reviewers (they use different rating scales).

Elegance

Elegance is a hard concept to pin down – but most data modelers will admit to a few “aha!” moments when a particular representation seems particularly neat. Often they interpret that feeling
(incorrectly of course) as a sign that they have arrived at the single right answer. For this reason, elegance can be a trap as well as a desirable objective; the important thing is to keep it in
perspective as just one dimension of quality.

Elegance is associated with consistency, conciseness, and comprehensiveness. All things being equal, these qualities should make programming cheaper and easier, and less prone to errors.
Unfortunately they are often achieved through high levels of generalization which may mean poor enforcement of business rules and a requirement for sophisticated generic code.

My wine database had a serious elegance problem: I built it for the first vintage I purchased, and just set up a copy for the second vintage – taking the opportunity to make a few improvements to
the structure. Scope was improved but at a severe cost in elegance. Any work which required pulling data together from two vintages was going to be messy.

Communication and Ease of Programming / Enquiry

Communication (the ease with which the model is understood by others) and Ease of Programming / Enquiry are closely related. Understanding is a pre-requisite to proper use of the model by
programmers and end-users.

One of the most frequent complaints about data models developed by sophisticated data modelers is that they are… sophisticated. The usual culprit is high levels of generalization, implemented in
the interests of scope, stability and elegance. With generalization comes unfamiliar language as the modeler seeks terms to embrace multiple business concepts. “Party”, “Tradable Object” and
“Schedulable Item” make their appearance, and more is asked of programmers and users if they are to use the database correctly.

A recursive relationship (often resulting from generalizing the original participant entities) is easy to draw, but can be a nightmare to code. “Table-driven” applications may offer great
stability and business flexibility – but require expert programmers and sophisticated use of programming tools to achieve the desired results.

Databases intended for direct enquiry by “unsophisticated” business users are a particularly strong example of the need for ease of enquiry. In a data mart setting, this factor is likely to be
given very strong weight, if necessary at the expense of other dimensions.

It was in the interests of ease of enquiry that I designed my original wine purchases databases as a single table (a spreadsheet in fact) – no joins or coding required! To facilitate communication,
column names were simple and natural (Pre-tax Purchase Price, Post-Tax Purchase Price). My assistant could use it without help and simple things were simple to do. But I paid a price for this
familiar representation of data: finding out how much I had spent meant bringing together pre-tax prices and post-tax prices in both Euros and Australian dollars. Of course, a proper multi-currency
design with the inclusion of tax rules would have made such enquiries easier – but at the expense of communication…

Data Re-usability

“Data, once captured, should be readily available to all legitimate users.” So goes the data management mantra: the difficulty, as data managers know, is encouraging individual project teams to
look beyond their own goals to those of the wider organization. Their issue is (or should be) the cost that they incur in doing so: in our terms, the trade-off between Data Re-usability and other
quality factors. Of course, if they cannot identify such sacrifices, there should be no reason not to design for maximum data reusability.

In practice, the trade-off is typically against Performance and / or Ease of Programming / Enquiry; we organize data in a way to favor our own transactions over those which may appear in the
future. Denormalization is a classic technique.

My wine database worked well for reconciling purchases with receipts – the “buy wine” transaction was neatly entered in a single line. But as I considered using the data as the basis for an
inventory, I had the problem of dealing with multiple purchases of the same wine (at different prices). Where was the simple “stock on hand” figure?

Integration

The Integration factor is related to Data Re-usability, insofar as it involves a wider perspective than the original application. The issue usually boils down to a need for consistency with other
data representations: having to accept someone else’s data model, if only in part. As such Integration may impact any of the other factors by removing the ability to optimize that factor.

I have to admit that integration was never a consideration when I designed my wine database. Unfortunately, the purchases were not tax deductible, so the only potential stakeholder I could think of
(my accountant) was not interested. But my wine merchant recently computerized, and I found myself with the latest offer nicely organized in a spreadsheet with all the ratings and pricing data that
I needed – in an entirely different format. Here was proof if I needed it that there is more than one way to model the same scenario. Given a choice of re-designing or re-keying, I’ve been reduced
to the latter.

Performance

Performance I’ve left until last, because it is generally tackled last. Common practice is to develop a data model independent of performance considerations, then to make any necessary changes
(which may potentially impact any of the other quality factors). The hope is that adequate performance can be achieved without impacting the logical structures, or that any such impact will be
minimal. Behind this, also, is an echo of the one right answer philosophy: get it right, then compromise only if necessary. Of course, the fact that the compromised model will still work emphasizes
that the issue is about relative goodness rather than simple right or wrong.

Broadly, I believe the approach is reasonable, but with two major caveats:

Firstly, the data modeler needs to stay involved. If there are to be changes, they should be weighed against the other factors, and the data modeler is the best placed person to facilitate the
process and communicate the results. Performance should not, by default, be allowed to trump all of the other factors.

Secondly, if it is clear from the outset that performance is going to be a critical factor that may require some tough design decisions, then it should be considered from the outset. The
performance expert (typically a DBMS expert) should be involved, and different designs should be evaluated with performance in mind. It’s much easier to build performance into the design than to
try to build it on.

It is with some relief that I can state that performance was not an issue with my wine database.

Pulling it Together

I chose my wine database as a case study primarily to illustrate that almost all of these quality factors will be relevant even in the simplest of data models. My original database was a single
table with 13 columns, yet it demonstrates issues in terms of all of the quality factors but Performance. It also (embarrassingly for me) illustrates what happens when one doesn’t take the time to
review a model rigorously.

It should be clear from the above that the performance dimensions interact with one another. The art of data modeling is to reach the best balance. In reviewing a model, I suggest that you evaluate
it in terms of each factor, noting ideas for improvement and which factors they will impact. Going into a discussion with other stakeholders so prepared will demonstrate a much greater empathy with
their possible priorities, and enable you to communicate the price that they will pay for having them met.

If all this sounds like overkill, remember that the data model is generally the single most important determinant of the shape, capabilities and flexibility of the applications that use it; data
model quality is the foundation of application quality.

MenuMenu

Share this post

Graeme Simsion