TDAN: The Data Administration Newsletter, Since 1997

THE DATA ADMINISTRATION NEWSLETTER – TDAN.com
ROBERT S. SEINER – PUBLISHER

Subscribe to TDAN

   > home > newsletter > article
 Printer-friendly
 E-mail to friend

Real Men Don't Read Instructions

by Alex Friedgan, Ph.D.
Published: April 1, 2009
Alex Friedgan provides a rebuttal to the recent article entitled "Why Data Models Cannot Work."
The article Why Data Models Cannot Work published in February edition of TDAN stirred my emotions. It stated requirements for a model from the knowledge management perspective and provided an example to highlight where data models fell short. The intention was likely to stimulate a discussion. It definitely achieved that goal: it provoked me into writing this response.

Points of Concurrence

There are problems and issues within any discipline. As a practitioner, I am painfully aware of ours. A logical data model should represent the business domain while following all relational normalization rules; a physical model is its implementation under constraints of the given DBMS. Very often, the real-world data model is a hybrid between logical and physical ones. There is not enough time and resources to create two separate models and, more importantly, to maintain both and keep them in sync. Out of two, the physical one is essential for an IT organization; the scripts for creating and modifying database objects are generated from it. By natural selection, the more dominant species survive, and the logical model disappears:
  • Performance-related denormalized structures propagate from physical to the logical side;

  • New mandatory attributes are added as optional to simplify processing;

  • Keys are not fully identified or include extraneous attributes to reflect physical indexes;

  • Sub-types are not identified or reflect physical partitioning instead;

  • Etc.
Therefore, the data model should not be used alone but be augmented by other tools. These tools might include data profiling, as was done by my colleague. In a similar fashion, I verify new data sources in my data modeling efforts.

Perceived Data Model Shortcomings

The perceived shortcomings arose from the three-fold assertion put forward. Models should “tell us (1) the truth, (2) the whole truth and (3) nothing but the truth about what it is they are purporting to describe” [Chisholm 09]. Let us go over each part of the statement. A relational model should give correct information and should not mislead. I am in complete agreement on the first and third points. I would also argue that these qualities are attainable. In the places where I have worked, they are more of the norm rather than the exception. The situation with the second point is different. It is impossible for any model to contain the whole truth. In order to achieve that goal, the model would be as voluminous as the underlying data.

Definition of a Good Model

The first question would be what the principles are for creating a model. George Box, the industrial statistician, is credited with the quote “all models are wrong, some are useful” [Box 79]. The author applied his maxim to statistical analyses, which are notoriously ambiguous. The relational models associate attributes on a deterministic basis; the relationships once established are always true (at least they are supposed to be). So our models are less ambiguous. Nevertheless, the quote rings true to models in general and to relational models in particular.

The same ideas are detailed by Marshall Clemens on his website [Clemens 02]. The term “model” refers to any structured knowledge that accurately reflects – and helps us to make sense of – the world. What are the desirable properties of a good model?

“A good model should be:
  • Salient: Since no model can represent everything, it must selectively represent those things most relevant to the task at hand.

  • Accurate: The model should precisely encode the actual state of affairs and not an erroneous or biased view.

  • Complete yet Parsimonious: The model should be as simple as possible, but no simpler. It should concisely capture all the relevant dimensions of the problem without squeezing out the opportunity for serendipitous or creative insight.”
The properties list includes a total of 14 requirements. I showed the first three, which are the most relevant to the discussion. Nowhere was it stated that a model should be able to capture everything up to the tiniest detail. And models built on such principles have been used for years. They provided insight and spread knowledge. Therefore, models are the de facto basis of knowledge management.

Example Revisited

The original article had an example that compared a data model to data profiling analyses to highlight where data models fell short. The example used the Financial Instrument entity. The attribute level report only showed the attribute name and definition. Let us take a closer look at the data modeling language. It includes many more attribute properties, such as domain, optionality (Null/not Null), and participation in candidate and primary keys. Thus, the hypothetical data model in expanded form is presented in the Table 1 below.

alt

Table 1: Expanded Attribute Report for Financial Instrument Entity

Each of the just “uncovered” attribute properties enhances the model. The domain allows interpreting data values and suggests possible operations. This information is impossible to glean from the database alone. It is just one example of the model’s value.

Next is the optionality. Based on the incomplete attribute report, the CUSIP attribute was assumed to be mandatory. It led to a proposition that contradicted data: Every Financial Instrument has a CUSIP. Now it is clear that the CUSIP is an optional attribute. And the propositions are: some Financial Instruments have a CUSIP, and some Financial Instruments do not have a CUSIP. That exactly matches the database profiling results.

Last are the keys. The key structure for this entity is very complicated. I only depicted two keys: primary key (PK) and candidate key (CK1). Moreover, the candidate key might not reflect the true business domain and is shown for illustration purposes only. Anyway, let us finish with the disclaimers, and get to the point. In this example, the primary key does not reflect the business entity identity. It is only a surrogate key that substitutes a multi-part natural key of the entity. The natural key is relegated to the role of a candidate key. Thus, it is important to identify candidate keys for understanding the entity identity.

The Financial Instrument is a generalized construct representing a variety of investment products. A set of attributes vary by the product. This is apparent from the existence of optional attributes. According to best practices [Reingruber 94] the model should be refined until there are no more optional attributes. So a more sophisticated IT shop would create entities for each subtype. It might be a simple three-subtype structure based on the Financial Instrument Type with values of Equity, Bond and Option, or it might be a more complicated hierarchy. The attributes would be moved into the subtypes, where they would become mandatory. This would fill another purported gap, because the sub-typing hierarchy would provide propositions “particular” to a subtype that has a given attribute and “negative” propositions to subtypes that lack that attribute.

Some of the data profiling propositions in the original text showed dependency of an attribute on other attributes. One of them was “Every Financial Instrument that has a Maturity Date must have either a Coupon or a Strike Price.” An attribute should depend only on a key. The dependence on other attributes is an anomaly and a violation of normalization rules. That anomaly would go away once the candidate keys and the complete sub-typing hierarchy are established.

While a cursory look showed data modeling inadequacy, a more thorough analysis uncovers a treasure trove of useful metadata. The additional metadata covers most of the perceived gaps and discrepancies. Therefore, in our hypothetical example, the root of the problem was either the data modeler not using the tool to its capacity, and/or the knowledge worker not requesting all of the information.

Conclusions

Models in general are an integral part of the body of knowledge. Data modeling language is quite expressive when used in full. It is based on a solid foundation of relational theory. Data models are capable of adequately reflecting the underlying business. They can and should be used for knowledge management. As in any discipline, real-life implementations have issues. Knowledge workers should verify and, if need be, augment models by other tools such as data profiling. They should demand logical data models instead of physical/logical hybrids.

References
  1. Malcolm Chisholm (February 2009). Why Data Models Cannot Work.

  2. Marshall Clemens, 2002

  3. Box, G.E.P.(1979), Robustness in the strategy of scientific model building, in Robustness in Statistics, R.L. Launer and G.N. Wilkinson, Editors. Academic Press: New York.

  4. Reingruber Michael, William W. Gregory (1994).The Data Modeling Handbook: A Best-Practice Approach to Building Quality Data Models.  A Wiley-QED Publication: New York.


Go to Current Issue | Go to Issue Archive


Recent articles by Alex Friedgan, Ph.D.

Alex Friedgan, Ph.D. - Alex Friedgan, Ph.D., is a Principal with Data Cartography. Alex has worked in multiple roles, including: research engineer, developer, analyst, college professor, database administrator, and data architect. He succeeded in solving problems of reverse engineering, agile development, data warehousing, distributed data architecture, object modeling, enterprise data management, metadata repository and metadata stand-alone solutions. He can be reached at alex.friedgan@gmail.com.