TDAN: The Data Administration Newsletter, Since 1997

THE DATA ADMINISTRATION NEWSLETTER – TDAN.com
ROBERT S. SEINER – PUBLISHER

Subscribe to TDAN

   > home > newsletter > article
 Printer-friendly
 E-mail to friend

Data Models and Data Profiling
Complementary Techniques

by Michael Smilg, CDMP, MBA
Published: May 1, 2009
Data models and data profiling are complementary techniques. Although data models do not tell us the whole truth, profiling the database does not provide the whole truth either. In fact, both may be misleading. However, used together, they can provide better insight into the data.


As both a data modeler and a data quality analyst helping knowledge workers understand their data, I read with great interest the Why Data Models Cannot Work article in the February edition of TDAN.com and the subsequent Real Men Don’t Read Instructions  in the April edition. The Why Data Models Cannot Work article highlighted the shortcomings of data models and promoted the examination of the data itself via techniques such as data profiling to obtain knowledge management. The Real Men Don’t Read Instructions article argued that “data models can and should be used for knowledge management. … Knowledge workers should verify and, if need be, augment models by other tools such as data profiling.”

Although data models do not tell us the whole truth, profiling the database does not provide the whole truth either. In fact, both may be misleading. However, used together, they can provide better insight into the data.

This article discusses four situations that may cause the data profiling results, as well as the data model, to be misleading.


Misleading Results – Data Models and Data Profiling

  1. Life Cycle Stages of an Entity

    At the point in time that a record is created in the database to capture a new instance of an entity, certain information may not be known even though it exists in the real world. For example, in the prior articles, the “formal” Financial Instrument Description might not be known at the time of initial data entry. The business process rules might allow a Financial Instrument to be entered into the database without a Financial Instrument Description but require one when the Financial Instrument is actually issued.


    Therefore, even though in the real world a Financial Instrument might always have a Description:

    • If the data model were purely logical, that attribute would be marked as mandatory under the belief that a “valid” Financial Instrument would always have a Description.

    • If the data model were being used to create the actual database, that attribute would have to be marked as optional.

    • Data profiling results would show that some records contain Financial Description and others don’t.

    Note that the design of the data entry screens and the data entry procedures could affect the data profiling results. The data element could be made mandatory in the database and the data entry personnel instructed to enter “unknown” when the information is not available at the time of initial data entry. Conversely, the data element could have been made optional in the database. In the design of the data entry screens/update processes, there could be a validation rule to check that Financial Instrument Description must be populated (and not with default values) when an instance of a Financial Instrument reaches the issuance stage in its life cycle.

    Frequently, the designer decides to implement optionality rules in the design of the front-end systems rather in the database definition itself. Often, a column in a table will be defined as optional, regardless of its true optionality.

    Data profiling can offer valuable hints to these life cycle stage dependencies. For example, in a data model for the insurance industry that I was reviewing, there was a straightforward definition of Issuance Date, implying that it was a historical date. Data profiling showed, however, that some dates were in the future. Investigation determined that expected issuance dates were being entered. The initial data design did not distinguish between actual and expected dates; the names and definitions in the model did not capture that future date concept either. Using the two in combination – data model and data profiling – improved the overall understanding and allowed the reports and business metrics to be calculated and interpreted correctly.

  2. Multiple Sources that Populate the Database

    The data might be populated from more than one source. Information that exists in the real world may not be provided in one of the feeds due to contractual constraints or data sensitivity/privacy concerns.

    Therefore, the data model could show a data element as optional and data profiling results could show that some records contain it and others don’t, despite it always existing in the real world.

  3. Population Segmentation

    The optionality might be a function of other data elements. For example, in the insurance industry, the optionality of a data element might depend on the specific company, state and product line for which the Financial Instrument was being issued. Those rules are not easily captured in a data model and might be too voluminous and too dynamic to document via data profiling.

  4. Data Quality

    In the real world, the information might always exist and even be available at the time of data entry. However, the data entry person might enter it inaccurately using short cuts to minimize keystrokes or might not bother to enter it at all. That is the focus of data governance – the well known disconnect between the data gatherer and the knowledge worker.

    Also, it may be a matter of data entry training and the design of the user interface. There may be certain business rules as to the conditions (for example, the specific combinations of company, state and product line) under which a field should be populated. However, unless the rules are enforced within the system, the data entry person could become confused and enter a value in a field when it should not be populated or conversely not enter a value when it should be populated.

    Data profiling documents how a field is actually populated rather than how it was intended to be populated.

Conclusion

As illustrated above, propositions of real-world relationships based on the model and even based on the data population itself can be misleading. That is not to say that data models and data profiling are not useful. Both are useful and provide complementary information. In fact, when the propositions mined from each appear to disagree, it indicates that there is an opportunity to clarify a misunderstanding. That investigation can lead to a better understanding of the data as it exists in the real world and as it is held in the database.

Go to Current Issue | Go to Issue Archive

Michael Smilg, CDMP, MBA - Michael is a systems consultant at a leading insurance company and has worked in information technology for more than 30 years in multiple roles, including systems consultant, data administrator architect, database administrator, systems analyst, programmer and manager. His current areas of interest are information strategy, data quality, data governance, enterprise data management and metadata management. He can be reached at msmilg@sbcglobal.net.