Published in TDAN.com October 2003
Organizations are increasingly relying on complex and interrelated information systems to support business operations, interface with customers, and make decisions. Accordingly, the information
products and services provided by these systems should be of high quality. Unfortunately organizations often encounter systems that fail to deliver on the promise of high quality data.
For example, a 2001 study by PricewaterhouseCoopers in New York found that 75% of the 599 companies surveyed experienced financial pain from defective data. It is estimated that poor data
management is costing global businesses more than $1.4 billion per year in billing, accounting, and inventory snafus. One third of the companies surveyed say that "dirty data" forced them to
delay or scrap a new system. Only 37% of the companies surveyed were "very confident" in the quality of their own data, and only 15% were "very confident" in the quality of the data of their
trading partners. The study concluded that "poor data quality is threatening to undermine massive investment being made elsewhere", such as customer relationship management and supply chain
management systems (Betts, 2001).
As more organizations realize that high quality data is necessary for competitive advantage, data administrators are under increasing pressure to add data quality assurance to their list of job
responsibilities. Whether taking the lead or participating as a member of a data improvement initiative, many data administrators may find they need training in the following areas:
- Developing data quality goals.
- Developing, promulgating and maintaining data quality standards.
- Monitoring compliance with data quality assurance standards.
- Identifying areas for data improvement.
- Measuring data quality levels and reporting results to management.
- Training others in data quality standards and procedures.
In seeking guidance as to how to accomplish these tasks, data administrators will find that the traditional Total Quality Management literature applies mainly to manufacturing rather than
information delivery. To address this need, Stuart Madnick and Richard Wang began the Total Data Quality Management (TDQM) Research Program at MIT in the early 1990's (web.mit.edu/TDQM). This research program has several objectives. The long term objective of the research program is to create a theory of data quality based on
reference disciplines such as computer science, statistics, accounting, total quality management and organizational behavior, thus serving as a unifying body of knowledge on data quality
management. The second goal is to serve as a center of excellence among practitioners of data quality techniques and to act as a clearinghouse for effective methods and project experiences (Madnick
& Wang, 1992). Through the annual International Conference on Information Quality (www.iqconference.org) sponsored by the TDQM Research Program at
MIT, academics and practitioners can come together to exchange ideas on how to improve data quality. These collaborations over the past decade have resulted in several important developments for
the data quality field.
Data Quality is a multidimensional concept
Before one can measure data quality, one must first define it. Using conventional wisdom most would guess that data quality is something akin to accuracy or reliability. However, researchers in
data quality have found that data quality is multidimensional in nature. Wang & Strong (1996) used a two-stage survey and a two-phase sorting study to develop a hierarchical framework that
consolidates 118 data quality attributes collected from data consumers into fifteen dimensions, which in turn are grouped into four categories.
Intrinsic Data Quality: This includes the dimensions of believability, accuracy, objectivity, and reputation. It implies that information has quality in its own right.
Contextual Data Quality: This includes the dimensions of value-added, relevancy, timeliness, completeness, and appropriate amount of data. It highlights the requirement
that information quality must be considered within the context of the task at hand.
Representational Data Quality: This includes the dimensions of interpretability, ease of understanding, representational consistency, and concise representation. It
addresses the way the computer system stores and presents information.
Accessibility Data Quality: This includes the dimensions of accessibility and access security. It emphasizes that the computer system must be accessible but secure.
Although the many dimensions associated with data quality have now been identified, one still faces the difficulty of obtaining rigorous definitions for each dimension so that measurements may be
taken and compared over time. Work by Wand and Wang (1996) and Pipino et al. (2002) have addressed this issue by developing precise definitions and formulas for calculating data quality metrics.
New research continues to further clarify how best to define and measure the different dimensions associated with data quality.
Manage information as a product
The conceptualization of information as a product provides an important departure from the conventional view of information as a system by-product. The information product approach requires that
organizations focus first on whether or not the right data products (i.e. invoices, orders, reports, records, etc.) are being produced by a system. Wang et al. (1998) identified four principles
that an organization must follow in order to treat information as a product.
- Understand consumers' information needs.
- Manage information as the product of a well-defined production process.
- Manage information as a product with a life cycle.
- Appoint an information product manager to manage the information processes and resulting product.
This idea of managing information as product opens a new methodology in total data quality management that follows this task cycle (Wang, 1998).
- Define the information product (IP): This means defining the characteristics of an information product in terms of its functionalities for data consumers as well as its basic units and
components and their relationships. It also means defining the requirements of the information product from the perspectives of IP suppliers, manufacturers and managers and identifying the
information manufacturing system that produces the IP.
- Measure the information product: By tracking information metrics based on the definition of the information product, one can monitor the quality of the IP over time.
- Analyze the information product: From the measurement results, one can investigate the root causes for existing data quality problems.
- Improve the information product: Once the analysis phase is complete, work can begin on eliminating the root causes for data quality problems to produce a better quality information product.
Application of quality tools and principles to information products
The paradigm of treating information as a product has led data quality researchers and practitioners to develop a new set of tools and methods for measuring data quality, modeling data quality,
improving data quality, and instituting data quality principles. Here is a summary of some of the tools and methods now available to help data administrators address data quality issues in their
Data Clean Up Tools: A quick search of the Internet reveals that a number of companies now offer consulting services and software tools for helping organizations cleanse
their data of duplicates, missing values, and invalid entries. Firstlogic (www.firstlogic.com), Vality (now known as Ascential Software at www.vality.com), and Evoke Software (www.evokesoftware.com) are just a few examples. In addition the U.S. Postal Service
offers help for companies trying to improve address quality through their web site (www.usps.com/ncsc).
IQ Survey and Analysis Tools available from Cambridge Research Associates: Cambridge Research Associates (www.crg2.com) has several software tools to aid organizations in
assessing and analyzing data quality. The Integrity AnalyzerÔ is a survey tool that assesses organizational information quality levels, organizational readiness for information quality
initiatives, and organizational knowledge of information quality. The Integrity Analyzer software tool is designed for rigorous data integrity analysis including entity integrity, referential
integrity, column integrity, and user-defined integrity. The Integrity Analyzer also provides various useful functions such as frequency checks to further analyze data files for their conformance
to business requirements.
IP-Maps: An IP-Map is a graphical tool designed to help people to describe and evaluate how an information product is assembled (Shankaranarayan et al, 2000). The IP-Map
is based on data flow diagrams, but includes additional symbols and metadata for capturing the details associated with the manufacture of an information product. The IP-Map is designed to
facilitate the visualization of the important phases of the data manufacturing process, to pinpoint bottlenecks, to identify ownership, and to resolve issues concerning the quality of the
Root Cause Analysis of Data Quality Problems: Several works have been published in recent years to help organizations uncover the root causes behind poor quality data and
to suggest better ways to improve the systems that collect, process, and disseminate data. One of the best examples of this type of research is work done by Strong et al. (1997) that identified 10
key problems that often lead to data quality problems. These problems include the following:
- Multiple Data Sources
- Subjective Judgment and Techniques in Data Production
- Bypassing Input Rules and Too Strict Input Rules
- Large Volumes of Data
- Distributed Heterogeneous Systems
- Complex Data Representations such as Text and Image
- Coded Data From Different Functional Areas
- Changing Data Needs from Information Consumers
- Security-Accessibility Tradeoff
- Limited Computing Resources
Other sources of information:
For data administrators hoping to learn more about data quality, there are several excellent resources. The web site at MIT (http://mitiq.mit.edu) provides the latest information on data quality
research. The Information Integrity Coalition () is a not-for-profit organization, which promotes the awareness and understanding of Information
Integrity. In addition, there are several good books on data quality. These include Data Quality for the Information Age and Data Quality: The Field Guide by Tom Redman (www.dataqualitysolutions.com), Improving Data Warehouse and Business Information Quality by Larry English (www.infoimpact.com), Data Quality: The Accuracy Dimension by Jack E. Olson (www.evokesoftware.com), Enterprise Knowledge Management by David Loshin (www.knowledge-integrity.com)
and Quality Information and Knowledge by Huang, Lee, and Wang (Prentice Hall, 1999).
Betts, M. Data quality should be a boardroom issue.
Betts, M. Dirty Data.
Madnick. S. and Wang R. Introduction to the TDQM Research Program. TDQM Working Paper Series. MIT: TDQM-92-01 (May 1992).
Pipino, L; Lee, Y.W.; and Wang, R. Y. Data Quality Assessment. Communications of the ACM. 45, 4 (April 2002), 211-218.
Shankaranarayan, G.; Wang, R. Y.; and Ziad, M. Modeling the Manufacture of an Information
Product with IP-MAP. Proceedings of Conference on Information Quality.
Massachusetts Institute of Technology, (2000), 1-16.
Strong, D. M.; Lee, Y. W.; and Wang, R. W. 10 Potholes in the Road to Information Quality.
Computer. 30, 8 (August 1997), 38-46.
Wang, Y. A Product Perspective on Total Data Quality Management. Communications of the ACM.
41, 2 (February 1998), 58-65.
Wand, Y. and Wang, R. Y. Anchoring Data Quality Dimensions in Ontological Foundations.
Communications of the ACM. 39, 11 (November 1996), 86-95.
Wang, R. Y. and Strong, D. Beyond Accuracy: What Data Quality Means to Data Consumers.
Journal of Management Information Systems, 12, 4 (Spring 1996), 5-34.
Wang, R. Y.; Lee, Y. W.; Pipino, L; and Strong, D. M. Manage Your Information as a Product.
Sloan Management Review. 39,4 (Summer 1998), 95-105.