|
Achieving Enterprise Data Quality
Published: September 1, 1998 The role of information in creating competitive advantage for businesses and other enterprises has been well documented and is now a business axiom
Introduction The role of information in creating competitive advantage for businesses and other enterprises has been well documented and is now a business axiom: whomever controls critical information can leverage that knowledge for profitability. The difficulties associated with dealing with the mountains of data produced in businesses brought about the concept of information architecture which has spawned projects such as Operational Data Stores (ODS), Data Warehousing and Data Marts. Along with these came a set of associated complementary technologies which help companies collect, massage, process, analyze and deliver useful information from this mass of raw, unconnected data. The growth of Data Warehousing into a $6 billion market demonstrates the degree to which organizations have taken a pro-active role in managing their data. Enterprise Data Quality Management After some years of attempting to deal with the issue of data quality, a new discipline has emerged within information architecture development to address the need for appropriately managing data quality. This discipline, known as Enterprise Data Quality Management, (EDQM) is intended to ensure the accuracy, timeliness, relevance and consistency of data throughout an organization, or multiple business units within an organization, and therefore to ensure that decisions are made on consistent and accurate information. Clean, useful and accurate data translate directly to the bottom line for most companies. It represents the added revenues that are realized when businesses correctly model and track their customer relationships, product or service preferences. With reliable data, a major credit company was able to assign risk assessment for loans based on the ability to read free format generalized text regarding automobile year, make and model data. Within weeks of implementation, 27 million records were processed and the company was able to offer new product line to their customers. Similarly, an insurance company was able to cleanse and standardize the names and addresses from its customer information files, resulting in a 62% reduction in names and an 80% reduction in addresses from duplications. This translated into huge savings in processing time, storage and mailing costs, in the confidence users have in their own data, analysis and conclusions, but most importantly in the cost of contacting customers and managing ongoing customer relationships Clearly, information is of value only if it is accurate, and in today's more complex information technology, when internal and external data are blended together in data warehouses and more advanced OLAP (on-line analytical processing) applications, new technology processes to ensure the accuracy of information are required. Today, more than ever, it is imperative to tackle the data quality issue from a point of prevention as well as cleansing existing data stores. While many organizations realize the dollar value in clean data, most organizations are still leaving money on the table. The Challenge - Where and How to Begin According to the Gartner Group "Most information reengineering initiatives will fail due to a lack of data quality". Just as Total Quality Management (TQM) required a level of pain within an organization to take hold in the manufacturing sector, permanent data quality management occurs only when companies feel sufficient pain as a result of poor data quality to be willing to build new practices to solve the problems on an enterprise basis. Projects to cleanse data are often created when a "crisis" occurs, and a key project may be in danger of failure. Unfortunately, many of these special projects are neither permanent nor consistent across the organization, and may result in several months of effort and hundreds of thousands of dollars in expense on solutions which are not permanent in nature. Traditionally data reengineering projects lacked the key factor for success in enterprise, data quality management projects - a set of consistent technology processes, which institutionalize data quality as a strategic asset, and business processes to make it a consistent competitive advantage. Effective EDQM approaches can significantly lower the costs of data-cleansing. In a recent article, Larry English, an international expert on data-cleansing processes and the "Data-cleansing" feature writer for DM Review magazine, succinctly outlined the costs of handling data errors, anomalies and inconsistencies at three separate points within the information technology infrastructure of an organization. "The costs of data quality impacts organizations when existing systems fail to provide the data in the format necessary to profitably conduct business and results in scrap and rework remedies. Additionally, costs occur during assessment or the inspection phase of the process. Lastly, are the costs associated with prevention." He summarizes the need for EDQM in that article with a question: "If the data were correct at the source and in an enterprise-defined format, would we need to spend so much on data clean-up?" Developing programs to convert data from one format to another is not difficult. Designing processes to clean and standardize data on an enterprise-wide scale, including data values that may not be obvious, presents a greater challenge. Fortunately, today's new generation of data management solutions provide data re-engineering and process tools along with conversion programs, to assist companies in implementing EDQM programs. Why is EDQM valuable and what problems does it solve? Perhaps the best way to illustrate the value of data quality is to review the roots of bad data, and some of the ways corrupt data can impact an organization. Mistakes: The origin of mistakes in data is the simplest of problems to understand. These include misspellings, typographical errors, out of range values or incorrect data types. While typographical errors are difficult to correct, validation routines typically handle out of range values or incorrect data types within applications. An example of out of range values might be 13 in a field for month, or an alphabetical character in a numeric field, such as interest rate. Homonyms: The English language contains many words and abbreviations with identical spellings that have multiple, and often unrelated or conflicting meanings, and relies on the context of usage to determine the correct meaning. Improper interpretation of the context in which the homonym was used can have a significant impact on data accuracy. For instance, the contextual use of St. in the example below illustrates how context sensitive processing is built into our language, and that proper interpretation of data requires recognition of the format and condition of how words and abbreviations are used.
Catherine B. St. James, MD. Lack of Standards: When data entry responsibilities are spread among different people and business units, variations are bound to arise, as in the example below from information gathered for inventory purposes. Within a field as simple as Product and Location data may be represented in several different ways:
Legal Entities: In many instances, the addition or subtraction of naming conventions may alter the actual legal definition of a document. Many banking and financial institutions require complex naming conventions that are unrecognizable by most applications, but must remain intact in order to protect the legal purity of the document. Missing/Invisible Data: Often data that is present may contain the proper structure and values, and in fact may appear to be correct but, data that has inadvertently been omitted causing identification and linkage mechanisms to unknowingly "grow" a mountain of poor quality data. This problem usually occurs without an organization's knowledge. For instance, "35 Avenue of the Americas" is syntactically correct. What are undetected are the thousands of apartments, suites and mail stops within the same address. Additionally, the name "Leslie Brown" is correct, but without a "title" deriving gender, matching would be accomplished with a lesser degree of certainty. Phantom Data: In many applications, phony data (e.g., the date 99/99/99) may be used to flag a record or signify that there is no valid data for a particular field. Equally perplexing, the flag inserted into a field may have nothing to do with the data in that field; for instance, a phantom date may serve as an indicator that the record in question is no longer valid. To be effective, EDQM as a process must also meet a number of technical challenges. The process must work across multiple platforms and information architectures, must be adaptable and capture knowledge from an organization, and not scare away users by being difficult. When all of these challenges are met, EDQM can be leveraged into an Enterprise "Business Intelligence" Asset. Some of the critical technical challenges are:
A data-cleansing tool with these three technical capabilities will facilitate deployment and consistent utilization of EDQM techniques throughout an organization. Once processes and procedures to ensure data quality are in place, the organization can begin to leverage its data resources into a "business intelligence" asset. Data-cleansing - An Emerging Field Once data warehousing architects and practitioners discovered the need for data quality, the question became: how to achieve it? Initially, data reengineering consisted of manually written code interposed between the data extraction and the data loading phases of the Data Warehousing implementation. Each project has specific needs, tailored to specific target and legacy data structures and context, and therefore each project required custom built edits to achieve the data quality required for the warehouse. Data-cleansing has grown from this editing process in the early days of information systems through a series of first and second-generation tools to help manage data quality. Proactive data quality initiatives start at the Data Entry phase. Data entry validation is the first line of defense against bad data, with validation routines checking data ranges and ensuring that all required fields are filled during the data entry process. Validation checks are commonplace in many newer systems. Newer generation solutions often contain more sophisticated conditional logic that may narrow the range of acceptable data based on entries to previous fields. Most importantly, solutions that enable organizations to develop data reengineering processes independent of particular projects, and execute those processes either on-line at the point of entry or in batch mode within legacy systems, are closest to achieving enterprise-wide data quality management. The advantages of checking data quality at the data entry stage are fairly obvious: mistakes are nipped in the bud, while the information is still fresh, thereby avoiding the need for downstream rework that is often performed by someone unfamiliar with the source data. Data entry validations, however, are not necessarily foolproof. Just as a word processing spell checker will not catch grammatical errors with properly spelled words, data entry personnel can still input incorrect codes to the right fields in the correct format and range, and the error would go undetected. This is why the tools must be used in conjunction with enterprise standards, which allow certain accepted mechanisms for entering data and reject others. These must be implemented at an enterprise level in order to ensure that all departments involved in data entry (accounts receivable, order entry, sales) use the conventions consistently. Need for an Enterprise Approach Managing data quality throughout an organization requires an enterprise approach. Such an approach, which focuses on prevention and standards, as well as error correction, can provide significant benefits to users, information technologists and, most importantly, to the bottom line. Just as TQM focuses on the prevention of scrap and rework, EDQM focuses on ensuring the accuracy of data throughout the enterprise. EDQM requires changes to business processes and the development of standards, which ensure that data are entered and standardized in accordance with a set of rules, which adapt to changes in business needs. In a future article to be included on the TDAN web-site I will discuss the benefits of a data quality approach as well as the differences between the first and second-generation technologies available to help organization achieve their goals. Go to Current Issue | Go to Issue Archive
Len Dubois - Len Dubois is the head of marketing for Trillium Software. Trillium is the leading provider of data cleasning and reengineering solutions for on-line and legacy system projects. This is his first
authored article. The Trillium Software System has been architected to meet the needs of companies looking to implement enterprise data quality management solutions. You can obtain more information
about the Trillium Software System at their website at: www.trilliumsoftware.com.
|