business intelligence resources

TDAN: The Data Administration Newsletter, Since 1997

THE DATA ADMINISTRATION NEWSLETTER – TDAN.com
ROBERT S. SEINER – PUBLISHER

Subscribe to TDAN

TDWI World Conference

Business Intelligence Resources

business intelligence resources

TDAN.com - The Data Administration Newsletter

TDAN.com - The Data Administration Newsletter

   > home > newsletter > article
Setting the Record Straight: Introduction to Data Rules

by Joseph Hudicka
Published: July 1, 2003

Published in TDAN.com July 2003

 

DATA EXCHANGE

Data exchange is simply the transfer of data from one system to another. It is usually called “data migration,” but that term limits the scope of data movement efforts. While data migration infers that data will move from a source system to a target system, data exchange includes bi-directional data movement in addition to unilateral movement between two systems. This paper will address the many issues associated with data exchange. Specific topics include:

The Reasons for Performing Data Exchange. This section further discusses why data exchange is a much larger effort than data migration.

  • Data Exchange Problems. This section identifies problems that the IT industry has recognized with respect to data migration. It then identifies problems that the IT industry has not recognized.
  • Solving Today’s Problems. This section fully describes industry-recognized problems, shows how these problems have not been totally resolved, and makes suggestions for solving them.
  • Solving Tomorrow’s Problems. This section fully describes the problems that industry has failed to recognize and describes how to address them.

At first glance, data exchange sounds rather simple, right? Not so fast. The issues involved with data exchange must not be underestimated. A precise, disciplined approach is required to overcome these formidable, technical and non-technical issues. Our approach describes how to do this. Read on.

DATA EXCHANGE-WHY DO IT

Data exchange is almost always a subset of a much larger project. When this is the case, it must be accomplished in parallel with that other project. Data exchange can be a major component of one of six different system initiatives:

  1. Implementation of a Business Process Re-engineering (BPR) project
  2. Implementation of Electronic Data Interfacing (EDI)
  3. The result of a merger or acquisition
  4. Implementation of a decision support (data storage) project
  5. Implementation of a data farming project
  6. Implementation of an information exchange project

BPRS AND MERGERS/ACQUISITIONS INITIATIVES

A BPR might involve the replacement of one or more legacy systems for the purpose of increasing functionality. During a merger or acquisition, organizations often need to combine the data from multiple, disparate systems. These two initiatives focus on moving data from one place to another, most often unilaterally, although sometimes a best-of-breed approach requires interfacing data between unrelated systems (EDI). These types of projects make up a large portion of today’s data exchange market. These initiatives usually involve the implementation of one or more better systems; therefore, if successful, these data exchange initiatives will reduce the cost of doing business because they result in increasing efficiency.

INFORMATION INITIATIVES

The remaining three initiatives focus on a new concept: turning data into information, and then turning that information into knowledge. Organizations can use this knowledge to expand their business in a cost-effective manner. For example, the knowledge provided by the data in decision support and data farming projects might involve static data that will be analyzed or referenced at some future time:

  • An organization uses a decision support system to examine its own data. That data is then used to expose existing clients to new products based on what their clients buy and when they are mostly likely to make purchases (throughout the year, or mostly in December?).
  • An organization uses the knowledge gleaned from a data farming project to find new clients who, based on previous spending habits, might be interested in their products.

Information exchange initiatives focus on the sharing of information between two or more, often competing organizations, i.e., Company A’s data will be periodically shared with Company B; likewise, Company B’s data will be periodically shared with Company A. Practically speaking, an auto insurance company might want to partner with a life insurance company. It is assumed that people who buy one type of insurance are more likely to buy another type of insurance. The data owned by these similar businesses compliment each other; their data is used to improve the competitive advantage of both businesses.

COMMONALTIES OF THE SIX INITIATIVES

All of these initiatives can reduce the cost of doing business by increasing efficiency, whether it be by upgrading existing systems, or by helping an organization target customers who are more likely than the general public to buy their product(s).

DATA EXCHANGE PROBLEMS

Data exchange problems abound. Some have been acknowledged by the industry, although they have not been sufficiently corrected, while others have escaped notice completely. Let us summarize both.

TODAY’S RECOGNIZED PROBLEMS

We have identified six distinct data exchange problems that are recognized by the industry. Although these have all been acknowledged to some degree, they have not been adequately addressed. Left unresolved, these problems can sink the data exchange project along with the main project:

  1. Legacy systems lack data integrity, making it difficult to translate data to a more restrictive architecture
  2. The theoretical design differences between legacy and relational systems are numerous and complex
  3. The personnel assigned to the task lack enthusiasm
  4. Planning for the task is enormously insufficient
  5. The personnel assigned to the task are often not the right people for the task
  6. The performance of the task-the amount of time it will take-is not considered

TOMORROW’S UNRECOGNIZED PROBLEMS

We have identified three distinct data exchange problems that the industry has failed to recognize. If acknowledged and properly addressed, these problems can turn into solutions that will greatly increase the efficiency of the data exchange project:

  1. The importance of process mapping is downplayed or ignored
  2. The idea of data patterning-fi nding patterns in the legacy data-is virtually unknown
  3. The importance of instituting validation metrics is overlooked and rarely employed

HOW WE FIX TODAY’S RECOGNIZED PROBLEMS

This section will detail the recognized problems of today. It will further explore industry’s solutions for them. To recap, recognized problems include:

  1. Lack of data integrity in legacy systems
  2. Complex design differences between legacy and relational systems
  3. Personnel’s lack of enthusiasm
  4. Enormously insufficient planning
  5. Improper personnel assignment
  6. Negligence of performance issues

LACK OF DATA INTEGRITY

One would think that any two systems that maintain the same sort of data must perform very similar tasks; information from one system should map to the other with ease. This is rarely true. Legacy systems are typically quite lenient with respect to enforcing data integrity; i.e., fields that should be populated from a list of valid values require that the user enter a value, but seldom is the value validated.

For example, a field for STATE ordinarily requires two characters. Ideally, when a user enters a value in this field, the system should check that the two characters form the abbreviation for one of the 50 US states. However, if the system does not validate this field, there may be values entered that do not correspond to any state. After all, we all make typos. Also, not everyone knows all the two-digit state abbreviations, which means that a user might enter an invalid state code, thinking that it is the right code.

Furthermore, legacy systems are infamous for containing de-normalized and/or redundant data, which does not make for an easy transition to relational or object oriented architectures. Legacy systems were designed around a logical procedure that accepts data, processes it, and outputs data. Newer systems look at the data as objects that need to be manipulated, rather than looking at the logic required to manipulate the data. These architectures were designed to ensure that any given bit of data appears only once in the system. To find information, relational architectures join two structures in a dependant fashion, while object oriented architectures have one “main” structure to which other structures are joined.

DESIGN DIFFERENCES

The many complex, theoretical design differences between legacy systems and the systems of today can frustrate those involved in a data exchange project. Traditionally, performance was considered very important, and designers would seek to enhance performance even if it would limit flexibility. Newer architectures, like relational and object-oriented systems, inherently eliminate many performance concerns. These newer systems combine performance with flexibility.

ATTITUDES OF PERSONNEL

Another issue of data exchange projects involves the attitudes of the people assigned to the task. Most technical people view data exchange as unglamorous. They also believe that they will not learn anything new that can further their careers (with the exception of learning about tools used in data exchange). On the surface, programmers think that they will simply be writing a bunch of complex insert-update-delete scripts. Our new approach to the tasks of data exchange shows the people involved that they will learn a lot, and that the work can be intriguing.

PLANNING

People simply do not understand the intricacies of data exchange-that is, until they have braved a number of data exchange projects. Because of this lack of understanding, the planning of data exchanges often become a one-line task on the main project plan, which downplays its complexity. The first pass at data exchange reveals its complexity, and by then, it may be too late. Much time has passed, and the users are clamoring for the new system. A lack of proper planning can cause the entire project to fail.

Most data exchange plans start with a rushed analysis, followed by designing the data exchange routines, data mapping, and codewriting, followed by testing. Then the project explodes. Next, it’s back to more analysis, mapping, coding, and testing. It becomes a spiral, the end of which cannot be seen.

PERSONNEL ASSIGNMENT

Project leaders fail to dedicate adequate access to appropriate personnel. The staff members who are most familiar with the data must be involved in its exchange, yet these people are usually assigned to more high-profile tasks, such as designing and gathering requirements for the main system. Project leaders think that they can push all of the data exchange tasks to either lower-level staff members, or to contractors who are unfamiliar with the data. This usually is a mistake-unless process mapping and data patterning are used (these concepts will be explored later in this paper).

If staffs are unfamiliar with the source system are assigned to the data exchange, they need to perform extensive process mapping in order to learn the processes. Spending a lot of time on mapping data to data will prove to be time wasted. It cannot be stressed enough! Key personnel must be identified and assigned to the data exchange project at the earliest phases of the main project. It must be very clear that these key players must dedicate a specified number of hours per week to this task. The staff members who have been supporting the source system are the ones who should be working on the data exchange. To make this work, staff members must present the results of the various mapping deliverables to management. This fosters ownership of the project and brings recognition to those working on the data exchange.

PERFORMANCE ISSUES

Let’s face it-data exchange takes time. Systems were never designed to push large, historical volumes of data in a short window of time. In many cases, it took systems years to generate the volumes of data that you now wish to transform and move to another environment, all in less than one day, in most cases. Knowing how much time and planning for it is important, especially if you want to migrate a large 24x7 system that cannot afford any significant downtime.

INDUSTRY’S RESPONSE TO TODAY’S PROBLEMS

The most recognized problem in data exchange is writing the code, and, true to form, the technical industry has turned to a technical solution for this problem. ETL (extract, transform, load) tools look good because they are good for iterative code generation. However, they are not the panacea that many had hoped they would be.

ETL DRAWBACKS

The major drawback to ETL tools is that they don’t sufficiently address analysis and performance issues. These two issues absolutely must not be minimized. By their very nature, ETL tools have difficulty supporting these issues because they are technology-independent. They are designed this way for two reasons: (1) It is simply not cost-effective for ETL designers to continually redesign their product to keep up with the latest technology. (2) Basing the ETLs on a given technology would reduce their flexibility. However, being technology-independent means that ETL tools cannot leverage the performance savings features inherent in specific platforms. These code-generators offer a framework that forces organizations to develop custom code beyond the reach of the generated code.

DATA PARTITIONING

To improve performance, a method called partitioning is often employed. For example, in the retail industry, much of their historic data is static, i.e., old orders. To use partitioning, we pick a point in time and declare that all data assembled before that time is historical, and any data compiled after that point in time is dynamic. To meet the project launch date, a copy of the historical data is moved to the target system before it has to go live. Dynamic data, which will be the newest data, is moved right before the new system comes online. (A delta strategy will be employed to account for the chance that some parts of the historical data did change.) ETL tools do not support partitioning inherently; it’s up to the migration architect to plan for it (if one is assigned to the project, that is).

SUMMARY

ETL tools are not designed to take analysis results into consideration. A thorough analysis forms the cornerstone of a successful data exchange project; you must apply everything you learn from your analysis to ensure success. A key component of the analysis phase will be the definition and application of validation metrics (described later), which is another item unsupported by ETL tools. Our method does use ETL tools, but it takes these shortcomings into account and compensates for them.

HOW WE FIX TOMORROW’S UNRECOGNIZED PROBLEMS

This section will discuss the unrecognized problems of today and explore how we propose to resolve them. The solutions to the unrecognized problems include:

  • Process Mapping
  • Data Patterning
  • Validation Metrics

As you read this section, you might think that these solutions will take too much valuable time away from the task at hand. Try to remember the times that you spent extra time at the start of a task, and how it paid off during the execution of the task, especially near the end.

PROCESS MAPPING

Process mapping gives the people involved a better understanding of how the data relates to itself and to the system(s) involved. This might best be described via an example. Let us say that we are a retail establishment that wants to move from a legacy system to a state-of-the-art system. Both systems provide support for promotions, but the source (legacy) system may only support promotions at the order level, whereas the target system supports promotions at the line item level. It is probable that the business itself supported promotions at the line item level in the past, but the legacy system was just not designed to capture this information. The line item-to-promotion relationship was lost. If not for process mapping, this knowledge would probably never be discovered until the data exchange occurred.

KNOWLEDGE MANAGEMENT VS. INFORMATION MANAGEMENT

Knowledge management has to do with how we use and interpret data for the purpose of improving how we do business. Information management has to do with how we access, manipulate, and supply the supporting data. Information management is actually a subset of knowledge management.

NEW APPROACH

Historically, few people looked at data collection systems as containing knowledge. This is a relatively new phenomenon. Despite this, it is not too late for organizations to think of it this way. We must be willing to break out our data into logical subsets called knowledge areas.

KNOWLEDGE AREAS

A knowledge area might best be described through an example. In the world of retail, knowledge areas might include Customer, Order, and Product. These areas do not necessarily correspond directly with specific departments in an organization. For instance, the Product knowledge area may contain data from the product department, the promotions department, and the marketing department; the knowledge area is a mix of these three departments. Each has its own context.

For each knowledge area, a team of area experts will be assembled. Area experts should include both technical and business experts from the departments involved. Having separate teams for each knowledge area will help spread out the workload. Each team can work on its own specialty.

EVENT

After defining each knowledge area team, you will start to document the high-level events that take place within each area. Back to our retail example, when a customer places an order, this is an event. Although both the source and target systems will support customers placing orders, they mostly likely will perform this event differently. Identifying the event is the first step in process mapping.

SCENARIO

After documenting all high-level events, you can write scenarios for each event that might encompass multiple events. By documenting these scenarios, and associating events, we can begin to articulate the true context of how we do business. From here, the data exchange team has a strong foundation to seriously review data.

PROCESS MAPPING

The biggest challenge in data exchange is translating existing processes to the new architecture. Back to the retail example, the old system might support back orders inefficiently, while the new system might not support them at all. This is a new business rule. As a result, we need rules to define how to handle back ordered items at the time of conversion. This arises as a result of transporting data to a new system. This rule is not inherent to the source or target system. Without understanding the context, we would not be able to detect or resolve issues like this one.

SUMMARY

Process mapping allows us to break processes down into knowledge areas. From there, we create teams and give them responsibility for leading the data mapping. Data mapping should be the end-product of this task, not the starting point. It might seem that data mapping should begin right after process mapping concludes, but a step called data patterning should be performed first. This leads us to the next section of this paper.

DATA PATTERNING

Data patterning gives us a jump on data mapping. The time spent on this step will translate into time saved during the analysis phase. As discussed earlier, a solid analysis increases the chances of the data exchange project being successful. For example, the experts claim that the data column containing a name always has a first name, a middle initial, and a last name. While conducting data patterning, we find that this is untrue. Sometimes there is no middle initial, and in some cases, the entire middle name is spelled out. Data patterning allows us to discover this early, when we still have time to challenge that rule and alter the code accordingly.

IDENTIFY ANOMALIES

In any industry, project leaders seek to identify and correct anomalies before they impact the project. Spending the time up-front, before developing code, will eventually save you more money and resources (money and manpower). Figure 1 depicts how the cost of data exchange increases when anomalies are not discovered until after the code has been developed. Let us explore this figure in detail. Whether you use the iterative approach or the analytical approach, the same number of data errors will exist, because the project is constant. The time axis also corresponds to the volume of data errors (i.e., everything above the optimal line means there are more errors that must be resolved, while everything below the optimal line means there are more errors that can be resolved, but it is not as critical for project success that they are resolved). The analytical approach spends more of its time in analysis resolving more errors than the iterative approach does-hence the analytical curve drops more rapidly, and intersects the optimal error rate curve sooner.

Given this situation, the team is afforded more time to make the exchange more error free than anticipated, whereas the iterative approach only aspires to achieve acceptance.

PROCESS MAPPING VS. DATA MAPPING

Data mapping looks only at the data, not at the context of the data. Looking at a row of data without the context makes it almost impossible to know why that data exists. Expert knowledge of data and processes is vital to data exchange success. When inexperienced staff members focus on the data without understanding the context from which it originates, they eventually find themselves struggling to understand the data leading to struggles to complete the project. Their approach isn’t entirely wrong; however, our approach allows inexperienced staff members to successfully perform a data exchange.

SYSTEM STRUCTURE DEFINITIONS

System structures explain the source and target data structures involved in the exchange. Defining system structures involves determining the following items:

  • Data Type. Is the data alpha only, alpha-numeric, numeric only?
  • Length. How long is the data string?
  • Precision. If the data is numeric, is there a decimal point? How many places are used?
  • Optionality. Is the data mandatory or optional?

Documenting this information gives us a baseline to begin our data patterning. We’ll know that a certain level of data integrity exists, i.e., a numeric column really is all numeric. In legacy systems, columns or data elements that are supposed to contain only alpha characters will contain other characters as well. In this case, further validation of these alpha values is only possible through customized code. Unfortunately, the source code that conducts this validation is typically not well documented, if it even exists. Therefore, we’ve found a way to extend the structure definition concept: through value sets and combination sets.

VALUE SETS

Value sets are made up of both character sets and reference sets. A character set identifies all valid characters for a given field. A reference set identifies the valid combinations of values for this field. For example, a column of data contains all abbreviations for the 50 states of the US. The character set would be the alpha characters a, c-i, k-p, r-t, v-z (in other words, all 26 letters of the alphabet except for b, j, q, u). The reference set would contain only valid combinations of the 24 valid characters, meaning that NJ is valid, but JN is not.

COMBINATION SETS

While value sets look at column patterning, combination sets look at row-level patterning. For instance, they will show us what a valid customer looks like (name, address, telephone number, fax number), what a valid order looks like (customer, ship-to address, bill-to address, terms of payment, item ordered, quantity), etc.

VALIDATION METRICS INTRODUCTION

Validation metrics allow us to measure the success of the data exchange before it actually takes place. Typically, no one really knows if the data exchange project has been a success or a failure until the exchange actually occurs. If the exchange project fails, there may not be enough time to find out why it failed, make corrections, and hope that it works the second time around. Validation metrics combine process mapping and data patterning. Process mapping is based at the row or transaction set level, i.e., an address on an order or a header on an order. Data patterning is based at the column level; it is used to help execute analysis. Validation metrics comprise a set of multi-faceted tasks that look to discover the number of errors that are associated with the data. These errors are then given an assignment based on severity. Validation metrics are percentage based and priority based.

PERCENTAGE BASED

This metrics tells us how many of the total records fall into the top few patterns. These patterns will guide us as to how many records upon which to concentrate. Uncovering the five top patterns is a good rule of thumb. First we need to establish an appropriate set of patterns from which to conduct an analysis. If the column of data was STATES, there can be only one valid pattern: XX. You will also find that data that is of a higher quality will produce fewer patterns. A long column of “good” data that represents titles might easily fit into two patterns only: alphas followed by a period, or only alphas, i.e., Dr., Miss, Ms., Mr., Mrs.

A better example for explaining this concept might be a column of data that contains last names of employees. We would assume that this would consist of a string of alpha characters with no spaces. Upon inspection, we find this to be true only 60 percent of the time. Ten percent of the data contains alphas plus spaces, i.e., Del Rosa, Johnson III. Another 10 percent contains alphas separated by a dash, i.e., Smith-Taylor. Another 10 percent contains alphas, spaces, and punctuation, i.e., Bronson Sr., Bronson Jr. The last 10 percent may contain a variety of combinations, i.e., a space followed by alphas, or numerics. In this example, we may be able to recognize four patterns that fi t 90 percent of all data.

Continuing the last name example, we find that 10 percent of the data falls outside of these four patterns and therefore must be analyzed independently. These metrics helps us estimate how much time we need to spend on this data and plan our time accordingly.

PRIORITY BASED

This tells us how many errors (exceptions to the top five patterns) exist. We then must decide how many errors we can tolerate and still consider this project successful. With this, we can identify which records need to be fixed right away, which can be fixed later on, and which can be forgotten about totally.

BENCHMARKS

Benchmarks are samples of production data. We select five records that match the top five most common records (patterns). We then manually transform these five records and show them to our developers. Our developers use this as an example as to what all records should look like after the routines are run, i.e., as target records (after transformation). By giving this to our developers, we eliminate a vast number of errors prior to the development phase of the project.

CONCLUSION

It should be obvious by now that there is a desperate need for a reliable, methodological approach with which organizations can tackle data exchange projects. Although it is impossible to avoid unpleasant surprises in data exchanges, you can handle them if you are properly prepared.

In this paper, we have uncovered the most significant obstacles to successful data exchange projects-process mapping for better understanding the context of the data to be mapped, and data patterning for rapidly identifying and resolving data errors.

By using our methodology to combine a devoted team, an early start, a clearly defined project plan, validation metrics to accurately evaluate progress and automated tools (where applicable), you will tremendously help your organization achieve its goal of a successful data exchange.

Go to Current Issue | Go to Issue Archive

Joseph Hudicka - Joseph is the founder of the Information Architecture Team, an organization which specializes in data quality, data migration and ETL. Winner of the ODTUG Best Speaker award for the Spring 1999 conference, Joseph is a regular speaker at ODTUG, OOW, IOUG-A, and a variety of local user groups. Joseph has extensive knowledge in the areas of information architecture and data exchange, with substantial experience in pharmaceutical, retail, public, and e-commerce applications. Joseph co-authored Oracle8 Design Using UML Object Modeling for Osborne/McGraw-Hill & Oracle Press along with Paul Dorsey. He has also written several articles for a variety of publications.