|
Data Archeology
Published: September 1, 1999
When I first heard the term "data archeology," many things went through my head. Was this the process of searching for old data to preserve in a museum somewhere? Or, maybe the pharaohs were buried with data storage devices. Or, maybe this is just another term created by consultants to confuse everyone. I went about asking my colleagues what it was. The answers were all interesting, but all different. Hmmm, so we consultants are finally confusing ourselves. With all of this in mind I decided to create a definition that can be used by everyone and will unconfuse the confused. Data archeology is the process of identifying critical data sources (both internal and external) and their respective data for analysis and migration to a separate database and/or data warehouse. It identifies and classifies both "good" and "bad" data. It feeds the data migration effort and helps to ensure that only "good" data is migrated to the new data repository. It is also very useful in helping determine if the analysts and modelers discovered all entities and attributes. A simplified example of the process would be:
Obviously, you can proceed with data archeology while the logical model is being created. This will help you get a quick start to this tedious process. Just coordinate carefully with your DBA and modelers. At this point there is a need to purchase special tools and create special programs, or scripts, to migrate the original data over to its new home. These tools and programs will eliminate, modify or identify the invalid data. An example of bad data that would be eliminated would be "test" data that was inserted into production files so programmers could easily test their changes. An example of data that would be identified might be the use of certain fields within the source files that are used by ingenious users to help them do their jobs. You may not want to eliminate this data. You may just identify it and create a separate attribute for it in the new database. You may wish to modify data at times. Let's say you have a customer file that contains data which lacks standardization. For instance, a post office box may have been entered in numerous ways (i.e., PO Box, P.O. Box, Box, etc.). You may wish to standardize it to a single format (i.e., PO Box). A parsing utility would help greatly in this instance. Remember that many data warehousing projects have failed due to problems originating from "bad" data. If success is to be assured, every database and/or data warehouse project should include this process in its work plan. Now that I think about it, I suppose "data archeology" is as good a term as any to identify this very important process. Go to Current Issue | Go to Issue Archive
Michael Bingle -
Mr. Bingle is a principal OSMC Consulting Services. He has over 30 years of Information Systems experience. He can be contacted at: mbingle@osmc-web.com or 908-301-9262. |