|
Metadata Improvements – A Case Study
Published: May 1, 2008 The article describes problems encountered in integrating metadata, the approach taken and the results achieved.
Multi-year content-rich documentation deemed unreliable and stopped being used. An immediate effort required to bring it in shape. The generally accepted approach to handle metadata is to load it from all sources into a single repository for integration. The premise is to avoid the “spaghetti” of point-to-point synchronizations. Are there any alternatives if that approach is not feasible? What if the metadata is distributed amongst multiple documents and the company has no plans for a full-scale metadata repository tool? What if, in addition, documents don’t follow a strict layout? The article describes the approach taken and discusses the results achieved. Importance of DocumentationDevelopment continuity for a typical data warehousing effort is disrupted by time gaps, geographic and organizational boundaries:
As a result, any significant data warehouse needs a trail of documentation to survive. Within the development workflow, an analyst produces documents that are used by next-level analysts, and so on. Only the final deliverables are programs and database structures. Documentation FlawsDocumentation is always flawed. It is imprecise and incomplete to begin with and easily gets out of date. We can only wish that documentation had the rigor of programs with syntax checking of individual modules, resolving their mutual references by a linker, and finally multiple levels of testing. Documents are checked only visually and many typographical and other errors are never caught. The original requirements are usually incomplete; multiple changes occur, but are not always reflected in the documents. Some information is omitted because it is assumed obvious by the development team. At each workflow step, an analyst interprets information from the earlier documents; the errors are compounded along the document chain. Multiple versions are needed to reflect all projects within the development pipeline. The versions should be tagged and archived properly. Documents for cancelled and on-hold projects should be distinguishable as well. Issues come in two different flavors. Sometimes versions are not tagged properly, and the resulting duplicates cause confusion. The opposite problem exists when only a single instance of a document is maintained. It is then difficult to tell where each individual artifact is in its life cycle. For example, a data model might have obsolete tables and early drafts of the new ones coexisting on equal rights with the tables currently implemented in production. Case Study ProblemIn our case study, a strong Data Administration team developed and maintained content-rich metadata. All relevant information about a given table was combined together into a single document. Document layout was thought through and easy to understand. The documentation had been widely used for years by multiple development teams and business partners. Time, however, took its toll, and the discrepancies accumulated. The documents stopped being used, because they were deemed unreliable. For example, only 60% of the data model tables were found in the source-to-target spreadsheets. At first glance, it is difficult to comprehend how any work can be done under such circumstances. However, bear in mind that humans are adaptive while inventorying programs can’t handle any discrepancy, however minute. (See also Flexible Authoring Tool Challenges below.) An effort was initiated with the goal of improving quality of the documentation and maintaining it going forward. In order for it to be sustainable over time, it needed to be automated and integrated in the development process. Distributed Metadata ChallengesThe generally accepted way to integrate metadata is to load it from all sources into a single repository. This is the best practice recommendation and the author’s preference as well. However, a full-scale metadata repository tool was not a feasible option. Results needed to be achieved within a distributed metadata framework. The immediate need was limited to synchronizing just data models and source-to-target maps, but the selected solution should have been extendable. Let’s assume the metadata is distributed; it is produced by multiple authoring tools and stored within numerous individual documents. These documents in their entirety constitute the system of record, and each document serves as a source of metadata. With n nodes (sources), the number of links is defined by the formula n(n-1)/2. If each link requires a synchronization process, the number of processes grows as the square of sources. This is the dangerous “spaghetti” effect of point-to-point synchronization. For the complex environment of a real-life organization, this number becomes prohibitively large. This potential scalability issue can be subdivided into two: multiple types of documents being integrated, and multiple individual documents being integrated. Let’s address each one separately.
Multiple Types of Documents Being Integrated Let us consider the metadata example in Figure 1. The organization is interested to keep the following five documents in sync:
Figure 1: Sample Metadata Entity-Relationship Diagram The square-power growth predicts 10 interfaces, and the linear growth – 4. The reality is somewhere between the predictions; 5 relationships cross subject area boundaries. This is still a manageable number and cannot prevent expansion to a more comprehensive metadata integration.
Multiple Individual Documents Being Integrated I remember once at a seaside lodge only cold water was running out of faucets in the morning. So you could not have HOT RUNNING water. But you heated some, and made do, by constantly switching between RUNNING water and HOT water. It was not the optimal, but still a workable solution. Similarly, we provide a workable metadata integration solution by combining two partial methods that complement each other:
Here is a typical scenario. A data modeler designs a new table for a project. She is only interested in a handful of project-related tables, not the whole data warehouse. She identifies the subset of documents required by using the Full Inventory method. Before starting her own work, she verifies and reconciles existing documents (from now on using the Full Synchronization method). She generates the first draft of a table in a data model out of the source-to-target spreadsheet. She then periodically updates the source-to-target spreadsheet based on the data model changes (and vice versa). Her documents are in sync throughout the development process. Flexible Authoring Tool ChallengesA flexible authoring tool, like Microsoft ™ Excel or Word is usually utilized to capture source-to-target maps and transformation rules. A standard document layout is usually established from the start. However, enforcement of the guidelines is difficult, and deviations creep in; new spreadsheet columns are added; columns are moved around; headers change. In addition, standards evolve over time, creating a series of accepted templates. There is a strong (and understandable) desire to squeeze more information on a viewable page. The axiom of one atomic datum per element is often violated. A single spreadsheet cell gets overloaded with multiple pieces of data.
For example, in our case, the “Source Column Name” header had over 50 different variations. As another example, there were four permutations for table name placement as depicted in Figure 2. Variation might happen to accommodate specific project requirements or personal taste of the document author. Regardless of whether reasons are well grounded or whimsical, all this complicates extracting information universally across the full portfolio of documents.
Figure 2: Table Name Placement Algorithm
Document Recognition
Overall flexibility was achieved by extending and tuning the above lists. The rules, which were difficult to fit into this scheme, were hard-coded. Still, some documents evaded any straightforward logic. The decisions were simple, if a variation was used in a lot of documents, handle programmatically, if only by few – modify documents instead.
Inventory Metrics
Figure 3: Inventory Metrics The goal was to reduce the number of missing, extraneous, and duplicate maps and, as a result, improve the match ratio. ImplementationThe success of the effort hinged on the data analysts doing research and correcting documentation. They had to do it in addition to their heavy project load. Of course, they were motivated to eliminate the perceived inadequacy of their collective results, but that did not diminish their accomplishments. And the author is grateful to their dedication. The inventorying program only facilitated the process by monitoring the progress and identifying the remaining issues. It also provided “immediate gratification” to the analysts. It generated a hyperlinked cross-reference of source-to-target documents and tables within data models. The document became an essential part of the development process by providing an easy searching and navigation tool. At first, the inventory was run on a weekly basis. All problem areas were attacked: typos were corrected, exception rules were tuned, old versions were pruned, obsolete documents and artifacts archived, missing documents recovered, etc. The progress during that initial period was mostly achieved by improving quality of existing documentation, not by creating new documents. In a couple of months, the results started to show:
Apparently, the confidence of developers and business partners in the data modeling documentation was restored, as all the crucial tables were matched. After the initial exploding growth, progress stabilized with only small incremental changes. The inventory frequency was cut down to once a month. Subsequently, the effort was transformed into an ongoing metadata administration function. It was assumed that the quality of the existing documentation portfolio was acceptable, and the attention is focused on the changes only. This prevents future crises, while keeping work effort low. Further improvements can be achieved by restoring the documentation, for example, by reverse-engineering from ETL programs. This is an expensive proposition. One way to justify the work is by limiting it to the immediate needs of the new projects. Such an overhead might be acceptable to project sponsors. ConclusionsThe metadata improvement effort targeted immediate needs, but was based on a solid foundation that allowed its extension in the future. A point-to-point metadata synchronization was used to accommodate the existing distributed metadata framework. Two partial methods complementing each other were developed and implemented programmatically. Document recognition techniques were used to handle flexible layouts. A new cross-referencing document was added to the analysts’ toolset, making it an essential part of the development process. The initial effort improved the quality of the documentation, and restored the confidence of developers and business partners in the data modeling deliverables. Subsequently, the effort was transformed into an ongoing metadata administration function as a low maintenance measure against future crises. Reference:
Go to Current Issue | Go to Issue Archive Recent articles by Alex Friedgan, Ph.D.
Alex Friedgan, Ph.D. - Alex Friedgan, Ph.D., is a Principal with Data Cartography. Alex has worked in multiple roles, including: research engineer, developer, analyst, college professor, database administrator, and data
architect. He succeeded in solving problems of reverse engineering, rapid application development, data warehousing, distributed data architecture, object modeling, enterprise data management,
meta-data repository and meta-data stand-alone solutions. He can be reached at alex.friedgan@usa.net.
|