|
A New Way of Thinking - July 2006
Master Data Management, Coherence, and Performance
Published: July 1, 2006
Published in TDAN.com July 2006 My interest in the area of Master Data Management (MDM) is largely driven by my experience in data cleansing – parsing, standardization, and matching. As any data management professional who has not been living in a cave for the past year knows, MDM is one of the hottest topics around with the analysts, media channels, conference sessions, and marketers. The concept – engineer a central repository consolidating variant, replicated copies of (shat should be) shared data objects, intended to establish, under well-defined governance policies and service-level agreements, a unified “best record” representation of each identifiable entity within the enterprise. Quite a mouthful, huh? Clearly, data quality tools play a large part in this. It is necessary to have these tools to be able to take different data sets and shake out and then merge together all the unique entities. And when you do a literature scan on the Web, you will find reams of articles and white papers (including some of my own!!) touting the benefits of MDM – high quality data, consistent views, reduced complexity, etc. Typical claims are that MDM will ensure that all applications have a consistent, accurate, and timely view of all master data objects. Yet there is one lingering issue in my mind that I seem to be unable to resolve, which is the question of performance. In my web searching, I have not been able to find a significant amount of information regarding performance. Let’s look at this a little more closely. The conventional wisdom is that there are three styles of MDM models:
Each of these styles must be able to support traditional database operations: create, read, update. The concept of deletes are a little trickier, and can be ignored for this thought experiment. In addition, these systems must support a lookup operation, to find a “best match” for an entity, which is necessary to ensure that duplicate data is not being inserted into the repository. Consider the record creation operation for customer data. First, any application needing a customer record will need to acquire enough identifying information, and then consult the master index for a lookup. If matching records are returned, either one may be selected as the appropriate match, or a new record needs to be created, and then returned as the appropriate match. Next, any modifications to the record need to be made and posted to the central repository. In the central master style, each application has a local copy of the master data, so these activities can be done locally, right? New records are created by the application itself, and that data must be propagated back to the central master and then onto the other application. Hold on a minute there – if the application can create new records locally, then between the time hat action takes place and the information propagates back, and then out to other applications, we have a situation where master data exists in one local copy and not in others… and doesn’t that break the whole “consistent, unified record” concept? And it is possible that other applications are creating new records for the same customer at the same time. All of a sudden we are bound by transaction semantics, which, if enforced, create a performance bottleneck at the central repository. Well, let’s look at the registry. Since the central repository only maintains an index and cross-references, data reads are a little hairier; the central registry will need to invoke a series of queries to each application that holds a piece of each virtual master record, and then assemble that master record on demand. While the performance penalty for creation goes down, the performance penalty for reads goes way up. Next is the transaction hub, which is just a more restrictive form of the central master. In this case we have the bottleneck associated with the transaction semantics at the central repository, which must also contend with reads now occurring at the single copy instead of the local copies. Still more potential performance hits. No matter what, it seems that MDM system performance is one of those nagging questions begging to be answered. How are these performance questions addressed? One approach is the traditional system engineer’s answer: buy more powerful hardware. Throwing massively parallel appliances will probably help, especially when they carry multiple I/O channels. Another approach is caching copies of the data along the system geography. Of course, this will now boil down to a memory hierarchy management and cache coherence management problem, which is a whole other kettle of fish (albeit, one with a nice history of research behind it). Another approach is to embed the allowance of inconsistency to be covered via service-level agreements guided by the governance component. Essentially, you can allow for some level of variance within a certain time frame for propagation, or restrict creation of new records by a set of policies for coherence. Of the articles and papers that I did find making reference to MDM performance issues seemed to imply that vendors are not adequately addressing them. My desire to see MDM succeed is tempered by the pervasive presence of the 800 lb. performance gorilla hovering around the back of the room. Copyright © 2006 Knowledge Integrity, Inc. Go to Current Issue | Go to Issue Archive Recent articles by David Loshin
David Loshin - David is the President of Knowledge Integrity, Inc., a consulting and development company focusing on customized information management
solutions including information quality solutions consulting, information quality training and business rules solutions. Loshin is the author of Enterprise Knowledge
Management – The Data Quality Approach (Morgan Kaufmann, 2001) and Business Intelligence – The Savvy
Manager's Guide and is a frequent speaker on maximizing the value of information. David can be reached at loshin@knowledge-integrity.com or at (301) 754-6350.
Editor's note: More David Loshin articles, resources, news and events are available in the Business Intelligence Network's David Loshin Channel. Be sure to visit today! |