|
In Praise of Intelligent Keys
Published: January 1, 2004
Published in TDAN.com January 2004
Intelligent keys are something that most data analysts abhor. Unfortunately, they are quite common, and there is often great attachment to them among communities of business users. This makes them tricky to deal with. The problem is compounded by the fact that data analysts seem to be divided into two groups: one group with a near-religious belief that all entities should have surrogate keys; and another group with an equally strong belief that all keys should be natural. Both groups are quick to condemn anything at variance with their core convictions, and intelligent keys are typically treated as something close to heresy. Yet, are intelligent keys really all that bad? It is possible to build a strong case to include them in database designs, and there is good evidence that they can be more useful than prejudiced data analysts often suppose. What is Intelligence?What is an intelligent key? It is an attribute that is used to identify an entity, but which encodes some form of meaning. One form of intelligent key is a number that conforms to some kind of mathematical rule. An example I have worked with is a 7-digit employee number where the seventh digit was derived by dividing the first six digits by 7 and finding the remainder. This provided a rough way of determining whether an employee number was authentic or not. Although this may seem rather a mild use of intelligence it is still too much for many zealots who declare that nothing with even the faintest whiff of “intelligence” can be put in a primary key, or indeed, as an attribute in an entity. The more common form of intelligent key is where more than one piece of information is encoded in a data value. A good example is the familiar Social Security Number (SSN). This has the format XXX-YY-NNNN, which can be interpreted as follows:
This is not the whole story about SSN’s, but it does illustrate multiple pieces of information in one data value. Account numbers often follow similar patterns, and can often encode over 10 different pieces of information – which is one reason they can be so long. More Than KeysThere is one other fundamental property of an intelligent key – it is a key. That is, it is used to identify instances of an entity. However, it is a key designed for use by humans, not for use by computers, and as such does more than provide uniqueness. When we humans search for information and try to understand its significance, we do not have the capacity to process large amounts of data. Our brains, eyes and hands are the only tools we have for this task and they do not have the efficiency of computerized databases in terms of speed of data processing. Intelligent keys are a strategy that humans can use to overcome these shortcomings. If someone searching for information understands what is encoded in the value of the key, they can often determine whether to or not a particular instance of an entity is relevant to their search simply by looking at the value of the intelligent key representing that particular instance. People experienced with working with a particular intelligent key can spot patterns in its use, especially patterns that may be “red flags” indicating potential error situations. There is another aspect of intelligent keys, which is that they permit cross-checks with other redundant information about the instance of the entity under consideration. Such cross-checks are an important way in which humans detect data quality problems. The usefulness of intelligent keys is one reason why communities of business users become so attached to them. This is not merely for purely manual information systems, such as filing cabinets full of folders with the intelligent key typed onto each folder tab. Even in computerized applications, intelligent keys help users to do their work. The Law of AtomicityThe features of intelligent keys that make them so useful to business users violate one of the fundamental tenets of data modeling, the law of atomicity. This states that one attribute should represent only one piece of information. Having an intelligent key means that if the underlying nature of the instance of the entity changes, then the value of the key may also have to change because one of the pieces of information encoded in the key value has changed. Furthermore, programs have to parse intelligent key values to extract the different pieces of information encoded in them. This is known to be error-prone. Yet, beyond this there is simply a prejudice that intelligent keys are “wrong” as a matter of principle. Many data zealots think no further than this, and actually have only hazy notions of why they are really “bad”. The result can be conflict with the user community and a database design that is less useful than it might otherwise be. Physician Heal ThyselfSurprisingly, in one respect, data analysts are among the most ardent proponents and users of intelligent keys. In fact, they often try to force entire enterprises to use elaborate schemes for intelligent keys. Of course, data analysts do not call these artifacts “intelligent keys”, rather they call them “naming conventions”. We are still in a time where repositories (databases that store metadata) are rather primitive. Such repositories, including the system catalogs of databases, usually store only a little information about the structure of a database, e.g. the tables, columns and relationships. Each of these items is given a name, but DBA’s, programmers, and other people that use tables, columns and relationships need to understand as much as possible what these names mean from a business standpoint. The answer is to introduce naming conventions. One example of a naming convention for a physical database column name could be that column names consist of three nodes separated by underscores. The first node represents an entity name, the second node represents a business concept in a standard way, and the third node represents a general datatype. Creating and enforcing these kinds of naming conventions is a major part of information resource management in all enterprises. Naming conventions are wonderful examples of intelligent keys, and how useful they are. Nobody would ever suggest creating a random number as a table or column name. How could programmers, for instance, remember the meaning of such things? Surely they would cause all kinds of errors to be introduced into program code. Indeed, how could anyone else who needs to deal directly with tables and columns work with such meaningless names? Dealing with Intelligent KeysData analysts cannot both be for naming conventions as a matter of principle and against intelligent keys as a matter or principle. The contradiction is simply too great. Approaching database design with a set of ideas about what is “wrong” and what is “right” with no thought about why this is so is a recipe for disaster. Design means truly understanding the trade-offs we make when we choose a particular option. Every design decision optimizes some functionality, and forecloses other kinds of functionality. It cannot be an exercise carried out on automatic pilot according to some predetermined set of rules. This is not to say that rules do not exist, but rather that they have to be fully understood. In the case of intelligent keys, throwing them out of a database design is often a mistake. Finding a good way to include them may need some extra design work, but it can be done. For instance, there is no need to actually make them into primary keys. The fact that intelligent keys often encode several pieces of information can be mitigated by creating additional attributes to hold these pieces of information, while keeping the intelligent key in the database. Then, programs can be written that do not need to parse key values to extract the values of these attributes (though it can be quite a chore to get programmers to do this). The intelligent key is thus available, albeit not as a primary key, and provides additional value to users. Perhaps the most important lesson is that if business users are attached to intelligent keys, data analysts should try to incorporate this requirement in database designs. It may not be easy, and it may require additional effort by the IT organization as a whole. However, absolute prohibitions of intelligent keys based on nothing more than the idea that this is “good practice” does not serve the enterprise well, and is indeed hypocritical when it comes from groups that enforce naming conventions for physically implemented database structures. Go to Current Issue | Go to Issue Archive Recent articles by Malcolm Chisholm
Malcolm Chisholm -
Malcolm Chisholm, Ph.D., has over 25 years of experience in enterprise information management and has worked in a wide range of sectors. He specializes in setting up and developing enterprise information management units, master data management and business rules. Malcolm has authored two books: Managing Reference Data in Enterprise Databases (Morgan Kaufmann, 2000) and How to Build a Business Rules Engine (Morgan Kaufmann, 2003). He can be contacted at mchisholm@refdataportal.com. Editor’s note: More Malcolm Chisholm articles, resources, news and events are available in the Business Intelligence Network's Malcolm Chisholm Channel. Be sure to visit today!
|