Dealing with De-Normalization of Lookup Tables
Published: April 1, 2001
Published in TDAN.com April 2001
I am asked several times a year to create a single look-up table rather than having 30 or so small 2-column tables. Each time I suggest that this is not the best idea for the future, nor is it going to be easier to manage, nor is it going to be faster and smaller. As an early teaser: Some DBMS cannot change data type or format without removing the data, dropping all foreign key relationships, replacing the newly defined table, putting the foreign keys back and the putting the data back. If this is against the common 30 in 1 table, that is 30 relationships to manage and you only needed to increase the description column by five characters. An old saying (I am an old guy) - "If it looks hard it probably is". Even so, it is often argued that fewer tables translate into "easier" model.
Here is an example of a typical look-up definition in ORM:
The constraints are over each role preventing a duplicate ID and preventing a duplicate description. The ID can only have one description and the description can have only one code. Usually, one would select the ID as primary key since it is the smallest. The logical model would appear as:
The typical database would contain many such small tables with relationships with the major tables. The need to join the tables in order to get the description is the result of normalization. It is also typical for the look-up to only have a relationship with a single major table.
The "common lookup table" solution involves a transformation of the basic ideas. The first act in the transformation is to change the table name into a column. This transformation typically takes on one of two styles:
Resulting in these two logical models:
The first variation creates the common look-up table with a primary identification of Table Name and Table Code. In order to maintain the uniqueness on the User Type Description, we now have to create a uniqueness constraint on the Table Name and Table Code Description. The rules are maintained, but the physical database now includes:
The second variation adds another column (Special Key) to reduce the foreign key back to only one column. Now we have three indexing issues: the primary key is one column, alternate key one is table name and table code, and the second alternate key is the table name and the table code description.
There is alternate way to do this without the all of these indexes that involves using some trigger code to manage the table. As one solution to indexing, Joe Celko (see references) suggests a large block of code (a CASE statement) would be required to maintain such a large table. He also noted that this is a maintenance issue whenever a new type of look-up is added. This solution reduces the number indexes but, as he points out, increases the complexity of the database with CHECK code. Celko is also concerned with the loss of the domain of the codes.
Dr. Terry Halpin is also very concerned about the identification and importance of domains, and this leads us to a small side trip; often data modelers assume that a domain is a data type. Therefore, they create a domain name such as ID and make it a data type of, say, integer. The problem is that it is not just an ID; it is User ID or User Type ID, for example. They are both IDs but not the same ID domain. It also seems a bit meaningless to create a domain name for each possible size of a character string:
I would not expect varchar20 to suddenly change to varchar(19) as it's data type, thus creating such implicit domains is meaningless. Further, what happens when 20 things are of the domain varchar20 and you want to change one of them to varchar(22)? Well, you change its domain which is really just changing the data type. Most design tools do not automatically carry this domain change over into all other tables using the same attribute, mostly because they are "seen" as equivalent elements.
Consider this: I have a Mountain and it has a height of 3000 meters. Is the domain integer? Meters are the domain with integer as the data type. Celko uses the Dewey Decimal as a domain and then shows several data type implementations that are possible (floating point, character, etc.), clearly showing domain and data type are not the same thing.
ORM uses reference modes on object types to indicate domain keeping data type as a totally different consideration. Even for value types, the domain is different than the data type. For the value type of "Last Name", the domain is "Last Name" not "varchar20". In addition, there are any numbers of attributes that are the same data type but in different domains. It makes no sense to compare "Last Name" to "Address Line" just because they are the same data type.
ORM solves this problem by creating exactly one instance of any entity. This single instance has a reference mode (domain); therefore, when the mapping occurs to a logical model this domain is always the same. The entity (object) in ORM is not a table or an attribute; it is an object playing a role with another object. How it all turns out in the logical model is determined by the roles the object plays. A "Date" object would have one definition applied everywhere in the system. The knowledge that "Date" is always exactly the same is valuable and assures consistency. It is interesting that ORM modelers tend to deal with domain in a natural way as they verbalize the fact and look at the resulting sentence. Again, "Mountain (name) reaches Height (meters). As small side trip of the small side trip, I like to think of the data type as a constraint on the data population. Last Name consists of a maximum of 20 characters and only characters.
And now back to the main topic.
The common table approach causes a loss of domain knowledge. The domain knowledge is gone since all columns have the same name and data type. Therefore, you choose some arbitrarily large text size for both the table code and the table code description. The resulting table using variation #2 is in First-Normal Form (1NF). Everything does not depend upon the key, the represented data types are not really the same domain - they are merely forced into appearing in a domain.
The data storage requirements are also larger. If you are using the alternate key technique the total size is larger since the number of indexes required are larger and there are more indexes. If you are using the CHECK technique the size is slightly larger because you repeat the table name as a column.
In order to control access and update to the subsets of look-ups, you will have to create multiple views controlling the look-up data returned. All these views constitute the same as having a view of each smaller table. Also, not all look-up tables are in same format. Some may required a display sequence in order to present the information in a useful order for humans looking at drop-down list boxes. Some may require a short and a long description, rather than just one. You could add all these options to the common table, but many look-ups do not require these options. You may decide to put these in specific loop-up tables, so you get a 50/50 effect (not a suggestion by the way).
To wrap this up: using the "common lookup table" approach, you will not save space, you will not save performance, you will have just as many relationship lines, you will lose domain (and, for that matter, data type as codes that are numbers will have to be represented in character form - causing interesting presentation sequences (1, 11, 2, 21) ), you will either create a special primary key or cause extra columns in foreign keys, and you will not save on the number of views. You will only save on those little rectangles on the drawing - heh, congratulations!
This article was previously published in the
April, 2000 issue of the Journal of Conceptual Modeling
Patrick Hallock - Patrick Hallock is a Senior Partner and Principal Consultant for InConcept. He has over 15 years of ORM/NIAM experience and is a certified ORM consultant and trainer and a certified Visio trainer. This article was previously published in the Journal of Conceptual Modeling. Contact Information: Patrick Hallock President and Co-Founder InConcept 8171 Hidden Bay Trail N Lake Elmo, MN 55042 (651) 777-8484 fax: (651) 777-9634 http://www.inconcept.com