|
Denormalization and the Rules of Reconstruction
Published: October 1, 2000
Published in TDAN.com October 2000 Introduction Software and Database engineering are complex activities that require planning and control to be successful. By the time the DBA is called up to tune the indices of a database it is probably already too late. Efficiency should be designed into the data structure before the data is actually put on disk. Since the invention of CASE tools there is usually a missing step in the database design. The logical database design is set up in the modeling tool and then the DDL is generated. The table design of the physical database is the entity design of the logical database. Then when tuning is required, data is moved around on disk, indices are applied, freespace is modified, and more CPU memory is assigned. The DBMS level tuning steps are valid and will continue to be used. But, there has been a missing step in the database design process. Physical Design of Databases The word ‘denormalization’ is used to describe changes to the table design that cause the physical tables to differ from the normalized entity relationship diagram. ‘Denormalization’ does not mean that anything goes. Denormalization does not mean chaos. The development of properly denormalized data structures follows software engineering principles that insure that information will not be lost. If the table is read-only (periodically refreshed from the system-of-record) then the rules are looser. Star schemas and hyper-cubes are read-only denormalizations. If the data is to be distributed and/or segmented and added-to, changed, or deleted from then the reconstruction described below must be followed. Fundamentally a single principal must be followed. If the individual table is updated in more than one system, it should be possible to reconstruct the original table as if the data was never reformatted or taken apart. Denormalization There are many techniques for denormalizing a relational database design. These include –
Rule of ReconstructionWhen the Rule of Reconstruction is ignored and the data updated, the data is corrupted. Codd’s Rule of Reconstruction is a corollary to ‘Lossless-Join Decomposition’. Lossless Decomposition is a method for creating well-formed normalizations from non-normalized database designs. The Rule of Reconstruction is basically the same idea in reverse. Well-formed non-normalized structures are created from normalized tables. Read-only database (data warehouses, data marts, etc.) gain flexibility and quality if they also approximate this approach. The designer should think in terms of a series of transformations via SQL. If the physical database design is not based on a well-formed (Third Normal Form) logical database design you cannot know if the Rule of Reconstruction is being followed or not. There seems to be some kind of data entropic law here. If data is not carefully designed and managed through time, it slides into chaos. Constructions based upon algebra and set theory are fragile. In my experience, if you don’t know and can’t prove the design is correct, it is very likely incorrect. In reality, it is more subtle than this. For example, if the business changes and the logical data model is not updated and the corresponding physical database design is not updated, you will more than likely get the same effect as a poorly formed denormalization. Add, Changes, and Deletes will corrupt the data relative to the present real business. Future queries will then get the wrong answer relative to the present real business. This is the “Silent Killer.” As a designer, you start out right, change nothing, and now you are wrong. In the textbook theoretical literature written on the Relational Model, the operators used on databases are Insert, Replace, Delete, Retrieve, Select, Project, and Join (and to be complete – Product, Union, Intersection, Difference, and Divide). This causes some confusion with those of us that know SQL. The Retrieve, Select, Project, and Join functions are all performed by the SQL SELECT operator. Just to reduce the amount of re-statement and/or translation from Codd’s original source, the following uses the Relational Model terminology. The following is the Relational Model’s definition of Select, Project, and Join:
Any combination of relational operators can be applied to re-form how the data is distributed, provided that the total transformation is reversible. The issue is verifying this reversibility. To not cause inaccuracies and anomalies, each updateable physical data store must be reversible to the well-formed normalized data model. The following is not intended to be exhaustive and consists of only a few of examples of well-formed denormalizations. Over Normalization This is achieved via the project relational operator. Each projection to be stored in some table, possibly another database, must include the primary key of the relation from which the projection is made. Then, each and every projection is well-formed with no duplicate rows. A table can be split into any number of projections. Note that, at any time, the original table can be recovered using equi-join with respect to the primary keys of the new denormalized tables being joined. In applying relational technology to the management of a denormalized database, it is essential that it be possible the relational operators can be used to decompose relations in the global database into relations for the various denormalized target tables. In other words you should be able to use SQL for the decomposition. This doesn’t mean that you must use SQL for every decomposition, only that you could. You might have a non-SQL Extract Transform and Load (ETL) engine interfacing one or more OLTP systems with an Operational Data Store.
If the key in the above example was very long and compound, a surrogate key could be substituted. And one of the vertical segments could be stored with the surrogate key only. Horizontal Segmentation The select relational operator is used to insert some rows of a table into another table, and other rows to other tables. The selection of rows cannot be arbitrary. The selection of rows must be made using the select operator. When a table is partitioned by rows to be stored in other tables, union is used to recover the original table.
Stored Joins The use of stored joins as a denormalization will formally offend the rule of reconstruction if the joined-table is updated. Since this is a very common if not the most common denormalization some discussion is appropriate. If you must update a stored join, non- relational programmatic measures must be taken to see that data integrity is maintained. A simple example of this is to not update the duplicate columns. It is always dangerous to have the data integrity rules outside the data domain. When other update programs are added to the system they must also contain the non-relational data integrity processes. It is better software engineering to have all the data integrity rules in the DBMS in the form of constraints, triggers, and stored procedures. Quality almost always implies simplicity in usage and a ‘minimum number of moving parts.’ If you have to write complex code with a relational database to get the right answer you are allowed to leap to the conclusion that the database design does not map to the business. In Summary While the project and select operators can be used to split a single table from a database into several tables, the join, union, relational intersection, relation difference, and relation divide operators can be used to combine several relations from a database into one table. If the transformations performed on tables from the database are to be reversible, these operators must be applied with considerable care. Any combination of relational operators can be applied sequentially provided that the total transformation is reversible. One can imagine over normalization and horizontal segmentation being applied to the same table. Of course the same restriction applies. If the reversing relational operators are applied, and you cannot tell they table was not always in Third Normal Form, the denormalization was well-formed. We should start the physical tuning of our databases at design time with well-formed denormalizations. To delay tuning until after implementation is a mistake. References Codd, E.F. (1990) The Relational Mode for Database Management: Version 2, Addison-Wesley Kirkwood, John (1992) High Performance Relational Database Design, Ellis Horwood, Ltd. |