Data Vault Series 2 - Data Vault Components
Published: January 1, 2003
Published in TDAN.com January 2003
There is a new home for the Data Vault: http://www.danlinstedt.com
The purpose of this paper is to present and discuss a patent-pending technique called a Data Vault™ – the next evolution in data modeling for enterprise data warehousing. This is the second paper in a series of papers that will be published on a Data Vault. This paper explores a specific example: Microsoft SQLServer 2000 Northwind Database, and taking it to a Data Vault representation. One of the most interesting concepts surrounding a Data Vault is the incorporation of S.E.I. and CMM Level 5 ideas. This poses the question: Can a Data Vault be reverse engineered mechanically from an existing model? The answer is yes – it can. Will it be perfect? No, the algorithm that has been constructed builds a base-line Data Vault data model from an existing 3rd Normal Form operational model. The point is that the process is repeatable and consistent; therefore portions of it can be automated.
The purpose of this paper is to present and discuss a patent-pending technique called a Data Vault™ – the next evolution in data modeling for enterprise data warehousing. The audience of this paper should be the data modelers who wish to construct a Data Vault data model. This paper focuses on a specific example: The Microsoft SQLServer 2000 Northwind Database. It is suggested that for the purposes of this discussion, the reader obtains at a minimum, a trial copy of the SQLServer 2000 database engine. Please read Series 1, the original paper - defining Data Vault architecture. This will provide the context on what the data model is, and how it fits into business. The topics in this paper are as follows:
Several of the objectives that you may learn from this paper are:
Let’s consider this for a moment: suppose it is possible to reverse engineer a data model into a warehouse. What would that mean for a data warehousing project? Suppose it could be done in an automated fashion, would that help or hurt? What if the only consideration necessary to make is how to integrate different aspects of the generated data models? These and many more questions come to mind when beginning to consider the automation of data modeling for data warehousing, particularly when the consideration involves mechanized engineering.
For our purposes, having this functionality to produce a baseline Data Vault would be of tremendous help. The Northwind data model was converted both by hand, and through an automated fashion. When the two data models were compared they only had minor differences. Further examination showed that hand-conversion opened up the possibilities for errors in Link tables where the automated converter kept the Links clean. Some of the most important items to mechanizing the process are: naming conventions, abbreviation conventions and specification of primary / foreign keys.
What’s important here is that this is a baby step into the application of “dynamic data warehousing” or dynamic model changes (please see my other article: Bleeding Edge Data Warehousing – due out in the Journal of Data Warehousing Fall 2002). It also provided a data model in ten minutes (for this particular example), when it took roughly two hours to convert it by hand. It then took an additional twenty minutes to adjust the model slightly, and implement it. Keep in mind this is a small data model and all that is proposed is auto-vaulting of one OLTP data model at a time. The automated process isn’t smart enough yet to integrate end-resulting Data Vault data models.
The DDL is available on a web link: http://www.coreintegration.com Sign in to our free online community, the Inner Core, and click on Downloads, select Data Warehousing, then find the zip file titled: DataVault2DDL.zip (for this series). The DDL and the views are built for Microsoft SQLServer 2000. Feel free to convert them to a database of your choice. The automated mechanism is not available today - it’s still in an experimental phase. The DDL contains the tables for a Data Vault and the views to populate the structure both initially and with changes.
Please keep in mind; this is not a “perfect Vault” and has not been conditioned to be the same quality of data model as delivered to the customer. This is meant as an example only, for trial purposes. Feel free to contact me directly with questions or comments.
“You must strive to do that which you think you cannot do.” Eleanor Roosevelt.
2.0 Examining an OLTP 3NF model for conversion
Some OLTP data models in 3NF are easier to convert than others. However, there are some distinctive properties which make the conversion process easy. Here are a few items to look for:
These are just a set of suggested items to consider before converting the data model. They are by no means a complete list. First and foremost, the data warehouse data model should always follow the business requirements, regardless of how the base-line or initial model is generated. It is suggested that a scorecard approach be developed. Over time, these items will be on a scale of difficulty, when that happens – it will provide a good guideline as to the “convertibility” of a particular data model. Future series will cover migrating Conformed Data Marts and other types of adapted 3NF EDW to a Data Vault schema.
2.1 The Northwind Database
Northwind is built by Microsoft, and is installed on every Microsoft SQLServer 2000 database. It is freely accessible with sample data. The data model is shown below, in Figure 2-1.
In this model the first thing to notice is the use of non-standard data types: bit, ntext, image, money. These don’t port very well to other relational databases. This is important to resolve because most of the data warehouses are not built on the same database engine as their OLTP counterparts. In this case a Data Vault will be built on the same RDBMS engine. Another item that pops out of the data model is the recursive relationship. Immediately this should signal a necessary change to the data model.
The naming conventions appear consistent across the model. ID is used synonymously with primary keys, primary and foreign keys are defined, there are no independent tables and the model does appear to use some surrogate and some natural keys. For the sake of discussion, the business requirements are to house all of the data in the warehouse and store only incremental changes to the data over time.
The attributes could be classed out (normalized) further if desired, items such as address, city, region, and postal code can all be grouped. Do certain attributes change faster than others? From looking at the model, the two tables with the most changes might be orders and order details. There really isn’t a method that will help the discovery of rapidly changing elements in this model. Normally rapid changing elements are either indicated by business users or provided in audit trails, usage logs or through time-stamps on the data itself. In this case, none of these are present.
3.0 The process of modeling a Data Vault
In order to keep the design simple, yet elegant, there are a minimum number of components, specifically the Hub, Link and traditional skill sets of data modeling expertise. These were defined in Series 1. Please refer to the first article for definitions and table structure setup. This section will discuss the process of converting the above data model to an effective Data Vault. The steps for a single model conversion without integration are as follows:
To address more than one model start with the business identified “master system”. Build the first data model and then incrementally map other data models and data elements into the single unified view of information.
There are three styles to load-dates in the EDW Data Vault architecture and before modeling can begin it is wise to chose a style that suites your needs. The styles are as follows:
Select the style that best suits the business needs and implement it across the model. Part of the Data Vault modeling success is consistency. Stay consistent with the style that’s chosen and the model will be solid from a maintenance perspective.
3.1 Hub Entities
Since the Hubs are a list of business keys it is important to keep them together with any surrogate keys (if surrogates are available). Upon examination of the model we find the following business key/surrogate key groupings (the examination included unique indexes and a data query):
Once the analysis has been done for each of the table structures, we can assemble the list of hub tables that will be built: Hub_Category, Hub_Product, Hub_Supplier, Hub_Shippers, Hub_Customer, Hub_CustomerDemographics, Hub_Employee, Hub_Territories. There are a couple of questionable items which depending on the business rules may have their structure integrated. Remember that the hub structures are all very similar, an example of the Hub_Category looks like this:
Create Table Hub_Category (
CategoryID int NOT NULL,)
CategoryName nvarchar(15) NOT NULL,
LOAD_DATE DateTime Not Null,
RECORD_SOURCE nvarchar(12) not null,
Primary Key (CategoryID)
Create unique index hub_category_i1
on Hub_Category (CategoryName)
Now that we have the hub structures, we can move on to the links. The function of the hubs is to integrate and centralize the business around the business keys.
3.2 Link Entities
The Links represent the business processes, the glue that ties the business keys together. They describe the interactions and relationships between the keys. It is important to realize that the business keys and the relationships that they contain are the most important elements in the warehouse. Without this information, the data is difficult to relate. Typically transactions and many-to-many tables constitute good Link tables. Along with that, any table that doesn’t have a respective business key becomes a good link entity. Tables with a single attribute primary key mostly make a good Hub Table, however the requirement is still for a business key. In the case of Orders, a business key does not exist. The link tables of our model are as follows:
Did we get all the linkages? No. Look again. There are some parent/child foreign key relationships in tables that are slated to become Hubs. Hubs don’t carry parent/child relationships or resolve granularity issues. Examining the Products table, we see both a CategoryID and a SupplierID. This will constitute a LNK_Product Table, including the ProductID, SupplierID, and CategoryID. In a true data warehouse we would construct a surrogate key for this link table – however in this case the data model states that ProductID is sufficient to represent the supplier and category (as indicated by OrderDetails). No surrogate key is necessary.
In cases of integration (across other sources), it may be necessary to put the surrogate key into multiple link tables. Are there other parent child relationships that need a linkage? Yes, Employees has a recursive relationship. To draw this out, we will construct a LNK_EMPLOYEE table, so that the “reportsto” relationship can be handled through a link table. There are no more relationships that need to be resolved. Now we can move on to Satellite entities. An example of a link table is below:
Create Table LNK_PRODUCTS (
ProductID int NOT NULL,)
CategoryID int NOT NULL,
SupplierID int NOT NULL,
LOAD_DATE DateTime Not Null,
RECORD_SOURCE nvarchar(12) not null,
Primary Key (ProductID),
Foreign Key (SupplierID) references HUB_Supplier,
Foreign Key (CategeoryID) references HUB_Category
3.3 Satellite Entities
The rest of the fields are subject to change over time – therefore they will be placed into Satellites. The following tables will be created as Satellite structures: Categories, Products, Suppliers, Order Details, Orders, Customers, Shippers and Employees. The Satellites contain only non-foreign key attributes. The primary key of the Satellite is the primary key of the Hub with a LOAD_DATE incorporated. It is a composite key as described in series 1. In the interest of time and space only one example of a Satellite will be listed below:
Create Table SAT_Products (
ProductID int NOT NULL,)
LOAD_DATE DateTime Not Null,
RECORD_SOURCE nvarchar(12) not null,
Primary Key (ProductID,LOAD_DATE)
Foreign Key (ProductID) references HUB_Products
The physical data model now appears as follows:
Figure 3-3 Physical Northwind Data Vault Model
If this is difficult to read, the full image is available on a PDF (in the ZIP file on the Inner Core Downloads) at: www.coreintegration.com (sign up for the Inner Core – its free, then go to the downloads section). This is the entire data model with all the hubs in light gray/blue, the links in red, and the satellites in white. This is style 1, with just a standard load date being utilized. In the interest of space the other styles will be represented in a future article.
4.0 Populating a Data Vault
If the Auto Vault generation process is used, the views will be generated to populate the data structures, right along with the generation of the structures themselves. In this case, the views have been generated. A sample is provided of one of each of the Hubs, Links, and Satellites.
The Hubs are inserts only. They record the business keys the first time the data warehouse sees them. They do not record subsequent occurrences. Only the new keys are inserted into the Hubs. The Links are the same way, they are inserts only – for only the rows that do not already exist in the links. The Satellites are also delta driven. The Satellites insert any row that has changed from the source system perspective, providing an audit trail of changes.
Another purpose of a Data Vault structure is to house 100% of the incoming data, 100% of the time. It will be up to the reporting environments, and the data marts, to determine what data is “in error” according to business rules. A Data Vault makes it easy to construct repeatable, consistent processes, including load processes. The architecture provides another baby step in the direction of allowing dynamic structure changes.
To load the Hubs: select a distinct list of business keys with their surrogate keys, where the keys do not already exist in the hub.
CREATE VIEW V_INS_HUB_CATEGORIES AS
A.CATEGORYID,FROM NORTHWIND..[CATEGORIES] A with (NOLOCK)
WHERE NOT EXISTS
(SELECT * FROM HUB_CATEGORIES WITH (NOLOCK))
To load the Links: select a distinct list of composite keys with their surrogates (if provided), where the data does not already exist in the link.
CREATE VIEW V_INS_LNK_ORDERS AS
SELECT DISTINCTFROM NORTHWIND..[ORDERS] A with (NOLOCK)
WHERE NOT EXISTS
(SELECT * FROM LNK_ORDERS WITH (NOLOCK))
To load the Satellites: select a set of records, match or join to the business key (or by composite key if possible), where the columns between the source and target have at least one change. Match only to the “latest” picture of the satellite row in the satellite table for comparison reasons.
CREATE VIEW V_UPD_SAT_EMPLOYEES AS SELECT
The view is built to handle null comparisons, as well as chop the comparison on text and image components to only 2000 characters. The comparison is extremely fast, and is a short-circuit Boolean evaluation. These views are run as Insert Into… Select * from . They satellite view is fast as long as partitioning is observed along with the primary key.
Views work well when the source and a Data Vault are in the same instance of the relational database engine. If different instances are utilized, then there are two suggested solutions: 1) stage the source data into the warehouse target, so that the views can be used. 2) utilize an ETL tool to make the transfer and comparison of the information. However, staging the information in the warehouse, and utilizing the views allows the database engine to keep the data local, and in some cases take advantage of the highly parallelized operations in the RDBMS engine (such as Teradata for instance).
This series provides a look at implementing and building a Data Vault along with a sample Data Vault structure that most everyone has access to. This simple example is meant to show that a Data Vault can be built in an iterative fashion and that it is not necessary to build the entire EDW in one sitting. It is also meant to serve as an example for the previous series, showing that this modeling technique is effective and efficient. The next series will dive into querying this style of data model and will discuss Style 3 – Load End Date of records vs. the point-in-time satellite structure.
© Copyright 2002-2003, Core Integration Partners, All Rights Reserved. Unless otherwise indicated, all materials are the property of Core Integration Partners, Inc. No part of this document may be reproduced in any form, or by any means, without written permission from Core Integration Partners, Inc.
Recent articles by Dan E. Linstedt
Dan E. Linstedt -
Cofounder of Genesee Academy, RapidACE, and BetterDataModel.com, Daniel Linstedt is an internationally known expert in data warehousing, business intelligence, analytics, very large data warehousing (VLDW), OLTP and performance and tuning. He has been the lead technical architect on enterprise-wide data warehouse projects and refinements for many Fortune 500 companies. Linstedt is an instructor of The Data Warehousing Institute and a featured speaker at industry events. He is a Certified DW2.0 Architect. He has worked with companies including: IBM, Informatica, Ipedo, X-Aware, Netezza, Microsoft, Oracle, Silver Creek Systems, and Teradata. He is trained in SEI / CMMi Level 5, and is the inventor of The Matrix Methodology, and the Data Vault Data modeling architecture. He has built expert training courses, and trained hundreds of industry professionals, and is the voice of Bill Inmons' Blog on http://www.b-eye-network.com/blogs/linstedt/.