Aggregates Location: A Rule of Thumb
Published: April 1, 2004
Published in TDAN.com April 2004
In the Information Age, more and more business and socio-political organizations are realizing the advantages that can be drawn by employing the mostly un-utilized resource represented by the enterprise information pool. Customer orders, product shipments, invoices, customers and employees behaviour, and many more information items can enable a company to better integrate into its market, predict its trends and proactively react to future market needs or directions. All these informational resources already exist in the enterprise, so there is no acquisition cost, so the organizations only have to invest in organizing and meaningfully using it to adjust their business behaviour, directly and positively impacting their bottom-line.
The particular class of systems targeted to organize and use large sets of enterprise information to enhance the business perspective and decision-making process are largely known as Business Intelligence (BI) solutions, such as On-Line Analytical Processing (OLAP), Multidimensional OLAP (MOLAP), Decision Support Systems (DSS) and Enterprise Information / Reporting Systems (EIS / ERS). These can be deployed as non-integrated stand-alone solutions, commonly named Data Marts. But it is a truism that somebody that knows the past can predict the future, so the most useful deployment of a BI solution is on top of an Enterprise Data Warehouse that holds integrated long-term history of organization' information elements.
There is a large industry debate concerning the level of detail to be stored in the Data Warehouse (its "granularity"), and how meaningful a ten-years old atomic data element is for the organization. The only definite answer to this issue is that there is no universal answer, as each organization has different data retention policies for electronic and written information. However, the Data Warehouse should have a consistent history depth to provide meaningful cross-references, so many organizations are using an Operational Data Store to maintain atomic data elements relationally organized, and the Data Warehouse to dimensionally store mostly aggregated data and only business critical atomic data elements.
The Data Warehousing is an intricate environment, and may include multiple components quite different as functionality and complexity. There are two opposed methodologies in Data Warehousing theory, introduced by Ralph Kimball and Bill Inmon, and many combinations in between, such as the approach promoted by Open Data Systems and introduced in Data Warehousing Environment - An Introduction (Open Data Systems Inc., www.opendatasys.com, Publications section, December 2001). This approach is based on the axiom that there is no universal "DW solution", but each organization will implement subsets of the potential environment, shown in the figure below
Figure 1 - Complete Business Intelligence Environment (mouse-over image to enlarge)
The back-end layer is usually enterprise-wide and will include one or more of the following enabling structures:
Typically, very few users will get direct access into the Data Warehouse, as they should have advanced knowledge about the data warehouse structure and data (for example, to avoid sending an all-tables all-data join that will seriously impact the DW performance). The regular user layer may typically include the following classes of client systems fed from the Data Warehouse:
The following table briefly compares the main classes of Data Warehouse clients and their main purpose (which defines the type of data required and processing performed):
Table 1 - Data Warehouse Client Systems
Each of these types of client systems has different types of data requirements in order to satisfy their specific end-user analytical needs. Even if they can receive data from the Operational Data Store, the Data Warehouse is usually better suited as a source at least for OLAP, Data Mining and MOLAP clients.
The basic data support in the Data Warehouse Database for any BI client is achieved by storing the atomic data at the highest granularity level (more detail) required to satisfy any potential business demand for information. The client systems could then extract and process that elementary level data to create derived and/or aggregated data, pre-computed or on-demand.
Another option is to pre-aggregate some of the required derived information in the Data Warehouse Database, and feed it as a "new" fact to the client tools. While this should definitely be done when the atomic data is moved out of the on-line media (reaches its retention limit), there are many voices arguing that this should be also done even if the detail data still exists in the Data Warehouse.
There are many arguments in favour and against each approach. The following table depicts a summary comparison of the most important differences between the two options mentioned above:
Table 2 - Aggregation Options Comparison
It is quite obvious that the balance is clearly tilted toward performing the aggregations in the Data Warehouse Database. This is especially true for aggregates involving complex transformations of the detailed data based on business-driven algorithms that, in fact, are "new" business measures derived from a set of underlying data elements.
However, caution must be exercised to not overdo and build every potential aggregate in the permanent storage. As a basic rule, the processing that is not part of the standard BI tools functionality (in other words, is custom built to satisfy specific business rules) should be built as a permanent store in the Data Warehouse. The straight summary, average etc. aggregations along one or more dimensions should be trusted to the client layer, as both MOLAP and ROLAP class solutions handle very well these standard BI operations.
Moreover, many RDBMS engines now offer advanced OLAP and data mining functionality as integral part of the database server software, enabling variations from the traditional aggregation approach. For example, Oracle9i can store OLAP data in a relational star schema (consisting of fact and dimension tables), but also multi-dimensionally in an Analytic Workspace (AW), or in a combination of both.
Also, with the lowering cost of on-line storage media (hard drives), in the recent years it became so much more cost-effective to pre-process data and permanently store the ready for usage results, versus on-demand processing that requires significant increased CPU power to deliver acceptable performance levels.
In conclusion, creating business driven database level aggregates enables a certain level of client / toolset independence, data storage and protection within a specialized RDBMS and lower overall Total Cost of Ownership. Without minimizing the role of the client systems in a Data Warehousing environment, it is quite obvious that the first candidate for aggregates deployment is the Data Warehouse Database, regardless of the type of client system that will ultimately present the data to the end-user.
George Jucan - George Jucan, MSc in Computer Science, is the founder and acting CEO of Open Data Systems Inc. He has over 10 years of IT experience, most of it as a consultant in the public and private sector. George Jucan is primarily specialized in Information Architecture, but his background also includes project management and data / database architecture. He can be contacted through the Open Data Systems web site at: www.opendatasys.com