|
Practically-Oriented Conceptual Model for OLAP
Published: April 1, 2002
Published in TDAN.com April 2002 ForewordIn 1998, our company was contacted by one of our customers, a Social office of a Swedish municipality, with a request to build a system that could help them to see the trends in the distribution of their resources. The information on which this system would work was gathered by a software system that supported the everyday operations. The operational system was of a transactional type, which means that it stored all transactions made, e.g., payments, but not the history of changes in “static” objects, like people. The prospective-users of a new analytical system would be ordinary office workers without any special analytical education. Getting this request, we screened the market for analytical tools that existed at the time. We found the following:
Thus, a decision had been made to develop an OLAP system on our own. The next step was screening the OLAP literature in order to find practical recommendations on how to build such a system. Unfortunately, though there were quite a lot of books on OLAP and Datawarehouse on the shelves, we had not found the information we were looking for. All books we found were about how to use analytical tools more or less equal to the ones we found on the market. Not finding practically useful information in an application field was not completely new for us. We started to work in a usual way:
This document represents the results of the first step of this work. It was first written as an internal memo for our development group. The text below represents the slightly edited version of the internal memo. We apologize for conciseness of the text and lack of examples. Still, we hope that the reader could find it interesting to have a look on the description of the OLAP and Datawarehouse concepts that somewhat differs from the one normally found in the literature. The system was successfully built based on the approach above. It had a web-based user interface, and it allowed the end-user to choose from tenths of predefined reports. The results could be presented as lists or graphs. The ordinary Oracle database was used to store analytical data. As a system development tool, we used JAM from Prolifics, Inc. (New York, NY, US). 1. IntroductionA conceptual model is a (minimal) set of abstract notions that describes a particular application domain, or class of tasks or systems. The conceptual model serves as a basis for both database design, and system construction. Each new object in the database should be classified according to the notions of the conceptual model. Each operation should be also classified according to it. If it is difficult or impossible to represent some element of the application domain in the conceptual model, it is desirable that the conceptual model is modified to determine the right place for this element in the overall picture of the application world. If, because of the time limitations, the programming fixed is used as a work around the limits of the conceptual model, the revision of the model should be undertaken at a later time. In this document, a conceptual model is presented for the application domain of so-called OLAP systems, where OLAP stands for Online Analytical Processing. This is not a conceptual model to be used when talking to the end-users. The goal is to have a common language in the development group that allows to formally represent functional specifications requested by the customer, e.g., reports. 2. GoalThe goal with OLAP systems is to measure some essential properties of a dynamic world, e.g., a company, organization, society, etc. Examples of values to be measured include: total of payments per month, an average number of days a patient remains in the intensive ward after an operation, etc. 3. Time Based FactsTo be able to measure the world’s characteristics, the dynamic world of interest needs to be represented in the system in some formal way. One way of representing such world is as a set of time-based facts. A time-based fact (TBF) is a logical statement about the external world that is true at a certain time. A time-based fact may be of two types:
Each time-based fact, as a minimum, has a predicate, and a timestamp.
If a stamp is not precise, e.g., the date precision, and a TBF is of event type, the meaning of the stamp is that the event happened sometime in the given interval.
If a stamp is not precise, e.g., the date precision, and a TBF is of state type, the meaning of the timestamp is that the state was preserved during some or (most) of the time in the given
interval.
Beside the predicate, and timestamp, a TBF may include attributes, object references, and measurements (measured values):
The unique ID, ObjId, is a string or a number that is unique for each object of a given type.
The current state of the object is described by a set of attributes, like name, phone number, address, age, etc. Each object type, like PERSON, COMPANY, PATIENT CASE, has its own set of attributes.
However, for each particular object, some attribute values may not be known.
The object’s history is a sequence of states that the object had in past. Each state, the current including, is supplied with the timestamp that shows when the object has reached this state.
Note: In our model, we simplify the reality by not allowing one object to refer directly to another object. All connections between objects are done via TBF’s.
An object reference is a pointer to the object that the given TBF concerns. Each reference has a name and type; the latter shows which type of objects the reference can point to (persons,
hospitals, hospital cases, etc.). All TBFs with the same predicate shares the same set of object references (more precisely, the names of object references) However, for a particular TBF, not all
of them may be known.
As the object is a dynamic entity defined by its history, the reference should point not only to the object, but also to the state of the object (in the object’s history) that it had when the
given TBF was true in the real world.
There is no conceptual difference between attributes and measurements. Both reflect some characteristics of TBFs. The difference lies in the way they are used in a particular application.
Measurements are used to evaluate integrated parameters of the dynamic world, e.g. an average amount of investment planned per month, the total sales per month for a certain group of products, etc.
Attributes are used to group TBFs during the aggregation process, e.g. all investment (attribute value) decisions made this month, all patients placed in the intensive ward (attribute value), etc.
Note: Sometimes the same parameter may be used as both a measurement and an attribute. For example, assume we want to group all investments decisions according to the size of investments made, like small investments (< $10000), middle-size investments ($10000-$100000), etc. Suppose also that we want to know the totals of planned investments for each group. To cover this situation, characteristic size of investment should be included in TBFs as both an attribute, and measurement. The precision of the attribute and the measurement based on the same characteristic may differ. Normally, an attribute does not require the same precision as a measurement. In a more complicated case, even characteristics of the objects referred by the TBFs may need to be included as measurements in a particular TBF. For example, if we are interested in the average age of patients who undergo a specific operation, the person age should be included in the TBF as a measurement, while it continues to serve as an attribute in the description of object state. 4. Space Transformation4.1 Simple Dimensions
Let P(t,m1,m2, ...,mi,,a1, a2,..., aj,r1,...,rk) be a TBF, where Let us consider different ways of representing this fact as a point in a multidimensional space. First, we need to have some order over components referred by the TBFs. Measurements have numeric values; thus for each measurement, values are ordered in a natural way. The same is true for the timestamps. As far as attributes are concerned, for some attributes (e.g. numeric attributes), the values have some natural order, for others, there is no natural order. However, we can always introduce some artificial order over the attribute values. An attribute with some order introduced over its values will be called an attribute dimension. The same trick can be made with references. Objects that a particular TBF’s reference can point to are, normally, not ordered, but we can introduce some arbitrary order in them. A predicate reference with some order introduced over the objects it points to will be called a reference dimension. With the dimensions introduced above, we can construct a multidimensional space in a number of different ways. One way is when an (i+j+k+1)-dimensional space is constructed from the time dimension (timestamp t), i measurement dimensions (m1,m2, ...,mi), j attribute dimensions (a1, a2,..., aj), and k reference dimensions (r1,...,rk). This way is standard for OLAP literature. However, the above space may have undesirable dependencies between dimensions. This happens when some measurement and some attribute represent the same characteristic of the real world. Another way of defining a multidimensional space is to construct it from the time dimension (timestamp t), j attribute dimensions (a1, a2,..., aj ), and k reference dimensions (r1,...,rk); the resulting space will have (j+k+1) dimensions This construct gives us a “more orthogonal” space than the previous one, and we will use it for defining a multidimensional space to represent TBFs of the given type (i.e., with the given predicate). Under assumption that the timestamp, attributes and references totally differentiate one TBF from another, measurements may be treated as values (i.e., weights) assigned to some points of thus defined space. 4.2 Extension and ReductionLet rc, 1 >= c <= k, be an object reference from TBF P(t,m1,m2, ...,mi,,a1, a2,..., aj,r1,...,rk), and the state of the object to which this reference points is defined by a set of attributes: . Suppose, we have defined a dimension over each attribute of objects to which reference c can point. Now, we can consider any TBF with predicate P as a point in a (j+k+n+1)-dimensional space constructed from the time dimension (timestamp t), j attribute dimensions (a1, a2,..., aj), k reference dimensions (r1,...,rk), and n object attribute dimensions (b1, ..., bn). An operation of space transformation where a reference dimension is complemented by the object attribute dimensions is called extension of the original space via reference rc. Another useful operation to transform the space is by eliminating one of the dimensions that belong to the original space (attribute, reference, or object attribute dimension). This transformation is called reduction via the specified dimension. The coordinates of a TBF in the reduced space are the same as in the original space for all axes but the one that has been eliminated. Note that in the reduced space, there is no guarantee that each TBF of the original space will be mapped into a unique point in a new space. Consequently, we need a way of defining measured values assigned to a point in a new space that assert several TBFs from the old space. This process is called aggregation and it will be described later. A complex transformation that includes several extensions may be defined by naming a set of reference dimensions that should be extended. A complex transformation that includes several reductions could be defined by naming a set of dimensions (be it attribute, reference, or object attribute) that remains in the transformed space (all other are eliminated). The latter is just a usual projection. Note: If the reduction concerns an object attribute (e.g. a product category), and the object reference is not part of the excluded dimensions, no aggregation will happen along the dropped dimension. 4.3 Hierarchical DimensionsAs we saw in the previous section, a simple attribute dimension is defined by introducing a strict order over the attribute values. A hierarchical dimension introduces several levels of attribute values to make it possible to specify the attribute with less precision than the exact value. Basically, a hierarchical dimension defined over an attribute specifies:
Let us also presume that each hierarchical dimension includes a so-called root level. The root level is the highest level in the hierarchy, and it has only one element (unit) denoted as all. According to the above definition of hierarchical dimension, all embraces all units of the previous level. With the root-level concept, each simple dimension can be treated as a hierarchical dimension with two levels: zero, and one, where level one is the root. Note: Our definition of a hierarchical dimension is not done in the form usually used in the OLAP literature. 4.4 ShrinkingSuppose we have defined a multidimensional space that includes a hierarchical dimension x. Then substituting the original values along axes x by the higher-level units, say of level n, will produce a new space that specifies the x values with less precision. We will refer to this kind of transformation as to shrinking along axes x up to level n. Again, when using this kind of transformation, several TBFs that occupies different points in the original space can be mapped into the same point in the “shrunk” space. If we treat all dimensions as being hierarchical, then the reduction operation discussed in the previous section may be regarded as a particular case of shrinking, i.e. shrinking up to the root level. This definition may be expanded to the reference dimensions too. Now, we can define a complex operation of reduction/shrinking by specifying the level of “shrinkage” for each dimension involved. 4.5 AggregationWhen a shrinking/reduction operation is undertaken over a multidimensional space, a number of points from the original space with measurement assigned to them may be projected to the same point in the reduced space. This effect is called aggregation, and it requires some way of calculating new measured values to assign to the aggregated points in the new space. New measurements are assigned by “integrating” the values originally assigned to the points being aggregated. Actually, the new space does not reflect the meaning of TBFs that served as a basis for building the original space. The new TBFs predicate can be defined, for example, as characteristics of an average monthly sale, total investment per business segment per month, etc. Consequently, the integration formula depends on the meaning we want to assign to the new predicate and, it cannot be defined generally. Such formula should specify what measurements are to be assigned to the points in the new space, and how they can be calculated from the measurements assigned to the points of the original space. Though the general integration formula does not exist, we would like to introduce one standard way of integration that can be useful as an intermediate step for other types of integration. Let m1,m2, ...,mi be measurements defined for the TBFs being aggregated. Add to them i+1 additional measurements c, c1, …, ci.. Define an integration procedure that assigns values to the measurements m1,m2, ...,mi,, c, c1, …, ci as follows:
The above integration rule is associative and commutative in respect to the order in which the space is transformed, if the latter is done step by step. The results will be the same if a complex transformation of shrinking/reduction along several axes is done in one step, or several reductions each concerning only one axis are done one at a time. The standard integration procedure has all data to calculate some other types of values, e.g. totals over all aggregated points. The total over the p-th measured value is derived from cp, and mp according to formula cp * mp. In some cases, the integrated values can be treated not as measurements, but as attributes in the transformed space (sometimes, we need both). This allows us to define a new hierarchical dimension, for example, on average values, and perform shrink (or restrict, see the next section) operations over this dimension. 4.6 Restriction and ShiftSometimes, only part of the attribute dimension values are of interest in the current analysis. Let s be a subset of all possible values in the attribute dimension x. A new space can be created from the original one by eliminating all points for which the value of coordinate x does not belong to s. This operation of space transformation is called restriction of dimension x on set s. The way to defined a subset of values for restriction is chosen from practical considerations. The set can be defined by: (a) listing all values that are included (b) specifying the range limits (minimum, and maximum) (c) listing several ranges (d) listing ranges and standalone values (e) specifying values that shouldn’t be included in the set Unlike the shrinking/reduction operation, restriction does not result in any aggregation, it just removes uninteresting parts of the space. After the restriction has been made, the initial range of values from 0 and up to some limit may become free of points with measurements. The shift operation moves point 0 in a dimension to the right. The operation is defined by the shift value. For example if we are interesting in TBFs that are valid for 1997, we can make restriction over this year and shift the beginning of the time axis to the beginning of 1997. Restriction with shift that follows allows comparative analysis of measurements that belong to two different time intervals, e.g., 1996 and 1997. When we perform two separate restrictions over the two years and make a shift in each of the resulting spaces, the time axes in both spaces will become compatible, i.e. they will get the same set of values. The join operation described below helps two create a new space to compare values for different years. 4.7 JoinIn the previous sections, we discussed operations that transform a multidimensional space based on TBFs, all of which have the same predicate. Some of these operations, shrinking/reduction, may result in a space that corresponds to TBFs with another predicate than the original one. This is an example of so called derived TBFs, e.g. time based facts that originally are not included in the database, but can be derived from the facts that are included in the database. There are other ways of defining derived TBFs. For example, suppose we have a hospital database that for each patient case stores a TBF that register the event when a patient was accepted to the hospital and a TBF that register the event when he/she was discharged from the hospital. These TBF’s are connected to the same person object (physical object), and the same hospital case (process object), and there is a rule that a case may have only one Accept and one Discharge event. Even if the original database does not have TBFs that shows who was at the hospital at a particular date, this information can be derived. To do that, we need for each date:
The above example presents a relatively complicated case of derived TBFs that is difficult, if possible, to define via space transformation. However, in some practically useful cases, a derived space can be defined in a general way. Joining two spaces is one of them. Consider two multidimensional spaces M1, M2 each with n dimensions but with completely different sets of measurements. Let each i-th dimension from M1 be compatible with the i-th dimension of M2, i.e. they have the same set of values ordered in the same way. Then we can join M1 with M2 along the common axes getting a new space M3 . To do that, we need to assign measurements to each point in M3. This can be done only if the corresponding point in M1, or/and M2 has some measurement assigned. There can be different ways to assign measurements to the points in the new space, but we can at least define a standard one:
Continue the example from the previous section, we can use the join operation to create a space which compares measurements made over time in two different time intervals, e.g. in 1996 and 1997. 5. Space Scanning and Presentation5.1 What is a QueryAn end-user query in an OLAP system is a request to present time-base facts in the form of one or more multidimensional spaces. The query has two parts: what spaces(s) should be presented, and in what form, e.g., list (report), graph, 3-d picture, etc. When more than one space is to be presented, there should be some likeliness between the spaces, e.g., they should share at least one dimension. Here, in contrast with the join operation, sharing may not mean that the dimensions are fully compatible. Weaker compatibility is allowed, i.e. the actual axes used in the two spaces belong to the same hierarchical dimension, but different levels of units are used in each of the spaces. Typical example when several spaces are presented together is a report with totals and subtotals for different time ranges (week, month, and year), categories of products sold, etc. For these reports, the spaces merged can be produced by stepwise shrinking/reduction of the original space along various axes (time, product, etc). 5.2 Query DecompositionAfter the problem of what space(s) to present is solved, the next question to answer is if this space is stored in the database, or should be derived from the database in some way. In the latter case, we need to understand if the derived space can be obtained with standard space transformation operations (extension, shrinking/reduction, restriction, union), or nonstandard operations should be applied. Generally, a target space should be presented as a sequence of operations (standard or nonstandard) over one or more source spaces. Each operation that can lead to aggregation should be supplied with a formula for measurements calculation. A formula can be added to non-aggregating operations too, for example, if the measurements stored in the database are not the ones that are needed according to the query, but they can be calculated. Moreover, a calculation formula may be needed even when no space transformation is required, i.e. when the database already has the space requested in the query, but with another set of measurements than requested. Often, there are several ways for defining an order of operations that lead to the given target space. A new order can be obtained, for example, by switching the places of two standard operations in an already defined order. This is allowed only if the calculation formula is invariant to such changes in order of operations. This is true, for example, for the standard integration formula discussed in section 4.5. When the order of operations can be changed, the following rules are useful to follow:
5.3 ScanningScanning is a process of traversing a multidimensional space and collecting all points that have got some measurements. The result of scanning is a stream of weighted points, each of which is defined by its coordinates in the space, and measurements. Scanning is needed for both:
When scanning is being done, each point should be visited only once. In practice, this rule is not sufficient for the purposes of presentation and integration, more strict traversal order may be required. A scanning order can be defined as a sequence of space dimensions, say: D1, .., Dn. Scanning starts with the lowest value of dimension D1 and goes through all the points of the space that have this lowest value as the coordinate in dimension D1. After that, the next value in dimension D1 is chosen and all points with this value are scanned. The order of scanning the points with D1 value fixed is defined by dimension D2, etc. Suppose we scan an n-dimensional space in order D1, .., Dn. Then the resulting stream may be considered as consisting of a number of series, defined as follows:
If dimension Dk is a hierarchical dimension with h levels and f is a nonzero level (f <=h) , then the scanned stream may be considered as consisting of a number of series such that:
5.4 Stream Processing and MergingScanning can be used as a way of space transformation on the fly. Applying special processing rules to the scanned stream, we can get a stream that could be produced by scanning a space that differs from the original one. Naturally, this other space can be defined as a transformation of the original space. Let space M2 be derivable from an n-dimensional space M1 = D1 x ...x Dn via the shrink operation along dimension Dk up to level f. Assume farther that some integration formula to assign measurements to aggregated points is defined. Let us scan M1 in the order defined as D1, ..., Dk-1, Dk+1, ..., Dn,, Dk. Consider the resulting stream as a number of series, each series lying between two breaks of level f in dimensions (the rightmost break being included in the series). By applying the integration formula to all points in each series, and substituting the Dk value of precision 0 with the value of precision f, we will get a stream that would result from scanning space M2. This procedure can be generalized to cover cases where several shrink operations are applied one after another. Beside of being useful for space transformations, breaks are useful for merging two or more spaces that should be presented to the end-user together. More precisely, breaks of a level l greater than 0 can be used for merging two streams, where the second one defines some integrated values, like subtotal, average, etc., for the units of level l. In this case, the next point from the second stream is placed directly before the next break in the first stream (to summarize the previous series). Moreover, instead of merging two independent streams, the stream processing rules described above may be applied to the current stream to get the second stream while scanning the original space. This will result in space transformation and merging made at the same time. 5.5 VisualizationVisualization is a way of presenting the stream of points to the end-user. This can be done in various forms, e.g., as a list (report), graphical diagram (pie, bar), graph, two dimensional matrix, etc. Go to Current Issue | Go to Issue Archive
Ilia Bider - Ilia Bider is a cofounder and Director R&D of IbisSoft, Stockholm, Sweden. He has a combined experience of over twenty years of research (in the fields of computational linguistics, databases,
OO, business modeling), and practical work (business analysis, design, coding, software sales, and marketing) in 5 countries (Norway, Russia, Sweden, UK, US). For more information on IbisSoft please
see http://www.ibissoft.com
|