Data Modeling & Enterprise Project Management: The Data-based Approach - Part 3
Published: April 1, 2004
Published in TDAN.com April 2004
This is the third in a series of articles from Amit Bhagwat.
Data modeling is no doubt one of the most important and challenging aspects of developing, maintaining, augmenting and integrating typical enterprise systems. More than 90% of functionality of enterprise systems is centered round creating, manipulating and querying data. It therefore stands to reason that individuals managing enterprise projects should leverage on data modeling to execute their projects successfully and deliver not only capable and cost effective but also maintainable and extendable systems. A project manager is involved in a variety of tasks including estimation, planning, risk evaluation, resource management, monitoring & control, delivery management, etc. Virtually all of these activities are influenced by evolution of the data model and may benefit by taking it as the primary reference. This series of articles by Amit Bhagwat will go through the links between data modeling and various aspects of project management. Having explained the importance of a data model in the estimation process, and having provided an overview of various estimation approaches and presented illustrative example for the function-based approach, this article focuses on explaining the Data-based estimation approach.
In the first article[i] of this series, we established the central function of most enterprise projects as data-operation. We therefore concluded that data structure associated with a system would prove an effective starting point to arrive at the efforts associated with creation and upkeep of the system.
In the last article[ii] we took a simple case to illustrate the function-based estimation approach. We presented the case; itemized functions involved and then atomized them into transactions. We next analyzed how each transaction manipulated entities. Finally, we checked the transactions and associated entities with system function to ensure that no function was missed and therefore no transaction or entity was omitted. We also verified that there was no duplication of transactions. We then converted the transaction data to Unadjusted Function Point count (UFP), that can stand as the basis for various estimation and resourcing calculations. We also briefly discussed a shortcut FB approach.
It is worth noting that whereas the detailed Function-based approach, in all its majesty, may prove to be among the more accurate of Function Point based estimates, it is somewhat demanding on analysts’ time. Further, more importantly, it takes the project through a substantial inception, before any numbers become available. This delay naturally necessitates another technique of obtaining a ‘nearby number’, should the project require this a bit earlier. Such a technique may be a bit off the mark, but should put the numbers in the proximity of the more accurate value. It can also give analysts advantage of verification, for if this number varies by significant proportion from the more accurate estimate, then the analyst gets to check both calculations again and, as a general rule, finds some gross blunder in one of them, which is then correctable in a timely fashion.
It is this ‘other’ quick but less accurate approach, the Data-based estimation approach, that we are going to focus on in this article. We’ll briefly go through the approach and understand why it approximates the function-based approach. We’ll then use it to work on the case we took in the last article.
The Data-based Estimation Approach
While studying the Function-based approach, we counted the entities involved, inputs taken and outputs given by each transaction, summed these numbers to get cumulative count of entities, inputs and outputs, and then used these figures to get UFP. The final formula looked like 0.58 x inputs + 1.66 x entities + 0.26 x outputs. The exact multipliers are a result of substantial numerical and empirical work and are valid only in the context of UFP, which then needs to be converted into FPI by taking a product with technical complexity adjustment (TCA), a figure itself arrived at through a careful and extensive analysis of the non-functional factors. The way that the multipliers are used in UFP calculation is beyond the scope of this work. It is important however to note how they are proportioned with respect to each other. Inputs are given just over twice the weight as outputs. This is because of the validation functionality associated with inputs. Entities are given nearly thrice the weight as for inputs, this is because on one hand they have functionality associated with them to maintain inter-entity relationships and referential integrity, which involves factor of overhead similar to input validation, on the other, they are susceptible to 3 broad type of operations: create / update, read and delete, raising their weight further by a factor of around 3.
The Data-based approach attempts to guess these 3 numbers (input, output, entities involved) by making certain assumptions, based on a view of the data structure. It is important to note that the data approach concentrates on data that the system ‘Holds’. This is important and will be illustrated through the example. I have briefly discussed the assumption and treatment associated with the Data-based approach in the first article of this series. However an expansion on it will be in good order here.
Our ultimate goal is to find the inputs I, outputs O and entities ET associated with the transactions. The first assumption we make, as we did in the shortcut Function-based method, is that a transaction is essentially classifiable cleanly as a create / update, read or delete transaction. This having been assumed, the next step is to find I, O and ET for each type of transactions and cumulate these up.
To begin with, we only have the data model, which can give us the total number of entities (E), attributes (A) and binary relationships (R) [note that we are following classical entity relationship approach thus substituting higher n-nary logical relationships by nearest equivalent binary relations]. Under this approach therefore, we assume reflexive binary relations, leading to mean connectivity (C) of 2R / E.
We next assume that each entity is involved in one instance of creation, update, read and delete, thereby representing total number of transitions (T) = 4E. We further assume that a transaction involves a principal entity plus the average of entities connected (mean connectivity, C), i.e. 1+C = 1+ (2R/E). Let’s call this EPT (entities per transaction), thus,
EPT = 1 + (2R /E) = (2R+E) / E
We likewise estimate the number of fields (these will manifest as inputs or outputs) per transaction, assuming that each transaction involves a major entity that contributes all its attributes, plus subsidiary entities in the form of all connected entities (averaged to mean connectively) contributing half their attributes. This assumption ignores derived data, unless we expand the primary entity structure given, to include pseudo-entities, which exist as part of the business process but not as part of final data structure. This ignorance is considered pardonable given the approximate nature of this approach and modest level of intermediate / derived business entities that are logical but have no equivalent in the database. Since total attributes are A, fields associated (on the average) with the principal entity of the transaction are (A/E). Likewise, those contributed by connected entities are (C(A/E)/2) = ((2R/E)(A/E)/2) = RA / E2
Therefore, Fields per transaction (FPT) = A/E + RA / E2
We make the following further assumptions:
1. Each create / update transaction involves FPT inputs and nominal (1) output
2. Each read transaction involves nominal (1) input and FPT outputs
3. Each delete transaction involves nominal (1) input and output
4. Half of total transactions are create / update
5. Quarter of total transactions are read
6. Quarter of total transactions are delete
And of course, each transaction, by definition has EPT entities.
Therefore cumulative input count may be arrived at as sum of input count from various types of transactions
= (FPT)(T/2) + (T/4) + (T/4) = (FPT)(T/2) + (T/2) = (FPT +1 )(T/2)
Substituting FPT = A/E + RA / E2 and T = 4E, we get:
I = (A/E + RA / E2 + 1) (2E)
= 2(A + E + RA/E)
O = (T/2) + (FPT) (T/4) + (T/4)
= (4E/2) + (A/E + RA / E2 ) (4E/4) + (4E/4)
= 2E + A + (RA/E) + E
= 3E + A + (RA/E)
ET = (EPT) (T)
= ((2R+E) / E)(4E)
The final formula for unadjusted function points,
UFP = 0.58 x I + 1.66 x ET + 0.26 x O
therefore transforms into
0.58 x (2(A + E + RA/E)) + 1.66 x (8R +4E) + 0.26 x (3E + A + (RA/E))
= (0.58 x 2) (A + E + RA/E) + (1.66 x 4) (2R + E) + 0.26 x (3E + A + (RA/E))
= 1.16 (A + E + RA/E) + 6.64 (2R + E) + 0.26 (3E + A + (RA/E))
= A (1.16 + 0.26) + E ( 1.16 + 6.64 + 0.26 x 3) + R (6.64 x 2) + (RA/E) (1.16 + 0.26)
= 1.42 A + 8.58 E + 13.28 R + 1.42 (RA/E)
= 1.42 A (1 + (R/E)) + 8.58 E + 13.28 R
In the last article, we took case of the book lending facility at a public library for illustration purposes. I mentioned that this simple example does not go further than subsystem level in the product hierarchy. We noted that it was a bad idea to take a subsystem in isolation and that we did so to simplify the illustration and to allow me to make a point or two regards definition of attributes and scope & control of interrelated subsystems. I’ll comment on this in the course of discussion that follows.
Here I do not intend to re-narrate the entire case, and the reader, who is expected to have assimilated the last article, can refer back to it for details of the case. I’ll however repeat the view of important entities, taken from the last article. Expressed in UML-like notation and without deliberate denormalization, except in one instance, it looks as shown in fig. 1.
Fig. 1. : A view of important data elements
Having got a formula based on entities (E), attributes (A) and relationships (R), your first impulse will be to count all of these as they appear in the entity view. Do it just for the sake of curiosity and see what happens. You may be startled by the generalization relationship, so note that the abstract parent entity should be considered invisible thus viewing the diagram as in Fig. 2.
Fig. 2. : Important data elements after removing generalization and abstract parent
At first glance, we may count the following 7 entities with a total of 8 relations between them and each entity carrying attributes as shown in table 1.
Table 1. : Entities and attributes at first glance
So if we proceed without a second thought,
We’ll take A = 40, E = 7, R= 8
This will give us UFP
=1.42 A (1 + (R/E)) + 8.58 E + 13.28 R
= (1.42 x 40 x (1 + 8/7)) + (8.58 x 7) + (13.28 x 8)
= (56.8 x 15 / 7) + 60.06 + 106.24
= 121.71 + 60.06 + 106.24
=288.01 ~ 288
How does this compare with our earlier estimate arrived from the more accurate Function based approach?
288 v/s 79, i.e. more than 3.5 times, or in other words, over 250% off target. Not good at all. Your first thought may be on justification of counting Present and Past Borrowings as two entities, when, in the Function based discussion, we have mentioned these as an update on the entity Borrowing. Some of us may likewise question the justification of counting foreign key identifiers as attributes, when all they are doing is helping maintain the relationships. So let’s stop counting foreign keys and replace the two types of borrowings with one entity Borrowing with an additional attribute IsPast. This will transform Table 1 into Table 2.
Table 2. : Entities and attributes after normalizing Borrowing and ignoring foreign keys
Therefore A=29, E=6, R= 6
UFP = (1.42 x 29 x (1 + 6/6)) + (8.58 x 6) + (13.28 x 6)
= 82.36 + 51.48 + 79.68
=213.52 ~ 214
i.e. some 2.7 times.
Even if we were to remove all identifiers, our attributes will come down by 6 and UFP will come down by ~ 18, not much of a dent as the end result will still be ~ 2.5 times. The difference clearly emerges from somewhere else.
This is where I would like you to recall my passing remarks on attributes and on scope of system (in our case, subsystem) & its control over data.
In our case, we have carved out an awkward piece of the bigger system, taking many entities in to make the data representation intelligible. But do we have control over these?
For a starter, consider Borrower with its 12 attributes. Does our system manipulate the borrower? Does it have use of individual data, such as DoB, stored as attributes of Borrower? What is the Borrower in context of our system? It is just an input in the form of Borrower ID. All our system does is to query borrower ID passed from borrower recognition system to verify that it corresponds to a valid borrower. So at the most, the Borrower would be treated as a Single Attribute Entity if our subsystem is to perform that interrogation on the borrower database. A good library management system will have a separate identification and verification system for Borrower and Borrowable Item that will be outside the scope of our subsystem. Likewise the entity Book is quite immaterial to the lending facility. May be the details could come on the console for the issuing person to manually check, but it is really not a necessary entity to our subsystem. In any case, our subsystem ‘Holds’ and controls only Borrowing, Fine and Total Fine, and in this respect may be concentrated on entities in Fig. 3.
Fig. 3. : Entities Held and controlled by the lending facility subsystem
Between them, they have E=3, R=2, A=10 excluding foreign keys.
UFP = (1.42 x 10 x (1 + 2/3)) + (8.58 x 3) + (13.28 x 2)
= 23.67 + 25.74 + 26.56
= 75.97 ~ 76
Good match with 79.
Further, say the analyst, who is required to have some knowledge of how things work, realizes that similarity and mutual exclusiveness of the two physically distinct Borrowing entities allows us to treat them as a single entity with an additional field, the additional field simulating slight addition to development efforts on these two alike entities as compared to a single one. So let’s add 1 to A, making it 11.
UFP = (1.42 x 11 x (1 + 2/3)) + (8.58 x 3) + (13.28 x 2)
= 26.03 + 25.74 + 26.56
= 78.33 ~ 78
Even closer match! The tiny gap left (which is of less than 1% if you take the two numbers before rounding off, 78.33 and 78.78) is attributable to not considering the rudimentary functionality of validating borrower id and item id. Here, both entities, in context of our subsystem, are one attribute entities, and they are subjected to query only, i.e. the least intensive transaction type that accounts for only a fourth of total transactions possible and which, in any case, needs to be replicated in other subsystems that own this data.
Of course, the data approach does not claim such a close match and even a variance of 10% would have been a good rough estimate or a validator to avoid gross estimation blunder.
In this article we have seen the fundamentals and logic associated with the Data-based estimation approach. It is recommended that this approach is used as a preliminary (and not primary, in the sense of ‘most important’) estimator or to validate that no gross blunder is committed in executing the Function-based approach. The Data-based approach owes both its speed and inaccuracy to the liberal level of assumptions that it entertains. It patently does not directly observe data flow nor account for derived or intermediate data. It can, at the same time, assume a more central role in maintenance activity, given the stability of data structure and typically higher knowledge of domain held by the analyst. If there is just enough time to squeeze-in a not too detailed estimate (which is unfortunate, as a project that fails to complete, achieves nothing for most things that it does right) and if the scenario is one of Greenfield development, the shortcut FB approach, commented on in the last article, may prove somewhat more accurate compared to Data-based approach.
In the next article, we’ll tie up a few loose ends related to correlation between estimation and data modeling. In particular, the last article will cover the issues of derived data and the logical business model. Finally, the article will conclude by creating a hybrid out of the function-based and data-based approaches by expanding the data model in context of system functionality.
Recent articles by Amit Bhagwat
Amit Bhagwat - Amit Bhagwat is an information architect and visual modeling enthusiast, in the thick of object oriented modeling and its application to information systems. He has developed a specialised interest in applying techniques from pure sciences to data modeling to get the best out of MIS / BIS. He is an active member of the precise UML group and happens to be the brain father of Projection Analysis - http://www.inconcept.com/JCM/June2000/bhagwat.html - and Event Progress Analysis - http://www.tdan.com/special003.htm - techniques. He also maintains contents for the celebrated Cetus Links - http://www.cetus-links.org - in Architecture and Design areas. He shares a variety of other interests including photography, poetry and sociological studies. Explore some of his work at: http://www.geocities.com/amit_bhagwat/