Modeling Data and Designing Databases
Published: January 1, 2007
Published in TDAN.com January 2007
The objective of this paper is to present an approach to the comprehensive, efficient and effective process of creating data models through mission, database domain, and localized entity-relationship modeling. The paper then proceeds with an overall process and check list for designing databases.
At first, it might seem strange to separate the process of modeling data from the process of designing databases. Not only are these topics really different, but if the first is not done before the second then not only will you will not only have a bad database design but you will also have a high probability of databases that are not-integrated, that are redundant, and that definitely possess conflicting semantics. These are all the inverse set of desired characteristics.
A school system needs to build a database application (an application where the database is the core component). The problem domain is to track individuals who are being trained to be Certified Medical Technicians (CMT). The school system has about 140,000 students.
There are about 210 health rooms across the system, and whenever a "nurse" is absent from a school, then a "certified medical technician" has to handle the dispensing of medicine, medical record keeping, and the like. "Nurse" in quotes because a nurse can be a CNA (certified nursing assistant), an LPN (License Practical Nurse), or an RN (Registered Nurse). A CMT can only be supervised by an RN.
Schools are responsible for nominating CMT candidates. They must pass a reading, writing, and arithmetic test. If the test is failed the CMT candidate cannot come back to the class for 90 days. The CMT nominee must "sit" for a 40 hour class. There are three tests during the class, and a practical demonstration-based exam at the end. If any test is failed the CMT candidate is "expelled" and they have to retake the entire course starting with the entrance test.
There is also a follow-up review and direct observation of every CMT every 45 school days. After the first 45 day period, the CMT candidate can then apply to the State to then become a Certified Medical Technician. If thereafter, the CMT fails the follow-up review and/or any direct observation, the supervising RN must report that failure to the State. The State may then de-certified the CMT. If de-certified, the process begins again from the entrance test. Giving the wrong drugs or dosage to a student, or failing to properly document medical treatments is a serious matter, and our State takes it seriously.
The task is to build the database application against these requirements. A Nurse Manager of the school health organization chose MS/Access to build the application. Since the Nurse Manager was told that MS/Access is really simple, and since she is a subject matter expert, she just fired up Access on her computer and started to type in the database's design. One table, of course. As she discovered something she forgot, she just modified the database's one table design, one column at a time.
After about nine months, the one-table database in MS/Access was complete. The Nurse Manager was very proud of her accomplishment. She then started to type in data for about 300 records of CMTs in various stages and states. Of course "she" herself is required to be present to understand and explain each and every row of data. There are status codes everywhere, some of the status codes are related one to the other. There is no history. All the obvious questions were then asked. That is, those related to the underlying process, test re-takes, test failures, observations, re-certifications, CMT moves from one school to the next, changes in RN supervisors, refresh of names, addresses, schools, from the school system's databases, and the like. When the Nurse Manger was unable to deal with these questions she started to feel "inadequate."
A question was then asked, "Where's your model of the data?" She stated, "Oh, right here." She showed the MS/Access database table. It had never occurred to her that a model of the data is different than a database's design. The next two hours were then spent going over how to figure out the model of the data. At the end of that time, about 15-20 entities were identified and related. "Yikes," she exclaimed, "How am I going to be able to do that in Access?" She went away sad because she then knew that her requirements were becoming what she clearly did not know how to do. Stating that a database's design is not a data model would have been like pouring salt on a open wound. What should she have done? That's the subject of this short paper. This paper is written as if she had all the necessary tools to accomplish all the steps. Will she have these tools? Will the school system will expend the resources to make "simple, little applications" like these practically possible. Good, sophisticated tool sets like Clarion for Windows make both client/server and Internet based applications like these very practical and cost effective to accomplish.
2. Topics Covered
The topics is this paper include:
3. How to Discover the Data Model
The overall process of modeling data is provided in Figure 1. Instantly you may be asking, "Why are functions and organizations involved in modeling data?" Well, if you don't know, in a general way, what your going to do with the data, nor who's going to be using it, then how can you know if you've got the right model? Some suggest that you are just supposed to cover a whole wall in a 50x 30 room with butcher block paper and then, put a rectangle in the upper left corner, write "Customer," and then wait for divine intercession. Some have grown old and retired waiting for the revelations.
A careful review of the process model in Figure 1 shows that neither "schema" nor "database" (other than database domains) is specified. That's because it's a process to model data. Thereafter, the data model can be employed to help create database schemas that are, of course, an essential requirement for database-centric applications. Provided here are summary descriptions of these data modeling processes, and reasons why they are performed.
Missions are the idealized descriptions of what the enterprise is all about, not just what is done on a day to day basis with respect to a specific narrow function. Identification and specification of the relevant missions is critical. In the case study, the focus was on the database's design to accomplish known processes. There was no overall understanding of how these collections of processes fit into the overall mission of the school system.
Because she was focused only on the database's design, she missed the opportunity to identify all the different objects (in the broadest sense of the word) that should have been involved. Her focus was on the CMT only. Mission isn't the sole answer, however. Rather, missions are the context within which the data model answer is derived.
Organizations accomplish aspects of missions with databases, information systems and functions. In the case study, the enumeration of the various organizations serve to identify the groups that have to be involved in getting, maintaining, and using the information. For example, CMTs are transferred from one school to the next. Knowing that, and knowing that probably the last thing on the mind of the transferred CMT is to notify someone in the school system's Health Services of the transfer, how to know the current CMT's school is a real problem. The school system's HR organization should have the most current assignment for a person. Consequently, a feature of the resulting CMT system would be to access, on a weekly basis, the HR system's database, and validate, CMT by CMT whether the current CMT assigned location is the correct one. If it is then OK. Otherwise that school's Assigned CMT records would have to be flagged as no longer having an available CMT.
Again, organization, like mission, is not the answer to knowing the full set of requirements. Rather, it too is a contextual element and provide the identification of groups that have to be involved in the full set of requirements for a complete solution.
Functions are the procedures that are performed by groups as they accomplish the various missions of the enterprise from within different enterprise organizations. This is where the actual activities are specified that need to be performed. Setting out and defining these functions cause the identification of areas of data, data elements, states, exceptions and the like involved in the CMT effort. Functions are, of course, all identified within the scope of the mission-organizations that result from the previous step. These functions directly lead to the set of processes, automated and manual, that are needed in some way for a complete CMT solution.
3.4 Prior Data and Reports
Prior data and reports generally refer to the set of all data that must be involved in the application area, and by inference in the data model's design. In the case study, there were State requirements for CMT applications, transactions, and reporting. There were also data requirements to support the individual CMT person such as biographic data, assignments, tests, scores, CMT progress states, and the like. All these data need to be identified and analyzed within the context of the missions, organizations, and functions as they directly lead to subjects, entities, attributes, and of course, relationships.
3.5 Scope Model
The above four: missions, organizations, functions and prior data are all brought together to convey one overall understanding of just what the mission is, which organizations are involved, what activities are performed, and just what data is involved in accomplishing the effort.
Once the scope model is complete, then from a non-IT point of view, the specification is not only complete, it should also be able to be reviewed for missing items and functionality. If any are found then they should be corrected before proceeding. Correcting these errors here is very inexpensive as no databases are yet designed nor has any software been built. Once the scope model is as complete as it can be, then, the next step, database domains, can proceed quickly.
3.6 Database Domains
Database domains are text or list based noun-intensive statements. Database domain paragraphs are the precursor step to entity-relationship models. Well done, each sentence in a Database Domain text becomes a multi-entity and relationship statement. Here, there's no intended granularity to an entity. Some may end up being complex (i.e., database objects), other simpler, and some even just data elements.
Once database domains are complete, the various entity relationship diagrams are created, subject by subject. If there's enough information, then a full set of attributes are created for each entity. It's best that the entities are in third normal form so that there is a crispness to the specificity of the set of entities within each subject. Relationships can be made between entities in different subjects to provide a more uniform model of the data.
From the point of view of the case study, here are examples of database domain statements:
From these examples, entities, attributes, and relationships are very obvious. Entity-relationship models are easily drawn and validated by reviewing the subject-verb-object relationship and cardinality with subject matter experts. If any statement is missing, now is the time to add it.
3.7 Database Objects
Database objects are collections of entities (actually collections of tables within a database's schema). Full database object specification includes its data structure (i.e., a table collection); table-based add, delete, and update processes; states; and state transform specifications. But for the purposes of this effort, the database object models are employed to provide a coherent understanding of the processes that are required to operate against multiple entities so as to ensure an overall integrity.
In the case study, school system is clearly a database objects as it has multiply contained entities. The CMT class is probably a simple database object, and the Certification Date is certainly a data element. Because the entity-relationship diagrams exist that were created from the database domains, identifying the database objects, simple entities and data elements is fast, and most often correct on their first recognition.
3.8 Data Elements
During the process of creating attributes within entities, each attribute should be mapped to enterprise-wide data elements as a way to maximize semantic uniformity across differently named attributes that represent the same data element, and also to ensure a uniformity of value domains. Once data elements are created and mapped to attributes, "where-used" reports can document maximum semantic integrity.
In the case study, Certification Date would be a data element. Whether it is the First Certification Date, or a Renewal Certification Date is immaterial. They are columns or attributes that map to the data element, Certification Date.
3.9 Data Models
If the steps above are accomplished, then the data model for the CMTs will largely be complete. If the effort is part of an overall larger modeling effort, then many of the subjects, entities, attributes, and relationships will already have been created. Having these available from a metadata repository will greatly speed these steps. The remaining step is to actually create a database schema of just the relevant data specifications and then proceed through prototyping to then validate the data model.
In the case study, a very quick set of entities were discovered, and included for example:
3.10 Validation through Prototyping
This step, validation through prototyping, is critical because up to this point the solution has been almost entirely paper-based. Even if data models are in a metadata repository, they are still untested designs. Data model design testing requires a process model through which the data model can be exercised to know that it represents the necessary data in an easy-to-use way.
The first step is to make a database schema. This is accomplished by importing collections, or subsets of collections of entities into a database's schema. Once represented as a database schema, a code generator that ingests the schema can materialize an actual database-centric application.
Once the application is generated, it can be electronically "shaped" to match a desired behavior model, and then demonstrated to users. User feedback is essential because that's when you hear, "Where's the XYZ data?" Reviewing something on paper is just not sufficient. Once a review is done, and the comments are all logged, just throw the generated application away. If there are changes to the data models, then make the changes and re-generate the prototype application. Once a series of cycles are performed, eventually the application and the data model will become stable. If there are six such cycles, all within a two-month period, then that's like implementing at version 6 versus implementing at version 1 and then spending two or three years achieving the same design. For sure, this strategy is faster, cheaper, and far more effective.
In the case study, the database's schema was able to be quickly created because all the entities, attributes and relationships existed in the metadata repository. A schema with all the tables was created in one afternoon. It was reviewed. The application was generated about two hours later. Another day was spend pruning the application prior to demonstration. Then there were six sessions with relevant subject matter experts set up. One every two weeks. That was more than enough time to cycle back in the changes to the data model, the database's design, and the generated application.
4. Where to Put Your Data Model
The data models should be placed into a metadata repository that is integrated and non-redundant. The model of such a metadata repository is similar to the one in Figure 2. Each box represents a category of metadata and is sometimes called a meta-entity. This is a very high level subset representation of the metadata repository data model. The diagram shows that the obvious places where the results from steps 3.1 through 3.8 are stored. Step 3.9 produces the metadata for the subject, entity, and attribute meta-entities. Relationships among entities are not shown. Step 3.10 causes the creation of the metadata for the business information system, module, database table, and column meta-entities. Relationships among tables are not shown.
It is sometimes stated that metadata repositories are too expensive and take too long to implement. Thus, they are an impractical solution. Not true. If you are an IT data management professional, then a metadata repository capable of storing all the metadata identified in this
paper and performing all the functions this paper describes can be purchases for less than one-day's consulting fee. A fully sophisticated code generator can be purchases for less than three-days consulting fees. The products are there, the way and the strategy is there. Is the will there?
5. How to Build a Database Design
The database's design starts after Step 3.10 is complete. Existing at that point, is a functionally acceptable database design that conforms to business policy requirements. It however is most likely not complete from a Information Technology point of view. The complete iterations of database design include:
Once a database schema has been created and the database application completed, there maybe other iterations such as:
Finally special iterations of a database's design may be needed for:
These iterations are all addressed in a Whitemarsh paper, Iterations of Database Design that is available from the Whitemarsh website, www.wiscorp.com.
In terms of the case study, software systems like Clarion for Windows can interact with MS/Access databases through ODBC and JDBC. So, the original objective, to have a simple, efficient, and effective database-centered application to manage the CMT area is very practical. For sure, with the right tool-set the CMT application could have been accomplished within two months of the need being cited. But because the correct process was not known nor were the proper tools used, about 12 months of a professional's time was wasted. That wasted time, worth about $125,000 is many times more that the actual cost of doing the CMT application.
The practical application of the points made in this paper include:
Recent articles by Michael M. Gorman
Michael M. Gorman -
Michael, the President of Whitemarsh Information Systems Corporation, has been involved in database and DBMS for more than 40 years. Michael has been the Secretary of the ANSI Database Languages
Committee for more than 30 years. This committee standardizes SQL. A full list of Whitemarsh's clients and products can be found on the website. Whitemarsh has developed a very comprehensive Metadata CASE/Repository tool, Metabase, that supports enterprise architectures, information systems planning,
comprehensive data model creation and management, and interfaces with the finest code generator on the market, Clarion ( www.SoftVelocity.com). The Whitemarsh website makes available data management books, courses, workshops, methodologies, software, and metrics. Whitemarsh prices
are very reasonable and are designed for the individual, the information technology organization and professional training organizations. Whitemarsh provides free use of its materials for
universities/colleges. Please contact Whitemarsh for assistance in data modeling, data architecture, enterprise architecture, metadata management, and for on-site delivery of data management
workshops, courses, and seminars. Our phone number is (301) 249-1142. Our email address is: firstname.lastname@example.org.