UML as a Data Modeling Notation, Part 2
Published: October 1, 2008
This article, Part 2 of a three-part series, addresses sub-types, domains and unique identifiers, as well as what in UML should not be used in a data model.
The series of articles is in three parts. Part 1, set the stage, describing the basic differences between UML and the various entity/relationship modeling notations – and how they can be reconciled. This article, Part 2 of the series, goes into more detail, addressing sub-types, domains and unique identifiers, along with what in UML should not be used in a data model. And, since the whole point of preparing a data model (regardless of notation) is to present it to management for validation, Part 3 will discuss the aesthetics of preparing and presenting data models – no matter what notation is used.
This series has two audiences: The data modelers who have been convinced that UML has nothing to do with them; and UML experts who don’t realize that data modeling really is different from
object modeling (and the differences are important).
SUPER-TYPES AND SUB-TYPES
When instances of an entity class can be divided into two or more entity classes and each of the subordinate classes “inherit” the attributes and relationships of the original entity class, these subordinate classes are called sub-types. The parent class is called a super-type. For example, in Figure 1 (in the Barker-Ellis notation), PERSON and ORGANIZATION are sub-types of PARTY. That is, every instance of PARTY is an instance of either a PERSON or an ORGANIZATION. ORGANIZATION, in turn, is a super-type of INTERNAL ORGANIZATION, GOVERNMENT, COMPANY, GOVERNMENT AGENCY, POLITICAL ORGANIZATION and HOUSEHOLD.
Note that an entity class is designated as a sub-type of another if its box is inside the other’s box.
Figure 1: Sub-Types in Barker-Ellis Notation
Figure 2 shows the way UML (and other entity/relationship modeling notations, for that matter) represent sub-types: outside the super-type box, attached via specialized arrows.
Figure 2: Sub-Types in Conventional UML
The problem with this notation is that if you have many subtypes, your diagram quickly fills up, distracting from other, often more important, structures. Moreover, as the nesting gets deeper and deeper, it is progressively more difficult to see that an instance of the sub-type is in fact an instance of the super-type several levels up. The attributes and relationships at the upper levels are not obviously attributes and relationships of the lower level sub-types.
Taking a hint from the Barker-Ellis notation, most UML tools do in fact allow you to display sub-type boxes inside the super-type ones.1 To do this, first create the lines showing the structure, and then move the sub-type boxes inside the super-type boxes.2 You can now delete the lines that, as graphic objects, represent the sub-type relationship. (Do not delete the underlying generalization relationship.) The result for our example is shown in Figure 3.
Figure 3: Sub-Types in ER UML
In the approach to entity/relationship modeling described in this article, your authors add three constraints on the treatment of super-/sub-types:
Note that these constraints are not followed by all modelers in the entity/relationship modeling community, either. Some would permit an instance of a super-type to be an instance of more than one sub-type.. This is not only in simple cases, but some modelers make use of “multiple-type hierarchies” (called “generalization sets” in UML), where a super-type has more than one set of sub-types. This is better described by the “categorization model,” described below.
Similarly, it is common to display only some of the sub-types that actually exist. This is acceptable in the course of a presentation to build up the model slowly, but eventually all sub-types have to be accounted for. (Okay, you can finesse this rule by including an entity class OTHER … .)
Multiple inheritance is controversial in both the object-oriented and data modeling worlds. Your authors contend, however, that every time it looked as though multiple inheritance was necessary, looking at the model differently removed the need.
But these constraints are important. They ensure that the classification shown in sub-types is fundamental. In this example, it is not possible for someone to be both a GOVERNMENT and a COMPANY.3
There are more flexible ways to categorize things, of course, but these can be represented in a data model without using sub-types. Figure 4 adds the entity classes PARTY CATEGORY and PARTY CATEGORIZATION. A PARTY CATEGORIZATION is the fact that a particular PARTY falls into a particular PARTY CATEGORY for a period of time. That is, each PARTY CATEGORIZATION must be of exactly one PARTY into exactly one PARTY CATEGORY. The PARTY CATEGORIZATION is effective on an “Effective date,” and ceases to be effective on an “Until date.” The PARTY CATEGORIZATION must be into exactly one CATEGORY and each PARTY CATEGORIZATION into that PARTY CATEGORY must be by a single PARTY.
For example, a PARTY CATEGORY that would apply to PERSON could be “Income Level.” This might be defined by the INTERNAL ORGANIZATION “Market Research Department.” The PERSON “David Letterman” then would be subject to PARTY CATEGORIZATION into the PARTY CATEGORY with the “name” “Over $500,000.” This would be according to the PERSON “Sam Sneed,” who happens to be Mr. Letterman’s gardener.
Figure 4: Categorization
Note that this structure allows a PARTY to be categorized into multiple PARTY CATEGORIES, and further allows for that PARTY CATEGORIZATION to change over time.
Also note that PARTIES may be subject to different PARTY CATEGORIZATIONS by different PARTIES, each for its own purpose. For example the INTERNAL ORGANIZATION “Market Research Department” might place the HOUSEHOLD “Hay family” in a different PARTY CATEGORY than the INTERNAL ORGANIZATION “Sales” does. Moreover, each PARTY CATEGORY must be defined by exactly one PARTY. The set of PARTY CATEGORIES that is of interest to Market Research may be very different from the set of PARTY CATEGORIES that is of interest to Accounting. A PARTY must be appointed as a steward for every PARTY CATEGORY.
This is a very different approach to categorization than is sub-typing, but if you are looking for multiple inheritance or multiple type hierarchies, this is the way to go.
Domain is a concept with an evolving definition, a history of weak implementation in relational systems, and it addresses issues that are addressed differently in the object-oriented vocabulary. Barker defines a domain as follows:
“A set of business validation rules, format constraints, and other properties that apply to a group of attributes: for example:
“Note that attributes and columns in the same domain are subject to the same validation checks.” [Barker 1989, p. Gl-3]
In addition to the three things above, a domain can also be described by:
Some entity/relationship-oriented CASE tools have explicit support for documenting domains behind the scenes, as part of an attribute’s documentation – others, less so.
A datatype is, in effect, a simple domain. If we require that a value of an attribute be an integer, we are assigning it to a datatype. If we require that a value of an attribute be a positive integer between 1 and 10 (inclusive), then we are assigning it to a domain.
UML does not have what entity/relationship modelers are accustomed to calling “domains.” Its concept of “datatype” however, is extensible and can be used in the same way.
The only caution is that in a conceptual entity/relationship model, a value set for a domain is a list of meanings. This is different from the code set that constrains the values a column in a database may take. A value set, for example, could be the States in the United States, which is then effectively implemented via several code sets in a database. Corresponding code sets could consist of names, two-character post office abbreviations, the older set of four-letter abbreviations, sequence numbers and so forth. One of the code sets (like “state names”) is designated “primary,” to be used for revealing the value set, but it is still not the same thing as the value set.
In both entity/relationship modeling and UML, an alternative to specifying a domain is to represent it as a “reference” entity type. For example, INTERNAL ORGANIZATION might have had the attribute “Internal Organization Type” with a domain that is a list of values such as “Division,” “Department,” “Section,” etc. This could be documented in the definition of the attribute, or it could be shown as an entity class, as in Figure 5. This is a solution for both entity/relationship and UML modeling.
Figure 5: Entity Classes as Domains
One difficulty with this solution, however, is that, while the list of values is data, the list of values is in fact fundamental to the meaning of the attribute. UML actually has a nifty feature to address this. Instead of defining it as a simple entity class, the list of values can be defined as an enumeration. Figure 6 shows INTERNAL ORGANIZATION TYPE as an enumeration. Instead of attributes being displayed in the entity class box, the “name” of the instances is shown. Note that an attribute (in this case “Internal organization type”) has to be present in INTERNAL ORGANIZATION to refer to it, but where attributes (“Name,” “Description”) would be shown on the class box, the list of values (“Department,” “Division,” “Section”) is shown instead.
This doesn’t translate well to the relational database world, but it is excellent for displaying the concepts, which is, after all, what we are here to do.
Note, by the way, that “Department,” “Division” and “Section,” shown as values of INTERNAL ORGANIZATION TYPE, could as easily be sub-types of INTERNAL ORGANIZATION, but the approach shown here provides more flexibility should the organization structure change significantly.
Figure 6: Enumeration
Unnecessary UML Elements
Okay, you now have the basic elements required for an entity/relationship model: the entity class, the relationship and the attribute.
UML has a number of features to describe concepts important to object-oriented designers and programmers. These are features that have no place in a conceptual entity/relationship diagram. They include:
Something Missing from UML
In an entity/relationship model, a unique identifier distinguishes each instance of an entity class from every other instance. It may be the value of one or more attributes, or it may be a combination of attribute values and roles attached to instances of other entity classes. For example, in Figure 7, PROJECT and PERSON are called reference entity classes, since they have no mandatory relationships with any other classes. They are also called independent entity classes. For these, unique identifiers are always attributes. That is, for PROJECT, the judgment was made that in this organization that all project names are unique, so the attribute “Name” can be used to identify instances of PROJECT. On the other hand, PERSON is simply identified by an attribute that is a “surrogate” (automatically generated) identifier: “Person ID.”
Figure 7: Projects
Identifying attributes are indicated on the drawing by the octothorpe (#)4 next to the attribute name.
Dependent entity classes are those which have mandatory relationships with one or more other entity classes. Typically, their unique identifiers include roles with other entity classes. Instances of CONSTRAINED PROJECT ASSIGNMENT in Figure 7, for example, are not identified by attributes at all. Instead, each instance of CONSTRAINED PROJECT ASSIGNMENT is identified by the PERSON it is of and the PROJECT it is to. Each relationships involved is designated as an identifying relationship (okay, “role”) by a line across the relationship line (–| – ) at the end next to the identified entity class, as shown in Figure 7.
In the object-oriented world, there is no concept of a “natural” or meaningful identifier. In an object-oriented program, every object in the system is uniquely identified by a generated surrogate key called an “object identifier” (known as an “OID”). Indeed, even in entity/relationship models, reference entity classes, like PRODUCT and PERSON, usually use “surrogate” identifiers (like “Person ID” in our example). In dependent entity classes, however, it is important to know whether instances are identified by all of the roles involved, just some of them, or some of them plus an attribute. The meaning of the entity class is based on that.
For example, look at the identifiers of CONSTRAINED PROJECT ASSIGNMENT. If each assignment is uniquely identified by the PERSON and the PROJECT, no PERSON can work for the same PROJECT twice. There can be no more than one instance of “Charlie” working on the “I-10 upgrade project.” If he leaves and wants later to return to the job, he is out of luck.
This may be exactly what the organization wants to say. Using natural identifiers allows this rule to be expressed, while object identifiers would not. If this is not the enterprise’s intent, of course, the model is incorrect. Specifically, instances of OPEN PROJECT ASSIGNMENT are identified not only by the roles they play, but also by the values of the attribute “Scheduled start date.” Thus if Charlie worked for one period that was scheduled to start on January 25, 2007, actually stopped working (“Actual stop date”) on June 15, 2007, and then returned to the project November 3, 2007, this would be no problem. Again, this rule is much more explicit using the entity/relationship approach than it could be with conventional UML.
It may be the case that “Scheduled start date” is not always known, so it cannot practically be used in the unique identifier. In that case, another kind of surrogate key, a “Sequence number” could be added to the entity class, and the unique ID would then be the roles plus the “Sequence number.” This opens up other possibilities. It may be that the business is mostly concerned with who works on the project and wants to set up a sequence number series for each project. In that case, the role “to one PROJECT” and the attribute “Sequence number” would be sufficient. The second assignment of Charlie to this PROJECT could be identified as the “fifth OPEN PROJECT ASSIGNMENT” to this PROJECT.
Note that the concept of unique identifier is carried through to relational database design as a primary key. This designates one or more columns to identify rows in a table. Roles are converted to relational tables by conversion of all many-to-one relationships to foreign key columns in the entity class that is the subject of the role. Each of these columns corresponds to a primary key column in the object entity class.
This means that a primary key can consist both of columns native to the table and columns that are foreign keys to other tables. This exactly implements the entity/relationship concepts of a unique identifier consisting of both attributes and roles.
Because of its object-oriented heritage, UML does not have a symbol for unique identifier. But to create an entity/relationship model using this notation, such identifiers are needed. Fortunately, UML has a facility for extending itself called the stereotype. This allows symbols to be created (from text) to encompass any concept needed. In this case, what is needed is a symbol for “identifier” that can be applied to either attributes or roles. For our purposes, your authors have invented “<<ID>>”, as shown in Figure 8.
Figure 8: Projects in UML
In Part 3 of this series, we will approach the aesthetic principles to be followed in presenting models in any notation.
Barker, R. 1989. CASE*Method: Entity Relationship Modeling. (Wokingham, England: Addison Wesley).
Box, George E. P.; Norman R. Draper (1987). Empirical Model-Building and Response Surfaces, p. 424, Wiley.
Hay, D. 1999 “UML Misses the Boat,” East Coast Oracle Users' Group: ECO 99 (Conference Proceedings / HTML File). Apr 1, 1999.
Hay, D. 2003. Requirements Analysis: From Business Rules to Architecture (Upper Saddle River, NJ: Prentice Hall PTR).
Miller, G. A. 1956. “The Magical Number Seven, Plus or Minus Two: Some Limits on Our Capacity for Processing Information,” The Psychological Review, Vol. 63, No 2 (March, 1956).
Martin, J., and James Odell. 1995. Object-Oriented Methods. (Englewood Cliffs, NJ: Prentice Hall).
Page-Jones, M.2000. Fundamentals of Object-Oriented Design in UML. New York: Dorset House). Pp. 233-240.
Rumbaugh, J., Ivar Jacobson, Grady Booch. 1999. The Unified Modeling Language Reference Manual.
Recent articles by David C. Hay
Recent articles by Michael J. Lynott
David C. Hay - In the information industry since it was called “data processing,” Dave Hay has been producing data models to support strategic and requirements planning for more than twenty-five years. As President of Essential Strategies, Inc. for nearly twenty of those years, Dave has worked in a variety of industries including, among others, banking, clinical pharmaceutical research, broadcasting, and all aspects of oil production and processing. Projects entailed various aspects of defining corporate information architecture, identifying requirements, and planning strategies for the implementation of new systems.
Dave’s recently published book, Enterprise Model Patterns: Describing the World, is an “upper ontology” consisting of a comprehensive model of any enterprise from several levels of abstraction. It is the successor to his groundbreaking 1995 book, Data Model Patterns: Conventions of Thought – the original book describing standard data model configurations for standard business situations.
In between, he has written Requirements Analysis: From Business Views to Architecture (2003) and Data Model Patterns: A Metadata Map (2006). Since he took the unusual step of using UML in the Enterprise Model Patterns… book, a follow-on book, UML and Data Modeling: A Reconciliation was published later in 2011. This book both shows data modelers how to adapt the UML notation to their purposes, and UML modelers how to adapt UML to produce business-oriented architectural models.
Dave has spoken at numerous international and local DAMA, semantics, and other conferences as well as at various user group meetings. He can be reached at firstname.lastname@example.org, (713) 464-8316, or via his company's website.
Michael J. Lynott - Mike has been doing data modeling and database design since his introduction to the world of databases in the early '80s. He was part of Oracle Corporation's introduction into Computer-Aided Systems Engineering (CASE) and has been a leading expert in conceptual data modeling ever since. After that, he was a consultant with eTransitions of New Jersey, working with renowned author and consultant Ulka Rodgers. In recent years, he has been senior enterprise information architect for a large retailer in Boise, Idaho. Mike has written a number of papers for various publications and conferences.