TDAN: The Data Administration Newsletter, Since 1997

THE DATA ADMINISTRATION NEWSLETTER – TDAN.com
ROBERT S. SEINER – PUBLISHER

Subscribe to TDAN

TDWI
Dataversity
Business Analysis Conference Europe 2014
Data Governance Financial Services Conference
Data Modeling Zone
Data Governance Winter Conference

   > home > newsletter > article
 Printer-friendly
 E-mail to friend

UML as a Data Modeling Notation, Part 2
The Details

by David C. Hay, Michael J. Lynott
Published: October 1, 2008
This article, Part 2 of a three-part series, addresses sub-types, domains and unique identifiers, as well as what in UML should not be used in a data model.

The series of articles is in three parts. Part 1, set the stage, describing the basic differences between UML and the various entity/relationship modeling notations –  and how they can be reconciled. This article, Part 2 of the series, goes into more detail, addressing sub-types, domains and unique identifiers, along with what in UML should not be used in a data model. And, since the whole point of preparing a data model (regardless of notation) is to present it to management for validation, Part 3 will discuss the aesthetics of preparing and presenting data models – no matter what notation is used.

This series has two audiences: The data modelers who have been convinced that UML has nothing to do with them; and UML experts who don’t realize that data modeling really is different from object modeling (and the differences are important).
The objective of this series is to provide all modelers with guidance as to how to produce a high-quality conceptual entity/relationship model using UML class diagram notation.

SUPER-TYPES AND SUB-TYPES

When instances of an entity class can be divided into two or more entity classes and each of the subordinate classes “inherit” the attributes and relationships of the original entity class, these subordinate classes are called sub-types. The parent class is called a super-type. For example, in Figure 1 (in the Barker-Ellis notation), PERSON and ORGANIZATION are sub-types of PARTY. That is, every instance of PARTY is an instance of either a PERSON or an ORGANIZATION. ORGANIZATION, in turn, is a super-type of INTERNAL ORGANIZATION, GOVERNMENT, COMPANY, GOVERNMENT AGENCY, POLITICAL ORGANIZATION and HOUSEHOLD.

Note that an entity class is designated as a sub-type of another if its box is inside the other’s box.

alt

Figure 1: Sub-Types in Barker-Ellis Notation

Figure 2 shows the way UML (and other entity/relationship modeling notations, for that matter) represent sub-types: outside the super-type box, attached via specialized arrows.

alt

Figure 2: Sub-Types in Conventional UML

The problem with this notation is that if you have many subtypes, your diagram quickly fills up, distracting from other, often more important, structures. Moreover, as the nesting gets deeper and deeper, it is progressively more difficult to see that an instance of the sub-type is in fact an instance of the super-type several levels up. The attributes and relationships at the upper levels are not obviously attributes and relationships of the lower level sub-types.

Taking a hint from the Barker-Ellis notation, most UML tools do in fact allow you to display sub-type boxes inside the super-type ones.1 To do this, first create the lines showing the structure, and then move the sub-type boxes inside the super-type boxes.2 You can now delete the lines that, as graphic objects, represent the sub-type relationship. (Do not delete the underlying generalization relationship.) The result for our example is shown in Figure 3.

alt

Figure 3: Sub-Types in ER UML

In the approach to entity/relationship modeling described in this article, your authors add three constraints on the treatment of super-/sub-types:

  • Completeness – Each instance of the super-type must be an instance of one of the sub-types. This is equivalent in UML to calling the super-type “abstract.” That is, in UML you can impose this constraint or not. In data modeling, the constraint always applies, but it can be finessed by adding a sub-type OTHER.

  • Exclusivity – No instance of the super-type may be an instance of more than one of the sub-types.

  • No multiple inheritance – Each sub-type may have only one super-type.

Note that these constraints are not followed by all modelers in the entity/relationship modeling community, either. Some would permit an instance of a super-type to be an instance of more than one sub-type.. This is not only in simple cases, but some modelers make use of “multiple-type hierarchies” (called “generalization sets” in UML), where a super-type has more than one set of sub-types. This is better described by the “categorization model,” described below.

Similarly, it is common to display only some of the sub-types that actually exist. This is acceptable in the course of a presentation to build up the model slowly, but eventually all sub-types have to be accounted for. (Okay, you can finesse this rule by including an entity class OTHER … .)

Multiple inheritance is controversial in both the object-oriented and data modeling worlds. Your authors contend, however, that every time it looked as though multiple inheritance was necessary, looking at the model differently removed the need.

But these constraints are important. They ensure that the classification shown in sub-types is fundamental. In this example, it is not possible for someone to be both a GOVERNMENT and a COMPANY.3

There are more flexible ways to categorize things, of course, but these can be represented in a data model without using sub-types. Figure 4 adds the entity classes PARTY CATEGORY and PARTY CATEGORIZATION. A PARTY CATEGORIZATION is the fact that a particular PARTY falls into a particular PARTY CATEGORY for a period of time. That is, each PARTY CATEGORIZATION must be of exactly one PARTY into exactly one PARTY CATEGORY. The PARTY CATEGORIZATION is effective on an “Effective date,” and ceases to be effective on an “Until date.” The PARTY CATEGORIZATION must be into exactly one CATEGORY and each PARTY CATEGORIZATION into that PARTY CATEGORY must be by a single PARTY.

For example, a PARTY CATEGORY that would apply to PERSON could be “Income Level.” This might be defined by the INTERNAL ORGANIZATION “Market Research Department.” The PERSON “David Letterman” then would be subject to PARTY CATEGORIZATION into the PARTY CATEGORY with the “name” “Over $500,000.” This would be according to the PERSON “Sam Sneed,” who happens to be Mr. Letterman’s gardener.

alt

Figure 4: Categorization

Note that this structure allows a PARTY to be categorized into multiple PARTY CATEGORIES, and further allows for that PARTY CATEGORIZATION to change over time.

Also note that PARTIES may be subject to different PARTY CATEGORIZATIONS by different PARTIES, each for its own purpose. For example the INTERNAL ORGANIZATION “Market Research Department” might place the HOUSEHOLD “Hay family” in a different PARTY CATEGORY than the INTERNAL ORGANIZATION “Sales” does. Moreover, each PARTY CATEGORY must be defined by exactly one PARTY. The set of PARTY CATEGORIES that is of interest to Market Research may be very different from the set of PARTY CATEGORIES that is of interest to Accounting. A PARTY must be appointed as a steward for every PARTY CATEGORY.

This is a very different approach to categorization than is sub-typing, but if you are looking for multiple inheritance or multiple type hierarchies, this is the way to go.

DOMAINS

Domain is a concept with an evolving definition, a history of weak implementation in relational systems, and it addresses issues that are addressed differently in the object-oriented vocabulary. Barker defines a domain as follows:

“A set of business validation rules, format constraints, and other properties that apply to a group of attributes: for example:

  • A list of values

  • A range

  • A qualified list or range

  • Any combination of these

“Note that attributes and columns in the same domain are subject to the same validation checks.” [Barker 1989, p. Gl-3]

In addition to the three things above, a domain can also be described by:

  • Data type

  • Length

  • A list of illegal values

  • Edit rules

  • Precision factor

Some entity/relationship-oriented CASE tools have explicit support for documenting domains behind the scenes, as part of an attribute’s documentation – others, less so.

A datatype is, in effect, a simple domain. If we require that a value of an attribute be an integer, we are assigning it to a datatype. If we require that a value of an attribute be a positive integer between 1 and 10 (inclusive), then we are assigning it to a domain.

UML does not have what entity/relationship modelers are accustomed to calling “domains.” Its concept of “datatype” however, is extensible and can be used in the same way.

The only caution is that in a conceptual entity/relationship model, a value set for a domain is a list of meanings. This is different from the code set that constrains the values a column in a database may take. A value set, for example, could be the States in the United States, which is then effectively implemented via several code sets in a database. Corresponding code sets could consist of names, two-character post office abbreviations, the older set of four-letter abbreviations, sequence numbers and so forth. One of the code sets (like “state names”) is designated “primary,” to be used for revealing the value set, but it is still not the same thing as the value set.

In both entity/relationship modeling and UML, an alternative to specifying a domain is to represent it as a “reference” entity type. For example, INTERNAL ORGANIZATION might have had the attribute “Internal Organization Type” with a domain that is a list of values such as “Division,” “Department,” “Section,” etc. This could be documented in the definition of the attribute, or it could be shown as an entity class, as in Figure 5. This is a solution for both entity/relationship and UML modeling.

alt

Figure 5: Entity Classes as Domains

One difficulty with this solution, however, is that, while the list of values is data, the list of values is in fact fundamental to the meaning of the attribute. UML actually has a nifty feature to address this. Instead of defining it as a simple entity class, the list of values can be defined as an enumeration. Figure 6 shows INTERNAL ORGANIZATION TYPE as an enumeration. Instead of attributes being displayed in the entity class box, the “name” of the instances is shown. Note that an attribute (in this case “Internal organization type”) has to be present in INTERNAL ORGANIZATION to refer to it, but where attributes (“Name,” “Description”) would be shown on the class box, the list of values (“Department,” “Division,” “Section”) is shown instead.

This doesn’t translate well to the relational database world, but it is excellent for displaying the concepts, which is, after all, what we are here to do.

Note, by the way, that “Department,” “Division” and “Section,” shown as values of INTERNAL ORGANIZATION TYPE, could as easily be sub-types of INTERNAL ORGANIZATION, but the approach shown here provides more flexibility should the organization structure change significantly.

alt 

Figure 6: Enumeration

NOTATION DIFFERENCES

Unnecessary UML Elements

Okay, you now have the basic elements required for an entity/relationship model: the entity class, the relationship and the attribute.

alt

UML has a number of features to describe concepts important to object-oriented designers and programmers. These are features that have no place in a conceptual entity/relationship diagram. They include:

  • Composition and aggregation. some will say that we could add a “composition symbol” (alt) on the model to denote “composed of.”

    But in an entity/relationship model, we don’t need an extra symbol, since we have the ancient symbols c, o, m, p, s, e, d and f at our disposal. Yes, in this case, the older approach uses many more symbols (and, for example, here reuses the “o” several times), but experience has shown that most observers of the models have had more experience with – and are therefore more comfortable with –  the ancient symbols than with the UML symbol that is new and has no intuitive meaning. Adding a new symbol is unnecessary in this context. Most business-oriented viewers prefer the ancient symbols.

    There is an interesting additional meaning to these symbols, though: If it is a solid diamond (called “composition”), it means that the referential integrity rule “cascade delete” applies. That is, if the parent is deleted, all the children are deleted. If it is an open diamond (called “aggregation”), the referential integrity rule “nullify” applies. That is, if the parent is deleted, the children are left as orphans, not connected to anything.

    While it would be valuable to be able to designate referential integrity rules in entity/relationship diagrams, this isn’t an adequate approach, since these symbols reflect only two of the three referential integrety rules, but not the third: there is no symbol for the “cascade restricted” rule –  the rule that says that, if children exist, the parent cannot be deleted.

    Without the complete set of referential integrity rules, there is no point to placing the composition and aggregation symbols on an entity/relationship diagram.

    Note that some UML modelers use the symbol instead of labeling the roles. That helps, but in an entity/relationship environment, the roles must be labeled anyway.

  • Ordered – UML aficionados will also suggest that we do not need “Sequence number” as an attribute since, for example, in the case of PARTY CATEGORIZATION in Figure 4, we could simply characterize the entity class as ordered. The sequence number would then be implicit. The problem with this approach is that it presupposes that the “ordered” approach will be implemented by a technology that can create sequence numbers and manipulate them behind the scenes. But that implies the use of a particular technology. In this case, the “Sequence number” may well be generated automatically, but this is a model for the business community, so all such things must be made explicit. Moreover, if we were to use the “ordered” approach, the sequence number would not be available for designation as an identifier.

  • Visibility – in an object-oriented program, a class and/or its attributes may be accessible to other classes or not. This characteristic is called “visibility.” It is meaningless in an entity/relationship model.

  • Navigation direction – In object-oriented programs, there is no declarative structural component to associations. That is, typically, a programmer must write program code to navigate a relationship in each direction. In a relational database (and by implication, in an entity/relationship model), on the other hand, the relationship exists as a structure with two ends. You cannot talk about half a bridge. In UML – in deference to the object-oriented designers and programmers –  it is permitted to designate that the primary path of navigation is in one direction or another. In Figure 3, above, for example, you could add an arrowhead from ORDER to LINE ITEM to indicate that it is expected for someone to want all the line items in an order, but it is never expected to ask for order information about a line item.

    In an entity/relationship diagram, this is not shown. We make no such assertions. The diagram is about structure, and there is no reason to limit the direction in which it can be navigated.

  • Behavior – The advertised advantage of the object-oriented approach is that it addresses behavior together with class structure. But the behavior included is often simply the name of object-oriented program modules. While it might be interesting to know how entity class instances are created, what may be done with them, and how they disappear, this is not something that can be described in a small compartment of an entity class box. (This is the subject of a completely different – and quite sophisticated –  modeling technique called “entity life histories” [Hay 2003, pp. 262-282].) So, no, behavior is not an appropriate subject for a UML entity/relationship diagram.

    The object-oriented movement has, of course, accentuated the point that entity/relationship modeling cannot be done in isolation from activity and event modeling. This is absolutely the case. But the notation (and even the UML Class diagram notation with its “behavior”) is not an appropriate tool for doing this. Other notations, from process modeling to entity life history modeling, provide much more expressiveness for the concepts to be presented.

  • Abstract Entities – In UML, every instance of an “abstract entity” must be of at least one of its sub-types. That is, the super-type is an abstraction, with no physical existence apart from its sub-types. Based on the constraints described above, in an entity/relationship model every instance of a super-type must – by definition –  be an instance of exactly one of its sub-types. The additional designation is unnecessary. And in some tools, it affects the way the entity is portrayed graphically, which can be a distraction.

Something Missing from UML

alt

In an entity/relationship model, a unique identifier distinguishes each instance of an entity class from every other instance. It may be the value of one or more attributes, or it may be a combination of attribute values and roles attached to instances of other entity classes. For example, in Figure 7, PROJECT and PERSON are called reference entity classes, since they have no mandatory relationships with any other classes. They are also called independent entity classes. For these, unique identifiers are always attributes. That is, for PROJECT, the judgment was made that in this organization that all project names are unique, so the attribute “Name” can be used to identify instances of PROJECT. On the other hand, PERSON is simply identified by an attribute that is a “surrogate” (automatically generated) identifier: “Person ID.”

alt

Figure 7: Projects

Identifying attributes are indicated on the drawing by the octothorpe (#)4 next to the attribute name.

Dependent entity classes are those which have mandatory relationships with one or more other entity classes. Typically, their unique identifiers include roles with other entity classes. Instances of CONSTRAINED PROJECT ASSIGNMENT in Figure 7, for example, are not identified by attributes at all. Instead, each instance of CONSTRAINED PROJECT ASSIGNMENT is identified by the PERSON it is of and the PROJECT it is to. Each relationships involved is designated as an identifying relationship (okay, “role”) by a line across the relationship line (–| –  ) at the end next to the identified entity class, as shown in Figure 7.

In the object-oriented world, there is no concept of a “natural” or meaningful identifier. In an object-oriented program, every object in the system is uniquely identified by a generated surrogate key called an “object identifier” (known as an “OID”). Indeed, even in entity/relationship models, reference entity classes, like PRODUCT and PERSON, usually use “surrogate” identifiers (like “Person ID” in our example). In dependent entity classes, however, it is important to know whether instances are identified by all of the roles involved, just some of them, or some of them plus an attribute. The meaning of the entity class is based on that.

For example, look at the identifiers of CONSTRAINED PROJECT ASSIGNMENT. If each assignment is uniquely identified by the PERSON and the PROJECT, no PERSON can work for the same PROJECT twice. There can be no more than one instance of “Charlie” working on the “I-10 upgrade project.” If he leaves and wants later to return to the job, he is out of luck.

This may be exactly what the organization wants to say. Using natural identifiers allows this rule to be expressed, while object identifiers would not. If this is not the enterprise’s intent, of course, the model is incorrect. Specifically, instances of OPEN PROJECT ASSIGNMENT are identified not only by the roles they play, but also by the values of the attribute “Scheduled start date.” Thus if Charlie worked for one period that was scheduled to start on January 25, 2007, actually stopped working (“Actual stop date”) on June 15, 2007, and then returned to the project November 3, 2007, this would be no problem. Again, this rule is much more explicit using the entity/relationship approach than it could be with conventional UML.

It may be the case that “Scheduled start date” is not always known, so it cannot practically be used in the unique identifier. In that case, another kind of surrogate key, a “Sequence number” could be added to the entity class, and the unique ID would then be the roles plus the “Sequence number.” This opens up other possibilities. It may be that the business is mostly concerned with who works on the project and wants to set up a sequence number series for each project. In that case, the role “to one PROJECT” and the attribute “Sequence number” would be sufficient. The second assignment of Charlie to this PROJECT could be identified as the “fifth OPEN PROJECT ASSIGNMENT” to this PROJECT.

Note that the concept of unique identifier is carried through to relational database design as a primary key. This designates one or more columns to identify rows in a table. Roles are converted to relational tables by conversion of all many-to-one relationships to foreign key columns in the entity class that is the subject of the role. Each of these columns corresponds to a primary key column in the object entity class.

This means that a primary key can consist both of columns native to the table and columns that are foreign keys to other tables. This exactly implements the entity/relationship concepts of a unique identifier consisting of both attributes and roles.

Because of its object-oriented heritage, UML does not have a symbol for unique identifier. But to create an entity/relationship model using this notation, such identifiers are needed. Fortunately, UML has a facility for extending itself called the stereotype. This allows symbols to be created (from text) to encompass any concept needed. In this case, what is needed is a symbol for “identifier” that can be applied to either attributes or roles. For our purposes, your authors have invented “<<ID>>”, as shown in Figure 8.

alt

Figure 8: Projects in UML

In Part 3 of this series, we will approach the aesthetic principles to be followed in presenting models in any notation.

 

References:

Barker, R. 1989. CASE*Method: Entity Relationship Modeling. (Wokingham, England: Addison Wesley).

Box, George E. P.; Norman R. Draper (1987). Empirical Model-Building and Response Surfaces, p. 424, Wiley.

Hay, D. 1999 “UML Misses the Boat,” East Coast Oracle Users' Group: ECO 99 (Conference Proceedings / HTML File). Apr 1, 1999.

Hay, D. 2003. Requirements Analysis: From Business Rules to Architecture (Upper Saddle River, NJ: Prentice Hall PTR).

Miller, G. A. 1956. “The Magical Number Seven, Plus or Minus Two: Some Limits on Our Capacity for Processing Information,The Psychological Review, Vol. 63, No 2 (March, 1956).

Martin, J., and James Odell. 1995. Object-Oriented Methods. (Englewood Cliffs, NJ: Prentice Hall).

Page-Jones, M.2000. Fundamentals of Object-Oriented Design in UML. New York: Dorset House). Pp. 233-240.

Rumbaugh, J., Ivar Jacobson, Grady Booch. 1999. The Unified Modeling Language Reference Manual.

 

End Notes:

  1. Before using this approach be sure that your tool does not assign any other semantic implications placing one box inside the other. It appears not to be commonly used in class diagrams, but UML can use this nesting to represent “containment.” If we’re to use it for an entity/relationship diagram, though, that option is no longer available.

  2. The NoMagic tool from Magic Draw, Inc. requires you to drag the super-type box around the sub-type boxes, but the effect is the same.

  3. Sometimes a GOVERNMENT may be the owner of a COMPANY, but that is a different kind of relationship.

  4. Blame Bell Labs for “octothorpe.” It was their name for that key on a telephone keypad. We know it as the “hash sign” in the UK, or the “number sign” or “pound sign” in the U.S. It is not called the “pound sign” in the UK, of course, since that term means something else – £). A neutral, international word like “octothorpe” seems like a good idea.

Go to Current Issue | Go to Issue Archive


Recent articles by David C. Hay


Recent articles by Michael J. Lynott

David C. Hay - In the information industry since it was called “data processing,” Dave Hay has been producing data models to support strategic and requirements planning for more than twenty-five years. As President of Essential Strategies, Inc. for nearly twenty of those years, Dave has worked in a variety of industries including, among others, banking, clinical pharmaceutical research, broadcasting, and all aspects of oil production and processing.  Projects entailed various aspects of defining corporate information architecture, identifying requirements, and planning strategies for the implementation of new systems.  

Dave’s recently published book, Enterprise Model Patterns: Describing the World, is an “upper ontology” consisting of a comprehensive model of any enterprise from several levels of abstraction. It is the successor to his groundbreaking 1995 book, Data Model Patterns: Conventions of Thought – the original book describing standard data model configurations for standard business situations. 

In between, he has written Requirements Analysis: From Business Views to Architecture (2003) and Data Model Patterns: A Metadata Map (2006). Since he took the unusual step of using UML in the Enterprise Model Patterns… book, a follow-on book, UML and Data Modeling: A Reconciliation was published later in 2011.  This book both shows data modelers how to adapt the UML notation to their purposes, and UML modelers how to adapt UML to produce business-oriented architectural models.

Dave has spoken at numerous international and local DAMA, semantics, and other conferences as well as at various user group meetings. He can be reached at dch@essentialstrategies.com, (713) 464-8316, or via his company's website.
Michael J. Lynott - Mike has been doing data modeling and database design since his introduction to the world of databases in the early '80s. He was part of Oracle Corporation's introduction into Computer-Aided Systems Engineering (CASE) and has been a leading expert in conceptual data modeling ever since. After that, he was a consultant with eTransitions of New Jersey, working with renowned author and consultant Ulka Rodgers. In recent years, he has been senior enterprise information architect for a large retailer in Boise, Idaho. Mike has written a number of papers for various publications and conferences.