Modeling Baseball Cards: A Case Study
Published: July 1, 2005
Published in TDAN.com July 2005
Ok, when did you first become aware of data management? No, it wasn't in your database class in college. It wasn't when you learned SQL or data modeling. It was back when you were a kid, collecting baseball cards. You became fascinated by all the statistics. You had to figure out how to sort the cards. By year? By player? By manufacturer?
I found my son's cards made excellent examples for normalization exercises when I was teaching data modeling some years ago-in spite of the fact that, well, ok, I don't really know as much about baseball as I am supposed to as a loyal American.
Recently, as I was trying to encourage my now grown son to work with me in the data modeling field, I tried to model the game itself. To his credit, my son was very good at pointing out where the model was flawed and showed me how to fix it. What I found interesting was that the kinds of errors I made initially and the kinds of fixes he proposed were very representative of the process we all go through in creating our first data model. Since the example isn't anything like the usual commercial example, it seemed worthwhile to present the process as a kind of case study about the modeling process itself.
The Baseball Card
We begin with baseball cards, as shown in Figure 1. Here you see that there are two kinds of cards: a batter card, here represented by the card for Boston Red Sox player Dwight Evans, and a pitcher card, here represented by the card for New York Yankee Lee Guetterman. (Yes, these cards are from my son's youth, back in 1989. And no, we haven't kept up...)
Figure 1: Baseball Cards - Front
Figure 2 shows the back of two different cards for the same players. The fronts, above, are from Topps, while the backs are from Donruss. The statistics for the two kinds of cards are different, since one is evaluating the performance of a pitcher in getting batters out, and the other is evaluating the performance of a batter who is trying to hit the balls thrown by the pitcher.
Figure 2: Baseball Cards -- Back
For the batter, then, you have measurements of his performance at the plate for each year. For example, from the Donruss card:
Note that the definitions of these statistics are not always as simple as the above definitions would suggest. For example, a hit is only a "Hit" if there were no errors, and if the time at bat was not a "fielder's choice".*
Note also that these are only the statistics captured by the Dunruss brand card. The Topps card also has 'Slugging percentage" and "games started", but not "steal". Others can be calculated from these. For example, the number of singles = hits - doubles - triples - home runs.
So, lets build a data model. By data model, we are talking about a conceptual data model, or entity/relationship model. It is a description of the game of baseball. It is not a database design. That is, it is about organizing the data, not about creating something that will be easy to report on. Indeed, as we will see, some of these statistics are really challenging to describe in data structure terms.
The Model - Version 1
A data model describes the things of significance to the organization, about which we wish to hold information. These are called "Entity Classes". In the baseball card example, the first thing of significance is, of course, a player, as is shown in Figure 3. Other entity classes of interest include the team that he is on and the position that he is the holder of.
That is, each player must be on one and only one team and may be holder of one and only one position.
Figure 3: A Player
Note that the above sentence was taken directly from the relationship names on the diagram. Other sentences that can be derived from the diagram are: "Each team may be composed of one or more players", and "Each position may be held by one or more players". Naming relationships is according to the rules shown in Figure 4. Relationship names are properly prepositions, not verbs. It is the preposition that is the part of speech that describes relationships. Verbs (commonly used) describe activities, which are more properly represented on a different kind of model.
Figure 4: How to Read a Data Model
Attributes of a player (the information to be held about him) of course include "Player number", "First name", and "Last name". In addition, the attribute "Throws handed indicator" can be "Left handed" or "Right handed". The attribute "Bats handed indicator" can be "Left handed", "Right handed", or "Switch".
In an early draft of this model, the line next to player was solid, asserting that each player must beholder of exactly one position. This turns out not to be true. In some cases, a player may not have a position permanently assigned. Hence, the model had to be corrected. The whole point of producing a data model, after all, is to create a set of assertions that can be validated by subject matter experts. Our job is to be wrong, so we can learn what is right.
In this case, there is actually more work to do. Take a look at the model in Figure 3 again. Is it really true that a player can play on only one team? Or that he can play on only one position? Especially over time? But if we allow more than one in each case, these will become "many-to-many" relationships that will not go well with our system designers. In addition, it is useful to understand exactly what is going on each time a player plays a position for a team. That, in fact, is where statistics will be collected. Figure 5 shows player membership, the fact that a player was on a team, and potentially holding one position, at a particular time. That is, each occurrence of a player membership must be held by one player, with a team, and as a player of a position.
Figure 5: Fixing Player
Now, let's add the baseball card, as shown in Figure 6. At first glance, it appears that each baseball card must be a report on one and only one player, and must be made by one and only one card manufacturer. (The front sides of the cards shown above are by Topps, while the back sides are by Donruss.) Specifically, each baseball card must be to describe on player membership.
Figure6: A Baseball Card
The sub-type structure shows that each baseball card must be either a pitcher card or a batter card. That is, an occurrence of a batter card, by definition, is also an occurrence of a baseball card. All attributes ("Card number", "Year") of baseball card are also attributes of pitcher card and of batter card. Similarly, all relationships with baseball card ("To describe player membership" and "Made by" card manufacturer") also are relationships to pitcher card and batter card. The attributes specific to pitcher card and batter card, however, are only attributes of those entity classes and none other.
This still isn't right. If you look at the Lee Guetterman card in Figure 2, above, you'll see that in fact he played on two different teams over the years, so the card cannot simply show one player membership. It's time for another entity class, this time called baseball card line, as shown in Figure 7. Note that each occurrence of baseball card line is identified both by the "Playing year" (represented by the octothorpe (#) next to "Playing year") and the baseball card it is part of (represented by the small line across the part of relationship role). Thus, the 1988 statistics will appear as separate baseball card lines on both the 1988 and 1989 Topps cards and on the 1988 and 1989 Donrus cards.
Figure 7: Baseball Card Lines
You can see the statistics we described previously as attributes of each of the sub-types. Actually, that is not quite true: the attributes above are from the Topps card, not the Donruss card as was previously described. As we can see the two different manufacturers don't capture quite the same statistics. In addition, on the Major League Baseball web site are listed 37 batting statistics, and 38 pitching statistics. It may true that most 11-year-olds are not interested in number of times a batter is hit by a pitch or his "on-base percentage", but then, who knows? Having the statistics "hard-coded" as attributes of baseball card line is simply not practical.
This leads us to the version shown in Figure 8. In this, a statistic is something to be measured, like "Earned run average", or "Number of games played". The fact that a statistic is captured on a particular baseball card line (either a pitcher card line or a batter card line) is a line statistic. A statistic, then, can appear on multiple baseball card lines (from multiple manufacturers), or, it doesn't have to appear on any of them. Note also that an actual player statistic value is for a single player membership, regardless of the number of baseball cards it may appear on.
Figure 8: Baseball Statistics
Model - Version 2
The problem with the above model is that it doesn't tell us anything about the nature of the statistics. Some of them listed above were shown as "complex", which means that they are not derived from a simple manipulation of other statistics. An "Earned run", for example, is a run scored by a player who got on base without benefit of an error. This means not only does it not count if an error is made in fielding a ball that he hit, but if an error prevents an out that would have retired the side, no hits after that point are "earned". This definition is not reflected in Figure 8, above.
What is required is for us to model the game itself. This begins with Figure 9. In this, our player membership is on a team that is now shown to be located in a stadium. In this case stadium refers to a home stadium. The Houston Astros play in Houston, Texas at Minute Maid Park, while the texas Rangers play in Arlington, Texas at Ameriquest Field.
Normally, a game is played at one of these stadiums between exactly two teams, with the team located in that stadium by definition being the home team for the game, while the opposing team is the away team for that game. That is, there is a business rule that if a team is the home team for a particular game, it must be located in the stadium that is the site of the same game.
There are two kinds of exceptions to this: first of all, the All Star Game, is played once a year between teams representing all teams in each league. In this case, neither team is located in one stadium. Hence the dotted line on that relationship, although a business rule states that teams that are members of the leagues must be located at one stadium. As it happens, one team's stadium is chosen for the location of the game (a game is played at one stadium), and the team of the league that the team belongs to is designated the home team for that game. The relationship that each game may be played at one stadium is for circumstances like this where you cannot assume that the game is played at the stadium that is the location of the team that is the home team in the game.
Also, sometimes games are played outside the United States, in which case a decision is made as to which team is home team for the purposes of that game. Again, you cannot assert that the stadium of that team is where the game is being played. Again, it is useful to be able to independently assert that the game is played at a (for example Tokyo) stadium.
Figure 9: A Game
Each game is composed of one or more half innings, where each half inning is the fact that a particular team is at bat. If the "Top/bottom indicator" is "Top" then the team which is the away teamfor the game is at bat. If it is "Bottom", then the home team is at bat.
The game is played when a succession of players from the team that is at bat step up to "home plate" to attempt to hit a ball thrown by the pitcher. That is, one player is the batter for one or more plate appearances, and each of these plate appearances is the occasion for one or more pitches by the player who is the pitcher for that pitch. Played position records the fact that various of the players (typically those in the field) are assigned to play a particular position during that pitch. That is, each played position during a pitch must be played by a player and must be the playing of a position. In addition, played position can record who the opposing pitcher is during that pitch. Note that the position embodied in a particular played position during one pitch may not be the normal position played by a player as recorded in his player membership.
Figure 10: Plate Appearances and Pitches
Now, what can happen as a result of each pitch?
The plate appearance can have the following outcomes:
Figure 11 shows how these outcomes are represented in the model. A pitch outcome is from a single pitch-Strike, Ball, etc. A plate appearance outcome is the overall result (Hit, Walk, Out, etc.) of the player's being up to bat. It is the result of a plate appearance. Each plate appearance results in a single plate appearance outcome.
Figure 11: Outcomes
In addition to the actual outcome of the pitch, other things will be going on around the field. Specifically, either someone may be put out, or a player may advance by one or more bases. Each field event (a field out, a player advance, or an error) is one of these other happenings. The field event must be during a pitch (or immediately following one), and it must be by a player, It also must be an example of one field event type, which redundantly expresses the sub-type structure of field event. (The first three field event types must be "Field out", "Player advance", or "Error".)
Specifically, "Player advance" is the super-type of the following field event types:
The field event type "Field out" may be the super-type of the following field event types:
A field event type "Error" has no sub-types, but an error may be the cause of a player advance.
a field event type "Multiple Play" is the super-type of the following field event types:
In addition, a multiple play must be composed of one or more field outs.
A field event type "Other" may be the super-type of the following field event types:
Each field event must be by one player, either the player advancing or the player responsible for the field out, error, etc. In addition, other players may assist in the play. That is, each field event may be helped by one or more assist roles, each of which is played by a player.
In recording the history of a game, it is also important to know the sequence of events, specifically in terms of the path of the ball. If the ball was hit to center field, then thrown to second base, and then thrown to first base, these are three instances of ball travel which are part of a particular field event. In the case of a multiple play, the ball travel is recorded for each component field out.
For example, if the ball is hit to the shortstop and thrown to first base for an out, the field event is a field out by the player who is playing first base, with the "Base" of the field out being "First". One instance of ball travel records that the ball went from "Shortstop" to "First Base", and the player who is playing shortstop is the player in an assist role in that field out.
Figure 12: Field Events
Now let's return to that definition of "Earned run" discussed earlier. An earned run is a run for which the pitcher is held accountable, and shall be charged every time a runner scores after he originally got on base without help from an error. That is, he got on base through a hit, a walk, he was hit by a pitch, or a fielder chose to tag or force out someone else out on a different base.
4th game of the 2004 American League Playoffs: New York Yankees vs. Boston Red Sox
It was the bottom of the 14th inning in the fourth game of the playoffs. The Yankees had won three games already and were expecting to wrap it up tonight. Here's what happened:
Esteban Loaiza pitches to Mark Bellhorn
Plate appearance outcome: M Bellhorn struck out swinging.
Esteban Loaiza pitches to Johnny Damon
Plate appearance outcome: J Damon walked.
Esteban Loaiza pitches to Orlando Cabrera
Plate appearance outcome: O Cabrera struck out swinging.
Esteban Loaiza pitches to Manny Ramirez
Plate appearance outcome: M Ramirez walked.
Player Advance: Movement to another base on a hit: Advance to second base by J. Damon
Esteban Loaiza pitches to David Ortiz
Plate appearance outcome: D Ortiz singled to center
Player Advance: Movement to another base on a hit: Advance to second base by M. Ramirez
Player Advance: Run Scored! by J Damon
1 run, 1 hit, 0 errors
NY Yankees 4, Boston 5
In another example, the Yankees were playing Minnesota. It was the top of the third inning and Derek Jeter is on first base:
Johan Santana pitches to Gary Scheffield
Plate appearance outcome: Gary Scheffield grounded out
Field Out: Force out: First base by player playing first base]
Ball Travel: Second base->First base
Assist Role: player playing second base
Player Advance: Movement to another base on a hit: Advance to second base by Derek Jeter]
Again With the Statistics
Now we can revisit the collection of statistics from the beginning of this paper. In Figure 13 we again see player statistic value of a statistic and for a player membership. Typically, each player statistic value is an aggregate for a year, but the attributes "Begin date" and "End date" allow specification of any time period.
What we don't see in the model is the navigation that is required to capture each statistic, although we do at least have a place to do the navigation:
Many of the other statistics are more complicated and are left to the reader as a homework assignment. All information necessary is contained in the model.
Figure 13 also shows team statistic value. This is the value of a statistic for the whole team in one game.
Figure 13: Statistics
A Final Thought
Developing this model has been a true exercise in the difficulty of extracting information from subject-matter experts. Where my experience with patterns has brought me to the point where doing a model in a commercial environment is relatively easy, I was in unfamiliar territory here. As I stated above, my particular upbringing has left me (let's see, what is the current PC term?) disadvantaged, when it came to my understanding of the game. I knew the basics, of course, and I've been to several games around the country, but I never really understood what went into compiling these statistics. The rules of the game are far more complex and subtle than I had ever imagined. It has taken several drafts for my son to clarify my thinking and my understanding to the degree represented by this paper.
But the exercise has been exactly what is required to produce any data model in any field of endeavor.
Discussing baseball is much like working with computers. If you don't know something, ask your kids.
*Fielder's choice - A play made on a ground ball in which the fielder chooses to put out an advancing base runner, thus allowing the batter to reach first base safely.
Recent articles by David C. Hay
David C. Hay - In the information industry since it was called “data processing,” Dave Hay has been producing data models to support strategic and requirements planning for more than twenty-five years. As President of Essential Strategies, Inc. for nearly twenty of those years, Dave has worked in a variety of industries including, among others, banking, clinical pharmaceutical research, broadcasting, and all aspects of oil production and processing. Projects entailed various aspects of defining corporate information architecture, identifying requirements, and planning strategies for the implementation of new systems.
Dave’s recently published book, Enterprise Model Patterns: Describing the World, is an “upper ontology” consisting of a comprehensive model of any enterprise from several levels of abstraction. It is the successor to his groundbreaking 1995 book, Data Model Patterns: Conventions of Thought – the original book describing standard data model configurations for standard business situations.
In between, he has written Requirements Analysis: From Business Views to Architecture (2003) and Data Model Patterns: A Metadata Map (2006). Since he took the unusual step of using UML in the Enterprise Model Patterns… book, a follow-on book, UML and Data Modeling: A Reconciliation was published later in 2011. This book both shows data modelers how to adapt the UML notation to their purposes, and UML modelers how to adapt UML to produce business-oriented architectural models.
Dave has spoken at numerous international and local DAMA, semantics, and other conferences as well as at various user group meetings. He can be reached at firstname.lastname@example.org, (713) 464-8316, or via his company's website.