TDAN: The Data Administration Newsletter, Since 1997

THE DATA ADMINISTRATION NEWSLETTER – TDAN.com
ROBERT S. SEINER – PUBLISHER

Subscribe to TDAN

TDWI
Dataversity
Business Analysis Conference Europe 2014
Data Governance Financial Services Conference
Enterprise Dataversity
Data Modeling Zone
Data Governance Winter Conference

   > home
 Printer-friendly
 E-mail to friend

SELECT * FROM Celko - March 2009
Splitting Data

by Joe Celko
Published: March 1, 2009
Joe Celko looks at his favorite bad design - attribute splitting, something that shows up in so many forms at the column, table and schema levels.

When something is really good or really bad and appears a lot in common usage, it gets a name. A friend of mine once worked at a COBOL shop where they had the “end of month totals” bug. Everyone knew that you needed to do a sort before running the end of the month reports. The logic in the report program assumed that the records were sorted; and when it found one out of sort order, it would go to the summary logic and print totals. It would skip over the remaining records that were out of order and go to the next set of records for the next control break.

In the database world, we also have names for good and bad design practices. Everyone knows about “normalization” as a good thing, even if they don’t use it. But we don’t hear much about the bad designs. My favorite bad design is called “Attribute Splitting” because it shows up in so many forms at the column, table and schema levels.

Column Splitting

Let’s go back to definitions. A table is in First Normal Form (1NF) when all its columns are scalar values and the table has a key. Scalar valued columns means that you do not put your hat size and shoe size in one column. Repeating groups are another form of non-1NF, but it is not attribute splitting.

Please note that “scalar” and “atomic” are not the same things. Scalar values are one-dimensional values on a scale of some kind. I am not going to go into scales and measurements here, but you can read about it in my book Thinking Sets where I have a few chapters on these topics.

The term “atomic” means that it cannot be broken up without destroying the data element. Scalar values are automatically atomic, which is very handy. The pair (longitude, latitude) is two columns, but it is only one data element (location) by this definition. I could also model a location with a single Hierarchical Triangular Mesh (HTM) code. The Hierarchical Triangular Mesh is a method to subdivide the spherical surface of the globe into triangles of nearly equal shape and size. The HTM gives us a very efficient indexing method for objects localized on the sphere. You can Google it.

The idea of a data element is more abstract than just a physical unit of storage, like records in files. I can have a virtual data element that I compute as needed and it has no existence at all.

Is (longitude, latitude) a split attribute? Technically, yes. Postgres and other SQLs have extensions for geographical data, so locations are encapsulated in special data types; the rest of us have to “fake it” with two columns. But it is not a serious design flaw.

A more serious example of column attribute splitting is the use of BIT flags in SQL. Old COBOL and OO programmers love to do this because it looks like what they had in punch card file systems before RDBMS. Consider the BIT flag column “is_male” in a table. If this was in SQL Server, you would have to worry about NULLs, but ignore that for now.

This looks nice, but then you run into your first corporate entity. Is the XYZ Corporation male or female? Organizations are called “lawful persons” in legal speak and can do most of the same things as a person can in a business situation. Is it male or female? Well, neither. If you want to get ridiculous, you could try to assign gender to corporate bodies – Playboy Enterprises is male and Mary Kay Cosmetics is female. We have no idea what to do with Sears or Macy’s.

The ISO sex codes are {0= unknown, 1= male, 2= female, 9= lawful persons}, they are standard and they make sense. The BIT flag guy has to add another flag – “is-corporate” and now he needs a CASE expression to handle the (is_male, is_corporate) pair. There are (2^2) = 4 binary combinations, so what do we do? How about (0, 0) = unknown because all zeros look good; (1, 0) = male, not corporate; (0, 0) = female, not corporate; (1, 1) = male lawful person and (0, 1) = female lawful person? We could add a CHECK() constraint that disallows (0, 1) to give us one and only one lawful person encoding.

This gets worse as you add more and more bits.

Table Splitting

Another form of attribute splitting is to take an attribute’s values and make a table for each value. The example I pull up all the time to demonstrate the problem is taking “Personnel” and splitting it into “MalePersonnel” and “FemalePersonnel” tables. The split attribute is the sex code.

The first problem is what to do with corporate employees if your company hires staff from Kelly or Manpower (lawful persons). What if we hire “Alex Mack” and he/she did not fill in the sex code and we have no idea what the right gender is (note to SF fans: Google “Secret World of Alex Mack” reference here; the gender of the lead was deliberately ambiguous).

Oops! Maybe we needed “MalePersonnel”, “FemalePersonnel”, “CorporatePersonnel” and MiscPersonnel” tables in this screwed up schema. Split tables quickly expand like this.

No, we needed a valid data model with a column for the sex code. Splitting a table on the sex code is probably not likely. But if the Martians land and we need another sex code, that is easier than another table. This is so obviously silly, programmers will not do it. Well, most programmers; never underestimate stupidity.

But look at temporal columns. You will often find a table for each year or month within a year. Why does this look okay? Because that is what we did with magnetic tapes! The old IBM magnetic tape labels used “YYDDD” format that led to some Y2K problems. If you get on SQL Newsgroups, you will find tables that have this or some variant as their names.

You will also find tables split on geographical values, organizational units, etc. This is not the same as partitioning a table. Partitioning is a physical storage management method; the table is one logical unit of work. The SQL engine will maintain the partitioned data and not allow duplicated or dropped data. You do not have to write extra code.

Schema Splitting

The extreme case is putting data into multiple databases. This is done out of necessity most of the time. It is not always physically or practically possible to put all of the enterprise data into one centralized database. This is fine as long as it is logical and well controlled. We want a single data model of an enterprise. This is where a federated database is similar to a partitioned database.

In a federated database, the system maintains the relationships among the databases and (we hope) keeps the data consistent (i.e., every DB is on UTC, a set of measures, etc.) so that the data belongs to one central data model and not a bunch of local data models that do not match.

Conclusions

We need heuristics, knowing that they are heuristics and not mathematical axioms. All the basic good RDBMS rules apply; do not mimic a file system in SQL, etc.

Remember the Laws of the Wolf Pack in Kipling’s Jungle Book stories? The DB versions are also important for the same reason – survival.

Go to Current Issue | Go to Issue Archive


Recent articles by Joe Celko

Joe Celko -

Joe is an Independent SQL and RDBMS Expert. He joined the ANSI X3H2 Database Standards Committee in 1987 and helped write the ANSI/ISO SQL-89 and SQL-92 standards. He is one of the top SQL experts in the world, writing over 700 articles primarily on SQL and database topics in the computer trade and academic press. The author of six books on databases and SQL, Joe also contributes his time as a speaker and instructor at universities, trade conferences and local user groups. Joe is now an independent contractor based in the Austin, TX area.