Business Intelligence Resources

TDAN: The Data Administration Newsletter, Since 1997

THE DATA ADMINISTRATION NEWSLETTER – TDAN.com
ROBERT S. SEINER – PUBLISHER

Subscribe to TDAN

TDWI World Conference

TDAN.com - The Data Administration Newsletter

Business Intelligence Resources

business intelligence resources

TDAN.com - The Data Administration Newsletter

   > home > newsletter > article
Stages Of Data Utility & Value

by Michael Scofield
Published: October 1, 2005

Published in TDAN.com October 2005

The world is filled with data and information. Some of it is unknowable (such as what is lurking behind the Andromeda galaxy). Some of it is knowable, but unknown--unknown mainly because it was never deliberately observed, and properly recorded. Of the data and information which are recorded with some accuracy, how do we find what we need for specific uses or decisions? The utility of data is a very specific characteristic, and depends upon the anticipated usage.

Utility characteristics of information

There are a several significant characteristics of information which must be understood, and understood distinct from each other, when evaluating utility. Any piece of information, in order to be useful, should be...

Knowable. Nearly everything (but not all, as Heisenberg[1] taught us) is knowable, although sometimes very difficult to learn or discern.

  1. Recorded. In some sharable, objective medium and not just in some human brain.
  2. Accessible (with the right resources and technology)
  3. Navigable (it may be there but is it easy to find?)
  4. Understandable (language, culture, technology, etc.)
  5. Of sufficient quality (for the intended use)
  6. Topically relevant to needs (perceived needs and unknown needs) (otherwise, it is noise)

These characteristics apply to a piece of data, information, or potential information. These characteristics apply to both a single item of data, and any meaningful grouping of data items.

Please notice that these characteristics are also tests, and seem naturally sequential (1 through 7). Subsequent tests are irrelevant if the previous tests are not passed. For example, it is no use worrying about navigability if the data is not knowable, recorded, and accessible.

Another important characteristic of information is whether it is structured (or tabular), or unstructured. The tabular-unstructured dimension is orthogonal to these seven. Meaning, that there are really 14 tests (in two columns) possible.

An overview of the stages of data utility is shown here.

Fig. 1. Seven utility characteristic questions about data.

So these seven characteristics can also be seen as seven tests, which can be applied in sequence. The exact sequence of the final five can be debated. For example, if a fact (whose value is known or not) is judged to not be relevant, then we probably wouldn't worry about its quality. But the sequence here seems somewhat intuitive, so we use it.

1. Knowable

There is a wealth of potentially knowable data and information in the universe. Only a very small portion of it has ever been observed or known by humans (or the technology which humans use to extend mankind's observations and senses). And little of that was ever recorded anyway. Slightly more was remembered, but not recorded. Of course, memory fades...and fails. Hence, the need for recording stuff.

2. Recorded

Most of what is knowable is never recorded. People don't feel it is worth recording. Some is recorded by hand (requiring, generally paper and writing instrument) but since 1850, some is recorded through automated or technological means (photography, phonographs, magnetic media, etc.).

What we chose to record depends upon our expectations of later utility or interest. Students take notes in academic lectures, but not on what goes on at a football game (unless they are a sports writer for the college newspaper). More captains of industry and government write memoirs than do postal clerks.

When we discover a need for data, we can adjust behavior and systems to start recording it. But much of the most crucially needed data is not recorded until it is too late. Recording of immigration activity is now more meticulous. I wonder why. Surveillance tapes may be saved longer now[2].

3. Accessible

The very availability and access to data can be a major issue. Of all the knowledge and information ever recorded, much (especially in ancient times) was confined to letters and personal journals. There were newspapers, but only late in the history of civilization.

Thus, much of what has been known and recorded is not in the public domain. It is in the personal property of families. In the modern industrial age, a great deal of knowledge is held in the private files of corporations and government agencies (though many of these are destroyed on a regular basis, sometimes to prevent their availability for discovery to aid litigation). These generally are not available to the public. For investigative purposes they can only be subpoenaed if their existence is known.

The public library was a major step forward in making data, knowledge, and information available to a wider readership. The internet is the latest significant mechanism for lowering the cost of access, speeding navigation (see below), and lowering the cost of "publishing".

We often underestimate the value and power of the human brain. Much of contemporary knowledge and information is held solely in human brains. Most people do not even understand or realize how much they know. Mechanisms for recalling that information are sometimes faulty. We may require triggers or clues to remember things. We may remember the route to a particular store if we were driving it ourselves (relying on visual clues along the way which work for us, but which we could not articulate during an interview), but it is more difficult to remember, describe, and organize those clues to give verbal or written directions to someone who wishes to drive the same route.

4. Navigable (it may be there but is it easy to find?)

Orthogonal to the concepts of structured and unstructured data are issues of navigation; how do we find what we are looking for? Or what is important? ...to us. Or what we need to know?

There is a wide range of "navigability" in data and information--particularly in unstructured information. The worst case is a collection of personal letters, on paper stationery, (not stacked chronologically) of some individual--especially after they are dead. How do you find the author's mention of a particular distant cousin? You usually need to read it all.

Books are bound pages, with some physical organization (usually chronology); they are designed (except for reference books) to be read serially, front to back. Reference books have topics generally organized alphabetically, or some other meaningful categorization. Then, some non-fiction books have tables of contents which are significant (for example, text books). In these kinds of books (compared to narratives), topics are becoming easier to find. Then, topical indices were added to back of the bound non-fiction book. These are helpful, but reflect the judgement of some editor who decides which topics are important enough to get a reference.

Tabular data was generally stored (at least on paper) in entry sequence. The beauty of automated (electronic, digital) tabular data was that records could be sorted, and we could use several alternate keys or indices to find records. This is a very significant enhancement in navigability of data. Indeed, tabular data has become quite easily managed.

But what about unstructured data? Document imaging systems allowed a document's image to be referenced by a few indices (perhaps name, date, and something else). But the content was not reference-able in its graphic form. Then came the web, and search engines. Wherever unstructured textual information is digitized, and placed on uniquely-identifiable pages, search engines could help us find it. Google has done much to revolutionize the finding of text.

And there appear to be some engines which can find images similar to each other. Whether the next generation of search engines can find meaning (rather than specific words or text strings) remains to be seen.

5. Understandable

Data and information can be recognized as existing, and the meta-data (source, time of observation, etc.) may be known, but the content of the information may not be understandable. Drawing usable meaning from data requires lingual, technical, and cultural familiarity. The more cryptic (or coded) the data is, the more culture must be "wrapped" around it, often supplied by the analyst or interpreter.

The very existence of data (or communication) may itself be useful even if it is not understood. A friend of mine told of his military experience working at a "listening post" facility high on a mountain in Turkey, with a good radio "view" into the Soviet Union. They listened to VHF radio signals, often voice, and though he spoke no Russian, he could note the time, frequency, duration, number of voices, etc. This kind of "metadata" was useful (known as traffic analysis) even if the content was not understood--although he probably tape recorded what he heard over the radio channel.

I frequently use the formula, "data plus context yields information". The context can make raw data understandable. Two lanterns hanging in a Boston church steeple are merely data; their meaning (profound meaning!) was only understood in the context of previously established values, plans, and codes by the "rebels" in the Revolutionary War.

The information and data held in the human brain are stored with context--sometimes vital context which makes the fact, out of context, of little value. The human manages to integrate and relate all this information in a variety of intuitive ways--not at all in a tabular manner, thus going far beyond the ways we can relate or process tabular data in the relational model. The sole proprietor entrepreneur is far better at integrating his personal knowledge about his products and customers (and thus making profitable business decisions) than any CRM system could do based upon tabular data (said data being woefully incomplete, compared with the nuances of the sole proprietor's memory).

6. Of sufficient quality

There are a variety of specific measures of quality. They include...

  • Presence of record. Is there a record present for an instance in reality?. If not, then the entire row of data is missing.
  • Presence of data in cell. While the row may be present, the cell may not be populated when it ought to be.
  • Validity of a fact. Does the value of a fact (a cell in a row) conform to some rules? E.g. if the field is called "STATE" is the value a valid state code? This concept is sometimes achieved through referential integrity. In tabular data, this is relatively easy to determine through Boolean tests.
  • Reasonableness of a fact. While the value in a cell may be valid, it may not be deemed reasonable, in context of peer data, or other facts in the same record. E.g. the zip code is not consistent with the state code on an address.
  • Accuracy. A fact may be valid (a proper code), reasonable (in keeping with peer data), and still be flat out wrong--inaccurate. Generally, testing for accuracy can be very expensive.
  • Precision. Precision is different from accuracy. A numeric amount can be accurate to the dollar, but not precise to the penny. It is still accurate, in a sense, and useful in some analysis.
  • Consistent definition over time. A set of facts (a column) in a set of records (a table) should have consistent definition over the span and scope of the table. If they do not, then the definitional accuracy of the data (or more precisely, the meta-data) is lacking.

Ideally, the quality of data is something that can be objectively measured, without reference to an intended use. And in designing systems and assessing data quality of latent and moving data assets, we do need to strive for that.

But the quality of data or information (as objectively measured) may or may not meet the specific needs of a decision or analysis.

7. Topically relevant to needs

A final issue must be mentioned and that is relevance and usability of data and/or information to a particular need. Data which is not relevant can be distracting, or actually be considered "noise". Advertising is a culturally sanctioned, structured form of noise. (Graffiti, another form of advertising, is generally not sanction by society.) Ideally, we know what is an ad when we see it, and we can choose to "tune it out"[3]. But noise can be used deliberately to obscure significant data, as can be dis-information. There were several such situations in World War Two (such as the fake army radio messages just prior to the Normandy invasion), and probably have been many since then we don't know about.

Needs for information may be known or unknown. "Write this down; you are going to need this someday!" is our way of adapting to one assessment of our anticipated needs. Many bureaucracies have people who save everything ("We might need this someday."). They may be considered pack rats, but many a crime mystery is supplied with essential clues by the pack rat who says, "Well, actually, I did save that back here." Of course, saved data is of no utility if it cannot be found--back to navigability!

These seven characteristics of data and information, when understood and applied as sequential tests, allow us to determine its utility and value for particular business and social needs and expectations.


[1] The physicist, Werner Karl Heisenberg, postulated that there are limits to what we can measure and know about the movement and behavior of some subatomic particles. His Uncertainty Principle was influential in the development of quantum mechanics and ultimately the development of nuclear energy. He won the Nobel Prize for his accomplishments.

[2] Interesting that surveillance videos in the London Underground system were re-viewed when analysis of the tickets (which carry magnetic coding) used by the 2005 bombers were traced back to previous uses of those same tickets, and then through establishing the prior date/times when the terrorists had used the Underground, authorities were able to find more pictures of them, and determine that they had rehearsed their actions.

[3] If the appearance of the ad, in text and layout, is not readily distinguishable from editorial content, many magazines must label the ad, ("Advertisement") at the top of the page.

© Copyright 2005 Neil Michael Scofield All rights reserved.

Go to Current Issue | Go to Issue Archive

Michael Scofield -

Michael is a widely known speaker and author on data quality and semantic data integration. He has held data architecture and data quality management positions in banking, finance and education. He has taught workshops for numerous organizations including the information quality conferences, numerous DAMA chapters, The Data Warehousing Institute, the Institute of Internal Auditors, chapters of the Quality Assurance Association, the Enterprise Data Forum, European Meta-data Conferences, Association of Computing Machinery, numerous DBMS user groups, and business intelligence tool conferences. His articles appear in numerous professional journals, and he writes occasional humor for the Los Angeles Times and other magazines. Michael can be reached at NMScofield@aol.com.