TDAN: The Data Administration Newsletter, Since 1997

THE DATA ADMINISTRATION NEWSLETTER – TDAN.com
ROBERT S. SEINER – PUBLISHER

Subscribe to TDAN

TDWI
Data Warehouse Conference
IRMUK
EDW 2012
DG Winter Conference

   > home > newsletter > article
 Printer-friendly
 E-mail to friend

Taking Inventory of the Unstructured World

by Bill Inmon, Krish Krishnan
Published: May 1, 2009
The corporate document catalog is a good start for getting your hands around all of the important unstructured information in your corporation.

In most companies, there is a wealth of unstructured textual information. There are documents of many kinds found in many places. There are reports. There are articles. There are spreadsheets. There are contracts. In a word, there are many documents of many types in many places in the corporation.

 
alt


Intuitively, the organization knows that it ought to be doing something with these documents. Trying to find a document six months after it has been written is no small task. Trying to gather documents for a cost justification or for litigation support is not trivial. Yet documents are like small minnows in the water. They keep multiplying and they are slippery to catch.

Managing Your Corporate Documents

Trying to manage corporate documents is like trying to catch the wind. Most corporations have never even attempted to try to manage their corporate documents. Yet some of the most valuable information the corporation has is found in documents.

Not all corporate documents need to be managed. Many informal documents and presentations do not warrant the attention of management. But many documents do need management. Many corporate documents represent official pronouncements and statements of obligations and expectations by the corporation.

The Document Inventory

A good first start for an organization to proactively manage its documents is to create a corporate document inventory. In creating an inventory, the organization looks at and catalogs its existing documents. In some organizations, there are literally hundreds of thousands of documents. Building a “card catalog” of the documents that belong to the organization is an excellent start to managing the corporate collection of documents.

Libraries have long used a card catalog to great effect. Libraries know that looking through an entire library with all of its books is a colossal waste of time. Realistically, if it were not for the card catalog, libraries would not be in existence. When a person is looking for a book in the library, the most efficient way to look for the book is to use the card catalog. With the card catalog, the reader can quickly scan through all the possibilities. Upon finding the one or two books that look the most promising, the reader then is directed to the location of the book by the card catalog. And it is no different with the documents that belong to the corporation.

alt

So what should an inventory of corporate documents – a corporate card catalog – contain? Some of the likely contents of the corporate card catalog should be:

  • A title or brief description of the document,
  • A measurement of the size of the document,
  • The date the document was created,
  • The date the document was last changed,
  • The date the document was last accessed,
  • The system path of the document, and
  • A classification of the document type.
alt


All of these components of the card catalog are useful. Indeed some of the elements of the card catalog are found in the metadata of the document. But not all card catalog elements are found in the metadata of the document. Perhaps the most useful of the card catalog elements is the document classification.

Document Classification

Documents can be classified in many ways. Consider an oil company. The business of the oil company can be roughly divided into the sectors of “upstream,” “mid stream” and “down stream.” Upstream refers to the process of exploration. Mid stream refers to the process of refining and pipeline. Downstream refers to the process of distribution. Each document that belongs to the oil company can be read and the document can be classified as to which general category of information that the document refers to. The document can be an “upstream” document, a “mid stream” document or a “downstream” document.

Or consider manufacturing. In manufacturing, there is the process of handling raw goods, assembly, managing work in process, finishing a product, and shipping or storing the product. Documents for manufacturers can be classified as to which aspect of manufacturing the document best applies to.

Classifying the content of the document is a jump-start for the analyst looking through the many documents that belong to the corporation.

Creating the inventory of corporate documents is an activity that represents the first start to managing the unstructured environment. Stated differently, without a corporate card catalog, the world of unstructured data is a massive blob of ambiguity.

After the inventory is made, the next step is to read the documents and create a corporate index of those documents. The index doesn’t just reflect the document classification; the index goes into the details of every word in every document. There are many and varied aspects to the creation of an index. Some of the aspects are:

  • Looking at and managing documents in different languages,
  • Classifying the content of documents so that there is a “higher” level of abstraction for each word and each concept in each document,
  • Taking information found in documents and organizing that information so that textual analytics can be supported, and
  • Organizing the information found in documents and creating it so that it can be queried along with structured information.

Indeed there are many different aspects to the creating of the corporate card catalog.

One of the challenges is that of dealing with different document types. Some documents are short (emails). Some documents are long (patents). Some documents are full of technical jargon (medical or legal documents). Some documents are full of slang (chat logs). The corporate card catalog needs to be able to accommodate ALL the different kinds of documents.

The corporate document catalog is a good start for getting your hands around all of the important unstructured information in your corporation.

Go to Current Issue | Go to Issue Archive


Recent articles by Bill Inmon

Bill Inmon -

Bill is universally recognized as the father of the data warehouse. He has more than 36 years of database technology management experience and data warehouse design expertise. He has published more than 40 books and 1,000 articles on data warehousing and data management, and his books have been translated into nine languages. He is known globally for his data warehouse development seminars and has been a keynote speaker for many major computing associations.

Editor's Note: More articles, resources and events are available in Bill's BeyeNETWORK Expert Channel. Be sure to visit today!

Krish Krishnan -

Krish is a recognized expert worldwide in the strategy, architecture and implementation of high performance data warehousing solutions. He is a visionary data warehouse thought leader and an independent analyst, writing and speaking at industry leading conferences, user groups and trade publications. He has authored two eBooks, more than 75 articles, viewpoints and case studies on business intelligence, data warehousing, and data warehouse appliances and architectures. In his 19 plus years of professional experience, he has been solving complex architecture problems spanning all aspects of data warehousing and business intelligence for Fortune 1000 clients. He has designed and tuned some of the world’s largest data warehouses.

The Vice President of Strategy at Chicago Business Intelligence Group, Krish teaches regularly at TDWI, DAMA, IRM UK and other conferences, and is helping drive and mature the data warehouse appliance market. Krish also serves as Associate Vice President of Programs for DAMA Chicago and is Ethics and Governance Advisor to DAMA International.

Editor's Note: More articles and resources are available in Krish's BeyeNETWORK Expert Channel. Be sure to visit today!