|
Self Organizing Maps (SOM’s) – Visualizing Textual Data
Published: July 1, 2006
Published in TDAN.com July 2006
For years there has been visualization of numeric data. Business Intelligence has pie charts, colored graphs, graphs over time, multi dimensional analysis, Pareto charts, and so forth. It is visualization that brings out the reality and contrast in a large collection of numbers. But numeric data is not the only kind of data that there is. The counterpart to numeric data is textual data. Textual data is found in many places, but nowhere more prominently than in the world of unstructured data and processing. Unstructured data and processing consists of emails, email attachments, .pdf files, spread sheets, PowerPoint files, text files, document files, and many more file types. Unstructured data runs the informal part of the organization while structured data runs the formal part of the organization. It is a good bet that as many business decisions are made in the unstructured environment as the structured environment. But trying to get a handle on unstructured data is difficult. Using Business Intelligence technology is a misfit because Business Intelligence technology is best suited for the display of numeric data while unstructured data is made up of text. Trying to feed text to Business Intelligence is like trying to plug in an electrical AC current device into a DC wall connection. If it works at all it will probably fry your device once current starts to flow. In a word, AC current does not mix with DC current and textual visualization does not mix with numerically based Business Intelligence. So where is there a need for assimilation, organization, and ultimately visualization of textual data? There are actually many places where there is such a need -
THE CHALLENGES So how does the organization cope with the challenges of visualization of unstructured data? Consider the issues facing the end user in addressing unstructured data -
In many environments the end user faces massive amounts of unstructured documents. There may be MILLIONS of unstructured documents. The end user cannot read them all. There simply are not enough hours in the day. And even if the end user could read them all, there is no way the end user could remember all the information that has been read. The finite limitations of the human brain preclude a manual reading of the library of a large number of unstructured documents as an effective way to process them. And that is exactly what some organizations are up against when it comes to making sense of a mass of unstructured documents,
Not only do organizations face the challenge presented by a massive number of unstructured documents, but organizations also face the need for speed in the processing of those documents. On occasion, there is a need for finding some information quickly. If there were only a few unstructured documents then the end user could perhaps quickly find what was being sought. But when there are a massive amount of unstructured documents, looking through those unstructured documents is laborious, exactly what you don't want when you need speed. When speed of retrieval is needed, there has to be an automated way of examining a large body of unstructured documents,
Where there are many unstructured documents, accuracy can become an issue in the retrieval of data. If a manual approach is used, it is simply a fact of life that human memory is both fuzzy and limited. The more documents a human tries to ingest, the fuzzier the memory becomes for any one document. After a large number of documents have been read, it is a wonder that anything is retained in human memory about any document. Accuracy fades quickly in the face of the need to understand a large body of unstructured information,
Another important aspect of understanding a body of unstructured documents is that of relating the documents together. It is one thing to understand a bunch of individual unstructured documents. It is quite another to be able to relate those unstructured documents together. In many cases the relationship of unstructured documents together forms a different and much more powerful picture than the unstructured documents taken individually. And the more unstructured documents there are, the more challenging it becomes in order to try to see the larger picture formed by the unstructured documents and their relationships,
As if there were not enough challenges in finding information from a large body of unstructured documents, there is also the challenge that is presented because of the fact that much work done against a large body of unstructured documents is heuristic in nature. In a heuristic mode, the next step in processing is determined by the results obtained in the current analysis. The net result of heuristic processing is the process of jumping all over the body of unstructured documents. The first analysis concentrates in one place. The next analysis concentrates somewhere else. The third iteration of analysis goes yet somewhere else, and so forth. Trying to do heuristic analysis manually for a large body of unstructured documents is very, very difficult to do. No wonder then that business analysts facing a large body of unstructured documents are so frustrated. Doing analysis is simply not feasible for these reasons and more. SELF ORGANIZING MAPS - SOM'S With modern visualization tools, you can now produce SOM's - self organizing maps - of unstructured data and documents. The SOM's that are produced solve ALL of the problems of unstructured visualization of documents and unstructured data. Now, in one place you can look at unstructured data and documents as you have never looked at them before. With a properly constructured SOM, you can look at -
There are several very valuable things that a SOM can do for you. One of those is to show correlation of data. The SOM's show text that is correlated to other text. In the medical field, working with medical records, this ability to correlate is very attractive. Another thing SOM's can do is to enable textual drill down processing. In textual drill down processing the analyst goes from one level of analysis to a lower level of analysis until the specific detail is found. Depending on how the data is arranged and integrated, SOM's support qualified analysis of data. First the analyst looks at records for employees. Then the analyst looks for records for women employees. Then the analyst looks for records for women college graduates. Then the analyst looks for women college graduates who are older than 50, and so forth. If the unstructured data is properly conditioned and edited, SOM's can yield insightful analysis for the selection of qualified data. Founded in Colorado by Bill Inmon, Guy Hildebrand and Dan Meers, Inmon Data Systems (IDS) is a software company dedicated to the proposition that there needs to be a bridge between the worlds of structured data and unstructured data. IDS has foundation technology that allows unstructured data to be brought into the structured environment and once there, integrated into the structured environment. Applications - unstructured visualization
IDS is located in Castle Rock, Colorado. Contact IDS at 303-973-3788 for further information. Go to Current Issue | Go to Issue Archive Recent articles by Bill Inmon
Bill Inmon -
Bill is universally recognized as the father of the data warehouse. He has more than 36 years of database technology management experience and data warehouse design expertise. He has published more than 40 books and 1,000 articles on data warehousing and data management, and his books have been translated into nine languages. He is known globally for his data warehouse development seminars and has been a keynote speaker for many major computing associations. Bill can be reached at 303-681-6772. Editor's note: More Bill Inmon articles, resources, news and events are available in the Business Intelligence Network's Bill Inmon Channel. Be sure to visit today! |