|
Textual Analytics: Business Intelligence from a Textual Foundation
Published: April 1, 2007
Published in TDAN.com April 2007
Analytics have been around from the time the first computer program was written. Once corporations began to generate data, there were financial analysts, sales analysts, marketing analysts and others anxiously awaiting to use that data in novel and creative ways. In the early days, data from applications was hard to come by, and the tools the analysts used to access and analyze the data were crude. As time passed and the volume of data grew, so grew the opportunity to use analytics to compete in the business arena. Over time, the world discovered the data warehouse as a foundation for analytic processing. The data warehouse contained integrated, historical and granular data that was gathered from a host of legacy systems. The data warehouse proved to be an ideal foundation for the analysis of data. Data from the data warehouse was predictable and easy to access; and because data in the data warehouse was granular, it could be reshaped for many different purposes. Numerical Data - A Fundamental LimitationOver time, it was recognized that business analysis - analytics - had a very fundamental limitation. That limitation was that analytics operated only on numerical data. While analysis of numerical data was quite useful, in fact, corporations have massive amounts of data that is not in the form of numerical data. There exist massive amounts of unstructured textual data - from e-mails, medical records, contracts, warranties, reports, call centers, and so forth. In fact, most estimates show that 80% of the data in the corporation is in the form of text, not numbers. There is a wealth of information in that textual data, but there are some problems with this unstructured, textual data. Textual data is not as neatly organized and as accessible as numerical data and it does not lend itself to easy and facile analysis because the software and technology used for business analytics is almost 100% dedicated to handling well structured numeric data. The very disorder of the textual data defeats (or at least greatly hampers!) any attempt at accessing and analyzing it in any sort of meaningful manner. However, there is technology that indeed is designed for textual analysis, such as Inmon Data Systems Foundation software. The following discussion of textual analytics makes free use of the many patents IDS (Inmon Data Systems) has on the process of doing textual analytics. Textual Analysis and Search EnginesWhen the subject of textual analytics arises, it is natural to think of search engines such as Google and Yahoo, among others. While a simple search of raw text can be considered to be a crude form of textual analytics, there are, in fact, many limitations to a simple textual search. In order to do textual analytics in a sophisticated manner, first the unstructured textual data must be integrated. If raw text is not integrated before it is analyzed, the search of the raw text will produce truly sketchy and questionable results. Therefore, the first step in textual analytics is the integration of the raw text into an integrated form. Some of the steps required for raw text to be integrated and to be fit for analysis include:
(For an in-depth treatment of the subject of the technology needed for textual integration, please refer to the white papers on the subject available from Inmon Data Systems.) Integrating Raw TextIn short, in order to do analytics on text, the raw text must first be integrated. After the raw text is integrated, textual analytics can be done. Searches are done against raw textual data and that textual analytic processing is done against integrated text. A search can be something as simple as - "Tell me where the term - Katherine Heigl - is mentioned." In this case, the search goes to the source or an index created from the source and looks for the term or part of the term that has been specified. An analytical treatment of text might be - "tell me about all the places where terms and information relating to Sarbanes Oxley can be found." The need for textual integration may not be obvious at all. In order to illustrate the importance of textual integration, consider the following. Suppose a medical file needs to be analyzed. In the medical file is the term "ha." If the raw data is searched on "ha," there are many entries. Because "ha" means little or nothing to the layman, a search on "ha" is questionable. However, if the raw data is integrated before being searched, then for all cardiologists the term "ha" is converted to "heart attack." For all endocrinologists, the term "ha" is converted to "hepatitis A," and for all general practitioners, the term "ha" is converted to "headache." After the conversion is completed, there is no questionable term "ha" to be dealt with, and patients with heart attacks, headaches and hepatitis A are not grouped together. From this simple example (and there are plenty of more cases of textual data needing to be clarified before being analyzed), it is seen that integration of text unlocks the text so that effective textual analytics can be performed. However, the example provided is not the only reason for the need for textual integration as a foundation for analytics. Searching for Categories of TextSuppose there is a body of text about ranching. Part of the body of text relates to horses. In some cases, the type of horse is discussed. In other cases, the age and maturity of the horse is discussed. In other cases, the gender of the horse is addressed. Now suppose that there is a desire to do analytical processing against this document or set of documents on ranching. Suppose that there is a desire to see information about horses. One way - the search engine way - to look at horses is to look for colts, then to look for ponies, then to look for studs, and so forth. The searcher must know beforehand what is being sought. Then the searcher must be able to gather all the information about horses together. Searching for a wide variety of information is tedious to do. A better approach - the integrated text approach - is to identify all information about horses into a common category. Then the integration process identifies all the places in the text where those pieces of information about horse exist. The kind of information that is returned when looking at integrated textual information about horses might include palomino, stud, mare, bridle, saddle, types of hay, fencing, horse whispering, gait, racing, gelding, and so forth. Now when the textual analyst wants to know information about horses, the textual analyst simply queries on the category - horses - and all information relating to horses is returned. Note just how different the results of a query are when done using textual analytic processing rather than when doing a simple search. By integrating the raw data, the textual analyst has prepared the data for effective textual analytical processing. Recasting Textual DataThere are many forms of textual integration that can set the stage for effective textual analytical processing. As a simple example of another form of integration that needs to be done to raw unstructured text in order for text analytics to be done is the recognition that there are multiple spellings of words, especially names. By recognizing that there are multiple spellings of the same name, the text analytical processor will not miss mentions when the name is spelled differently. When a simple search engine is used, the search may fail to pick up important information about the name entered in the search because a variation of the name is used. Stemming Raw TextThe need for integrated text only begins with the simple examples that have been described. Another way that textual data needs to be integrated is in terms of operating at the Latin or Greek stems of words. Latin-based words tend to have similar but not quite the same spellings. If a search is literally made, then the search will not connect the fact that a word is related to another word even though they are not spelled exactly the same. As an example, consider the word - "move." Some of the different forms of the word "move" are moved, moving, mover, moves, remove and removed. If an effective analysis of the text is to be done, it must be recognized that words that have the same stem need to be considered as the same word. Indeed, there are many other considerations of the discipline of integrating text. Some of them include screening text to see if it is business relevant, punctuation removal, case sensitivity (or insensitivity), and so forth. The Scope of the Search and AnalysisOne of the challenges of a search engine is the scope of the material accessed and analyzed by the query. A search engine is capable of drawing on wide amounts of source material (such as the Internet). A textual analytical tool, on the other hand, must access and draw upon data that it has access to and can manipulate. In other words, because textual analytics requires a serious amount of preprocessing of data in order to integrate the data, textual analytics is performed on a much smaller amount of data than searches of data. It does not make sense that a search engine would integrate data before doing a search because the search engine does not have the ownership and control of the data that is being searched. Textual analytical tools, on the other hand, typically operate on data from the corporation. Indeed, there is the opportunity to access and integrate corporate data before textual analytics occurs. A Simple QueryIn addition to the standard search queries that the textual analyst needs to do, there are many different kinds of queries that the textual analyst submits. One of the simplest of the queries submitted is the query by class of data. Consider that a textual analyst has submitted a query for the category of financial information. The query for financial information includes many different terms, each of which relate to finance. Some terms that relate to finance include stock, share, equity, warrant, profit and so forth. The query is submitted by a reference to finance. The results of the query are a reference back to each place where a term related to query is found. This type of query is sometimes called an indirect query or a query by category. Sophisticated QueriesAn important type of query submitted by a textual analyst is that of a query looking for basic occurrences of information. Consider that a query has been made looking for all occurrences of the word "water." Upon finding a reference to water, the next step is to do a search on the specific text preceding "water" and following "water." These textual references to water along with their immediate text are called "snippets." A Snippet SearchBy looking at each of the snippets, the analyst can determine the context of the word that has been sought. Snippets are most useful for determining the context of a particular word. The term "water" can refer to quite different things - for example, a water table, a watermark, sea water that is menacing, and Waterford crystal. A Proximity SearchAnother type of query that the textual analyst sometimes needs to submit is a proximity query. In a proximity analysis, the query is done for words that are in proximity to each other in a document. In a proximity query, a search is done over one or more documents where the document(s) is searched with regard to two or more words residing in the document within a predetermined proximity. Of course, proximity analysis can be done for lists of words as well as individual words. Relating Textual Data to Structured DataAnother form of textual analytic data is one that relates textual data to structured data. Consider demographic data from a customer as it relates to the communications from the customer. The e-mails that a customer has made can be attached to the customer. By merging textual information with structured information, a true 360-degree view of the customer is achieved. Stated differently, when a organization only has demographic information about a customer, that is hardly a 360-degree view of the customer. Customer communications as well as demographic information about the customer is required as well. Textual VisualizationAnother form of textual analytics that is extremely valuable is that of visualization of text. In a visualization of text, integrated text is ingested and clustered in order to find correlations and relationships between words and phrases. The text from the documents is integrated and then lifted into a work area. In the work area, the integrated text is clustered into what can be termed themes. The themes are then displayed in a visualization called a SOM, or self-organizing map. The clustering of data in a SOM has many uses. Some of those uses are identifying correlations of data, identifying the major themes of data, organizing data so that major themes of data are obvious, and so forth. SOMs can be created for very large amounts of data and for smaller amounts of data. Furthermore, SOMs can be used to look across whole vistas of information - looking at thousands of documents at a time. It is seen then that textual analytics is a very different subject than search engine processing. Very different results are achieved by textual analytics. Bridging the GapOne of the keys to creating the effective textual analytics environment is that of being able to access unstructured data in a structured format. In other words, if you want to use BusinessObjects or Cognos against unstructured text, you have to put the unstructured data in a form that is useful to BusinessObjects or Cognos. This means that the unstructured data - after it is integrated - must be restructured into a relational format. In other words, there is a need for taking textual information and placing it in to a structured format where there are recognizable relational fields in a predictable format. Once unstructured data has been transformed into the relational format, the standard analytical tools can be applied. But there are some subtleties which are important. Consider what happens when more than one record is converted into a relational format. Consider that the drug Metformin has been specified for Carol Teal. Yet when the unstructured record for Carol Teal is read, there is no such drug specified. Instead, it is seen that Carol takes Glucotrol. The software - under the guidance of the analyst - has translated Glucotrol to Metformin as part of the transformation process. The ability to recognize and translate text is an important capability in preparing for textual analytics. In addition, the analyst has specified that generalizations (or categorizations) be made on the raw text. For example, whether or not a patient is being treated for diabetes II is analyzed. Based on the textual data that is found, a patient can be classified as to whether the patient is or is not a Type II diabetic. By translating data and by classifying it, then putting the data in a relational format, the end user is prepared to do analytical processing on text. Accessing Integrated Textual Data Placed in a Relational DatabaseThe first step in integrating raw data for textual analytics is to create the infrastructure that supports textual analytics. However, once that infrastructure is built, it then remains to put the infrastructure to good use. This section of this paper is on the usage of the integrated textual infrastructure once that infrastructure is built. Assume that you have a relational database that has been built from unstructured text and that the text has been integrated. The database is in a relational format and can be accessed by standard industry analytical tools such as BusinessObjects, Cognos, MicroStrategy, Crystal Reports and others. The access to the database is through standard SQL. There are some basic ways the data can be accessed. These ways are -
A simple search. A word or phrase is given to the software and the database is examined. Take the word "water." A search of this type would find every occurrence of water.
A simple search of context surrounding a word ( a "snippet"). Take the word "water." A context search gathers the text before and after the word being sought. Suppose a context
search was done for "water." The results might look like: "....she held the Waterford crystal in her hands....," "...the football players welcomed the waterboy, as Gatorade was passed..." and
"...was it a mirage or real water? He couldn't see beyond the..."
An indirect search. A search is done for items that belong to a class or category of information. For example an indirect search on Sarbanes Oxley might return these results:
"...revenue recognition...," "...promise to deliver...," "...conditional sale...." and "...delayed delivery ..."
Proximity search for words. Are the two words "water" and "television" found in a document within 200 bytes of each other? An example of a result might be: "....Waterworld was
advertised on television last night...." and "...she spilled water on the television set accidentally..."
Alternate spellings search. As an example, find all the places where "Osama bin Laden" is mentioned would yield: " ...lead me to Usama bin laden or else...," "....huddled in a
cave, Osama ben ladeen drank tea and said prayers..." and "...the Muslims adore Abu ben laden, more every day..."
These are merely some of the analytical forms that can be taken based on unstructured data placed in a relational database. Textual analytics can be done by searching whole masses of documents or looking at just one document. Textual analytics can be as simple as looking for one word or looking for whole categories of words and phrases. Textual analytics can look for the context surrounding words. The Value PropositionHow do these forms of textual analysis lead to business advantage? The general answer is that an infrastructure of integrated unstructured data placed in a database and accessed by analytical tools gives the corporation advantages that it never had before in that information coming from the textual environment is now readily available. Now, decision makers can ask questions that were never before possible. In order to posit some of these questions, consider the following industries and functions within industries. (Note: a term in quotation marks is taken to mean a generic term.) E-Mail/Call Center Administration
Contract Administration
Warranties AdministrationOver the past three years have there been any noticeable patterns in the exercise of warranties? Any pattern in products failing? In type of customer exercising warranty? Any seasonality?
Medical Healthcare Administration
Insurance Claims Processing
Documentum (for anyone who has Documentum)
Scientific
In SummaryThis paper has addressed the subject of textual integration and business intelligence (BI) operating on textual data. Raw text must first be integrated. The process of integration has many facets. Once integrated, the raw text is placed in a relational database. Once the raw text has been placed in a relational database, it can be accessed and analyzed by standard BI tools. The analysis can take many forms. Forms of textual analytics include: Visualizations, in the form of a SOM where integrated textual is clustered,
Go to Current Issue | Go to Issue Archive Recent articles by Bill Inmon
Bill Inmon -
Bill is universally recognized as the father of the data warehouse. He has more than 36 years of database technology management experience and data warehouse design expertise. He has published more than 40 books and 1,000 articles on data warehousing and data management, and his books have been translated into nine languages. He is known globally for his data warehouse development seminars and has been a keynote speaker for many major computing associations. Bill can be reached at 303-681-6772. Editor's note: More articles, resources, news and events are available in Bill's BeyeNETWORK Expert Channel. Be sure to visit today! |