Unsupervised Adaptive Clustering for Data Prospecting and Data Mining
Published: March 1, 1999
There was a time, well within the scope of living memory, when the dream of market analysts was to have as much data about their prospective customers as they could possibly get.
THE EMERGENCE OF DATA MINING
There was a time, well within the scope of living memory, when the dream of market analysts was to have as much data about their prospective customers as they could possibly get. It seemed clear that if one knew all they could about a customer or group of customers, the correct marketing approach would become evident. Now that dream has not only become true, but the mass of data concerning us all has become overwhelming. Indeed, it can be said that the marketer's dream has become something of a nightmare. What perhaps has happened is that the ability to rapidly collect data has outpaced the ability to analyze it, and what may have initially begun as an analytic backlog rapidly crossed over into a veritable flood that resulted in the present day amorphous mass of both meaningful and meaningless facts. What we now have is a data glut.
It is now clear that automated tools must be developed to help extract meaningful information from a morass of information. Moreover, these tools must be sophisticated enough to search for correlations among the data unspecified by the user, as the potential for unforeseen relationships to exist among the data is very high. A successful tool set to accomplish these goals will locate useful nuggets of information in the otherwise chaotic data space, and present them to the user in a contextual format.
As an example, consider a marketing director trying to associate demographics in a database of one million records. Since multiple dependencies surely exist in the data, the marketer will quickly get lost in one or both of two ways: (1) overwhelmed by the sheer amount of data, the number of profiles and associations that emerge are incomprehensible; (2) the resultant narrowing of objectives to reduce the effects of being overwhelmed, which results in the loss of critical relationships. This effect brings to mind the old adage of "being up to your eyebrows in alligators, with no ability to drain the swamp". Fortunately, a group of techniques now exist that find real, usable information in the data glut, with the capability of presenting it to the decision-maker in a useful format. This set of tools utilize a technique called "data mining", and supports the goal of knowledge discovery (finding sometimes unexpected information in the data). Data mining consists of a variety of both statistical and non-statistical techniques, and utilizes logical methods, neural networks, and some new unsupervised adaptive clustering techniques.
THE IMPORTANCE OF NON-LINEARITY
The fact is that we humans generated the data glut and it continues to grow, at an alarming rate. The data glut is indeed the source of the chaos and the masking of those elusive useful nuggets of information. Finding those nuggets is complicated by the fact that many of the relationships hidden within the data glut are non-linear. Humans do not behave linearly, but rather exhibit discrete patterns of behavior. Therefore, any attempt to extract information about human behavior using linear techniques will ultimately miss some, if not most, of the information contained in the data. The old joke about the average American family with 2.3 children is not without substance: none of us have fractional children. The problem is compounded when one realizes that distinct demographic groups, e.g. lifestyles, may exhibit similar external behaviors. A double income, no kids (DINK) family may eat at a pizza parlor as frequently as a single mother with several children. The motivating factors for these two distinct types of families are different but the behavior is the same. In this case, a linear modeling technique would miss the appropriate correlation. Statistical methods, for the most part, depend on linearity. There are some statistical methods that account for and even take advantage of non-linearity, but they are highly sophisticated and only work well, at least at this juncture, in the hands of professional statisticians.
ENTER NEURAL NETWORKS
A successful approach to modeling non-linear relationships has been the so-called neural networks. These algorithms are the result of cognitive science's attempts to understand and mimic learning and memory in the human brain. Humans are pattern recognizers by design. We are naturally able to recognize discrete patterns in our environment, correlate them with events and alter our behavior accordingly. Neural networks can, in a limited sense, do the same thing.
One particular neural network type, the back-propagation algorithm, has performed very well in this regard and it is now accepted as a reliable method for data mining. However, back-propagation algorithms have their shortcomings. The major difficulty lies in the fact that the relationships between specific variables and the neural network results are, at best, difficult to explain. Validating unexplainable results can be a significant challenge. For example, it would be beneficial to understand something of the pattern/outcome correlation to assist in an overall marketing approach. Additionally, the range of data with which it was trained limits the scope of the neural network. New patterns in the data are frequently classified incorrectly. Finally, the neural network is a supervised technique, requiring training by the user on a given set of data. A cause and effect relationship and historical outcomes must be known in advance.
Neural networks are of no help to the analyst when the relationships are undefined or even may not exist. How can one tell if a database is worth mining in the first place? Supervised techniques cannot address this issue. The search of data for undisclosed relationships is often referred to as knowledge discovery. In fact, knowledge discovery is the result of a successful data mining search without a priori knowledge of such relationships. Such a successful data mining search, in keeping with the mining metaphor, could be described by the term data prospecting. The goal for data prospecting is gaining an understanding of the actual patterns of behavior embedded in the data glut. After these patterns emerge, correlation of the patterns of behavior with specific events can take place in a straightforward fashion.
INTRODUCING UNSUPERVISED ADAPTIVE PATTERN RECOGNITION
There are now several useful tools that fall into the category of data prospecting. Unlike neural networks, these data prospecting tools do not require a priori knowledge of a relationship or historical outcomes. The only specifications required prior to runtime are the definition of the variables that may influence a pattern. In the case of marketing analysis, these may be a set of demographic variables, or data relating to buying history, or both. Unsupervised algorithms then sort through the data and classify records in the database according to similarities in the patterns. These data prospecting techniques are, in fact, a method of clustering.
Data prospecting techniques are also adaptive or vigilant in that they can recognize novel patterns as they appear in the data. A major advantage to all of these methods is that the weights assigned to each variable are easily translated into real world values. If relationships exist between variables, they are easily expressed in normal understandable terms. Data prospecting algorithms include Fuzzy Adaptive Resonance Theory, Lead Clustering or Feature Mapping, and American Heuristic's Adaptive Fuzzy Feature Map (AFFM, patent applied for). All of these techniques provide for easily understood pattern recognition and allow for mapping into a supervised technique.
A detailed technical description of data prospecting algorithms is beyond the scope of this article. However, those who choose to prospect and mine data for information should be aware of these techniques and look for tools that incorporate them or their capabilities. Proper application of data prospecting algorithms will reward the data prospector/miner with more comprehensive information to support the decision process, as well as the capability for knowledge discovery of previously unanticipated patterns
Ben A. Hitt, Ph.D. -
Ben A. Hitt, Ph.D., is the Director of the Schenk Center for Informatic Sciences (SCIS) at Wheeling Jesuit University. He founded the SCIS in 2006 as part of an initiative to emphasize the role of information in the conduct of business in today’s fast paced global economy. The SCIS is engaged in research and development of information discovery and dissemination systems include advanced concept searching, next generation instructional systems, vehicular safety systems and personal health and fitness systems. Dr. Hitt also is actively engaged in developing and teaching courses in information theory and practice.
Dr. Hitt is a co-founder of Correlogic Systems, Inc. Along with Peter Levine, he is co-inventor of the pattern recognition approach to disease detection. He is also the inventor of Correlogic's proprietary Proteome Quest® software that is designed to analyze patterns in human blood proteins to detect disease. He is a nationally recognized expert in data mining and pattern recognition solutions. He is the inventor and patent holder of numerous algorithms and computer programs. Dr. Hitt’s works include inventions for analyzing disparate data streams, near real-time analysis of audio and text information streams, applications for the detection of credit card fraud, and optimization solutions for direct marketing problems.
Prior to his work at Correlogic, Dr. Hitt served as Senior Principal Software Engineer for Raytheon Systems. He later held positions with NeuralWare Inc., Advanced Software Applications, and American Heuristics, which he also co-founded. During this period, he developed and expanded his concepts of employing algorithms in data mining and other complex problem-solving applications. Progressing from his initial work with neural networks, Dr. Hitt incorporated genetic algorithms and related analytical techniques into his later inventions.