|
Taking the Temperature of Your Data
Published: January 1, 2008 Enhanced model performance comes from extracting as much information content as possible… relative to the specific performance metrics you are using to measure success.
There are a large variety of quantitative techniques available to assist in the development of mathematical models, but the seasoned practitioner understands that they all do basically the same thing: they help us search for a set of variables, weights and operators in the form of an equation. When that equation is applied to a set of decision data, it enhances the performance of our decision making. The algorithms behind our model development effort are seeking those variables that have information content relative to the goals we have defined. Our data, and the information content it contains, is the source of enhanced performance. Successful practitioners typically spend 75% to 80% of their overall modeling effort preparing data. These efforts deal with issues such as understanding the context of the available data fields, handling of missing data, identifying and correcting data errors, identification and representation of interaction effects between variables, mathematical transformation of data to obtain different perspectives on the information content, and data representation schemes appropriate for the type of data being utilized. Practitioners new to predictive analytics often overlook this last issue. The physical representation of the data in their data set can often have significant impact on the information content presented to the modeling technique. This article presents a brief discussion comparing two approaches: common data representation, and an enhanced approach for certain types of data. Data TypesJust as quantitative techniques have strengths and weaknesses, so does our data. When considering the context of our data, it is also important to understand the mathematical capabilities of our data. It is obviously trivial to point out that the mean and standard deviation of variables such as ZIP code is meaningless at best. However, many practitioners overlook more serious considerations and miss important data representation issues as a result. Each variable in your data set should be clearly identified as being either quantitative or qualitative in nature.The characteristic of importance here is "order." There is no inherent order in a qualitative variable. Quantitative variables, on the other hand, have an underlying order. It is beyond the scope of this article to consider the types of mathematics that are appropriate for the various types of quantitative variables (ordinal, interval and continuous). Rather, we will focus on the implications of the characteristic of "order" and data representation schemes that are of use to enhance the extraction of information content. Qualitative VariablesA qualitative variable is typically simply a variable that describes a set of categories. The variable will have two or more values, each representing a category meeting a particular set of conditions. An example of a qualitative variable is marital_status. For this discussion, let’s assume that marital_status has the following values:
The values of the variable marital_status have relative order. We can easily rearrange them in any other order with no impact on the information content. However, from a predictive analytics perspective, we still have many questions that need to be addressed from a field of this type.
Collapsing ValuesFor the marital_status, we have identified six values. Is this the appropriate number of categories? It is important to understand that there is no “right” answer to this question generally. The answer is always going to be contingent on what the context of usage is. For some decision environments, this is going to be the most appropriate representation.
These are empirical questions. They can only be answered in the context of the particular decision environment we are exploring. How many values to use, and how to collapse the values, is best answered by testing each of the combinations and measuring the impact that the representation has on performance. Data Representation AlternativesWe must also consider the impact of different data representation schemes. In this case there are two alternatives:
The 1 of N representation allows for more flexibility.Some of our modeling techniques may identify relationships differently than others.Some may focus only on one of the values.Others may use more than one, but not all of the values. Still others may use all six values. This inherent flexibility makes the 1 of N representation appropriate for virtually all qualitative variables. Quantitative DataLet’s explore another example… Education_Level.
Education_Level is an example of quantitative data. While it isn’t represented by numeric values, "order" is a significant characteristic. This is, in fact, an ordinal variable. It would be inappropriate to compute any type of mathematical calculations, even if the data were represented numerically since there is an inconsistent interval in the values. Just as we considered collapsing the values in the variable Marital_Status, above, the same considerations apply here. The number of values appropriate for Education_Level is purely determined by empirical testing in the decision environment we are working in. The data representation issues are also similar. We can obtain a number of advantages by using a 1 of N representation for Education_Level.
While this 1 of N representation allows for the flexibility advantages discussed above, it does not capture the "order" characteristics of the variable Education_Level. If this representation were used as an output variable, for instance, your answers would either be correct or incorrect. You would be unable to assess the degree of incorrectness, as the data representation scheme does not capture that information. On the other hand, consider a different representation scheme, a Thermometer Representation.
The logic of a Thermometer Representation is very straightforward. An individual in the category High_School, has all of the attributes of someone in the category <<em>High_School...plus something else. An individual in the category Some_College, has all of the attributes of someone in the category High_School... plus something else. And, so on. The Thermometer Representation allows us to capture "order" in our values and, as a result, allows us to consider degree of incorrectness. While it would be physically possible to use a Thermometer Representation on the Marital_Status variable, discussed above, it would not make sense to do so. A qualitative variable has no "order." On the other hand, restricting our data representation method for a quantitative variable to a 1 of N representation misses an important characteristic of the information content available. It is worth noting, that a Thermometer Representation also allows us to control the direction of error. In the representation above, the logic reinforces the building of levels. As a result, this representation scheme will have a tendency to underestimate the value. Is this what we want? Again, it depends. If we are in a decision environment where we would prefer to have overestimation when we are incorrect, we simply need to invert the Thermometer Representation to achieve that result.
|