Using the Mining Accuracy Chart

The Mining Accuracy Chart pane provides tools to help gauge the quality and accuracy of the models you create. The accuracy chart performs predictions against your model and compares the result to data for which you already know the answer. The profit chart performs the same task, but allows you to specify some cost and revenue information to determine the exact point of maximum return. The classification matrix (also known as a confusion matrix) shows you exactly how many times the algorithm predicts results correctly, and what it predicts when it is wrong. In practice, it is a better to hold some data aside when you train your models, to use for testing. Using the same data for testing that you trained your models with may make the model seem to perform better than it actually does.

To use the accuracy chart, you need to select source tables from your DSV or other data sources and bind them to your mining structure. If the columns from the tables have the same name, this step is done automatically upon table selection. Once you have selected the case and nested tables and performed the binding, you can optionally filter the cases — this can be done when you have a specific column that indicates if a case is for training or testing, or sim-ply to verify how the model performs for certain populations: for example, does the model perform differently for customers over 40? Last, you choose which target you are testing and, optionally, the value you are testing for. By default, the accuracy chart selects the same column and value for each model in the structure. However, you can also test different columns at the same time.

This is useful, for instance, if you have different discretizations in different models; you might want to see how well predicting Age with five buckets compares to doing the same with seven buckets.

The type of chart you receive depends on whether the target you chose is continuous or discrete, and whether or not you chose a target value to predict.

The latter case is the most common, so we will explain that first. When you select a discrete target and specify a target value, you receive a standard lift chart. A standard lift chart always contains a single line for each model you have selected, plus two additional lines, an ideal line and a random line. The coordinates at each point along the line indicate what percentage of the target audience you would capture if you used that model against the specified per-centage of the audience.

For example, in Figure 3.9 the top line shows that an ideal model would cap-ture 100% of the target using 36% of the data. This simply implies that 36% of the data indicates the desired target — there is no magic here. The bottom line is the random line. The random line is always a 45-degree line across the chart. This indicates that if you were to randomly guess the result for each case, you would capture 50% of the target using 50% of the data — again, no magic here. The other lines on the chart represent your mining models. Hope-fully, all your models will be above the random line. When you see a model’s line hovering around the random guess line, this means that there wasn’t suf-ficient information in the training data to learn patterns about the target. In the model in Figure 3.20, both models are about equal, and we can get about 90%

of our target using only 50% of our data. In other words, if we had $5,000 to hold a mailing campaign, each mailing cost $10, and we had 1,000 customers on our list, we could get 90% of all the possible respondents using the model, whereas we would only get 50% if we randomly sent the mailings. Note that this does not mean that 90% of the people we send to will respond. We have already determined that only 36% of the population matches the target. Using the model, we will get 90% of that 36%, or 32.4% of the total population to respond. Randomly guessing would net us only 18% of the total.

Figure 3.20 Standard lift chart

By changing the chart type to Profit Chart, you get a much better idea about the quality of the model. This chart prompts you to enter the initial cost, cost per item, and revenue per successful return, and plots a chart of the profits you will receive, using the models you’ve created. This can help you decide which model to use and how many people to send mail to. For example, from the chart in Figure 3.21, you can tell that if you only had enough money to mail to less than 25% of the population, you should go with the Movie Bayes model. If you had enough to go up to about 37%, the Movie Trees model would be your best bet. Most importantly, it tells you that regardless of how much money you can spend, you will maximize your profits by sending mail to about 50% of the population, using the Movie Bayes model. Additionally, it tells you how to determine the people to send to. Clicking on the chart causes a vertical line to appear with statistics about each line at the point displayed in the Mining Leg-end. In this case, the model says that sending a mailing to everyone with a propensity to buy of 10.51% or better using the Movie Bayes model will maxi-mize your profit.

Another type of accuracy chart is produced when you select a discrete tar-get variable and you do not specify which value of the tartar-get you are looking for. In this case, you get a modified lift chart that looks a bit like an upside-down standard chart. This chart shows the overall performance of the model across all possible target states. In this version, a line coordinate indicates how many guesses you would have gotten correct had you used that model. The ideal line here is at a 45-degree angle, indicating that if you had used 50% of the data, you would have been correct about 50% of the population, or if you had used 100% of the data, you would have been correct all the time. The ran-dom guess line is based on the most likely state discovered in the training set.

For example, if you were predicting gender and 57% of your training data was female, you could presume that you would get the best results by guessing female for every case. The random guess line will end at the percentage of the target that was equal to the most likely state in the training set. That is, for our gender example, if the testing set had the same distribution as our training set, the line would end at 57%, but if only 30% of the testing set was female, the line would end at 30%.

The Classification Matrix tab shows you how many times a model made a correct prediction and what answers were given when the answers were wrong.

This can be important in cases where there is a cost associated with each wrong decision. For example, if you were to predict which class of member card to assign to a customer based on his or her creditworthiness, it would be less costly to misclassify someone who should have received a bronze card as a normal card than it would to issue that person a platinum card. This view simply shows

you a matrix per model, illustrating counts of each pairwise combination of actual value and predicted values.

The last type of accuracy chart is strictly for continuous values. This chart is a scatter plot, comparing actual values versus predicted values for each case.

In a perfect model, each point would end up on a perfect 45-degree angle, indi-cating that the predicted values exactly matched the actuals. On any other model, the closer the points fall to the 45-degree line, the better.

Figure 3.22 shows a scatter accuracy plot. You can see that this model per-formed well for most cases, with only one point that was significantly off. The scatter accuracy plot is automatically displayed instead of the lift chart when a continuous target is selected.

Figure 3.21 Profit chart with legend

Figure 3.22 Scatter accuracy plot

Dans le document Data Mining with (Page 141-145)