Getting It Wrong: Mistakes Every Data Miner Has Made

Model Evaluation (Step 5)

Question 8: How will this project aff ect the cost/benefi t elements?

8.4 Getting It Wrong: Mistakes Every Data Miner Has Made

Eleven fundamental principles of data mining were introduced in chapter 2; they are the keys to getting it right. There are also some keys to getting it wrong, and it is very important to know about these, too.

A lot can be learned about what works by knowing what doesn’t work, and why it doesn’t work. This is particularly so in experimental and creative domains such as data mining. The following eleven short treatments describe the worst of the worst: mistakes that will doom your project or perhaps even get you fired. I know them all too well, because I have seen them all multiple times (usually as the perpetrator). If you are going to fail, try to find some less spectacular way than those contained in this list.

Mistake One: Bad Data Conditioning

This mistake covers everything from making arithmetic mistakes and formatting errors to selecting the wrong settings in a feature extractor. Sometimes it’s even a stupid mistake: I once gave a file to an engineer to condition, and he finished his analytic process by rounding everything in the entire file down to zero.

A common instance of this mistake is to repair broken data in some way that makes things worse, for example by filling gaps in records with some valid data value. Next month you will be wondering whether the zero you see in field 17 is a real zero, or a missing zero and by that time, it’s too late.

Visualization and reporting tools applied by an analyst familiar with the data are often effective in detecting information damage. Sanity checking by a domain expert is the best defense against this mistake.

Mistake Two: Failure to Validate

It is exciting when some data mining experiment produces a result that advances the effort. But you have to resist the temptation to trumpet your success until the results have been validated by a thorough process audit, and demonstration that you can rep-licate the results.

A common instance of this mistake is testing a model on the data used to create it. Of course it will give reasonable, even great, results on that set; but will it general-ize? Before you trumpet your success, make certain it won’t evaporate when the VP of Engineering hustles an ecstatic customer down to your office witness the miracle first-hand.

Mistake Three: GIGoO (Garbage In, Gold Out?)

Reality is a harsh critic; but it is a fair critic. Sometimes the data you are given just do not support the application you have been asked to build. The fact that you can’t use a customer’s shoe sizes to predict her hair color doesn’t mean you have failed. Be prepared to admit the limitations of the data, and be ready to make thoughtful suggestions about less grandiose goals. What is real ought to be good enough—because that’s all the good there is.

Mistake Four: Ignoring Population Imbalance

Real world data mining is often looking for the needle in the haystack that everyone would like to find. But the importance that derives from rarity has a side-effect: you

probably won’t have many examples to use for model development. In building fraud models, for example, keep in mind that the data you are given will probably not provide you with thousands of examples of fraud. This should be taken into account when you train.

For example, let’s suppose that you are asked to create some kind of a recognizer for fraud for a business where only 1 out of every 100 cases is fraudulent. If you just call every case you see non-fraudulent, you will be right 99% of the time. But what have you actually accomplished? Nothing.

Many adaptive algorithms train by maximizing or minimizing some performance score. Unless they are carefully designed, they might be able to push their score up by doing exactly what the non-fraudulent example above does.

Classes for which only a few instances are available (residual classes) are sometimes lost in the information glare of an overwhelming majority of common (and there-fore, uninteresting) cases. Techniques to overcome this include boosting; enriching the residual class by including multiple copies of each instance (replication); by removing members of other classes (decimating); or by adjusting the objective score used to train the algorithm using the geometric accuracy method described in chapter 7.

Populations to be used for mining activities can be balanced using techniques such as replication and decimation (chapter 4). Sampling, segmentation, coding, quantiza-tion, and statistical normalization may also be used to address imbalance. Imbalance can be a difficult problem whose solution may require special expertise.

Mistake Five: Trojan-Horsing Ground-Truth

This is one of the easiest mistakes to make—and one of the most dangerous. Fortunately, it is usually easy to detect, if you are looking for it.

Let’s suppose you are trying to construct a classifier that recognizes when a picture contains one or more cats. You have been provided with an assortment of pictures, some containing cats, and some not. Each photograph has a date stamp added to the image by the camera. You carefully digitize these, and extract various kinds of shape descriptors, texture measures, etc. from the pictures. You train up and get good results.

What you didn’t notice is that all the pictures that contained cats were taken on the same day. What your detector has actually done is learn that a particular squiggly shape in the corner of the picture (the date) is perfectly correlated with the desired answer.

Further, learning to recognize that specific letter date is not all that hard. You have unwittingly Trojan-Horsed the ground truth into the data.

There are lots of ways this happens. Suppose you are working on financial spread-sheet data, trying to estimate next month’s revenue in dollars. What you might not know is that someone in the European branch office needed the revenue in Euros per week, and put that in column 47 of the spreadsheet with a helpful name—written in Swedish. You have Trojan-Horsed the ground truth.

Trojan Horsing can often be detected by computing the correlation coefficient of each feature with the ground truth. If any have an absolute value very close to 1, they could be ground truth masquerading as usable features, and should be checked manu-ally as possible Trojan Horses.

Mistake Six: Temporal Infeasibility

This mistake is easy to make unless you are watching out for it. It arises because data mining researchers are generally given data that contains all the fields and records that might be useful in understanding the problem. This does not mean, though, they can all necessarily be used for modeling.

When looking at a record in a data file, don’t assume this is how it exists in the database your model will use operationally: clean and complete all fields arriving at the same time along the same data paths, and so on.

If there are multiple data sources, it is quite possible that the fields in the records you were given for data mining are never brought together in the operational system.

Even if they are, they might have arrived hour, days, or months apart. Particularly if some came from correcting records, audits, roll-ups from later analyses, and so on. At the time your application has to be executed, will all of the fields be here, in this for-mat? Unless you have asked, why would you think so?

Building a model that works in theory but can’t be used because the data are not available when the model must run is the temporal infeasibility trap. Avoiding tempo-ral infeasibility is best accomplished by having a domain expert fully conversant with the business process review the feature set for consistency with the business cycle.

Mistake Seven: Being a “One-Widget Wonder”

They say that if all you have is a hammer, everything looks like a nail. This is true in data mining, but the truth of this doesn’t always sink in. The obvious facts are that every analysis tool is an implementation of some process; processes are formalizations of specific algorithms; and thinkers with some particular idea in mind devise algo-rithms. It is silly to believe that there was one idea back there that is the solution to every problem you will ever work on.

Even wonderful tools work well on some problems, and poorly on others. This is the reason ensemble methods were invented, and even these must be tweaked out in challenging situations. If all you have is one tool/method, there might be only one problem you can address properly.

Remember: the solution must be applied to the problem, and not the other way around. The successful data miner will not only employ an eclectic set of tools; they will know when and how to use each tool.

Mistake Eight: Poor Configuration/Audit/Version Control

This is one of the hardest lessons I have ever had to learn. I was into day two of a five-week predictive modeling task for a tricky problem. During a calibration test, I stumbled across an unusual collection of settings that created a machine with remark-able performance on the problem.

Rather than immediately capturing the context and settings for this architecture, I tried to make it a little better by applying some small adjustments. That resulted in a small change, so I did it again. Two hours later, with hundreds of output files from failed tweaks scattered across the working directory, I realized that I didn’t know how to reproduce the original great result.

Five weeks later, at around 11 PM on the day before the results were due, I managed to recreate the miracle. It had taken days and nights and weekends of work to arduously track down what had been a gift of providence.

As with every principled experimental process, everything needed for repeatability of the results must be documented. The best way to do this is to create scripts of some sort to automate data transformation, feature extraction, and model construction/

testing. These scripts can be saved, rerun, shared with others—or even delivered to customers or embedded in applications. If that isn’t practical (e.g., automation doesn’t yet exist for some new process), keep a little text window open on your desktop and log steps/settings as you work. At the end of an experiment, append a line or two describing the results, and save this log entry to a directory with the date embedded in its name.

Mistake Nine: Neglecting to Define Success

Enterprises with deep pockets sometimes undertake data mining projects as purely speculative research and development efforts with the full understanding that they might or might not yield a return. Some great things have come about in exactly this way. The definition of success for these efforts is, Spend this money doing the best you can, and tell us what you get.

For most customers, an outcome like this doesn’t smell like success. They funded the project with the expectation of a tangible return. In such cases, it behooves the data miner to work with the customer to get a description of success” committed to writing.

This is a project manifesto, not a project plan, and should not unduly constrain the effort. At the very least, it needs to state some measurable goals for the effort. (Note again: measureable). I have found that a list of goals with performance targets (accuracy improvements, labor savings, etc.) works best.

Even R&D projects are at risk of merely wandering through technology land if they are not focused on a goal statement of some kind. For R&D projects, it is usually best not to put performance numbers in a goal statement. These numbers will end up becoming project drivers that discourage the principled risk taking that should be part of a research effort.

Realize, too, that goals that are not written down and agreed to early on will morph in peoples’ minds over the course of time, and can become unrecognizable. This results in lots of unpleasant confusion and surprise when the final report is inconsistent with what people think they thought you meant.

Mistake Ten: Ignoring Legacy Protocols

Data mining applications are often developed to enhance the operational performance or utility of existing computing systems (i.e., legacy systems). If these systems have been in use for a while, users will have developed work styles that mesh with their system concept of operation (CONOP). These work styles will even include exploitation of system quirks and bugs, should this prove to be useful.

New applications integrated into legacy environments change things. From the perspective of power users who have streamlined their processes, change is bad unless it is helpful, in which case it is merely tolerable.

Dans le document P R A C T IC A L D A T A M IN IN G an co ck (Page 194-200)