Characterizing and Resolving Data Problems

Feature Extraction and Enhancement (Step 3) *

5.2 Characterizing and Resolving Data Problems

• Outlier detection and mitigation. The term outlier can refer either to data that are invalid due to some error (e.g., sensor error, corrupted during storage), or to data that are valid but assume unexpected values. The former we call bad; the latter we call anomalies. The problem of missing data is addressed below.

Telling the difference between bad data and anomalous data is challenging, because the determination is really an assessment of cause rather than effect. For example, “the observed data are not nominal” is an observed effect. If the cause of this effect is not determined, it will be difficult to handle outliers in a prin-cipled way.

• Detection of outliers. For our purposes, an outlier is a datum having a value that is not consistent with the established pattern. Outliers are pattern break-ers. This definition is a bit vague, suggesting that the term outlier is subjec-tive . . . and so it is. One data miner’s outlier is another’s nominal datum. This subjectivity arises because what is consistent and established in one context need not be so in another; it really depends upon how the data are being used.

Outlier detection is important in data mining because the presence of data that violate the pattern make the pattern harder to detect, characterize, and exploit. Part of proper data conditioning includes the consideration of whether certain records are invalid, and should be removed so that authentic patterns are not obscured.

Methods for definitive outlier tests do exist in some problem domains by virtue of their having gained acceptance among domain experts. This is certainly true for domains requiring unambiguous definitions, such as medicine and law.

For most data mining efforts, though, the data miner must establish their own defi-nition of an outlier, and construct their own outlier tests. There is no universally appli-cable outlier test, but there are principled outlier tests. Here are five different methods (objective and subjective) that together cover many data mining situations.

Objective Methods:

• Range checking. Establish reasonable minimum and maximum values for each feature. Values that fall outside this range are outliers.

• Z-scores. Using the Calibration set, compute the mean and standard deviation for each feature. To determine whether an instance of that feature is an outlier, use these to compute its z-score. If the feature value is less than the minimum acceptable z-score, or greater than the maximum acceptable z-score, it is an out-lier. A standard choice for the minimum and maximum allowable z-scores are -3 and +3, respectively.

Subjective Methods:

• Visualization. Plot the data. Data that don’t fit the visual pattern are possible outliers.

• Clustering. Use a clustering algorithm to aggregate the data into clusters. Clus-ters that have very few members (e.g., one), might consist of outliers.

• Contextualization. Establish outlier rules that can be applied to check for known inconsistencies. For example, suppose a medical record says a patient’s gender is male, and their diagnosis is gestational diabetes (i.e., the patient is a pregnant man). These feature values cannot both be correct, so an outlier has been detected.

The objective methods are nice because they don’t really require an understand-ing of the domain, and can be performed quickly and automatically. The subjective methods are more powerful and discriminating, but require time and domain know-ledge to apply.

Anomaly Detection

For our purposes, an anomaly is a valid datum having a value that is assumed very rarely. Notice that a distinction is being made here between outliers and anomalies.

Outliers are bad data; their frequency of occurrence is irrelevant. Anomalies are good data that are rare; they are not outliers. Deciding which you have in a particular case is a judgment call (domain experts often have heuristics for this). Knowing the difference is important if outliers and anomalies must be handled differently.

There are two types of anomaly detection problems: Closed Corpus and Open Corpus. Closed Corpus problems are those for which there is a known, a priori list of anomalous patterns. Software virus detection is an example: an anomalous version of a program is detected by scanning it for the presence of known bad code.

Closed Corpus anomaly detectors characterize anomalous patterns (by keeping a list). This approach characterizes abnormal patterns, and creates detectors for similarity to these patterns.

Strengths:

• Good track record (e.g., virus and spam detection).

• Supervised learning can be used, because examples of every target pattern can be generated.

Weaknesses:

• The corpus must be regularly updated.

• Patterns that are not (yet) in the corpus will not be detected.

Open Corpus problems are those which must detect anomalies that have never been seen before: there is no a priori list to check. In this situation, anomalous data must be detected by examination of its attributes. An example is a bot detector on a web page. You are asked to prove that you are human by typing in a partially garbled word

presented in a thumbnail image. The web site doesn’t have a list of all humans, but it does know what humans can do that bots can’t.

Open Corpus anomaly detectors have a known, a priori collection of historical pat-terns constituting normalcy. This approach characterizes normal patpat-terns, and creates detectors for deviation from these patterns.

Strengths:

• Good track record (e.g., change detection, control systems).

• Previously unseen patterns might be detected.

Weaknesses:

• More complex and therefore more difficult to build and use.

• Unsupervised learning must be used, because it is not known a priori what anom-alous patterns must be detected.

Open Corpus anomaly detection is usually regarded as the more difficult of the two problems. It can be performed in a number of ways. One method is an application of unsupervised learning combined with continuous regression.

The concept is simple. If a pattern consisting of several parts is not unusual, then it should be possible to hide some of its parts, and use pattern matching to infer these hid-den parts from those that are not hidhid-den. In a certain sense, parts that can be inferred in this way conform to what is expected, and are not novel. However, when some part of a pattern cannot be inferred from the others, it must in some way be unusual in the context of the whole pattern.

This suggests a method for using pattern matching to detect novel items. For each part of a pattern, a learning engine is created to infer that part from the others. Items are run through the engine to determine whether all of their parts make sense in context.

Items that contain many parts that cannot be inferred by the engine are deemed novel.

Using scores computed during processing by the learning engine, items are ranked by novelty. The most novel items are flagged for manual review.

5.2.1 Outlier Case Study

This experiment shows that it is sometimes possible to detect weak features and remove (winnow) them without degrading the information content of the data as a whole. In some cases, feature winnowing can actually improve the performance of the classifier.

5.2.2 Winnowing Case Study: Principal Component Analysis for Feature Extraction

This case study describes a laboratory experiment designed to illustrate the use of PCA for dimension reduction and feature extraction. We describe what PCA is, and its strengths and weaknesses as a dimension reduction and feature extraction technique.

The most important issue to consider when using PCA for dimension reduction is its effect on the ability of the new features to distinguish between ground truth classes.

Addendum [1] at the end of this case study suggests a methodology for developing nonlinear transforms manually; and Addendum [2] describes a real-world use of non-linear encoding for feature enhancement.

Dans le document P R A C T IC A L D A T A M IN IN G an co ck (Page 118-121)