• Aucun résultat trouvé

What data problems does the user report (outliers, missing data, timeliness, precision, etc.)?

Dans le document P R A C T IC A L D A T A M IN IN G an co ck (Page 132-136)

Feature Extraction and Enhancement (Step 3) *

Question 8: What data problems does the user report (outliers, missing data, timeliness, precision, etc.)?

Why this question is important? Th e users are the best source of reliable information on data errors that cause operational problems.

What is question seeking? Add the problems reported by users to the list of problems to be addressed.

Likely responses and their meanings Sometimes error phenomenology is documented, but this is rare. You will probably not get a comprehensive list.

Follow-up questions Ask for estimates of rates of occurrence for the various types of known errors.

Special considerations Some errors are more serious than others.

Th e fact that an error is present does not necessarily mean it should be fi xed; it might make more sense to discard the record.

5.4 Synthesis of Features

Feature synthesis is the merging of several existing features to obtain one or more new features. This is done by applying a mathematical transform to the raw features, or combining features in some intuitively meaningful way. Synthesis of features is usually undertaken when the existing features are individually weak. By combining them with other features, two benefits might be realized:

1. Synthesized features can be more powerful because they combine the informa-tion content of several features.

2. If the original, weak features are removed and only the synthesized feature is kept, the dimension of the data is lower.

Good Features

It is desirable to have features that are correlated with the ground truth to be predicted/

estimated, and uncorrelated with each other. When this is the case, each feature is itself highly informative, and provides an information stream that is not redundant with other features.

Making Good Feature Sets From Bad Ones

It is rare for data mining practitioners to be able to specify the data provided for their effort. They usually receive data that was collected for some other purpose, but hap-pens to be available for a data mining effort. In practice, this can put the data mining researcher in the challenging position of trying to get useful information out of not so useful data. A process that can help is feature synthesis.

Whether a particular type of feature synthesis will be helpful depends strongly on the particulars of the data itself, so it doesn’t make sense to attempt an extended treat-ment here; the reader is referred to the excellent detailed treattreat-ments found in “Data Preparation for Data Mining,” Dorian Pyle,3 and encouraged to read the case study below.

Feature synthesis can include things like replacing the two features total revenue and advertising expenditures by their ratio to obtain the synthesized feature revenue dol-lars per advertising dollar, which might be more meaningful in some applications. Or, it might replace 6 features of monthly sales with the slope of the trend line for those six months, which in one number indicates whether things are getting better or getting worse, and how rapidly.

5.4.1 Feature Synthesis Case Study

This case study examines the details of a predictive modeling project to predict the worker’s compensation liability incurred by a state government. The goal was to build a model that could ingest the claims submitted during a month, and estimate the total payout on each claim over the following six months.

The available data consisted of a disparate collection of mostly nominal features.

Initial predictive modeling experiments using numerical codes for the nominal features did not give good results.

The features were taken from the worker’s compensation claim forms submitted for payment. In an attempt to perform relatively high-precision estimation (within $1000 on each claim), it was decided to synthesize some additional features from those sup-plied. In particular, it was believed that transforms that computed a variety of weighted sums, ratios, and nonlinear combinations of the existing discrete features might expose some subtle information encoded in the distribution of feature values within each claimant’s record.

Further, for this case study, three ground truth variables were selected by the cus-tomer; one of these had to be synthesized (as a difference) of claimant data fields.

The Original Data

A commercial source provided 12,130 worker’s compensation claim records (comp records) for the pre-study effort. Each record contained 85 fields, any or all of which could be used as features. These data were sampled from claims filed in a single state over a five-year period. Supporting documentation was also provided which included a data concordance, coding tables, and a description of the sampling methodology.

Several Problems Were Addressed:

1. Is it possible to predict from comp record phenomenology the total incurred medical expense?

2. Is it possible to predict from comp record phenomenology the duration of disability?

3. Is it possible to predict from comp record phenomenology which claims will result in litigation/adjudication?

Preliminary Analysis of the Data

Preliminary analysis was performed to determine general population parameters. The distribution of outcomes (in terms of medical dollars) is heavily skewed to the low end:

over 99% of the claims had valuations under $100,000.

Single-factor Bayesian analysis, scientific visualization techniques, and covari-ance measures showed that most of the features were not correlated with each other.

Many vacant fields were found in the supplied data. In order to insure that any results obtained could be applied in practice, attention was restricted to only those fields pres-ent in virtually all records.

Preliminary cluster analysis indicated that the supplied estimates of future medical costs were very poor. Further, the system being developed should aim to improve on current estimation techniques rather than replicate their shortcomings. In subsequent work, attention was restricted to closed cases only.

Feature Extraction

The features satisfying the above conditions, which were intuitively correlated with total medical cost, were selected. Some of these were synthesized from multiple comp record fields by weighted summation, differencing, or other transformations:

1. Type of injury 2. Part of body

3. Person’s age at time of injury 4. Gender

5. Marital status

6. Age of policy at time of injury 7. Employment status at time of injury 8. Attorney been retained?

9. Claim ever contested by carrier?

10. Type of employment

11. Traumatic, occupational, or cumulative injury?

12. Pre-injury weekly wage

These features were z-scored and sorted by correlation with outcome valuation, and PCA was applied. The data set was divided into four smaller sets: A, B, C, and D. Each of these smaller sets held approximately 1100 normalized case records.

Training Methods

A neural network was applied to the feature sets A and C. This was done in both a supervised and unsupervised training mode. For sets A and C (those used in training), individual case valuations could consistently be predicted to within $1000 over 98%

of the time. In other words, the ground truth assignments can be learned for a given data set. For both sets A and C, it was possible to predict total medical expense for the training sets themselves to within 10%. When applied as blind tests against sets B and D (which were not part of the training data), the prediction of total population claim value was within 15% on the whole sets.

Data analysis indicated that the involvement of an attorney strongly de-correlated the data. A simple stratification was performed to remove cases involving attorneys and contested claims, and the procedures above repeated. (Note: the trained machine could correctly predict the involvement of an attorney over 80% of the time.)

After stratification, the blind test set results improved to estimation of total cost to within 0.5% on set B, and 7% on set D. However, individual case estimates were still poor.

Conclusions

Results based upon the limited effort applied were promising for population estima-tion. Additional stratification and feature enhancement was considered, but indica-tions were that the data provided do not support prediction of individual cases.

The Need for Additional Data

It appeared that further progress would be constrained by the limitations of the sup-plied data set. Missing fields in the supsup-plied data set forced consideration of only a few of the 85 collected features. Aggregate results were consistently much better than indi-vidual results. The data set had to be subdivided to obtain blind test sets. These facts, coupled with the high resolution ($1000 bins) desired on the output side indicate that the supplied data set does not adequately cover the universe of discourse.

5.4.2 Synthesis of Features Checklist

Run this checklist to insure that issues related to feature synthesis have been considered.

Dans le document P R A C T IC A L D A T A M IN IN G an co ck (Page 132-136)