Reason for Data Mining

(1)

JSM 2-day Course SF August 2-3, 2003 1

Successful Data Mining in Practice:

Where do we Start?

Richard D. De Veaux Richard D. De Veaux

Department of Department of

Mathematics and Statistics Mathematics and Statistics

Williams College Williams College

Williamstown MA, 01267 Williamstown MA, 01267

deveaux

deveaux@williams@williams..eduedu

http://www.

http://www.williamswilliams..eduedu/Mathematics//Mathematics/rdeveauxrdeveaux

(2)

Outline

•What is it?

•Why is it different?

•Types of models

•How to start

•Where do we go next?

•Challenges

(3)

Reason for Data Mining

Data = $$

(4)

Data Mining Is…

“the nontrivial process of identifying valid, novel, potentially useful, and ultimately understandable patterns in data.” --- Fayyad

“finding interesting structure (patterns, statistical models, relationships) in data bases”.--- Fayyad, Chaduri and Bradley

“a knowledge discovery process of extracting

previously unknown, actionable information from very large data bases”--- Zornes

“ a process that uses a variety of data analysis tools to discover patterns and relationships in data that may be used to make valid predictions.”

(5)

What is Data Mining?

(6)

Paralyzed Veterans of America

• KDD 1998 cup

• Mailing list of 3.5 million potential donors

• Lapsed donors

¾Made their last donation to PVA 13 to 24 months prior to June 1997

¾200,000 (training and test sets)

• Who should get the current mailing?

• Cost effective strategy?

(7)

Results for PVA Data Set

• If entire list (100,000 donors) are mailed, net donation is

$10,500

• Using data mining techniques, this was increased 41.37%

(8)

KDD CUP 98 Results

(9)

KDD CUP 98 Results 2

(10)

Why Is This Hard?

• Size of Data Set

• Signal/Noise ratio

• Example #1 – PVA on

(11)

Why Is It Taking Off Now?

• Because we can

¾Computer power

¾The price of digital storage is near zero

• Data warehouses already built

¾Companies want return on data investment

(12)

What’s Different?

• Users

¾Domain experts, not statisticians

¾Have too much data

¾Want automatic methods

¾Want useful information

• Problem size

¾Number of rows

¾Number of variables

(13)

Data Mining Data Sets

• Massive amounts of data

• UPS

¾16TB -- library of congress

¾Mostly tracking

• Low signal to noise

¾Many irrelevant variables

¾Subtle relationships

¾Variation

(14)

Financial Applications

• Credit assessment

¾Is this loan application a good credit risk?

¾Who is likely to declare bankruptcy?

• Financial performance

¾What should be a portfolio product mix

(15)

Manufacturing Applications

• Product reliability and quality control

• Process control

¾What can I do to improve batch yields?

• Warranty analysis

¾Product problems

¾Fraud

¾Service assessment

(16)

Medical Applications

• Medical procedure effectiveness

¾Who are good candidates for surgery?

• Physician effectiveness

¾Which tests are ineffective?

¾Which physicians are likely to over- prescribe treatments?

¾What combinations of tests are most effective?

(17)

E-commerce

• Automatic web page design

• Recommendations for new purchases

• Cross selling

(18)

Pharmaceutical Applications

• Clinical trial databases

• Combine clinical trial results with

extensive medical/demographic data base to explore:

¾ Prediction of adverse experiences

¾ Who is likely to be non-compliant or drop out?

¾ What are alternative (I.E., Non-

approved) uses supported by the data?

(19)

Example: Screening Plates

• Biological assay

¾Samples are tested for potency

¾8 x 12 arrays of samples

¾Reference compounds included

• Questions:

¾Correct for drift

¾Recognize clogged dispensing tips

(20)

Pharmaceutical Applications

• High throughput screening

¾Predict actions in assays

¾Predict results in animals or humans

• Rational drug design

¾Relating chemical structure with chemical properties

¾Inverse regression to predict chemical properties from desired structure

• DNA snips

(21)

Pharmaceutical Applications

• Genomics

¾Associate genes with diseases

¾Find relationships between genotype and drug response (e.g., dosage requirements, adverse effects)

¾Find individuals most susceptible to placebo effect

(22)

Fraud Detection

• Identify false:

¾Medical insurance claims

¾Accident insurance claims

• Which stock trades are based on insider information?

• Whose cell phone number has been stolen?

• Which credit card transactions are from stolen cards?

(23)

Case Study I

• Ingot cracking

¾953 30,000 lb. Ingots

¾20% cracking rate

¾$30,000 per recast

¾90 potential explanatory variables 9Water composition (reduced)

9Metal composition 9Process variables

9Other environmental variables

(24)

Case Study II – Car Insurance

• 42800 mature policies

• 65 potential predictors

¾Tree model found industry, vehicle age, numbers of vehicles, usage and location

(25)

Data Mining and OLAP

• On-line analytical processing (OLAP): users deductively analyze data to verify hypothesis

¾ Descriptive, not predictive

• Data mining: software uses data to inductively find patterns

¾ Predictive or descriptive

• Synergy

¾ OLAP helps users understand data before mining

¾ OLAP helps users evaluate significance and value of patterns

(26)

Data Mining vs. Statistics

Large amount of data:

Data Collection Sample?

Reasonable Price for Sofware Presentation Medium

Nice Place for a Meeting

1,000,000 rows, 3000 columns 1,000 rows, 30 columns

Happenstance Data Designed Surveys, Experiments

Why bother? We have big, parallel computers

You bet! We even get error estimates.

$1,000,000 a year $599 with coupon from Amstat News PowerPoint, what else? Overhead foils, of course!

Indianapolis in August, Dallas

(27)

Data Mining Vs. Statistics

• Flexible models

• Prediction often most important

• Computation matters

• Variable selection and overfitting are

problems

• Particular model and error structure

• Understanding,

confidence intervals

• Computation not critical

• Variable selection and model selection are still problems

(28)

What’s the Same?

• George Box

¾All models are wrong, but some are useful

¾Statisticians, like artists, have the bad habit of falling in love with their models

• The model is no better than the data

• Twyman’s law

¾If it looks interesting, it’s probably wrong

• De Veaux’s corollary

¾If it’s not wrong, it’s probably obvious

(29)

Knowledge Discovery Process

Define business problem Build data mining database Explore data

Prepare data for modeling Build model

Evaluate model

Deploy model and results

Note: This process model borrows from Note: This process model borrows from CRISP

CRISP--DM:DM: CRossCRoss Industry Standard Process for Industry Standard Process for Data Mining

Data Mining

(30)

Data Mining Myths

• Find answers to unasked questions

• Continuously monitor your data base for interesting patterns

• Eliminate the need to understand your business

• Eliminate the need to collect good data

• Eliminate the need to have good data analysis skills

(31)

Beer and Diapers

• Made up story?

• Unrepeatable -- Happened once.

• Lessons learned?

• Imagine being able to see nobody

coming down the road, and at such a distance

• De Veaux’s theory of evolution

(32)

Beer and Diapers

(33)

Successful Data Mining

• The keys to success:

¾Formulating the problem

¾Using the right data

¾Flexibility in modeling

¾Acting on results

• Success depends more on the way you mine the data rather than the specific tool

(34)

Types of Models

• Descriptions

• Classification (categorical or discrete values)

• Regression (continuous values)

¾Time series (continuous values)

• Clustering

• Association

(35)

Data Preparation

• Build data mining database

• Explore data

• Prepare data for modeling

60% to 95% of the time is spent preparing the data

60% to 95% of the time is spent 60% to 95% of the time is spent

preparing the data

(36)

Data Challenges

• Data definitions

¾ Types of variables

• Data consolidation

¾ Combine data from different sources

¾ NASA mars lander

• Data heterogeneity

¾ Homonyms

¾ Synonyms

• Data quality

(37)

Data Quality

(38)

Missing Values

• Random missing values

¾Delete row?

9Paralyzed Veterans

¾Substitute value 9Imputation

9Multiple Imputation

• Systematic missing data

¾Now what?

(39)

Missing Values -- Systematic

• Ann Landers: 90% of parents said they wouldn’t do it again!!

• Wharton Ph.D. Student questionnaire on survey attitudes

• Bowdoin college applicants have mean SAT verbal score above 750

(40)

The Depression Study

• Designed to study antidepressant efficacy

¾Measured via Hamilton Rating Scale

• Side effects

¾Sexual dysfunction

¾Misc safety and tolerability issues

• Late '97 and early '98.

• 692 patients

• Two antidepressants + placebo

(41)

The Data

• Background info

¾ Age

¾ Sex

• Each received either

¾ Placebo

¾ Anti depressant 1

¾ Anti depressant 2

• Dosages

• At time points 7 and 14 days we also have:

¾ Depression scores

¾ Sexual dysfunction indicators

¾ Response indicators

(42)

Example #2

• Depression Study data

• Examine data for missing values

(43)

Build Data Mining Database

• Collect data

• Describe data

• Select data

• Build metadata

• Integrate data

• Clean data

• Load the data mining database

• Maintain the data mining database

(44)

Data Warehouse Architecture

• Reference: Data Warehouse from

Architecture to Implementation by Barry Devlin, Addison Wesley, 1997

• Three tier data architecture

¾Source data

¾Business data warehouse (BDW): the

reconciled data that serves as a system of record

¾Business information warehouse (BIW): the data warehouse you use

(45)

Data Mining BIW

Data Sources

Geographic BIW

Business DW

Subject BIW

Data Mining

BIW

(46)

Metadata

• The data survey describes the data set contents and characteristics

¾Table name

¾Description

¾Primary key/foreign key relationships

¾Collection information: how, where, conditions

¾Timeframe: daily, weekly, monthly

¾Cosynchronus: every Monday or Tuesday

(47)

Relational Data Bases

• Data are stored in tables

Items

ItemID ItemName price

C56621 top hat 34.95

T35691 cane 4.99

RS5292 red shoes 22.95

Shoppers

Person ID person name ZIPCODE item bought 135366 Lyle 19103 T35691 135366 Lyle 19103 C56621 259835 dick 01267 RS5292

(48)

RDBMS Characteristics

• Advantages

¾All major DBMSs are relational

¾Flexible data structure

¾Standard language

¾Many applications can directly access RDBMSs

• Disadvantages

¾May be slow for data mining

¾Physical storage required

¾Database administration overhead

(49)

Data Selection

• Compute time is determined by the

number of cases (rows), the number of variables (columns), and the number of distinct values for categorical variables

¾Reducing the number of variables

¾Sampling rows

• Extraneous column can result in overfitting your data

¾Employee ID is predictor of credit risk

(50)

Sampling Is Ubiquitous

• The database itself is almost certainly a sample of some population

• Most model building techniques require separating the data into training and

testing samples

(51)

Model Building

• Model building

¾Train

¾Test

• Evaluate

(52)

Overfitting in Regression

Classical overfitting:

¾Fit 6th order polynomial to 6 data points

0.0 0.5 1.0 1.5 2.0 2.5 3.0 3.5

-1 0 1 2 3 4 5 6

(53)

Overfitting

• Fitting non-explanatory variables to data

• Overfitting is the result of

¾Including too many predictor variables

¾Lack of regularizing the model

9Neural net run too long 9Decision tree too deep

(54)

Avoiding Overfitting

• Avoiding overfitting is a balancing act

¾Fit fewer variables rather than more

¾Have a reason for including a variable (other than it is in the database)

¾Regularize (don’t overtrain)

¾Know your field.

All models should be as simple as possible but no simpler than necessary

Albert Einstein

All models should be as simple as possible but no simpler than necessary

Albert Einstein

(55)

Evaluate the Model

• Accuracy

¾Error rate

¾Proportion of explained variation

• Significance

¾Statistical

¾Reasonableness

¾Sensitivity

¾Compute value of decisions 9The “so what” test

(56)

Simple Validation

• Method : split data into a training data set and a testing data set. A third data set for validation may also be used

• Advantages: easy to use and understand.

Good estimate of prediction error for reasonably large data sets

• Disadvantages: lose up to 20%-30% of data from model building

(57)

Train vs. Test Data Sets

Test

Train

(58)

N-fold Cross Validation

• If you don’t have a large amount of data, build a model using all the available data.

¾What is the error rate for the model?

• Divide the data into N equal sized groups and build a model on the data with one group left out.

1 2 3 4 5 6 7 8 9 10

(59)

N-fold Cross Validation

• The missing group is predicted and a prediction error rate is calculated

• This is repeated for each group in turn and the average over all N

repeats is used as the model error rate

• Advantages: good for small data sets. Uses all data to calculate prediction error rate

• Disadvantages: lots of computing

(60)

Regularization

• A model can be built to closely fit the training set but not the real data.

• Symptom: the errors in the training set are reduced, but increased in the test or validation sets.

• Regularization minimizes the residual sum of squares adjusted for model complexity.

• Accomplished by using a smaller decision tree or by pruning it. In neural nets, avoiding over-training.

(61)

Example #3

• Fit a tree to DRP using all the variables

¾Continue until the model won’t let you fit any more

• Predict on the test set

(62)

Opaque Data Mining Tools

• Visualization

• Regression

¾ Logistic regression

• Decision trees

• Clustering methods

(63)

Black Box Data Mining Tools

• Neural networks

• K nearest neighbor

• K-means

• Support vector machines

• Genetic algorithms (not a modeling tool)

(64)

“Toy” Problem

train2[, i]

train2$y

0.0 0.2 0.4 0.6 0.8 1.0

510152025

train2[, i]

train2$y

0.0 0.2 0.4 0.6 0.8 1.0

510152025

train2[, i]

train2$y

0.0 0.2 0.4 0.6 0.8 1.0

510152025

train2[, i]

train2$y

0.0 0.2 0.4 0.6 0.8 1.0

510152025

train2[, i]

train2$y

0.0 0.2 0.4 0.6 0.8 1.0

510152025

train2[, i]

train2$y

0.0 0.2 0.4 0.6 0.8 1.0

510152025

train2[, i]

train2$y

0.0 0.2 0.4 0.6 0.8 1.0

510152025

train2[, i]

train2$y

0.0 0.2 0.4 0.6 0.8 1.0

510152025

2025 2025

(65)

Linear Regression

Term Estimate Std Error t Ratio Prob>|t|

Intercept ^-0.900 0.482 -1.860 0.063

x1 ^4.658 0.292 15.950 <.0001

x2 ^4.685 0.294 15.920 <.0001

x3 ^-0.040 0.291 -0.140 0.892

x4 ^9.806 0.298 32.940 <.0001

x5 ^5.361 0.281 19.090 <.0001

x6 ^0.369 ^0.284 ^1.300 ^0.194

x7 ^0.001 ^0.291 ^0.000 ^0.998

x8 ^-0.110 0.295 -0.370 0.714

x9 ^0.467 ^0.301 ^1.550 ^0.122

x10 ^-0.200 0.289 -0.710 0.479

R-squared: 73.5% Train 69.4% Test

(66)

Stepwise Regression

Term Estimate Std Error t Ratio Prob>|t|

Intercept ^-0.625 ^0.309 ^-2.019 ^0.0439

x1 ^4.619 0.289 15.998 <.0001

x2 ^4.665 0.292 15.984 <.0001

x4 ^9.824 0.296 33.176 <.0001

x5 ^5.366 0.28 19.145 <.0001

R-squared 73.3% on Train 69.8% Test

(67)

Stepwise 2

^ND

Order Model

Term Estimate Std Error t Ratio Prob>|t|

Intercept -2.074 0.248 -8.356 0.000

x1 4.352 0.182 23.881 0.000

x2 4.726 0.183 25.786 0.000

x3 -0.503 0.182 -2.769 0.006

(x3-0.48517)*(x3-0.48517) 20.450 0.687 29.755 0.000

x4 9.989 0.186 53.674 0.000

x5 5.185 0.176 29.528 0.000

x9 0.391 0.188 2.084 0.038

(x9-0.51161)*(x9-0.51161) -0.783 0.743 -1.053 0.293 (x1-0.51811)*(x2-0.48354) 8.815 0.634 13.910 0.000 (x1-0.51811)*(x3-0.48517) -1.187 0.648 -1.831 0.067

(x1-0.51811)*(x4-0.49647) 0.925 0.653 1.416 0.157

(x2-0.48354)*(x3-0.48517) -0.626 0.634 -0.988 0.324

R-squared 89.7% Train 88.9% Test

(68)

Next Steps

• Higher order terms?

• When to stop?

• Transformations?

• Too simple: underfitting – bias

• Too complex: inconsistent

predictions, overfitting – high variance

• Selecting models is Occam’s razor

¾Keep goals of interpretation vs. prediction in mind

(69)

Logistic Regression

What happens if we use linear regression on 1-0 (yes/no) data?

Income

20000 40000 60000 80000

0.00.20.40.60.81.0

(70)

Example #4

• Fit a linear regression to DRP using HAMA 14

(71)

Logistic Regression II

• Points on the line can be interpreted as probability, but don’t stay within [0,1]

• Use a sigmoidal function instead of linear function to fit the data

e

I

f

₋

= + 1 ) 1

(

0 0.2 0.4 0.6 0.8 1 1.2

-10 -6 -2 2 6 10

(72)

Logistic Regression III

Income

Accept

20000 40000 60000 80000

0.00.20.40.60.81.0

(73)

Example #5

• Fit a logistic regression to DRP using HAMA 14

(74)

Regression - Summary

• Often works well

• Easy to use

• Theory gives prediction and confidence intervals

• Key is variable selection with interactions and transformations

• Use logistic regression for binary data

(75)

Smoothing –

What’s the Trend?

0.85 0.9 0.95 1 1.05 1.1 1.15 1.2

Euro/USD

1999 2000 2001 2002

Time

Bivariate Fit of Euro/USD By Time

(76)

Scatterplot Smoother

0.85 0.9 0.95 1 1.05

1.1 1.15 1.2

Euro/USD

1999 2000 2001 2002

Time Smoothing Spline Fit, lambda=10.44403

R-Square 0.90663

Smoothing Spline Fit, lambda=10.44403 Bivariate Fit of Euro/USD By Time

(77)

Less Smoothing

Usually these smoothers have choices on how much smoothing

0.85 0.9 0.95

1 1.05

1.1 1.15

1.2

Euro/USD

1999 2000 2001 2002

Time Smoothing Spline Fit, lambda=0.001478

R-Square

Sum of Squares Error

0.986559 0.116135 Change Lambda:

Smoothing Spline Fit, lambda=0.001478 Bivariate Fit of Euro/USD By Time

(78)

Example #6

• Fit a linear regression to Euro Rate over time

• Fit a smoothing spline

(79)

Draft Lottery 1970

0 50 100 150 200 250 300 350

dayofyear

(80)

Draft Data Smoothed

0 50 100 150 200 250 300 350

0 50 100 150 200 250 300 350 Day of year

(81)

More Dimensions

• Why not smooth using 10 predictors?

¾Curse of dimensionality

¾With 10 predictors, if we use 10% of each as a neighborhood, how many points do we

need to get 100 points in cube?

¾Conversely, to get 10% of the points, what percentage do we need to take of each

predictor?

¾Need new approach

(82)

Additive Model

• Cant get

• So, simplify to:

• Each of the f_iare easy to find

¾Scatterplot smoothers

) (

...

) (

)

ˆ f

₁

( x

₁

f

₂

x

₂

f

_p

x

_p

y = + + +

) ,...,

ˆ f ( x

₁

x

_p

y =

(83)

Create New Features

• Instead of original x’s use linear combinations

¾ Principal components

¾ Factor analysis

¾ Multidimensional scaling

p p

i

b x b x

z = α +

₁ ₁

+ ... +

(84)

How To Find Features

• If you have a response variable, the question may change.

• What are interesting directions in the predictors?

¾High variance directions in X - PCA

¾High covariance with Y -- PLS

¾High correlation with Y -- OLS

¾Directions whose smooth is correlated with y - PPR

(85)

Principal Components

• First direction has maximum variance

• Second direction has maximum variance of all directions perpendicular to first

• Repeat until there are as many directions as original variables

• Reduce to a smaller number

¾Multiple approaches to eliminate directions

(86)

First Principal Component

Disp.

2000 2500 3000 3500

100150200250300

(87)

When Does This Work Well?

• When you have a group of highly correlated predictor variables

¾Census information

¾History of past giving

¾10 temperature sensors

(88)

Biplot

Comp. 1

Comp. 2

-0.2 -0.1 0.0 0.1 0.2

-0.2-0.10.00.10.2

1 2

3

4 5

6 7

8

9 10

11

12 13

14

15

16

17 18

19 20

21

22 23

24 25

26 27 28

29 30

32 31 33

34

35

36

37 39 38

40 41

42 43

44 45

46 47

48

49

50

51

52 53

54 55

56

57

58 59

60

61 62

63 64 65

67 66

68

69 70

71

72

73 74

75

76 77 78

79

80 81

82

83 84

85

86

87 88 89

90

91

92 93

94

95 96

97

98 99

100

-10 -5 0 5 10

-10-50510

Temp.1Temp.10Temp.8Temp.6Temp.9Temp.5Temp.7Temp.2Temp.3Temp.4

Flow1

Flow2

(89)

Advantages of Projection

• Interpretation

• Dimension reduction

• Able to have more predictors than observations

(90)

Disadvantages

• Lack of interpretation

• Linear

(91)

Going Non-linear

• The features are all linear:

• But you could also use them in an additive model:

k k

z b z

b b

y ˆ =

₀

+

₁ ₁

+ ... +

) (

...

) (

)

ˆ f

₁

( z

₁

f

₂

z

₂

f

_p

z

_p

y = + + +

(92)

Examples

• If the f’s are arbitrary, we have projection pursuit regression

• If the f’s are sigmoidal we have a neural network

¾The z’s are the hidden nodes

¾The s’s are the activation functions

¾The b’s are the weights

) (

...

) (

)

ˆ b

₁

s

₁

( z

₁

b

₂

s

₂

z

₂

b

_p

s

_p

z

_p

y = α + + + +

(93)

Neural Nets

• Don’t resemble the brain

¾ Are a statistical model

¾ Closest relative is projection pursuit regression

(94)

History

• Biology

¾ Neurode (McCulloch & Pitts, 1943)

¾ Theory of learning (Hebb, 1949)

• Computer science

¾ Perceptron (Rosenblatt, 1958) Adaline (Widrow, 1960)

¾ Perceptrons (Minsky & Papert, 1969)

¾ Neural nets (Rummelhart, others 1986)

(95)

Input (z¹) Output x1

x2 x3 x4 x5 x0

0.3 0.7 -0.2

0.4 -0.5

z¹ = 0.8 + .3x¹ + .7x² - .2x³ + .4x⁴- .5x⁵ h(z¹)

A Single Neuron

0.8

(96)

Single Node

l j

j jk

l

w x

z

I = = ∑

¹

+ θ

) ˆ

_k

h ( z

_kl

y =

Output:

Input to outer layer from “hidden node”:

(97)

Layered Architecture

Input layer

Output layer

Hidden layer z¹

z²

z³ x¹

x² y

(98)

Neural Networks

l j

j jk

l

w x

z = ∑

¹

+ θ

j

k

w h z w h z

y ˆ =

₂₁

(

₁

) +

₂₂

(

₂

) + K + θ

Create lots of features – hidden nodes

Use them in an additive model:

(99)

Put It Together

) )

(

~ (

ˆ

₂ ₁ _l _j

j

j jk l

kl

k

h w h w x

y = ∑ ∑ + θ + θ

The resulting model is just a flexible non- linear regression of the response on a set of predictor variables.

(100)

Running a Neural Net

(101)

JSM 2-day Course SF August 2-3, 2003 101

Predictions for Example

R² 89.5% Train 87.7% Test

(102)

What Does This Get Us?

• Enormous flexibility

• Ability to fit anything

¾Including noise

¾Not just the elephant – the whole herd!

• Interpretation?

(103)

Example #7

• Fit a neural net to the “toy problem” data.

• Look at the profiler.

• How does it differ from the full regression model?

(104)

Running a Training Session

• Initialize weights

¾Set range

¾Random initialization

¾With weights from previous training session

• An epoch is one time through every row in data set

¾Can be in random order or fixed order

(105)

Training the Neural Net

Epochs

RSS

0 200 400 600 800 1000

0.00.20.40.60.81.0

Test Set Error Test Set Error

Training Set Error Training Set Error

Error as a function of training

(106)

Stopping Rules

• Error (RMS) threshold

• Limit -- early stopping rule

¾Time

¾Epochs

• Error rate of change threshold

¾E.G. No change in RMS error in 100 epochs

• Minimize error + complexity -- weight decay

¾De Veaux et al Technometrics 1998

(107)

Neural Net Pro

• Advantages

¾Handles continuous or discrete values

¾Complex interactions

¾In general, highly accurate for fitting due to flexibility of model

¾Can incorporate known relationships 9So called grey box models

9See De Veaux et al, Environmetrics 1999

(108)

Neural Net Con

• Disadvantages

¾Model is not descriptive (black box)

¾Difficult, complex architectures

¾Slow model building

¾Categorical data explosion

¾Sensitive to input variable selection

(109)

Household Income > $40000

Debt > $10000 No

Yes On Job > 1 Yr

No

.01 .03

Yes

No Yes

.07 .04

Decision Trees

(110)

Determining Credit Risk

• 11,000 cases of loan history

¾ 10,000 cases in training set 9 7,500 good risks

9 2,500 bad risks

¾ 1,000 cases in test set

• Data available

¾ predictor variables

9 Income: continuous 9 Years at job: continuous

9 Debt: categorical (High, Low)

¾ response variable

9 Good risk: categorical (Yes, No)

(111)

Find the First Split

7,500 Y 2,500 N

1,500 Y 1,500 N

6,000 Y 1,000 N Income

< $40K > $40K

(112)

Find an Unsplit Node and Split It

7,500 Y 2,500 N

1,500 Y 1,500 N

6,000 Y 1,000 N

100 Y 1,400 N

1,400 Y 100 N

Income

< $40K > $40K

Job Years

> 5

< 5

(113)

Find an Unsplit Node and Split It

7,500 Y 2,500 N

1,500 Y 1,500 N

6,000 Y 1,000 N

100 Y 1,400 N

1,400 Y 100 N

6,000 Y 0 N

0 Y 1,000 N Income

< $40K > $40K

Job Years Debt

> 5

< 5 Low High

(114)

Class Assignment

• The tree is applied to new data to classify it

• A case or instance will be assigned to the largest (or modal) class in the leaf to which it goes

• Example:

• All cases arriving at this node would be given a value of “yes”

1,400 Y 100 N

(115)

Tree Algorithms

• CART (Breiman, Friedman, Olshen, stone)

• C4.5, C5.0, cubist (Quinlan)

• CHAID

• Slip (IBM)

• Quest (SPSS)

(116)

Decision Trees

• Find split in predictor variable that best splits data into heterogeneous groups

• Build the tree inductively basing

future splits on past choices (greedy algorithm)

• Classification trees (categorical response)

• Regression tree (continuous response)

• Size of tree often determined by cross-validation

(117)

Geometry of Decision Trees

Y Y

Y N Y

Y Y

Y Y Y

Y

Y Y

Y N N

N

N N N N

N N

N Debt

Household Income Y N

N N

N

(118)

Two Way Tables -- Titanic

Cre w Firs t S e cond Third Tota l

Live d ²¹² ²⁰² ¹¹⁸ ¹⁷⁸ ⁷¹⁰

S urvival Die d ⁶⁷³ ¹²³ ¹⁶⁷ ⁵²⁸ ¹⁴⁹¹

To tal ⁸⁸⁵ ³²⁵ ²⁸⁵ ⁷⁰⁶ ²²⁰¹

Tic ke t Clas s

Crew First Second Third

Class

Survivors Non-Survivors

(119)

D S

AC

F M

123C123C

F M

Mosaic Plot

(120)

Tree Diagram

M |

3

46% 93%

3 1,2,C

Child Adult

1 or 2

F

27% 100%

33%

23%

1st Crew

1 or Crew 2 or 3

14%

(121)

Regression Tree

| Price<9446.5

Weight<2280 Disp.<134

Weight<3637.5

Price<11522

Reliability:abde

HP<154

34.00 30.17 26.22

24.17

21.67 20.40

22.57

18.60