JSM 2-day Course SF August 2-3, 2003 1
Successful Data Mining in Practice:
Where do we Start?
Richard D. De Veaux Richard D. De Veaux
Department of Department of
Mathematics and Statistics Mathematics and Statistics
Williams College Williams College
Williamstown MA, 01267 Williamstown MA, 01267
deveaux
deveaux@williams@williams..eduedu
http://www.
http://www.williamswilliams..eduedu/Mathematics//Mathematics/rdeveauxrdeveaux
Outline
•What is it?
•Why is it different?
•Types of models
•How to start
•Where do we go next?
•Challenges
JSM 2-day Course SF August 2-3, 2003 3
Reason for Data Mining
Data = $$
Data = $$
Data Mining Is…
“the nontrivial process of identifying valid, novel, potentially useful, and ultimately understandable patterns in data.” --- Fayyad
“finding interesting structure (patterns, statistical models, relationships) in data bases”.--- Fayyad, Chaduri and Bradley
“a knowledge discovery process of extracting
previously unknown, actionable information from very large data bases”--- Zornes
“ a process that uses a variety of data analysis tools to discover patterns and relationships in data that may be used to make valid predictions.”
JSM 2-day Course SF August 2-3, 2003 5
What is Data Mining?
Paralyzed Veterans of America
• KDD 1998 cup
• Mailing list of 3.5 million potential donors
• Lapsed donors
¾Made their last donation to PVA 13 to 24 months prior to June 1997
¾200,000 (training and test sets)
• Who should get the current mailing?
• Cost effective strategy?
JSM 2-day Course SF August 2-3, 2003 7
Results for PVA Data Set
• If entire list (100,000 donors) are mailed, net donation is
$10,500
• Using data mining techniques, this was increased 41.37%
KDD CUP 98 Results
JSM 2-day Course SF August 2-3, 2003 9
KDD CUP 98 Results 2
Why Is This Hard?
• Size of Data Set
• Signal/Noise ratio
• Example #1 – PVA on
JSM 2-day Course SF August 2-3, 2003 11
Why Is It Taking Off Now?
• Because we can
¾Computer power
¾The price of digital storage is near zero
• Data warehouses already built
¾Companies want return on data investment
What’s Different?
• Users
¾Domain experts, not statisticians
¾Have too much data
¾Want automatic methods
¾Want useful information
• Problem size
¾Number of rows
¾Number of variables
JSM 2-day Course SF August 2-3, 2003 13
Data Mining Data Sets
• Massive amounts of data
• UPS
¾16TB -- library of congress
¾Mostly tracking
• Low signal to noise
¾Many irrelevant variables
¾Subtle relationships
¾Variation
Financial Applications
• Credit assessment
¾Is this loan application a good credit risk?
¾Who is likely to declare bankruptcy?
• Financial performance
¾What should be a portfolio product mix
JSM 2-day Course SF August 2-3, 2003 15
Manufacturing Applications
• Product reliability and quality control
• Process control
¾What can I do to improve batch yields?
• Warranty analysis
¾Product problems
¾Fraud
¾Service assessment
Medical Applications
• Medical procedure effectiveness
¾Who are good candidates for surgery?
• Physician effectiveness
¾Which tests are ineffective?
¾Which physicians are likely to over- prescribe treatments?
¾What combinations of tests are most effective?
JSM 2-day Course SF August 2-3, 2003 17
E-commerce
• Automatic web page design
• Recommendations for new purchases
• Cross selling
Pharmaceutical Applications
• Clinical trial databases
• Combine clinical trial results with
extensive medical/demographic data base to explore:
¾ Prediction of adverse experiences
¾ Who is likely to be non-compliant or drop out?
¾ What are alternative (I.E., Non-
approved) uses supported by the data?
JSM 2-day Course SF August 2-3, 2003 19
Example: Screening Plates
• Biological assay
¾Samples are tested for potency
¾8 x 12 arrays of samples
¾Reference compounds included
• Questions:
¾Correct for drift
¾Recognize clogged dispensing tips
Pharmaceutical Applications
• High throughput screening
¾Predict actions in assays
¾Predict results in animals or humans
• Rational drug design
¾Relating chemical structure with chemical properties
¾Inverse regression to predict chemical properties from desired structure
• DNA snips
JSM 2-day Course SF August 2-3, 2003 21
Pharmaceutical Applications
• Genomics
¾Associate genes with diseases
¾Find relationships between genotype and drug response (e.g., dosage requirements, adverse effects)
¾Find individuals most susceptible to placebo effect
Fraud Detection
• Identify false:
¾Medical insurance claims
¾Accident insurance claims
• Which stock trades are based on insider information?
• Whose cell phone number has been stolen?
• Which credit card transactions are from stolen cards?
JSM 2-day Course SF August 2-3, 2003 23
Case Study I
• Ingot cracking
¾953 30,000 lb. Ingots
¾20% cracking rate
¾$30,000 per recast
¾90 potential explanatory variables 9Water composition (reduced)
9Metal composition 9Process variables
9Other environmental variables
Case Study II – Car Insurance
• 42800 mature policies
• 65 potential predictors
¾Tree model found industry, vehicle age, numbers of vehicles, usage and location
JSM 2-day Course SF August 2-3, 2003 25
Data Mining and OLAP
• On-line analytical processing (OLAP): users deductively analyze data to verify hypothesis
¾ Descriptive, not predictive
• Data mining: software uses data to inductively find patterns
¾ Predictive or descriptive
• Synergy
¾ OLAP helps users understand data before mining
¾ OLAP helps users evaluate significance and value of patterns
Data Mining vs. Statistics
Large amount of data:
Data Collection Sample?
Reasonable Price for Sofware Presentation Medium
Nice Place for a Meeting
1,000,000 rows, 3000 columns 1,000 rows, 30 columns
Happenstance Data Designed Surveys, Experiments
Why bother? We have big, parallel computers
You bet! We even get error estimates.
$1,000,000 a year $599 with coupon from Amstat News PowerPoint, what else? Overhead foils, of course!
Indianapolis in August, Dallas
JSM 2-day Course SF August 2-3, 2003 27
Data Mining Vs. Statistics
• Flexible models
• Prediction often most important
• Computation matters
• Variable selection and overfitting are
problems
• Particular model and error structure
• Understanding,
confidence intervals
• Computation not critical
• Variable selection and model selection are still problems
What’s the Same?
• George Box
¾All models are wrong, but some are useful
¾Statisticians, like artists, have the bad habit of falling in love with their models
• The model is no better than the data
• Twyman’s law
¾If it looks interesting, it’s probably wrong
• De Veaux’s corollary
¾If it’s not wrong, it’s probably obvious
JSM 2-day Course SF August 2-3, 2003 29
Knowledge Discovery Process
Define business problem Build data mining database Explore data
Prepare data for modeling Build model
Evaluate model
Deploy model and results
Note: This process model borrows from Note: This process model borrows from CRISP
CRISP--DM:DM: CRossCRoss Industry Standard Process for Industry Standard Process for Data Mining
Data Mining
Data Mining Myths
• Find answers to unasked questions
• Continuously monitor your data base for interesting patterns
• Eliminate the need to understand your business
• Eliminate the need to collect good data
• Eliminate the need to have good data analysis skills
JSM 2-day Course SF August 2-3, 2003 31
Beer and Diapers
• Made up story?
• Unrepeatable -- Happened once.
• Lessons learned?
• Imagine being able to see nobody
coming down the road, and at such a distance
• De Veaux’s theory of evolution
Beer and Diapers
JSM 2-day Course SF August 2-3, 2003 33
Successful Data Mining
• The keys to success:
¾Formulating the problem
¾Using the right data
¾Flexibility in modeling
¾Acting on results
• Success depends more on the way you mine the data rather than the specific tool
Types of Models
• Descriptions
• Classification (categorical or discrete values)
• Regression (continuous values)
¾Time series (continuous values)
• Clustering
• Association
JSM 2-day Course SF August 2-3, 2003 35
Data Preparation
• Build data mining database
• Explore data
• Prepare data for modeling
60% to 95% of the time is spent preparing the data
60% to 95% of the time is spent 60% to 95% of the time is spent
preparing the data
preparing the data
Data Challenges
• Data definitions
¾ Types of variables
• Data consolidation
¾ Combine data from different sources
¾ NASA mars lander
• Data heterogeneity
¾ Homonyms
¾ Synonyms
• Data quality
JSM 2-day Course SF August 2-3, 2003 37
Data Quality
Missing Values
• Random missing values
¾Delete row?
9Paralyzed Veterans
¾Substitute value 9Imputation
9Multiple Imputation
• Systematic missing data
¾Now what?
JSM 2-day Course SF August 2-3, 2003 39
Missing Values -- Systematic
• Ann Landers: 90% of parents said they wouldn’t do it again!!
• Wharton Ph.D. Student questionnaire on survey attitudes
• Bowdoin college applicants have mean SAT verbal score above 750
The Depression Study
• Designed to study antidepressant efficacy
¾Measured via Hamilton Rating Scale
• Side effects
¾Sexual dysfunction
¾Misc safety and tolerability issues
• Late '97 and early '98.
• 692 patients
• Two antidepressants + placebo
JSM 2-day Course SF August 2-3, 2003 41
The Data
• Background info
¾ Age
¾ Sex
• Each received either
¾ Placebo
¾ Anti depressant 1
¾ Anti depressant 2
• Dosages
• At time points 7 and 14 days we also have:
¾ Depression scores
¾ Sexual dysfunction indicators
¾ Response indicators
Example #2
• Depression Study data
• Examine data for missing values
JSM 2-day Course SF August 2-3, 2003 43
Build Data Mining Database
• Collect data
• Describe data
• Select data
• Build metadata
• Integrate data
• Clean data
• Load the data mining database
• Maintain the data mining database
Data Warehouse Architecture
• Reference: Data Warehouse from
Architecture to Implementation by Barry Devlin, Addison Wesley, 1997
• Three tier data architecture
¾Source data
¾Business data warehouse (BDW): the
reconciled data that serves as a system of record
¾Business information warehouse (BIW): the data warehouse you use
JSM 2-day Course SF August 2-3, 2003 45
Data Mining BIW
Data Sources
Geographic BIW
Business DW
Subject BIW
Data Mining
BIW
Metadata
• The data survey describes the data set contents and characteristics
¾Table name
¾Description
¾Primary key/foreign key relationships
¾Collection information: how, where, conditions
¾Timeframe: daily, weekly, monthly
¾Cosynchronus: every Monday or Tuesday
JSM 2-day Course SF August 2-3, 2003 47
Relational Data Bases
• Data are stored in tables
Items
ItemID ItemName price
C56621 top hat 34.95
T35691 cane 4.99
RS5292 red shoes 22.95
Shoppers
Person ID person name ZIPCODE item bought 135366 Lyle 19103 T35691 135366 Lyle 19103 C56621 259835 dick 01267 RS5292
RDBMS Characteristics
• Advantages
¾All major DBMSs are relational
¾Flexible data structure
¾Standard language
¾Many applications can directly access RDBMSs
• Disadvantages
¾May be slow for data mining
¾Physical storage required
¾Database administration overhead
JSM 2-day Course SF August 2-3, 2003 49
Data Selection
• Compute time is determined by the
number of cases (rows), the number of variables (columns), and the number of distinct values for categorical variables
¾Reducing the number of variables
¾Sampling rows
• Extraneous column can result in overfitting your data
¾Employee ID is predictor of credit risk
Sampling Is Ubiquitous
• The database itself is almost certainly a sample of some population
• Most model building techniques require separating the data into training and
testing samples
JSM 2-day Course SF August 2-3, 2003 51
Model Building
• Model building
¾Train
¾Test
• Evaluate
Overfitting in Regression
Classical overfitting:
¾Fit 6th order polynomial to 6 data points
0.0 0.5 1.0 1.5 2.0 2.5 3.0 3.5
-1 0 1 2 3 4 5 6
JSM 2-day Course SF August 2-3, 2003 53
Overfitting
• Fitting non-explanatory variables to data
• Overfitting is the result of
¾Including too many predictor variables
¾Lack of regularizing the model
9Neural net run too long 9Decision tree too deep
Avoiding Overfitting
• Avoiding overfitting is a balancing act
¾Fit fewer variables rather than more
¾Have a reason for including a variable (other than it is in the database)
¾Regularize (don’t overtrain)
¾Know your field.
All models should be as simple as possible but no simpler than necessary
Albert Einstein
All models should be as simple as possible but no simpler than necessary
Albert Einstein
JSM 2-day Course SF August 2-3, 2003 55
Evaluate the Model
• Accuracy
¾Error rate
¾Proportion of explained variation
• Significance
¾Statistical
¾Reasonableness
¾Sensitivity
¾Compute value of decisions 9The “so what” test
Simple Validation
• Method : split data into a training data set and a testing data set. A third data set for validation may also be used
• Advantages: easy to use and understand.
Good estimate of prediction error for reasonably large data sets
• Disadvantages: lose up to 20%-30% of data from model building
JSM 2-day Course SF August 2-3, 2003 57
Train vs. Test Data Sets
Test
Train
N-fold Cross Validation
• If you don’t have a large amount of data, build a model using all the available data.
¾What is the error rate for the model?
• Divide the data into N equal sized groups and build a model on the data with one group left out.
1 2 3 4 5 6 7 8 9 10
JSM 2-day Course SF August 2-3, 2003 59
N-fold Cross Validation
• The missing group is predicted and a prediction error rate is calculated
• This is repeated for each group in turn and the average over all N
repeats is used as the model error rate
• Advantages: good for small data sets. Uses all data to calculate prediction error rate
• Disadvantages: lots of computing
Regularization
• A model can be built to closely fit the training set but not the real data.
• Symptom: the errors in the training set are reduced, but increased in the test or validation sets.
• Regularization minimizes the residual sum of squares adjusted for model complexity.
• Accomplished by using a smaller decision tree or by pruning it. In neural nets, avoiding over-training.
JSM 2-day Course SF August 2-3, 2003 61
Example #3
• Depression Study data
• Fit a tree to DRP using all the variables
¾Continue until the model won’t let you fit any more
• Predict on the test set
Opaque Data Mining Tools
• Visualization
• Regression
¾ Logistic regression
• Decision trees
• Clustering methods
JSM 2-day Course SF August 2-3, 2003 63
Black Box Data Mining Tools
• Neural networks
• K nearest neighbor
• K-means
• Support vector machines
• Genetic algorithms (not a modeling tool)
“Toy” Problem
train2[, i]
train2$y
0.0 0.2 0.4 0.6 0.8 1.0
510152025
train2[, i]
train2$y
0.0 0.2 0.4 0.6 0.8 1.0
510152025
train2[, i]
train2$y
0.0 0.2 0.4 0.6 0.8 1.0
510152025
train2[, i]
train2$y
0.0 0.2 0.4 0.6 0.8 1.0
510152025
train2[, i]
train2$y
0.0 0.2 0.4 0.6 0.8 1.0
510152025
train2[, i]
train2$y
0.0 0.2 0.4 0.6 0.8 1.0
510152025
train2[, i]
train2$y
0.0 0.2 0.4 0.6 0.8 1.0
510152025
train2[, i]
train2$y
0.0 0.2 0.4 0.6 0.8 1.0
510152025
2025 2025
JSM 2-day Course SF August 2-3, 2003 65
Linear Regression
Term Estimate Std Error t Ratio Prob>|t|
Intercept -0.900 0.482 -1.860 0.063
x1 4.658 0.292 15.950 <.0001
x2 4.685 0.294 15.920 <.0001
x3 -0.040 0.291 -0.140 0.892
x4 9.806 0.298 32.940 <.0001
x5 5.361 0.281 19.090 <.0001
x6 0.369 0.284 1.300 0.194
x7 0.001 0.291 0.000 0.998
x8 -0.110 0.295 -0.370 0.714
x9 0.467 0.301 1.550 0.122
x10 -0.200 0.289 -0.710 0.479
R-squared: 73.5% Train 69.4% Test
Stepwise Regression
Term Estimate Std Error t Ratio Prob>|t|
Intercept -0.625 0.309 -2.019 0.0439
x1 4.619 0.289 15.998 <.0001
x2 4.665 0.292 15.984 <.0001
x4 9.824 0.296 33.176 <.0001
x5 5.366 0.28 19.145 <.0001
R-squared 73.3% on Train 69.8% Test
JSM 2-day Course SF August 2-3, 2003 67
Stepwise 2
NDOrder Model
Term Estimate Std Error t Ratio Prob>|t|
Intercept -2.074 0.248 -8.356 0.000
x1 4.352 0.182 23.881 0.000
x2 4.726 0.183 25.786 0.000
x3 -0.503 0.182 -2.769 0.006
(x3-0.48517)*(x3-0.48517) 20.450 0.687 29.755 0.000
x4 9.989 0.186 53.674 0.000
x5 5.185 0.176 29.528 0.000
x9 0.391 0.188 2.084 0.038
(x9-0.51161)*(x9-0.51161) -0.783 0.743 -1.053 0.293 (x1-0.51811)*(x2-0.48354) 8.815 0.634 13.910 0.000 (x1-0.51811)*(x3-0.48517) -1.187 0.648 -1.831 0.067
(x1-0.51811)*(x4-0.49647) 0.925 0.653 1.416 0.157
(x2-0.48354)*(x3-0.48517) -0.626 0.634 -0.988 0.324
R-squared 89.7% Train 88.9% Test
Next Steps
• Higher order terms?
• When to stop?
• Transformations?
• Too simple: underfitting – bias
• Too complex: inconsistent
predictions, overfitting – high variance
• Selecting models is Occam’s razor
¾Keep goals of interpretation vs. prediction in mind
JSM 2-day Course SF August 2-3, 2003 69
Logistic Regression
What happens if we use linear regression on 1-0 (yes/no) data?
Income
20000 40000 60000 80000
0.00.20.40.60.81.0
Example #4
• Depression Study data
• Fit a linear regression to DRP using HAMA 14
JSM 2-day Course SF August 2-3, 2003 71
Logistic Regression II
• Points on the line can be interpreted as probability, but don’t stay within [0,1]
• Use a sigmoidal function instead of linear function to fit the data
e
II
f
−= + 1 ) 1
(
0 0.2 0.4 0.6 0.8 1 1.2
-10 -6 -2 2 6 10
Logistic Regression III
Income
Accept
20000 40000 60000 80000
0.00.20.40.60.81.0
JSM 2-day Course SF August 2-3, 2003 73
Example #5
• Depression Study data
• Fit a logistic regression to DRP using HAMA 14
Regression - Summary
• Often works well
• Easy to use
• Theory gives prediction and confidence intervals
• Key is variable selection with interactions and transformations
• Use logistic regression for binary data
JSM 2-day Course SF August 2-3, 2003 75
Smoothing –
What’s the Trend?
0.85 0.9 0.95 1 1.05 1.1 1.15 1.2
Euro/USD
1999 2000 2001 2002
Time
Bivariate Fit of Euro/USD By Time
Scatterplot Smoother
0.85 0.9 0.95 1 1.05
1.1 1.15 1.2
Euro/USD
1999 2000 2001 2002
Time Smoothing Spline Fit, lambda=10.44403
R-Square 0.90663
Smoothing Spline Fit, lambda=10.44403 Bivariate Fit of Euro/USD By Time
JSM 2-day Course SF August 2-3, 2003 77
Less Smoothing
Usually these smoothers have choices on how much smoothing
0.85 0.9 0.95
1 1.05
1.1 1.15
1.2
Euro/USD
1999 2000 2001 2002
Time Smoothing Spline Fit, lambda=0.001478
R-Square
Sum of Squares Error
0.986559 0.116135 Change Lambda:
Smoothing Spline Fit, lambda=0.001478 Bivariate Fit of Euro/USD By Time
Example #6
• Fit a linear regression to Euro Rate over time
• Fit a smoothing spline
JSM 2-day Course SF August 2-3, 2003 79
Draft Lottery 1970
0 50 100 150 200 250 300 350
0 50 100 150 200 250 300 350
dayofyear
Draft Data Smoothed
0 50 100 150 200 250 300 350
0 50 100 150 200 250 300 350 Day of year
JSM 2-day Course SF August 2-3, 2003 81
More Dimensions
• Why not smooth using 10 predictors?
¾Curse of dimensionality
¾With 10 predictors, if we use 10% of each as a neighborhood, how many points do we
need to get 100 points in cube?
¾Conversely, to get 10% of the points, what percentage do we need to take of each
predictor?
¾Need new approach
Additive Model
• Cant get
• So, simplify to:
• Each of the fi are easy to find
¾Scatterplot smoothers
) (
...
) (
)
ˆ f
1( x
1f
2x
2f
px
py = + + +
) ,...,
ˆ f ( x
1x
py =
JSM 2-day Course SF August 2-3, 2003 83
Create New Features
• Instead of original x’s use linear combinations
¾ Principal components
¾ Factor analysis
¾ Multidimensional scaling
p p
i
b x b x
z = α +
1 1+ ... +
How To Find Features
• If you have a response variable, the question may change.
• What are interesting directions in the predictors?
¾High variance directions in X - PCA
¾High covariance with Y -- PLS
¾High correlation with Y -- OLS
¾Directions whose smooth is correlated with y - PPR
JSM 2-day Course SF August 2-3, 2003 85
Principal Components
• First direction has maximum variance
• Second direction has maximum variance of all directions perpendicular to first
• Repeat until there are as many directions as original variables
• Reduce to a smaller number
¾Multiple approaches to eliminate directions
First Principal Component
Disp.
2000 2500 3000 3500
100150200250300
JSM 2-day Course SF August 2-3, 2003 87
When Does This Work Well?
• When you have a group of highly correlated predictor variables
¾Census information
¾History of past giving
¾10 temperature sensors
Biplot
Comp. 1
Comp. 2
-0.2 -0.1 0.0 0.1 0.2
-0.2-0.10.00.10.2
1 2
3
4 5
6 7
8
9 10
11
12 13
14
15
16
17 18
19 20
21
22 23
24 25
26 27 28
29 30
32 31 33
34
35
36
37 39 38
40 41
42 43
44 45
46 47
48
49
50
51
52 53
54 55
56
57
58 59
60
61 62
63 64 65
67 66
68
69 70
71
72
73 74
75
76 77 78
79
80 81
82
83 84
85
86
87 88 89
90
91
92 93
94
95 96
97
98 99
100
-10 -5 0 5 10
-10-50510
Temp.1Temp.10Temp.8Temp.6Temp.9Temp.5Temp.7Temp.2Temp.3Temp.4
Flow1
Flow2
JSM 2-day Course SF August 2-3, 2003 89
Advantages of Projection
• Interpretation
• Dimension reduction
• Able to have more predictors than observations
Disadvantages
• Lack of interpretation
• Linear
JSM 2-day Course SF August 2-3, 2003 91
Going Non-linear
• The features are all linear:
• But you could also use them in an additive model:
k k
z b z
b b
y ˆ =
0+
1 1+ ... +
) (
...
) (
)
ˆ f
1( z
1f
2z
2f
pz
py = + + +
Examples
• If the f’s are arbitrary, we have projection pursuit regression
• If the f’s are sigmoidal we have a neural network
¾The z’s are the hidden nodes
¾The s’s are the activation functions
¾The b’s are the weights
) (
...
) (
)
ˆ b
1s
1( z
1b
2s
2z
2b
ps
pz
py = α + + + +
JSM 2-day Course SF August 2-3, 2003 93
Neural Nets
• Don’t resemble the brain
¾ Are a statistical model
¾ Closest relative is projection pursuit regression
History
• Biology
¾ Neurode (McCulloch & Pitts, 1943)
¾ Theory of learning (Hebb, 1949)
• Computer science
¾ Perceptron (Rosenblatt, 1958) Adaline (Widrow, 1960)
¾ Perceptrons (Minsky & Papert, 1969)
¾ Neural nets (Rummelhart, others 1986)
JSM 2-day Course SF August 2-3, 2003 95
Input (z1) Output x1
x2 x3 x4 x5 x0
0.3 0.7 -0.2
0.4 -0.5
z1 = 0.8 + .3x1 + .7x2 - .2x3 + .4x4 - .5x5 h(z1)
A Single Neuron
0.8
Single Node
l j
j jk
l
w x
z
I = = ∑
1+ θ
) ˆ
kh ( z
kly =
Output:
Input to outer layer from “hidden node”:
JSM 2-day Course SF August 2-3, 2003 97
Layered Architecture
Input layer
Output layer
Hidden layer z1
z2
z3 x1
x2 y
Neural Networks
l j
j jk
l
w x
z = ∑
1+ θ
j
k
w h z w h z
y ˆ =
21(
1) +
22(
2) + K + θ
Create lots of features – hidden nodes
Use them in an additive model:
JSM 2-day Course SF August 2-3, 2003 99
Put It Together
) )
(
~ (
ˆ
2 1 l jj
j jk l
kl
k
h w h w x
y = ∑ ∑ + θ + θ
The resulting model is just a flexible non- linear regression of the response on a set of predictor variables.
Running a Neural Net
JSM 2-day Course SF August 2-3, 2003 101
Predictions for Example
R2 89.5% Train 87.7% Test
What Does This Get Us?
• Enormous flexibility
• Ability to fit anything
¾Including noise
¾Not just the elephant – the whole herd!
• Interpretation?
JSM 2-day Course SF August 2-3, 2003 103
Example #7
• Fit a neural net to the “toy problem” data.
• Look at the profiler.
• How does it differ from the full regression model?
Running a Training Session
• Initialize weights
¾Set range
¾Random initialization
¾With weights from previous training session
• An epoch is one time through every row in data set
¾Can be in random order or fixed order
JSM 2-day Course SF August 2-3, 2003 105
Training the Neural Net
Epochs
RSS
0 200 400 600 800 1000
0.00.20.40.60.81.0
Test Set Error Test Set Error
Training Set Error Training Set Error
Error as a function of training
Stopping Rules
• Error (RMS) threshold
• Limit -- early stopping rule
¾Time
¾Epochs
• Error rate of change threshold
¾E.G. No change in RMS error in 100 epochs
• Minimize error + complexity -- weight decay
¾De Veaux et al Technometrics 1998
JSM 2-day Course SF August 2-3, 2003 107
Neural Net Pro
• Advantages
¾Handles continuous or discrete values
¾Complex interactions
¾In general, highly accurate for fitting due to flexibility of model
¾Can incorporate known relationships 9So called grey box models
9See De Veaux et al, Environmetrics 1999
Neural Net Con
• Disadvantages
¾Model is not descriptive (black box)
¾Difficult, complex architectures
¾Slow model building
¾Categorical data explosion
¾Sensitive to input variable selection
JSM 2-day Course SF August 2-3, 2003 109
Household Income > $40000
Debt > $10000 No
Yes On Job > 1 Yr
No
.01 .03
Yes
No Yes
.07 .04
Decision Trees
Determining Credit Risk
• 11,000 cases of loan history
¾ 10,000 cases in training set 9 7,500 good risks
9 2,500 bad risks
¾ 1,000 cases in test set
• Data available
¾ predictor variables
9 Income: continuous 9 Years at job: continuous
9 Debt: categorical (High, Low)
¾ response variable
9 Good risk: categorical (Yes, No)
JSM 2-day Course SF August 2-3, 2003 111
Find the First Split
7,500 Y 2,500 N
1,500 Y 1,500 N
6,000 Y 1,000 N Income
< $40K > $40K
Find an Unsplit Node and Split It
7,500 Y 2,500 N
1,500 Y 1,500 N
6,000 Y 1,000 N
100 Y 1,400 N
1,400 Y 100 N
Income
< $40K > $40K
Job Years
> 5
< 5
JSM 2-day Course SF August 2-3, 2003 113
Find an Unsplit Node and Split It
7,500 Y 2,500 N
1,500 Y 1,500 N
6,000 Y 1,000 N
100 Y 1,400 N
1,400 Y 100 N
6,000 Y 0 N
0 Y 1,000 N Income
< $40K > $40K
Job Years Debt
> 5
< 5 Low High
Class Assignment
• The tree is applied to new data to classify it
• A case or instance will be assigned to the largest (or modal) class in the leaf to which it goes
• Example:
• All cases arriving at this node would be given a value of “yes”
1,400 Y 100 N
JSM 2-day Course SF August 2-3, 2003 115
Tree Algorithms
• CART (Breiman, Friedman, Olshen, stone)
• C4.5, C5.0, cubist (Quinlan)
• CHAID
• Slip (IBM)
• Quest (SPSS)
Decision Trees
• Find split in predictor variable that best splits data into heterogeneous groups
• Build the tree inductively basing
future splits on past choices (greedy algorithm)
• Classification trees (categorical response)
• Regression tree (continuous response)
• Size of tree often determined by cross-validation
JSM 2-day Course SF August 2-3, 2003 117
Geometry of Decision Trees
Y Y
Y Y
Y N Y
Y Y
Y Y Y
Y
Y Y
Y N N
N
N
N
N N N N
N N
N Debt
Household Income Y N
N N
N
Two Way Tables -- Titanic
Cre w Firs t S e cond Third Tota l
Live d 212 202 118 178 710
S urvival Die d 673 123 167 528 1491
To tal 885 325 285 706 2201
Tic ke t Clas s
Crew First Second Third
Class
Survivors Non-Survivors
JSM 2-day Course SF August 2-3, 2003 119
D S
AC
F M
123C123C
F M
Mosaic Plot
Tree Diagram
M |
3
46% 93%
3 1,2,C
Child Adult
1 or 2
F
27% 100%
33%
23%
1st Crew
1 or Crew 2 or 3
14%
JSM 2-day Course SF August 2-3, 2003 121
Regression Tree
| Price<9446.5
Weight<2280 Disp.<134
Weight<3637.5
Price<11522
Reliability:abde
HP<154
34.00 30.17 26.22
24.17
21.67 20.40
22.57
18.60