• Aucun résultat trouvé

Reason for Data Mining

N/A
N/A
Protected

Academic year: 2022

Partager "Reason for Data Mining"

Copied!
217
0
0

Texte intégral

(1)

JSM 2-day Course SF August 2-3, 2003 1

Successful Data Mining in Practice:

Where do we Start?

Richard D. De Veaux Richard D. De Veaux

Department of Department of

Mathematics and Statistics Mathematics and Statistics

Williams College Williams College

Williamstown MA, 01267 Williamstown MA, 01267

deveaux

deveaux@williams@williams..eduedu

http://www.

http://www.williamswilliams..eduedu/Mathematics//Mathematics/rdeveauxrdeveaux

(2)

Outline

•What is it?

•Why is it different?

•Types of models

•How to start

•Where do we go next?

•Challenges

(3)

JSM 2-day Course SF August 2-3, 2003 3

Reason for Data Mining

Data = $$

Data = $$

(4)

Data Mining Is…

“the nontrivial process of identifying valid, novel, potentially useful, and ultimately understandable patterns in data.” --- Fayyad

“finding interesting structure (patterns, statistical models, relationships) in data bases”.--- Fayyad, Chaduri and Bradley

“a knowledge discovery process of extracting

previously unknown, actionable information from very large data bases”--- Zornes

“ a process that uses a variety of data analysis tools to discover patterns and relationships in data that may be used to make valid predictions.”

(5)

JSM 2-day Course SF August 2-3, 2003 5

What is Data Mining?

(6)

Paralyzed Veterans of America

KDD 1998 cup

Mailing list of 3.5 million potential donors

Lapsed donors

¾Made their last donation to PVA 13 to 24 months prior to June 1997

¾200,000 (training and test sets)

Who should get the current mailing?

Cost effective strategy?

(7)

JSM 2-day Course SF August 2-3, 2003 7

Results for PVA Data Set

If entire list (100,000 donors) are mailed, net donation is

$10,500

Using data mining techniques, this was increased 41.37%

(8)

KDD CUP 98 Results

(9)

JSM 2-day Course SF August 2-3, 2003 9

KDD CUP 98 Results 2

(10)

Why Is This Hard?

Size of Data Set

Signal/Noise ratio

Example #1 – PVA on

(11)

JSM 2-day Course SF August 2-3, 2003 11

Why Is It Taking Off Now?

Because we can

¾Computer power

¾The price of digital storage is near zero

Data warehouses already built

¾Companies want return on data investment

(12)

What’s Different?

Users

¾Domain experts, not statisticians

¾Have too much data

¾Want automatic methods

¾Want useful information

Problem size

¾Number of rows

¾Number of variables

(13)

JSM 2-day Course SF August 2-3, 2003 13

Data Mining Data Sets

Massive amounts of data

UPS

¾16TB -- library of congress

¾Mostly tracking

Low signal to noise

¾Many irrelevant variables

¾Subtle relationships

¾Variation

(14)

Financial Applications

Credit assessment

¾Is this loan application a good credit risk?

¾Who is likely to declare bankruptcy?

Financial performance

¾What should be a portfolio product mix

(15)

JSM 2-day Course SF August 2-3, 2003 15

Manufacturing Applications

Product reliability and quality control

Process control

¾What can I do to improve batch yields?

Warranty analysis

¾Product problems

¾Fraud

¾Service assessment

(16)

Medical Applications

Medical procedure effectiveness

¾Who are good candidates for surgery?

Physician effectiveness

¾Which tests are ineffective?

¾Which physicians are likely to over- prescribe treatments?

¾What combinations of tests are most effective?

(17)

JSM 2-day Course SF August 2-3, 2003 17

E-commerce

Automatic web page design

Recommendations for new purchases

Cross selling

(18)

Pharmaceutical Applications

Clinical trial databases

Combine clinical trial results with

extensive medical/demographic data base to explore:

¾ Prediction of adverse experiences

¾ Who is likely to be non-compliant or drop out?

¾ What are alternative (I.E., Non-

approved) uses supported by the data?

(19)

JSM 2-day Course SF August 2-3, 2003 19

Example: Screening Plates

Biological assay

¾Samples are tested for potency

¾8 x 12 arrays of samples

¾Reference compounds included

Questions:

¾Correct for drift

¾Recognize clogged dispensing tips

(20)

Pharmaceutical Applications

High throughput screening

¾Predict actions in assays

¾Predict results in animals or humans

Rational drug design

¾Relating chemical structure with chemical properties

¾Inverse regression to predict chemical properties from desired structure

DNA snips

(21)

JSM 2-day Course SF August 2-3, 2003 21

Pharmaceutical Applications

Genomics

¾Associate genes with diseases

¾Find relationships between genotype and drug response (e.g., dosage requirements, adverse effects)

¾Find individuals most susceptible to placebo effect

(22)

Fraud Detection

Identify false:

¾Medical insurance claims

¾Accident insurance claims

Which stock trades are based on insider information?

Whose cell phone number has been stolen?

Which credit card transactions are from stolen cards?

(23)

JSM 2-day Course SF August 2-3, 2003 23

Case Study I

Ingot cracking

¾953 30,000 lb. Ingots

¾20% cracking rate

¾$30,000 per recast

¾90 potential explanatory variables 9Water composition (reduced)

9Metal composition 9Process variables

9Other environmental variables

(24)

Case Study II – Car Insurance

42800 mature policies

65 potential predictors

¾Tree model found industry, vehicle age, numbers of vehicles, usage and location

(25)

JSM 2-day Course SF August 2-3, 2003 25

Data Mining and OLAP

On-line analytical processing (OLAP): users deductively analyze data to verify hypothesis

¾ Descriptive, not predictive

Data mining: software uses data to inductively find patterns

¾ Predictive or descriptive

Synergy

¾ OLAP helps users understand data before mining

¾ OLAP helps users evaluate significance and value of patterns

(26)

Data Mining vs. Statistics

Large amount of data:

Data Collection Sample?

Reasonable Price for Sofware Presentation Medium

Nice Place for a Meeting

1,000,000 rows, 3000 columns 1,000 rows, 30 columns

Happenstance Data Designed Surveys, Experiments

Why bother? We have big, parallel computers

You bet! We even get error estimates.

$1,000,000 a year $599 with coupon from Amstat News PowerPoint, what else? Overhead foils, of course!

Indianapolis in August, Dallas

(27)

JSM 2-day Course SF August 2-3, 2003 27

Data Mining Vs. Statistics

Flexible models

Prediction often most important

Computation matters

Variable selection and overfitting are

problems

Particular model and error structure

Understanding,

confidence intervals

Computation not critical

Variable selection and model selection are still problems

(28)

What’s the Same?

George Box

¾All models are wrong, but some are useful

¾Statisticians, like artists, have the bad habit of falling in love with their models

The model is no better than the data

Twyman’s law

¾If it looks interesting, it’s probably wrong

De Veaux’s corollary

¾If it’s not wrong, it’s probably obvious

(29)

JSM 2-day Course SF August 2-3, 2003 29

Knowledge Discovery Process

Define business problem Build data mining database Explore data

Prepare data for modeling Build model

Evaluate model

Deploy model and results

Note: This process model borrows from Note: This process model borrows from CRISP

CRISP--DM:DM: CRossCRoss Industry Standard Process for Industry Standard Process for Data Mining

Data Mining

(30)

Data Mining Myths

Find answers to unasked questions

Continuously monitor your data base for interesting patterns

Eliminate the need to understand your business

Eliminate the need to collect good data

Eliminate the need to have good data analysis skills

(31)

JSM 2-day Course SF August 2-3, 2003 31

Beer and Diapers

Made up story?

Unrepeatable -- Happened once.

Lessons learned?

Imagine being able to see nobody

coming down the road, and at such a distance

De Veaux’s theory of evolution

(32)

Beer and Diapers

(33)

JSM 2-day Course SF August 2-3, 2003 33

Successful Data Mining

The keys to success:

¾Formulating the problem

¾Using the right data

¾Flexibility in modeling

¾Acting on results

Success depends more on the way you mine the data rather than the specific tool

(34)

Types of Models

Descriptions

Classification (categorical or discrete values)

Regression (continuous values)

¾Time series (continuous values)

Clustering

Association

(35)

JSM 2-day Course SF August 2-3, 2003 35

Data Preparation

Build data mining database

Explore data

Prepare data for modeling

60% to 95% of the time is spent preparing the data

60% to 95% of the time is spent 60% to 95% of the time is spent

preparing the data

preparing the data

(36)

Data Challenges

Data definitions

¾ Types of variables

Data consolidation

¾ Combine data from different sources

¾ NASA mars lander

Data heterogeneity

¾ Homonyms

¾ Synonyms

Data quality

(37)

JSM 2-day Course SF August 2-3, 2003 37

Data Quality

(38)

Missing Values

Random missing values

¾Delete row?

9Paralyzed Veterans

¾Substitute value 9Imputation

9Multiple Imputation

Systematic missing data

¾Now what?

(39)

JSM 2-day Course SF August 2-3, 2003 39

Missing Values -- Systematic

Ann Landers: 90% of parents said they wouldn’t do it again!!

Wharton Ph.D. Student questionnaire on survey attitudes

Bowdoin college applicants have mean SAT verbal score above 750

(40)

The Depression Study

Designed to study antidepressant efficacy

¾Measured via Hamilton Rating Scale

Side effects

¾Sexual dysfunction

¾Misc safety and tolerability issues

Late '97 and early '98.

692 patients

Two antidepressants + placebo

(41)

JSM 2-day Course SF August 2-3, 2003 41

The Data

Background info

¾ Age

¾ Sex

Each received either

¾ Placebo

¾ Anti depressant 1

¾ Anti depressant 2

Dosages

At time points 7 and 14 days we also have:

¾ Depression scores

¾ Sexual dysfunction indicators

¾ Response indicators

(42)

Example #2

Depression Study data

Examine data for missing values

(43)

JSM 2-day Course SF August 2-3, 2003 43

Build Data Mining Database

Collect data

Describe data

Select data

Build metadata

Integrate data

Clean data

Load the data mining database

Maintain the data mining database

(44)

Data Warehouse Architecture

Reference: Data Warehouse from

Architecture to Implementation by Barry Devlin, Addison Wesley, 1997

Three tier data architecture

¾Source data

¾Business data warehouse (BDW): the

reconciled data that serves as a system of record

¾Business information warehouse (BIW): the data warehouse you use

(45)

JSM 2-day Course SF August 2-3, 2003 45

Data Mining BIW

Data Sources

Geographic BIW

Business DW

Subject BIW

Data Mining

BIW

(46)

Metadata

The data survey describes the data set contents and characteristics

¾Table name

¾Description

¾Primary key/foreign key relationships

¾Collection information: how, where, conditions

¾Timeframe: daily, weekly, monthly

¾Cosynchronus: every Monday or Tuesday

(47)

JSM 2-day Course SF August 2-3, 2003 47

Relational Data Bases

Data are stored in tables

Items

ItemID ItemName price

C56621 top hat 34.95

T35691 cane 4.99

RS5292 red shoes 22.95

Shoppers

Person ID person name ZIPCODE item bought 135366 Lyle 19103 T35691 135366 Lyle 19103 C56621 259835 dick 01267 RS5292

(48)

RDBMS Characteristics

Advantages

¾All major DBMSs are relational

¾Flexible data structure

¾Standard language

¾Many applications can directly access RDBMSs

Disadvantages

¾May be slow for data mining

¾Physical storage required

¾Database administration overhead

(49)

JSM 2-day Course SF August 2-3, 2003 49

Data Selection

Compute time is determined by the

number of cases (rows), the number of variables (columns), and the number of distinct values for categorical variables

¾Reducing the number of variables

¾Sampling rows

Extraneous column can result in overfitting your data

¾Employee ID is predictor of credit risk

(50)

Sampling Is Ubiquitous

The database itself is almost certainly a sample of some population

Most model building techniques require separating the data into training and

testing samples

(51)

JSM 2-day Course SF August 2-3, 2003 51

Model Building

Model building

¾Train

¾Test

Evaluate

(52)

Overfitting in Regression

Classical overfitting:

¾Fit 6th order polynomial to 6 data points

0.0 0.5 1.0 1.5 2.0 2.5 3.0 3.5

-1 0 1 2 3 4 5 6

(53)

JSM 2-day Course SF August 2-3, 2003 53

Overfitting

Fitting non-explanatory variables to data

Overfitting is the result of

¾Including too many predictor variables

¾Lack of regularizing the model

9Neural net run too long 9Decision tree too deep

(54)

Avoiding Overfitting

Avoiding overfitting is a balancing act

¾Fit fewer variables rather than more

¾Have a reason for including a variable (other than it is in the database)

¾Regularize (don’t overtrain)

¾Know your field.

All models should be as simple as possible but no simpler than necessary

Albert Einstein

All models should be as simple as possible but no simpler than necessary

Albert Einstein

(55)

JSM 2-day Course SF August 2-3, 2003 55

Evaluate the Model

Accuracy

¾Error rate

¾Proportion of explained variation

Significance

¾Statistical

¾Reasonableness

¾Sensitivity

¾Compute value of decisions 9The “so what” test

(56)

Simple Validation

Method : split data into a training data set and a testing data set. A third data set for validation may also be used

Advantages: easy to use and understand.

Good estimate of prediction error for reasonably large data sets

Disadvantages: lose up to 20%-30% of data from model building

(57)

JSM 2-day Course SF August 2-3, 2003 57

Train vs. Test Data Sets

Test

Train

(58)

N-fold Cross Validation

If you don’t have a large amount of data, build a model using all the available data.

¾What is the error rate for the model?

Divide the data into N equal sized groups and build a model on the data with one group left out.

1 2 3 4 5 6 7 8 9 10

(59)

JSM 2-day Course SF August 2-3, 2003 59

N-fold Cross Validation

The missing group is predicted and a prediction error rate is calculated

This is repeated for each group in turn and the average over all N

repeats is used as the model error rate

Advantages: good for small data sets. Uses all data to calculate prediction error rate

Disadvantages: lots of computing

(60)

Regularization

A model can be built to closely fit the training set but not the real data.

Symptom: the errors in the training set are reduced, but increased in the test or validation sets.

Regularization minimizes the residual sum of squares adjusted for model complexity.

Accomplished by using a smaller decision tree or by pruning it. In neural nets, avoiding over-training.

(61)

JSM 2-day Course SF August 2-3, 2003 61

Example #3

Depression Study data

Fit a tree to DRP using all the variables

¾Continue until the model won’t let you fit any more

Predict on the test set

(62)

Opaque Data Mining Tools

Visualization

Regression

¾ Logistic regression

Decision trees

Clustering methods

(63)

JSM 2-day Course SF August 2-3, 2003 63

Black Box Data Mining Tools

Neural networks

K nearest neighbor

K-means

Support vector machines

Genetic algorithms (not a modeling tool)

(64)

“Toy” Problem

train2[, i]

train2$y

0.0 0.2 0.4 0.6 0.8 1.0

510152025

train2[, i]

train2$y

0.0 0.2 0.4 0.6 0.8 1.0

510152025

train2[, i]

train2$y

0.0 0.2 0.4 0.6 0.8 1.0

510152025

train2[, i]

train2$y

0.0 0.2 0.4 0.6 0.8 1.0

510152025

train2[, i]

train2$y

0.0 0.2 0.4 0.6 0.8 1.0

510152025

train2[, i]

train2$y

0.0 0.2 0.4 0.6 0.8 1.0

510152025

train2[, i]

train2$y

0.0 0.2 0.4 0.6 0.8 1.0

510152025

train2[, i]

train2$y

0.0 0.2 0.4 0.6 0.8 1.0

510152025

2025 2025

(65)

JSM 2-day Course SF August 2-3, 2003 65

Linear Regression

Term Estimate Std Error t Ratio Prob>|t|

Intercept -0.900 0.482 -1.860 0.063

x1 4.658 0.292 15.950 <.0001

x2 4.685 0.294 15.920 <.0001

x3 -0.040 0.291 -0.140 0.892

x4 9.806 0.298 32.940 <.0001

x5 5.361 0.281 19.090 <.0001

x6 0.369 0.284 1.300 0.194

x7 0.001 0.291 0.000 0.998

x8 -0.110 0.295 -0.370 0.714

x9 0.467 0.301 1.550 0.122

x10 -0.200 0.289 -0.710 0.479

R-squared: 73.5% Train 69.4% Test

(66)

Stepwise Regression

Term Estimate Std Error t Ratio Prob>|t|

Intercept -0.625 0.309 -2.019 0.0439

x1 4.619 0.289 15.998 <.0001

x2 4.665 0.292 15.984 <.0001

x4 9.824 0.296 33.176 <.0001

x5 5.366 0.28 19.145 <.0001

R-squared 73.3% on Train 69.8% Test

(67)

JSM 2-day Course SF August 2-3, 2003 67

Stepwise 2

ND

Order Model

Term Estimate Std Error t Ratio Prob>|t|

Intercept -2.074 0.248 -8.356 0.000

x1 4.352 0.182 23.881 0.000

x2 4.726 0.183 25.786 0.000

x3 -0.503 0.182 -2.769 0.006

(x3-0.48517)*(x3-0.48517) 20.450 0.687 29.755 0.000

x4 9.989 0.186 53.674 0.000

x5 5.185 0.176 29.528 0.000

x9 0.391 0.188 2.084 0.038

(x9-0.51161)*(x9-0.51161) -0.783 0.743 -1.053 0.293 (x1-0.51811)*(x2-0.48354) 8.815 0.634 13.910 0.000 (x1-0.51811)*(x3-0.48517) -1.187 0.648 -1.831 0.067

(x1-0.51811)*(x4-0.49647) 0.925 0.653 1.416 0.157

(x2-0.48354)*(x3-0.48517) -0.626 0.634 -0.988 0.324

R-squared 89.7% Train 88.9% Test

(68)

Next Steps

Higher order terms?

When to stop?

Transformations?

Too simple: underfitting – bias

Too complex: inconsistent

predictions, overfitting – high variance

Selecting models is Occam’s razor

¾Keep goals of interpretation vs. prediction in mind

(69)

JSM 2-day Course SF August 2-3, 2003 69

Logistic Regression

What happens if we use linear regression on 1-0 (yes/no) data?

Income

20000 40000 60000 80000

0.00.20.40.60.81.0

(70)

Example #4

Depression Study data

Fit a linear regression to DRP using HAMA 14

(71)

JSM 2-day Course SF August 2-3, 2003 71

Logistic Regression II

Points on the line can be interpreted as probability, but don’t stay within [0,1]

Use a sigmoidal function instead of linear function to fit the data

e

I

I

f

= + 1 ) 1

(

0 0.2 0.4 0.6 0.8 1 1.2

-10 -6 -2 2 6 10

(72)

Logistic Regression III

Income

Accept

20000 40000 60000 80000

0.00.20.40.60.81.0

(73)

JSM 2-day Course SF August 2-3, 2003 73

Example #5

Depression Study data

Fit a logistic regression to DRP using HAMA 14

(74)

Regression - Summary

Often works well

Easy to use

Theory gives prediction and confidence intervals

Key is variable selection with interactions and transformations

Use logistic regression for binary data

(75)

JSM 2-day Course SF August 2-3, 2003 75

Smoothing –

What’s the Trend?

0.85 0.9 0.95 1 1.05 1.1 1.15 1.2

Euro/USD

1999 2000 2001 2002

Time

Bivariate Fit of Euro/USD By Time

(76)

Scatterplot Smoother

0.85 0.9 0.95 1 1.05

1.1 1.15 1.2

Euro/USD

1999 2000 2001 2002

Time Smoothing Spline Fit, lambda=10.44403

R-Square 0.90663

Smoothing Spline Fit, lambda=10.44403 Bivariate Fit of Euro/USD By Time

(77)

JSM 2-day Course SF August 2-3, 2003 77

Less Smoothing

Usually these smoothers have choices on how much smoothing

0.85 0.9 0.95

1 1.05

1.1 1.15

1.2

Euro/USD

1999 2000 2001 2002

Time Smoothing Spline Fit, lambda=0.001478

R-Square

Sum of Squares Error

0.986559 0.116135 Change Lambda:

Smoothing Spline Fit, lambda=0.001478 Bivariate Fit of Euro/USD By Time

(78)

Example #6

Fit a linear regression to Euro Rate over time

Fit a smoothing spline

(79)

JSM 2-day Course SF August 2-3, 2003 79

Draft Lottery 1970

0 50 100 150 200 250 300 350

0 50 100 150 200 250 300 350

dayofyear

(80)

Draft Data Smoothed

0 50 100 150 200 250 300 350

0 50 100 150 200 250 300 350 Day of year

(81)

JSM 2-day Course SF August 2-3, 2003 81

More Dimensions

Why not smooth using 10 predictors?

¾Curse of dimensionality

¾With 10 predictors, if we use 10% of each as a neighborhood, how many points do we

need to get 100 points in cube?

¾Conversely, to get 10% of the points, what percentage do we need to take of each

predictor?

¾Need new approach

(82)

Additive Model

Cant get

So, simplify to:

Each of the fi are easy to find

¾Scatterplot smoothers

) (

...

) (

)

ˆ f

1

( x

1

f

2

x

2

f

p

x

p

y = + + +

) ,...,

ˆ f ( x

1

x

p

y =

(83)

JSM 2-day Course SF August 2-3, 2003 83

Create New Features

Instead of original x’s use linear combinations

¾ Principal components

¾ Factor analysis

¾ Multidimensional scaling

p p

i

b x b x

z = α +

1 1

+ ... +

(84)

How To Find Features

If you have a response variable, the question may change.

What are interesting directions in the predictors?

¾High variance directions in X - PCA

¾High covariance with Y -- PLS

¾High correlation with Y -- OLS

¾Directions whose smooth is correlated with y - PPR

(85)

JSM 2-day Course SF August 2-3, 2003 85

Principal Components

First direction has maximum variance

Second direction has maximum variance of all directions perpendicular to first

Repeat until there are as many directions as original variables

Reduce to a smaller number

¾Multiple approaches to eliminate directions

(86)

First Principal Component

Disp.

2000 2500 3000 3500

100150200250300

(87)

JSM 2-day Course SF August 2-3, 2003 87

When Does This Work Well?

When you have a group of highly correlated predictor variables

¾Census information

¾History of past giving

¾10 temperature sensors

(88)

Biplot

Comp. 1

Comp. 2

-0.2 -0.1 0.0 0.1 0.2

-0.2-0.10.00.10.2

1 2

3

4 5

6 7

8

9 10

11

12 13

14

15

16

17 18

19 20

21

22 23

24 25

26 27 28

29 30

32 31 33

34

35

36

37 39 38

40 41

42 43

44 45

46 47

48

49

50

51

52 53

54 55

56

57

58 59

60

61 62

63 64 65

67 66

68

69 70

71

72

73 74

75

76 77 78

79

80 81

82

83 84

85

86

87 88 89

90

91

92 93

94

95 96

97

98 99

100

-10 -5 0 5 10

-10-50510

Temp.1Temp.10Temp.8Temp.6Temp.9Temp.5Temp.7Temp.2Temp.3Temp.4

Flow1

Flow2

(89)

JSM 2-day Course SF August 2-3, 2003 89

Advantages of Projection

Interpretation

Dimension reduction

Able to have more predictors than observations

(90)

Disadvantages

Lack of interpretation

Linear

(91)

JSM 2-day Course SF August 2-3, 2003 91

Going Non-linear

The features are all linear:

But you could also use them in an additive model:

k k

z b z

b b

y ˆ =

0

+

1 1

+ ... +

) (

...

) (

)

ˆ f

1

( z

1

f

2

z

2

f

p

z

p

y = + + +

(92)

Examples

If the f’s are arbitrary, we have projection pursuit regression

If the f’s are sigmoidal we have a neural network

¾The z’s are the hidden nodes

¾The s’s are the activation functions

¾The b’s are the weights

) (

...

) (

)

ˆ b

1

s

1

( z

1

b

2

s

2

z

2

b

p

s

p

z

p

y = α + + + +

(93)

JSM 2-day Course SF August 2-3, 2003 93

Neural Nets

Don’t resemble the brain

¾ Are a statistical model

¾ Closest relative is projection pursuit regression

(94)

History

Biology

¾ Neurode (McCulloch & Pitts, 1943)

¾ Theory of learning (Hebb, 1949)

Computer science

¾ Perceptron (Rosenblatt, 1958) Adaline (Widrow, 1960)

¾ Perceptrons (Minsky & Papert, 1969)

¾ Neural nets (Rummelhart, others 1986)

(95)

JSM 2-day Course SF August 2-3, 2003 95

Input (z1) Output x1

x2 x3 x4 x5 x0

0.3 0.7 -0.2

0.4 -0.5

z1 = 0.8 + .3x1 + .7x2 - .2x3 + .4x4 - .5x5 h(z1)

A Single Neuron

0.8

(96)

Single Node

l j

j jk

l

w x

z

I = = ∑

1

+ θ

) ˆ

k

h ( z

kl

y =

Output:

Input to outer layer from “hidden node”:

(97)

JSM 2-day Course SF August 2-3, 2003 97

Layered Architecture

Input layer

Output layer

Hidden layer z1

z2

z3 x1

x2 y

(98)

Neural Networks

l j

j jk

l

w x

z = ∑

1

+ θ

j

k

w h z w h z

y ˆ =

21

(

1

) +

22

(

2

) + K + θ

Create lots of features – hidden nodes

Use them in an additive model:

(99)

JSM 2-day Course SF August 2-3, 2003 99

Put It Together

) )

(

~ (

ˆ

2 1 l j

j

j jk l

kl

k

h w h w x

y = ∑ ∑ + θ + θ

The resulting model is just a flexible non- linear regression of the response on a set of predictor variables.

(100)

Running a Neural Net

(101)

JSM 2-day Course SF August 2-3, 2003 101

Predictions for Example

R2 89.5% Train 87.7% Test

(102)

What Does This Get Us?

Enormous flexibility

Ability to fit anything

¾Including noise

¾Not just the elephant – the whole herd!

Interpretation?

(103)

JSM 2-day Course SF August 2-3, 2003 103

Example #7

Fit a neural net to the “toy problem” data.

Look at the profiler.

How does it differ from the full regression model?

(104)

Running a Training Session

Initialize weights

¾Set range

¾Random initialization

¾With weights from previous training session

An epoch is one time through every row in data set

¾Can be in random order or fixed order

(105)

JSM 2-day Course SF August 2-3, 2003 105

Training the Neural Net

Epochs

RSS

0 200 400 600 800 1000

0.00.20.40.60.81.0

Test Set Error Test Set Error

Training Set Error Training Set Error

Error as a function of training

(106)

Stopping Rules

Error (RMS) threshold

Limit -- early stopping rule

¾Time

¾Epochs

Error rate of change threshold

¾E.G. No change in RMS error in 100 epochs

Minimize error + complexity -- weight decay

¾De Veaux et al Technometrics 1998

(107)

JSM 2-day Course SF August 2-3, 2003 107

Neural Net Pro

Advantages

¾Handles continuous or discrete values

¾Complex interactions

¾In general, highly accurate for fitting due to flexibility of model

¾Can incorporate known relationships 9So called grey box models

9See De Veaux et al, Environmetrics 1999

(108)

Neural Net Con

Disadvantages

¾Model is not descriptive (black box)

¾Difficult, complex architectures

¾Slow model building

¾Categorical data explosion

¾Sensitive to input variable selection

(109)

JSM 2-day Course SF August 2-3, 2003 109

Household Income > $40000

Debt > $10000 No

Yes On Job > 1 Yr

No

.01 .03

Yes

No Yes

.07 .04

Decision Trees

(110)

Determining Credit Risk

11,000 cases of loan history

¾ 10,000 cases in training set 9 7,500 good risks

9 2,500 bad risks

¾ 1,000 cases in test set

Data available

¾ predictor variables

9 Income: continuous 9 Years at job: continuous

9 Debt: categorical (High, Low)

¾ response variable

9 Good risk: categorical (Yes, No)

(111)

JSM 2-day Course SF August 2-3, 2003 111

Find the First Split

7,500 Y 2,500 N

1,500 Y 1,500 N

6,000 Y 1,000 N Income

< $40K > $40K

(112)

Find an Unsplit Node and Split It

7,500 Y 2,500 N

1,500 Y 1,500 N

6,000 Y 1,000 N

100 Y 1,400 N

1,400 Y 100 N

Income

< $40K > $40K

Job Years

> 5

< 5

(113)

JSM 2-day Course SF August 2-3, 2003 113

Find an Unsplit Node and Split It

7,500 Y 2,500 N

1,500 Y 1,500 N

6,000 Y 1,000 N

100 Y 1,400 N

1,400 Y 100 N

6,000 Y 0 N

0 Y 1,000 N Income

< $40K > $40K

Job Years Debt

> 5

< 5 Low High

(114)

Class Assignment

The tree is applied to new data to classify it

A case or instance will be assigned to the largest (or modal) class in the leaf to which it goes

Example:

All cases arriving at this node would be given a value of “yes”

1,400 Y 100 N

(115)

JSM 2-day Course SF August 2-3, 2003 115

Tree Algorithms

CART (Breiman, Friedman, Olshen, stone)

C4.5, C5.0, cubist (Quinlan)

CHAID

Slip (IBM)

Quest (SPSS)

(116)

Decision Trees

Find split in predictor variable that best splits data into heterogeneous groups

Build the tree inductively basing

future splits on past choices (greedy algorithm)

Classification trees (categorical response)

Regression tree (continuous response)

Size of tree often determined by cross-validation

(117)

JSM 2-day Course SF August 2-3, 2003 117

Geometry of Decision Trees

Y Y

Y Y

Y N Y

Y Y

Y Y Y

Y

Y Y

Y N N

N

N

N

N N N N

N N

N Debt

Household Income Y N

N N

N

(118)

Two Way Tables -- Titanic

Cre w Firs t S e cond Third Tota l

Live d 212 202 118 178 710

S urvival Die d 673 123 167 528 1491

To tal 885 325 285 706 2201

Tic ke t Clas s

Crew First Second Third

Class

Survivors Non-Survivors

(119)

JSM 2-day Course SF August 2-3, 2003 119

D S

AC

F M

123C123C

F M

Mosaic Plot

(120)

Tree Diagram

M |

3

46% 93%

3 1,2,C

Child Adult

1 or 2

F

27% 100%

33%

23%

1st Crew

1 or Crew 2 or 3

14%

(121)

JSM 2-day Course SF August 2-3, 2003 121

Regression Tree

| Price<9446.5

Weight<2280 Disp.<134

Weight<3637.5

Price<11522

Reliability:abde

HP<154

34.00 30.17 26.22

24.17

21.67 20.40

22.57

18.60

Références

Documents relatifs

We solve this problem using a two phase method, where the ®rst phase selects the most relevant associations of features and the second phase validates the feature selection by

Two types of spatiotemporal databases are mainly considered: databases containing trajecto- ries of moving objects located in both space and time (e.g. bird or aircraft

In contrast to the Process Cubes approach, the MEL maps the structure of event logs to a data cube and organizes cases and events on different levels.. Further- more, the cells of

When stores are closed, sales data accumulated during the day are sent to the headquarters through public communication network, and then stored in data files for the purpose of

(1) As demonstrated using the DPOSS data, the best strategy is to use small number of starting points and relatively large number of random starts.. The best solution is pretty

New investigations are developed now in our laboratory, on the previous flounder populations stemming from environmentally contrasted estuaries; the main objective

The probability of recollision resulting in multiple ionization or fragmentation depends on the energy of the electron on its return and on electron impact ioniza- tion or

In this paper we propose a new approach to familiarize casual users with a KB built on linked datasets: we treat classes and properties as tags with precise semantics and instances