Cm Assessing Model Accuracy PDF

(1)

Assessing Model Accuracy

Datamining | MecEn 1

Julie Scholler | B 246 | February 2021

(2)

Prediction accuracy

nd the best predictive model

accurately predict unseen test cases machine learning

statistical learning

Interpretability

nd the true model

understand which inputs affect the outcome, and how

econometrics statistical learning

Model assesment

The prediction error is useful to

choose the best version of a model (inside a family of models) help to choose between the best models of each family

give, in ne, a measure of quality/performance of the predictions

(3)

Context

Starting point

Outcome measurement , also called dependent variable, response, target In classi cation problem, takes values in a nite, unordered set (yes/no, survived/died, digit 0-9, etc.)

Vector of predictor measurements , also called inputs, regressors, covariates, features, independent variables

Model to predict: where captures measurement errors and other discrepancies: , independent of

Fitting a model

observations or training data prediction model

Y Y

p X = (X

₁

, X

₂

, … , X

_p

)

Y = f(X) + ε ε

E(ε) = 0 X

n (x

₁

, y

₁

), … , (x

_N

, y

_N

) ⇒ f ^

_n

(4)

Assessing Model Accuracy

Recall

Suppose we have t a model to some training data, and let be a test observation drawn from the population. If the true model is

then

Typically as the exibility of increases, its variance increases,

and its bias decreases.

So choosing the exibility based on average test error amounts to a bias-variance trade-off.

In classi cation

f ^ (x

0

, y

0

)

Y = f(X) + ε

E ((y

0

− ^ f (x

0

))

²

) = Var ( ^ f (x

0

)) + [Bias ( ^ f (x

0

))]

²

+ Var(ε)

f ^

(5)

Assessing Model Accuracy

Misclassi cation error rate

Typically, in classi cation, we measure the performance of using the misclassi cation error rate:

Problem

We do not have access to the distribution of or to the whole population.

What can we do to estimate this prediction error?

f ^

_n

Err = E (Ind

_{{Y ≠ ˆ}_f

n(X)}

) ou ∑

ω∈Ωpop

Ind

_{{Y (ω)≠ ˆ}_f

n(X(ω))}

1 N

(Y , X)

(6)

First idea: Training error rate

Observations: and

Training error rate

can be easily calculated by applying the statistical learning method to the observations used in its training.

Problem

biased estimate: the training error can dramatically underestimate the real error rate depends on model characteristic, complexity, over tting tendency

the more an instance have an impact on its own affectation, the more the underestimation will be subtantial (1-knn: 0% for the training error)

(x

_i

)

_i=1,…,n

(y

_i

)

_i=1,…,n

Err(train) = ˆ ∑

ⁿ

i=1

Ind

_{y

i≠ ˆf _n(xi)}

1 n

(7)

Solutions

Best solution

a large designated test set (often not available)

Other strategies

penalizing models too complex when computing the error rate

holding out a subset of the training observations from the tting process, and then applying the statistical learning method to those held out observations

resampling method to create new data

Decision criteria

size of the initial sample parametric model or no

algorithm complexity, computing performance

(8)

Penalization

Mathematical adjustement to the training error rate in order to estimate the test error rate

Ideas

evaluate the optimistic bias of the training error rate correct this bias with a penalization

the penalization depends on the model variability and complexity

Examples

Mallows statistic

Akaike information criterion: AIC Bayesian information criterion: BIC

Drawback

only usable with parametric models

C

_p

(9)

Validation-set approach

Idea

dissociate the data to t the model and the data to compute error rate compute the error with unused fresh data

How

randomly divide the available set of samples into two parts: a training set and a validation or hold-out set.

The model is t on the training set, and the tted model issued to predict the responses for the observations in the validation set.

The resulting validation-set error provides an estimate of the test error.

Random splitting into two parts

to t and to assess the value of the prediction error via

usually or

D = D

_train

∪ D

_test

D

appr

f ˆ D

test

ˆ Err(test)

2/3 − 1/3 70% − 30%

(10)

(11)

Drawbacks of validation set approach

Validation estimate of the test error

unbiased estimate of the error rate of the tting model

but biased estimate of the error rate of the nal model on the whole data

only a subset of the observations — those that are included in the training set rather than in the validation set — are used to t the model

the validation set error may tend to overestimate the test error for the model t on the entire data set.

highly variable, depending on precisely which observations are included in the training set and which observations are included in the validation set

Bais-variance trade-off

the bigger is the test sample, the more precise is the estimate (low variance) the bigger is the test sample, themore biased is the estiamte

(12)

(13)

(14)

Train/validation/test sets

Split the data into random samples

to t

to choose the best model inside a family of models to choose the best of the chosen ones

3 D = D

_train

∪ D

_valid

∪ D

_test

D

train

f ˆ D

valid

D

test

(15)

Resampling Methods

(16)

Resampling Methods

Ideas

create new samples from the initial data t a model on each of these samples

Purpose

obtain more informations or more precise information about the tted model For example, they provide estimates of test-set prediction error, and the standard deviation of our estimate

Bene ts

decrease the variance of the estimate

improve accuracy when the original sample is small

Drawback

(17)

Resampling Methods

Use

Widely used approach for estimating test error

These methods can be used to build con dence interval in non parametric context

Methods and history

Cross-validation: 1948

Jackknife: Quenouille (1949) and Tukey (1958) Bootstrap: Efron - 1979

(18)

Cross-validation

(19)

Cross-validation

Algorithm

1. Divide data into roughly equal-sized parts . denotes the indices of the observations in part .

There are observations in part : if is a multiple of , then . 2. For in

hold out the -th part

t the model to the other parts (combined) compute the prediction error on the left out -th part 3. Compute the average error of the tting models

where , and is the t for observation , obtained from the data with part removed.

K C

₁

, C

₂

, … C

_K

C

_k

k

n

_k

k N K n

_k

=

_Kⁿ

k {1, … , K}

k K − 1

k

CV

_(K)

= ∑

^K

k=1

Err

_k

n

_k

n

Err

_k

=

_n¹_k

∑

_i∈C_k

Ind

_y_i_≠^y_i

^y

_i

i

k

(20)

(21)

(22)

(23)

Choose of

Since each training set is only as big as the original training set, the estimates of prediction error will typically be biased upward.

This bias is minimized when (leave-one-out cross validation: LOOCV)

LOOCV

small bias: each training sample contains observations

but high variance: the estimates from each fold are highly correlated and can be computationally intensive

Bias-variance trade-off

the smaller is , the smaller is the variance the smaller is , the bigger is the bias

A good compromise is or .

K

K−1K

K = n

n − 1

K K

K = 5 10

(24)

(25)

Classical use of CV

Selection of the best parameter in a model family

f ^

_^λ

= argmin

λ

ˆ Err (CV )

(26)

(27)

Cross-validation: right and wrong

Data

Consider a simple classi er applied to some two-class data (same size) sample of size 50

500 predictors with standard normal distribution, mutually independent and independent of the target

Procedure

1. nd the 100 predictors having the largest correlation with the class labels

2. estimate the prediction error via cross validation of the -nearest neighbors model using only these 100 predictors

Results

on 50 simulations, error rate average:

but real error rate:

1 50% 3%

(28)

Cross-validation: right and wrong

This process ignore the fact that in Step 1, the procedure has already seen the labels of the training data, and made use of them. This is a form of training and must be

included in the validation process.

This error made in many high pro le genomics papers.

The Wrong and Right Way

Wrong: apply cross-validation in step 2

Right: apply cross-validation to steps 1 and 2

When a procedure requires multiple steps, all the steps must

be include in the CV process.

(29)

Bootstrap

(30)

Bootstrap

exible and powerful statistical tool that can be used to quantify the uncertainty associated with a given estimator or statistical learning method

can provide an estimate of the standard error of a coef cient, or a con dence interval for that coef cient

Where does the name came from?

Tderives from the phrase to pull oneself up by one's bootstraps, from the eighteenth century “The Surprising Adventures of Baron Munchausen” by Rudolph Erich Raspe:

The Baron had fallen to the bottom of a deep lake. Just when it looked like all was lost, he thought to pick himself up by his own bootstraps.

Principle

using a computer to mimic the process of obtaining new data sets

Each of these bootstrap data sets is created by sampling with replacement, and is the

(31)

(32)

Bootstrap

Algorithm 1

Generate bootstrap samples:

t a model on

compute the prediction error of each model on the original dataset and do the average

Issue

each bootstrap sample has signi cant overlap with the original data

About two-thirds of the original data points appear in each bootstrap sample.

This will cause the bootstrap to seriously underestimate the true prediction error.

B Ω

_n

Ω

_n

(33)

Bootstrap

Algorithm 2

Generate bootstrap samples:

t a model on

compute the prediction error of each model using observations that did not occur in the current bootstrap sample

and do the average

Drawback

the method gets complicated, and in the end, cross-validation provides a simpler, more attractive approach for estimating prediction error

B Ω

_n

Ω

_n

(34)

Cross validation vs. Bootstrap

conceptually bootstrap is more complicated than cross-validation we will see bootstrap play a key role in some algorithms

cross-validation is widely use due to easy implementation

For same computational performance:

bootstrap estimate has a lower variance CV estimate is less biased

K = B

(35)

Conclusion

estimating a prediction error is a tricky operation and has important consequences

there is no perfect method

Advice

apart from any system of probabilistic hypotheses, be careful about the absolute nature of an estimate

in a situation of model choice within the same family, we may assume that the induced bias is identical from one model to another and use a not so computational expensive method

use the same method to estimate the error if it serves to compare the ef ciency of the methods