Assessing Model Accuracy
Datamining | MecEn 1
Julie Scholler | B 246 | February 2021
Prediction accuracy
nd the best predictive model
accurately predict unseen test cases machine learning
statistical learning
Interpretability
nd the true model
understand which inputs affect the outcome, and how
econometrics statistical learning
Model assesment
The prediction error is useful to
choose the best version of a model (inside a family of models) help to choose between the best models of each family
give, in ne, a measure of quality/performance of the predictions
Context
Starting point
Outcome measurement , also called dependent variable, response, target In classi cation problem, takes values in a nite, unordered set (yes/no, survived/died, digit 0-9, etc.)
Vector of predictor measurements , also called inputs, regressors, covariates, features, independent variables
Model to predict: where captures measurement errors and other discrepancies: , independent of
Fitting a model
observations or training data prediction model
Y Y
p X = (X
1, X
2, … , X
p)
Y = f(X) + ε ε
E(ε) = 0 X
n (x
1, y
1), … , (x
N, y
N) ⇒ f ^
nAssessing Model Accuracy
Recall
Suppose we have t a model to some training data, and let be a test observation drawn from the population. If the true model is
then
Typically as the exibility of increases, its variance increases,
and its bias decreases.
So choosing the exibility based on average test error amounts to a bias-variance trade-off.
In classi cation
f ^ (x
0, y
0)
Y = f(X) + ε
E ((y
0− ^ f (x
0))
2) = Var ( ^ f (x
0)) + [Bias ( ^ f (x
0))]
2+ Var(ε)
f ^
Assessing Model Accuracy
Misclassi cation error rate
Typically, in classi cation, we measure the performance of using the misclassi cation error rate:
Problem
We do not have access to the distribution of or to the whole population.
What can we do to estimate this prediction error?
f ^
nErr = E (Ind
{Y ≠ ˆfn(X)}
) ou ∑
ω∈Ωpop
Ind
{Y (ω)≠ ˆfn(X(ω))}
1 N
(Y , X)
First idea: Training error rate
Observations: and
Training error rate
can be easily calculated by applying the statistical learning method to the observations used in its training.
Problem
biased estimate: the training error can dramatically underestimate the real error rate depends on model characteristic, complexity, over tting tendency
the more an instance have an impact on its own affectation, the more the underestimation will be subtantial (1-knn: 0% for the training error)
(x
i)
i=1,…,n(y
i)
i=1,…,nErr(train) = ˆ ∑
ni=1
Ind
{yi≠ ˆf n(xi)}
1
n
Solutions
Best solution
a large designated test set (often not available)
Other strategies
penalizing models too complex when computing the error rate
holding out a subset of the training observations from the tting process, and then applying the statistical learning method to those held out observations
resampling method to create new data
Decision criteria
size of the initial sample parametric model or no
algorithm complexity, computing performance
Penalization
Mathematical adjustement to the training error rate in order to estimate the test error rate
Ideas
evaluate the optimistic bias of the training error rate correct this bias with a penalization
the penalization depends on the model variability and complexity
Examples
Mallows statistic
Akaike information criterion: AIC Bayesian information criterion: BIC
Drawback
only usable with parametric models
C
pValidation-set approach
Idea
dissociate the data to t the model and the data to compute error rate compute the error with unused fresh data
How
randomly divide the available set of samples into two parts: a training set and a validation or hold-out set.
The model is t on the training set, and the tted model issued to predict the responses for the observations in the validation set.
The resulting validation-set error provides an estimate of the test error.
Random splitting into two parts
to t and to assess the value of the prediction error via
usually or
D = D
train∪ D
testD
apprf ˆ D
testˆ Err(test)
2/3 − 1/3 70% − 30%
Drawbacks of validation set approach
Validation estimate of the test error
unbiased estimate of the error rate of the tting model
but biased estimate of the error rate of the nal model on the whole data
only a subset of the observations — those that are included in the training set rather than in the validation set — are used to t the model
the validation set error may tend to overestimate the test error for the model t on the entire data set.
highly variable, depending on precisely which observations are included in the training set and which observations are included in the validation set
Bais-variance trade-off
the bigger is the test sample, the more precise is the estimate (low variance) the bigger is the test sample, themore biased is the estiamte
Train/validation/test sets
Split the data into random samples
to t
to choose the best model inside a family of models to choose the best of the chosen ones
3
D = D
train∪ D
valid∪ D
testD
trainf ˆ D
validD
testResampling Methods
Resampling Methods
Ideas
create new samples from the initial data t a model on each of these samples
Purpose
obtain more informations or more precise information about the tted model For example, they provide estimates of test-set prediction error, and the standard deviation of our estimate
Bene ts
decrease the variance of the estimate
improve accuracy when the original sample is small
Drawback
Resampling Methods
Use
Widely used approach for estimating test error
These methods can be used to build con dence interval in non parametric context
Methods and history
Cross-validation: 1948
Jackknife: Quenouille (1949) and Tukey (1958) Bootstrap: Efron - 1979
Cross-validation
Cross-validation
Algorithm
1. Divide data into roughly equal-sized parts . denotes the indices of the observations in part .
There are observations in part : if is a multiple of , then . 2. For in
hold out the -th part
t the model to the other parts (combined) compute the prediction error on the left out -th part 3. Compute the average error of the tting models
where , and is the t for observation , obtained from the data with part removed.
K C
1, C
2, … C
KC
kk
n
kk N K n
k=
Knk {1, … , K}
k K − 1
k
CV
(K)= ∑
Kk=1
Err
kn
kn
Err
k=
n1k∑
i∈CkInd
yi≠^yi^y
ii
k
Choose of
Since each training set is only as big as the original training set, the estimates of prediction error will typically be biased upward.
This bias is minimized when (leave-one-out cross validation: LOOCV)
LOOCV
small bias: each training sample contains observations
but high variance: the estimates from each fold are highly correlated and can be computationally intensive
Bias-variance trade-off
the smaller is , the smaller is the variance the smaller is , the bigger is the bias
A good compromise is or .
K
K−1K
K = n
n − 1
K K
K = 5 10
Classical use of CV
Selection of the best parameter in a model family
f ^
^λ= argmin
λ
ˆ Err (CV )
Cross-validation: right and wrong
Data
Consider a simple classi er applied to some two-class data (same size) sample of size 50
500 predictors with standard normal distribution, mutually independent and independent of the target
Procedure
1. nd the 100 predictors having the largest correlation with the class labels
2. estimate the prediction error via cross validation of the -nearest neighbors model using only these 100 predictors
Results
on 50 simulations, error rate average:
but real error rate:
1
50% 3%
Cross-validation: right and wrong
This process ignore the fact that in Step 1, the procedure has already seen the labels of the training data, and made use of them. This is a form of training and must be
included in the validation process.
This error made in many high pro le genomics papers.
The Wrong and Right Way
Wrong: apply cross-validation in step 2
Right: apply cross-validation to steps 1 and 2
When a procedure requires multiple steps, all the steps must
be include in the CV process.
Bootstrap
Bootstrap
exible and powerful statistical tool that can be used to quantify the uncertainty associated with a given estimator or statistical learning method
can provide an estimate of the standard error of a coef cient, or a con dence interval for that coef cient
Where does the name came from?
Tderives from the phrase to pull oneself up by one's bootstraps, from the eighteenth century “The Surprising Adventures of Baron Munchausen” by Rudolph Erich Raspe:
The Baron had fallen to the bottom of a deep lake. Just when it looked like all was lost, he thought to pick himself up by his own bootstraps.
Principle
using a computer to mimic the process of obtaining new data sets
Each of these bootstrap data sets is created by sampling with replacement, and is the
Bootstrap
Algorithm 1
Generate bootstrap samples:
t a model on
compute the prediction error of each model on the original dataset and do the average
Issue
each bootstrap sample has signi cant overlap with the original data
About two-thirds of the original data points appear in each bootstrap sample.
This will cause the bootstrap to seriously underestimate the true prediction error.
B Ω
nΩ
nBootstrap
Algorithm 2
Generate bootstrap samples:
t a model on
compute the prediction error of each model using observations that did not occur in the current bootstrap sample
and do the average
Drawback
the method gets complicated, and in the end, cross-validation provides a simpler, more attractive approach for estimating prediction error
B Ω
nΩ
nCross validation vs. Bootstrap
conceptually bootstrap is more complicated than cross-validation we will see bootstrap play a key role in some algorithms
cross-validation is widely use due to easy implementation
For same computational performance:
bootstrap estimate has a lower variance CV estimate is less biased
K = B
Conclusion
estimating a prediction error is a tricky operation and has important consequences
there is no perfect method
Advice
apart from any system of probabilistic hypotheses, be careful about the absolute nature of an estimate
in a situation of model choice within the same family, we may assume that the induced bias is identical from one model to another and use a not so computational expensive method
use the same method to estimate the error if it serves to compare the ef ciency of the methods