Model Assessment and Selection We assume i.i.d. data

(1)

Model Assessment and Selection

We assume i.i.d. data

independently (independent samples)

distributed

identically (the same distribution) Assume many iid datasets

1 line= train/test curve for 1 set of samples (data)

data used to fit the model Training erroris

err=_N¹PN

i=1L(yi,ˆf(xi)), data not used to fit the model Test erroris

err=_N¹PN

i=1L(yi,ˆf(xi)), Test error is a point estimate of the generalization error.

Dark blue line is an average of test errors, a more robust estimate.

Elements of Statistical Learning (2nd Ed.) cHastie, Tibshirani & Friedman 2009 Chap 7

0 5 10 15 20 25 30 35

0.00.20.40.60.81.01.2

Model Complexity (df)

Prediction Error

High Bias Low Bias

High Variance Low Variance

FIGURE 7.1.Behavior of test sample and training sample error as the model complexity is varied. The light blue curves show the training errorerr, while the light red curves show the conditional test errorErrT for100training sets of size50each, as the model complexity is increased. The solid curves show the expected test errorErrand the expected training errorE[err].

Definition (Generalization error) Generalization erroris the expected prediction error over an independent test sample

Err =E[L(Y,ˆf(X))]

where bothX andY are drawn randomly from their joint distribution (population).

Machine Learning Model Assessment and Selection 6 1 - 35 April 4, 2021 147 / 1

(2)

Loss functions for regression

Definition

Square error lossL(y,yˆ) = (y−ˆy)² Absolute error lossL(y,yˆ) =|y−yˆ| Huber error loss

L_δ(y,ˆy) = ₁

2(y−yˆ)² for |y−yˆ| ≤δ, δ(|y−ˆy| − ¹₂δ), otherwise.

−3 −2 −1 0 1 2 3

02468

Squared Error Absolute Error Huber

Loss

y−f

FIGURE 10.5.A comparison of three loss functions for regression, plotted as a function of the marginy−f. The Huber loss function combines the good properties of squared-error loss near zero and absolute error loss when|y−f|is large.

(3)

Classifier Assessment and Selection

Qualitative responseG taking one ofK values labeled as 1, . . . ,K. Typically

G(Xˆ ) = arg maxkpˆk(X).

−2 −1 0 1 2

0.00.51.01.52.02.53.0 Misclassification

Exponential Binomial Deviance Squared Error Support Vector

Loss

y·f

FIGURE 10.4. Loss functions for two-class classi- ﬁcation. The response isy =±1; the prediction is f, with class predictionsign(f). The losses are mis- classiﬁcation:I(sign(f)=y); exponential:exp(−yf);

binomial deviance: log(1 + exp(−2yf)); squared error:(y−f)²; and support vector:(1−yf)+(see Sec- tion 12.3). Each function has been scaled so that it passes through the point(0,1).

Definition (Loss functions for classification) 0-1 loss, misclassification

L(G,G(X)) =ˆ I(G 6= ˆG(X)) log-likelihood, cross-entropy, deviance

L(G,p(X)) =ˆ −2

N

X

k=1

I(G=k) log ˆpk(X) two classes encoded {−1,+1}

exponential

e^−yf support vector

max(0,1−yf(x))

(4)

Data Rich Situation

If we have enough data, we split the dataset Traintrain the model

Validateselect appropriate model parameterα,λ, usually the model complexity, cost penalty

Testestimate the test error on an independent sample.

Recommended ratios:

1

2: ¹₄ : ¹₂ with the validation set

2

3: ¹₃ without the validation need.

(5)

Stratified Selection

Assume small disease prevalence.

We split the data healthy/sick and split each group separately.

Consider a model over different branches of your company:

to estimate a new branch, all data from some branches should be selected as test ones

not a random sample from each branch.

Other methods only since we almost never have enough of the data.

(6)

Error Estimation with a few data

Amount of data needed depends on:

true function complexity the noise ration.

Model selection

absolute value is not necessary, the difference is crucial

any method can be used (AIC, BIC, cross-validation,. . . )

Subset Size p

Misclassification Error

5 10 15 20

0.00.10.20.30.40.50.6

••

•

• •• • • • • • • • •

•

• •

••

• •

• • • • • • ••• • • •

FIGURE 7.9.Prediction error (orange) and tenfold cross-validation curve (blue) estimated from a single training set, from the scenario in the bottom right panel of Figure 7.3.

Test error estimation

absolute value is necessary direct estimation (cross-validation, one-leave-out) preferred.

Number of Basis Functions Log-likelihood 0.51.01.52.02.5

Log-likelihood Loss

2 4 8163264128

O O

O O OO

O O O

O O O OO O

O

O O

O O OO O O Train Test AIC

Number of Basis Functions Misclassification Error 0.100.150.200.250.300.35

0-1 Loss

2 4 8163264128

O

O O

O O O

O O

O O O

O O

FIGURE 7.4. AIC used for model selection for the phoneme recognition example of Sec- tion 5.2.3. The logistic regression coeﬃcient function β(f) =PM

m=1hm(f)θmis modeled as an expansion in Mspline basis functions. In the left panel we see the AIC statistic used to estimateErrinusing log-likelihood loss. Included is an estimate ofErrbased on an independent test sample. It does well except for the ex- tremely over-parametrized case (M= 256parameters forN = 1000observations). In the right panel the same is done for 0–1 loss. Although the AIC formula does not strictly apply here, it does a reasonable job in this case.

(7)

Bias Variance Decomposition

AssumeY =f(X) +where E() = 0 andVar() =σ².

we derive the expected prediction error of the regression fit ˆf(X) at an input pointX =x₀using squared-error loss:

Err(x0) = E[(Y −ˆf(x0))²|X =x0]

= σ²+ [Eˆf(x0)−f(x0)]²+E[ˆf(x0)−Eˆf(x0)]²

= σ²+Bias²(ˆf(x0)) +Var(ˆf(x0))

= Irreducible Error + Bias²+ Variance.

NMR Signal

0 200 400 600 800 1000

0204060

spline wavelet

Smooth Function (Simulated)

n

0.0 0.2 0.4 0.6 0.8 1.0

-4-2024

spline wavelet true

•

••

•

••

•

••

•••

•

•••

••••

•

••

•

••

•

••

•

•••••

•

••

•

•••

••

•

••

•

••

•

•••

••

•

••

•

••

•

••

•

••

•

••

•

•••••

•••

•

•••

•

•••

•

••

•

FIGURE 5.19. Wavelet smoothing compared with smoothing splines on two examples. Each panel com- pares the SURE-shrunk wavelet ﬁt to the cross-vali- dated smoothing spline ﬁt.

(8)

Bias Variance Decomposition

K-Nearest neighbour

Err(x0) = E[(Y −ˆf(x0))²|X =x0]

= σ²+ [Eˆf(x₀)−f(x₀)]²+E[ˆf(x₀)−Eˆf(x₀)]²

= σ²+ [f(x₀)− 1 k

k

X

`=1

f(x_(`))]²+σ² k . Linear fit

Err(x₀) = E[(Y −ˆf(x₀))²|X =x₀]

= σ²+ [Eˆfp(x0)−f(x0)]²+kh(x0)kσ² h(x0)y = (x₀^T(X^TX)⁻¹X^T)y= ˆfp(x0).

henceVar[ˆf_p(x₀)] =kh(x₀)kσ² and its average is _N^pσ², hence 1

N

X

i=1

Err(xi) = σ²+ 1 N

N

X

i=1

[f(xi)−Eˆf(xi)]²+ p Nσ²,

(9)

Penalized model complexity

Consider ridge regression fit ˆfα(x0) with the fit h(x₀)y=x₀^T(X^TX+αI)⁻¹X^Ty we break the bias more finely. Letβ∗ be the fit withα= 0, β∗= arg minβE(f(X)−β^TX)².

Ex₀[f(x0)−Eˆfα(x0)]² = Ex₀[f(x0)−β_∗^Tx0]²+Ex₀[β_∗^Tx0−Eβˆ_α^Tx0]²

= Ave[Model Bias]²+Ave[Estimation Bias]².

Realization

Closest fit in population

Estimation Bias

SPACE

Variance Estimation

Closest fit

Truth

Model bias

RESTRICTED Shrunken fit

MODEL SPACE MODEL

FIGURE 7.2.Schematic of the behavior of bias and variance. The model space is the set of all possible predictions from the model, with the “closest fit” labeled with a black dot. The model bias from the truth is shown, along with the variance, indicated by the large yellow circle centered at the black dot labeled “closest fit in population.” A shrunken or regularized fit is also shown, having additional estimation bias, but smaller prediction error due to its decreased variance.

(10)

Example: Bias-Variance Tradeoff

50 observations, 20 predictors, uniformly distributed in the hypercube [0,1]²⁰. Left Y is 0 ifX₁≤¹₂ and 1

ifX₁> ¹₂ and we apply k-NN

Right Y is 1 ifP10

j=1Xj >5 and 0 otherwise, and we use best subset linear regression of size p.

0.00.10.20.30.4

Number of Neighbors k

50 40 30 20 10 0

k−NN − Regression

5 10 15 20

0.00.10.20.30.4

Subset Size p Linear Model − Regression

0.00.10.20.30.4

Number of Neighbors k

50 40 30 20 10 0

k−NN − Classification

5 10 15 20

0.00.10.20.30.4

Subset Size p Linear Model − Classification

FIGURE 7.3. Expected prediction error (orange), squared bias (green) and variance (blue) for a simulated example. The top row is regression with squared error loss; the bottom row is classiﬁcation with0–1loss.

The models arek-nearest neighbors (left) and best subset regression of sizep(right). The variance and bias

Fig. 7.3.

(11)

Optimism of the Training Error Rate

training errorerr = _N¹ PN

i=1L(yi,ˆf(xi))

usually less than the true errorErr =E[L(Y,ˆf(X))].

we keepxi points fixed, we take new sample of yat these points.

in-sample errorErrin =_N¹PN

i=1EyEY^newL(Y_i^new,ˆf(xi)).

optimism op≡Err_in−Ey(err).

for squared error, 0-1, and other loss functions: op=_N²PN

i=1Cov(ˆyi,yi) For a linear fit withd=pinputs of basis functions and additive error model Y =f(X) +it simplifies

N

X

i=1

cov(ˆyi,yi) = dσ² Errin = Eyerr+ 2d

Nσ² =Cp=AICgauss..

NMR Signal

0 200 400 600 800 1000

0204060

spline wavelet

Smooth Function (Simulated)

n

0.0 0.2 0.4 0.6 0.8 1.0

-4-2024

spline wavelet true

•

••

•

••

•

•••

•

•••

••••

•

••

•

••

•

••

•

•••••

•

••

•

•••

••

•

••

•

••

•

•••

•

••

•

••

•

••

•

••

•

••

•

•••••

•••

•

•••

•

•••

•

••

•

FIGURE 5.19. Wavelet smoothing compared with smoothing splines on two examples. Each panel com- pares the SURE-shrunk wavelet ﬁt to the cross-vali- dated smoothing spline ﬁt.

(12)

The Optimism (Just for fun)

yi−ˆyi = (yi−f(xi)) + (f(xi)−Eˆf(xi)) + (Eˆf(xi)−yˆi).

We get six terms of the sum of squares:

A1=P

i(yi−f(xi))² D1= 2P

i(yi−f(xi))(f(xi)−Eˆf(xi)) B=P

i(f(x_i)−Eˆf(x_i))² E = 2P

i(f(x_i)−Eˆf(x_i))(Eˆf(x_i)−ˆy_i) C=P

i(Eˆf(xi)−ˆyi)² F1= 2P

i(yi−f(xi))(Eˆf(xi)−ˆyi) For the new sampleY, taking the expectation.

A2=P

i(EY⁰Y_i⁰−f(xi))² and similarlyD2,F2.

N(Errin−err) = (A2+B+C+D2+E+F2)−(A1+B+C+D1+E+F1)

= (A2−A1) + (D2−D1) + (F2−F1) E(A1) =E(A2) =Nσ²,

E(D1) = 2P

i(E(yi)−f(xi))(f(xi)−Eˆf(xi)) = 0 sinceE(yi) =f(xi) E(D2) = 0 as well.

F₂= 2P

i(EY⁰

h

(Y_i⁰−f(x_i))(Eˆf(x_i)−ˆy_i)i

= 0 asE(Y⁰) =f(xi) andYi⁰and ˆyi are independent.

E(F1) =−2PN

i cov(yi,ˆyi).

(13)

The Covariance (Just for fun)

cov(ˆy,y) = cov((X(X^TX)⁻¹X^T)y,y) = (X(X^TX)⁻¹X^T)cov(y,y) cov(y,y) = σ².

The valuescov(ˆy_i,y_i) are the diagonal values of the above matrixcov(ˆy,y).

Thus

N

X

i=1

cov(ˆy_i,y_i) = trace(X(X^TX)⁻¹(X^T))σ²

= trace((X^TX)⁻¹(X^TX))σ²

= trace(Id)σ²=dσ². Similarly,

N

X

i=1

cov(Sλyi,yi) = trace(Sλ).

https://www.waxworksmath.com/Authors/G_M/Hastie/WriteUp/Weatherwax_Epstein_Hastie_Solution_Manual.pdf

(14)

AIC Criterion for Gaussian error

N

X

i=1

Cov(ˆyi,yi) = dσ²

AIC(α) = Eyerr(α) + 2d(α) N σ². The effective number of parameters

for a linear fit^y=Syisd(S) =trace(S), the sum of the diagonal elements.

For 0-1 loss does not hold in general, only as approximation.

Number of Basis Functions Log-likelihood 0.51.01.52.02.5

Log-likelihood Loss

2 4 8 16 32 64 128

O O

O

O O O

O O

O O O O

O

O O

O

O O O O O Train

Test AIC

Number of Basis Functions Misclassification Error 0.100.150.200.250.300.35

0-1 Loss

2 4 8 16 32 64 128

O

O O

O O O

O O

FIGURE 7.4. AIC used for model selection for the phoneme recognition example of Sec- tion 5.2.3. The logistic regression coeﬃcient function β(f) =PM

m=1hm(f)θmis modeled as an expansion in M spline basis functions. In the left panel we see the AIC statistic used to estimateErrinusing log-likelihood loss. Included is an estimate of Err based on an independent test sample. It does well except for the ex- tremely over-parametrized case (M = 256 parameters for N = 1000 observations). In the right panel the same is done for 0–1 loss. Although the AIC formula does not strictly apply here, it does a reasonable job in this case.

(15)

BIC Bayesian Information Criterion

P(Mm|Z) = P(Z|Mm)·P(Mm) P(Z)

∝ P(Z|Mm)·P(Mm)

∝ P(M_m)· Z

P(Z|θ_m,M_m)P(θ_m|M_m)dθ_m

Laplace approximation to the integral gives (and other simplification):

logP(Z|M_m) = logP(Z|θ_m,M_m)−dm

2 ·logN+O(1)

= loglik−dm

2 ·logN+O(1) BIC_m=−2loglik_m+ (logN)·d_m BIC may be used to compare the model posterior probabilities

e¹²^·BIC^m PM

`=1e¹² ·BIC`

.

(16)

Crossvalidation

Split the data intoK roughly equal-sized parts. Usually,K = 5 or 10 orK =N

ForK= 10 it is calledtenfoldcrossvalidation.

ForK=N it is calledone-leave-outcrossvalidation.

κ:{1, . . . ,N} → {1, . . . ,K}is the partition function ˆf^−k is the fitted function withkth part removed.

Fork = 1, . . . ,K

For thekth part we fit the model to the otherK−1 parts of the data, and calculate the prediction error of the fitted model when predicting thekth part of the data.

Average the error estimates.

CV = 1 N

N

X

i=1

L(y_i,ˆf^−κ(i)(x_i)).

10 times 10 Crossvalidation

(17)

Parameter Tuning by Crossvalidation

Parameter Tuning by Crossvalidation Given a set of modelsf(x, α) indexed by a tuning parameter α, we define

CV(α) = 1 N

N

X

i=1

L(yi,ˆf^−κ(i)(xi, α)).

TheCV(α) provides an estimate of the test error curve and we find the tuning parameter ˆαthat minimizes it.

Our final model isf(x,α).ˆ

Subset Size p

Misclassification Error

5 10 15 20

0.00.10.20.30.40.50.6

•

••

• •

•

• •• • • • • • • • •

•

• •

•

• •

• • • • • • ••

• • • •

FIGURE 7.9.Prediction error (orange) and tenfold cross-validation curve (blue) estimated from a single training set, from the scenario in the bottom right panel of Figure 7.3.

Definition (One standard error rule) We choose the most parsimonious model whose error is no more than one standard error above the error of the best model.

Here,p= 9.

(18)

Generalized Crossvalidation

For a linear fitting method we can write ˆ y=Sy.

For many linear fitting methods, 1

N

X

i=1

[yi−ˆf⁻ⁱ(xi)]²= 1 N

N

X

i=1

[y_i−ˆf⁻ⁱ(x_i) 1−Sii

]², Thegeneralized crossvalidation GCV approximation is

GCV = 1 N

N

X

i=1

"

yi−ˆf(xi) 1−trace(S)/N

#2

.

(19)

Bad and good Crossvalidation

p= 500 dimensions N= 20

random data,y independent ofx.

We search for the best ’decision stump’

(decision rule based on one single attribute).

0 100 200 300 400 500

23456789

Predictor

Error on Full Training Set

1 2 3 4 5 6 7 8

01234

Error on 4/5

Error on 1/5

−1 0 1 2

Predictor 436 (blue) Class Label 01

full 4/5

0.00.20.40.60.81.0

CV Errors

FIGURE 7.11.Simulation study to investigate the per- formance of cross validation in a high-dimensional problem where the predictors are independent of the class labels. The top-left panel shows the number of errors made by individual stump classiﬁers on the full training set (20observations).

The top right panel shows the errors made by individual stumps trained on a random split of the dataset into4/5ths (16observations) and tested on the remaining1/5th (4ob- servations). The best performers are depicted by colored dots in each panel. The bottom left panel shows the eﬀect of

Wrong way

choose 100 good predictors Using just this subset of predictors, build a multivariate classifier, using all of the samples except those in foldk. Use the classifier to predict the class labels for the samples in foldk.

Correlations of Selected Predictors with Outcome

Frequency

−1.0 −0.5 0.0 0.5 1.0

0102030

Wrong way

Correlations of Selected Predictors with Outcome

Frequency

−1.0 −0.5 0.0 0.5 1.0

0102030

Right way

FIGURE 7.10. Cross-validation the wrong and right way: histograms shows the correlation of class labels, in10 randomly chosen samples, with the100predictors chosen using the incorrect (upper red) and correct (lower green) versions of cross-validation.

(20)

One–leave–out odhad chyby

Jeden vzorek schováme, na ostatních naučíme model, zjistíme, zda je schovaný vzorek predikován správně.

Opakujeme pro každý vzorek a průměrujeme.

Výhody:

Máme velkou množinu na učení.

Výpočet je deterministický (a nemá cenu opakovat víckrát).

Nevýhody:

Výpočetně náročné.

Občas může selhat – např. náhodně 50 na 50 generovaný cíl, máme chybu one–leave–out 100%.

Size of Training Set

1-Err

0 50 100 150 200

0.00.20.40.60.8

FIGURE 7.8.Hypothetical learning curve for a clas- siﬁer on a given task: a plot of1−Errversus the size of the training setN. With a dataset of200observations, 5-fold cross-validation would use training sets of size 160, which would behave much like the full set. How- ever, with a dataset of50observations ﬁvefold cross–

validation would use training sets of size40, and this would result in a considerable overestimate of prediction error.

(21)

Bootstrap

Vzorky trénovací množiny vybíráme s opakováním.

MámeNdat, vybereme s opakovánímN vzorků – některé vyberu vícekrát, takže některé nevyberu vůbec. Ty použiji k testování.

Pravděpodobnost nevybrání vzorku je 1−_N¹N

≈e⁻¹= 0.368.

Odhad chyby je dost pesimistický, protože učíme naN datech, která ale reálně pocházejí jen z 0.632 dat, proto se konečný odhad chyby upravuje chybou na

trénovacích datech:

chyba= 0.632·etestovaci+0.368·etrenovaci data Opět může být svedeno s cesty, např.

stejnými daty jako one–leave–out.

Bootstrap

Bootstrap replications

samples

sample Training Z= (z1, z2, . . . , zN)

Z^∗1 Z^∗2 Z^∗B S(Z^∗1) S(Z^∗2) S(Z^∗B)

FIGURE 7.12. Schematic of the bootstrap process.

We wish to assess the statistical accuracy of a quan- tityS(Z)computed from our dataset.Btraining sets Z^∗b, b= 1, . . . , Beach of sizeNare drawn with re- placement from the original dataset. The quantity of interestS(Z)is computed from each bootstrap training set, and the valuesS(Z^∗1), . . . , S(Z^∗B)are used to assess the statistical accuracy ofS(Z).

(22)

Porovnání různých typů modelů

Řekněme, že jsme učili rozhodovací stromy a k–NN klasifikátory, provedli 10x10 krosvalidaci a dostali dvě čísla. Můžeme z toho usuzovat, že je jeden typ modelu na daných datech lepší, než ten druhý?

Jsou–li hodnoty blízko, tak ne, rozdíl může být náhodný.

Pro klasifikaci např. Mc Nemars’s test ^(err_err^A^−err^B⁻¹⁾²

A−err_B withχ² with 1 degree of freedom

errA denotes samples where A is wrong and B is correct.

Pro regresi

Statistika volí použítt–test, pokud můžeme určovat chybu na stejných rozděleních na trénovací a testovací data, takpárový t–test, který je přesnější.

Rozdíl jednotlivých odhadů má Studentovo rozložení oN−1 stupni volnosti, po normalizaci rozptylem: q^dⁱ

σ2 d (N−1)

používáme dvoustranný odhad testu, je–lid_i nula, protože předem nevíme, který model je lepší, tj. na kterou stranu z intervalu vypadneme.

Nepárový test má odhad rozptylu:_(k^σ²^x₎+^σ

2 y

(l).

(23)

Ohodnocení numerické predikce – regrese

střední kvadratická chyba mean–squared error

P

i(pi−ai)² n

odmocnina střední kvad. chyby root mean–squared error

q P

i(p_i−a_i)² n

střední absolutní chyba mean absolute error

P

i|pi−ai| n

předchozí – dají se použít buď na absolutní rozdíly nebo na relativní chyby, následující kromě korelace – relativně k předpovědi toho nejpravděpodobnějšího

relativní kvadratická chyba relative squared error

P

i(p_i−ai)²

P

i(a_i−a)²,a= ¹_nP

iai

odmocnina ... root relative squared error,

r P

i(pi−ai)²

P

i(a_i−a)²

relativní absolutní chyba relative absolute error

P

i|pi−ai|

P

i|ai−a|

korelační koeficient correlation coefficient ^√^S^PA

S_PS_A

whereSPA= P

i(pi−p)(ai−a)

n−1 ,SP =

P

i(pi−p)² n−1 ,SA=

P

i(ai−a)²

n−1 . Korelační koeficient (čím vyšší, tím lepší, narozdíl od předchozích)

(24)

Ohodnocení úspěšnosti klasifikace

Základní statistiky uváděné při ohodnocování modelů pro klasifikaci (např. ve Weka) vycházejí z tzv. matice záměn (confusion matrix):

správná třída\klasifikace + -

+ TP – true positive FN – false negative - FP – false positive TN true negative Česky se říká správně/falešně positivní/negativní.

Základní míry:

celková správnost accurancy Acc =_TP+TN^TP+TN_+FP_+FN

chyba error Err = TP+TN+FP+FN^FP+FN

přesnost precision Prec =_TP+FP^TP

úplnost, sensitivita recall, sensitivity Rec =_TP+FN^TP specificita specificity Specificity = _TN+FP^TN

F míra F measure F = ^2·Prec·Rec_Prec+Rec =_2·TP+FP^2·TP_+FN

TP rate _TP+FN^TP

FP rate (=1–Specificity) _FP+TN^FP

Různé obory uvádějí různé míry, v medicíně sensitívita a specificita, u vyhledávání dokumentů přesnost, úplnosti a F míra, případně 3–bodový průměr úspěšnosti (recall) při přesnosti 20,50, a 80 procent nebo 11–ti bodový průměr pro 0,10,20,. . . ,90,100 procent

marketing: TP, TP+TN+FP+FN^TP+FP ×100%

(25)

ROC křivka

ROC = receiver operating characteristic = užívané pro charakteristiku detekce přijímaného signálu

osa x = TP (správně positivních) v procentech osa y = FP (falešně positivních) v procentech

tj. každý positivně klasifikovaný příklad posune křivku o 1, buď nahoru (správně positivní) nebo doprava (falešně positivní). V reálu počítáme průměr přes několik testů.

Pokud máme více testů, u kterých se ROC křivky protínají, pak lze testy zkombinovat tak, abychom dosáhli hranice konvexního obalu ROC křivek.

!!Popisky os jsou vyměněné!.

(26)

Precision recall křivka

osa x jako dřív, osa y má precision, tj. TP ku TP+FP v procentech. Nejlepší místo je pak vpravo nahoře.

!!Popisky os jsou vyměněné!.

(27)

Cena za falešně positivní a falešně negativní

Příklad: predikce ovulace u krav – 97% dní neplodných, proto predikuji vždy neplodná, pokud nemám jinou cenu za promeškání plodnosti proti falešnému označení za plodnou.

Jiný příklad: reklamy– nechtějí promeškat zákazníka, zbytečně poslaný leták tak nevadí.

Čtyřpolní tabulka, obecně matice záměn (confusion matrix) pro vícehodnotový cílový atribut.

Učení s různou cenou chyby (Cost sensitive learning)

Datům, na které máme být citlivější, dáme větší váhu (např. je několikrát zopakujeme) a pak učíme normální model bez citlivosti na ceny.