• Aucun résultat trouvé

Introduction au cours

N/A
N/A
Protected

Academic year: 2022

Partager "Introduction au cours"

Copied!
22
0
0

Texte intégral

(1)

Classification

Introduction to Supervised Learning

Ana Karina Fermin

M2 ISEFAR

http://fermin.perso.math.cnrs.fr/

(2)

Motivation

Credit Default, Credit Score, Bank Risk, Market Risk Management

Data: Client profile, Client credit history...

Input: Client profile Output: Credit risk

(3)

Motivation

Spam detection (Text classification)

Data: email collection Input: email

Output : Spam or No Spam

(4)

Motivation

Face Detection

Data: Annotated database of images Input : Sub window in the image Output : Presence or no of a face...

(5)

Motivation

Number Recognition

Data: Annotated database of images (each image is represented by a vector of 28×28 = 784 pixel intensities) Input: Image

Output: Corresponding number

(6)

Machine Learning

A definition by Tom Mitchell (http://www.cs.cmu.edu/~tom/) A computer program is said to learn fromexperience Ewith respect to someclass of tasks Tandperformance measure P, if its performance at tasks in T, as measured by P, improves with experience E.

(7)

Supervised Learning

Supervised Learning Framework

Input measurementX= (X(1),X(2), . . . ,X(d))∈ X Output measurement Y ∈ Y.

(X,Y)∼Pwith Punknown.

Training data : Dn={(X1,Y1), . . . ,(Xn,Yn)} (i.i.d. ∼P) Often

XRd andY ∈ {−1,1} (classification) orXRd andY R(regression).

A classifieris a function in F ={f :X → Y}

Goal

Construct a goodclassifierfbfrom the training data.

Need to specify the meaning of good.

Formally, classification and regression are the same problem!

(8)

Loss and Probabilistic Framework

Loss function

Loss function : `(f(x),y) measure how wellf(x) “predicts"

y.

Examples:

Prediction loss: `(Y,f(X)) =1Y6=f(X)

Quadratic loss: `(Y,X) =|Y f(X)|2

Risk of a generic classifier

Risk measured as the average loss for a new couple:

R(f) =E[`(Y,f(X))]

Examples:

Prediction loss: E[`(Y,f(X))] =P{Y 6=f(X)}

Quadratic loss: E[`(Y,f(X))] =E

|Yf(X)|2 Beware: As bf depends onDn

(9)

Supervised Learning

Experience, Task and Performance measure

Training data : D={(X1,Y1), . . . ,(Xn,Yn)} (i.i.d. ∼P) Predictor: f :X → Y

Cost/Loss function: `(f(X),Y) Risk: R(f) =E[`(Y,f(X))]

Often`(f(X),Y) =|f(X)−Y|2 or`(f(X),Y) =1Y6=f(X) Goal

Learn a rule to construct a classifier bf ∈ F from the training data Dn s.t. the riskR(fb) issmall on average or with high probability with respect toDn.

(10)

Goal

Machine Learning

Learn a rule to construct a classifier bf ∈ F from the training data Dn s.t. the riskR(fb) issmall on average or with high probability with respect toDn.

Canonical example: Empirical Risk Minimizer

One restricts f to a subset of functions S ={fθ, θ∈Θ}

One replaces the minimization of the average loss by the minimization of the empirical loss

bf =f

bθ = argmin

fθ,θ∈Θ

1 n

n

X

i=1

`(Yi,fθ(Xi)) Examples:

Linear discrimination with

S={x7→sign{βTx+β0}Rd, β0R}

(11)

Example: TwoClass Dataset

Synthetic Dataset

Two features/covariates.

Two classes.

Dataset from Applied Predictive Modeling, M. Kuhn and K. Johnson, Springer

Numerical experiments withRand the caretpackage.

(12)

Example: Linear Discrimination

(13)

Example: More Complex Model

(14)

Under-fitting / Over-fitting Issue

Different behavior for different model complexity

Under-fit : Low complexity modelsare easily learned but too simple to explain the truth.

Over-fit: High complexity models are memorizing the data they have seen and are unable to generalize to unseen examples.

(15)

Under-fitting / Over-fitting Issue

We can determine whether a predictive model is underfitting or overfitting the training data by looking at the prediction error on the training data and the test data.

How to estimate the test error ?

(16)

Binary Classification Loss Issue

Empirical Risk Minimizer

bf = argmin

f∈S

1 n

n

X

i=1

`0/1(Yi,f(Xi))

Classification loss: `0/1(y,f(x)) =1y6=f(x)

Not convex and not smooth!

(17)

Statistical Point of View Ideal Solution and Estimation

The best solution f (which is independent of Dn) is f = arg min

f∈FR(f) = arg min

f∈FE[`(Y,f(X))]

Bayes Predictor (Ideal solution)

In binary classification with 0−1 loss:

f(X) =

(+1 if P{Y = +1|X} ≥P{Y =−1|X}

−1 otherwise

Issue: Explicit solution requires toknowE[Y|X]for all values ofX!

(18)

Conditional prob. and Bayes Predictor

(19)

Classification Loss and Convexification

Classification loss: `0/1(y,f(x)) =1y6=f(x) Not convex and not smooth!

Classical convexification

Logistic loss: `(y,f(x)) = log(1 +e−yf(x)) (Logistic / NN) Hinge loss: `(y,f(x)) = (1−yf(x))+ (SVM)

Exponential loss: `(y,f(x)) =e−yf(x) (Boosting...)

(20)

Machine Learning

(21)

Methods (Today):

1 k Nearest-Neighbors

(22)

Bibliography

T. Hastie, R. Tibshirani, and J. Friedman (2009) The Elements of Statistical Learning

Springer Series in Statistics.

G. James, D. Witten, T. Hastie and R. Tibshirani (2013) An Introduction to Statistical Learning with Applications in R Springer Series in Statistics.

B. Schölkopf, A. Smola (2002) Learning with kernels.

The MIT Press

Références

Documents relatifs

The proposed approach is based on the characterization of the class of initial state vectors and control input signals that make the outputs of different continuous

Unlike the modal logic framework, which usually uses possible world semantics, in- put/output logic adopts mainly operational semantics: a normative system is conceived in

In this perspective, we introduced four principal input/output operations: simple-minded, basic (making intelligent use of disjunctive inputs), simple-minded reusable

Note that Λ L 2 is nonempty, as it includes the labelling that assigns undec to all arguments, and since AF is finite there is at least a maximal (w.r.t. Suppose by contradiction that

Unlike the modal logic framework, which usually uses possible world semantics, input/output logic adopts mainly operational semantics: a normative system is conceived in in-

Other three relevant deontic logics based on non possible-world semantics are imperative logic (Hansen, 2008), prioritized default logic (Horty, 2012), and defeasible deontic

Whereas ex- pression mechanisms of synaptic plasticity are well identified today and involve either presynaptic change in neurotransmitter release or post- synaptic change in

The originality of IODA is to transpose DNN input pre-training trick to the output space, in order to learn a high level representation of labels. It allows a fast image labeling