SVM

(1)

Classification

Penalization, Optimisation and SVM

Ana Karina Fermin

ISEFAR

(2)

Penalization

Penalized Loss Minimization of

argmin

θ∈Θ

1 n

n

X

i=1

`(Y_i,f_θ(X_i)) + pen(θ) where pen(θ) is a penalty.

Penalties

Upper bound of the optimism of the empirical loss Depends on the loss and the framework!

(3)

Variable Selection

Setting: Gen. linear model = prediction of Y byh(X^tβ).

Model coefficients

Model entirely specified by β.

Coefficientwise:

βi = 0 means that theith covariate is not used.

βi ∼0 means that theith covariate as alowinfluence...

If some covariates are useless, better use a simpler model...

Submodels

Simplify the model through a constraint onβ!

Examples:

Support: Impose thatβi = 0 fori6∈I.

Support size: Impose thatkβk0=Pd

i=11β_i6=0<C Norm: Impose thatkβkp<C with 1≤p(Oftenp= 2 or p= 1)

(4)

Constraint and Penalization

Constrained Optimization Choose a constant C. Computeβ as

argmin

β∈R^d,kβkp≤C

1 n

n

X

i=1

`(Yi,h(β^tXi))

Lagrangian Reformulation Choose λ

Ccompute β as argmin

β∈R^d

1 n

n

X

i=1

`(Yi,h(β^tXi)) +λkβk^p_p⁰ with p⁰ =p except if p= 0 wherep⁰= 1.

(5)

Penalization

Penalized Linear Model Minimization of

argmin

β∈R^d

1 n

n

X

i=1

`(Yi,h(β^tXi)) + pen(β) Variable selection if β is sparse.

Classical Penalties

AIC: pen(β) =λkβk₀ (non convex / sparsity) Ridge: pen(β) =λkβk²₂ (convex / no sparsity) Lasso: pen(β) =λkβk₁ (convex / sparsity)

Elastic net: pen(β) =λ1kβk₁+λ2kβk²₂ (convex / sparsity) Easy optimization if pen (and the loss) is convex...

Need to specify λ!

(6)

Logistic Revisited

Ideal solution:

bf = argmin

f∈S

1 n

n

X

i=1

`^0/1(y_i,f(x_i))

Logistic regression

Usef(x) =hβ,xi+b.

Use the logistic loss `(y,f) = log₂(1 +e^−yf), i.e. the -log-likelihood.

Different vision than the statistician but same algorithm!

(7)

Methods

Statistical point of view

1 k Nearest-NeighborsX

2 Generative Modeling (Naive Bayes, LDA, QDA)X

3 Logistic ModelingX Optimisation point of View

1 Logistic ModelingX

2 SVM

3 ...

(8)

Ideal Separable Case

Linear classifier: sign(hβ,xi+b)

Separable case: ∃(β,b),∀i,yi(hβ,xi+b)>0!

How to choose (β,b) so that the separation is maximal?

Strict separation: ∃(β,b),∀i,y_i(hβ,xi+b)≥1 Maximize the distance betweenhβ,xi+b = 1 and hβ,xi+b=−1.

Equivalent to the minimization of kβk².

(9)

Non Separable Case

What about the non separable case?

Relax the assumption that ∀i,y_i(hβ,xi+b)≥1.

Naive attempt:

argminkβk²+C1 n

n

X

i=1

1_y_i(hβ,xi+b)≤1

Non convex minimization.

SVM: better convex relaxation!

argminkβk²+C1 n

n

X

i=1

max(1−y_i(hβ,xi+b),0)

(10)

SVM as a Penalized Convex Relaxation

Convex relaxation:

argminkβk²+C1 n

n

X

i=1

max(1−yi(hβ,xi+b),0)

= argmin1 n

n

X

i=1

max(1−y_i(hβ,xi+b),0) + 1 Ckβk²

Prop: `^0/1(yi,sign(hβ,xi+b))≤max(1−yi(hβ,xi+b),0) Penalized convex relaxation (Tikhonov!)

1 n

n

X

i=1

`^0/1(y_i,sign(hβ,xi+b))

≤ 1 n

n

X

i=1

max(1−yi(hβ,xi+b),0) + 1 Ckβk²

(11)

SVM

(12)

The Kernel Trick

Non linear separation: just replace x by a non linear Φ(x)...

Kernel trick

Computing k(x,x⁰) =hΦ(x),Φ(x⁰)i may be easier than computing Φ(x), Φ(x⁰) and then the scalar product!

Φ can be specified through its definite positive kernelk. Exemples :

linear kernel k(x,x⁰) =hx,x⁰i

Polynomial kernel k(x,x⁰) = (1 +hx,x⁰i)^d, Gaussian kernel k(x,x⁰) =e^−kx^−x⁰^k²^/2,...

(13)

SVM

(14)

SVM