Classification
Penalization, Optimisation and SVM
Ana Karina Fermin
ISEFAR
Penalization
Penalized Loss Minimization of
argmin
θ∈Θ
1 n
n
X
i=1
`(Yi,fθ(Xi)) + pen(θ) where pen(θ) is a penalty.
Penalties
Upper bound of the optimism of the empirical loss Depends on the loss and the framework!
Variable Selection
Setting: Gen. linear model = prediction of Y byh(Xtβ).
Model coefficients
Model entirely specified by β.
Coefficientwise:
βi = 0 means that theith covariate is not used.
βi ∼0 means that theith covariate as alowinfluence...
If some covariates are useless, better use a simpler model...
Submodels
Simplify the model through a constraint onβ!
Examples:
Support: Impose thatβi = 0 fori6∈I.
Support size: Impose thatkβk0=Pd
i=11βi6=0<C Norm: Impose thatkβkp<C with 1≤p(Oftenp= 2 or p= 1)
Constraint and Penalization
Constrained Optimization Choose a constant C. Computeβ as
argmin
β∈Rd,kβkp≤C
1 n
n
X
i=1
`(Yi,h(βtXi))
Lagrangian Reformulation Choose λ
Ccompute β as argmin
β∈Rd
1 n
n
X
i=1
`(Yi,h(βtXi)) +λkβkpp0 with p0 =p except if p= 0 wherep0= 1.
Penalization
Penalized Linear Model Minimization of
argmin
β∈Rd
1 n
n
X
i=1
`(Yi,h(βtXi)) + pen(β) Variable selection if β is sparse.
Classical Penalties
AIC: pen(β) =λkβk0 (non convex / sparsity) Ridge: pen(β) =λkβk22 (convex / no sparsity) Lasso: pen(β) =λkβk1 (convex / sparsity)
Elastic net: pen(β) =λ1kβk1+λ2kβk22 (convex / sparsity) Easy optimization if pen (and the loss) is convex...
Need to specify λ!
Logistic Revisited
Ideal solution:
bf = argmin
f∈S
1 n
n
X
i=1
`0/1(yi,f(xi))
Logistic regression
Usef(x) =hβ,xi+b.
Use the logistic loss `(y,f) = log2(1 +e−yf), i.e. the -log-likelihood.
Different vision than the statistician but same algorithm!
Methods
Statistical point of view
1 k Nearest-NeighborsX
2 Generative Modeling (Naive Bayes, LDA, QDA)X
3 Logistic ModelingX Optimisation point of View
1 Logistic ModelingX
2 SVM
3 ...
Ideal Separable Case
Linear classifier: sign(hβ,xi+b)
Separable case: ∃(β,b),∀i,yi(hβ,xi+b)>0!
How to choose (β,b) so that the separation is maximal?
Strict separation: ∃(β,b),∀i,yi(hβ,xi+b)≥1 Maximize the distance betweenhβ,xi+b = 1 and hβ,xi+b=−1.
Equivalent to the minimization of kβk2.
Non Separable Case
What about the non separable case?
Relax the assumption that ∀i,yi(hβ,xi+b)≥1.
Naive attempt:
argminkβk2+C1 n
n
X
i=1
1yi(hβ,xi+b)≤1
Non convex minimization.
SVM: better convex relaxation!
argminkβk2+C1 n
n
X
i=1
max(1−yi(hβ,xi+b),0)
SVM as a Penalized Convex Relaxation
Convex relaxation:
argminkβk2+C1 n
n
X
i=1
max(1−yi(hβ,xi+b),0)
= argmin1 n
n
X
i=1
max(1−yi(hβ,xi+b),0) + 1 Ckβk2
Prop: `0/1(yi,sign(hβ,xi+b))≤max(1−yi(hβ,xi+b),0) Penalized convex relaxation (Tikhonov!)
1 n
n
X
i=1
`0/1(yi,sign(hβ,xi+b))
≤ 1 n
n
X
i=1
max(1−yi(hβ,xi+b),0) + 1 Ckβk2
SVM
The Kernel Trick
Non linear separation: just replace x by a non linear Φ(x)...
Kernel trick
Computing k(x,x0) =hΦ(x),Φ(x0)i may be easier than computing Φ(x), Φ(x0) and then the scalar product!
Φ can be specified through its definite positive kernelk. Exemples :
linear kernel k(x,x0) =hx,x0i
Polynomial kernel k(x,x0) = (1 +hx,x0i)d, Gaussian kernel k(x,x0) =e−kx−x0k2/2,...
SVM
SVM