• Aucun résultat trouvé

Logistic regression and convex analysis

N/A
N/A
Protected

Academic year: 2022

Partager "Logistic regression and convex analysis"

Copied!
6
0
0

Texte intégral

(1)

Logistic regression and convex analysis

Alessandro Rudi, Pierre Gaillard February 26, 2021

In this class, we will see logistic regression, a widely used classification algorithm. Contrary to linear regression, there is no closed-form solution and one needs to solve it thanks to iterative convex optimization algorithms. We will then see the basics of convex analysis.

1 Logistic regression

We will consider the binary classification problem in which one wants to predict outputs in{0,1}

from inputs in Rd. We consider a training setDn:=

(Xi, Yi)

16i6n. The data points (Xi, Yi) are i.i.d. random variables and follow a distributionP inX × Y. Here,X =RdandY ={0,1}.

Goal We would like to use a similar algorithm to linear regression. However, since the outputs Yi are binary and belong to{0,1}we cannot predict them by linear transformation of the inputs Xi(which belong toRd). We will thus classify the data thanks to classification rulesf :Rd7→R such that:

f(Xi)

>0 ⇒ Yi = +1

<0 ⇒ Yi = 0 ,

to separate the data into two groups. In particular, we will consider linear functions f of the form fβ :x 7→x>β. This assumes that the data are well-explained by a linear separation (see figure below).

Linearly separable data Non-linearly separable data

f(X) = 0 f(X)>0 f(X)<0

Of course, if the data does not seem to be linearly separable, we can use similar tricks that we mentioned for linear regression (polynomial regression, kernel regression, splines,. . . ). We search a feature map x 7→ φ(x) into a higher dimensional space in which the data are linearly separable. This will be the topic of the class on Kernel methods.

(2)

−3 0 1 0

0.5 1

binary

logistic

Hinge

yb

error

Figure 1: Binary, logistic and Hinge loss incured for a prediction by := x>β when the true observation is y= 0.

Loss function To minimize the empirical risk, it remains to choose a loss function to assess the performance of a prediction. A natural loss is thebinary loss: 1 if there is a mistake (f(Xi)6=Yi) and 0otherwise. The empirical risk is then:

Rbn(β) = 1 n

n

X

i=1

1Yi6=1X>

i β>0.

This loss function is however not convex neither in β. The minimization problemminβRbn(β) is extremely hard to solve. The idea of logistic regression consists in replacing the binary loss with another similar loss function which is convex in β. This is the case of the Hinge loss and of the logistic loss `:R× {0,1} →R+. The latter assigns to a linear prediction by =x>β and an observationy∈ {0,1} the loss

`(y, y) :=b ylog 1 +eby

+ (1−y) log 1 +eby

. (1)

The binary loss, Hinge loss and logistic loss are plotted in Figure 1.

Definition 1.1 (Logistic regression estimator). The logistic regression estimator is the solution of the following minimization problem:

βb(logit)= arg min

β∈Rd

1 n

n

X

i=1

`(Xi>β, Yi),

where `is the logistic loss defined in Equation (1).

The advantage of the logistic loss with respect to the Hinge loss is that it has a probabilistic interpretation by modelingP(Y = 1|X), where(X, Y)is a couple of random variables following the law of(Xi, Yi). We will see more on this in the lecture on Maximum Likelihood.

Computation of βb(logit) Similarly to OLS, we may try to analytically solve the minimization problem by canceling the gradient of the empirical risk. Since ∂`(by,y)

yb =σ(by)−y, where σ :z7→

(3)

1

1+e−z is the logistic function, we have:

∇Rbn(β) = 1 n

n

X

i=1

Xi σ(Xi>β)−Yi

= 1

nX Y −σ(Xβ)

whereX:= (X1, . . . , Xn)>,Y := (Y1, . . . , Yn), andσ(Xβ)i:=σ(Xi>β)for16i6n. Bad news:

the equation∇Rbn(β) = 0 has no closed-form solution. It needs to be solved through iterative algorithm (gradient descent, Newton’s method,. . . ). Fortunately, this is possible because the logistic loss is convex in its first argument. Indeed,

2`(by, y)

∂by =σ(by)σ(−by)>0.

The loss is strictly convex, the solution is thus unique. In this class and the next one, we will see tools and methods to solve convex optimization problems.

Regularization Similarly to linear regression, logistic regression may over-fit the data (especially whenp > n). One needs then to add a regularization such as λkβk22 to the logistic loss.

2 Convex analysis

We will now see notions of convex analysis to solve convex optimization problems such as the one of logistic regression. For more details on this topic, we refer to the monograph Boyd and Vandenberghe, 2004. This class and the next one, we will see two aspects:

– convex analysis: properties of convex functions and convex optimization problems

– convex optimization: algorithms (gradient descent, Newton’s method, stochastic gradient descent, . . . )

Convexity is a crucial notion in many fields of mathematics and computer sciences. In machine learning, convexity allows to get well-defined problems with efficient solutions. A typical example is the problem ofempirical risk minimization:

fbn∈arg min

f∈F

1 n

n

X

i=1

` f(Xi), Yi

+λΩ(f), (∗)

where Dn =

(Xi, Yi) 16i6n is the data set, F is a convex set of predictors f : X 7→ R, yb7→`(y, y)b are convex loss functions for ally∈ Y andΩ is a convex penalty (k · k2,k · k1, . . .).

Convexity will be useful to analyze

– the statistical properties of the solutionfbn and its generalization error (i.e., its risk):

R(fbn) :=E

`(f(X), Y) Dn

– get efficient algorithms to solve the minimization problem (∗) and findfbn.

(4)

2.1 Convex sets

In this class, we will only consider finite dimensional Euclidean space (typicallyRd).

Definition 2.1 (Convex set). A setK ⊆Rd is convex if and only if for all x, y∈K [x, y]⊂K (or equivalently for allα∈(0,1), αx+ (1−α)y∈K).

Example 2.1. Here are a few examples of convex sets – Hyperplans: K={x∈Rd:a>x=b, a6= 0, b∈R} – Half spaces: K={x∈Rd:a>x>b, a6= 0, b∈R} – Affine subspaces: K =

x∈Rd:Ax=b, A∈Md(R), b∈R – Balls: kxk6R, R >0

– Cones: K={(x, r)∈Rd+1,kxk6r

– Convex polytopes: intersections of half spaces.

Properties of Convex sets :

– stability by intersection (not necessarily countable) – stability by affine transformation

– convex separation: if C, D are disjoints convex sets (C ∩D = ∅), then there exists a hyperplane which separates C and D:

∃a6= 0, b∈R such thatC⊂

a>x>b and D⊂

a>x6b .

The inequalities are strict if C and Dare compact. Exercise: show this property whenC and Dare compact (clue: define(x, y)∈arg minx∈C,y∈Dkx−yk).

Definition 2.2 (Convex Hull). Let A ∈ Rd. The convex hull, denoted Conv(A), of A is the smallest convex set that contains A. In other words:

Conv(A) =\

B ⊆Rd:A⊆B and Bconvex

x∈Rd:∃p>1, α∈Rp+,

p

X

i=1

αi = 1 and z1, . . . , zp ∈Asuch that x=

p

X

i=1

αizi

2.2 Convex functions

Definition 2.3 (Convex function). A function f :D⊂Rd→R with D convex is – convex iff for all x, y∈D and 06α61,

f(αx+ (1−α)y)6αf(x) + (1−α)f(y)

– strictly convex iff for all x, y∈D and06α61,

f(αx+ (1−α)y)< αf(x) + (1−α)f(y)

– µ-strongly convex if there existsµ >0 such that

f(αx+ (1−α)y)6αf(x) + (1−α)f(y)−µ

2α(1−α)kx−yk2 Proposition 2.1. f is µ-strongly convex if and only if x7→f(x)−µ2kxk2 is convex.

(5)

A few examples of useful convex functions:

– dimensiond= 1: x,x2,−logx,ex,log(1 +e−x),|x|p for p>1,−xp for p <1 andx>0 – higher dimensiond>1: linear functionsx 7→a>x, quadratic functions x7→ x>Qx for Q

semidefinite (symmetric) positive matrix (i.e., all eigenvalues are nonnegative, or for all x x>Qx>0), norms, max{x1, . . . , xd},log Pd

i=1exi Characterization of convex functions

– iff isC1: f convex ⇔ ∀x, y∈D f(x)>f(y) +f0(y)(x−y)

– if f is twice differentiable: f convex ⇔ ∀x ∈ D its Hessian is semi-definite positive (f00(x)>0)

Operations which preserve convexity – supremum of a familyx7→supi∈Ifi(x)

– linear combination with non-negative coefficients

– partial minimization: f convex onC×D ⇒ y7→infx∈Cf(x, y) is convex onD

Proposition 2.2. If f is convex onD, thenf is continuous on the interior of D. Furthermore, the epigraph of f

(x, t)∈D×R, f(x)6t is convex.

Proposition 2.3 (Jensen’s inequality). Iff is convex. For all x1, . . . , xn∈D andα1, . . . , αn∈ R+ such that Pn

i=1αi = 1 then

f

n

X

i=1

αixi

! 6

n

X

i=1

αif(xi).

Jensen’s inequality can be extended to infinite sums, integrals and expected values: iff is convex – integral formulation: ifp(x)>0 onS ⊂D such thatR

Sp(x)dx= 1 then f

Z

S

p(x)xdx

6 Z

S

p(x)f(x)dx .

– expected value formulation: ifX is a random variable such thatX∈Dalmost surely and E[X]exists then if

f E X]

6E f(X)

.

2.3 Unconstrained optimization problems

Letf :Rd→Rconvex and finite on Rd. We consider the problem

x∈infRd

f(x).

First, remark that we use the notationminxf(x)only when the minimum is reached. If no point achieves the minimum, we use the notationinfxf(x).

There are three possible cases

– infx∈Rdf(x) =−∞: there is no minimum. For instance,x7→x.

– infx∈Rdf(x) > −∞ and the infimum is not reached. This is the case for instance for x7→log(1 +e−x) or forx7→e−x.

(6)

– infx∈Rdf(x)>−∞and the minimum is reached and equalsminx∈Rdf(x). This is the case for instance for coercive functionslimkxk→∞f(x) = +∞.

Definition 2.4(Local minimum). Let f :D→Randx∈D. xis a local minimum if and only if there exists an open setV ⊂D such that x∈V andf(x) = minx0∈V f(x0).

Properties:

– f convex⇒ any local minimum is a global minimum.

– f strictly convex⇒ at most one minimum.

– f convex andC1 then xis a minimum of f onRd if and only iff0(x) = 0.

As we saw for linear regression and we will use in the next class, canceling the gradient provides an efficient solution to solve the minimization problem in closed form.

References

Boyd, S. and L. Vandenberghe (2004).Convex optimization. Cambridge university press.

Références

Documents relatifs

Sketch of proof The proof of the theorem is based on two main steps: 1) we upper-bound the cumulative regret using the true losses by the cumulative regret using the quadratic

We have applied the extension techniques to regularization by the ℓ 2 -norm and regularization by the ℓ 1 -norm, showing that new results for logistic regression can be easily

We thus have proved that the output probabilities computed by a logistic regression classifier can be seen as the normalized plau- sibilities obtained after combining elementary

But model selection procedures based on penalized maximum likelihood estimators in logistic regression are less studied in the literature.. In this paper we focus on model

First, the mixture of logistic regression model is presented in Section 2.1, and the prediction step is described thereby. Then, parameter estimation by maximization of the

Overall, these results indicate that a reliable statistical inference on the regression effects and probabilities of event in the logistic regression model with a cure fraction

logistic link functions are used for both the binary response of interest (the probability of infection, say) and the zero-inflation probability (the probability of being cured).

Application scoring (scoring of the applicant) is the assessment of client solvency to receive the loan (scoring according to biographical data is in priority); Behavioral scor-