• Aucun résultat trouvé

A Streaming Algorithm for Logistic Regression

N/A
N/A
Protected

Academic year: 2022

Partager "A Streaming Algorithm for Logistic Regression"

Copied!
27
0
0

Texte intégral

(1)

A Streaming Algorithm for Logistic Regression

Thomas Bonald

Joint work with Till Wohlfarth (Veepee)

DIG Seminar June 2021

(2)

Motivation

How torank itemsaccording to user preferences?

Veepee website

(3)

Binary classification

Given

I samples x1, . . . ,xn∈Rd (items)

I labels y1, . . . ,yn∈ {0,1}(click / no click) learn some mappingf :Rd → {0,1} such that:

∀i, yi ≈f(xi)

Samples Decision

(4)

Regression

Given

I samples x1, . . . ,xn∈Rd (items)

I labels y1, . . . ,yn∈ {0,1}(click / no click) learn some mappingf :Rd →[0,1]such that:

∀i, yi ≈f(xi)

Samples Decision

(5)

Logistic regression

Given

I samples x1, . . . ,xn∈Rd (items)

I labels y1, . . . ,yn∈ {0,1}(click / no click) learn someparameter θsuch that:

∀i, yi ≈pi(θ) = 1 1 +e−θTxi

Samples Decision

(6)

Loss function

Given

I samples x1, . . . ,xn∈Rd (items)

I labels y1, . . . ,yn∈ {0,1}(click / no click) learn someparameter θsuch that:

∀i, yi ≈pi(θ) = 1 1 +e−θTxi Regularized cross-entropy

L(θ) =−

n

X

i=1

[yilogpi(θ) + (1−yi) log(1−pi(θ))] +λ 2||θ||2

(7)

Outline

1. Offline learning 2. Online learning 3. Experiments 4. Summary

(8)

A convex optimization problem

Objective:

arg min

θ∈Rd

L(θ)

Gradient

∇L(θ) =λθ+

n

X

i=1

xi(pi(θ)−yi)

Hessian matrix

2L(θ) =λI+

n

X

i=1

pi(θ)(1−pi(θ))xixiT >0

(9)

Gradient descent

Algorithm Iterate:

θ←θ−γ∇L(θ) whereγ is the learning parameter.

Smallγ Large γ

(10)

Newton’s method

Algorithm Iterate:

θ←θ−∇2L(θ)−1∇L(θ)

Note: Follows from the minimization in t of L(θ+t)≈L(θ) +tT∇L(θ) +1

2tT2L(θ)t

(11)

Newton’s method

Algorithm Iterate:

θ←θ−∇2L(θ)−1∇L(θ)

Note: Follows from the minimization int of L(θ+t)≈L(θ) +tT∇L(θ) +1

2tT2L(θ)t

(12)

Outline

1. Offline learning 2. Online learning 3. Experiments 4. Summary

(13)

Gradient descent

From

∇L(θ) =λθ+

n

X

i=1

xi(pi(θ)−yi)

| {z }

S

we get

θ←(1−λγ)θ−γS Online algorithm

θ←0,S ←0

For each new sample (x,y):

p← 1

1 +e−θTx S ←S+x(p−y)

θ←(1−λγ)θ−γS

(14)

Newton’s method

The update is

θ←θ−∇2L(θ)−1∇L(θ) with

∇L(θ) =λθ+

n

X

i=1

xi(pi(θ)−yi)

2L(θ) =λI +

n

X

i=1

pi(θ)(1−pi(θ))xixiT

How to compute theinverse of∇2L(θ)in a streaming fashion?

(15)

The Sherman-Morrison formula

Proposition

For any invertible matrixA and vectorsu,v, (A+uvT)−1 =A−1−A−1uvTA−1

1 +vTA−1u Sherman & Morrison 1949

(16)

Newton’s method

Online algorithm θ←0,S ←0, Γ←I/λ For each new sample (x,y):

p ← 1

1 +e−θTx v ←p(1−p)

S ←S+x(y−p+vθTx) Γ←Γ− vΓxxTΓ

1 +vxTΓx θ←ΓS

(17)

Outline

1. Offline learning 2. Online learning 3. Experiments 4. Summary

(18)

Experiments

Classification

y,yˆ∈ {0,1}

F1 = 2

recall−1+ precision−1

Ranking p,pˆ∈[0,1]

FCP∝X

i,j

1(pi−pj)( ˆpi−ˆpj)>0 (Fraction of Concordant Pairs)

(19)

Preprocessing

Samples arecenteredandrescaled (in stream) before learning Standard scaler

0 20 40 60 80 100

0 10 20

(20)

Scenarios

1. All samples

At each timet, predict samplext then learn using labelyt

0 50 100 150 200 250 300

Time 2

0 2

2. Cold start

Learn until timet = 20 and test over the next 200 samples

0 50 100 150 200 250 300

Time 2

0 2

(21)

OpenML datasets

All OpenML datasets with:

I binary labels

I no missing data

I 2≤d ≤100 features

I n ≥100 samples

I accuracy <0.99 (offline)

→36 datasets 10

2 103 104 105

Nb of samples 101

102

Nb of features

Seewww.openml.org

(22)

Classification

Distribution ofF1 score over the 36 OpenML datasets All samples

Newton GD Random

0.00 0.25 0.50 0.75 1.00

Cold start

Newton GD Random

0.00 0.25 0.50 0.75 1.00

Running timefor learning (ms)

Newton GD Random

0.0 0.2 0.4 0.6

(23)

Synthetic data

LetS be the unit sphere in dimensiond Model

I θ∼ U(S)

I x ∼ U(S)

I y ∼ B(p) with

p = 1

1 +e−αθTx

Example Distribution of p n= 100,d = 10, α∈ {1,10}

0.0 0.5 1.0

p 0

5 10 15 20 25 30 35

Count

(24)

Ranking (α = 10)

Distribution ofFCP over 100 independent instances All samples

Newton GD Random

0.4 0.6 0.8 1.0

Cold start

Newton GD Random

0.4 0.6 0.8 1.0

Running timefor learning (ms)

Newton GD Random

0.00 0.05 0.10 0.15 0.20

(25)

Ranking (α = 1)

Distribution ofFCP over 100 independent instances All samples

Newton GD Random

0.4 0.6 0.8 1.0

Cold start

Newton GD Random

0.4 0.6 0.8 1.0

Running timefor learning (ms)

Newton GD Random

0.00 0.05 0.10 0.15 0.20

(26)

Application to Veepee data

Data collected in April 2021:

I 10 days

I 105,170 new users

I d = 132 binary features

FCP

0 20 40 60 80 100

Number of clicks 0.50

0.55 0.60 0.65

0.70 Online Offline Random

Running time(ms)

Online Offline Random 0.00

0.05 0.10 0.15

(27)

Summary

A novel streaming algorithm for logistic regression, based on

I Newton’s method

I theSherman-Morrisonformula Efficientand parameter-free Future work:

I Benchmarking

I Convergence

I High dimension

I Control (bandit algorithm)

Références

Documents relatifs

Among the efficient algorithms solving the Lasso for the logistic model, those proposed by [42] (consisting in a generalization of the Lars-Lasso algorithm described in [36]),

The l 1 –based penalizer is known as the lasso (least absolute shrinkage and selection operator) [2]. The lasso estimates a vector of linear regression coefficients by minimizing

A new approach for signal parametrization, which consists of a specific regression model incorporating a discrete hidden logistic pro- cess, is proposed.. The model parameters

Other alternative approaches are based on Hidden Markov Models [9] in a context of regression [10] where the model parameters are estimated by the Baum- Welch algorithm [8].. The

In that way we have proposed a particular model which is a reactive multi-agent system (MAS) model called Logistic Multi-Agent System (LMAS) [5], because of the internal logistic

This popularity evolution is typical for a catch-up television service as studied in this paper (see [1]). Each video has its own popularity decay curve that follows

Sketch of proof The proof of the theorem is based on two main steps: 1) we upper-bound the cumulative regret using the true losses by the cumulative regret using the quadratic

In this section we design an online algorithm—the Chaining Exponentially Weighted Average fore- caster—that achieves the Dudley-type regret bound (1).. In Section 2.1 below, we