A Streaming Algorithm for Logistic Regression

(1)

A Streaming Algorithm for Logistic Regression

Thomas Bonald

Joint work with Till Wohlfarth (Veepee)

DIG Seminar June 2021

(2)

Motivation

How torank itemsaccording to user preferences?

Veepee website

(3)

Binary classification

Given

I samples x₁, . . . ,x_n∈R^d (items)

I labels y₁, . . . ,y_n∈ {0,1}(click / no click) learn some mappingf :R^d → {0,1} such that:

∀i, yi ≈f(xi)

Samples Decision

(4)

Regression

Given

I samples x₁, . . . ,x_n∈R^d (items)

I labels y₁, . . . ,y_n∈ {0,1}(click / no click) learn some mappingf :R^d →[0,1]such that:

∀i, yi ≈f(xi)

Samples Decision

(5)

Logistic regression

Given

I samples x1, . . . ,xn∈R^d (items)

I labels y1, . . . ,yn∈ {0,1}(click / no click) learn someparameter θsuch that:

∀i, y_i ≈p_i(θ) = 1 1 +e^−θ^T^xⁱ

Samples Decision

(6)

Loss function

Given

I samples x1, . . . ,xn∈R^d (items)

I labels y₁, . . . ,y_n∈ {0,1}(click / no click) learn someparameter θsuch that:

∀i, y_i ≈p_i(θ) = 1 1 +e^−θ^T^xⁱ Regularized cross-entropy

L(θ) =−

n

X

i=1

[y_ilogp_i(θ) + (1−y_i) log(1−p_i(θ))] +λ 2||θ||²

(7)

Outline

1. Offline learning 2. Online learning 3. Experiments 4. Summary

(8)

A convex optimization problem

Objective:

arg min

θ∈R^d

L(θ)

Gradient

∇L(θ) =λθ+

n

X

i=1

xi(pi(θ)−yi)

Hessian matrix

∇²L(θ) =λI+

n

X

i=1

pi(θ)(1−pi(θ))xix_i^T >0

(9)

Gradient descent

Algorithm Iterate:

θ←θ−γ∇L(θ) whereγ is the learning parameter.

Smallγ Large γ

(10)

Newton’s method

Algorithm Iterate:

θ←θ−∇²L(θ)⁻¹∇L(θ)

Note: Follows from the minimization in t of L(θ+t)≈L(θ) +t^T∇L(θ) +1

2t^T∇²L(θ)t

(11)

Newton’s method

Algorithm Iterate:

θ←θ−∇²L(θ)⁻¹∇L(θ)

Note: Follows from the minimization int of L(θ+t)≈L(θ) +t^T∇L(θ) +1

2t^T∇²L(θ)t

(12)

Outline

(13)

Gradient descent

From

∇L(θ) =λθ+

n

X

i=1

x_i(p_i(θ)−y_i)

| {z }

S

we get

θ←(1−λγ)θ−γS Online algorithm

θ←0,S ←0

For each new sample (x,y):

p← 1

1 +e^−θ^T^x S ←S+x(p−y)

θ←(1−λγ)θ−γS

(14)

Newton’s method

The update is

θ←θ−∇²L(θ)⁻¹∇L(θ) with

∇L(θ) =λθ+

n

X

i=1

x_i(p_i(θ)−y_i)

∇²L(θ) =λI +

n

X

i=1

pi(θ)(1−pi(θ))xix_i^T

How to compute theinverse of∇²L(θ)in a streaming fashion?

(15)

The Sherman-Morrison formula

Proposition

For any invertible matrixA and vectorsu,v, (A+uv^T)⁻¹ =A⁻¹−A⁻¹uv^TA⁻¹

1 +v^TA⁻¹u Sherman & Morrison 1949

(16)

Newton’s method

Online algorithm θ←0,S ←0, Γ←I/λ For each new sample (x,y):

p ← 1

1 +e^−θ^T^x v ←p(1−p)

S ←S+x(y−p+vθ^Tx) Γ←Γ− vΓxx^TΓ

1 +vx^TΓx θ←ΓS

(17)

Outline

(18)

Experiments

Classification

y,yˆ∈ {0,1}

F₁ = 2

recall⁻¹+ precision⁻¹

Ranking p,pˆ∈[0,1]

FCP∝X

i,j

1_(p_i_−p_j_{)( ˆ}_p_i_−ˆ_p_j_)>0 (Fraction of Concordant Pairs)

(19)

Preprocessing

Samples arecenteredandrescaled (in stream) before learning Standard scaler

0 20 40 60 80 100

0 10 20

(20)

Scenarios

1. All samples

At each timet, predict samplex_t then learn using labely_t

0 50 100 150 200 250 300

Time 2

0 2

2. Cold start

Learn until timet = 20 and test over the next 200 samples

0 50 100 150 200 250 300

Time 2

0 2

(21)

OpenML datasets

All OpenML datasets with:

I binary labels

I no missing data

I 2≤d ≤100 features

I n ≥100 samples

I accuracy <0.99 (offline)

→36 datasets ¹⁰

2 10³ 10⁴ 10⁵

Nb of samples 10¹

10²

Nb of features

Seewww.openml.org

(22)

Classification

Distribution ofF1 score over the 36 OpenML datasets All samples

Newton GD Random

0.00 0.25 0.50 0.75 1.00

Cold start

Newton GD Random

0.00 0.25 0.50 0.75 1.00

Running timefor learning (ms)

Newton GD Random

0.0 0.2 0.4 0.6

(23)

Synthetic data

LetS be the unit sphere in dimensiond Model

I θ∼ U(S)

I x ∼ U(S)

I y ∼ B(p) with

p = 1

1 +e^−αθ^T^x

Example Distribution of p n= 100,d = 10, α∈ {1,10}

0.0 0.5 1.0

p 0

5 10 15 20 25 30 35

Count

(24)

Ranking (α = 10)

Distribution ofFCP over 100 independent instances All samples

Newton GD Random

0.4 0.6 0.8 1.0

Cold start

Newton GD Random

0.4 0.6 0.8 1.0

Newton GD Random

0.00 0.05 0.10 0.15 0.20

(25)

Ranking (α = 1)

Distribution ofFCP over 100 independent instances All samples

Newton GD Random

0.4 0.6 0.8 1.0

Cold start

Newton GD Random

0.4 0.6 0.8 1.0

Newton GD Random

0.00 0.05 0.10 0.15 0.20

(26)

Application to Veepee data

Data collected in April 2021:

I 10 days

I 105,170 new users

I d = 132 binary features

FCP

0 20 40 60 80 100

Number of clicks 0.50

0.55 0.60 0.65

0.70 Online Offline Random

Running time(ms)

Online Offline Random 0.00

0.05 0.10 0.15

(27)

Summary

A novel streaming algorithm for logistic regression, based on

I Newton’s method

I theSherman-Morrisonformula Efficientand parameter-free Future work:

I Benchmarking

I Convergence

I High dimension

I Control (bandit algorithm)