A Streaming Algorithm for Logistic Regression
Thomas Bonald
Joint work with Till Wohlfarth (Veepee)
DIG Seminar June 2021
Motivation
How torank itemsaccording to user preferences?
Veepee website
Binary classification
Given
I samples x1, . . . ,xn∈Rd (items)
I labels y1, . . . ,yn∈ {0,1}(click / no click) learn some mappingf :Rd → {0,1} such that:
∀i, yi ≈f(xi)
Samples Decision
Regression
Given
I samples x1, . . . ,xn∈Rd (items)
I labels y1, . . . ,yn∈ {0,1}(click / no click) learn some mappingf :Rd →[0,1]such that:
∀i, yi ≈f(xi)
Samples Decision
Logistic regression
Given
I samples x1, . . . ,xn∈Rd (items)
I labels y1, . . . ,yn∈ {0,1}(click / no click) learn someparameter θsuch that:
∀i, yi ≈pi(θ) = 1 1 +e−θTxi
Samples Decision
Loss function
Given
I samples x1, . . . ,xn∈Rd (items)
I labels y1, . . . ,yn∈ {0,1}(click / no click) learn someparameter θsuch that:
∀i, yi ≈pi(θ) = 1 1 +e−θTxi Regularized cross-entropy
L(θ) =−
n
X
i=1
[yilogpi(θ) + (1−yi) log(1−pi(θ))] +λ 2||θ||2
Outline
1. Offline learning 2. Online learning 3. Experiments 4. Summary
A convex optimization problem
Objective:
arg min
θ∈Rd
L(θ)
Gradient
∇L(θ) =λθ+
n
X
i=1
xi(pi(θ)−yi)
Hessian matrix
∇2L(θ) =λI+
n
X
i=1
pi(θ)(1−pi(θ))xixiT >0
Gradient descent
Algorithm Iterate:
θ←θ−γ∇L(θ) whereγ is the learning parameter.
Smallγ Large γ
Newton’s method
Algorithm Iterate:
θ←θ−∇2L(θ)−1∇L(θ)
Note: Follows from the minimization in t of L(θ+t)≈L(θ) +tT∇L(θ) +1
2tT∇2L(θ)t
Newton’s method
Algorithm Iterate:
θ←θ−∇2L(θ)−1∇L(θ)
Note: Follows from the minimization int of L(θ+t)≈L(θ) +tT∇L(θ) +1
2tT∇2L(θ)t
Outline
1. Offline learning 2. Online learning 3. Experiments 4. Summary
Gradient descent
From
∇L(θ) =λθ+
n
X
i=1
xi(pi(θ)−yi)
| {z }
S
we get
θ←(1−λγ)θ−γS Online algorithm
θ←0,S ←0
For each new sample (x,y):
p← 1
1 +e−θTx S ←S+x(p−y)
θ←(1−λγ)θ−γS
Newton’s method
The update is
θ←θ−∇2L(θ)−1∇L(θ) with
∇L(θ) =λθ+
n
X
i=1
xi(pi(θ)−yi)
∇2L(θ) =λI +
n
X
i=1
pi(θ)(1−pi(θ))xixiT
How to compute theinverse of∇2L(θ)in a streaming fashion?
The Sherman-Morrison formula
Proposition
For any invertible matrixA and vectorsu,v, (A+uvT)−1 =A−1−A−1uvTA−1
1 +vTA−1u Sherman & Morrison 1949
Newton’s method
Online algorithm θ←0,S ←0, Γ←I/λ For each new sample (x,y):
p ← 1
1 +e−θTx v ←p(1−p)
S ←S+x(y−p+vθTx) Γ←Γ− vΓxxTΓ
1 +vxTΓx θ←ΓS
Outline
1. Offline learning 2. Online learning 3. Experiments 4. Summary
Experiments
Classification
y,yˆ∈ {0,1}
F1 = 2
recall−1+ precision−1
Ranking p,pˆ∈[0,1]
FCP∝X
i,j
1(pi−pj)( ˆpi−ˆpj)>0 (Fraction of Concordant Pairs)
Preprocessing
Samples arecenteredandrescaled (in stream) before learning Standard scaler
0 20 40 60 80 100
0 10 20
Scenarios
1. All samples
At each timet, predict samplext then learn using labelyt
0 50 100 150 200 250 300
Time 2
0 2
2. Cold start
Learn until timet = 20 and test over the next 200 samples
0 50 100 150 200 250 300
Time 2
0 2
OpenML datasets
All OpenML datasets with:
I binary labels
I no missing data
I 2≤d ≤100 features
I n ≥100 samples
I accuracy <0.99 (offline)
→36 datasets 10
2 103 104 105
Nb of samples 101
102
Nb of features
Seewww.openml.org
Classification
Distribution ofF1 score over the 36 OpenML datasets All samples
Newton GD Random
0.00 0.25 0.50 0.75 1.00
Cold start
Newton GD Random
0.00 0.25 0.50 0.75 1.00
Running timefor learning (ms)
Newton GD Random
0.0 0.2 0.4 0.6
Synthetic data
LetS be the unit sphere in dimensiond Model
I θ∼ U(S)
I x ∼ U(S)
I y ∼ B(p) with
p = 1
1 +e−αθTx
Example Distribution of p n= 100,d = 10, α∈ {1,10}
0.0 0.5 1.0
p 0
5 10 15 20 25 30 35
Count
Ranking (α = 10)
Distribution ofFCP over 100 independent instances All samples
Newton GD Random
0.4 0.6 0.8 1.0
Cold start
Newton GD Random
0.4 0.6 0.8 1.0
Running timefor learning (ms)
Newton GD Random
0.00 0.05 0.10 0.15 0.20
Ranking (α = 1)
Distribution ofFCP over 100 independent instances All samples
Newton GD Random
0.4 0.6 0.8 1.0
Cold start
Newton GD Random
0.4 0.6 0.8 1.0
Running timefor learning (ms)
Newton GD Random
0.00 0.05 0.10 0.15 0.20
Application to Veepee data
Data collected in April 2021:
I 10 days
I 105,170 new users
I d = 132 binary features
FCP
0 20 40 60 80 100
Number of clicks 0.50
0.55 0.60 0.65
0.70 Online Offline Random
Running time(ms)
Online Offline Random 0.00
0.05 0.10 0.15
Summary
A novel streaming algorithm for logistic regression, based on
I Newton’s method
I theSherman-Morrisonformula Efficientand parameter-free Future work:
I Benchmarking
I Convergence
I High dimension
I Control (bandit algorithm)