Regularized Stochastic Learning with Dual Averaging Methods

(1)

Regularized Stochastic Learning with Dual Averaging Methods

An Application to Linear SVM

Maxime Jumelle

(2)

Context

During this note, we are interested in evaluating efficiency of the Dual Averaging Methods in case of Stochastic Learning with regularization.

• First, we present the algorithm in its general form and associated properties and discuss intents of studied research article.

• Secondly, we propose a special case with mixed l1 and l2-regularization.

• Finally, we apply this special case on MNIST dataset with l₂-regularized hinge loss and compare sparsity patterns, as well as regret bounds and convergence results with others instances of online convex optimization algorithms.

1

(3)

Linear SVM

During this presentation, we will only focus on linear SVM with hinge loss andl₂regularization, which loss function can be written as :

λ n

n

X

i=1

(1−b_iw^>a_i)₊+1 2kwk²₂ with a fixedλ >0.

2

(4)

Regularized Dual Algorithm

(5)

Regularization

Letf be a convex loss function andΨa closed convex function, referred as "regularization function". Then regularized stochastic learning aim at solving the following problem.

minw L(w;Z) = min

w {EZ[f(w;Z)] + Ψ(w)}

Popular examples or regularization functions are :

• l₁-regularization withΨ(w) =ρkwk₁withρ≥0.

• l₂-regularization withΨ(w) =ρkwk²₂.

• Mixed regularization withΨ(w) = ¹₂kwk²₂+ρkwk1.

3

(6)

Dual Averaging

The key idea behind this algorithm is to minimize three terms at each iteration :

w_t+1=argmin

h¯g_t,wi+ Ψ(w) +β_t t h(w)

withhan auxiliary strongly convex function and(β_t)_t≥1 a nonnegative and nondecreasing sequence. This auxiliary function is supposed to ensure a more aggressive truncation thresholds than in the usual setting of only a singlel1orl2regularization term, then significantly improving sparsity selection.

4

(7)

Regularized Dual Algorithm (RDA)

Algorithm 1:Regularized Dual Algorithm (RDA) method

Input: hauxiliary function and(β_t)_t≥1 nonnegative and nondecreasing sequence.

Setw1=argmin_wh(w)andg0=0. fort =1, . . . ,T do Compute subgradient∂f_t(x_t)for a givenf_t

Updates dual subgradient

¯

g_t =t−1

t g¯_t−1+1 tg_t

Compute next weight wt+1=argmin

h¯gt,wi+ Ψ(w) +βt

t h(w)

end

returnwT+1 5

(8)

Special case : l

₁

-regularization

(9)

Soft l

1

-regularization

This previous algorithm provides general settings for regularized dual averaging method with any regularization and auxiliary functions. In particular, let us choose

Ψ(w) =σkwk1 h(w) =1

2kwk²₂+ρkwk1

Recall minimization of loss function from the beginning, we can define a specific case of RAD method with such regularization and auxiliary functions as

f(w;a,b) =λ n

n

X

i=1

(1−biw^>ai)++1 2kwk²₂ then we can express RDA update as

w_t+1=argmin_w (λ

n

X

i=1

(1−b_iw^>a_i)₊+σkwk1+β_t t

1

2kwk²₂+ρkwk1

)

6

(10)

Soft l

1

-regularization

Thissoft l1-regularization enables us to compute a closed-form solution for everywt :

w_t+1⁽ⁱ⁾ =







0 if|¯g_t⁽ⁱ⁾| ≤λ

−

√t γ

¯

g_t⁽ⁱ⁾−σ sgn(¯g_t⁽ⁱ⁾)

otherwise

7

(11)

Enhanced l

1

-RDA

Algorithm 2:Enhanced l1-RDA method Input: γ >0 andρ≥0.

Setw1=0 andg0=0. fort =1, . . . ,T do Compute subgradient∂ft(xt)for a givenft

¯

g_t =t−1

t g¯_t−1+1 tg_t

Letλ^RDA_t =σ+γρ√

t and compute next weight

w_t+1⁽ⁱ⁾ =







0 if|¯g_t⁽ⁱ⁾| ≤λ

−

√t γ

¯

g_t⁽ⁱ⁾−σsgn(¯g_t⁽ⁱ⁾)

otherwise end

returnwT+1 8

(12)

l

1

-regularization with linear SVM

In our case, we considered a slightly different setting, as we are focusing on linear SVM so that ¹₂kwk²₂ is already set. However, we can still play on this term as we define

Ψ(w) =ρkwk1+σ 2kwk²₂

to enable sparse model. In this case, it can be shown that

w_t+1⁽ⁱ⁾ =







0 if|¯g_t⁽ⁱ⁾| ≤ρ

−1 σ

¯

g_t⁽ⁱ⁾−ρsgn(¯g_t⁽ⁱ⁾)

otherwise Settingσ=1 and we are back in the linear SVM setup.

9

(13)

Mixed l

1

/l

2

-RDA

Algorithm 3:Mixedl1/l2-RDA Input: ρ≥0.

Setw1=0 andg0=0. fort =1, . . . ,T do Compute subgradient∂ft(xt)for a givenft

¯

gt =t−1

t g¯t−1+1 tgt

Compute next weight

w_t+1⁽ⁱ⁾ =







0 if|¯g_t⁽ⁱ⁾| ≤ρ

−1 σ

g¯_t⁽ⁱ⁾−ρsgn(¯g_t⁽ⁱ⁾)

otherwise end

returnwT+1

10

(14)

Experiments

(15)

Settings

We use MNIST dataset (LeCun et al. 1999) only on 0 et 1 digits. Our goal is to build a classifier in a 784-dimensional space able to correctly identify whether an input digit is likely to be a 0 (positive class) or a 1 (negative class).

Figure 1: Example of a few digits in the dataset.

11

(16)

Settings

In addition with RDA algorithm, we will also train our model with other popular ones :

• Stochastic Gradient Descent

• Stochastic Mirror Descent

• Stochastic Randomized Exponentiated Gradients

12

(17)

Sparsity patterns for T = 2000, λ = 1 and ρ = 0.25

13

(18)

Convergence

14

(19)

Error rates

Since we have trained our classifier on a train set, we would like to see any over-fitting process (although it is more unlikely for a linear model rather than a non-parametric one). We then use a test set

(˜a,˜b) = (˜ai,b˜i)1≤i≤m with no observation from train set and compute, at different iterationst for each algorithm with prediction functiongˆ, the error rate :

1 m

m

X

i=1

1{gˆ(˜ai) = ˜bi}

Whereg(x) =1_{_w_ˆ>x>0}, withwˆ learned parameters (or weights).

15

(20)

Error rates

16

(21)

RDA with ρ ∈ {10

⁻⁶

, . . . , 1}

17

(22)

Convergence of RDA with ρ ∈ {10

⁻⁶

, . . . , 1}

18

(23)

Going further and improvements

There are still many possibilities since Nesterov’s establishment of primal averaging (2009). In this situation, we could :

• Extends current experiment to all 10 digits, as well as studying other models rather than linear SVM.

• Fine-tuning of hyper-parameters with cross-validation techniques.

• Using other regularization functions and potentially discover closed-form solutions.

19