• Aucun résultat trouvé

Regularized Stochastic Learning with Dual Averaging Methods

N/A
N/A
Protected

Academic year: 2022

Partager "Regularized Stochastic Learning with Dual Averaging Methods"

Copied!
23
0
0

Texte intégral

(1)

Regularized Stochastic Learning with Dual Averaging Methods

An Application to Linear SVM

Maxime Jumelle

(2)

Context

During this note, we are interested in evaluating efficiency of the Dual Averaging Methods in case of Stochastic Learning with regularization.

• First, we present the algorithm in its general form and associated properties and discuss intents of studied research article.

• Secondly, we propose a special case with mixed l1 and l2-regularization.

• Finally, we apply this special case on MNIST dataset with l2-regularized hinge loss and compare sparsity patterns, as well as regret bounds and convergence results with others instances of online convex optimization algorithms.

1

(3)

Linear SVM

During this presentation, we will only focus on linear SVM with hinge loss andl2regularization, which loss function can be written as :

λ n

n

X

i=1

(1−biw>ai)++1 2kwk22 with a fixedλ >0.

2

(4)

Regularized Dual Algorithm

(5)

Regularization

Letf be a convex loss function andΨa closed convex function, referred as "regularization function". Then regularized stochastic learning aim at solving the following problem.

minw L(w;Z) = min

w {EZ[f(w;Z)] + Ψ(w)}

Popular examples or regularization functions are :

• l1-regularization withΨ(w) =ρkwk1withρ≥0.

• l2-regularization withΨ(w) =ρkwk22.

• Mixed regularization withΨ(w) = 12kwk22+ρkwk1.

3

(6)

Dual Averaging

The key idea behind this algorithm is to minimize three terms at each iteration :

wt+1=argmin

h¯gt,wi+ Ψ(w) +βt t h(w)

withhan auxiliary strongly convex function and(βt)t≥1 a nonnegative and nondecreasing sequence. This auxiliary function is supposed to ensure a more aggressive truncation thresholds than in the usual setting of only a singlel1orl2regularization term, then significantly improving sparsity selection.

4

(7)

Regularized Dual Algorithm (RDA)

Algorithm 1:Regularized Dual Algorithm (RDA) method

Input: hauxiliary function and(βt)t≥1 nonnegative and nondecreasing sequence.

Setw1=argminwh(w)andg0=0. fort =1, . . . ,T do Compute subgradient∂ft(xt)for a givenft

Updates dual subgradient

¯

gt =t−1

t g¯t−1+1 tgt

Compute next weight wt+1=argmin

h¯gt,wi+ Ψ(w) +βt

t h(w)

end

returnwT+1 5

(8)

Special case : l

1

-regularization

(9)

Soft l

1

-regularization

This previous algorithm provides general settings for regularized dual averaging method with any regularization and auxiliary functions. In particular, let us choose

Ψ(w) =σkwk1 h(w) =1

2kwk22+ρkwk1

Recall minimization of loss function from the beginning, we can define a specific case of RAD method with such regularization and auxiliary functions as

f(w;a,b) =λ n

n

X

i=1

(1−biw>ai)++1 2kwk22 then we can express RDA update as

wt+1=argminw

n

n

X

i=1

(1−biw>ai)++σkwk1t t

1

2kwk22+ρkwk1

)

6

(10)

Soft l

1

-regularization

Thissoft l1-regularization enables us to compute a closed-form solution for everywt :

wt+1(i) =

0 if|¯gt(i)| ≤λ

√t γ

¯

gt(i)−σ sgn(¯gt(i))

otherwise

7

(11)

Enhanced l

1

-RDA

Algorithm 2:Enhanced l1-RDA method Input: γ >0 andρ≥0.

Setw1=0 andg0=0. fort =1, . . . ,T do Compute subgradient∂ft(xt)for a givenft

Updates dual subgradient

¯

gt =t−1

t g¯t−1+1 tgt

LetλRDAt =σ+γρ√

t and compute next weight

wt+1(i) =

0 if|¯gt(i)| ≤λ

√t γ

¯

gt(i)−σsgn(¯gt(i))

otherwise end

returnwT+1 8

(12)

l

1

-regularization with linear SVM

In our case, we considered a slightly different setting, as we are focusing on linear SVM so that 12kwk22 is already set. However, we can still play on this term as we define

Ψ(w) =ρkwk1+σ 2kwk22

to enable sparse model. In this case, it can be shown that

wt+1(i) =

0 if|¯gt(i)| ≤ρ

−1 σ

¯

gt(i)−ρsgn(¯gt(i))

otherwise Settingσ=1 and we are back in the linear SVM setup.

9

(13)

Mixed l

1

/l

2

-RDA

Algorithm 3:Mixedl1/l2-RDA Input: ρ≥0.

Setw1=0 andg0=0. fort =1, . . . ,T do Compute subgradient∂ft(xt)for a givenft

Updates dual subgradient

¯

gt =t−1

t g¯t−1+1 tgt

Compute next weight

wt+1(i) =

0 if|¯gt(i)| ≤ρ

−1 σ

t(i)−ρsgn(¯gt(i))

otherwise end

returnwT+1

10

(14)

Experiments

(15)

Settings

We use MNIST dataset (LeCun et al. 1999) only on 0 et 1 digits. Our goal is to build a classifier in a 784-dimensional space able to correctly identify whether an input digit is likely to be a 0 (positive class) or a 1 (negative class).

Figure 1: Example of a few digits in the dataset.

11

(16)

Settings

In addition with RDA algorithm, we will also train our model with other popular ones :

• Stochastic Gradient Descent

• Stochastic Mirror Descent

• Stochastic Randomized Exponentiated Gradients

12

(17)

Sparsity patterns for T = 2000, λ = 1 and ρ = 0.25

13

(18)

Convergence

14

(19)

Error rates

Since we have trained our classifier on a train set, we would like to see any over-fitting process (although it is more unlikely for a linear model rather than a non-parametric one). We then use a test set

(˜a,˜b) = (˜ai,b˜i)1≤i≤m with no observation from train set and compute, at different iterationst for each algorithm with prediction functiongˆ, the error rate :

1 m

m

X

i=1

1{gˆ(˜ai) = ˜bi}

Whereg(x) =1{wˆ>x>0}, withwˆ learned parameters (or weights).

15

(20)

Error rates

16

(21)

RDA with ρ ∈ {10

−6

, . . . , 1}

17

(22)

Convergence of RDA with ρ ∈ {10

−6

, . . . , 1}

18

(23)

Going further and improvements

There are still many possibilities since Nesterov’s establishment of primal averaging (2009). In this situation, we could :

• Extends current experiment to all 10 digits, as well as studying other models rather than linear SVM.

• Fine-tuning of hyper-parameters with cross-validation techniques.

• Using other regularization functions and potentially discover closed-form solutions.

19

Références

Documents relatifs

As well as for ARDA, the auxiliary function enables to select the sparse feature subspace directly from the first iterations, as average is also much similar to sparsity pattern of

Elucidating the dechlorination mechanism of hexachloroethane by Pd-doped zerovalent iron microparticles in dissolved lactic acid polymers using chromatography and indirect monitoring

L’archive ouverte pluridisciplinaire HAL, est destinée au dépôt et à la diffusion de documents scientifiques de niveau recherche, publiés ou non, émanant des

The present paper focuses on both points by proposing a single association matrix, the dual pignistic matrix which gathers all the association information (targets ⇒ tracks and tracks

A spatio-spectral image reconstruc- tion algorithm named PAINTER (for Polychromatic opticAl INTEr- ferometric Reconstruction software) with a publicly available source code has

Diurnal variations of upper and middle tropospheric humid- ity have been examined in conjunction with diurnal varia- tions of precipitation intensity and high clouds (i.e.,

Therefore, the aim of the present study was to quantify skin hydra- tion and moisture absorption in two different types of sock fabrics in boots, as well as perceptions of

In short, this barcode format has the following char- acteristics: (a) three black and two white bars alternating on each side of the barcode form the finder pattern, (b) one