Regularized Stochastic Learning with Dual Averaging Methods
An Application to Linear SVM
Maxime Jumelle
Context
During this note, we are interested in evaluating efficiency of the Dual Averaging Methods in case of Stochastic Learning with regularization.
• First, we present the algorithm in its general form and associated properties and discuss intents of studied research article.
• Secondly, we propose a special case with mixed l1 and l2-regularization.
• Finally, we apply this special case on MNIST dataset with l2-regularized hinge loss and compare sparsity patterns, as well as regret bounds and convergence results with others instances of online convex optimization algorithms.
1
Linear SVM
During this presentation, we will only focus on linear SVM with hinge loss andl2regularization, which loss function can be written as :
λ n
n
X
i=1
(1−biw>ai)++1 2kwk22 with a fixedλ >0.
2
Regularized Dual Algorithm
Regularization
Letf be a convex loss function andΨa closed convex function, referred as "regularization function". Then regularized stochastic learning aim at solving the following problem.
minw L(w;Z) = min
w {EZ[f(w;Z)] + Ψ(w)}
Popular examples or regularization functions are :
• l1-regularization withΨ(w) =ρkwk1withρ≥0.
• l2-regularization withΨ(w) =ρkwk22.
• Mixed regularization withΨ(w) = 12kwk22+ρkwk1.
3
Dual Averaging
The key idea behind this algorithm is to minimize three terms at each iteration :
wt+1=argmin
h¯gt,wi+ Ψ(w) +βt t h(w)
withhan auxiliary strongly convex function and(βt)t≥1 a nonnegative and nondecreasing sequence. This auxiliary function is supposed to ensure a more aggressive truncation thresholds than in the usual setting of only a singlel1orl2regularization term, then significantly improving sparsity selection.
4
Regularized Dual Algorithm (RDA)
Algorithm 1:Regularized Dual Algorithm (RDA) method
Input: hauxiliary function and(βt)t≥1 nonnegative and nondecreasing sequence.
Setw1=argminwh(w)andg0=0. fort =1, . . . ,T do Compute subgradient∂ft(xt)for a givenft
Updates dual subgradient
¯
gt =t−1
t g¯t−1+1 tgt
Compute next weight wt+1=argmin
h¯gt,wi+ Ψ(w) +βt
t h(w)
end
returnwT+1 5
Special case : l
1-regularization
Soft l
1-regularization
This previous algorithm provides general settings for regularized dual averaging method with any regularization and auxiliary functions. In particular, let us choose
Ψ(w) =σkwk1 h(w) =1
2kwk22+ρkwk1
Recall minimization of loss function from the beginning, we can define a specific case of RAD method with such regularization and auxiliary functions as
f(w;a,b) =λ n
n
X
i=1
(1−biw>ai)++1 2kwk22 then we can express RDA update as
wt+1=argminw (λ
n
n
X
i=1
(1−biw>ai)++σkwk1+βt t
1
2kwk22+ρkwk1
)
6
Soft l
1-regularization
Thissoft l1-regularization enables us to compute a closed-form solution for everywt :
wt+1(i) =
0 if|¯gt(i)| ≤λ
−
√t γ
¯
gt(i)−σ sgn(¯gt(i))
otherwise
7
Enhanced l
1-RDA
Algorithm 2:Enhanced l1-RDA method Input: γ >0 andρ≥0.
Setw1=0 andg0=0. fort =1, . . . ,T do Compute subgradient∂ft(xt)for a givenft
Updates dual subgradient
¯
gt =t−1
t g¯t−1+1 tgt
LetλRDAt =σ+γρ√
t and compute next weight
wt+1(i) =
0 if|¯gt(i)| ≤λ
−
√t γ
¯
gt(i)−σsgn(¯gt(i))
otherwise end
returnwT+1 8
l
1-regularization with linear SVM
In our case, we considered a slightly different setting, as we are focusing on linear SVM so that 12kwk22 is already set. However, we can still play on this term as we define
Ψ(w) =ρkwk1+σ 2kwk22
to enable sparse model. In this case, it can be shown that
wt+1(i) =
0 if|¯gt(i)| ≤ρ
−1 σ
¯
gt(i)−ρsgn(¯gt(i))
otherwise Settingσ=1 and we are back in the linear SVM setup.
9
Mixed l
1/l
2-RDA
Algorithm 3:Mixedl1/l2-RDA Input: ρ≥0.
Setw1=0 andg0=0. fort =1, . . . ,T do Compute subgradient∂ft(xt)for a givenft
Updates dual subgradient
¯
gt =t−1
t g¯t−1+1 tgt
Compute next weight
wt+1(i) =
0 if|¯gt(i)| ≤ρ
−1 σ
g¯t(i)−ρsgn(¯gt(i))
otherwise end
returnwT+1
10
Experiments
Settings
We use MNIST dataset (LeCun et al. 1999) only on 0 et 1 digits. Our goal is to build a classifier in a 784-dimensional space able to correctly identify whether an input digit is likely to be a 0 (positive class) or a 1 (negative class).
Figure 1: Example of a few digits in the dataset.
11
Settings
In addition with RDA algorithm, we will also train our model with other popular ones :
• Stochastic Gradient Descent
• Stochastic Mirror Descent
• Stochastic Randomized Exponentiated Gradients
12
Sparsity patterns for T = 2000, λ = 1 and ρ = 0.25
13
Convergence
14
Error rates
Since we have trained our classifier on a train set, we would like to see any over-fitting process (although it is more unlikely for a linear model rather than a non-parametric one). We then use a test set
(˜a,˜b) = (˜ai,b˜i)1≤i≤m with no observation from train set and compute, at different iterationst for each algorithm with prediction functiongˆ, the error rate :
1 m
m
X
i=1
1{gˆ(˜ai) = ˜bi}
Whereg(x) =1{wˆ>x>0}, withwˆ learned parameters (or weights).
15
Error rates
16
RDA with ρ ∈ {10
−6, . . . , 1}
17
Convergence of RDA with ρ ∈ {10
−6, . . . , 1}
18
Going further and improvements
There are still many possibilities since Nesterov’s establishment of primal averaging (2009). In this situation, we could :
• Extends current experiment to all 10 digits, as well as studying other models rather than linear SVM.
• Fine-tuning of hyper-parameters with cross-validation techniques.
• Using other regularization functions and potentially discover closed-form solutions.
19