Local convergence analysis - Fast algorithms for Sparse Reduced Rank Regression

Fast algorithms for Sparse Reduced Rank Regression

5.5 Local convergence analysis

In this section, we prove linear convergence rates in a neighborhood of the global minima for RRR and under a condition on the regularization parameterλfor SRRR.

Precisely, we first study the geometry around the optima of (RRR) via a change of variables. Then, a continuity argument shows that the structure remains approxi-mately the same for (SRRR) with a small λ >0. Finally, we introduce and leverage Polyak-Łojasiewicz inequalities to prove local linear convergence.

5.5.1 A key reparameterization for RRR

The relation between RRR and PCA and the form of the analytical solution given byVelu and Reinsel [2013] will allow us to show that our study of the objective of RRR can be reduced to the study of the particular case in which X and Y are full-rank diagonal matrices, via a linear change of variables based on the singular value decompositionP SQ^T introduced in Section5.3.2of the matrix(X^TX)⁻¹²X^TY. From now on, we assume that the rank parameterris smaller than the rank ofX^TY i.e. r ≤ ` := rank(X^TY). It makes sense to assume that the imposed rank is less than the rank of the optimum for the unconstrained problem, otherwise the rank constraint is essentially useless. We also assume¹ that s1 > . . . > s` and that X^TX

Since τ is invertible, the minimization in (RRR) w.r.t. U is equivalent to the minimization off◦τ w.r.t.(A, C). We can therefore study the original optimization problem by studyingfa.

Similarly toBaldi and Hornik[1989], we characterize the minima offa using the connection between PCA and RRR, with a proof given in Appendix G.7.2.

Lemma 9. The set of minima of fa is :

1These assumptions are also reasonable and will hold in particular if (X, Y) are drawn from a continuous density. We discuss the case whereX^TX is not invertible in Appendix G.7 and in AppendixG.8.2, we show why these assumptions are needed.

FIGURE 5.1: Graph of fa for A ∈ R^2,1. In this particular case, Ω^∗_a = {(1; 0),(−1; 0)} and O1 ={−1,1}.

In words, Ω^∗_a is the image by the linear transformation R 7→ IR˜ of the Stiefel manifoldOr :=

R ∈R^r,r, R^TR =Ir . In particular, Ω^∗_a has two connected compo-nents. We also classify the critical points offa in Appendix G.7.3 :

Lemma 10. Rank-deficient matrices cannot be critical points of fa. Critical points of fa among full-rank matrices are differentiable points and either global minima or saddle points. Therefore, all local minima off_a are global.

5.5.2 Local strong convexity on cones

Although fa is not convex even in the neighborhood of its minima, we will show that it is locally convex around them in the subspace orthogonal to the set of minima.

For anyA∈R^p,r, let :

ΠΩ^∗_a(A) := argmin

B∈Ω^∗a

kB−Ak²F

be the closest minima toA, and for any R ∈ Or, let Ca(R) be defined as follows : Ca(R) :={A ∈R^`,r |IR˜ ∈Π_Ω∗a(A)}.

C^a(R)is the set of points that are associated with the same minimum parameterized byIR. As shown in the following lemma, the sets˜ Ca(R) are actually convex cones that are images of each other by orthogonal matrices; this result is essentially a consequence of the polar decomposition and of the orthogonal invariance offa. Let Sr⁺⊂R^r,r denote the set of positive-semidefinite matrices.

Lemma 11. For each R ∈ O^r, C^a(R) is a cone in R^`,r and : Ca(I_r) =

A₁ A2

|A₁ ∈ Sr⁺, A₂ ∈R^`⁻^r,r

, (5.8)

Ca(R) ={AR|A∈ Ca(Ir)} and [

Ca(R) =R^`,r.

a

₁

a

f

( A )

C^a(R) 0

f

|

Ca(R)

(A)

FIGURE 5.2: Schematic 2D graph of f_a around one of the connected compo-nents of Ω^∗_a whenr ≥2. Here, the component of Ω^∗_a is a circle and the cones are half-lines with extreme points at the origin.

Note that the cones Ca(R) do not form a partition of R^`,r because if A1 is not invertible, its polar decomposition is not unique so [A^T₁ A^T₂]^T is in several cones.

However the relative interiors of all the cones partition the set of matrices[A^T₁ A^T₂]^T such that A1 is invertible (cf. Fact 59 in Appendix G.8.1). The decomposition on these cones is motivated by the fact that forr≥2, the functionf_ain a neighborhood of each of the two connected components ofΩ^∗_acan be informally thought of as having the shape of the base of a glass bottle with a punt. This is illustrated in Figure5.2.

Thus, givenR ∈R^r,r, we focus on the restrictionf_a|Ca(R)off_aon the coneCa(R). The next result states in particular that fa|Ca(R) is smooth and strongly convex² in a neighborhood ofIR.˜

Theorem 12. For any 0< µa < s²_`(1− _s²_r+1^s²^r ), there exists a non-empty sublevel set Va⊂R^`,r of fa such that fa is s²₁-smooth in Va and for any R ∈ Or, the restriction f_a|Ca(R) is µ_a-strongly convex in Va∩ Ca(R).

Via τ these properties of fa carry over to f. Let νX and LX be respectively the smallest and largest eigenvalues of X^TX and C(R) :=τ(Ca(R),R^p⁻^`,r) with τ defined in Equation (5.6).

Corollary 13. For any 0< µ < νX(1− ^s²^r+1_s²_r ), there exists a non-empty sublevel set V⁰ of the function f that can be partitioned into disjoint convex elements {C(R)∩ V⁰}R∈Or such thatf isLX-smooth onV⁰ and isµ-strongly convex on everyV⁰∩C(R).

To extend partially the previous result to (SRRR), we apply Theorem 6.4 of Bonnans and Shapiro [1998] : given that (a) the objective F^λ of (SRRR) is locally

2The definitions of µ-strong convexity, L-smoothness and sublevel sets are recalled in Ap-pendixG.2.

strongly convex on the cone C(Ir) around the minimum, (b) for every fixed λ in some interval[0,˜λ),f is locally Lipschitz with a constant that does not depend onλ and, (c)F^λ−F⁰ =λk · k^1,2 is locally Lipschitz with a constant√pλwhich is O(λ), then by Bonnans and Shapiro[1998, Theorem 6.4], there exists λ >ˇ 0 such that for all 0 ≤ λ < λˇ, the minimum of F^λ in C(Ir) is a continuous function of λ. This is

These characterizations of the geometry in a neighborhood of the optima imme-diately lead to Polyak-Łojasiewicz inequalities that entail the linear convergence of first-order algorithms.

5.5.3 P-Ł inequalities and proofs for linear convergence rates

Polyak-Łojasiewicz (PŁ) and Kurdyka-Łojasiewicz inequalities (KŁ) were intro-duced to generalize to nonconvex functions (or just not strongly convex) proofs of rates of convergence for first-order methods [Attouch and Bolte,2009;Karimi et al., 2016, and references therein]. In particular, PŁ generalizes the fact that, for a differentiable andµ-strongly convex function f with optimal value f^∗,

f(x)−f^∗ ≤ 1

2µk∇f(x)k². (PŁ)

Karimi et al.[2016] andCsiba and Richtarik[2017] proposed a generalization to a proximal PŁ inequality of relevance for forward-backward algorithms applied to non-differentiable functions. In this section , we summarize an immediate extension allowing a line search procedure, of results established for first-order algorithms to prove locally a linear rate of convergence. Consider d ∈ N^∗ and a function³ F^λ = f +λh defined on R^d and with optimal value F^λ,∗, where f is an L-smooth function and h is a proper lower semi-continuous convex function. We define the t-approximation f˜t,x and F˜_t,x^λ of f and F^λ at x as in Section 5.3.3. The t-decrease

Given x, assume that the minimum in Equation (5.9) is attained at a point x⁺ for t >0such that the (LS) conditionF˜_t,x^λ (x⁺)≥F^λ(x⁺)is satisfied. Then the decrease in the objective value F^λ(x)−F^λ(x⁺) is lower bounded by tγt(x), hence the name t-decrease function (see Fact 41 in Appendix G.5.1). We make use of a natural generalization of the proximal PŁ inequality proposed by Karimi et al. [2016] and Csiba and Richtarik [2017]. For x such that F^λ(x)> F^λ,^∗, with F^λ,^∗ the minimum of F^λ, we define thet-proximal forcing function :

αt(x) := γt(x) F^λ(x)−F^λ,∗.

We can now state the following theorem that bounds the optimal gap for our algorithm iteratively.

Theorem 15. [From Lemma 13 inCsiba and Richtarik, 2017] Letx∈R^dandx⁺ be defined by x⁺= argmin_x0[ ˜F_t,x^λ (x⁰)−F^λ(x)], where t is chosen so that the line search condition (LS) is satisfied. Then we have :

F^λ(x⁺)−F^λ,^∗ ≤[1−t αt(x)]

F^λ(x)−F^λ,^∗ .

Given t >0, we say that F^λ satisfies the (t-strongproximal PŁ) inequality in a setV ⊂R^d if there exists α(t)>0 such that for any x∈ V where F^λ(x)> F^λ,^∗, we

We now return to the functions f and F^λ defined for (RRR) and (SRRR) with minimal valuesf^∗ and F^λ,∗, and we establish the (PŁ) and (t-strongproximal PŁ) inequalities in a neighborhood of their respective global minima.

Corollary 16. Let0≤µ < νX(1− ^s²^r+1_s²_r )andV⁰ as in Corollary13. For allU ∈ V⁰, we have :

f(U)−f^∗ ≤ 1

2µk∇f(U)k²F .

In light of Corollary14, we can also prove the (t-strongproximal PŁ) inequality forF^λ with small values of λ. To this end, we consider ¯λ >0 as in Corollary 14.

Corollary 17. Let0≤µ < νX(1− ^s²^r+1_s²_r )and0≤λ <¯λ. For anyt >0, F^λ satisfies the (t-strong proximal PŁ) inequality with α(t) := min(_2t¹, µ). In other words, for any t >0 and U ∈ V^λ, we have :

So, leveraging Theorem 15 and Corollary 16/17for (RRR)/(SRRR), we obtain the linear rate of convergence which is proved in Appendix G.10.3. Indeed, if LX

denotes the largest eigenvalue of X^TX and β the step-size decrease factor in Algo-rithm2, then we have the following result.

Corollary 18. Let 0 ≤ λ < λ¯ and k ≥ 0. Assume that tk−1 > _L^β

As explained in Fact43in AppendixG.5.2, there is only a finite number of steps at the beginning of Algorithm 1 for which the assumption t_k > _L^β

X may fail. The convergence is therefore linear. We propose a direct proof of Corollary 18 based on Corollary 17 and Theorem 15. It should be noted that the geometric structure leveraged for Corollary17can also be used to obtain Corollary 18as a consequence of the Kurdyka-Łojasiewicz inequality (cf. Appendix L).

Dans le document Benjamin Dubois pour obtenir le grade de (Page 163-168)