Downloaded 01/18/18 to 166.111.71.48. Redistribution subject to SIAM license or copyright; see http://www.siam.org/journals/ojsa.php

(1)

A RIEMANNIAN NEWTON ALGORITHM FOR NONLINEAR EIGENVALUE PROBLEMS^∗

ZHI ZHAO^†, ZHENG-JIAN BAI^‡, _AND XIAO-QING JIN^†

Abstract. We give the formulation of a Riemannian Newton algorithm for solving a class of nonlinear eigenvalue problems by minimizing a total energy function subject to the orthogonality constraint. Under some mild assumptions, we establish the global and quadratic convergence of the proposed method. Moreover, the positive deﬁniteness condition of the Riemannian Hessian of the total energy function at a solution is derived. Some numerical tests are reported to illustrate the eﬃciency of the proposed method for solving large-scale problems.

Key words. nonlinear eigenvalue problem, Riemannian Newton algorithm, Stiefel manifold, Grassmann manifold

AMS subject classifications.15A18, 65F15, 49M15, 47J10

DOI.10.1137/140967994

1. Introduction. We consider the following total energy minimization problem:

(1.1) min

X∈R^n×kE(X) :=1

2tr(X^TLX) + α

4ρ(X)^TL⁻¹ρ(X) s.t. X^TX =I_k, where X^T denotes the transpose of X, L is a discrete Laplacian operator, α > 0 is a given constant, ρ(X) := diag(XX^T), “s.t.” means “subject to”, and I_k is the identity matrix of order k. We point out that the matrix L may be singular with diﬀerent boundary conditions (see [45]). In this case, we may replace L⁻¹ by the Moore–Penrose generalized inverseL^†. The symbol diag(M) := (m₁₁, m₂₂, . . . , m_nn)^T denotes a vector containing the diagonal elements of ann×nmatrixM = [m_ij]. Ob- viously, the ﬁrst-order necessary conditions for the total energy minimization problem (1.1) are given by [33]

H(X)X =XΛ_k, X^TX =I_k,

where thek-by-kreal symmetric matrix Λ_k is a Lagrange multiplier. We note that the global minimizer of the constrained minimization problem (1.1) is not unique. If X is a solution, thenXQ is also a solution for any k×k real orthogonal matrixQ.

Thus a necessary condition for the global minimum of problem (1.1) takes the form of a nonlinear eigenvalue problem (NEP) [46]:

(1.2) H(X)X =XΛ_k, X^TX =I_k,

where the diagonal matrix Λ_k ∈ R^k×k contains the k smallest eigenvalues of the symmetric matrix H(X) = L+αDiag(L⁻¹ρ(X)) ∈ R^n×n. The symbol Diag(x) is

∗Received by the editors May 5, 2014; accepted for publication (in revised form) by F. Tisseur March 30, 2015; published electronically June 11, 2015.

http://www.siam.org/journals/simax/36-2/96799.html

†Department of Mathematics, University of Macau, Macau, People’s Republic of China (zhaozhi231@163.com,xqjin@umac.mo). The research of the third author was supported by research grant MYRG098(Y2-L3)-FST13-JXQ from the University of Macau.

‡Corresponding author. School of Mathematical Sciences, Xiamen University, Xiamen 361005, People’s Republic of China (zjbai@xmu.edu.cn). The research of this author was partially supported by National Natural Science Foundation of China grant 11271308, by NCET, and by the Fundamental Research Funds for the Central Universities (20720150001).

752

Downloaded 01/18/18 to 166.111.71.48. Redistribution subject to SIAM license or copyright; see http://www.siam.org/journals/ojsa.php

(2)

a diagonal matrix with a vector x on its diagonal. Note that the meaning of the notation diag(·) is diﬀerent from that of the notation Diag(·).

The total energy minimization problem (1.1) is a simpliﬁed version of the Hartree–

Fock (HF) total energy minimization problem and the Kohn–Sham (KS) total energy minimization problem in electronic structure calculations (see, for instance, [30,38, 44, 45]). Moreover, the NEP (1.2) is a simpliﬁed version of the associated HF and KS equations. The self-consistent ﬁeld (SCF) iteration is widely used for solving the HF and KS equations, which calculates theksmallest eigenvalues and associated eigenvectors of the NEP (1.2) iteratively: Given the current iterateX^j, computeX^j+1 such that

H(X^j)X^j+1=X^j+1Λ^j+1_k and (X^j+1)^TX^j+1=I_k,

where Λ^j+1_k contains the k smallest eigenvalues of H(X^j). However, the original version of the SCF iteration often fails to converge [11]. In past decades, diﬀerent heuristics were developed to accelerate and stabilize the SCF iteration [24,25]. On the convergence of the SCF iteration, one may refer to [13, 27,46]. In [45], the SCF iteration is used as an indirect way to solve problem (1.1) by minimizing a sequence of quadratic surrogate functions.

There are several recent optimization methods for solving the minimization problem (1.1) directly [5, 7, 25, 26, 32, 34, 35, 40, 41]. Because of the orthogonality constraintX^TX =I_k, those methods only use the gradient of the total energy and often converge slowly. In [44], a constrained optimization algorithm is proposed for minimizing the total energy by projecting the total energy into a sequence of sub- spaces and seeking the minimum point of the total energy over each subspace. In [42], a projected gradient-type method is given for minimizing a general function with the orthogonality constraint. In [16], Newton’s method and the conjugate gradient (CG) method are developed on the Grassmann and Stiefel manifolds. In [28], a mod- iﬁed steepest descent–type method with Armijo’s line search and a modiﬁed Newton method are presented on the Grassmann and Stiefel manifolds. Also, in [31], line- search, trust region, and Newton algorithms are well-studied on matrix manifolds.

The SCF iteration with various trust region techniques is employed to minimize the total energy [17, 18, 39, 45]. In [19], a Newton method is presented for solving a class of NEPs arising from electronic structure calculation, which is eﬃcient only for small-scale problems.

In this paper, we propose a Riemannian Newton algorithm for solving the total energy minimization problem (1.1) over the Grassmann manifold related to the Stiefel manifold St(k, n) := {X ∈ R^n×k | X^TX = I_k}. This is sparked by two recent papers, [27] and [19]. In [27], the convergence condition of the SCF iteration is related to the Hessian of the total energy. In [19], the NEP is viewed as a system of nonlinear equations, and then a Newton method is used for solving it. Therefore, in this paper, we ﬁrst construct the Grassmann manifold from the Stiefel manifold St(k, n) based on an orthogonal equivalence relation and a Rieman- nian metric. Then we propose a Riemannian Newton algorithm for solving problem (1.1) over the Grassmann manifold. In particular, we combine the Riemannian New- ton algorithm with the Riemannian line search technique. Sparked by [2, 16, 28], we use the CG method [20, Algorithm 10.2.1] to solving each Newton equation inex- actly, where we do not need the inversion of the Riemannian Hessian of the total energy function and thus the computational complexity is reduced. Also, the Riemannian line search guarantees that the proposed method will converge to a local minimum [28].

Downloaded 01/18/18 to 166.111.71.48. Redistribution subject to SIAM license or copyright; see http://www.siam.org/journals/ojsa.php

(3)

Under some mild conditions, we show that the proposed Riemannian Newton algorithm converges globally and quadratically. Moreover, we give the positive deﬁniteness condition of the Riemannian Hessian of the total energy function at a solution. Some numerical experiments are reported to demonstrate the eﬃciency of our method for solving large-scale problems.

The rest of this paper is organized as follows. In section2we review some prelim- inary results on Riemannian manifolds. In section3we present a Riemannian Newton algorithm for solving the minimization problem (1.1) over the Grassmann manifold related to the Stiefel manifold St(k, n). In section4we give a convergence analysis. In section5we investigate the positive deﬁniteness condition of the Riemannian Hessian of the total energy function in problem (1.1) over the Grassmann manifold. In section 6 we report some numerical results, and ﬁnally we give some concluding remarks in section7.

2. Preliminaries. In this section, we recall some basic concepts and results on Riemannian manifolds [1,2]. LetMbe ad-dimensional manifold. LetR_x(M) be the set of all smooth real-valued functions deﬁned on a neighborhood of a pointx∈ M. A tangent vectorξ_xto Matxis deﬁned as a mapping fromRx(M) toRsuch that

ξ_xf = ˙γ(0)f := d

f(γ(t)) dt

t=0

∀f ∈ R_x(M)

for some smooth curveγ onMwith γ(0) =x. The tangent spaceT_xMto Mat x consists of all tangent vectors toMatx. Denote byTMthe tangent bundle ofM:

TM:=

x∈M

T_xM.

A vector ﬁeld onMis a smooth functionξ:M →TMsuch thatξ(x) =ξ_x∈T_xM for allx∈ M. A Riemannian metricg onM is a family of inner products,

g_x:T_xM ×T_xM →R, x∈ M,

where the inner productg_x(·,·) varies smoothly and induces a normξ_x=

g_x(ξ_x, ξ_x) onT_xM. Thus, (M, g) is a Riemannian manifold [2, p. 45].

Let Mand L be two manifolds. Let G:M → L be a smooth mapping. Then the diﬀerential DG(x) ofGatx∈ M is a mapping fromT_xMtoT_G(x)Lsuch that

DG(x)[ξ_x]∈T_G(x)L ∀ξ_x∈T_xM,

where DG(x)[ξ_x] is a tangent vector to L at G(x) ∈ L, which is a mapping from R_G(x)(L) toRdeﬁned by

DG(x)[ξ_x]f =ξ_x(f◦G) ∀f ∈ R_G(x)(L).

Given a Riemannian manifold (M, g) with a Riemannian connection∇ (see, for instance, [2,10]), letf :M →Rbe a smooth function. Then the Riemannian gradient gradf(x) off atx∈ Mis deﬁned as the unique element inT_xMsuch that

g_x(gradf(x), ξ_x) = Df(x)[ξ_x] ∀ξ_x∈T_xM.

The Riemannian Hessian off at x∈ Mis deﬁned as the linear mapping fromT_xM toT_xMsuch that [2, Deﬁnition 5.5.1],

Hessf(x)[ξ_x] =∇_ξ_x gradf(x) ∀ξ_x∈T_xM.

Downloaded 01/18/18 to 166.111.71.48. Redistribution subject to SIAM license or copyright; see http://www.siam.org/journals/ojsa.php

(4)

The concept of retraction originally appeared in the ﬁeld of algebraic topology [21]. Here, we adopt the following deﬁnition of retraction [2,4, 37].

Definition 2.1. Let Mbe a manifold. LetR be a mapping from TMontoM. Let R_x denote the restriction ofR toT_xM. We say that R is a retraction onM if

(i) R is smooth,

(ii) R_x(0_X) =x, where0_X is the origin ofT_xM,

(iii) DR_x(0_X) = id_T_x_M, where id_T_x_M is the identity mapping on T_xM with the canonical identiﬁcation T₀_XT_xM T_xM.

For a real-valued function f on the manifold M and a retraction R on M, we deﬁne the pullbackfoff as the mapping fromTMtoRsuch that

(2.1) f(ξ) =f(R(ξ)) ∀ξ∈TM,

and letf_xmean the restriction of ftoT_xM, which is deﬁned by f_x(ξ_x) =f(R_x(ξ_x)) ∀ξ_x∈T_xM.

On the Riemannian distance to a nondegenerate local minimizerx^∗ of a smooth real-valued functionf on (M, g), we have the following lemma [2, Lemma 7.4.8].

Lemma 2.2. Letx^∗∈ Mand letf :M →Rbe aC²function(its ﬁrst and second derivatives are continuous)such thatgradf(x^∗) = 0andHessf(x^∗)is positive deﬁnite with maximal and minimal eigenvaluesλ_maxandλ_min. Then given two positive scalars τ₀, τ₁ with τ₀ < λ_min and τ₁ > λ_max, there exists a neighborhood N(x^∗) of x^∗ such that

τ₀dist(x^∗, x)≤ gradf(x) ≤τ₁dist(x^∗, x) ∀x∈ N(x^∗), wheredist(·,·)means the Riemannian distance on (M, g) [2, p. 46].

On a relation between the Riemannian gradient of a smooth functionf onMat R_x(ξ) and the gradient off_x atξ∈T_xM withξ ≤δ for someδ > 0, we have the following special result [2, Lemma 7.4.9].

Lemma 2.3. LetRbe a retraction onMand letf be a continuously diﬀerentiable cost function on M. Then for any given x^∗∈ Mand a scalar τ₂>1, there exists a neighborhood N(x^∗)ofx^∗ andδ >0 such that

gradf(R_x(ξ)) ≤τ₂gradf_x(ξ)

for allx∈ N(x^∗)and allξ∈T_xMwith ξ ≤δ, wherefis deﬁned as in(2.1).

3. Riemannian Newton algorithm. In this section, we propose a Riemannian Newton algorithm for solving the total energy minimization problem (1.1). We ﬁrst construct a Grassmann manifold from the Stiefel manifold St(k, n). Then, based on the induced Grassmann manifold, we give a matrix-form Riemannian Newton algorithm for solving problem (1.1).

3.1. The Grassmann manifold. We observe the fact that the function E : St(k, n) → R deﬁned in problem (1.1) is such that for any given X ∈ St(k, n), E(X) = E(XQ) for all Q ∈ O_k, where O_k is the set of all k×k orthogonal matrices. Thus, the global minimizer of problem (1.1) is not unique and is not isolated.

The Riemannian Hessian ofE must be singular, which causes trouble for applying a Riemannian Newton algorithm to problem (1.1). To overcome this diﬃculty, we construct a Grassmann manifoldQfrom the Stiefel manifold St(k, n) under the operation of orthogonal groupO_k. We deﬁne a quotient manifold by

(3.1) Q:= St(k, n)/O_k,

Downloaded 01/18/18 to 166.111.71.48. Redistribution subject to SIAM license or copyright; see http://www.siam.org/journals/ojsa.php

(5)

based on the following equivalence relation on St(k, n):

X∼Y ⇐⇒ XQ|Q∈O_k

= Y Q|Q∈O_k .

Then we haveQ:= [X] :X∈St(k, n) , where

[X] := Y ∈St(k, n)|Y =XQ, Q∈O_k

is the equivalent class containingX. Thenatural projectionis deﬁned as the mapping from St(k, n) toQsuch that

π(X) = [X] ∀X∈St(k, n).

Moreover, we have [X] =π⁻¹(π(X)) and dimπ⁻¹(π(X)) = dimO_k = 1/2k(k−1).

Since St(k, n) is the total space ofQ, we have [2, Proposition 3.4.4]

dimQ = dim St(k, n)−dimπ⁻¹(π(X))

= nk−¹₂k(k+ 1)−¹₂k(k−1) =k(n−k).

The functionE: St(k, n)→Rinduces a unique functionE:Q →Rsuch that E

[X]

=E

π⁻¹([X])

and E(X) =E π(X)

.

Hence, problem (1.1) can be written as the following minimization problem:

(3.2) minE(X) s.t. X ∈ Q.

For any X ∈ Q, let ξ_X be an element of T_XQ, and let X be an element in the equivalent classπ⁻¹(X), which is an embedded submanifold of St(k, n). Any element ξ_X ∈T_XSt(k, n) that satisﬁes Dπ(X)[ξ_X] =ξ_Xcan be considered a representation of ξ_X. For any smooth functionf : Q →R, the functionf :=f ◦π: St(k, n)→Ris smooth [2, Proposition 3.4.5]. Moreover,

Df(X)[ξ_X] = Df π(X)

Dπ(X)[ξ_X]

= Df(X)[ξ_X].

Since there are inﬁnitely many valid representationsξ_X ofξ_X at X, we need to deﬁne the vertical space and horizontal space at the pointX [2, p. 48]. Note that the tangent space to St(k, n) atX ∈St(k, n) is given by [2, p. 42]

T_XSt(k, n) = {Z ∈R^n×k :X^TZ+Z^TX = 0}

= {XΩ +X_⊥K: Ω^T =−Ω, K ∈R^(n−k)×k},

whereX_⊥∈R^n×(n−k)such that span(X_⊥) is the orthogonal complement of span(X).

Also, a Riemannian metricg on St(k, n) is deﬁned by

g_X(Z₁, Z₂) := tr(Z^T₁Z₂) ∀Z₁, Z₂∈T_XSt(k, n), X ∈St(k, n), and its induced Frobenius norm · _X. Thus, thevertical spaceatX is deﬁned as

V_X :=T_X

π⁻¹(X)

= XΩ : Ω^T =−Ω, Ω∈R^k×k .

Downloaded 01/18/18 to 166.111.71.48. Redistribution subject to SIAM license or copyright; see http://www.siam.org/journals/ojsa.php

(6)

We can set thehorizontal spaceH_X atX to be

H_X: = V_X^⊥ = ξ_X∈T_XSt(k, n) :g_X(ξ_X, ν_X) = 0∀ν_X∈ V_X

= ξ_X ∈T_XSt(k, n) :X^Tξ_X = 0

= X_⊥K:K∈R^(n−k)×k . Then for anyξ_X ∈T_XQ, there exists a unique elementξ_X∈ H_Xsuch that Dπ(X)[ξ_X]

= ξ_X, and ξ_X is called the horizontal lift of ξ_X at X. Moreover, the orthogonal projection of any elementη_X∈T_XSt(k, n) ontoH_X atX is given by

P^h

Xη_X =

I_n−X X^T η_X.

Now, we deﬁne a Riemannian metricgon the quotient manifold Qby (3.3) g_X(ξ_X, ζ_X) :=g_X(ξ_X, ζ_X), ξ_X, ζ_X ∈T_XQ, X ∈ Q,

whereξ_X, ζ_X∈ H_X are the unique horizontal lifts ofξ_X, ζ_XatX, respectively. Since X ∈π⁻¹(X),XQ is inπ⁻¹(X) for anyQ∈O_k, we need to show that

(3.4) g_XQ(ξ_XQ, ζ_XQ) =g_X(ξ_X, ζ_X) ∀Q∈O_k.

To verify (3.4), we ﬁrst establish the following result. The proof is similar to that of Proposition 3.6.1 in [1], and we therefore omit it.

Proposition 3.1. Let X ∈St(k, n),X =π(X), and ξ_X ∈T_XQ. Then it holds that

ξ_XQ=ξ_X·Q

for allQ∈O_k, where the center dot denotes matrix multiplication, and g_XQ(ξ_XQ, ζ_XQ) =g_X(ξ_X, ζ_X)

for allξ_X, ζ_X ∈T_XQ.

Thus the quotient manifoldQendowed with the Riemannian metricg deﬁned in (3.3) is a Grassmann manifold.

Next, we deﬁne a second-order retractionRon (Q, g) as follows:

(3.5) R_X

ξ_X :=π

R_X(ξ_X)

∀ξ_X∈T_XQ,

whereX =π(X)∈ Q, ξ_X∈ H_X is the horizontal lift of aξ_X ∈T_XQat X, andRis a second-order retraction on St(k, n), which is deﬁned by [1,3]:

(3.6) R_X(Z) =

k i=1

¯

u_i¯v_i^T ∀Z ∈T_XSt(k, n),

where{¯u}^k_i=1 and{¯v}^k_i=1are the left and right singular vectors corresponding to the largest k singular values of X+Z, which admits the singular value decomposition [20]:

X+Z=UΣV^T, Σ = Diag(¯σ₁(X+Z), . . . ,σ¯_k(X+Z))∈R^n×k.

Here, ¯σ₁(X+Z)≥σ¯₂(X+Z)≥ · · · ≥σ¯_k(X+Z)>0 andU = [¯u₁, . . . ,u¯_n]∈R^n×n and V = [¯v₁, . . . ,¯v_k]∈R^k×k are orthogonal matrices. The retraction R on St(k, n)

Downloaded 01/18/18 to 166.111.71.48. Redistribution subject to SIAM license or copyright; see http://www.siam.org/journals/ojsa.php

(7)

may reduce the computational complexity and accelerate convergence [28, 29]. In our numerical experiments, we use the retraction R. Obviously, for the retractions R deﬁned in (3.6), we have π(R_X

a(ξ_X

a)) =π(R_X

b(ξ_X

b)) for allX_a, X_b ∈π⁻¹(X).

ThusRdeﬁned by (3.5) is a retraction on Q[2, Proposition 4.1.3].

For the functionE : St(k, n)→Rdeﬁned in problem (1.1), we have E=E◦π, where E :Q → Ris deﬁned in problem (3.2). By the smoothness of E on St(k, n), we know thatE is smooth onQ.

3.2. Riemannian gradient and Riemannian Hessian ofE. We give explicit formulas of the Riemannian gradient and the Riemannian Hessian of the cost function E deﬁned in problem (3.2). To do so, we deﬁne the extended functionE:R^n×k →R by

E(X) = 1

2tr(X^TLX) +α

4ρ(X)L⁻¹ρ(X) ∀X∈R^n×k.

ThenEis the restriction ofEonto St(k, n), i.e.,E=E|_St(k,n). By simple calculation, the gradient ofE atX ∈R^n×k is given by [2, p. 48]

gradE(X ) =H(X)X.

Since St(k, n) is a Riemannian submanifold ofR^n×k, the Riemannian gradient of E atX ∈St(k, n) is given by

gradE(X) = P_X

gradE(X )

= P^h_X

H(X)X .

where P_X means the orthogonal projection ontoT_XM, which is given by P_XZ= (I_n−X X^T)Z+Xskew(X^TZ) =Z−Xsym(X^TZ) ∀Z∈R^n×k. Here, skew(A) := (A−A^T)/2 and sym(A) := (A+A^T)/2. Therefore, for anyX ∈ Q andX∈π⁻¹(X), the unique horizontal lift of the Riemannian gradient gradE(X) of E atX ∈St(k, n) is given by

(3.7) gradE(X)_X= gradE(X) = P^h

X

H(X)X

= (I_n−X X^T)H(X)X.

Let ∇ and ∇ be the Riemannian connections on Q and St(k, n). The Riemannian Hessian ofEat X∈ Q is given by

HessE(X)[Z_X] =∇ZXgradE(X) ∀Z_X∈T_XQ. SinceP^h

XP_XZ=P^h

XZ for allZ∈T_XSt(k, n), we have [2, equation (5.15) and Propo- sition 5.3.3],

HessE(X)[Z_X]_X =∇ZXgradE(X)_X = P^h_X

∇_Z

XgradE(X)_X

= P^h_X

P_X

D gradE(X)[Z_X]

= P^h_X

D gradE(X)[Z_X] . where D gradf(x)[ξ_x] means the classical directional derivative. We get by (3.7),

D gradE(X)[Z_X] =−

X Z^T_X+Z_X X^T

H(X)X + 2α

I_n−X X^T Diag

L⁻¹diag(X Z^T_X) X +

I_n−X X^T)H(X Z_X.

Downloaded 01/18/18 to 166.111.71.48. Redistribution subject to SIAM license or copyright; see http://www.siam.org/journals/ojsa.php

(8)

Thus

HessE(X)[Z_X]_X

= P^h

X

−Z_XX^TH(X)X+ 2αDiag

L⁻¹diag(X Z^T_X)

X+H(X Z_X

, (3.8)

where the fact of P^h

XX = 0 is used.

We remark that the Newton equation on the Grassmann manifoldQat the point X ∈ Qis given by [2, p. 113]

HessE(X)[Z_X] =−gradE(X), Z_X∈T_XQ. Taking the horizontal lift yields

HessE(X)[Z_X]_X=−gradE(X)_X, or

P^h

X

−Z_XX^TH(X)X+ 2αDiag

L⁻¹diag(X Z^T_X)

X+H(X Z_X

=−P^h

X

H(X)X ,

forZ_X ∈ H_X.

3.3. Riemannian Newton algorithm. Without causing any confusion, we use

·,·and · to denote the Riemannian metrics and their induced norms on St(k, n) andQ, respectively. Based on the discussion in section3.2, we describe a matrix-form Riemannian Newton algorithm for solving the minimization problem (3.2).

algorithm3.2 (a matrix-form Riemannian Newton algorithm).

Step 0. Given X⁰∈St(k, n),β, η∈(0,1),σ∈(0,1/2], and j:= 0.

Step 1. Apply the CG method [20, Algorithm 10.2.1]to solving

(3.9) P^h

X^j

D gradE(X^j)[ΔX^j]

+ gradE(X^j) = 0 for ΔX^j∈ H_X^j such that

(3.10) P^h

X^j

D gradE(X^j)[ΔX^j]

+ gradE(X^j)≤η_jgradE(X^j) and

(3.11)

gradE(X^j),ΔX^j

≤ −η_j

ΔX^j,ΔX^j ,

where η_j := min{η,gradE(X^j)}. If (3.10) and(3.11) are not attainable, then let

ΔX^j:=−gradE(X^j).

Step 2. Let l_j be the smallest nonnegative integerl such that

(3.12) E(R

X^j(β^lΔX^j))−E(X^j)≤σβ^l

gradE(X^j),ΔX^j . Set

X^j+1 :=R

X^j(β^l^jΔX^j)Q^j for some Q^j∈O_k.

Downloaded 01/18/18 to 166.111.71.48. Redistribution subject to SIAM license or copyright; see http://www.siam.org/journals/ojsa.php

(9)

Step 3. Replace j byj+ 1and go toStep 1.

We remark that Algorithm 3.2 is a numerically realizable Riemannian Newton algorithm for solving the minimization problem (3.2). Suppose that{X^j} and{Y^j} are two sequences generated by Algorithm 3.2. If [X⁰] = [Y⁰], then [X^j] = [Y^j] for allj. Thus, Algorithm3.2returns a sequence{[X^j]} ∈ Q by takingX⁰= [X⁰]∈ Q, where X⁰ ∈St(k, n). We also point out that our method has some advantages over classical equality-constrained optimization methods: (1) A nice feature is that the generated iterates are all feasible. (2) As shown in section 4, our method converges globally and quadratically as an unconstrained optimization on a constrained set.

(3) No additional Lagrange multipliers or penalty functions are required. Finally, numerical tests in section6show the eﬃciency of our method over the classical interior- point method [12].

4. Convergence analysis. In this section, we establish the global and quadratic convergence of Algorithm 3.2. As in (2.1), we have the following equality on the Riemannian gradient ofEand its pullback functionEthrough the retractionRdeﬁned in (3.5) [2, p. 56]:

(4.1) gradE(X) = gradE_X(0_X) ∀X ∈ Q, 0_X∈T_XQ.

For the second-order retractionRonQdeﬁned in (3.5), we have [2, Proposition 5.5.5]

(4.2) HessE(X) = HessE(0 _X) ∀X∈ Q, 0_X∈T_XQ.

4.1. Global convergence. On the global convergence of Algorithm3.2, we have the following result. The proof follows that of Theorem 11(a) in [15].

Theorem 4.1. Any accumulation point X_∗ of the sequence {X^j} generated by Algorithm 3.2produces a stationary point X_∗ := [X_∗] of the cost function E deﬁned in problem(3.2).

Proof. Suppose {X^j} → X_∗, renumbering if necessary. If there exists a subsequence{ΔX^j}J such that ΔX^j=−gradE(X^j) for allk∈ J, thenX_∗is a stationary point ofE. We note that Dπ(X_∗)[gradE(X_∗)] = gradE(X_∗). Hence, [X_∗] is a stationary point ofE. Therefore, without loss of generality, to prove the theorem we only need to consider the case in which the direction is always given by (3.9). To verify that gradE(X_∗) = 0, we only need to show that gradE(X_∗) = 0. By contradiction, we assume that gradE(X_∗)= 0. LetX^j := [X^j] for allj. By (3.3) and (3.10), we have

gradE(X^j) ≤ HessE(X^j)[ΔX^j]_Xj+HessE(X^j)[ΔX^j]_Xj+ gradE(X^j)

=HessE(X^j)[ΔX^j]+HessE(X^j)[ΔX^j]

X^j+ gradE(X^j)

≤ HessE(X^j) · ΔX^j+η_jgradE(X^j)

≤ HessE(X^j) · ΔX^j+ηgradE(X^j), (4.3)

where 0< η_j ≤η <1, andHessE(X^j)denotes the operator norm deﬁned by HessE(X^j):= sup HessE(X^j)[ΔX^j]: ΔX^j∈T_XjSt(k, n), ΔX^j= 1

. It follows from (4.3) that

(4.4) ΔX^j ≥(1−η)gradE(X^j) HessE(X^j),

Downloaded 01/18/18 to 166.111.71.48. Redistribution subject to SIAM license or copyright; see http://www.siam.org/journals/ojsa.php

(10)

where HessE(X^j)>0 for allj. Otherwise, if HessE(X^j)= 0 for some j, then by (4.3), we have gradE(X^j) = 0. Thus X^j is a stationary point of E and the algorithm stops.

Now, we note that there exist two constantsc₁, c₂>0 such that 0< c₁≤ ΔX^j ≤c₂

for allj. In fact, if there exists some subsequence {ΔX^j}K →0, then we have by (4.4) that {gradE(X^j)}K → 0, since HessE(X^j) is bounded for the bounded sequence{X^j}_K. By continuity, we get gradE(X_∗) = 0, a contradiction. On the other hand,{ΔX^j}cannot be unbounded because, taking into account of the boundedness of{gradE(X^j)}, this would contradict (3.11).

We observe from (3.12) that the sequence{E(X^j)≥0}is monotonically nonin- creasing and thus is convergent. Hence,

(4.5) lim

j→∞

E(X^j)−E(X^j+1)

= 0.

By (3.11), (3.12), and (4.4), we have E(X^j)−E(X^j+1)≥ −σβ^l^j

gradE(X^j),ΔX^j

≥σ(1−η)²β^l^jη_jgradE(X^j)² HessE(X^j)² ≥0, which, together with (4.5), implies

j→∞lim β^l^jη_jgradE(X^j)²= 0.

This implies that lim infβ^l^j = 0. Otherwise, if lim infβ^l^j >0, then, by the deﬁnition of η_j, we have gradE(X_∗) = 0, a contradiction. Therefore, we may assume that limβ^l^j = 0, taking a subsequence if necessary. Then we get by (3.12),

E

R_Xj

β^l^jΔX^j β

ΔX^j ΔX^j

−E(X^j)> σβ^l^jΔX^j β

gradE(X^j), ΔX^j ΔX^j

,

and it follows that

(4.6)

E_X

j

β^ljΔX^j β ΔX^j

ΔX^j

−E_X

j(0_X

j)

β^ljΔX^j β

> σ

,

whereE =E◦Rmeans the pullback ofE through the retractionRon St(k, n). Since ΔX^j/ΔX^j has unit norm, we may assume that {ΔX^j/ΔX^j} converges to ξ_∗ with ξ_∗ = 1, taking a subsequence if necessary. By continuity of the Riemannian metric·,·and (4.6), we obtain

gradE(X_∗), ξ_∗

≥σ

gradE(X_∗), ξ_∗ , and then

(4.7)

≥0,

Downloaded 01/18/18 to 166.111.71.48. Redistribution subject to SIAM license or copyright; see http://www.siam.org/journals/ojsa.php

(11)

since 0< σ <1. By (3.11) and (4.4), we have (4.8)

≤ −η_jΔX^j.

Note that the sequence{ΔX^j}is bounded below and gradE(X_∗)= 0 by assumption. Hence, we may assume that {ΔX^j} → ΔX_∗, taking a subsequence if necessary. Then, in (4.8) asj → ∞, we get

≤ −min{η,gradE(X_∗)}ΔX_∗<0, which contradicts (4.7). Therefore,gradE(X_∗)= 0. The proof is complete.

4.2. Quadratic convergence. We establish the quadratic convergence of Al- gorithm3.2. To do so, we need the following positive deﬁniteness assumption on the Riemannian Hessian ofE.

Assumption 4.2. The Riemannian Hessian operator HessE([X_∗]) : T_[X

∗]Q → T_[X

∗]Q is positive deﬁnite, where X_∗ is an accumulation point of the sequence{X^j} generated by Algorithm 3.2.

Assumption4.2guarantees that a stationary pointX_∗:= [X_∗] ofEis an isolated local minimum point of E. In section 5 we provide a suﬃcient condition such that Assumption4.2is satisﬁed.

To establish the quadratic convergence of Algorithm 3.2, we need the following result.

Lemma 4.3. Let X_∗ be an accumulation point of the sequence {X^j} generated by Algorithm 3.2, i.e., X_∗ is a limit point of a subsequence {X^j}K. Suppose that Assumption 4.2 is satisﬁed. Then there exist two constants d₁, d₂ >0 such that for allj ∈ K suﬃciently large, it holds that

d₁gradE([X^j]) ≤ ΔX^j ≤d₂gradE([X^j]).

Proof. Let X_∗ := [X_∗] and X^j := [X^j] for all j. As {X^j}K → X_∗, we get {X^j}K→X_∗. By Assumption4.2, there exist two scalarsκ₀, κ₁>0 such that for all j∈ K suﬃciently large, HessE(X^j) is nonsingular, and

(4.9) HessE(X^j) ≤κ₀, [HessE(X^j)]⁻¹ ≤κ₁. By (4.9), we have for allj ∈ Ksuﬃciently large,

ΔX^j=[HessE(X^j)]⁻¹

HessE(X^j)[ΔX^j] + gradE(X^j)−gradE(X^j)

≤ [HessE(X^j)]⁻¹

HessE(X^j)[ΔX^j] + gradE(X^j)+gradE(X^j)

≤κ₁(1 +η_j)gradE(X^j) ≤κ₁(1 +η)gradE(X^j) ≡d₂gradE(X^j) and

gradE(X^j)=HessE(X^j)[ΔX^j] + gradE(X^j)−HessE(X^j)[ΔX^j]

≤ HessE(X^j)[ΔX^j] + gradE(X^j)+HessE(X^j)[ΔX^j]

≤η_jgradE(X^j)+HessE(X^j) · ΔX^j

≤ηgradE(X^j)+κ₀ΔX^j.

Downloaded 01/18/18 to 166.111.71.48. Redistribution subject to SIAM license or copyright; see http://www.siam.org/journals/ojsa.php

(12)

Thus for allj∈ Ksuﬃciently large, ΔX^j ≥ 1−η

κ₀ gradE(X^j) ≡d₁gradE(X^j). The proof is complete.

On the local convergence of Algorithm 3.2 related to the nondegenerate local minima, we have the following result. The proof follows that of Theorem 11(b) in [15].

Lemma 4.4. Let X_∗ be an accumulation point of the sequence{X^j} generated by Algorithm 3.2. Suppose that Assumption 4.2 holds. Then {[X^j]} converges to [X_∗] onQ deﬁned by (3.1).

Proof. By Theorem4.1, we have gradE([X_∗]) = 0. Also, the Riemannian Hessian operator HessE([X_∗]) is positive deﬁnite by assumption. Then X_∗ := [X_∗] is an isolated local minimum point of E. LetS be the set of limit points of the sequence {X^j := [X^j]}, which is nonempty since X_∗ ∈ S. Suppose thatX_∗ is not the only limit points of the sequence{X^j}. Then

ς :=

inf

X∈S\X∗

dist(X, X_∗)

ifS\X_∗=∅,

1 otheriwse.

SinceX_∗ is an isolated local minimizer ofE, it follows thatς >0. Deﬁne S₁:= Y ∈ Q |dist(Y|S)≤ς/4

, S₂:= Y ∈ Q |dist(Y, X_∗)≥ς , where dist(Y|S) := inf_X∈Sdist(Y, X). Then for allj suﬃciently large,X^j belongs to at least one of the setsS1andS2. Next, let{X^j}Kbe a subsequence of{X^j}such that dist(X^j, X_∗)≤ς/4 for allj∈ Ksuﬃciently large. Thus, every limit point of{X^j}j∈K

lies in the compact setBς

4(X_∗), which is also an accumulation point of the sequence {X^j}. Hence,{X^j}_j∈K converges toX_∗, which is the unique accumulation point of {X^j} in Bς

4(X_∗). By Theorem 4.1 again, {gradE(X^j)}_K → 0. This, together with Lemma4.3, yields that{ΔX^j}K→0. SinceQis a compact manifold, for the retractionR onQ, there exist two scalars >0 andδ>0 such that [2, p. 149]

(4.10) ΔX ≥dist(X, R_XΔX) ∀X ∈ Q, ∀ΔX ∈T_XQ, ΔX ≤δ. Notice that ΔX^j ≤ min{ς/4, δ,(ς)/4} for all j ∈ K suﬃciently large. Let ˆj ∈ K be suﬃciently large. Then, by using X^ˆ^j+1 := [X^j+1] = [R

X^j(β^l^jΔX^j)Q^j] = R_Xˆj(β^l^ˆ^jΔX^ˆ^j) and (4.10), we obtain

dist(X^ˆ^j+1|S\X_∗)≥ inf

Y∈S\X∗{dist(Y, X_∗)} −dist(X^ˆ^j+1, X^ˆ^j)−dist(X^ˆ^j, X_∗)

= inf

Y∈S\X∗{dist(Y, X_∗)} −dist(R_Xˆj(β^l^ˆ^jΔX^ˆ^j), X^ˆ^j)−dist(X^ˆ^j, X_∗)

≥ inf

Y∈S\X∗{dist(Y, X_∗)} − 1

ΔX^ˆ^j −dist(X^ˆ^j, X_∗)

≥ς−ς/4−ς/4 =ς/2, which showsX^ˆ^j+1∈ S₁\Bς

4(X_∗).

Downloaded 01/18/18 to 166.111.71.48. Redistribution subject to SIAM license or copyright; see http://www.siam.org/journals/ojsa.php

(13)

By using ΔX^ˆ^j ≤ min{ς/4, δ,(ς)/4}, X^ˆ^j+1 := R_Xˆj(β^l^ˆ^jΔX^ˆ^j), and (4.10) again, we get

dist(X^ˆ^j+1, X_∗)≤dist(X^ˆ^j+1, X^ˆ^j) + dist(X^ˆ^j, X_∗)

≤dist(R_Xˆj(β^l^ˆ^jΔX^ˆ^j), X^ˆ^j) + dist(X^ˆ^j, X_∗)

≤ 1

ΔX^ˆ^j+ dist(X^ˆ^j, X_∗)

≤ς/4 +ς/4 =ς/2,

which implies X^ˆ^j+1 ∈ S₂. Hence, X^ˆ^j+1 ∈ Bς

4(X_∗). By deﬁnition, we derive that ˆj+ 1∈ K. Therefore, by induction, we conclude thatj∈ Kfor allj suﬃciently large and then the whole sequence{X^j}converges toX_∗.

On the stepsizeβ^l^j in (3.12), we have the following result similar to Proposition 5 in [36].

Lemma 4.5. Let X_∗ be an accumulation point of the sequence{X^j} generated by Algorithm3.2. Suppose that Assumption4.2holds; then forj suﬃciently large,l_j= 0 satisﬁes(3.12).

Proof. LetX_∗:= [X_∗] and X^j:= [X^j] for allj. Let ΔX^j_N be the exact solution of the Newton equation (3.9). Then we have

HessE(X^j)[ΔX^j−ΔX_N^j]_Xj = gradE(X^j) + HessE(X^j)[ΔX^j]_Xj, and thus

(4.11) HessE(X^j)[ΔX^j−ΔX_N^j] = gradE(X^j) + HessE(X^j)[ΔX^j].

According to (4.1) and (4.2), we have (4.12)

gradE_Xj(0_Xj) + HessE_Xj(0_Xj)[ΔX_N^j] = gradE(X^j) + HessE(X^j)[ΔX_N^j] = 0_Xj. By Lemma4.4, we haveX^j→X_∗. Thus, by Lemma4.3, (3.10), (4.9), and (4.11), we have for allj suﬃciently large,

ΔX^j−ΔX_N^j=[HessE(X^j)]⁻¹(gradE(X^j) + HessE(X^j)[ΔX^j])

≤ [HessE(X^j)]⁻¹ · gradE(X^j) + HessE(X^j)[ΔX^j]

≤κ₁η_jgradE(X^j) ≤κ₁gradE(X^j)²≤ κ₁

d²₁ΔX^j². (4.13)

In addition, HessE_X is Lipschitz-continuous at 0_X uniformly in a neighborhood ofX_∗, i.e., there exist scalarsκ₂>0,δ₁>0, andδ₂>0, such that for allX ∈B_δ₁(X_∗) and allξ∈B_δ₂(0_X), it holds that

(4.14) HessE_X(ξ)−HessE_X(0_X) ≤κ₂ξ. By Taylor’s theorem, there exists some constantθ∈[0,1] such that E_Xj(ΔX^j) =E_Xj(0_Xj)+

gradE_Xj(0_Xj),ΔX^j +1

2

HessE_Xj(θΔX^j)[ΔX^j],ΔX^j .