Downloaded 01/18/18 to 166.111.71.48. Redistribution subject to SIAM license or copyright; see http://www.siam.org/journals/ojsa.php

(1)

A GEOMETRIC NONLINEAR CONJUGATE GRADIENT METHOD FOR STOCHASTIC INVERSE EIGENVALUE PROBLEMS^∗

ZHI ZHAO^†, XIAO-QING JIN^‡, _AND ZHENG-JIAN BAI^§

Abstract. In this paper, we focus on the stochastic inverse eigenvalue problem of reconstruct- ing a stochastic matrix from the prescribed spectrum. We directly reformulate the stochastic inverse eigenvalue problem as a constrained optimization problem over several matrix manifolds to minimize the distance between isospectral matrices and stochastic matrices. Then we propose a geometric Polak–Ribière–Polyak-based nonlinear conjugate gradient method for solving the constrained optimization problem. The global convergence of the proposed method is established. Our method can also be extended to the stochastic inverse eigenvalue problem with prescribed entries. An extra advantage is that our models yield new isospectral flow methods. Finally, we report some numerical tests to illustrate the efficiency of the proposed method for solving the stochastic inverse eigenvalue problem and the case of prescribed entries.

Key words. inverse eigenvalue problem, stochastic matrix, geometric nonlinear conjugate gradient method, oblique manifold, isospectral ﬂow method

AMS subject classifications.65F18, 65F15, 15A18, 65K05, 90C26, 90C48 DOI. 10.1137/140992576

1. Introduction. Ann-by-nreal matrixAis called a nonnegative matrix if all its entries are nonnegative, i.e., A_ij ≥0 for all i, j = 1, . . . , n, where A_ij means the (i, j)th entry ofA. Ann-by-nreal matrix Ais called a (row) stochastic matrix if it is nonnegative and the sum of the entries in each row equals one, i.e., A_ij ≥0 and _n

j=1A_ij = 1 for alli, j= 1, . . . , n. Stochastic matrices arise in various applications such as probability theory, statistics, economics, computer science, physics, chemistry, and population genetics. See, for instance, [6, 13, 15, 20, 25, 32, 35] and the references therein.

In this paper, we consider the stochastic inverse eigenvalue problem (StIEP) and the stochastic inverse eigenvalue problem with prescribed entries (StIEP-PE) as follows:

StIEP. Given a self-conjugate set of complex numbers{λ^∗₁, λ^∗₂, . . . , λ^∗_n}, ﬁnd an n-by-nstochastic matrixC such that its eigenvalues areλ^∗₁, λ^∗₂, . . . , λ^∗_n.

StIEP-PE. Given a self-conjugate set of complex numbers {λ^∗₁, λ^∗₂, . . . , λ^∗_n}, an index subset L ⊂ N := {(i, j) | i, j = 1, . . . , n}, and a certain n-by-n nonnega- tive matrix C_a with (C_a)_ij = 0 for (i, j) ∈ L/ (i.e., the set of nonnegative numbers {(C_a)_ij | (i, j)∈ L} is prescribed), ﬁnd an n-by-n stochastic matrix C such that its

∗Received by the editors October 22, 2014; accepted for publication (in revised form) March 31, 2016; published electronically July 6, 2016.

http://www.siam.org/journals/sinum/54-4/99257.html

†Department of Mathematics, School of Sciences, Hangzhou Dianzi University, Hangzhou 310018, People’s Republic of China (zzhao@hdu.edu.cn).

‡Department of Mathematics, University of Macau, Taipa, Macau, People’s Republic of China (xqjin@umac.mo). The research of this author was supported by the research grant MYRG098(Y3- L3)-FST13-JXQ from University of Macau.

§Corresponding author. School of Mathematical Sciences, Xiamen University, Xiamen 361005, People’s Republic of China (zjbai@xmu.edu.cn). The research of this author was partially supported by the National Natural Science Foundation of China (11271308), the Natural Science Foundation of Fujian Province of China (2016J01035), and the Fundamental Research Funds for the Central Universities (20720150001).

2015

Downloaded 01/18/18 to 166.111.71.48. Redistribution subject to SIAM license or copyright; see http://www.siam.org/journals/ojsa.php

(2)

eigenvalues are λ^∗₁, λ^∗₂, . . . , λ^∗_n and

C_ij = (C_a)_ij ∀(i, j)∈ L.

As stated in [11], the StIEP aims to construct a Markov chain (i.e., a stochastic matrix) from the given spectral property. The StIEP is a special nonnegative inverse eigenvalue problem which aims to ﬁnd a nonnegative matrix from the given spectrum.

These two problems are related to each other by the following result [19, p. 142].

Theorem 1.1. If A is a nonnegative matrix with positive maximal eigenvalue r and a positive maximal eigenvector a, then r⁻¹D⁻¹ADis a stochastic matrix, where D:= Diag(a)means a diagonal matrix with the vector aon its diagonal.

Theorem 1.1 shows that one may ﬁrst construct a nonnegative matrix with the prescribed spectrum, and then obtain a stochastic matrix with the desired spectrum by a diagonal similarity transformation. In many applications (see, for instance, [6, 32, 35]), the underlying structure of a desired stochastic matrix is often characterized by the prescribed entries. The corresponding inverse problem can be described as the StIEP-PE. However, the StIEP-PE may not be solved by ﬁrst constructing a nonnegative solution followed by a diagonal similarity transformation.

There is a large literature on the existence condition of solutions to the StIEP in many special cases. The reader may refer to [8, 10, 11, 17, 19, 23, 24, 36, 37]

for various necessary or suﬃcient conditions. However, there exist few numerical methods for solving the StIEP, especially for the large-scale StIEP (say n≥ 1000).

In [34], Soules gave a constructive method where additional inequalities of the Perron eigenvector are required. In [12], based on a matrix similarity transformation, Chu and Guo presented an isospectral ﬂow method for constructing a stochastic matrix with the desired spectrum. An inverse matrix involved in the isospectral ﬂow method was further characterized by an analytic singular value decomposition (ASVD). Recently, in [21] Orsi proposed an alternating projection-like scheme for the construction of a stochastic matrix with the prescribed spectrum. The main computational cost is an eigenvalue-eigenvector decomposition or a Schur matrix decomposition. In [18], Lin presented a fast recursive algorithm for solving the StIEP, where the prescribed eigenvalues are assumed to be real and satisfy additional inequality conditions. We see that all the numerical methods in [12, 18, 21] are based on Theorem 1.1 and may not be extended to the StIEP-PE.

Recently, there appeared a few Riemannian optimization methods for eigenprob- lems. For instance, in [1], a truncated conjugate gradient method was presented for the symmetric generalized eigenvalue problem. In [5], a Riemannian trust-region method was proposed for the symmetric generalized eigenproblem. Newton’s method and the conjugate gradient method proposed in [33] are applicable to the symmetric eigenvalue problem. In [40], a Riemannian Newton method was proposed for a kind of nonlinear eigenvalue problem.

In this paper, we directly focus on the StIEP. Based on the real Schur matrix decomposition, we reformulate the StIEP as a constrained optimization problem over several matrix manifolds, which aims to minimize the distance between isospectral matrices and stochastic matrices. We propose a geometric Polak–Ribière–Polyak- based nonlinear conjugate gradient method for solving the constrained optimization problem. This is motivated by Yao et al. [38] and Zhang, Zhou, and Li [39]. In [38], a Riemannian Fletcher–Reeves conjugate gradient approach was proposed for solving the doubly stochastic inverse eigenvalue problem. In [39], Zhang, Zhou, and Li gave a modified Polak–Ribière–Polyak conjugate gradient method for solving the

Downloaded 01/18/18 to 166.111.71.48. Redistribution subject to SIAM license or copyright; see http://www.siam.org/journals/ojsa.php

(3)

unconstrained optimization problem, where the generated direction is always a descent direction for the objective function. We investigate some basic properties of these matrix manifolds (e.g., tangent spaces, Riemannian metrics, orthogonal projections, retractions, and vector transports) and derive an explicit expression of the Riemannian gradient of the cost function. The global convergence of the proposed method is established. We also extend the proposed method to the StIEP-PE. An additional advantage of our models is that they also yield new geometric isospectral flow methods for both the StIEP and the StIEP-PE. Finally, we report some numerical tests to show that the proposed geometric methods perform very effectively and are applicable to large-scale problems. Moreover, the new geometric isospectral flow methods work acceptably for small- and medium-scale problems.

The rest of this paper is organized as follows. In section 2 we propose a geometric nonlinear conjugate gradient approach for solving the StIEP. In section 3 we establish the global convergence of the proposed approach. In section 4 we extend the proposed approach to the StIEP-PE. Finally, some numerical tests are reported in section 5, and some concluding remarks are given in section 6.

2. A geometric nonlinear conjugate gradient method. In this section, we propose a geometric nonlinear conjugate gradient approach for solving the StIEP.

We ﬁrst reformulate the StIEP as a nonlinear minimization problem deﬁned on some matrix manifolds. We also discuss some basic properties of these matrix manifolds and derive the Riemannian gradient of the cost function. Then we present a geometric nonlinear conjugate gradient approach for solving the nonlinear minimization problem.

2.1. Reformulation. As noted in [19, p. 141], we only assume that the set of prescribed complex numbers {λ^∗₁, λ^∗₂, . . . , λ^∗_n} satisﬁes the following necessary condition: One element is 1 and max_1≤i≤n|λ^∗_i|= 1. Note that the set of complex numbers {λ^∗₁, λ^∗₂, . . . , λ^∗_n} is closed under complex conjugation. Without loss of generality, we can assume

λ^∗_2j−1 =a_j+b_j√

−1, λ^∗_2j=a_j−b_j√

−1, j= 1, . . . , s; λ_j∈R, j= 2s+ 1, . . . , n, wherea_j, b_j∈Rwithb_j = 0 forj= 1, . . . , s. Then we deﬁne a block diagonal matrix by

Λ := blkdiag

λ^[2]∗₁ , . . . , λ^[2]∗_s , λ^∗_2s+1, . . . , λ^∗_n , with diagonal blocksλ^[2]∗₁ , . . . , λ^[2]∗_s , λ^∗_2s+1, . . . , λ^∗_n, where

λ^[2]∗_j =

a_j b_j

−b_j a_j

, j= 1, . . . , s.

Denote byO(n) the set of alln-by-northogonal matrices, i.e., O(n) := P ∈R^n×n |P^TP =I_n

,

where P^T means the transpose of P and I_n is the identity matrix of order n. In addition, we deﬁne the setV by

V := V ∈R^n×n |V_ij = 0, (i, j)∈ I , whereI is the index subset,

I := (i, j)|i≥jor Λ_ij = 0, i, j= 1, . . . , n .

Downloaded 01/18/18 to 166.111.71.48. Redistribution subject to SIAM license or copyright; see http://www.siam.org/journals/ojsa.php

(4)

LetSbe the set of alln-by-nstochastic matrices, which can be deﬁned as the following matrix manifold:

S:={SS | diag(SS^T) =I_n, S∈R^n×n},

wheredenotes the Hadamard product [7] and diag(A) is a diagonal matrix with the same diagonal entries as a square matrix A. The set of isospectral matrices deﬁned by

(2.1) M(Λ) := Z ∈R^n×n |Z=P(Λ +V)P^T, P ∈ O(n), V ∈ V

is a smooth manifold. Then the StIEP has a solution if and only ifM(Λ)∩ S =∅. Suppose that the StIEP has at least one solution. Then the StIEP is equivalent to solving the following nonlinear matrix equation:

H(S, P, V) =0_n×n

for (S, P, V)∈ OB × O(n)× V, where0_n×n is the zero matrix of ordernand the set OB is the oblique manifold deﬁned by [3, 4],

OB:= S∈R^n×n | diag(SS^T) =I_n . The mappingH :OB × O(n)× V →R^n×n is deﬁned by

(2.2) H(S, P, V) =SS−P(Λ +V)P^T, (S, P, V)∈ OB × O(n)× V. If we ﬁnd a solution (S, P , V) ∈ OB × O(n)× V to H(S, P, V) = 0_n×n, then C = SS is a solution to the StIEP. Alternatively, one may solve the StIEP by ﬁnding a global solution to the following nonlinear minimization problem:

(2.3) min h(S, P, V) := 1

2H(S, P, V)²_F =1

2SS−P(Λ +V)P^T²_F s.t. S∈ OB, P ∈ O(n), V ∈ V,

where “s.t.” means “subject to” and · _F means the matrix Frobenius norm [16].

2.2. Basic properties. In the following, we establish some basic properties for the product manifoldOB × O(n)× V. It is easy to check that the dimensions ofOB, O(n), andV are given by [4, p. 27]

dimOB=n(n−1), dimO(n) = 1

2n(n−1), dimV =|J |,

where |J | is the cardinality of the index subset J = N \ I, which is the relative complement ofI inN. Then we have

dim(OB × O(n)× V) =n(n−1) +n(n−1)

2 +|J |>dimR^n×n, n≥3.

This shows that the nonlinear equation H(S, P, V) = 0_n×n is an under-determined system deﬁned fromOB × O(n)× V toR^n×n forn≥3.

The tangent spaces of OB, O(n), andV at S ∈ OB, P ∈ O(n), andV ∈ V are, respectively, given by (see [3] and [4, p. 42])

⎧⎪

⎪⎨

⎪⎪

⎩

T_SOB = T ∈R^n×n | diag(ST^T) =0_n×n , T_PO(n) = PΩ| Ω^T =−Ω, Ω∈R^n×n

, T_VV = V.

Downloaded 01/18/18 to 166.111.71.48. Redistribution subject to SIAM license or copyright; see http://www.siam.org/journals/ojsa.php

(5)

Hence, the tangent space ofOB × O(n)× V at (S, P, V)∈ OB × O(n)× V is given by T_(S,P,V₎OB × O(n)× V =T_SOB ×T_PO(n)×T_VV.

For the isospectral manifoldM(Λ) deﬁned in (2.1), it follows that T_ZM(Λ) = [Z,Ω] +P U P^T |Ω^T =−Ω, U ∈ V, Ω∈R^n×n

,

where Z = P(Λ +V)P^T ∈ M(Λ) and [Z,Ω] := ZΩ−ΩZ means the Lie-bracket product.

Let the Euclidean space R^n×n×R^n×n ×R^n×n be equipped with the following natural inner product:

(X₁, Y₁, Z₁),(X₂, Y₂, Z₂):= tr(X₁^TX₂) + tr(Y₁^TY₂) + tr(Z₁^TZ₂)

for all (X₁, Y₁, Z₁),(X₂, Y₂, Z₂)∈ R^n×n×R^n×n×R^n×n and its induced norm · , where tr(A) denotes the sum of the diagonal entries of a square matrix A. Since OB × O(n)× V is an embedded submanifold ofR^n×n×R^n×n×R^n×n, we can equip OB × O(n)× V with an induced Riemannian metric,

(2.4) g_(S,P,V₎

(ξ₁, ζ₁, η₁),(ξ₂, ζ₂, η₂)

:=(ξ₁, ζ₁, η₁),(ξ₂, ζ₂, η₂)

for all (S, P, V)∈ OB × O(n)× Vand (ξ₁, ζ₁, η₁),(ξ₂, ζ₂, η₂)∈T_(S,P,V₎OB × O(n)× V. Without causing any confusion, in what follows we use·,· and · to denote the Riemannian metric onOB × O(n)× V and its induced norm. Then, the orthogonal projections of any ξ, ζ, η ∈ R^n×n onto T_SOB, T_PO(n), and T_VV are, respectively, given by (see [3] and [4, p. 48])

(2.5) Π_Sξ=ξ−diag(Sξ^T)S, Π_Pζ=Pskew(P^Tζ), Π_Vη=W η,

where skew(A) := ¹₂(A−A^T) for a matrixA ∈R^n×n and the matrix W ∈R^n×n is deﬁned by

W_ij :=

0 if (i, j)∈ I, 1 otherwise.

Therefore, the orthogonal projection of a point (ξ, ζ, η)∈R^n×n×R^n×n×R^n×n onto T_(S,P,V)OB × O(n)× V has the following form:

Π_(S,P,V₎(ξ, ζ, η) =

ξ−diag(Sξ^T)S, Pskew(P^Tζ), Wη .

In addition, the retractions on the manifoldsOB,O(n), andV can be chosen as (see [3] and [4, p. 58]),

⎧⎪

⎪⎨

⎪⎪

⎩

R_S(ξ_S) = diag

(S+ξ_S)(S+ξ_S)^T_−1/2

(S+ξ_S) forξ_S ∈T_SOB, R_P(ζ_P) = qf(P +ζ_P) forζ_P ∈T_PO(n),

R_V(η_V) = V +η_V forη_V ∈T_VV,

where qf(A) denotes the Qfactor of the QR decomposition of a nonsingular matrix A ∈ R^n×n in the form ofA = QR. Here, Q ∈ O(n) and R is an upper triangular matrix with strictly positive diagonal elements. Thus, a retractionRonOB×O(n)×V takes the form of

R_(S,P,V₎(ξ_S, ζ_P, η_V) =

R_S(ξ_S), R_P(ζ_P), R_V(η_V)

Downloaded 01/18/18 to 166.111.71.48. Redistribution subject to SIAM license or copyright; see http://www.siam.org/journals/ojsa.php

(6)

for all (ξ_S, ζ_P, γ_V)∈T_(S,P,V₎OB × O(n)× V.

Since all the manifolds OB, O(n), and V are embedded Riemannian submani- folds of R^n×n, we can deﬁne vector transports on OB, O(n), andV by orthogonal projections [4, p. 174]:

(2.6)⎧

⎪⎪

⎨

⎪⎪

⎩

TηSξ_S := Π_R_S_(η_S₎ξ_S =ξ_S−diag

R_S(η_S)ξ^T_S

R_S(η_S) forξ_S, η_S∈T_SOB, T_η_Pξ_P := Π_R_P_(η_P₎ξ_P =R_P(η_P)skew

R_P(η_P)_T ξ_P

forξ_P, η_P ∈T_PO(n), TηVξ_V := ξ_V forξ_V, η_V ∈T_VV.

Thus the vector transport onOB × O(n)× V has the following form:

(2.7) T_(η_S_,η_P_,η_V₎(ξ_S, ξ_P, ξ_V) =

T_η_Sξ_S,T_η_Pξ_P,T_η_Vξ_V

for any (ξ_S, ξ_P, ξ_V),(η_S, η_P, η_V)∈T_(S,P,V₎OB × O(n)× V.

We now establish explicit formulas for the differential ofHdefined in (2.2) and the Riemannian gradient of the cost functionhdefined in (2.3). To derive the Riemannian gradient ofh, we define the mappingH :R^n×n×R^n×n×R^n×n→R^n×n and the cost function h:R^n×n×R^n×n×R^n×n→Rby

(2.8) H(S, P, V) :=SS−P(Λ +V)P^T and h(S, P, V) :=1

2H(S, P, V)²_F for all (S, P, V)∈R^n×n×R^n×n×R^n×n. Then the mapping H and the function h can be seen as the restrictions of H andh ontoOB × O(n)× V,respectively. That is,H =H|_OB×O(n)×V andh=h|_OB×O(n)×V. The Euclidean gradient ofhat a point (S, P, V)∈R^n×n×R^n×n×R^n×n is given by

gradh(S, P, V) = ∂

∂Sh(S, P, V), ∂

∂Ph(S, P, V), ∂

∂Vh(S, P, V)

,

where

⎧⎪

⎪⎨

⎪⎪

⎩

∂S∂ h(S, P, V) = 2SH(S, P, V),

∂P∂ h(S, P, V) = −H(S, P, V)P(Λ +V)^T −H^T(S, P, V)P(Λ +V),

∂V∂ h(S, P, V) = −P^TH(S, P, V)P.

Then the Riemannian gradient ofhat a point (S, P, V)∈ OB × O(n)× Vis given by [4, p. 48]

(2.9)

gradh(S, P, V) = Π_(S,P,V)gradh(S, P, V)

=

Π_S ∂

∂Sh(S, P, V),Π_P ∂

∂Ph(S, P, V),Π_V ∂

∂Vh(S, P, V)

.

It follows from (2.5) that

⎧⎪

⎪⎨

⎪⎪

⎩

Π_S_∂S^∂ h(S, P, V) = 2SH(S, P, V)−2 diag

S(SH(S, P, V))^T S, Π_P_∂P^∂ h(S, P, V) =1

2

[P(Λ +V)P^T, H^T(S, P, V)] + [P(Λ +V)^TP^T, H(S, P, V)]

P, Π_V_∂V^∂ h(S, P, V) =−W

P^TH(S, P, V)P .

Downloaded 01/18/18 to 166.111.71.48. Redistribution subject to SIAM license or copyright; see http://www.siam.org/journals/ojsa.php

(7)

By using a similar idea from [9] and [12], we can deﬁne the steepest descent ﬂow for the cost functionhoverOB × O(n)× V as follows:

(2.10) d(S, P, V)

dt =−gradh(S, P, V).

This, together with an initial value (S(0), P(0), V(0)) ∈ OB × O(n)× V, leads to an initial value problem for problem (2.3). Thus we can apply existing ODE solvers to the above diﬀerential equation and get a solution for problem (2.3). From the numerical tests in section 5, one can see that this extra geometric isospectral ﬂow method is acceptable for small- and medium-scale problems.

On the other hand, the diﬀerential DH(S, P, V) : T_(S,P,V₎OB × O(n)× V → T_H(S,P,V₎R^n×nR^n×n ofH at a point (S, P, V)∈ OB × O(n)× V is determined by

DH(S, P, V)[(ΔS,ΔP,ΔV)] = 2SΔS+ [P(Λ +V)P^T,ΔP P^T]−PΔV P^T for all (ΔS,ΔP,ΔV)∈T_(S,P,V₎OB × O(n)× V. Then we have [4, p. 185]

(2.11) gradh(S, P, V) = (DH(S, P, V))^∗[H(S, P, V)], where (DH(S, P, V))^∗ denotes the adjoint of the operator DH(S, P, V).

LetTOB×O(n)×V denote the tangent bundle ofOB×O(n)×V. As in [4, p. 55], the pullback mappingsH :TOB × O(n)× V →R^n×n andh:TOB × O(n)× V →R ofH andhthrough the retractionRonOB × O(n)× V can be deﬁned by

(2.12) H(ξ) = H(R(ξ)) and h(ξ) =h(R(ξ)), ξ∈TOB × O(n)× V, whereH_(S,P,V)andh_(S,P,V₎are the restrictions ofH andhtoT_(S,P,V₎OB ×O(n)×V, respectively. By using the local rigidity ofR[4, p. 55], one has

(2.13) DH(X) = DH_X(0_X), X ∈ OB × O(n)× V,

where0_X is the origin ofT_XOB × O(n)× V. For the Riemannian gradient of hand the Euclidean gradient of its pullback functionh, we have [4, p. 56]

(2.14) gradh(S, P, V) = gradh_(S,P,V₎

0_(S,P,V₎ .

2.3. A geometric nonlinear conjugate gradient method. In this section, we propose a geometric modified Polak–Ribière–Polyak conjugate gradient method for solving the StIEP. This is motivated by the modified Polak–Ribière–Polyak conjugate gradient method in [39], where the generated direction is always a descent direction for the objective function and the modified method with Armijo-type line search is globally convergent. For the original Polak–Ribière–Polyak conjugate gradient method, the reader may refer to [26, 27]. A geometric Polak–Ribière–Polyak-based nonlinear conjugate gradient algorithm for solving problem (2.3) can be stated as follows.

Algorithm2.1 (a geometric nonlinear conjugate gradient method).

Step 0. Given X⁰:= (S⁰, P⁰, V⁰)∈ OB × O(n)× V,α≥1,ρ, δ∈(0,1). k:= 0.

Step 1. Ifh(X^k) = 0, then stop. Otherwise, go to Step 2. HereX^k:= (S^k, P^k, V^k).

Step 2. Set

(2.15) ΔX_k=−gradh(X^k) +β_kTαk−1ΔXk−1ΔX_k−1−θ_kY^k−1,

Downloaded 01/18/18 to 166.111.71.48. Redistribution subject to SIAM license or copyright; see http://www.siam.org/journals/ojsa.php

(8)

whereΔX_k:= (ΔS_k,ΔP_k,ΔV_k) and

Y^k−1:= gradh(X^k)− T_α_k−1_ΔX_k−1gradh(X^k−1), (2.16)

β_k = gradh(X^k), Y^k−1 gradh(X^k−1)² , (2.17)

θ_k = gradh(X^k),T_α_k−1_ΔX_k−1ΔX_k−1 gradh(X^k−1)² . (2.18)

Step 3. Determine α_k = max{αρ^j, j= 0,1,2, . . .} such that

(2.19) h

R_Xk(α_kΔX_k)

≤h(X^k)−δα²_kΔX_k². Set

(2.20) X^k+1:=R_Xk(α_kΔX_k).

Step 4. Replace kby k+ 1 and go to Step 1.

We have several remarks for the above algorithm as follows:

• In Algorithm 2.1, we set

ΔX₀=−gradh(X⁰).

• From (2.15), (2.16), (2.17), and (2.18), we get for allk≥1,

ΔX_k,gradh(X^k)=−gradh(X^k),gradh(X^k) −θ_kY^k−1,gradh(X^k) +β_kT_α_k−1_ΔX_k−1ΔX_k−1,gradh(X^k)

=−gradh(X^k)². (2.21)

This shows that the search direction ΔX_k is a descent direction ofh.

• According to (2.6), the vector transport in (2.15) is given by T_α_k−1_ΔX_k−1ΔX_k−1

=

ΔS_k−1−diag

S^k(ΔS_k−1)^T

S^k, P^kskew

(P^k)^TΔP_k−1 ,ΔV_k−1

.

• Since the functionhis bounded from below, it follows from (2.19) and (2.20) that the sequence {h(X^k)} is monotone decreasing and thus is convergent.

Hence, we have_∞

k=0α²_kΔX_k²<∞. Then it follows that

(2.22) lim

k→∞α_kΔX_k= 0.

• In Step 3 of Algorithm 2.1, the initial step length is set to beα. As mentioned in [39], the line-search process may not be very efficient. In the following, we propose a good initial guess to the initial step length. For the nonlinear mapping H_Xk defined between the tangent space T_XkOB × O(n)× V and T_H(Xk)R^n×n, we consider the first-order approximation ofH_Xk at the origin 0_Xk along the search direction ΔX_k:

(2.23) H_Xk tΔX_k

≈H_Xk(0_Xk) +tDH_Xk(0_Xk)[ΔX_k]

Downloaded 01/18/18 to 166.111.71.48. Redistribution subject to SIAM license or copyright; see http://www.siam.org/journals/ojsa.php

(9)

for alltclose to 0. Based on (2.12) and (2.13), this is reduced to H

R_Xk(tΔX_k)

≈H(X^k) +tDH X^k

[ΔX_k] for alltclose to 0. We have by (2.12) and (2.23),

h_Xk tΔX_k

=1 2H_Xk

tΔX_k ²

≈ 1

2H_Xk(0_Xk)²+t

DH_Xk(0_Xk)_∗

H_Xk(0_Xk) ,ΔX_k +1

2t²DH_Xk(0_Xk)[ΔX_k]²

for alltclose to 0. Based on (2.12) and (2.13), this becomes (2.24)

h

R_Xk(tΔX_k)

≈h(X^k)+t(DH(X^k))^∗[H(X^k)],ΔX_k+1

2t²DH(X^k)[ΔX_k]² for alltclose to 0. By (2.11), we have the following local approximation ofh atX^k along the direction ΔX_k:

m_k(t) :=h(X^k) +tgradh(X^k),ΔX_k+1

2t²DH(X^k)[ΔX_k]² for alltclose to 0. Let

m_k(t) =gradh(X^k),ΔX_k+tDH(X^k)[ΔX_k]²= 0.

We have

t=−gradh(X^k),ΔX_k DH(X^k)[ΔX_k]². Hence, a reasonable initial step length can be set to (2.25) t_k =gradh(X^k),ΔX_k

DH(X^k)[ΔX_k]² .

Motivated by [39], the line-search step, i.e., Step 3 of Algorithm 2.1, can be modiﬁed such that if

DH(X^k)[ΔX_k]>0 and h

R_Xk(t_kΔX_k)

≤h(X^k)−δt²_kΔX_k², then we setα_k=t_k; otherwise, the step lengthα_k can be selected by Step 3 of Algorithm 2.1. The numerical tests in section 5 show that the initial step length (2.25) is very eﬀective.

Finally, we point out that there also exist other nonlinear conjugate gradient methods, which were generalized to Riemannian manifolds (see, for instance, [14, 29, 30, 31]). The global convergence of these methods is guaranteed under the strong or weak Wolfe conditions and specially constructed vector transports defined by parallel translation or scaled differentiated retraction. In general, parallel translation or scaled differentiated retraction is computationally expensive. For the StIEP, we note that OB × O(n)× V is an embedded submanifold ofR^n×n×R^n×n×R^n×n; it is natural to define the vector transport by orthogonal projection. As shown in section 3, the proposed geometric nonlinear conjugate gradient method with Armijo-type line search is globally convergent. From the numerical tests in section 5, one can see that Algorithm 2.1 is easy to implement and computationally effective for solving the StIEP.

Downloaded 01/18/18 to 166.111.71.48. Redistribution subject to SIAM license or copyright; see http://www.siam.org/journals/ojsa.php

(10)

3. Global convergence. In this section, we establish the global convergence of Algorithm 2.1. We ﬁrst derive the following result.

Lemma 3.1. For any point X⁰:= (S⁰, P⁰, V⁰)∈ OB × O(n)× V, the level set (3.1) Z:= X := (S, P, V)∈ OB × O(n)× V |h(X)≤h(X⁰)

is compact.

Proof. For any point (S, P, V)∈ Z, we have

(3.2) SF≤√

n, SSF≤√

n, PF =√ n ,

and

SS−P(Λ +V)P^T_F ≤ S⁰S⁰−P⁰(Λ +V⁰)(P⁰)^T_F =

2h(X⁰).

Note that the Frobenius norm is invariant under orthogonal transformation. It follows from (3.2) that

VF =P V P^TF ≤

2h(X⁰) +PΛP^TF+SSF

≤

2h(X⁰) +Λ_F+√ n.

This, together with (3.2), shows that the level setZ is bounded. In addition, sinceh is a continuous function,Z is also closed. Thus the level setZ is compact.

On the relation between the sequences {Y^k−1} and {ΔX_k−1} deﬁned in (2.15)–(2.16), we have the following result.

Proposition3.2. Let {Y^k−1} and {ΔX_k−1} be the sequences generated by Al- gorithm 2.1; then there exists a constant ν >0 such that for allk suﬃciently large,

Y^k−1 ≤να_k−1ΔX_k−1, whereα_k−1 is the stepsize deﬁned in Algorithm 2.1.

Proof. By (2.7), (2.16), and (2.20), we get

Y^k−1= gradh(X^k)− Tαk−1ΔXk−1gradh(X^k−1)

= gradh(X^k)−Π_R

Xk−1(αk−1ΔXk−1)gradh(X^k−1)

= gradh(X^k)−Π_Xkgradh(X^k−1)

= Π_Xk

gradh(X^k)−gradh(X^k−1) . (3.3)

SinceOB × O(n)× V is an embedded Riemannian submanifold ofR^n×n×R^n×n× R^n×n, one may rely on the natural inclusionT_XOB×O(n)×V ⊂R^n×n×R^n×n×R^n×n for allX∈ OB×O(n)×V. Thus, the Riemannian gradient gradhdeﬁned in (2.9) can be seen as a continuous mapping betweenOB × O(n)× V andR^n×n×R^n×n×R^n×n. For any points X₁, X₂ ∈ OB × O(n)× V, the operation gradh(X₂)−gradh(X₁) is meaningful since both gradh(X₁) and gradh(X₂) can be treated as vectors in R^n×n×R^n×n×R^n×n. We note that gradhis Lipschitz continuous on the compact regionZ deﬁned in (3.1); i.e., there exists a constantβ_L₁ >0 such that

(3.4) gradh(X₂)−gradh(X₁) ≤β_L₁dist(X₁, X₂)

Downloaded 01/18/18 to 166.111.71.48. Redistribution subject to SIAM license or copyright; see http://www.siam.org/journals/ojsa.php

(11)

for allX₁, X₂ ∈ Z ⊂ OB × O(n)× V, where “dist” deﬁnes the Riemannian distance on (OB × O(n)× V, g) [4, p. 46]. SinceV is a Euclidean space, for the retractionR deﬁned onV, we have

(3.5) γ_V= dist

R_V(γ_V), V

∀V ∈ V, γ_V ∈T_VV.

We observe that the manifolds OBandO(n) are both compact manifolds. By (3.5), for the retractionRdeﬁned onOB × O(n)× V, there exist two constantsμ >0 and δ_μ>0 such that [4, p. 149]

(3.6) μξ_X ≥dist

R_X(ξ_X), X

for allX ∈ OB × O(n)× V, andξ_X ∈T_XOB × O(n)× Vwithξ_X ≤δ_μ. We get by (2.22) for allksuﬃciently large,

(3.7) α_k−1ΔX_k−1 ≤δ_μ.

We have by (3.3), (3.4), (3.6), and (3.7), for allksuﬃciently large, Y^k−1=Π_Xk

gradh(X^k)−gradh(X^k−1)

≤ gradh(X^k)−gradh(X^k−1)

≤β_L₁dist

X^k, X^k−1

=β_L₁dist

R_Xk−1(α_k−1ΔX_k−1), X^k−1

≤β_L₁μα_k−1ΔX_k−1 ≡να_k−1ΔX_k−1, whereν =β_L₁μ. The proof is complete.

In Algorithm 2.1, we observe that for allk≥1,

(3.8) Tαk−1ΔXk−1ΔX_k−1=Π_XkΔX_k−1 ≤ ΔX_k−1. From (2.17), (2.18), and (3.8), for allk≥1,

(3.9)⎧

⎪⎪

⎪⎨

⎪⎪

⎪⎩

β_k = gradh(X^k), Y^k−1

gradh(X^k−1)² ≤gradh(X^k) · Y^k−1 gradh(X^k−1)² , θ_k = gradh(X^k),T_α_k−1_ΔX_k−1ΔX_k−1

gradh(X^k−1)²

≤ gradh(X^k) · Tαk−1ΔXk−1ΔX_k−1

gradh(X^k−1)² ≤ gradh(X^k) · ΔX_k−1 gradh(X^k−1)² . By Lemma 3.1, Z is compact. In addition, his continuously diﬀerentiable and gradhcan be seen as a continuous nonlinear mapping deﬁned betweenOB ×O(n)×V andR^n×n×R^n×n×R^n×n. Then there exists a constantγ >0 such that

(3.10) gradh(X) ≤γ ∀X ∈ Z.

We have known that the sequence{h(X^k)}is monotone decreasing. Thus the sequence {X^k} generated by Algorithm 2.1 is contained inZ.

Similar to the proof of Lemma 3.1 in [39], we can establish the following result.

Lemma 3.3. If there exists a constant >0 such that

(3.11) gradh(X^k) ≥ ∀k,

then there exists a constantΘ>0 such that ΔX_k ≤Θ ∀k.

Downloaded 01/18/18 to 166.111.71.48. Redistribution subject to SIAM license or copyright; see http://www.siam.org/journals/ojsa.php

(12)

Proof. By Proposition 3.2, (2.15), (3.8), (3.9), (3.10), and (3.11), there exists an integerk₁>0 such that for allk > k₁,

ΔX_k ≤ gradh(X^k)+β_kTαk−1ΔXk−1ΔX_k−1+θ_kY^k−1

≤γ+gradh(X^k) · Y^k−1

gradh(X^k−1)² ΔX_k−1+gradh(X^k) · ΔX_k−1

gradh(X^k−1)² Y^k−1

=γ+ 2gradh(X^k) · Y^k−1

gradh(X^k−1)² ΔX_k−1

≤γ+ 2γνα_k−1ΔX_k−1

² ΔX_k−1. (3.12)

By (2.22), we get

k→∞lim 2γν

² α_k−1ΔX_k−1= 0.

Then there exist a constantr∈(0,1) and an integerk₂>0 such that for allk > k₂, we have the following inequality:

(3.13) 2γν

² α_k−1ΔX_k−1 ≤r.

From (3.12) and (3.13), it follows that for allk > k₀:= max{k₁, k₂}, ΔX_k ≤ γ+rΔX_k−1

≤ γ(1 +r+r²+· · ·+r^k−k⁰⁻¹) +r^k−k⁰ΔX_k₀

≤ _1−r^γ +ΔX_k₀. Let Θ₁:= max ΔX₁,ΔX₂, . . . ,ΔX_k₀

. We have ΔX_k ≤max

Θ₁, γ

1−r+ΔX_k₀

≡Θ ∀k.

We now establish the global convergence of Algorithm 2.1. The proof can be viewed as a generalization of [39, Theorem 3.2].

Theorem 3.4. Let {X^k} be the sequence generated by Algorithm 2.1. Then we have

lim inf

k→∞ gradh(X^k)= 0.

Proof. For the sake of contradiction, we assume that there exists a constant >0 such that

(3.14) gradh(X^k) ≥ ∀k.

We have by (2.21) that for allk≥1,

α_kgradh(X^k)²=−α_kΔX_k,gradh(X^k) ≤α_kΔX_k · gradh(X^k), which implies that

0≤α_kgradh(X^k) ≤α_kΔX_k ∀k≥1.

We get by (2.22),

k→∞lim α_kgradh(X^k)= 0.

Downloaded 01/18/18 to 166.111.71.48. Redistribution subject to SIAM license or copyright; see http://www.siam.org/journals/ojsa.php

(13)

Hence, if lim inf_k→∞α_k > 0, then lim inf_k→∞gradh(X^k) = 0. This contradicts (3.14).

We now suppose that lim inf_k→∞α_k = 0. By taking a subsequence if necessary, we may assume that

(3.15) lim

k→∞α_k= 0.

According to Step 3 of Algorithm 2.1, we have for allksuﬃciently large,

(3.16) h

R_Xk(ρ⁻¹α_kΔX_k)

−h(X^k)≥ −δρ⁻²α²_kΔX_k².

We note that the pullback functionhis continuously diﬀerentiable since both the cost function hand the retraction mappingR are continuously diﬀerentiable. Then there exist two constantsδ₂>0 andβ_L₂ >0 such that

(3.17) gradh_X(η_X)−gradh_X(ξ_X)≤β_L₂η_X−ξ_X

for any X ∈ Z and ξ_X, η_X ∈ T_XOB × O(n)× V with η_X −ξ_X < δ₂. By using the mean-value theorem, Lemma 3.3, (2.14), (2.21), (2.22), (3.16), and (3.17), there exists a constantω_k ∈(0,1) such that for allksuﬃciently large,

h

−h(X^k) (3.18)

=h

−h

R_Xk(0_Xk)

=h_Xk

ρ⁻¹α_kΔX_k

−h_Xk

0_Xk

=ρ⁻¹α_kgradh_Xk

ω_kρ⁻¹α_kΔX_k ,ΔX_k

=ρ⁻¹α_kgradh_Xk(0_Xk),ΔX_k+ρ⁻¹α_kΔX_k,gradh_Xk

ω_kρ⁻¹α_kΔX_k

−ρ⁻¹α_kΔX_k,gradh_Xk(0_Xk)

≤ρ⁻¹α_kgradh_Xk(0_Xk),ΔX_k+β_L₂ω²_kρ⁻²α²_kΔX_k²

≤ −ρ⁻¹α_kgradh(X^k)²+β_L₂ρ⁻²α²_kΔX_k².

Combining (3.16) with (3.18) yields for allksuﬃciently large, gradh(X^k)²≤(β_L₂+δ)ρ⁻¹α_kΔX_k².

By Lemma 3.3, we know that the sequence{ΔX_k} is bounded. This, together with (3.15), gives rise to

k→∞lim gradh(X^k)= 0.

This is also a contradiction. Thus the proof is complete.

Next, we establish the relationship between a stationary point ofhin (2.3) and a solution to the StIEP via Algorithm 2.1. According to (2.2) and (2.9), we have

Π_P ∂

∂Ph(S, P, V)

= 1 2

[SS−H(S, P, V), H(S, P, V)^T] + [(SS−H(S, P, V))^T, H(S, P, V)]

P

= 1 2

[SS, H(S, P, V)^T] + [(SS)^T, H(S, P, V)]

P.