Adding a single state memory optimally accelerates symmetric linear maps

(1)

HAL Id: hal-01093907

https://hal.inria.fr/hal-01093907v2

Submitted on 28 Dec 2015

HAL is a multi-disciplinary open access archive for the deposit and dissemination of sci- entific research documents, whether they are pub- lished or not. The documents may come from teaching and research institutions in France or abroad, or from public or private research centers.

L’archive ouverte pluridisciplinaire HAL, est destinée au dépôt et à la diffusion de documents scientifiques de niveau recherche, publiés ou non, émanant des établissements d’enseignement et de recherche français ou étrangers, des laboratoires publics ou privés.

Adding a single state memory optimally accelerates symmetric linear maps

Alain Sarlette

To cite this version:

Alain Sarlette. Adding a single state memory optimally accelerates symmetric linear maps. IEEE

Transactions on Automatic Control, Institute of Electrical and Electronics Engineers, 2016, 61 (11),

pp.3533. �10.1109/TAC.2016.2516247�. �hal-01093907v2�

(2)

Adding a single state memory optimally accelerates symmetric linear maps

A. Sarlette — alain.sarlette@inria.fr

QUANTIC project team, Paris Sciences Lettres University, INRIA Rocquencourt, Paris, France

& SYSTeMS, Ghent University, Belgium.

June 2015

Abstract

Previous papers have proposed to add memory registers to the dynamics of discrete- time linear systems in order to accelerate their convergence. In particular, it has been proved that adding one memory slot per agent allows faster convergence towards average consensus. We here prove that this situation cannot be improved by adding more memory slots, when the knowledge about the self-adjoint linear map to be accelerated reduces to bounds on its extreme eigenvalues.

1 Main result & related work

1.1 Motivating question: consensus

A consensus algorithm in discrete time [1] writes, for a column vector x = [x

₁

; x

₂

; ...; x

_N

] ∈ R

^N

:

x(t + 1) = x(t) − αL x(t) . (1)

Here x

_k

∈ R is the state of agent k, α > 0 is a gain, and L is the Laplacian matrix character- izing interactions among agents: component j of L x equals P

N

k=1

w

j,k

(x

j

− x

k

) with weights w

_j,k

≥ 0. Usually L contains many zeros, as each agent interacts with few fellows. For the algorithm to preserve the average of the initial values x

_k

(0), we assume w

_j,k

= w

_k,j

for all j, k. The Laplacian L is then symmetric nonnegative definite, and if the interactions form a connected graph it has a single eigenvalue λ

1

= 0 with eigenvector v

1

= [1; 1; ...1]; we call v

1

a consensus situation. When (I

_N

− αL) has nonnegative entries, with I

_N

the N × N identity matrix, it can equivalently be viewed as the transition matrix of a Markov chain; symmetric L implies that the limiting distribution is uniform.

For time-invariant L, the convergence of (1) is dictated by the largest eigenvalue, in modu- lus, of (I

_N

− αL) — excluding the trivial eigenvalue 1 associated to λ

₁

= 0 and eigenvector v

₁

, which spans the target subspace. In the orthonormal basis corresponding to the eigenvectors of L (so-called “modes”), the system decouples into

˜

x

_k

(t + 1) = (1 − αλ

_k

)˜ x

_k

(t) , k = 1, 2, ..., N, (2)

(3)

with ˜ x

_k

the coefficient of mode k and λ

_k

the associated eigenvalue. If we only know that the eigenvalues λ

k

of L belong to an interval [λ, λ] ¯ ⊂ R

>0

for k = 2, 3, ..., N , then the convergence speed — in terms of bounds on eigenvalues of (2) guaranteed over all λ

_k

∈ [λ, λ] — is optimal ¯ when α is selected to satisfy (1 − αλ) = −(1 − α ¯ λ). This gives α =

_λ+λ_¯²

and worst eigenvalue µ =

^λ−λ_λ+λ^¯_¯

.

In practice, the time step of discrete-time consensus is mostly limited by communication speed, much more rarely by local computation power. Hence [2] and later [3, 4, 5] propose to improve convergence speed by adding local dynamics to each agent, namely a memory slot with associated gain β

1

∈ R:

x

_k

(t+1) = x

_k

(t) − α P

N

j=1

w

_k,j

(x

_k

(t) − x

_j

(t)) − β

₁

(x

_k

(t) − x

_k

(t-1)) . (3) The papers analyze in detail only this case of one memory slot and prove that with β

₁

< 0 it allows faster convergence. This spurs the natural question:

How much can be gained by adding more memory slots?

The answer is strictly nothing , as we prove next in a more general setting.

1.2 Formal result

Consider a general linear iteration

x(t + 1) = x(t) + α(b − Ax(t)) (4)

where x = [x

₁

; x

₂

; ...; x

_N

] ∈ R

^N

is a dynamic variable, α ∈ R is a constant gain to be chosen, b ∈ R

^N

is a given constant bias and A is a given constant, self-adjoint positive semidefinite linear map on R

^N

. We are interested in applications where (4) is applied iteratively in order to converge towards the set of fixed points S = {s ∈ R

^N

: As = b}. For consensus, with A = L and b = 0, we have S = {λv

₁

: λ ∈ R }; in other computational applications S might reduce to a single point; we always assume S to be nonempty.

Motivated by the consensus setting, we consider a “memory-accelerated” version of (4):

x(t + 1) = x(t) + α(b − Ax(t)) +

M−1

X

m=0

β

m

x(t-m) ,

with chosen gains β

m

∈ R for m = 0, 1, 2, ..., M − 1. This yields stationary solutions x satisfying α(b − Ax) +

P

M−1 m=0

β

_m

x = 0. Hence we require P

M−1

m=0

β

_m

= 0 to ensure that S remains a set of stationary solutions under the “memory-accelerated” dynamics. The latter can then be rewritten as:

x(t + 1) = x(t) + α(b − Ax(t)) +

M−1

X

m=1

β

_m

(x(t-m) − x(t)) , (5) with freely chosen gains β

m

∈ R. For analysis, we can rewrite (5) in the basis of eigenvectors of A and this just leads to a set of N independent scalar equations

˜

x

_k

(t + 1) = (1 − αλ

_k

)˜ x

_k

(t) + α ˜ b

_k

+

M−1

X

m=1

β

_m

(˜ x

_k

(t-m) − x ˜

_k

(t)) (6)

(4)

where λ

_k

are the associated eigenvalues of A, for k = 1, 2, ..., N . Hence it is clear that convergence of (5) is governed by the separate convergence of each modal component ˜ x

k

associated to its eigenvalue λ

_k

in (6). Note that for S to be nonempty, we must have ˜ b

_k

= 0 for every k for which λ

_k

= 0. Then the modes with λ

_k

= 0 of both (4) and (5) (initialized with

˜

x

k

(t) = ˜ x

k

(0) for t < 0) become trivial, such that convergence towards S is governed by the nonzero eigenvalues of A only. We consider the setting where the choice of α, β

₁

, ..., β

M−1

is allowed to exploit the following knowledge. Let |z| and ∠ (z) respectively denote the modulus and the phase of z ∈ C.

Assumption 1: It is known a priori that the nonzero eigenvalues λ

_k

of A take values in the interval [λ, λ] ¯ ⊂ (0, +∞).

Definition 2: Denote γ

m

(λ), m = 1, 2, ..., M the roots of the characteristic polynomial of the linear difference equation (6) with λ

_k

= λ. Then we define the convergence speed guarantee ν of (5) for fixed α, β

₁

, ..., β

_M−1

by

ν := max

λ∈[λ,¯λ]

m∈{1,2,...,M

max

}

|γ

_m

(λ)| ,

i.e. ν denotes the worst possible eigenvalue of (6) over all λ

_k

satisfying Assumption 1.

We then prove the following result.

Theorem 3: Consider dynamics (5) with A self-adjoint and satisfying Assumption 1. Then the convergence speed guarantee is optimized, i.e. ν ∈ (0, 1) is minimized over all α, β

₁

, β

₂

, ...β

M−1

∈ R, when using just a single memory slot optimally tuned as in [2], i.e. taking:

β

_m

= 0 for all m > 1 , β

1

= β

1∗

:= −

1 µ −

r 1 µ

²

− 1

2

, α = α

∗

:= 2(1 − β

1∗

)

¯ λ + λ ,

where µ =

^λ−λ_λ+λ^¯_¯

. The corresponding convergence speed guarantee is ν

∗

= 1

µ − r 1

µ

²

− 1 = p

−β

_1∗

.

This ν

∗

increases with µ, which was the optimal convergence speed guarantee for M = 1 i.e. no added memory slot. For µ = 1-ε, with ε frequently called the spectral gap, we have ν

∗

< 1- √

ε and the bound gets tight as ε → 1.

Remark 4: [2] and related work mention that β

₁

< 0 is necessary to get acceleration over the memoryless case. As an auxiliary result, it is not difficult to show for (5) that if we are restricted to β

_m

≥ 0 for all m then no acceleration with respect to the memoryless case can be achieved.

1.3 Related acceleration approaches

Before proving Theorem 3 on the basis of complex polynomial analysis, we mention a few

related acceleration schemes in the literature. We hope that in the future these viewpoints

might lead to a more elegant explanation and intuitive understanding of the limitation to one

memory slot.

(5)

Optimization techniques The scheme studied here is not unlike techniques proposed earlier by the optimization community, where acceleration methods are a major focus point.

For instance the Nesterov method [6] accelerates the gradient descent optimization of a convex function f, i.e. it accelerates

x(t + 1) = x(t) − α∇

_x

f (x(t)) by applying, in its most basic version:

m(t) = x(t) − α∇

_x

f (x(t))

x(t + 1) = m(t) + g(t) (m(t) − m(t − 1)) .

For f =

¹₂

x

^T

Ax − b

^T

x, this corresponds exactly to the acceleration (5) of (4) with M = 2, except that the Nesterov method involves a time-varying step size g(t), whose details go beyond the scope of the present paper. Our result seems to suggest that just adding more memory slots would not further accelerate the Nesterov method.

Robust control The problem setting (6) with ˜ b

_k

= 0 can be viewed as an ensemble of closed-loop systems resulting from proportional feedback with different gains λ

k

, i.e.:

y(z) = H(z) u(z) , u(z) = −λ

_k

y(z) with

H(z) = α

z − 1 + P

M−1

m=1

β

_m

(1 − z

^−m

)

where y(z) and u(z) are the output and input respectively and H(z) is the plant transfer function. We then want to design the plant α, β

1

, ..., β

M−1

to get the fastest possible worst- case performance over the ensemble of feedback gains λ

_k

∈ [λ, λ]. In view of Theorem 3, a ¯ plant with M = 2 would be best in terms of convergence speed. As a major caveat though, the present paper proves optimality over all H(z) with no zeros and with at least one pole at z = 1; the possible practical relevance of this setting in control applications would have to be checked.

Various reformulations of where the uncertainty sits can of course be envisioned, e.g. the uncertain plant could be modeled by A whereas α, β

₁

, ..., β

M−1

would characterize the con- troller. From that perspective, (6) can be viewed as an optimal control problem for a so-called (non-symmetric) interval matrix or linear parametric uncertainty system, along the lines of Kharitonov’s theorem. Unfortunately the latter does not generalize easily to discrete-time systems [7, 8]. An interesting result when there is a single uncertain parameter is proposed in [9]. Nevertheless, the author found no simple way to use this towards proving Theorem 3.

Accelerated consensus As mentioned in the introduction, our initial motivation was the study of consensus algorithms accelerated through “local” memories; see [2, 3, 4, 5] which propose the scheme (similar to) (5) and analyze the case M = 2. Very recently [10] has proposed a scheme with “extrapolating” step, similar to the case M = 2 (and to the Nesterov method, see above) but where the limited information known about the network is a bound on the total number of nodes, instead of on the extremal eigenvalues of L.

Researchers have also considered accelerations based on different resources than local

memory, for instance allowing periodically time-varying gain α in (4). Optimal tuning rules

for α(t) can then build on polynomial-based filtering [12, 13]. The optimal acceleration strategy

(6)

of [11] can also be viewed as an instance of this framework, provided one limits the algorithm to a finite number of Chebyshev polynomials which are then applied periodically. In this latter context, when λ

_k

can take all values in [λ, λ], it is easy to check that the memory- ¯ based strategy (6) is faster than the optimal periodically time-varying one. However if more is known about the λ

k

, then polynomial-based filtering is straightforward to adapt towards better speed-up. In the extreme case where L has M different eigenvalues whose values are exactly known, it is possible to construct a polynomial of order M that achieves deadbeat convergence in M steps. Similar “finite-time consensus” strategies have been proposed and studied in e.g. [14, 15], with an interpretation of the resulting algorithms as gathering initial state information at a central node, computing its average and redistributing the average through the network.

It seems not straightforward to compare the resources “local memory” and “time-varying gain” outside an applicational context. In particular a standard time-varying implementation requires a stronger type of synchronization among all the nodes, namely they must not only agree on the frequency of updates but also on which gain to apply at each step. This may or may not be restrictive. The Nesterov method combines both memory and time-dependent step lengths. These approaches, along with accelerated Markov chains (see next paragraph), offer alternatives towards further acceleration but apparently only when more is known than λ

_k

∈ [λ, λ]. ¯

Accelerated Markov chains The optimal acceleration of the dual of consensus, namely Markov chains, has also been extensively studied in the literature, see e.g. the “lifting” method of [16] and an even faster “pseudo-lifting” in [17]. Those approaches exploit knowledge of the particular network, hence they apply to the case where more is known than only λ

_k

∈ [λ, λ] ¯ (see previous paragraph). In fact [17] proposes an accelerated consensus version which directly connects with the “gather-and-distribute” strategy of [14, 15], although with a priori more complex communication requirements: it requires each node to communicate a vector of several values at each time step.

Yet lifted Markov chains cannot be viewed straightforwardly as the dual of accelerated consensus.

¹

Whether the result of [2] and/or Theorem 3 can be used to investigate Markov chain acceleration, possibly without requiring detailed network knowledge, remains to be studied. The lifted Markov chain [16] using detailed network knowledge is limited to quadratic speed-up.

Information theory Consider the consensus application from the viewpoint of an individ- ual node k. The messages received by node k when symmetrically exchanging data with a network, reflect in part the other nodes’ initial values {x

_j

(0) : j 6= k} and in part the influence of x

k

(0) on the other nodes in the network. It seems not unreasonable to suspect that local memory can help disentangle these influences. Yet Theorem 3 says that if the network is poorly characterized, in the precise sense λ ∈ [λ, λ], then having more than one additional ¯ memory is not helpful. This somehow seems to say: taking into account the direct feedback loop from k through its neighbors j and directly back to k does allow improved convergence;

but speculating about longer feedback loops does not pay off.

1

In fact, although this is a minor point, even in the unaccelerated case, Markov chains unlike consensus are

restricted to systems with positive coefficients in the matrix

I−αL.

(7)

Finally, let us emphasize that Theorem 3 starts with a positive semidefinite self-adjoint matrix A, which can be viewed as a diffusive operator, or in consensus as an undirected com- munication matrix. In contrast, the state matrix characterizing the accelerated system (5) is not symmetric. Hence it seems to introduce transport, or hidden directed communication (effects), which is known to improve convergence speed in several contexts [16, 18]. This also implies that the acceleration of [2], exactly like Markov chain liftings, cannot be applied iteratively. I.e. the new state matrix, characterizing the system that results from an accelera- tion according to [2], is not symmetric which precludes applying the technique of [2] once more.

In the following we first prove Theorem 3, before discussing an example in Section 4.

2 Proof, first part: reformulation

We consider the dynamics with memory slots in the form (6). The convergence speed for each mode is governed by the roots of

P(z; λ

_k

) = z

^M

− (1 − αλ

_k

)z

^M⁻¹

+ P

_M−2

m=0

β

M−m−1

(z

^M−1

− z

^m

) , (7) viewed as a function of z parameterized by λ

k

. In accordance with Definition 2, for the sequel we replace the set of λ

_k

by a generic λ which runs through [λ, ¯ λ].

2.1 Optimal solution with one memory slot

To make the paper self-contained, we outline a proof of optimal tuning for the case M = 2, which is covered in [2].

Proposition 5 [adapted from [2]]: For M = 2 the tuning β

1

= β

1∗

, α = α

∗

, as proposed in Theorem 3, minimizes the value of the convergence speed guarantee ν. Moreover, the roots of

P

∗

(z; λ) = z

²

− (1 − α

∗

λ)z + β

1∗

(z − 1)

take the values z

±∗

(λ) = ν e

^±iθ^λ

where the map λ 7→ θ

_λ

is a continuous bijection from [λ, λ] ¯ to [0, π].

Proof: First note that β

₁

≥ 1 always leads to an unstable eigenvalue. Then the roots of P (z; λ) are equal to the roots z

±

(˜ λ) of

f (z; ˜ λ) = z

²

− (1 − β

₁

)˜ λz − β

₁

with ˜ λ = (1 − αλ − β

₁

)/(1 − β

₁

).

We first fix a value of β

₁

and study the roots of f (z; ˜ λ) as a function of ˜ λ, which translates into an optimal value for α. Standard function analysis yields:

Property 5: The worst root max{|z

₊

(˜ λ)|, |z

₋

(˜ λ)|} of f (z; ˜ λ) is a monotonically (non-strictly) increasing function of | λ|. ˜

Hence for given β

₁

the choice of α should strive to minimize max

_λ∈[λ,λ]¯

(| λ|). This is obtained ˜

by choosing α such that (1 − αλ − β

1

) = −(1 − α λ ¯ − β

1

), since | λ| ˜ takes its extremal value(s)

when λ is at the boundary of the interval [λ, λ]. We hence get the condition ¯ α =

^2(1−β_λ+λ_¯ ¹⁾

.

Moreover, this implies that ˜ λ runs through the interval [−µ, µ] when λ spans [λ, ¯ λ], with

µ =

^λ−λ_λ+λ^¯_¯

independent of β

1

.

(8)

It remains to optimize β

₁

, assuming that it is associated to its respective optimal α; by Property 5 and the previous sentence, this reduces to:

β

min

1∈R

˜

max

λ∈±µ

max

|z

±

(˜ λ)| : f(z

±

(˜ λ); ˜ λ) = 0

.

A standard function analysis shows that it is optimal to take β

1

such that z

+

(µ) = z

−

(µ) =

−z

₊

(−µ) = −z

₋

(−µ). This yields the expressions of α

∗

, β

1∗

and ν given in Theorem 3.

The just mentioned optimality condition requires that the discriminant ∆

∗

= (1−β

_1∗

)

²

λ ˜

²

+ 4β

1∗

of f (z; ˜ λ) = 0 takes its zeros for ˜ λ = ±µ. Then for all ˜ λ ∈ [−µ, µ], i.e. all λ ∈ [λ, λ], we ¯ have ∆

∗

≤ 0. Therefore 4|z

_±∗

(λ)|

²

= (1 − β

1∗

)

²

λ ˜

²

− ∆

∗

(sum of squared real and imaginary parts) = −4β

₁

independently of λ ∈ [λ, λ]. Continuity of the roots of ¯ f (z) between ˜ λ = µ

and ˜ λ = −µ yields the last part of the Proposition.

2.2 General polynomial property We now reformulate P (z; λ) using Proposition 5.

Claim 6 [to be proved]: For any ν ∈ (0, 1) and P ˜

M−1

= P

M−1

k=0

a

_k

y

^k

with a

₁

, a

₂

, ..., a

M−1

∈ R and a

M−1

6= −1, there exists θ ∈ [0, π] for which

P ˜ (y; θ) = (y − e

^iθ

)(y − e

^−iθ

) y

^M−2

+ (y −

¹_ν

) ˜ P

M−1

(y) (8) has a root of modulus ≥ 1.

Proposition 7: Theorem 3 is true if Claim 6 holds.

Proof [of Proposition 7]: Let

P ˆ (z; θ) = (z − νe

^iθ

)(z − νe

^−iθ

)z

^M⁻²

+ (z − 1) P

M−1

m=0

β ˜

_m

z

^m

.

Using z

²

− (1 − α

∗

λ)z + β

1∗

(z − 1) = (z − νe

^iθ^λ

)(z − νe

^−iθ^λ

) from Proposition 5, it is straight- forward to check that ˆ P (z; θ

λ

) =

^α_α^∗

P (z; λ) provided

β ˜

_k

=

^α_α^∗

P

k

m=0

β

_M−m−1

for k ≤ M -3, β ˜

_M-2

=

^α_α^∗

( P

M−2

m=0

β

M−m−1

) − β

1∗

, β ˜

_M-1

=

^α_α^∗

− 1 .

Thus by taking appropriate ˜ β

₀

, β ˜

₁

, ... β ˜

_M−1

∈ R , the roots of ˆ P (z; θ

_λ

) can be made equal to the roots of P(z; λ) from Theorem 3 for any α, β

1

, ..., β

M−1

∈ R, except for α = 0. The latter case is trivially uninteresting since for α = 0 any x ∈ R

^N

would be a stationary point of (5), hence no convergence can be obtained. Moreover, the case ˜ β

_M−1

= −1 will never appear since this would require infinite α. The tuning proposed in Theorem 3, with roots ν e

^±iθ^λ

and 0, corresponds to ˜ β

0

= ˜ β

1

= ... = ˜ β

M−1

= 0. The parameter λ ∈ [λ, λ] in ¯ P (z; λ) is transferred to θ

_λ

∈ [0, π] in ˆ P (z; θ

_λ

), according to Proposition 5.

Hence a sufficient condition to prove Theorem 3 is to show that for any ˜ β

₀

, β ˜

₁

, ... β ˜

M−1

∈ R , with ˜ β

M−1

6= −1, there exists some θ ∈ [0, π] such that ˆ P (z; θ) has at least one root with modulus ≥ ν. The statement of the Proposition is obtained by defining y = z/ν and

a

_m

= ˜ β

_m

ν

^m−M⁺¹

for m = 1, 2, ..., M − 1.

(9)

Claim 6 looks like a discrete-time robust (in)stability property. Surprisingly, we know no standard result that would straightforwardly establish it. Examples can be constructed where the relevant roots are never real, or appear only for intermediate values of θ (see Section 4), which seems to rule out simple variants of polynomial roots properties. Hence the remainder of this paper explicitly analyzes ˜ P (y; θ) in the complex plane. Note that for any polynomial P (z), the modulus |P (z)| is continuous for all z ∈ C , while e

^i∠(P^(z))

is continuous for all z ∈ C \ {z : P (z) = 0}.

3 Proof, second part: analyzing P ˜ (y; θ)

We split up ˜ P (y; θ) into

P

1

(y; θ) := (y − e

^iθ

)(y − e

^−iθ

) y

^M−2

, P

2

(y) := −(y −

_ν¹

) ˜ P

M−1

(y) .

Any y at which P

₁

(y; θ) = P

₂

(y) is a root of ˜ P(y; θ). We first dispose of two special cases.

Proposition 8:

(a) If P ˜

M−1

in Claim 6 has a root on the unit circle, then P ˜ (y; θ) has a root y

₁

with |y

₁

| ≥ 1 for some θ ∈ [0, π].

(b) If a

M−1

< −1 in Claim 6, then P ˜ (y; θ) has a root y

₁

with |y

₁

| ≥ 1 for all θ ∈ [0, π].

Proof: (a) Let y

∗

the root of ˜ P

M−1

on the unit circle and take e

^iθ^∗

= y

∗

if Imaginary(y

∗

) ≥ 0, else e

^−iθ^∗

= y

∗

. Then ˜ P (y

∗

, θ

∗

) = 0 with |y

_∗

| ≥ 1.

(b) We separately analyze (i) the sign and (ii) the magnitude of both P

1

and P

2

for y belonging to the real positive axis.

(i) Let y

∗

≥ 1/ν > 1 the largest real root of P

₂

(y). Both P

₂

(y) and P

₁

(y; θ) are positive for y ∈ (y

∗

, +∞), for any θ.

(ii) |P

₂

(y)| < |P

₁

(y; θ)| for y close to y

∗

, while for very large |y| we have |P

₂

(y)| ' |a

_M−1

| |y|

^M

>

|P

₁

(y; θ)| ' |y|

^M

.

Hence for any θ, there exists some y

₁

∈ (y

∗

, +∞) where |P

₁

(y

₁

; θ)| = |P

₂

(y

₁

)|, which with (i) implies P

1

(y

1

; θ) = P

2

(y

1

). This y

1

is a root of ˜ P(y; θ) with |y

₁

| > 1.

From Prop. 8(a), we can reduce our investigation to the case where the roots of P

₂

are disjoint from the roots of P

1

. Indeed, we have excluded roots of P

2

on the unit circle, whereas a situation with P

₂

and P

₁

having the common root y = 0 can be reduced to an equivalent algorithm with a lower value of M.

The analysis of ˜ P(y; θ) towards Claim 6 for the remaining cases is essentially a general- ization of the proof of Prop. 8(b), from the real line towards the complex plane.

3.1 Partitioning the complex plane with P

₁

and P

₂

For given P

1

(·; θ) and P

2

(·), we can partition the complex plane into a collection of open connected sets where |P

₁

| > |P

₂

| (we call these type 1 sets), a collection of open connected sets where |P

₂

| > |P

₁

| (we call these type 2 sets), separated by sets where |P

₂

| = |P

₁

|. Note that, as the roots of P

1

and P

2

can be assumed disjoint (see Prop.8(a)), any root of P

2

unambiguously belongs to a type 1 set and any root of P

₁

belongs to a type 2 set.

(10)

Figure 1: Illustrating the definition of the type 1 sets (shaded), type 2 sets (plain white) and Γ

R

(y

∗

) (darker shading). Small blue crosses represent roots of P

2

, small blue circles the roots of P

1

. The boundary ∂Γ

_R

(y

∗

) (thick plain and dotted curves) and its subset ∂Γ

R¯

(y

∗

) (thick plain curves) are also represented. The complete picture is meant to be symmetric around the real axis. The sets drawn here are purely schematic, they are not meant to correspond to any actual polynomials P

1

, P

2

and may possibly be more general than what polynomial properties would allow.

For any root y

∗

of P

2

with |y

∗

| > 1, let Γ(y

∗

) the type 1 set containing y

∗

. Then for any R > |y

_∗

|, let

Γ ˜

_R

(y

∗

) := Γ(y

∗

) ∩ {y ∈ C : 1 < |y| < R}

and define Γ

_R

(y

∗

) to be the connected component of ˜ Γ

_R

(y

∗

) such that y

∗

∈ Γ

_R

(y∗) (see Figure 1). There is at least one such set corresponding to y

∗

= 1/ν. We denote the boundary of Γ

_R

(y

∗

) by ∂Γ

_R

(y

∗

). Note that for any y ∈ ∂Γ

_R

(y

∗

) we have either |P

₂

(y)| = |P

₁

(y; θ)|, or

|P

₂

(y)| < |P

₁

(y; θ)|; the latter case involves points where either |y| = 1, or |y| = R. We finally define ∂Γ

R¯

(y

∗

) = ∂Γ

R

(y

∗

) ∩ {y : |y| < R} and reduce the proof of Claim 6 to the following.

Claim 9 [to be proved]: For any P

₂

— excluding the cases already handled by Proposition 8 — there exists θ ˜ ∈ [0, π] and y

∗

a root of P

₂

such that e

^i∠(P²^(˜^y))

= e

^i∠(P¹^(˜^y;˜^θ))

for some

˜

y ∈ ∂Γ

R¯

(y

∗

).

Proposition 10: Claim 6 is true if Claim 9 holds.

Proof: We must show that for any situation satisfying Claim 9, we can find y

₁

∈ C with

|y

₁

| ≥ 1 and θ ∈ [0, π], such that P

₁

(y

₁

; θ) = P

₂

(y

₁

). For given P

₂

, excluding the cases covered by Proposition 8, take ˜ y, θ ˜ according to Claim 9.

If ˜ y belongs to the part of ∂Γ

R¯

(y

∗

) where |P

₂

(y)| = |P

₁

(y; ˜ θ)|, we have P

₁

(˜ y; ˜ θ) = P

₂

(˜ y) with |˜ y| ≥ 1 by construction. Hence the case is closed with y

₁

= ˜ y, θ = ˜ θ.

If |P

₂

(˜ y)| 6= |P

₁

(˜ y; ˜ θ)|, then necessarily |P

₂

(˜ y)| < |P

₁

(˜ y; ˜ θ)| and |˜ y| = 1. Let ˜ y = e

^iφ

and assume φ ∈ [0, π]; the case φ ∈ [−π, 0] is the same modulo a few notation changes. As for Prop. 8(b), we separately investigate (i) the phase and (ii) the modulus of P

₁

, P

₂

.

(i) By computing

(˜ y − e

^iθ

)(˜ y − e

^−iθ

)˜ y

^M−2

e

^−i(M^−1)φ

= 2 (cos(φ) − cos(θ)) .

we see that e

^i∠(P¹^(˜^y;θ))

= e

^i∠(P¹^(˜^y;˜^θ))

= e

^i∠(P²^(˜^y))

for all θ for which (cos(φ) − cos(θ)) takes the

(11)

same sign as

cos(φ) − cos(˜ θ)

, i.e. for all θ ∈ [˜ θ, φ) =: I

₁

or all θ ∈ (φ, θ] =: ˜ I

₁

, depending on the ordering of ˜ θ and φ.

(ii) f (θ) := |P

₂

(˜ y)| − |P

₁

(˜ y; θ)| is a continuous function of θ, with f (˜ θ) < 0 and f(φ) > 0 by construction. Hence there exists θ

1

∈ I

₁

such that f (θ

1

) = 0.

Combining (i) and (ii), we have P

₁

(y

₁

; θ) = P

₂

(y

₁

) with y

₁

= ˜ y and θ = θ

₁

. 3.2 Proving Claim 9

It remains to prove that Claim 9 is true, which in fact is a consequence of Cauchy’s argument principle in complex analysis. Adapted to the current case, the argument principle states the following fact.

Property 11 [Cauchy’s Argument Principle]: Let D ⊂ C a bounded, simply connected open set whose boundary ∂D is the image of a closed simple curve γ : [0, 1] → C. Assume that D contains n roots of P

2

(y) and p roots of P

1

(y; θ) for some fixed θ, while ∂D contains no roots.

Then there exists a continuous function f (t) : [0, 1] → R such that exp

i

∠ (P

₁

(γ(t); θ)) − ∠ (P

₂

(γ(t)))

= exp(if(t)) , and any such f (t) satisfies |f (1) − f (0)| = 2π|p − n|.

We will apply this principle with ∂D defined by elements of ∂Γ

R

(y

∗

). Note that the boundary of any Γ

_R

(y

∗

) is indeed sufficiently regular since the locus in the complex plane where |P

₁

|

²

− |P

₂

|

²

= 0 is by definition a planar algebraic curve. We will also use the following related property.

Property 12: Consider the setting of Property 11 with |p − n| =: m > 0. Then there are at least m points y = γ (t) on ∂D where e

^i∠(P²^(y))

= e

^i∠(P¹^(y;θ))

.

Proof: The interval (f (0), f (0) + 2πm] ⊂ R contains m different multiples of 2π and the func- tion f (t) must take each of these values at least once as it continuously evolves from f (0) to f(1) = f (0)+2πm. The γ (t) associated to these values of f (t) satisfy e

^i∠(P²^(y))

= e

^i∠(P¹^(y;θ))

. The remainder of the proof relies on the following observations (see Fig. 2). For some fixed θ, let D

_R

(y

∗

) the smallest simply connected set containing Γ

_R

(y

∗

), where the latter contains m ≥ 1 roots of P

₂

. Note that ∂D

_R

(y

∗

) ⊆ ∂Γ

_R

(y

∗

).

Case A: If D

_R

(y

∗

) does not contain the open unit disc (see Fig. 2 left), then it contains at least m roots of P

₂

but no root of P

₁

. Hence applying Properties 11 and 12 with C = D

_R

(y

∗

), there must be m points on ∂D

_R

(y

∗

) ⊆ ∂Γ

R

(y

∗

) where e

^i∠(P²^(y))

= e

^i∠(P¹^(y;θ))

.

Case B: If D

_R

(y

∗

) contains the open unit disc, then ∂Γ

R¯

(y

∗

) contains a closed curve γ

₀

that separates Γ

_R

(y

∗

) from the open unit disc and we can define D

₀

(y

∗

) ⊂ (D

_R

(y

∗

) \ Γ

_R

(y

∗

)) the simply connected set whose boundary is γ

0

(see Fig. 2 right). This set contains the M roots of P

₁

but at most M − m roots of P

₂

. Hence by Properties 11 and 12 there must be m points on γ

₀

⊂ ∂Γ

R¯

(y

∗

) where e

^i∠(P²^(y))

= e

^i∠(P¹^(y;θ))

.

Claim 9 is readily true for Case B. For Case A, we need to ensure that at least one of the m points in ∂Γ

_R

(y

∗

) also belongs to ∂Γ

R¯

(y

∗

). Towards this, we investigate e

^i∠(P¹(y;θ))−i∠(P2(y))

when |y| = R with R very large.

(12)

y* R 1

γ0

Figure 2: Schematic illustration of Case A (left) and Case B (right) to which we apply Properties 11 and 12. The sets and roots are represented similarly to Fig. 1. For Case A the set in red is D

_R

(y

∗

). For Case B, the striped set is D

_R

(y

∗

), the thick curve is γ

₀

and the set enclosed by γ

0

is D

₀

(y

∗

).

Lemma 13: If in P

₂

the coefficient a

M−1

> 0 then there exists R

₁

> 0 such that e

^i∠(P¹(y;θ))−i∠(P2(y))

6= 1 for all y with |y| > R

₁

and for all θ ∈ [0, π].

Proof: Note that e

^i∠(P¹(y;θ))−i∠(P2(y))

= exp(i ∠

_P

1(y;θ) P2(y)

). Let us evaluate the phase of the ratio of two general polynomials P

_g

(y) = P

M

k=0

g

_k

y

^k

and P

_h

(y) = P

M

k=0

h

_k

y

^k

with real coef- ficients g

k

, h

k

and assuming g

M

h

M

< 0. Applying the formula

^a+bi_c+di

=

_c2+d¹ ²

((ac + bd) + (bc − ad)i), with a, b, c, d real and i the imaginary unit, we get with y = Re

^iφ

:

P

_g

(Re

^iφ

) P

_h

(Re

^iφ

) =

P

M k=0

gM−k

R^k

(cos(kφ) − i sin(kφ)) P

M

k=0 hM−k

R^k

(cos(kφ) − i sin(kφ)) = η (g

_M

h

_M

+ O(1/R) + iO(1/R)) for some η > 0. Thus there exists R

1

> 0 such that

Real

Pg(Re^iφ) P_h(Re^iφ)

= g

M

h

M

+ O(1/R) < 0

for all R > R

1

. The Lemma follows by applying the above result with P

g

(y) = P

1

(y; θ) and

P

_h

(y) = P

₂

(y).

We are now ready to prove Claim 9 in a few steps.

Proof of Claim 9 for a

M−1

≤ 0: From Prop.8(c) we only need to consider the case a

M−1

∈ (−1, 0]. Then there exists R

₁

> 1 such that |y|

^M

' |P

₁

(y; θ)| > |P

₂

(y)| ' |a

_M₋₁

| |y|

^M

for all y for which |y| > R

₁

. Take any root y

∗

of P

₂

with |y

∗

| > 1. Then for R > R

₁

, the points y with R

1

< |y| < R cannot belong to ∂Γ

R

(y

∗

), since this would require |P

₁

(y; θ)| = |P

₂

(y)|. This im- plies that ∂Γ

_R

(y

∗

) contains either all points or no point of the circle C

_R

:= {y ∈ C : |y| = R}.

In case ∂Γ

_R

(y

∗

) ∩ C

_R

= ∅, we have ∂Γ

_R

(y

∗

) = ∂Γ

R¯

(y

∗

) and the observations after Proposition

12 allow to directly conclude. In case ∂Γ

R

(y

∗

) = C

_R

, we are necessarily in Case B of the

observations after Proposition 12 hence the conclusion is also immediate.

(13)

Proof of Claim 9 for a

_M−1

> 0: Take R > R

₁

to satisfy Lemma 13. The set ∂Γ

_R

(y

∗

) \∂Γ

_R_¯

(y

∗

) is a subset of {y ∈ C : |y| = R}, and Lemma 13 implies e

^i∠(P²^(y))

6= e

^i∠(P¹^(y;θ))

for all y for which |y| = R. Hence all the points identified in the observations after Proposition 12, where e

^i∠(P²^(y))

= e

^i∠(P¹^(y;θ))

on Γ

_R

(y

∗

), must belong to Γ

R¯

(y

∗

).

With this we have covered all situations and hence concluded the proof of Claim 9, which proves Theorem 3.

4 Example

Consider a linear map A with nonzero eigenvalues λ

k

∈ [0.0122, 0.9878]. The optimal ac- celeration without added memory slot i.e. with M = 1 yields µ = 0.9756 and a spectral gap 1 − µ = 0.0244. With M − 1 = 1 added memory slot, an improved convergence speed guarantee ν = 0.8000 is obtained, with 1 − ν = 0.2 < √

0.0244 = 0.1562, using the optimal parameters α

∗

= 3.2800 and β

1∗

= −0.6400.

If more is known about the eigenvalues of A then additional memory does help accelerate.

For instance consider A with nonzero eigenvalues λ

_k

∈ Λ = [0.0122, 0.0182] ∪ {0.9878} i.e. the largest eigenvalue is in fact isolated far away from the others. This does not allow to improve convergence speed with M = 2, because for M = 2 the bound ν ≥ 0.8000 holds as soon as A features both eigenvalues 0.0122 and 0.9878 (see the proof of Prop. 5).

In contrast, M = 4 and parameter values α = 3.6908, β

₁

= −0.9083, β

₂

= 0.006662, β

₃

= 0.06785 do yield an improved convergence speed guarantee ˜ ν = 0.7560 over all λ

_k

∈ Λ (with ˜ ν defined by straightforward extension of Def. 2). Note that these parameters have been obtained numerically by local search around α = α

∗

, β

₁

= β

1∗

and β

_k

= 0 for k > 1; we do not exclude the existence of better parameter values. Figure 3 shows |z

_∗

| the corresponding largest root in modulus of (7) as a function of λ

k

. It highlights how the improved |z

_∗

| for λ

_k

∈ Λ comes to the detriment of (much) worse |z

_∗

| for λ

_k

∈ [0.0291, 0.9788].

Finally, let us illustrate the proof of Claim 9 and Prop.10 with those values, i.e. M = 4 and α = 3.6908, β

₁

= −0.9083, β

₂

= 0.006662, β

₃

= 0.06785 for [λ, ¯ λ] = [0.0122, 0.9878] and hence ν = 0.8000. For three values of θ ∈ { π/40, π/10, π/4 }, Figure 4 shows in white the set in C where |P

₁

(y; θ)| < |P

₂

(y)| and in grey the set where |P

₁

(y; θ)| > |P

₂

(y)|. Small circles are the roots of P

1

and small crosses the roots of P

2

. In the middle is the unit circle and the colored lines denote the locus where e

^i∠(P¹⁾

= e

^i∠(P²⁾

. For all values of θ, we are in Case B of Section 3.2. Correspondingly, we have highlighted in black the curve γ

0

delimiting the simply connected set D

₀

(y

∗

). For all θ, there are two points along γ

0

where e

^i∠(P¹⁾

= e

^i∠(P²⁾

. For θ = π/40 the crossing occurs at some y

₁

on the unit circle (red dot on Fig. 4), in a region where |P

₁

| > |P

₂

|. For e

^iθ

closer to y

1

, e.g. θ = π/10, we still have e

^i∠(P¹^(y¹^;θ))

= e

^i∠(P²^(y¹⁾⁾

but y

1

now belongs to a region where |P

₂

| > |P

₁

|. Hence somewhere in between there must be some ˜ θ for which |P

₁

(y

₁

; ˜ θ)| = |P

₂

(y

₁

)|. This precisely corresponds to the point where the plot on Fig. 3 crosses |z

∗

| = ν = 0.8000, and suffices to prove Theorem 3. We pursue this example towards further insight on bad or good θ. Moving e

^iθ

beyond y

1

finally changes

∠ (P

₁

(y

₁

; θ)) by π and the points along the unit circle where e

^i∠(P¹⁾

= e

^i∠(P²⁾

start moving;

for a large interval of θ values, the points on γ

0

where e

^i∠(P¹⁾

= e

^i∠(P²⁾

also satisfy |P

₂

| = |P

₁

|,

as on the third graph, hence the polynomial is worse than the optimal one with M = 2. As

(14)

10⁻² 10⁻¹ 10⁰ 0.65

0.7 0.75 0.8 0.85 0.9 0.95 1

k

| z* |

Figure 3: Largest root in modulus |z

_∗

| of (7) with M = 4, α = 3.6908, β

₁

= −0.9083, β

₂

= 0.006662, β

₃

= 0.06785, as a function of λ

_k

∈ [0.0122, 0.9878]. Reducing the relevant eigenvalue set to λ

k

∈ [0.0122, 0.0182] ∪ {0.9878} (vertical lines) yields an improved conver- gence speed guarantee ˜ ν = 0.7560 (horizontal full line) with respect to the optimal situation ν = 0.8000 with M = 2 (horizontal dashed line). However this happens to the detriment of

|z

_∗

| > 0.8000 for all λ

k

∈ [0.0291, 0.9788].

θ approaches π a behavior symmetric to the neighborhood of θ = 0 takes place (not shown).

5 Conclusion

We have proved that the maximal achievable acceleration of a linear iterative map by adding M − 1 memory slots to each state vector component, is already achieved with M − 1 = 1.

More precisely, this holds for the performance guarantee of a constant map about which we only know bounds [λ, ¯ λ] ⊂ (0, +∞) on the nonzero eigenvalues of the self-adjoint state update matrix A. The fact that the nonzero eigenvalues of A can take any values in [λ, λ] is important ¯ for the property to hold. Better accelerations can be devised if more is known about A, see e.g. Section 4.

A direct extension of our result would allow each subsystem to follow general linear dy-

Figure 4: Characteristics of P

₁

(y; θ) and P

₂

(y) for three values of θ ∈ { π/40, π/10, π/4 },

illustrating the proof of Claim 9 and Prop.10 on our example polynomial. See main text for

details.

(15)

namics; thinking of the consensus application with rational input-output transfer function at each node, this would mean characteristic polynomials of the form

(z − 1)P (z) + αλ

_k

Q(z)

with P(z) and Q(z) two polynomials. The memory slot setting (5) is restricted to Q(z) = z

^M−1

, i.e. the consensus algorithm uses only the latest information coming from the network.

It is not clear if convergence can speed up with other Q(z).

The basic consensus algorithm (1) has a proven robustness to network incidents

²

. For the case (5) with optimally tuned memory slot, one can construct examples where packet drops lead to instability. Thus the benefit of more memory slots might have to be reevaluated towards robustness.

Acknowledgment

This research has been carried out within the Interuniversity Attraction Poles network DYSCO.

The author wants to thank F.Ticozzi and S.Zampieri for stimulating discussions.

References

[1] J. N. Tsitsiklis, “Problems in decentralized decision making and computation,” PhD Thesis, MIT, 1984.

[2] S. Muthukrishnan, B. Ghosh, and M. H. Schultz, “First- and second-order diffusive methods for rapid, coarse, distributed load balancing,” Theory Comput. Systems, vol. 31, pp. 331–354, 1998.

[3] B. Johansson and M. Johansson, “Faster linear iterations for distributed averaging,”

Proceedings of the 17th IFAC World Congress, pp. 2861–2866, 2008.

[4] E. Ghadimi, I. Shames, and M. Johansson, “Accelerated gradient methods for networked optimization,” arxiv, vol. 1211.2132, 2012.

[5] J.Liu, B. Anderson, M.Cao, and S. Morse, “Analysis of accelerated gossip algorithms,”

Automatica, vol. 49, pp. 873–883, 2013.

[6] Y. Nesterov, “A method of solving a convex programming problem with convergence rate O(1/k

²

),” in Soviet Mathematics Doklady, vol. 27, no. 2, 1983, pp. 372–376.

[7] C. V. Hollot and A. C. Bartlett, “Some discrete-time counterparts to Kharitonov’s sta- bility criterion for uncertain systems,” IEEE Trans. Automatic Control, vol. 31, no. 4, pp. 355–356, 1986.

2

Stability under switching networks holds: for any directed graphs, as long as

I−αL

has all components positive, thanks to the common Lyapunov function

V

= max

ixi−

min

ixi

[1]; for undirected graphs, as long as

I−αL

has all eigenvalues inside the unit circle, thanks to the common Lyapunov function

P

i

(x

i

)

²

.

The conditions on

I−αL

always hold if we start from a stable nominal network and “packet drops” occur,

i.e. failures can at times modify

L

by setting to zero the gain associated to some links.

(16)

[8] P. Vaidyanathan, “Derivation of new and existing discrete-time Kharitonov theorems based on discrete-time reactances,” IEEE Trans. Acoustics, Speech and Signal Processing, vol. 38, no. 2, pp. 277–285, 1990.

[9] B. R. Barmish, “A generalization of kharitonov’s four-polynomial concept for robust stability problems with linearly dependent coefficient perturbations,” Automatic Control, IEEE Transactions on, vol. 34, no. 2, pp. 157–165, 1989.

[10] A. Olshevsky, “Linear time average consensus on fixed graphs and implications for de- centralized optimization and multi-agent control,” http://arxiv.org/pdf/1411.4186v5.pdf, 2015.

[11] E. Montijano, J. Montijano, and C. Sa¨ u´ es, “Chebyshev polynomials in distributed con- sensus applications,” IEEE Trans. Signal Proc., vol. 61, no. 3, pp. 693–706, 2013.

[12] E.Kokiopoulou and P. Frossard, “Polynomial filtering for fast convergence in distributed consensus,” arxiv, vol. 0802.3992, 2008.

[13] S. Sundaram and C. Hadjicostis, “Finite-time distributed consensus in graphs with time- invariant topologies,” Proc. American Control Conference, pp. 711–716, 2007.

[14] J. Hendrickx, R. Jungers, A. Olshevsky, and G. Vankeerberghen, “Graph diameter, eigen- values, and minimum-time consensus,” Automatica, vol. 50, no. 2, pp. 635–640, 2014.

[15] L. Georgopoulos, “Definitive consensus for distributed data inference,” PhD thesis, EPFLausanne, 2011.

[16] F. Chen, L. Lovasz, and I. Pak, “Lifting Markov chains to speed up mixing,”

Proc.STOC’99, 1999.

[17] K. Jung, D. Shah, and J. Shin, “Distributed averaging via lifted markov chains,” Infor- mation Theory, IEEE Transactions on, vol. 56, no. 1, pp. 634–647, 2010.