• Aucun résultat trouvé

Some Remarks on Replicated Simulated Annealing

N/A
N/A
Protected

Academic year: 2021

Partager "Some Remarks on Replicated Simulated Annealing"

Copied!
23
0
0

Texte intégral

(1)

HAL Id: hal-03238781

https://hal.archives-ouvertes.fr/hal-03238781

Submitted on 31 May 2021

HAL is a multi-disciplinary open access archive for the deposit and dissemination of sci- entific research documents, whether they are pub- lished or not. The documents may come from teaching and research institutions in France or abroad, or from public or private research centers.

L’archive ouverte pluridisciplinaire HAL, est destinée au dépôt et à la diffusion de documents scientifiques de niveau recherche, publiés ou non, émanant des établissements d’enseignement et de recherche français ou étrangers, des laboratoires publics ou privés.

Distributed under a Creative Commons Attribution| 4.0 International License

Some Remarks on Replicated Simulated Annealing

Vincent Gripon, Matthias Löwe, Franck Vermet

To cite this version:

Vincent Gripon, Matthias Löwe, Franck Vermet. Some Remarks on Replicated Simulated Annealing.

Journal of Statistical Physics, Springer Verlag, 2021, 182 (3), �10.1007/s10955-021-02727-z�. �hal-

03238781�

(2)

https://doi.org/10.1007/s10955-021-02727-z

Some Remarks on Replicated Simulated Annealing

Vicent Gripon1·Matthias Löwe2 ·Franck Vermet3

Received: 7 October 2020 / Accepted: 13 February 2021 / Published online: 2 March 2021

© The Author(s) 2021

Abstract

Recently authors have introduced the idea of training discrete weights neural networks using a mix between classical simulated annealing and a replica ansatz known from the statistical physics literature. Among other points, they claim their method is able to find robust config- urations. In this paper, we analyze this so called “replicated simulated annealing” algorithm.

In particular, we give criteria to guarantee its convergence, and study when it successfully samples from configurations. We also perform experiments using synthetic and real data bases.

Keywords Simulated annealing·Replicas·Perceptron·Neural networks·Optimization Mathematics Subject Classification 60J10·82C32·65C05

1 Introduction

In the past few years, there has been a growing interest in finding methods to train dis- crete weights neural networks. As a matter of fact, when it comes to implementations,

Communicated by Hal Tasaki.

The research of the second author was funded by the Deutsche Forschungsgemeinschaft (DFG, German Research Foundation) under Germany’s Excellence Strategy EXC 2044-390685587, Mathematics Münster:

Dynamics—Geometry—Structure.

B

Matthias Löwe

maloewe@math.uni-muenster.de Vicent Gripon

vincent.gripon@imt-atlantique.fr Franck Vermet

Franck.Vermet@univ-brest.fr

1 IMT Atlantique UMR CNRS Lab-STICC, Technopole Brest Iroise, 29238 Brest, France 2 Fachbereich Mathematik und Informatik, Universität Münster, Einsteinstraße 62, 48149 Münster,

Germany

3 Laboratoire de Mathématiques, UMR CNRS 6205, Université de Bretagne Occidentale, 6, Avenue Victor Le Gorgeu, CS 93837, 29238 Brest Cedex 3, France

(3)

discrete weights allow to reach a better efficiency, as they considerably simplify the multiply- accumulate operations, with the extreme case where weights become binary and there is no need to perform any multiplication anymore. Unfortunately, training discrete weights neural networks is complex in practice, since it basically boils down to a NP-hard optimization problem. To circumvent this difficulty, many works have introduced techniques that aim at finding reasonable approximations [6,7,13,24].

Among these works, in a recent paper Baldassi et al. [2] discuss the learning process in artificial neural networks with discrete weights and try to explain why these networks work so efficiently. Their approach is based on an analysis of the learning procedure in artificial neural networks. In this process a huge number of connection weights are adjusted using some stochastic optimization algorithm for a given target function. Some of the resulting optima of the target or energy function have better computational performance and general- ization properties, other worse. The authors in [2] propose that the better and more robust configurations of weights lie in dense regions with many maxima or minima (depending on the sign) of the target function, while the optimal configurations that are isolated, i.e. far away from the next optimum of the energy function have poor computational performance.

They propose a new measure, called the robust ensemble, that suppresses such configurations with bad computational performance. On the other hand, the robust ensemble amplifies the dense regions with many good configurations. In [2] the authors present various algorithms to sample from this robust ensemble, one of them isReplicated Simulated AnnealingorSimu- lated Annealing with Scoping. This algorithm combines the replica approach from statistical physics with the simulated annealing algorithm that is supposed to find the minima (or max- ima) of a target function. The replica technique is used to regularize the highly non-convex target or energy function (another regularization idea was introduced recently in [4]), while the simulated annealing algorithm is used afterwards to minimize this new energy. We will define Replicated Simulated Annealing in Sect.2.

To give a first impression of this algorithm, assume we haveΣ:= {−1,+1}Nas our state space andN is large. OnΣ we have very rugged energy functionE:Σ →Rand assume that minσ∈ΣE(σ )=0. To find the minima ofEone could run a Metropolis algorithm for the Gibbs measure at inverse temperatureβ >0:

πβ(σ ):=exp(−βE(σ ))

Zβ , σΣ.

HereZβ:=

σexp(−βE(σ))is the partition function of the model, a normalizing factor that makesπβ a probability measure. If one carefully lowers the temperature, i.e. if one increasesβslowly enough during the process the corresponding Markov chain will get stuck in one of the maxima ofπwhich are easily seen to be the minima ofE. This is the classical Simulated Annealing algorithm, cf. [25] or [15] for the seminal papers. The question how to choose the optimal dependence ofβfrom timetand its convergence properties have been extensively discussed. We just mention [5,8,18,20,22], for a short and by far not complete list of references. The upshot is that good a ”cooling schedule” is of the formβt = m1 log(1+Ct), wheremcan be roughly described as the largest hill to be climbed to get from an arbitrary state to one of the global minima of the target function.

However, sometimes not all the global minima are equally important, in particular one may be interested in regions with many global minima (or almost global minima), so called dense regions. An obstacle may be, thatEexhibits many global minima, but only relatively few of them are in dense regions. Let us motivate this question by a central example which we will often have in mind in this context and which also is one of the central objects in [2].

(4)

Example 1 Assume we have patternsξ1, . . . , ξM,ξμ∈ {±1}N, μ=1, . . . ,MwhereM= αN for someα >0. Each of these patterns belongs to one of two groups, which we indicate byϑ(ξμ)=ϑμ∈ {±1}.

The task is to classify these patterns. One of the standard methods to do this in machine learning is the perceptron. Perceptron was defined in the 1960s by Rosenblatt [34]. It is one of the first mathematical models of neural networks, inspired by the biological phenomenon of vision. In its simplest form, the one we are studying here, it is a single neuron, corresponding to the mathematical model of Mac Culloch and Pitts [30]. It is then a binary classifier, which separates two sets of points linearly separable by a hyperplane. In mathematical terms it maps its inputxto f(x)which is either 1 or 0, and thus puts it into one of two classes. This decision is made with the help of a vector of weights

W =(W1, . . . ,WN)∈ {−1,+1}N. More precisely,

f(x)=

1 if W,x>0, 0 otherwise whereW,xis the dot productW,x =

iWixi. These weightsW =(W1, . . . ,WN)∈ {−1,+1}N have to learned and we want the classification to be perfect, i.e. we want that

Θ(ϑμW, ξμ):=Θ(ϑμ N

i=1

Wiξiμ)=1 for allμ=1, . . .M. Here

Θ(x)=

1 ifx>0 0 otherwise

denotes the Heaviside-function. Hence our classification task is fulfilled if M

μ=1

Θ(−ϑμW, ξμ)=0

(where we assume thatNis even to avoid the specification of tie-breaking rules) or, equiva- lently

M μ=1

Θ(ϑμW, ξμ)=1. (1)

Note that in Rosenblatt’s initial model, the weights(W1, . . . ,WN), called synaptic weights, are real-valued and not restricted to take their values in{−1,+1}N. The objective now is to find weightsW such that (1) is true, i.e. we are searching for weightsW such that E(W) = −M

μ=1Θ(ϑμW, ξμ)is minimal. Obviously, this optimization problem is of the above mentioned form. However, one prefers weightsWin co called dense regions, i.e.

weights that are surrounded by weightsW˜ that are also minima ofE. The idea is that these states have good generalization properties or a small generalization error. This means that we want to find weights, that still classify input patterns correctly, which we have not seen in our training setξ1, . . . , ξM. It is at least plausible that weights with a small generalization error lie in dense regions of{±1}N.

(5)

In this work we are interested in making explicit convergence properties of the algorithm of Replicated Simulated Annealing. We also perform experiments using synthetic and real data bases. The outline is as follows: in Sect.2we mathematically formalize and describe the algorithm of Replicated Simulated Annealing, in Sect.3we study its convergence properties.

Naturally, this convergence will be studied on an infinite time horizon. This is a slightly different set-up than in [2], where experiments are performed for finite time. However, the question, whether or not Replicated Simulated Annealing converges is the first question that should be analyzed, before studying which choice of parameters yields the best results. This latter question is addressed in Sect.4, where we perform experiments using synthetic and real data bases. Of course, the time horizon for such experiments is finite. On the other hand, in finite time we can control the influence of the choice of the parameters on the performance which is hard to control theoretically. Finally, Sect.5is a conclusion.

2 Replicated Simulated Annealing

Recall that we are searching for the minima of a function E : Σ → R, whereΣ :=

{−1,+1}N. To find minima of E in dense regions of the state space the authors in [2]

propose a new measure given by

Py,β,γ(σ ):= exp(yΦβ,γ(σ ))

Z(y, β, γ ) (2)

where

Z(y, β, γ ):=

σ

exp(yΦβ,γ)). (3)

Py,β,γ has, at least formally, the structure of a Gibbs measure at inverse temperature y. Its

”energy function” is given by

Φβ,γ(σ ):=log

σ∈Σ

exp(−βE(σ)γd(σ, σ)), (4) whered(·,·)is some monotonically increasing function of a distance onΣ. This distance will be chosen below, but there are not too many reasonable essentially different distance functions onΣ, anyway.

SinceΦy,β,γ weights each configurationσby a function of its distance toσand theσ again by an exponential of their energy, it is, indeed, plausible thatΦy,β,γ is much smoother thanEand will have its minima in dense regions. We will come back to this question in the next section.

However, a serious problem is, how one could simulate from the measurePy,β,γ. Indeed, computing the ”energy”Φy,β,γ(σ )of a single configurationσinvolves, among others, com- putingE(σ)for allσΣ. Computing these values is almost as hard as finding the minima ofE(even though one might not be immediately able to tell which of these minima are in dense regions). To find a promising algorithm that does not rely on computing all the values ofE(σ ), Baldassi et al. [2] propose the following:

First of all assume thaty≥2 is an integer. Second take as function of the distance between two spinsσandσthe (negative) inner product:d(σ, σ)= −σ, σ. As a matter of fact, this is a natural choice, since two natural distance functions, the Hamming distance and the square of the Euclidian distance are functions of the inner product:

iiσi)2=2N−2σ, σ as well asdH(σ, σ)= N−σ,σ2 and theNdependent terms cancel, because they also occur

(6)

inZ(y, β, γ ). Using the fact thatyis an integer, we can now compute the partition function Z(y, β, γ )(by replacingσbyσin (3)) of this model:

Z(y, β, γ ) = σ∈Σ

exp(yΦβ,γ(σ ))= σ∈Σ

exp( y

a=1

Φβ,γ(σ ))

= σ∈Σ

y

a=1 σa∈Σ

exp(−βE(σa)γd(σ, σa))

= σ∈Σ

σ1∈Σ

. . . σy∈Σ

exp(−βE(σ1)γd(σ, σ1)) . . .exp(−βE(σy)γd(σ, σy))

= σ∈Σ

a}∈Σy

exp(−β y

a=1

E(σa)γ y

a=1

d(σ, σa)).

Here

a}is the sum over allσ1, . . . , σy. HenceZ(y, β, γ )can be considered as a partition function on the space of all(σ,a})∈Σy+1of the measure

Q((σ,a}):=exp(−βy

a=1E(σa)γy

a=1d(σ, σa))

Z(y, β, γ ) . (5)

Its marginal with respect to the second coordinate{σa}is given by Q({σa}):=

σexp(−βy

a=1E(σa)γy

a=1d(σ, σa))

Z(y, β, γ ) . (6)

Making use of our choiced(σ, σ)= −σ, σwe obtain for the numerator inQ:

Z(y, β, γ )Q({σa})=

σ

exp

−β y a=1

E(σa)+γ y a=1

σ, σa

= 2N 2N

σ

exp

−β y a=1

E(σa)+γ N i=1

y a=1

σiσia

=2Nexp

−β y a=1

E(σa)+ N i=1

log cosh(γ y a=1

σia)

Putting the 2N into the normalizing constant we thus obtain that Q({σa})= exp

−βy

a=1E(σa)+N

i=1log coshy a=1σia)

Z(y, β, γ ) .

This form of the measureQis now accessible to a Simulated Annealing algorithm: being in {σa}one picks one of theσaat random and one coordinateσiaofσaat random and flips it to become−σia. This new configuration is then accepted with the usual Simulated Annealing probabilities.

Example 2 (Example1continued) In our perceptron example we so far proposed the energy function

E(W)= − M μ=1

Θ(ϑμW, ξμ)= − αN μ=1

Θ(ϑμW, ξμ).

(7)

This function, however, may be a bit unwieldy when using Simulated Annealing, since it just tells how many patterns have been classified correctly but not whether we are moving in a

”good” or a ”bad” direction when the proposed configuration{Wa}has the same energy as the old configuration{Wa}. We therefore propose (as e.g. [2]) to use the energy function

E(W)= αN μ=1

Eμ(W) withEμ(W)=R(−ϑμW, ξμ),

instead. HereR(x) = x+12 Θ(x)and we again assume thatN is odd, otherwise we would need to takeR(x)x2Θ(x). In other wordsEμis the number of bits that we need to change, in order to classifyξμcorrectly.

3 Convergence of the Annealing Process

In this section we want to discuss the convergence properties of the annealing procedure introduced above. The two major questions are: Does the process converge to an invariant measure, and if so, does this measure have the desired property of favoring dense regions?

This question is not addressed in [2]. However, we feel that it is the first problem that needs to be analyzed. Indeed, if the process does not converge to the desired distribution the question is rather when to stop it than what is the optimal choice of parameters.

We will distinguish two cases: the first is whenγ in the definition of the measures Qin (5) andQin (6) does not depend on time, while the second is, when it does.

Before analyzing these two cases, we will slightly modify the annealing procedure, to make it accessible to the best results that are available for discrete time, see [1]. As a matter of fact, we find discrete time slightly more appropriate for computer simulations than the continuous time set-up in e.g. [22], [18], or [8]. To this end, we will study cooling schedules, where the inverse temperatureβnis fixed forTnconsecutive steps of the annealing process.

Denote byνnthe distribution of the annealing processXnat time Ln :=T1+. . .+Tn.

Note that νn can be computed recursively: If Sβ denotes the transition matrix of the Metropolis–Hastings chain (see [19] or [21]) at inverse temperatureβ(see (7) below), then

νn=νn−1SβTn

n whereν0is a fixed probability measure onΣy. Here, of course,SβTn

n is theTn’th power of the transition matrixSβn(which is constant for the lastTnsteps, as described above).

3.1 Fixed

Ifγ is fixed it is convenient to split the Simulated Annealing algorithm introduced above into aγ-dependent part and aβ-dependent part. To this end, let us introduce the following probability measureμ0onΣy:

μ0({σa}):=expN

i=1log cosh(γy a=1σia) Γ

withΓ :=

{ ˜σa}expN

i=1log cosh y a=1σ˜ia)

.

(8)

Next define a transition matrixΠonΣy.Πwill only allow transition from{σa}to{ ˜σa}, if there are exactly oneσa,a=1, . . .yand onei=1, . . .N, such thatσia= − ˜σia, and for all otherband jwe haveσbj = ˜σbj. In this case, we define

Πγ({σa},{ ˜σa}):=Π({σa},{ ˜σa})min

1,cosh(γy a=1σ˜ia) cosh(γy

a=1σia) . For all other configurations{ ˆσa} = {σa}, we haveΠγ({σa},{ ˆσa})=0 and we set

Πγ({σa},{σa}):=1−

{ ˜σa} ={σa}

Πγ({σa},{ ˜σa}).

Note thatΠγ is nothing but the Metropolis–Hastings algorithm for the measureμ0(see [21]).

In particular,Πγ is reversible with respect to the measureμ0, i.e.

μ0({σa})Πγ({σa},{ ˜σa})=μ0({ ˜σa})Πγ(˜{σa},{σa}).

Now consider the Metropolis–Hastings chain onΣywith proposal chainΠγ and transition probabilities

Sβ({σa},{ ˜σa}) :=

exp

−β(y

a=1Ea)y

a=1E(σ˜a))+

Πγ({σa},{ ˜σa}) ifa} = { ˜σa} 1

{ ˆσa} ={σa}exp

−β(y

a=1Ea)y

a=1E(σˆa))+

Πγ({σa},{ ˆσa})ifa} = { ˜σa(7)}

Here(x)+ :=max{x,0}. For an appropriate normalizing constantΓˆ this chain has as its invariant measure

exp(−βy

a=1E(σa))

Γˆ μ0({σa})=Q({σa})=:Qβ({σa}). (8) So indeed for each fixedβ >0,γ >0 we have found a Metropolis chain forQ.

If we now letβ=βndepend onnin the form described at the beginning of the section, we arrive at a Simulated Annealing algorithm with piecewise constant temperature.

We will quickly introduce some of Azencott’s notation [1]. The invariant measure ofSβn isQβn. Recall that we assumed that minσ∈ΣE(σ )=0 and define

B:= min

σ:E(σ ) =0E(σ ). (9)

Next we bound

||QβnQβn+1||κ1exp(−βnB)

for some constantκ1. Indeed such an estimate is true for any difference of Gibbs measures with respect to the same energy function. To see this let

ρβn := exp(−βnU(x)) Zn

be a sequence of Gibbs measures with respect to the energy functionU on a discrete space of sizeK. We assume minxU(x)=0 (without loss of generality, otherwise we subtract the minimum fromU) and minx:U(x) =0U(x)= B. Letk:= |{x :U(x)=0}|.Then, for anyx withU(x)=0 we simply have

βn(x)ρβn+1(x)| = |

y:U(y) =0exp(−βnU(y))−exp(−βn+1U(y))|

ZnZn+1 ≤2(Kk)e−βnB,

(9)

sinceβn+1βnandZn≥1 for alln. Otherwise, ifU(x) >0, we trivially can compute

βn(x)ρβn+1(x)| ≤e−βnU(x)+e−βn+1U(x)≤2e−βnB.

To describe the spectral gap ofSβn, for any two{σa},{τa} ∈ΣyletP({σa},{τa})be the set of all paths inΣyfrom{σa}to{τa}. ForpP({σa},{τa})with vertices{νa,r}r=0M define

Elev(p):= max

a,r}r=0M

y

a=1

E(νa,r).

Moreover define

H({σa},{τa})= min

p∈P({σa},{τa})Elev(p) and

m:= max

a},{τa}(H({σa},{τa})− y a=1

E(σa)y a=1

E(τa). (10)

The quantitymis related to the optimal cooling schedule for Simulated Annealing as well as to the spectral gap of the associated Metropolis Hastings algorithmSβ. To understand this, define the operatorLβf({σa})for{σa} ∈Σyand f :Σy →Rby

Lβf({σa})=

a}

f({σa})− f({τa})

Sβ({σa},{τa}).

LetEβbe the associated Dirichlet form, i.e. for functions f,g:Σy→R Eβ(f,g):= − f,Lβg

d Qβ

= 1 2Γˆ

a},{τa}

(f({σa})− f({τa})(g({σa})−g({τa})

×exp

−β(

a

E(σa)

a

E(τa)) μ0({σa})Πγ({σa},{τa}).

Then with

ψ(β):=inf{E(f,f): ||f||L2(Qβ)=1 and

f d Qβ=0} we have

Proposition 1 There are constants c>0and C<such that for allβ≥0, ce−βmψ(β)Ce−βm.

Proof See [22, Theorem 2.1].

But we also have that

ψ(β)=1−λ1(β)

whereλ1(β)is the second largest eigenvalue ofSβ, cf. [23, p. 176] or [9, (1.2)]. This estab- lishes the relation ofmto the spectral gap ofSβ.

(10)

Introduce

εn:= ||Qβnνn||. Then

Theorem 1 εnconverges to 0, if

nlim→∞

n

k=1

Tkexp(−βkm)+nlogκ1

= −∞. (11)

In particular, we need thatn

k=1Tkexp(−βkm)→ ∞. In this caseνnhas the same limit as Qβn and this is given by a distribution Qon

N0:= {{σa} : y a=1

E(σa)=0} (12)

such that

Q({σa})μ0({σa}), ifa} ∈N0

and Q({σa})=0, otherwise (heredenotes proportionality). Of course Qis normalized in such a way that it is a probability measure onN0.

Proof The convergence part is basically the content of [1, Sect. 7]. Note that the computations there are done for a proposal chainΠthat has the uniform measure as its invariant distribution.

However, the proof on p. 231 [1] carries over verbatim to our situation.

After that it is easy matter to check thatQβnhas a limiting distributionQand thatQ charges every point inN0with a probability proportional toμ0({σa}).

A choice forTkwhere (11) holds is given by Tk:= exp(−mβk)

C (logκ1+Cb)

for a constantb>0,mas given in (10), andCas given in Proposition1.

As Azencott [1] points out, in this case βnα

Blogn+ b Bn, for someα >1 andBdefined as in (9),

TnnαmB exp(b B m n),

andTnandLnhave the same order of magnitude, i.e. the algorithm spends most of the time in the lowest temperature band.

One also sees the logarithmic relation betweenβandT, i.e.

βn ∼ 1

mlog(Ln)∼ 1

mlog(Tn).

We now turn to the question whether this algorithm achieves that typical samples from it have realizations in dense regions ofΣ. First of all this needs to be defined:

(11)

Definition 1 LetσΣ with E(σ ) = 0 and let R > 0 and k ∈ N. The discrete ball BR(σ )Σwith radiusR, centered inσis called an(R,k)-dense with respect toE, if there are exactlykglobal minimaτofEinBR(σ ). (Without loss of generality all balls considered here and henceforth are Hamming balls.)

σis calledR-isolated, ifσis the only global minimum ofEinBR(σ ).

The authors in [2] are not very explicit about a definition of ”dense regions” and the situation where Replicated Simulated Annealing should be applied. However, from their examples, they seem to have in mind a situation close to the following caricature:

Situation 1 Given1<a<b<1andαN → ∞,δN → ∞with

N→∞lim αN

δN =0 as well as lim

N→∞

δN

N =0,

we say that a sequence of energy functions ENonΣN :=Σ= {−1,+1}Nis(a,b, αN, δN)- regular, if it has bNglobal minima, if there existsσΣNsuch that BαN(σ )is(αN,aN)-dense and such that all the other bNaN minima areδN-isolated.

It is now rather obvious thatQ({·})prefers such dense regions:

Proposition 2 Assume we are in the situation described in Situation 1. Hence we have a sequence of energy functions that is(a,b, αN, δN)-regular. Then, given ε > 0, for any admissible choice of these parameters, there exist y, N0andγ such that

Qi=1y BαN(σ )):=Q({(σ1, . . . , σy):σaBαN(σ )a})≥1−ε for all NN0.

Proof Note thatQhas its mass concentrated on the setN0(given by equation (12)) and the differences in the mass for the various configurations from this set stem from factor

μ0({σa})=expN

i=1log cosh(γy a=1σia) Γ

Let us just consider the numerators of these weights.

Letσ be anδN-isolated minimum ofEN. If allσ1, . . . , σy are located inσ, then the numerator ofμ0({σa})equals exp(Nlog cosh(γy)). Otherwise there is at least oneσathat is different fromσ, say in a global minimumτofEN. By assumptiondH(σ, τ)δN. Thus a configuration that has at least oneσa =τhas a weight at most

exp((N−δN)log cosh(γy)+δNlog cosh((y−2)γ )).

Now there arebNaN δNisolated minima. Hence the sum of the numerators of the proba- bilities of these isolated minima can be be bounded from above by

exp(Nlog cosh(γy))bN

1+bN ye(N−δN)log cosh(γy)+δNlog cosh((y−2)γ ))

eNlog cosh(γy) .

HerebN is a bound on the number of isolated minima, exp(Nlog cosh(γy))is the weight, when all σa are identical, bN y is an upper bound on the number of choices we have, when oneσa equals a given isolated minimum and at least oneσb is different, and finally e(N−δN)log coshy)+δNlog cosh((y2)γ ))is a rough upper bound on the weight in that case.

(12)

Note, that we will chooseγandybelow in such a way that → ∞, whenN → ∞. We will therefore bound log cosh(yγ ). Then the total contribution of the isolated minima becomes at most:

eNγybN

1+bN ye2γ δN

If we chooseγN ylogN b the contribution to the numerator of the probability of the isolated minima will be at most 2 exp(Nγy))bN.

On the other hand, for the case that allσa are in the dense regionBαN(σ )we haveaN y choices. For each of these choices at leastNN of the coordinates of allσ1, . . . , σy are identical. Again, since → ∞, when N → ∞, given ε > 0 we may bound log cosh(yγ )yγ (1−ε). Thus the overall weight (this is again the numerator of the corresponding probability) of the dense region is at leastaN ye(N−yαNy(1−ε).

To compare the two weights, let us see, if we can arrange the parameters in such a way that

aN ye(NyαNy(1−ε)2eNγybN (by which we mean that

aN ye(N−yαNy(1−ε) 2eNγybN → ∞

asN → ∞). Sinceε>0 is fixed and arbitrarily small, we may as well check whether aN ye(N−yαNy2eybN

which is the case, if and only if exp

N y

logayγ αN

N −logb y

1.

To this end, substituteγ = N y2δlogb

N (and note that indeedγ =γN → ∞asN → ∞) to obtain for the exponent on the right hand side:

N y

logay2logN

N

−logb y

. Now takey = 12loglogab.Since, by assumption αδN

N →0 andydoes not depend onN, also

y2logbαN

N converges to 0. This implies that the exponent will eventually become negative,

hence the dense region carries an arbitrarily large mass.

Remark 1 Reading [2] carefully, one may get the impression that for them a dense region is one with an exponential number of local minima ofEN (again, the authors in [2] are not very explicit about this). However, if we are taking the limitβ → ∞slowly enough as in a real Simulated Annealing schedule, the local minima that are not global minima will eventually get zero probability and hence are negligible. As a matter of fact, if one works with finite times as in our next section, this is, of course, not true. In this case however, one could equally well study a low temperature Metropolis chain, since most of the time in the annealing schedules is spent in the low temperature region, anyway, as remarked above. For this Metropolis–Hastings chain a result similar to Proposition2can be shown very similarly.

(13)

3.2 The Limit→ ∞

The situation where alsoγ depends on time and converges to infinity, when time becomes large, is different to the fixedγ situation. Even though this is not explicitly stated in [2] it seems to be the version of the algorithm that the authors in have in mind. Indeed, as mentioned, they only consider a finite time horizon, in which they, however, increaseγ.

In the situation withγ → ∞we need to modify the considerations of the previous section.

Again we will assume that we keepβn, γnconstant on an intervalTntTn+1−1. For the algorithm in this fixed time interval, again, the invariant measure is given byQβwithβ=βn

andγ =γn as given in (8). This is the case because during this interval the parameters of the Metropolis chain do not change. To stress the dependence on both parameters, we will now denote this measure byQβ,γ.

Following the arguments in the previous subsection we now see that there is a constant κ2, such that

||QβnnQβn+1n+1||κ2e−(βnBnB).

Here again, B := minσ:E(σ ) =0E(σ ). Analogously, the constantB is defined as the gap between the maximum of the function

H({σa}):=

N

i=1

log cosh( y

a=1

σia) onΣy and its second largest value. Hence

B :=Nlog cosh(γy)((N−1)log cosh(γy)+log cosh(γ (y−2)))

= log cosh(γy)−log cosh(γ (y−2)).

The maximum ofH is realized when we take allσa identical, while the second term inB stems from the fact that the we obtain the second largest value ofH by changing oneσain one spin from a maximizing configuration. Since we will consider the limitγ → ∞we may safely replace log cosh(γy)byγy−log 2 and log cosh(γ (y−2))byγ (y−2)−log 2 to obtain

B≈2γ.

To determine how the cooling schedule has to be chosen, we need to estimate the spectral gap of the Metropolis chain. Note that, if we use Proposition1to do so, we run into the problem, that the constantscandCthere depend on time, because the energy function does.

The solution is, of course, to include this time dependence into the definitions. Hence for a timetlet

Ft({σa}):=

y a=1

E(σa)γt

βtH({σa}).

Then, we can represent the Simulated Annealing chain, which we will now denote bySβ,γ and which is still given by (7) (with the only difference that now alsoΠγ depends on time) as a Simulated Annealing algorithm with time-dependent energy functionFt, see e.g. [28], [14]. Indeed, in this case we may replace the proposal chainΠγ in (7) toΠ. HereΠbeing in {σa}picks one ofa=1, . . .yand one indexi =1, . . . ,N at random and flipsσiato−σia. ThenSβ,γ can be written as

(14)

Sβ,γ({σa},{ ˜σa}) :=

exp

−β(Ft({σa})−Ft({ ˜σa}))+

Π({σa},{ ˜σa}) if{σa} = { ˜σa}

1−

{ ˆσa}exp

−β(Ft({σa})−Ft({ ˆσa}))+

Π({σa},{ ˆσa})if{σa} = { ˜σa} (13) Now we can use results from [28] (cf. [27] for related work) to compute the spectral gap Sβ,γ. In analogy to what we did in the previous subsection, define

mt := max

a},{τa}(Ht({σa},{τa})−Ft({σa})−Ft({τa}) where

Ht({σa},{τa})= min

p∈P({σa},{τa})Elevt(p) and

Elevt(p):= max

a,r}Mr=0Ft({νa}).

Again, in analogy to the previous subsection for functions f,g:Σy→Rlet the Dirichlet- formEβ,γ be given by

Et,β,γ(f,g)= 1 2Γˆ

a},{τa}

(f({σa})− f({τa})(g({σa})−g({τa})

×exp

−β(Ft({σa})∨Ft({τa}))

μ0({σa})Π({σa},{τa}).

Then for

ψt(β):=inf{Et,β,γ(f,f): ||f||L2(Qβ,γ)=1 and

f d Qβ,γ)=0} it holds

Proposition 3 There are constants c>0and C<such that for allβ≥0.

ce−βmtψt(β)Ce−βmt.

Proof This is the content of [28, Theorem 2.1].

As before Proposition3implies for the second largest eigenvalueλ1(β, γ )ofSβ,γ that

1n, γn)| ≤1−Ce−βnmN.

From linear algebra we therefore obtain that for eachn = 1,2, . . . and any probability measureν0onΣy

||ν0SβTn

nn||κ3

1−Ce−βnmnTn||ν0||

for some constantκ3 >0 (cf. the very similar argument for ordinary Simulated Annealing in [1, (7.8)]). Writing again

εn := ||Qβnnνn||

by the recursive structure of the annealing algorithm and the considerations above we obtain the estimate

εnκ2e−(βnB+γnB)+κ3

1−Ce−βnmnTnεn−1.

(15)

Solving this recursive inequality gives εnκ3n

n

k=1

1−Ce−βkmkTk

n

k=1

κ2e−(βnB+γnB)

uk +ε0 (14)

(cf. [1, (7.14)]). Here

uk:=κ3k k j=1

1−Ce−βjmjTj

.

Hence we need to chose our parametersβn, γn,Tn in such a way that the right hand side converges to zero. In this case we have shown the following theorem.

Theorem 2 If

κ3n n

k=1

1−Ce−βkmkTk

n

k=1

κ2e−(βnB+γnB)

uk +ε0 →0 (15)

as n→ ∞, the distributionνnof Sβnnhas the same limit as Qβnnas n→ ∞.

Following[1, (7.17)]a necessary condition for(15)is

n→∞lim

n k=1

Tke−βkmk+nlogκ3 = −∞

Remark 2 If we are right with the assumption that the authors in [2] would takeγn → ∞ when time gets large, the result of the theorem is, however, not what the authors in [2] seem to intend with their introduction of Replicated Simulated Annealing algorithm. Indeed, when βn → ∞, andγn → ∞the measureQβnn converges to Q∞,∞. However, the latter is nothing but the uniform distribution on

N˜0:= {{σ1, . . . , σy} :E(σa)=0 for alla=1, . . .y,andσ1=. . .=σy}.

In particular,Q∞,∞does not put higher probability on configuration in dense regions of the state space.

Remark 3 Note that for both, Theorems1and2, the cooling schedules have to be chosen very carefully. An anonymous referee remarked that there are simulation algorithms for Gibbs measures that do not use such a cooling strategy as parallel tempering [33], swapping [16,17,32], or equi-energy sampling [26]. We are grateful for this remark.

However, there are some issues with these algorithms. First of all, all of these algorithms simulate Gibbs measures at non-zero temperatures. That means we will obtain an impression of the energy landscapes, but not necessarily convergence towards their global maxima or minima. However, for the simulations in Sect.4this is still an important remark.

The tempering algorithms usually suffer from the deficit that they require computation of partition functions which is as hard as finding the minima or maxima of the energies involved.

Swapping circumvents these problems. However, the speed convergence may be a problem (as it is for simulated annealing). In some situations the swapping algorithm converges rapidly (i.e. in polynomial time), see e.g. [12,29,31], in others the convergence takes exponentially long, see [3] or [10]. The results in [11] show that equi-energy sampling typically does not overcome the problem of torpid mixing.

In the next section, we empirically study a slightly modified version of the algorithm of Replicated Simulated Annealing, using both synthetic toy datasets and real data bases.

Références

Documents relatifs

MCKENNA, On a conjecture related to the number of the solu- tions of a nonlinear Dirichlet problem, to appear Proc. PODOLAK, On the range operator equations with an

A cheap thing does not lack defect, nor does an expensive thing lack quality.. (You get what you

The upper bound for the maximum number of factors, compatible with distinct squares, would be achieved if all these factors, starting before the hole, would contain the hole (then

Health authorities should have power to place people under surveillance even though they have been vaccinated and they may also in certain circumstances require isolation until

If I could have raised my hand despite the fact that determinism is true and I did not raise it, then indeed it is true both in the weak sense and in the strong sense that I

It will be said, that a conclusion ought not to be drawn from the unthinking conduct of the great majority of an aristocratical body, against the capability of such a body for

a- Air we breathe will lose its oxygen because if we lose tropical forests, the air will contain much less oxygen and much more CO2 and the Amazon forests provide 50% of the

Aware of the grave consequences of Substance abuse, the United Nations system including the World Health Organization have been deeply involved in many aspects of prevention,