Stein's method in high dimensions with applications

(1)

www.imstat.org/aihp 2013, Vol. 49, No. 2, 529–549

DOI:10.1214/11-AIHP473

Stein’s method in high dimensions with applications

Adrian Röllin

Department of Statistics and Applied Probability, National University of Singapore, 6 Science Drive 2, 117546 Singapore Received 1 February 2011; revised 19 September 2011; accepted 26 December 2011

Abstract. Leth be a three times partially differentiable function onRⁿ, letX=(X₁, . . . , X_n) be a collection of real-valued random variables and letZ=(Z₁, . . . , Z_n)be a multivariate Gaussian vector. In this article, we develop Stein’s method to give error bounds on the differenceEh(X)−Eh(Z)in cases where the coordinates ofXare not necessarily independent, focusing on the high dimensional casen→ ∞. In order to express the dependency structure we use Stein couplings, which allows for a broad range of applications, such as classic occupancy, local dependence, Curie–Weiss model, etc. We will also give applications to the Sherrington–Kirkpatrick model and last passage percolation on thin rectangles.

Résumé. Soithune fonction réelle surRⁿdont les dérivées partielles d’ordre trois existent, soitX=(X₁, . . . , X_n)un vecteur de variables aléatoire réelles et soitZ=(Z₁, . . . , Z_n)un vecteur aléatoire Gaussien. Dans cet article, nous établissons par la méthode de Stein une majoration de la différenceEh(X)−Eh(Z)dans le cas où les coordonnées deXne sont pas nécessairement indépendantes; nous nous concentrons sur le cas de la grande dimensionn→ ∞. Pour exprimer la structure de dépendance, nous utilisons des couplages de Stein, ce qui permet une large gamme d’applications, par exemple aux modèles d’urnes, au modèles avec dépendance locale, au modèle de Curie–Weiss, etc. Nous présentons aussi des applications au modèle de Sherrington–Kirkpatrick et à la percolation de dernier passage dans des rectangles étroits.

MSC:60F17; 82B44

Keywords:Stein’s method; Gaussian interpolation; Last passage percolation on thin rectangles; Sherrington–Kirkpatrick model; Curie–Weiss model

1. Introduction

Let X and Z be random vectors in Rⁿ and let h:Rⁿ →R be a function of interest. A fundamental problem in probability theory is to obtain bounds on the quantity

Eh(X)−Eh(Z), (1.1)

that is, to estimate the error when we replaceXinEh(X)byZ. If the error in (1.1) is small irrespective of the detailed properties ofXandZ then we will attribute to the functionha certain degree ofuniversality, which means that the expected value only depends on certain basic characteristics ofXandZ, such as the first few moments.

Of particular interest is the case whereZis a Gaussian vector having the same (or a similar) covariance structure asX, and probably the most prominent occurrence of such universality is the central limit theorem. IfXis a random vector, such that theX_iare independent of each other, centred and scaled such that

iVarX_i=1, andZis a centred Gaussian vector with uncorrelated coordinates having the same variances as those of X, then it is well known that (1.1) is small for functions of the form

h(x)=g _n

i=1

xi

, (1.2)

(2)

whereg:R→Ris not too irregular. A common heuristic says that the central limit theorem will also hold if independence is replaced by some form of “weak” dependence, and, furthermore, it can be expected that in many cases (1.1) will be small for more general functions than (1.2). Thus, in terms of dropping independence and considering more general functions than (1.2), universality often can be observed beyond the standard setting of the central limit theorem.

Even if the vector X is such that

iX_i does not satisfy the central limit theorem, we can consider (1.1) for functions more general than (1.2). Let, for example, ξ_i be the number of balls that end up in the ith urn, when a fixed number of ballsmis distributed independently among nurns. Clearly,

iξ_i =m, respectively,

iX_i = 0 if theX_i are the centered and properly scaled ξ_i. Although these sums do not satisfy a central limit theorem, it is nevertheless possible to give informative bounds on (1.1). Dembo and Rinott [11] and Chen and Röllin [10]

considered, for example, functions of the formh(x)=g(

iϕ(x_i))for fixed functionsgandϕ, whereϕis non-linear.

For other, non-trivial choices ofhwe refer to Sections4.1and4.2.

Over the last decades, Stein’s method has proved to be a very robust method to obtain explicit bounds for univariate and multivariate distributional approximations in cases whereXexhibits non-trivial dependencies which are not of martingale type, but more combinatorial in flavour. Although Stein’s method for the multivariate normal distribution has been successfully implemented in many places (see Meckes [18] and Reinert and Röllin [21] and references therein), the dependence on the dimension of the results obtained so far may give the impression that the method is not suitable if the dimension grows linearly with the size of the problem. Indeed, this high-dimensional case has remained untackled until now. The purpose of this article is to close this gap.

It is important to note at this point that the type of bounds that we will obtain will generally not imply that the marginal distributions of the individual coordinates will converge to a normal distribution. That is, the aim is not to prove convergence to a multivariate normal distribution. In the already mentioned example of classic occupancy, if the number of balls and urns are of the same order, thenξ_i will converge to a Poisson distribution with mean being equal to the limiting ratio limm/n. Bounds on (1.1) will only be informative if they are smaller than the fluctuation ofh(X), that is, if the bounds are smaller thanE|h(X)|(assuming here without loss of generality thatEh(Z)=0), which is an obvious upper bound on (1.1). The bounds that we obtain for functionshthat concentrate only on a few coordinates will typically have the same order asE|h(X)|and hence will not – and often cannot – be informative.

The remainder of the article is organised as follows. In the next section we will first discuss the key tools used in this article, in particular the fundamental idea of using interpolation to estimate (1.1), the Gaussian integration by parts formula and multivariate Stein couplings, leading to our main result, Lemma2.1. In Section3we will then give some abstract and more concrete examples of Stein couplings, ranging from the independent case to more sophisticated dependencies. In Section4we will discuss various applications.

2. The key lemma

An old idea to compare two quantities of interest is to find an interpolating sequence between them and to estimate the error “along the way” of the interpolation using the derivatives ofh(paraphrasing Talagrand [29] on “Gaussian interpolation and the smart path method”). One of the earliest encounters of this idea is Lindeberg’s method of telescoping sums. Define the interpolating sequence

Y (i)=(X₁, . . . , X_i, Z_i₊₁, . . . , Z_n), (2.1) and write

Eh(X)−Eh(Z)= n

i=1

E h

Y (i) −h

Y (i−1) ; (2.2)

one can now bound the right-hand side of (2.2) using Taylor expansion; this idea has been successfully implemented by Rotar’ [24], Chatterjee [7], Mossel et al. [19] and Tao and Vu [30] and surely by other authors. One of the important consequences of this approach is apparent when we look at (2.1): it forces us to treat the coordinates of X in an ordered way. If the components ofXare independent or, more generally, a martingale difference sequence, then this is of course desirable, and, indeed, quite a few central limit theorems for martingales are based upon (2.2) (see e.g.

(3)

Bolthausen [4], Grama [15] and Rinott and Rotar’ [22]). And even if no such structure is apparent in the problem, one can sometimes arrangeXsuch that it will be close enough to a martingale difference sequence.

This approach, however, is not entirely satisfying. Often the martingale structure is “artificial” and one would like to make use of a more natural dependence structure inX, instead (rates of convergences being another reason to avoid martingales). And in some cases, one may have difficulties to linearise the problem at all.

A key difference in Stein’s method is to chose an interpolating sequence that, in contrast to Lindeberg’s telescoping sum, treats the components ofX symmetrically. Note that (2.1) essentially interpolates “along the coordinate axes”

and the order of the axes determines the linearisation of the problem. Instead, we will interpolate betweenX andZ in a way that will linearly interpolate between the matricesXX^t andZZ^t. This approach is well-known asGaussian interpolationand independently developed by Slepian [25] and Stein [26], although the technique used by Stein looks very much different from what is commonly referred to as Gaussian interpolation (the interpolation is “hidden” in the solution to the so-calledStein equation).

Gaussian interpolation has become popular in many areas; Talagrand [29] gives a good account of the key idea, in particular in the context of statistical mechanics (where Gaussian interpolation is referred to assmart path method).

The method is a key ingredient in the rigorous proof of theParisi formulaby Talagrand [28]. It is also an important tool to prove universality in the bulk of eigenvalues for Wigner random matrices with matrix entries following so-called Gaussian divisible distributions. The generalisation from these special distributions to the general case, however, uses Lindeberg’s idea of telescoping sums; see Johansson [17] and Erd˝os et al. [13].

Now, assume that X andZ are independent and define the interpolating sequenceYt =√ t X+√

1−tZ, 0≤ t≤1. Note that, ifEX=EZ=0, thenEYt =0 and, if Cov(X)=Cov(Z), then Cov(Yt)=Cov(X)for allt (which may serve as an explanation why this particularY_t is a good choice). Withh_i being the partial derivative in theith coordinate, we can write

Eh(X)−Eh(Z)= 1

0

∂

∂tEh(Yt)dt

=1 2

₁

0 E

1

√t

i

X_ih_i(Y_t)− 1

√1−t

i

Z_ih_i(Y_t)

dt (2.3)

(differentiation in (2.3) corresponds to taking differences in (2.2) and integration replaces summation, but this is only a technical difference). One can easily see that, on the right-hand side of (2.3), the coordinates are treated symmetrically. The result obtained by Slepian [25] (known asSlepian’s Lemma) is valid under the assumption thatX andZ are centred Gaussian vectors having a different covariance structure. In this case, the Gaussian integration by parts formula

E

Zihi(Z)

= n

j=1

Cov(Zi, Zj)Ehij(Z) (2.4)

can be used to estimate the error on the right-hand side of (2.3) in terms of the covariances. Stein [26], on the other hand, considered the univariate case, but whereXis not Gaussian. Although (2.4) can still be used forZ, it needs to be replaced by an approximate version of (2.4) forX.

In order to formalise this approximate version of the Gaussian integration by parts formula, we will make use of a multivariate generalisation ofStein couplings, which were introduced by Chen and Röllin [10] in the univariate case, and then give more concrete constructions later on. Throughout this article summations will always range from 1 ton unless otherwise stated.

Definition 2.1. Let(X, X, G)be a triple ofn-dimensional random vectors.We say that the triple is a Stein coupling if,for any smooth enough functionf:Rⁿ→R,we have

E

i

X_if_i(X)=E

i

G_i f_i

X −f_i(X) (2.5)

whenever the involved expectations exist.

(4)

Remark 2.2. If(X, X, G)is a Stein coupling,it follows from the definition that

EX_i=0, E(G_iD_j+G_jD_i)=2 Cov(Xi, X_j) (2.6)

for alliandj,where we letD=X−Xthroughout this article(apply(2.5)to the functionsf (x)=x_i andf (x)= x_ix_j,respectively).If(2.5)is replaced by the stronger condition that

E

X_if_i(X)

=E G_i

f_i

X −f_i(X) (2.7)

for alli,then

E(G_iD_j)=E(G_jD_i)=Cov(Xi, X_j) (2.8)

for alliandj.

Equation (2.5) is the key condition to obtain an approximate Gaussian integration by parts formula: ifXandX are close to each other, then the difference on the right-hand side of (2.5) can be approximated by the corresponding derivatives, leading to a formula similar to (2.4). Hence, it is crucial thatXis only asmall perturbationofX.

The following result, although not difficult to prove, is crucial for our approach. On one hand, it measures how closelyXsatisfies the Gaussian integration by parts formula and, on the other hand, also compares the covariances ofXandZ (which in this article we will mostly assume to be the same). To make things more transparent, we keep everything explicit in terms of the functionh, instead of using the usual approach via Stein equation and its solution.

Unless otherwise stated, we will assume throughout this article that EX=0, VarX_i=σ_i², E|X_i|³=τ_i³<∞, τ¯=sup

i

τ_i. (2.9)

We will denote byΣ=(σij)1≤i,j≤nthe covariance matrix ofX, whereσij=E(XiXj), and we haveσii=σ_i². Lemma 2.1. Let(X, X, G)be a Stein coupling.LetX and D˜ ben-dimensional random vectors and letS be a randomn×nmatrix.DefineD=X−XandD=X−X.Assume that,for allkandl,

E(GkDl|X)=E(GkD˜l|X), E(Skl|X)=σkl. (2.10)

LetZ∼MVNn(0, Σ )be independent of the previous random vectors.Then,for any three times partially differentiable functionh,

Eh(X)−Eh(Z)=1 2

₁

0 ER₁(t)dt−1 2

₁

0

₁

0

t^1/2ER₂(t, s)dsdt +1

2 ₁

0

₁

0

₁

0

st^1/2ER₃(t, sr)drdsdt, (2.11)

where

R₁(t)=

k,l

(G_kD˜_l−S_kl)h_kl√

t X+√

1−tZ , R₂(t, u)=

k,l,m

(G_kD˜_l−S_kl)D_mh_klm√

t X+u√

t D+√

1−t Z , R₃(t, u)=

k,l,m

G_kD_lD_mh_klm√

tX+u√ t D+√

1−t Z , provided thatER_i(·)exists fori=1,2,3.In particular,

Eh(X)−Eh(Z)≤1 2sup

t

ER₁(t)+1 3sup

t,s

ER₂(t, s)+1 6sup

t,s

ER₃(t, s).

(5)

It seems rather difficult at this point to convey the purpose of all the random vectors appearing in the lemma.

Probably the best way to get an intuition for such couplings is to go through the different applications given later on; we also refer to Chen and Röllin [10] for the univariate case, where further examples are discussed. We note that finding the appropriate random vectors will usually require some trial and error.

Remark 2.3. Let us make a few comments at this point.

1. We will use the following simple fact in the applications.If(X, X, G)is a Stein coupling satisfying the stronger condition(2.7)and if there is aσ-algebraF⊃σ (X)such that

E

G_kD˜_l|F =E

S_kl|F , (2.12)

thenER₁(t)=0.

2. Except for the case of local dependence in Section3.5,we will chooseD˜ =D.

3. The result can be easily extended to include other error terms from the proof of the lemma under weaker conditions.

We will use the following extension later on.If(X, X, G)is not a Stein coupling,then one can include a measure of how close(2.5)is satisfied;with

R0(t)=

k

Xkhk

√t X+√

1−t Z −Gkhk

√t X+√ 1−t Z +Gkhk

√t X+√

1−t Z , an additional ¹₂₁

0

√1

tER₀(t)dtappears on the right-hand side of(2.11).

4. If(X, X, G)is not a Stein coupling,the identities(2.6)and(2.8)are no longer valid and need to be replaced by corresponding approximate versions.

Proof of Lemma2.1. Define the interpolating sequenceYt =√ t X+√

1−tZ, 0≤t≤1. Starting from (2.3), and using (2.4) and (2.5), we obtain

Eh(X)−Eh(Z)= ₁

0

∂

∂tEh(Y_t)dt

=1 2

1 0

E

k

√1

tX_kh_k(Y_t)−

k

√1

1−tZ_kh_k(Y_t)

dt

=1 2

₁

0 E

k

√1 tG_k

h_k

Y_t −h_k(Y_t) −

k,l

σ_klh_kl(Y_t)

dt, (2.13)

whereY_t=√

t X+√

1−t Z. Let us recall the definition ofR₁(t)and introduce two additional error terms:

R₁(t)=

k,l

(G_kD˜_l−S_kl)h_kl Y_t ,

R₄(t):=

k,l

(S_kl−σ_kl)h_kl(Y_t), R₅(t):=

k,l

G_k(D_l− ˜D_l)h_kl(Y_t), whereY_t=√

t X+√

1−t Z. Applying hk

Y_t −hk(Yt)= 1

0

√t

l

Dlhkl

Yt+s√ t D ds

(6)

to (2.13), and adding and subtracting the terms fromR₁(t),R₄(t)andR₅(t)yields Eh(X)−Eh(Z)

=1 2

₁

0

E ₁

0

k,l

G_kD_lh_kl Y_t+s√

t D ds−

k,l

σ_klh_kl(Y_t)

dt

=1 2

₁

0

E

R₁(t)+R₄(t)+R₅(t) dt +1

2 1

0

E

k,l

(G_kD˜_l−S_kl)

h_kl(Y_t)−h_kl Y_t

dt

+1 2

1 0

E

k,l

GkDl

hkl

Yt+s√

tD −hkl(Yt) dsdt.

Note that, under (2.10),ER4(t)=ER5(t)=0. Taylor expansion in the last two lines yields the final result; we refer to Chen and Röllin [10] for a more detailed exposition of the proof in the univariate case.

We now derive general norm bounds from Lemma2.1, along the lines of Raiˇc [20], Chatterjee and Meckes [8] and Meckes [18]. Let · be a norm onRⁿ and let | · |be a norm onRⁿ^×ⁿ, the space ofn×nmatrices. Define the following measures of smoothness ofh. Fork≥1, let

M_k(h)= sup

x∈Rⁿ sup

u⁽¹⁾,...,u^(k)∈Rⁿ

n

i₁,...,i_k=1

u⁽¹⁾_i

1 · · ·u^(k)_i

k

u⁽¹⁾ · · · u^(k) h_i₁_,...,i_k(x),

and fork≥2 define Mk(h)= sup

x∈Rⁿ sup

A∈Rⁿ^×ⁿ sup

u⁽³⁾,...,u^(k)∈Rⁿ

n

i1,...,ik=1

Ai₁i₂u⁽³⁾_i

3 · · ·u^(k)_i

k

|A | u⁽³⁾ · · · u^(k) hi₁,...,i_k(x)

(ifk=2, the third supremum in the definition of Mk(h)is just ignored). We then have the following straitforward result.

Lemma 2.2. Under the assumptions of Lemma2.1,letFbe aσ-algebra withσ (X)⊂F.Then,for all0≤t, s≤1, ER₁(t)≤M₂(h)EE

GD˜^t−S|F , ER2(t, s)≤M3(h)EGD˜^t+ |S | D, ER₃(t, s)≤M₃(h)E

G D ² .

It is clear from this lemma that the optimal choice of the norms · and | · |very much depends on the involved random vectors and how they are coupled. This, in turn, determines which functionshare considered smooth enough to yield informative bounds.

Let us fix some notation before we proceed. We denote by · _∞the supremum norm of functions. Fork≥1 and ak-times partially differentiable functionf:Rⁿ→R, we let

|f|k= sup

1≤i₁≤···≤i_k≤n

f_i₁_...i_k _∞.

For functionsg:R→Rwe will use the notation g _∞, g _∞, . . ., instead of the equivalent|g|1,|g|2, . . ..

(7)

The couplings we construct in this article are such that the random vectors and matrices in Lemma2.2are small with respect to theL₁-norms

u ₁= n

i=1

|u_i| and |A |1= n

i,j=1

|A_ij|.

It is not difficult to see that with respect to these norms we have Mk(h)=Mk(h)= |h|k.

For this reason, we will directly formulate our results in terms of|h|k.

Note that this is in contrast to the results for multivariate normal approximation of Chatterjee and Meckes [8] and Reinert and Röllin [21]. There, the vectors and matrices are typically closer inL₂than inL₁. Meckes [18] showed that in this case| · |kis too strong to measure the smoothness ofh, resulting in suboptimal dependence on the dimension.

Using instead M_k(h) and M_k(h) with respect to the L₂-norms, Meckes [18] showed that the dependence on the dimension can be substantially reduced.

Remark 2.4. One may be interested in comparing the distributions off (X)and f (Z)for some specific function f:Rⁿ→R.To this end,chooseh(x)=g(f (x))forg:R→R.Then,if(1.1)is small for all three times differentiable functionsg,then we can conclude thatf (X)andf (Z)are close in distribution.We record the useful estimates

|h|1≤ |f|1g

∞, |h|2≤ |f|2g

∞+ |f|²₁g

∞,

|h|3≤ |f|3g

∞+3|f|1|f|2g

∞+ |f|³1g

∞. Remark 2.5. A particular function of interest is

f (x)=log m

p=1

e^βy^(p)^(x)

for functionsy^(p):Rⁿ→R, 1≤p≤m.Defineγ_k=sup_p|y^(p)|k;it is straightforward to check that

|f|1≤βγ₁, |f|2≤βγ₂+2β²γ₁², |f|3≤βγ₃+6β²γ₁γ₂+6β³γ₁³. 3. Couplings

Many of the Stein couplings discussed by Chen and Röllin [10] can be adapted to the multivariate case: exchangeable pairs, size-biasing, local dependence, etc. Instead of generalising all of them here (which will be done elsewhere with emphasis on multivariate normal approximation for fixed dimension) we only go through a few of them explicitly and instead present some other couplings not discussed by Chen and Röllin [10].

3.1. A theoretical result

One may wonder if, given a pair(X, X)withEX=0, there exists aGto make the triple(X, X, G)a Stein coupling.

This question has been answered by Chen and Röllin [10] for the univariate case, but the construction given there can also be used in the multivariate setting. LetF=σ (X)be the σ-algebra induced byX and letF=σ (X). Define formally the sequence

G= −X+E

X|F −E E

X|F |F +E E

E

X|F |F |F − · · ·.

If the sequence converges absolutely in each coordinate, then this will make(X, X, G)a Stein coupling. Indeed, E(G|F)= −XandE(G|F)=0 so that (2.5) is satisfied.

(8)

To motivate the choice of G used in the next few settings, consider the case where the coordinates of X are independent. LetI be uniformly distributed on{1, . . . , n}, independent of all else. Define the vectorX⁽ⁱ⁾by

X⁽ⁱ⁾_k =(1−δki)Xk,

whereδ_ij is the Dirac delta function. LetX=X^{(I )}; that is,Xis the vector where we have set a randomly chosen coordinate to 0. Denote bye_ithe unit vector in directioni. Using independence of the coordinates,

E

X|F =E

X^{(I )}+e_IX_I|X^{(I )} =X^{(I )} and

E

X|F =1 n

i

X⁽ⁱ⁾=

1−1 n

X.

Hence,

G= −X+X−

1−1 n

X+

1−1 n

X−

1−1 n

2

X+

1−1 n

2

X+ · · ·

= −e_IX_I−

1−1 n

e_IX_I−

1−1

n 2

e_IX_I− · · · = −ne_IX_I.

3.2. Independent coordinates

In order to illustrate the method in a simple setting, we start with independent coordinates using(X, X, G)derived in the previous section.

Theorem 3.1. LetXbe as in(2.9)and assume the coordinates ofXare independent.IfZis a vector of independent centred Gaussian random variables with the same variances asX,then

Eh(X)−Eh(Z)≤5 6

i

τ_i³ h_iii _∞.

Proof. LetG_i = −nδ_iIX_i andX=X=X^{(I )}, hence D_i =D_i= −δ_iIX_i. LetD˜ =DandS_ij =nσ_i²δ_iIδ_{j I}. It is easy to see that(X, X, G)is a Stein coupling satisfying the stronger condition (2.7), that (2.10) is satisfied and that (2.12) holds withF=σ (X, I ); henceER₁(t)=0. The following estimates are immediate:

R₂(t)≤

i

h_iii _∞

σ_i²E|X_i| +E|X_i|³ ≤2

i

h_iii _∞E|X_i|³, R₃(t)≤

i

h_iii _∞E|X_i|³.

Lemma2.1concludes the theorem.

Using Lindeberg’s telescoping sum and Taylor expansion, and noting that the first two moments ofXandZmatch, one easily obtains

Eh(X)−Eh(Z)

≤1 6

i

E|X_i|³+E|Z_i|³ h_iii _∞≤(1+√ 8/π) 6

i

τ_i³ h_iii _∞.

Not surprisingly, the constants obtained via Stein’s method are larger for the case of independent random variables.

However, applications with dependencies is the main purpose of using Stein’s method.

(9)

3.3. Weak dependence

A simple way to measure how much a single coordinateX_i is influenced by the other coordinates is to look at the fluctuation of the conditional mean and variance ofX_i. To this end, letXbe as in (2.9) and defineX⁽ⁱ⁾as in Section3.1.

Furthermore, let μ_i

X⁽ⁱ⁾ =E

X_i|X⁽ⁱ⁾ , σ_i²

X⁽ⁱ⁾ =Var

X_i|X⁽ⁱ⁾ . Then we have the following.

Theorem 3.2. LetXbe as in(2.9)and letZ∼MVNn(0, Σ ).Then Eh(X)−Eh(Z)≤

i

h_i _∞Eμ_i

X⁽ⁱ⁾ +1 2

i

h_ii _∞ Eμ_i

X⁽ⁱ⁾ ²+Eσ_i²

X⁽ⁱ⁾ −σ_i² +5

6

i

τ_i³ h_iii _∞.

Proof. DefineG,X,X,S as in the proof of Theorem3.1; the error termsR₂andR₃can be bounded in the same way. As(X, X, G)is not necessarily a Stein coupling, we need the additional error term

ER₀(t)=E

i

X_ih_i√

tX⁽ⁱ⁾+√

1−tZ =E

i

μ_i

X⁽ⁱ⁾ h_i√

tX⁽ⁱ⁾+√ 1−tZ (see Remark2.3). Furthermore,

ER1(t)=E

i

X_i²−σ_i² h_ii√

t X⁽ⁱ⁾+√ 1−t Z

=E

i

Xi−μi

X⁽ⁱ⁾ 2

−σ_i²−μi

X⁽ⁱ⁾ ² hii

√tX⁽ⁱ⁾+√

1−tZ .

This easily leads to the final bound.

Note that if theX_iare independent, Theorem3.2reduces to Theorem3.1. Götze and Tikhomirov [14] assumed that μi(Xⁱ)=0 almost surely to obtain convergence rates to the semi-circular law in random matrix theory under such dependence.

3.4. Constant sum and symmetry

Recall the classic occupancy problem from the Introduction. The sum of the vector that describes the number of balls in each urn is equal to the total number of balls and hence, itself, does not satisfy a central limit theorem. This motivates us to consider general centered vectorsXthat satisfy

n

i

Xi=0 (3.1)

almost surely.

To apply our method, we will need to make more assumptions. A random vector X=(X₁, . . . , X_n)is called exchangeable if its distribution is invariant under permutation of the coordinates. Note that (3.1) implies

jσij=0 for eachi, and combined with exchangeability, we therefore have

σ_ij= − σ₁²

n−1 (3.2)

for alli=j.

(10)

Theorem 3.3. LetXbe an exchangeable random vector satisfying(2.9)and letZ∼MVNn(0, Σ ).Then Eh(X)−Eh(Z)≤ |h|2

Var

i

X²_i 1/2

+16|h|3nτ₁³. (3.3)

Remark 3.1. Note that the theorem can also be applied ifX is not exchangeable,buthsymmetric instead,that is ifh(x)remains the same under any permutation of the coordinates.In that case,Theorem3.3can be applied to the randomly permutedX.Note thatτ₁³is then replaced byn⁻¹

iτ_i³for the final result.

Proof of Theorem3.3. Forx∈Rⁿ, letx^ik∈Rⁿ be the vector obtained by interchanging theith andkth coordinate ofx(ifi=kthenx^ik=x). Note that due to exchangeability,

E

ϕ(X_i)h_i

X^ik =E

ϕ(X_k)h_i(X)

(3.4) for any functionϕfor which the expectations exist. Furthermore, for(i, j, k, l)∈ [n]⁴with

i=k ⇐⇒ j=l, (3.5)

denote byx^{ij kl}a permutation ofxsuch that E

ϕ(Xj, Xl)hik(X)

=E

ϕ(Xi, Xk)hik

X^{ij kl} (3.6)

for all functionsϕfor which the expectations exist. Note that this permutation can be defined independently ofx and h: ifX is exchangeable, keep[n] \ {i, j, k, l}fixed, mapj →i andl→kand map the remaining numbers among each other in any arbitrary, but fixed way. Let(I, J, K, L)be distributed on[n]⁴, such that(I, J, K, L)is uniform on [n]³and, given(I, J, K),Lis uniform on[n] \ {J}ifI=K, andJ=LifI=K; hence,(I, J, K, L)satisfies (3.5).

Define

X:=X^{I K}, X:=X^{I J KL} and

Gk= −nδkIXk, D˜k=Dk, Skl=n²δkIδlKσkl; note that

Dl=δlI(XK−XI)+δlK(XI−XK), D_l=

m∈{I,J,K,L}

δlm

X_l−Xl .

Fixt and let, for notational convenience, f_·(x)=Eh_·(√ tx+√

1−t Z), where· stands for i,ij or ij k. Clearly, E{G_kf_k(X)} =E{X_kf_k(X)}. Using exchangeability ofX, we can use (3.4) to obtain

E

k

G_kf_k

X = −nE X_If_I

X^{I K} = −nE

X_Kf_I(X)

= −1 nE

i,k

X_kf_i(X)=0.

Hence,(X, X, G)is a Stein coupling. Now, ER1(t)=E

k,l

(Skl−GkDl)fkl

X

=E

k,l

n²δ_kIδ_lKσ_kl+δ_kInX_k

δ_lI(X_K−X_l)+δ_lK(X_I−X_l) f_kl X

=n²EσI KfI K

X +nEXI(XK−XI)fI I

X +nEXI(XI−XK)fI K

X .

(11)

Using exchangeability and (3.2), n²E

σ_{I K}f_{I K}

X =n²E

σ_{J L}f_{I K}(X)

=1 nE

i,j

σ_jjf_ii(X)+ 1

n(n−1)E

i,j,k=i,l=j

σ_{j l}f_ik(X)

=σ₁²E

i

fii(X)− σ₁² n−1E

i,k=i

fik(X).

Furthermore, using (3.1) and (3.6), nE

X_I(X_K−X_I)f_{I I}

X =nE

X_J(X_L−X_J)f_{I I}(X)

= 1 n²E

i,j,l

X_j(X_l−X_j)f_ii(X)

= −1 nE

j

X²_j

i

f_ii(X)

and nE

X_I(X_I−X_K)f_{I K} X

=nE

X_J(X_J−X_L)f_{I K}(X)

= 1

n²(n−1)E

i,j,k=i,l=j

X_j(X_j−X_l)f_ik(X)

= 1 n²E

i,j,k=i

X²_jfik(X)− 1

n²(n−1)E

i,j,k=i,l=j

XjXlfik(X)

= 1 n²E

j

X²_j

i,k=i

f_ik(X)− 1

n²(n−1)E

j,l=j

X_jX_l

i,k=i

f_ik(X)

= 1 n²E

j

X²_j

i,k=i

f_ik(X)+ 1

n²(n−1)E

j

X²_j

i,k=i

f_ik(X)

= 1

n(n−1)E

j

X²_j

i,k=i

f_ik(X),

where for the last equality we used that _n(n¹₋₁₎=_n₋¹₁−¹_n. Hence, ER₁(t)≤ 1

n(n−1)

E

j

X²_j−nσ₁²

i,k=i

f_ik(X) +1

n

E

j

X²_j−nσ₁²

i

f_i(X)

≤2|h|2

Var

i

X_i² 1/2

.

(12)

This gives the first part of the result. Now, ER₂(t, u)=E

k,l,m

(S_kl−G_kD˜_l)D_mf_klm

X+uD

=n²E

m∈{I,J,K,L}

σ_{I K}

X_m −X_m f_{I Km}

X+uD

−nE

l∈{I,K},m∈{I,J,K,L}

XI

X_l−Xl X_m−Xm fI lm

X+uD

hence

ER₂(t, u)≤8|h|3τ₁

i,j

|σ_ij| +32|h|3nτ₁³≤8|h|3τ₁nσ₁²+32|h|3nτ₁³.

Similarly,

ER₃(t, u)=

k,l,m

G_kD_lD_mh_klm(X+uD)

=nE

l∈{I,K},m∈{I,K}

X_I

X_l−X_l X_m−X_m h_klm(X+uD), hence

ER₃(t, u)≤16|h|3nτ₁³.

3.5. Local dependence

Stein couplings to handle local dependence has already been discussed by Chen and Röllin [10], based on similar decompositions that appeared in many other places; we refer to the more detailed discussion in Chen and Röllin [10].

In particular, multivariate normal approximation for sums of locally dependent random vectors was considered by Rinott and Rotar’ [23] and Raiˇc [20].

LetX=(X₁, . . . , X_n)be as in (2.9). Assume that, for eachi∈ [n] := {1, . . . , n}, there is a subsetA_i⊂ [n]such thatX_A^c

i andX_i are independent. Assume further that for eachi∈ [n]andj⊂A_i there is a subsetB_ij⊂ [n]such that A_i⊂B_ij andX_B^c

ij is independent of(X_i, X_j). Central limit theorems for sums of random variables satisfying this refined version of local dependence were analyzed in detail by Barbour et al. [2].

Theorem 3.4. LetXas above.LetZ∼MVNn(0, Σ ).Then,for any three times partially differentiable functionh, Eh(X)−Eh(Z)≤ 1

3

i

j∈A_i

k∈B_ij

|σ_ij|E|X_k| +E|X_iX_jX_k| h_{ij k} _∞

+1 6

i

j,k∈Ai

E|X_iX_jX_k| h_{ij k} _∞≤ 5

6τ¯³nη|h|3, whereη=sup_i

j∈Ai|B_ij|.

Proof. LetI be uniform on[n]and, givenI, letJ be uniform onA_I. Define the vectorsX,X,GandD˜ and the matrixSas

G_k= −δ_kInX_k, X_k=I[k /∈A_I]X_k, X_k=I[k /∈B_{I J}]X_k, Skl=n|AI|δkIδlJσkl, D˜k= −|AJ|δkJXk.

(13)

Note thatXis independent ofG, which makes(X, X, G)a Stein coupling satisfying the stronger condition (2.7), similarly as for the independent case. Furthermore, withF=σ (X, I ), (2.12) holds and thereforeER1(t)=0. The

final bound follows now easily from Lemma2.1.

As we can see from the case of the CLT, whereh(x)=g(

ix_i), the typical scaling ofXis such thatτ¯³n⁻^3/2. With this scaling, a “typical” functionhwill have the property thatE|h(X)| 1, whereas the bound of Theorem3.4 is of order O(n⁻^1/2).

Note that anm-dependent sequence is a special case of local dependence: we have|Ai| =1+2mandBij≤1+3m.

However, the crucial aspect here is that the exact structure of the dependence is only important in terms of the size of A_i andB_ij. Any graph with maximal degreemthat describes the dependence structure ofX(that is, two subsets of vertices are independent if there is no edge between them) will have the upper bounds|A_i| ≤1+mand|B_ij| ≤1+2m.

In that case,

η≤2(m+1)². (3.7)

4. Applications

In this section, we present two different types of applications. First, we consider concrete functions h, for which we determine under what kind of dependencies (1.1) is small. If we can control the first three derivatives ofh, then we can analyse the universality of the givenh with respect to dependence, for example for the different settings of the previous section. The first two applications below are of this type. We analyse universality with respect to local dependence only, but it is clear that many of the other settings can be used instead. In the case of local dependence, we are interested in how big the “neighbourhoods”Ai andBij are allowed to become while keeping the bounds on (1.1) small enough. We useηfrom Theorem3.4as a simple measure of neighbourhood size, and hence dependence. These applications are closely related to Chatterjee [6]. Whereas in the first application of the SK-model the dependence enters in a straightforward way, in the second application of last passage percolation on thin rectangles, an certain optimisation step has to be recalculated, including the measure of dependenceη.

As a second type of application, we can consider more concrete vectorsX, for which we want to show that (1.1) is small for a large class of functionsh. In this situation, the structure of the dependence ofXwill either fit into one of the abstract settings of the previous section (this is the case for classic occupancy), or else, one has to construct a Stein coupling from scratch; the latter is the case for the Curie–Weiss model.

4.1. Environment with dependencies in the Sherrington–Kirkpatrick spin glass model

Consider theN-spin system{−1,1}^N. To each configurationσ∈ {−1,1}^Nwe assign the (random) Hamiltonian H_N(σ )= β

√N

i<j

ξ_ijσ_iσ_j,

whereξ=(ξ_ij)1≤i<j≤nis a family of random variables, which we call theenvironment. Given the environmentξ, we assign to eachσ the probability

P^ξ_N(σ )= e^H^N^{(σ )} Z_N(β, ξ ), where

Z_N(β, ξ )=

σ

e^βH^N^{(σ )}. Let

p_N(β)= 1

NElogZ_N(β, ξ ).