Global implicit function theorems and the online expectation-maximisation algorithm

(1)

HAL Id: hal-03110213

https://hal.archives-ouvertes.fr/hal-03110213v2

Preprint submitted on 12 Nov 2021

HAL is a multi-disciplinary open access archive for the deposit and dissemination of sci- entific research documents, whether they are pub- lished or not. The documents may come from teaching and research institutions in France or abroad, or from public or private research centers.

L’archive ouverte pluridisciplinaire HAL, est destinée au dépôt et à la diffusion de documents scientifiques de niveau recherche, publiés ou non, émanant des établissements d’enseignement et de recherche français ou étrangers, des laboratoires publics ou privés.

Global implicit function theorems and the online expectation-maximisation algorithm

Hien Nguyen, Florence Forbes

To cite this version:

Hien Nguyen, Florence Forbes. Global implicit function theorems and the online expectation-

maximisation algorithm. 2021. �hal-03110213v2�

(2)

Global implicit function theorems and the online

1

expectation–maximisation algorithm

2

Hien Duy Nguyen

¹²^*

and Florence Forbes

³

3

University of Queensland and Inria Grenoble Rhône-Alpes

4

Summary

The expectation–maximisation (EM) algorithm framework is an important tool for statistical computation. Due to the changing nature of data, online and mini-batch variants of EM and EM-like algorithms have become increasingly popular. The consistency of the estimator sequences that are produced by these EM variants often rely on an assumption regarding the continuous differentiability of a parameter update function. In many cases, the parameter update function is not in closed form and may only be defined implicitly, which makes the verification of the continuous differentiability property difficult. We demonstrate how a global implicit function theorem can be used to verify such properties in the cases of finite mixtures of distributions in the exponential family, and more generally, when the component specific distributions admit data augmentation schemes, within the exponential family. We then illustrate the use of such a theorem in the cases of mixtures of beta distributions, gamma distributions, fully-visible Boltzmann machines and Student distributions. Via numerical simulations, we provide empirical evidence towards the consistency of the online EM algorithm parameter estimates in such cases.

5

Key words:online algorithm; expectation–maximimisation algorithm; global implicit function theorem; Student distribution; mixture models

6

1. Introduction

7

Since their introduction by Dempster, Laird & Rubin (1977), the expectation–

8

maximisation (EM) algorithm framework has become an important tool for the conduct

9

of maximum likelihood estimation (MLE) for complex statistical models. Comprehensive

10

accounts of EM algorithms and their variants can be found in the volumes of McLachlan &

11

Krishnan (2008) and Lange (2016).

12

Due to the changing nature of the acquisition and volume of data, online and incremental

13

variants of EM and EM-like algorithms have become increasingly popular. Examples of

14

such algorithms include those described in Cappé & Moulines (2009), Maire, Moulines

15

*Author to whom correspondence should be addressed.

1School of Mathematics and Physics, University of Queensland, St. Lucia 4067, Queensland Australia

2Department of Mathematics and Statistics, La Trobe University, Bundoora 3086, Victoria Australia Email:h.nguyen7@uq.edu.au

3Univ. Grenoble Alpes, Inria, CNRS, Grenoble INP, LJK, 38000 Grenoble, France

(3)

& Lefebvre (2017), Karimi et al. (2019a,b), Fort, Moulines & Wai (2020a), Kuhn, Matias

16

& Rebafka (2020), Nguyen, Forbes & McLachlan (2020), and Allassonniere & Chevalier

17

(2021), among others. As an archetype of such algorithms, we shall consider the online EM

18

algorithm of Cappé & Moulines (2009) as a primary example.

19

Suppose that we observe a sequence ofnindependent and identically distributed (IID)

20

replicates of some random variable Y ∈Y⊆R^d, for d∈N={1,2, . . .} (i.e., (Yi)ⁿ_i=1),

21

whereY is the visible component of the pairX^>= Y^>,Z^>

, whereZ∈His a hidden

22

(latent) variable, andH⊆R^l, forl∈N. That is, eachY_i(i∈[n] ={1, . . . , n}) is the visible

23

component of a pairX_i^> = Y_i^>,Z_i^>

∈X. In the context of online learning, we observe

24

the sequence(Yi)ⁿ_i=1one observation at a time, in sequential order.

25

Suppose thatY arises from some data generating process (DGP) that is characterised

26

by a probability density function (PDF) f(y;θ), which is parameterised by a parameter

27

vector θ∈T⊆R^p, forp∈N. Specifically, the sequence of data arises from a DGP that

28

is characterised by an unknown parameter vectorθ0∈T. Using the sequence(Yi)ⁿ_i=1, one

29

wishes to sequentially estimate the parameter vectorθ0. The method of Cappé & Moulines

30

(2009) assumes the following restrictions regarding the DGP ofY.

31

(A1) The complete-data likelihood corresponding to the pairXis of the exponential

32

family form:

33

fc(x;θ) =h(x) expn

[s(x)]^>φ(θ)−ψ(θ)o

, (1)

whereh:R^d+l→[0,∞),ψ:R^p→R,s:R^d+l→R^q, andφ:R^p→R^q, for

34

q∈N.

35

(A2) The function

¯

s(y;θ) = E_θ[s(X)|Y =y] (2) is well-defined for ally∈Yandθ∈T, whereE_θ[·|Y =y]is the conditional

36

expectation under the assumption thatX arises from the DGP characterised by

37

θ.

38

(A3) There is a convex subsetS⊆R^q, which satisfies the properties:

39

(i) for alls∈S,y∈Y, andθ ∈T,

40

(1−γ)s+γ¯s(y;θ)∈S, for anyγ∈(0,1), and

41

(4)

(ii) for anys∈S, the function

42

Q(s;θ) =s^>φ(θ)−ψ(θ) (3) has a unique global maximiser onT, which is denote by

43

θ¯(s) = arg max

θ∈T

Q(s;θ). (4)

Let(γ_i)ⁿ_i=1be a sequence of learning rates in(0,1)and letθ⁽⁰⁾∈Tbe an initial estimate of

44

θ0. For eachi∈[n], the method of Cappé & Moulines (2009) proceeds by computing

45

s⁽ⁱ⁾=γis¯

Yi;θ⁽ⁱ⁻¹⁾

+ (1−γi)s⁽ⁱ⁻¹⁾, (5)

and

46

θ⁽ⁱ⁾= ¯θ s⁽ⁱ⁾

, (6)

wheres⁽⁰⁾= ¯s Y1;θ⁽⁰⁾

. As an output, the algorithm produces a sequence of estimators of

47

θ₀: θ⁽ⁱ⁾ⁿ

i=1.

48

Suppose that the true DGP of(Yi)ⁿ_i=1is characterised by the probability measure Pr0,

49

where we writeE_Pr₀ to indicate the expectation according to this DGP. We write

50

η(s) = E_Pr₀

¯

s Y; ¯θ(s)

−s,

and define the roots ofηasO={s∈S:η(s) =0}. Further, let

51

l(θ) = E_Pr₀[logf(Y;θ)]

and define the sets

52

UO=

l θ¯(s) :s∈O

and

53

MT=

θˆ∈T: ∂l

∂θ( ˆθ) =0

. Denote the distance between the real vectoraand the setBby

54

dist(a,B) = inf

b∈B

ka−bk,

wherek·kis the Euclidean norm. Further, denote the complement of setBbyB^{, and make

55

the following assumptions:

56

(5)

(A4) The set T is convex and open, and φ and ψ are both twice continuously

57

differentiable with respect toθ∈T.

58

(A5) The functionθ¯is continuously differentiable, with respect tos∈S.

59

(A6) For somer >2and compact subsetK⊂S,

60

sup

s∈K

EPr0

¯s Y; ¯θ(s)

r

<∞.

(A7) The sequence(γ_i)^∞_i=1satisfies the condition thatγ_i∈(0,1), for eachi∈N,

61

∞

X

i=1

γi=∞, and

∞

X

i=1

γ_i²<∞.

(A8) The values⁽⁰⁾is inS, and, with probability 1,

62

lim sup

i→∞

s⁽ⁱ⁾

<∞, andlim inf

i→∞ dist s⁽ⁱ⁾,S^{

= 0.

(A9) The setUOis nowhere dense.

63

Under Assumptions (A1)–(A9), Cappé & Moulines (2009) proved that the sequences

64

s⁽ⁱ⁾^∞

i=1 and θ⁽ⁱ⁾^∞

i=1, computed via the algorithm defined by (5) and (6), permit the

65

conclusion that

66

i→∞lim dist s⁽ⁱ⁾,O

= 0, and lim

i→∞ dist

θ⁽ⁱ⁾,MT

= 0, (7) with probability 1, when computed using an IID sequence(Y_i)^∞_i=1, with DGP characterised

67

by measure Pr0(cf. Cappé & Moulines 2009, Thm. 1).

68

The result can be interpreted as a type of consistency for the estimatorθ⁽ⁿ⁾, asn→ ∞.

69

Indeed if Pr0can be characterised by the PDFf(y;θ0)in the family of PDFsf(y;θ), where

70

the family is identifiable in the sense thatf(y;θ) =f(y;θ0)for ally∈Y, if and only if

71

θ=θ₀, thenθ₀∈MTandθ₀is the only value minimisingl(·). If there is no other stationary

72

point, then the result guarantees thatθ⁽ⁿ⁾→θ0, asn→ ∞. If the family is not identifiable,

73

in addition to other stationary points,MTcould contain several minimisers ofl(·), in addition

74

toθ₀. This situation is illustrated in Section 4. In any case, a lack of identifiability does not

75

affect the nature ofO, due to Proposition 1 in Cappé & Moulines (2009), which states that

76

any two different parameter valuesθ⁰andθ⁰⁰inMT, withf(·;θ⁰) =f(·;θ⁰⁰), correspond to

77

the sames∈O.

78

(6)

It is evident that when satisfied, Assumptions (A1)–(A9) provide a strong guarantee

79

of correctness for the online EM algorithm and thus it is desirable to validate them in any

80

particular application. In this work, we are particularly interested in the validation of (A5),

81

since it is a key assumption in the algorithm of Cappé & Moulines (2009) and variants of it

82

are also assumed in order to provided theoretical guarantees for many online and mini-batch

83

EM-like algorithms, including those that appear in the works that have been cited above.

84

In typical applications, the validation of (A5) is conducted by demonstrating that

85

Q(θ;s)can be maximised in closed form, and then showing that the closed form maximiser

86

θ¯(s)is a continuously differentiable function and hence satisfies (A5). This can be seen,

87

for example, in the Poisson finite mixture model and normal finite mixture regression

88

model examples of Cappé & Moulines (2009) and the exponential finite mixture model and

89

multivariate normal finite mixture model examples of Nguyen, Forbes & McLachlan (2020).

90

However, in some important scenarios, no closed form solution forθ¯(s)exists, such

91

as when Y arises from beta or gamma distributions, when Y has a Boltzmann law (cf.

92

Sundberg 2019, Ch. 6), such as when Y arises from a fully-visible Boltzmann machine

93

(cf. Hyvarinen 2006, and Bagnall et al. 2020), or when data arise from variance mixtures

94

of normal distributions. In such cases, by (4), we can defineθ¯(s)as the root of the first-order

95

condition

96

J_φ(θ)s−∂ψ

∂θ (θ) =0, (8)

whereJφ(θ) =∂φ/∂θis the Jacobian ofφ, with respect toθ, as a function ofθ.

97

To verify (A5), we are required to show that there exists a continuously differentiable

98

functionθ¯(s)that satisfies (8), in the sense that

99

Jφ θ¯(s) s−∂ψ

∂θ θ¯(s)

=0,

for alls∈S. Such a result can be established via the use of a global implicit function theorem.

100

Recently, global implicit function theorems have been used in the theory of indirect

101

inference to establish limit theorems for implicitly defined estimators (see, e.g., Phillips 2012,

102

and Frazier, Oka & Zhu 2019). In this work, we demonstrate how the global implicit function

103

theorem of Arutyunov & Zhukovskiy (2019) can be used to validate (A5) when applying

104

the online EM algorithm of Cappé & Moulines (2009) to compute the MLE when data arise

105

from the beta, gamma, and Student distributions, or from a fully-visible Boltzmann machine.

106

Simulation results are presented to provide empirical evidence towards the exhibition of

107

theoretical guarantee (7). Discussions are also provided regarding the implementation of

108

online EM algorithms to mean, variance, and mean and variance mixtures of normal

109

distributions (see, e.g., Lee & McLachlan 2021 for details regarding such distributions). More

110

(7)

generally, we show that it is straightforward to consider mixtures of the aforementioned

111

distributions. We show that the problem of checking Assumption (A5) for such mixtures

112

reduces to checking (A5) for their component distributions.

113

The remainder of the paper proceeds as follows. In Section 2, we provide a discussion

114

regarding global implicit function theorems and present the main tool that we will use for the

115

verification of (A5). In Section 3, we consider finite mixtures of distributions with complete

116

likelihoods in the exponential family form. Here, we also illustrate the applicability of the

117

global implicit theorem to the validation of (A5) in the context of the online EM algorithm

118

for the computation of the MLE in the gamma and Student distribution contexts. Additional

119

illustrations for the beta distribution and the fully-visible Boltzmann machine model are

120

provided in Appendices A.2 and A.3. Numerical simulations are presented in Section 4.

121

Conclusions are finally drawn in Section 5. Additional technical results and illustrations are

122

provided in the Appendix.

123

2. Global implicit function theorems

124

Implicit function theorems are among the most important analytical results from the

125

perspective of applied mathematics; see, for example, the extensive exposition of Krantz &

126

Parks (2003). The following result from Zhang & Ge (2006) is a typical (local) implicit

127

function theorem for real-valued functions.

128

Theorem 1. Local implicit function theorem. Let g:R^q×R^p→R^p be a function and

129

V×W⊂R^q×R^pbe a neighbourhood of(v₀,w₀)∈R^q×R^p, forp, q∈N. Further, letg

130

be continuous onV×Wand continuously differentiable with respect tow∈W, for each

131

v∈V. If

132

g(v₀,w₀) =0and det ∂g

∂w(v₀,w₀)

6=0,

then there exists a neighbourhood V0⊂V of v₀ and a unique continuous mapping χ:

133

V0→R^p, such thatg(v,χ(v)) =0andχ(v0) =w0. Moreover, ifgis also continuously

134

differentiable, jointly with respect to (v,w)∈V×W, then χ is also continuously

135

differentiable.

136

We note that Theorem 1 is local in the sense that the existence of the continuously

137

differentiable mappingχis only guaranteed within an unknown neighbourhoodV0 of the

138

rootv0. This is insufficient for the validation of (A5), since (in context) the existence of a

139

continuously differentiable mapping is required to be guaranteed for allV, regardless of the

140

location of the rootv0.

141

Since the initial works of Sandberg (1981) and Ichiraku (1985), the study of conditions

142

under which global versions of Theorem 1 can be established has become popular in the

143

(8)

mathematics literature. Some state-of-the-art variants of global implicit function theorems

144

for real-valued functions can be found in the works of Zhang & Ge (2006), Galewski &

145

Koniorczyk (2016), Cristea (2017), and Arutyunov & Zhukovskiy (2019), among many

146

others. In this work, we make use of the following version of Arutyunov & Zhukovskiy (2019,

147

Thm. 6), and note that other circumstances may call for different global implicit function

148

theorems.

149

Theorem 2.Global implicit function theorem. Letg:V×R^p→R^r, whereV⊆R^q and

150

p, q, r∈Nand make the following assumptions:

151

(B1) The mappinggis continuous.

152

(B2) The mappingg(v,·)is twice continuously differentiable with respect tow∈R^p,

153

for eachv∈V.

154

(B3) The mappings ∂g/∂w and ∂²g/∂w² are continuous, jointly with respect to

155

(v,w)∈V×R^p.

156

(B4) There exists a root (v0,w0)∈V×R^p of the mapping g, in the sense that

157

g(v₀,w₀) =0.

158

(B5) For all pairs(v⁰,w⁰)∈V×R^p, the linear operator defined by the Jacobian

159

evaluated at(v⁰,w⁰):∂g/∂w(v⁰,w⁰), is surjective.

160

Under Assumptions (B1)–(B5), there exists a continuous mappingχ:V→R^p, such that

161

χ(v₀) =w₀ andg(v,χ(v)) =0, for anyv∈V. Furthermore, ifVis an open subset of

162

R^dand the mappinggis twice continuously differentiable, jointly with respect to(v,w)∈

163

V×R^p, thenχcan be chosen to be continuously differentiable.

164

We note that the stronger conclusions of Theorem 2 requires stronger hypotheses on

165

the functiong, when compared to Theorem 1. Namely, it is requiresg to have continuous

166

second-order derivatives in all arguments in Theorem 2, whereas only the first derivatives are

167

required in Theorem 1. Assumption (B5) may be abstract in nature, but can be replaced by

168

the practical condition that

169

det ∂g

∂w(v⁰,w⁰)

6= 0, (9)

for all(v⁰,w⁰)∈V×R^p, whenp=r, since a square matrix operator is bijective if and only

170

if it is invertible. Whenp > r, Assumption (B5) can be validated by checking that

171

rank ∂g

∂w(v⁰,w⁰)

=r,

(9)

for all(v⁰,w⁰)∈V×R^p(cf. Yang 2015, Def. 2.1). We thus observe that the assumptions of

172

Theorem 2, although strong, can often be relatively simple to check.

173

3. Applications of the global implicit function theorem

174

We now proceed to demonstrate how Theorem 2 can be used to validate Assumption

175

(A5) for the application of the online EM algorithm in various finite mixture scenarios of

176

interest.

177

We recall the notation from Section 1. Suppose thatY is a random variable that has

178

a DGP characterised by aK∈Ncomponent finite mixture model (cf. McLachlan & Peel

179

2000), where each mixture component has a PDF of the formf(y;ϑ_z), forz∈[K], and

180

f(y;ϑz)has exponential family form, as defined in (A1). That is,Y has PDF

181

f(y;θ) =

K

X

z=1

πzf(y;ϑz) =

K

X

z=1

πzh(y) expn

[s(y)]^>φ(ϑz)−ψ(ϑz)o

, (10)

whereπz>0andPK

z=1πz= 1, andθcontains the concatenation of elements(πz,ϑz), for

182

z∈[K].

183

Remark 1. We note that the component density f(y;ϑz) in(10) can be replaced by a

184

complete-data likelihoodfc(x⁰;ϑz)of exponential family form, where X⁰= (Y,U)^> is

185

a further latent variable representation via the augmented random variableU, and where

186

Y is the observed random variable, as previously denoted. This is the case whenY arises

187

from a finite mixture of Student distributions. Although the Student distribution is not within

188

the exponential family, its complete-data likelihood, when considered as a Gaussian scale

189

mixture, can be written as a product of a scaled Gaussian PDF and a gamma PDF, which

190

can be expressed in an exponential family form. We illustrate this scenario in Section 3.2.1.

191

Remark 2. Another type of missing (latent) variable occurs when we have to face missing

192

observations. We can consider the case of IID vectors of observations, where some of the

193

elements are missing. Checking assumption (A5) is the same as in the fully observed case but

194

the computation ofs¯is different as it requires an account of the missing data imputation. An

195

illustration in the multivariate Gaussian case is given in Appendix A.7.

196

LetZ∈[K]be a categorical latent random variable, such thatPr (Z=z) =πz. Then, upon definingX^>= Y^>, Z

, we can write the complete-data likelihood in the exponential

(10)

family form (cf. Nguyen, Forbes & McLachlan 2020, Prop. 2):

fc(x;θ) =h(y) exp







K

X

ζ=1

1_{z=ζ}h

logπζ+ [s(y)]^>φ(ϑζ)−ψ(ϑζ)i







=h_m(x) expn

[s_m(x)]^>φ_m(θ)−ψ_m(θ)o ,

where the subscriptmstands for ‘mixture’, and whereh_m(x) =h(y),ψ_m(θ) = 0,

s_m(x) =







1_{z=1}

1_{z=1}s(y) ...

1_{z=K}

1_{z=K}s(y)







, andφ_m(θ) =







logπ1−ψ(ϑ1) φ(ϑ1)

...

logπK−ψ(ϑK) φ(ϑK)







. (11)

Recall that θ contains the pairs (π_z,ϑ_z) (z∈[K]) and q∈N is the dimension of the component specific sufficient statistics s(y). We introduce the following notation, for z∈[K]:

s^>_z = (s1z, . . . , sqz), and s^>_m= (s01,s^>₁, . . . , s0K,s^>_K),

wheresz∈S, for an appropriate open convex set S, as defined in (A3). Then sm∈Sm,

197

whereSm= ((0,∞)×S)^Kis an open and convex product space.

198

As noted by Cappé & Moulines (2009), the finite mixture model demonstrates the importance of the role played by the setS(and thusSm) in Assumption (A3). In the sequel, we require thats0zbe strictly positive, for eachz∈[K]. These constraints defineSm, which is open and convex if S is open and convex. Via (11), the objective function Q_m for the mixture complete-data likelihood, of form (3), can be written as

Q_m(s_m,θ) =s^>_mφ_m(θ) =

K

X

z=1

s_0z(logπ_z−ψ(ϑ_z)) +s^>_zφ(ϑ_z).

Whatever the form of the component PDF, the maximisation with respect toπz yields the mapping

¯

π_z(s_m) = s_0z

K

P

ζ=1

s0ζ

.

(11)

Then, for eachz∈[K],

∂Q_m

∂ϑz

(s_m,θ) =−s0z

∂ψ

∂ϑz

(ϑ_z) +J_φ(ϑ_z)s_z

=s_0z

J_φ(ϑ_z) s_z

s0z

− ∂ψ

∂ϑz

=s_0z ∂Q

∂ϑz

s_z s0z

,ϑ_z

,

whereQis the objective function of form (3) corresponding to the component PDFs. Since s_0z>0, for allz∈[K], it follows that the maximisation ofQ_mcan be conducted by solving

∂Q

∂ϑz

sz

s0z

,ϑz

=0,

with respect toϑz, for eachz. Therefore, it is enough to show that for the component PDFs, there exists a continuously differentiable root of the equation above,ϑ(s), with respect to¯ s, in order to verify (A5) for the maximiser of the mixture objectiveQ_m. That is, we can set

θ¯m(sm) =







¯ π1(sm) ϑ¯(s₁/s₀₁)

...

¯ π_K(s_m) ϑ¯(sK/s0K)





 ,

which is continuously differentiable ifϑ¯ is continuously differentiable. In the sequel, we

199

illustrate how Theorem 2 can be applied, withV=S, to establish the existence of continuous

200

and differentiable functionsϑ¯in various scenarios.

201

3.1. The gamma distribution

202

We firstly suppose thatY ∈(0,∞)is characterised by the PDF

203

f(y;θ) =ς(y;k, θ) = 1

Γ (k)θ^ky^k−1exp{−y/θ},

where θ^>= (θ, k)∈(0,∞)², which has an exponential family form, with h(y) = 1, ψ(θ) = log Γ (k) +klogθ,s(y) = (logy, y)^>, and φ(θ) = (k−1,−1/θ)^>. Here, Γ (·) denotes the gamma function. The objective functionQin (A3) can be written as

Q(s;θ) =s1(k−1)−s2

θ −log Γ (k)−klogθ,

(12)

wheres^>= (s1, s2)∈R×(0,∞).

204

Using the first-order condition (8), we can define θ¯ as a solution of the system of

205

equations:

206

∂Q

∂k =s1−Ψ⁽⁰⁾(k)−logθ= 0, (12)

207 ∂Q

∂θ = s₂ θ² −k

θ = 0, (13)

where Ψ^(r)(k) =d^r+1log Γ (k)/dk^r+1, is the rth-order polygamma function (see, e.g.,

208

Olver et al. 2010, Sec. 5.15).

209

The existence and uniqueness ofθ¯can be proved using Proposition 2 (from Appendix

210

A.1). Firstly note thatφ(θ) = (k−1,−1/θ)^>∈P= (−1,∞)×(−∞,0), which is an open

211

set and hence we have regularity. Then, settingΦ^> = (Φ1,Φ2), we obtain

212

δ(Φ) =

"

Ψ⁽⁰⁾(1 + Φ₁) + log (−1/Φ₂)

−(1 + Φ1)/Φ2

# .

For any s^>= (s₁, s₂), we can solve δ(Φ) =s with respect to Φ, which yields: Φ₂=

213

−(1 + Φ1)/s2and requires the root ofΨ⁽⁰⁾(1 + Φ1) + logs2−log (1 + Φ1) =s1, which

214

is solvable for anyssatisfyings1−logs2<0, since bothΨ⁽⁰⁾(·)andlog (·)are continuous,

215

andΨ⁽⁰⁾(a)−log (a)is increasing ina∈(0,∞)and has limits of−∞and0, asa→0and

216

a→ ∞, respectively, by Guo et al. (2015, Eqns. 1.5 and 1.6). Thus,θ¯exists and is unique

217

when

218

s∈S={s= (s1, s2)∈R×(0,∞) :s2>0, s1−logs2<0}. (14)

219

Assumings∈S, we can proceed to solve (13) with respect toθ, to obtain

220

θ=s₂

k, (15)

which substitutes into (12) to yield:

221

s1−Ψ⁽⁰⁾(k)−logs2+ logk= 0. (16) Notice thatθ, as defined by (15), is continuously differentiable with respect to k, and thus

222

ifkis a continuously differentiable function ofs, thenθis also a continuous differentiable

223

function ofs. Hence, we are required to show that there exists a continuously differentiable

224

root of (16), with respect tok, as a function ofs.

225

(13)

We wish to apply Theorem 2 to show that there exists a continuously differentiable

226

solution of (16). Let

227

g(s, w) =s1−Ψ⁽⁰⁾(e^w)−logs2+w, (17) wherek=e^w. We reparameterise with respect tow, since Theorem 2 requires the parameter to be defined over the entire domain R. Notice that (B1)–(B3) are easily satisfied by considering existence and continuity of Ψ^(r) over(0,∞), for allr≥0. Assumption (B4) is satisfied whens∈S, since it is satisfied ifθ¯exists. Next, to assess (B5), we require the derivative:

∂g

∂w = 1−e^wΨ⁽¹⁾(e^w) = 1−kΨ⁽¹⁾(k). (18) By the main result of Ronning (1986), we have the fact that −kΨ⁽¹⁾(k)is negative

228

and strictly increasing for allk >0. Using an asymptotic expansion, it can be shown that

229

−kΨ⁽¹⁾(k)→ −1, ask→ ∞(see the proof of Batir 2005, Lem. 1.2). Thus, (18) is negative

230

for allw, implying that (B5) is validated.

231

Finally, we establish the existence of a continuously differentiable functionχ(s), such

232

thatg(s, χ(s)) = 0by noting thatg is twice continuously differentiable jointly in(s, w).

233

We thus validate (A5) in this scenario by setting

234

θ¯(s) =

"

s₂/exp{χ(s)}

exp{χ(s)}

# ,

whereχ(s)is a continuously differentiable root of (17), as guaranteed by Theorem 2.

235

3.2. Variance mixtures of normal distributions

236

Variance, or scale mixtures of normal distributions refer to the family of distributions

237

with PDFs that are generated by scaling the covariance matrix of a Gaussian PDF by a

238

positive scalar random variable U. A recent review of such distributions can be found in

239

Lee & McLachlan (2021). Although such distributions are not necessarily in the exponential

240

family, we show that they can be handled within the online EM setting presented in this paper.

241

Indeed, if U admits an exponential family form, a variance mixture of normal

242

distributions admits a hierarchical representation whose joint distribution, after data

243

augmentation, belongs to the exponential family. We present the general form in this section

244

and illustrate its use by deriving an online EM algorithm for the Student distribution.

245

(14)

Letfu(u;θu)denote the PDF ofU, depending on some parametersθu, and admitting an exponential family representation

f_u(u;θ_u) = h_u(u) expn

[s_u(u)]^>φ_u(θ_u)−ψ_u(θ_u)o .

If Y is a characterized by a variance mixture of a normal distributions, then with x^>= (y^>, u) and θ^>= (µ^>,vec(Σ)^>,θ_u^>), we can write fc(x;θ)as the product of a scaled Gaussian PDF andfu:

f_c(x;θ) =ϕ(y;µ,Σ/u)f_u(u;θ_u),

whereϕ(y;µ,Σ/u)is the PDF of a Gaussian distribution with meanµand covariance matrix

246

Σ/u. Here, vec(·) denotes the vectorisation operator, which converts matrices to column

247

vectors.

248

Using the exponential family forms of both PDFs (see Nguyen, Forbes & McLachlan 2020 for the Gaussian representation), it follows that

f_c(x;θ) =h(x) expn

[s(x)]^>φ(θ)−ψ(θ)o , whereh(x) = (2π/u)^−d/2h_u(u),ψ(θ) = log det [Σ]/2 +ψ_u(θ_u),

s(x) =







uy uvec(yy^>)

u su(u)







andφ(θ) =







Σ⁻¹µ

−¹₂vec(Σ⁻¹)

−¹₂µ^>Σ⁻¹µ φu(θu)







. (19)

Depending on the statistics defining su(u), the representation above can be made more

249

compact; see, for example, the Student distribution case, below.

250

Consider the objective function Q(s;θ), as per (A3), with s^>= (s^>₁,vec(S₂)^>, s₃,s^>₄), wheres₁ands₄are real vectors,S₂is a matrix (all of appropriate dimensions) ands3is a strictly positive scalar. An interesting property is that whatever the mixing PDFfu, when maximisingQ, closed-form expressions are available forµandΣ:

¯

µ(s) =s1

s₃, (20)

Σ(s) =¯ S₂−s1s^>₁

s₃ . (21)

The rest of the expression ofθ¯_u(s)depends on the specific choice off_u, as illustrated in the

251

sequel.

252

(15)

Remark 3. Similarly, others families of distributions can be generated by considering mean

253

mixtures, and mean and variance mixtures of normal distributions. If the mixing distribution

254

belongs to the exponential family, the corresponding complete-data likelihood also belongs to

255

the exponential family and can be handled in a similar manner as above. These exponential

256

family forms are provided in Appendices A.5 and A.6. Examples of such distributions are

257

listed in Lee & McLachlan (2021) but are not discussed further in this work.

258

3.2.1. The Student distribution

259

In contrast to the gamma distribution example, the case of the Student distribution

260

requires the introduction of an additional positive scalar latent variable U. The Student

261

distribution is a variance mixture of normal distributions, where U follows a gamma

262

distribution with parameters, in the previous notation of Section 3.1,k=ν/2andθ= 2/ν,

263

whereνis commonly referred to as the degree-of-freedom parameter (dof).

264

Remark 4. When the two parameters of the gamma distribution are not linked via the joint

265

parameterνwe obtain a slightly more general form of the Student distribution, which is often

266

referred to as the Pearson type VII or generalised Student distribution. Although this later

267

case may appear more general, the Pearson type VII distribution suffers from an identifiability

268

issue that requires a constraint be placed upon the parameters values, which effectively makes

269

it equivalent in practice to the usual Student distribution. See Fang, Kotz & Ng (1990, Sec.

270

3.3) for a detailed account regarding the Pearson type VII distribution.

271

Maximum likelihood estimation of a Student distribution is usually performed via an EM algorithm. As noted in the previous section, the Student distribution does not belong to the exponential family, but the complete-data likelihood after data augmentation byU, does have exponential family form. Indeedfc(x;θ)is the product of a scaled Gaussian and a gamma PDF, which both belong to the exponential family. More specifically, withx^>= (y^>, u)and θ^>= (µ^>,vec(Σ)^>, ν):

fc(x;θ) =ϕ(y;µ,Σ/u)ς

u;ν 2,2

ν

.

It follows from the more general case (19), that fc(x;θ) =h(x) expn

[s(x)]^>φ(θ)−ψ(θ)o ,

(16)

whereh(x) = (2π/u)^−d/2,ψ(θ) = log det [Σ]/2 + log Γ(ν/2)−(ν/2) log(ν/2),

s(x) =







uy uvec(yy^>)

u logu







, andφ(θ) =







Σ⁻¹µ

−¹₂vec(Σ⁻¹)

−¹₂µ^>Σ⁻¹µ−^ν₂

ν 2 −1







. (22)

The closed-form expressions for µ¯ and Σ¯ are given in (20) and (21), while for the dof parameter, we obtain similar equations as in Section 3.1, which leads to defining ν(s)¯ as the solution, with respect toν, of

s₄−Ψ⁽⁰⁾ν 2

−s₃+ 1 + logν 2 = 0.

With the necessary restrictions on S (i.e., that s3>0 and thatS2 be symmetric positive

272

definite), the same arguments as in Section 3.1 apply and provide verification of (A5). A

273

complete derivation of the online EM algorithm is detailed in Appendix A.4 and a numerical

274

illustration is provided in the next section.

275

4. Numerical Simulations

276

We now present empirical evidence towards the exhibition of the consistency conclusion

277

(7) of Cappé & Moulines (2009, Thm. 1). The online EM algorithm is illustrated on the

278

scenarios described in Section 3; that is, for gamma and Student mixtures. For the Student

279

distribution example, the estimation of a single such distribution already requires a latent

280

variable representation, due to the scale mixture form, and the online EM algorithm for this

281

case is provided in Appendix A.4. Both the classes of gamma and Student mixtures have been

282

shown to be identifiable; see for instance Teicher (1963, Prop. 2) regarding finite gamma

283

mixtures and Holzmann, Munk & Gneiting (2006, Example 1) regarding finite Student

284

mixtures. If observations are generated from one of these mixtures, checking for evidence

285

that (7) is true is equivalent to checking that the algorithm generates a sequence of parameters

286

that converges to the parameter values used for the simulation. In contrast, beta mixtures are

287

not identifiable, as proved in Ahmad & Al-Hussaini (1982). Boltzmann machine mixtures

288

are also not identifiable. In these cases, the algorithm may generate parameter sequences

289

that converge to values different from the one used for simulation. However, assumptions

290

implying (7) may still be satisfied and the convergence of the online EM algorithm in these

291

cases can be demonstrated (see Appendix A.8).

292

The numerical simulations are all conducted in the R programming environment (R

293

Core Team 2020) and simulation scripts are made available athttps://github.com/

294

(17)

hiendn/onlineEM. In each of the cases, we follow Nguyen, Forbes & McLachlan (2020)

295

in using the learning rates (γ_i)^∞_i=1, defined byγ_i= 1−10⁻¹⁰

×i^−6/10, which satisfies

296

(A7). This choice ofγi satisfies the recommendation by Cappé & Moulines (2009) to set

297

γi=γ0i^−(0.5+), with∈(−0.5,0.5)andγ0∈(0,1). Our choice agrees with the previous

298

reports of Cappe (2009) and Kuhn, Matias & Rebafka (2020), who demonstrated good

299

performance using the same choice of= 0.1. Furthermore, Le Corff & Fort (2013) and

300

Allassonniere & Chevalier (2021) showed good good numerical results using= 0.03and

301

= 0.15, respectively. The online EM algorithm is then run forn= 100000ton= 500000

302

iterations (i.e., using a sequence of observations(Yi)ⁿ_i=1, wheren= 100000orn= 500000.

303

Where required, the mappingsθ¯are numerically computed using theoptimfunction, which

304

adequately solves the respective optimisation problems, as defined by (4). Random gamma

305

observations are generated using thergammafunction. Random Student observations are

306

generated hierarchically using thernormandrgammafunctions.

307

For the gamma mixture distribution scenario, we generate data from a mixture

308

of K= 3 gamma distributions using the values θ0z= 0.5,0.05,0.1, k0z= 9,20,1, and

309

π_0z = 0.3,0.2,0.5, for each of the 3 components z= 1,2,3, respectively. The algorithm

310

is initialised with θz⁽⁰⁾= 1,1,1, k⁽⁰⁾z = 5,9,2, and π⁽⁰⁾z = 1/3,1/3,1/3, for each z=

311

1,2,3. For the Student mixture case, we restrict our attention to the d= 1 case where

312

Σ is a scalar, which we denote byσ²>0. We generate data from a mixture of K= 3

313

Student distributions using µ0z= 0,3,6, σ²_0z= 1,1,1, and ν0z= 3,2,1, for z= 1,2,3.

314

The corresponding mixture weights are taken as π0z= 0.3,0.5,0.2, for each respective

315

component. The algorithm is initialised with values set to µ⁽⁰⁾z = 1,4,7, σ²⁽⁰⁾z = 2,2,2,

316

νz⁽⁰⁾ = 4,4,4,πz⁽⁰⁾= 1/3,1/3,1/3, for each componentz= 1,2,3.

317

Example sequences of online EM parameter estimates θ⁽ⁱ⁾ⁿ

i=1 for the gamma and

318

Student simulations are presented in Figures 1 and 2, respectively. As suggested by Cappé

319

& Moulines (2009), the parameter vector is not updated for the first 500 iterations. That is,

320

fori≤500,θ⁽ⁱ⁾=θ⁽⁰⁾, and fori >500,θ⁽ⁱ⁾= ¯θ s⁽ⁱ⁾

. This is to ensure that the initial

321

elements of the sufficient statistic sequence is stable. Other than ensuring thats⁽⁰⁾∈Sin

322

each case, we did not find it necessary to mitigate against any tendencies towards violations

323

of Assumption (A8). We note that if such violations are problematic, then one can employ

324

a truncation of the sequence s⁽ⁱ⁾ⁿ

i=0, as suggested in Cappé & Moulines (2009) and

325

considered in Nguyen, Forbes & McLachlan (2020).

326

From the two figures, we notice that the sequences each approach and fluctuate around

327

the respective generative parameter valuesθ0. This provides empirical evidence towards the

328

correctness of conclusion (7) of Cappé & Moulines (2009, Thm. 1), in the cases considered

329

in Section 3. In each of the figures, we also observe the decrease in volatility as the iterations

330

increase. This may be explained by the asymptotic normality of the sequences (cf. Cappé

331

(18)

0e+00 2e+04 4e+04 6e+04 8e+04 1e+05

6810121416

i

k1

0e+00 2e+04 4e+04 6e+04 8e+04 1e+05

10152025

i

k2

0e+00 2e+04 4e+04 6e+04 8e+04 1e+05

0.70.91.11.3

i

k3

0e+00 2e+04 4e+04 6e+04 8e+04 1e+05

0.40.60.81.0

i

theta1

0e+00 2e+04 4e+04 6e+04 8e+04 1e+05

0.40.60.81.01.2

i

theta2

0e+00 2e+04 4e+04 6e+04 8e+04 1e+05

0.81.01.21.4

i

theta3

0e+00 2e+04 4e+04 6e+04 8e+04 1e+05

0.200.250.300.35

i

pi1

0e+00 2e+04 4e+04 6e+04 8e+04 1e+05

0.150.200.250.30

i

pi2

0e+00 2e+04 4e+04 6e+04 8e+04 1e+05

0.400.500.60

i

pi3

Figure 1. Online EM algorithm estimator sequenceθz⁽ⁱ⁾=

θz⁽ⁱ⁾, kz⁽ⁱ⁾, πz⁽ⁱ⁾

>

(z∈[3]), for a mixture ofK= 3gamma distributions. The dashed lines indicates the generative parameter values of the DGP.

Components are grouped in columns.

& Moulines 2009, Thm. 2), which is generally true under the assumptions of Cappé &

332

Moulines (2009, Thm. 1). For the Student mixture, we considern= 500000and observed

333

that convergence for the dof parameters may be slower, espectially when the dof is larger.

334

This may be due to a flatter likelihood surface for larger values of dof.

335

5. Conclusion

336

Assumptions regarding the continuous differentiability of mappings are common for the

337

establishment of consistency results for online and mini-batch EM and EM-like algorithms.

338

As an archetype of such algorithms, we studied the online EM algorithm of Cappé &

339

Moulines (2009), which requires the verification of Assumption (A5) in order for consistency

340

to be establish. We demonstrated that (A5) can be verified in the interesting scenarios when

341

data arises from mixtures of beta distributions, gamma distributions, fully-visible Boltzmann

342

machines and Student distributions, using a global implicit function theorem. Via numerical

343

simulations, we also provide empirical evidence of the convergence of the online EM

344

algorithm in the aforementioned scenarios.

345

Furthermore, our technique can be used to verify (A5) for other exponential family

346

distributions of interest, that do not have closed form estimators, such as the inverse gamma

347

(19)