On Tracking The Partition Function:

(1)

On Tracking The Partition Function:

Supplemental Material

Guillaume Desjardins, Aaron Courville, Yoshua Bengio {desjagui,courvila,bengioy}@iro.umontreal.ca

Département d’informatique et de recherche opérationnelle Université de Montréal

The supplementary material provides a more comprehensive view of our log partition function tracking algorithm. The notation is mostly identical to that of the paper and is summarized below for convenience. One minor difference however, is that we make widespread use of the notations A _:,l and A m,: to indicate the vectors formed by the l -th column and m -th row of matrix A (respectively).

q _i,t RBM at inverse temp. β _i at time-step t . q _i,t ( x ) = ˜ q _i,t ( x ) /Z _i,t , with i ∈ [1 , M ].

θ set of model parameters.

F i,t ( x ) free-energy assigned by q i,t to input configuration x . ζ _i,t log partition function of model q _i,t .

X _i,t mini-batch of samples { x ⁽ⁿ⁾ _i,t ∼ q i,t ( x ); n ∈ [1 , N ]}.

Y _t additional mini-batch of samples { x ⁽ⁿ⁾ _1,t ∼ q 1,t ( x ); n ∈ [1 , N Y ]} , N Y ≥ N .

D set of training examples.

µ t,t , P t,t estimated mean and covariance of posterior distribution p(ζ

t

| O

^(∆t)_t:0

, O

^(∆β)_t:0

) . Σ

ζ

fixed covariance matrix of p(ζ

t

|ζ

t−1

) .

init , t initial learning rate and rate at time t .

s _i,t coarse estimate of Z _i+1,t /Z _i,t used to generate bridging distribution q

^∗

. q

^∗

_i,t bridging distribution with q

^∗

_i,t ( x ) = _s ^q ^˜

^i,t

^(x)˜ ^q

^i+1,t

^(x)

i,t

q ˜

i,t

(x)+˜ q

i+1,t

(x) . r ( q 1 , q 2 , x 1 , x 2 ) swap probability used by PT, with r i,t = min

1 , ^q _q ^˜ _˜

¹₁

⁽ ₍

^x_x²₁

^)˜ _)˜ ^q _q

²₂

⁽ ₍

^x_x¹₂

⁾ ₎ .

1 Parallel Tempering Algorithm

We divide our algorithm into three parts. Algorithm 1 presents the pseudo-code for the Parallel Tempering sampling algorithm. For details, we refer the reader to [1].

Algorithm 1 sample PT ( q :,t , X _:,t−1 , k ) Initialize X _i,t as empty sets, ∀ i ∈ [1 , M ].

for n ∈ [1 , N ] do for i ∈ [1 , M ] do

Initialize Markov chain associated with model q _i,t with state x ⁽ⁿ⁾ _i,t−1 . Perform k steps of Gibbs sampling, yielding x ⁽ⁿ⁾ _i,t .

end for

Swap x ⁽ⁿ⁾ _i,t ↔ x ⁽ⁿ⁾ _i+1,t with prob. r ( q _i,t , q _i+1,t , x ⁽ⁿ⁾ _i,t , x ⁽ⁿ⁾ _i+1,t ) , ∀ even i.

Swap x ⁽ⁿ⁾ _i,t ↔ x ⁽ⁿ⁾ _i+1,t with prob. r ( q i,t , q i+1,t , x ⁽ⁿ⁾ _i,t , x ⁽ⁿ⁾ _i+1,t ) , ∀ odd i . X _i,t ← X _i,t ∪ { x ⁽ⁿ⁾ _i,t } , ∀ i .

end for return X _:,t .

1

(2)

O^∆t_1,t

O^∆t2,t O^∆β1,t O^∆β1,t−1

O^∆βM−1,t−1 OM−1,t^∆β

ζM,t−1 ζM,t

ζ2,t

ζ1,t

ζ1,t−1

ζ2,t−1

bt

bt−1

System Equations:

p ( ζ 0 ) = N ( µ 0 , Σ ₀ )

p ( ζ _t | ζ _t−1 ) = N ( ζ _t−1 , Σ _ζ )

p ( O _t ^(∆t) | ζ t , ζ t−1 ) = N ( C [ ζ t , ζ t−1 ] ^T , Σ _∆t ) p ( O _t ^(∆β) | ζ t ) = N ( Hζ t , Σ _∆β )

C = 2 6 6 4

I

M

1 0 .. . 0

−I

M

0 0 .. . 0

3 7 7 5

H = 2 6 6 4

−1 +1 0 0 0

0 −1 +1 0 .. . 0

. . . 0

0 0 0 −1 +1 0

3 7 7 5

Figure 1: A directed graphical model for log partition function tracking. The shaded nodes represent observed variables, and the double-walled nodes represent the tractable ζ

M,:

with β

M

= 0. For clarity of presentation, we show the bias term as distinct from the other ζ

i,t

(recall b

t

= ζ

M+1,t

.)

Inference Equations:

(i) p “

ζ

t−1

, ζ

t

| O

_t−1:0^(∆t)

, O

^(∆β)_t−1:0

”

= N (η

t−1,t−1

, V

t−1,t−1

) with η

t−1,t−1

=

» µ

t−1,t−1

µ

t−1,t−1

–

and V

t−1,t−1

=

» P

t−1,t−1

P

t−1,t−1

P

t−1,t−1

Σ

ζ

+ P

t−1,t−1

–

(ii) p(ζ

t−1

, ζ

t

| O

_t:0^(∆t)

, O

_t−1:0^(∆β)

) = N (η

t,t−1

, V

t,t−1

)

with V

t,t−1

= (V

_t−1,t−1⁻¹

+ C

^T

Σ

⁻¹_∆t

C)

⁻¹

and η

t,t−1

= V

t,t−1

(C

^T

Σ

∆t

O

^(∆t)_t

+ V

_t−1,t−1⁻¹

η

t−1,t−1

) (iii) p “

ζ

t

| O

_t:0^(∆t)

, O

_t−1:0^(∆β)

”

= N (µ

t,t−1

, P

t,t−1

) with µ

t,t−1

= [η

t,t−1

]

₂

and P

t,t−1

= [V

t,t−1

]

_2,2

(iv) p(ζ

t

| O

_t:0^(∆t)

, O

^(∆β)_t:0

) = N (µ

t,t

, P

t,t

)

with P

t,t

= (P

_t,t−1⁻¹

+ H

^T

Σ

⁻¹_∆β

H)

⁻¹

and µ

t,t

= P

t,t

(H

^T

Σ

∆β

O

_t^(∆β)

+ P

_t,t−1⁻¹

µ

t,t−1

)

Figure 2: Inference equations for our log partition tracking algorithm, a variant on the Kalman filter. For any vector v and matrix V , we use the notation [v]

2

to denote the vector obtained by preserving the bottom half elements of v and [V ]

2,2

to indicate the lower right-hand quadrant of V .

2 Kalman Filter

Algorithm 2 presents the tracking algorithm. The statistical estimate O

_i,t^(∆t)

is computed as an average of importance weights, measured between adjacent models q i,t and q i,t−1 . The estimate O

^(∆β)_i,t

is computed through bridge sampling, applied to models q _i+1,t and q _i,t . These observations are combined through a Kalman filter, which also exploits a smoothness prior on the evolution of ζ t . We include the graphical model, system equations and inference equations in Figures 1 & 2 for com- pleteness. We however refer the reader to the accompanying paper for a more thorough description of these figures.

3 Simultaneous Tracking and Learning

Finally, Algorithm 3 ties everything together, performing joint training and estimation of the log partition function. Note that using two sets of samples from the target distribution (X _1,t and Y _t )

2

(3)

Algorithm 2 kalman filter ( q :,t−1 , q :,t , µ t−1,t−1 , P t−1,t−1 , s i,t , Σ _ζ )

Using µ _t−1,t−1 , P _t−1,t−1 and Σ _ζ , compute η _t−1,t−1 and V _t−1,t−1 through equation (i).

for i ∈ [1 , M ] do w ⁽ⁿ⁾ _i,t ← ^q ^˜

^i,t

“

x

⁽ⁿ⁾i,t−1

”

˜ q

i,t−1

“

x

⁽ⁿ⁾i,t−1

”

; O ^(∆t) _i,t ← log h

N 1 P N

n=1 w ⁽ⁿ⁾ _i,t i

; Σ _∆t ← Diag

"

Var[w

i,t

]

“P

n

w

⁽ⁿ⁾i,t

”2

#

end for

Using O ^(∆t) _i,t and Σ _∆t , compute η t,t−1 and V t,t−1 . through equation (ii).

Compute µ _t,t−1 and P _t,t−1 using equation (iii).

for i ∈ [1 , M − 1] do u ⁽ⁿ⁾ _i,t ← ^q

∗ i,t

“

x

⁽ⁿ⁾i,t

”

˜ q

i,t

“

x

⁽ⁿ⁾i,t

”

; v ⁽ⁿ⁾ _i,t ← ^q

∗ i,t

“

x

⁽ⁿ⁾i+1,t

”

˜ q

i+1,t

“

x

⁽ⁿ⁾i+1,t

”

O _i,t ^(∆β) ← log _N ¹ P _N

n=1 u ⁽ⁿ⁾ _i,t − log _N ¹ P _N

n=1 v ⁽ⁿ⁾ _i,t Σ _∆β ← Diag

"

Var[u

i,t

]

“P

n

u

⁽ⁿ⁾_i,t”2

+

“

^Var[v

^i,t

^]

P

n

v

_i,t⁽ⁿ⁾”2

#

end for

Using O ^(∆β) _i,t and Σ _∆β , compute η _t,t and V _t,t through equation (iv).

Return ( η _t,t , V _t,t ).

is not required. Their use is inspired from [2] and allows us to separately tune N , the number of

“tempered” mini-batches and N Y , the size of the mini-batch used to estimate the gradient.

Algorithm 3 main

Initialize θ ₁ and compute exact log partition functions ζ _:,1 .

Initialize µ _1,1 with ζ _:,1 , P _1,1 [1 : M, 1 : M ] = 0 and P _1,1 [ M + 1 , M + 1] = _init · σ _b ² . Initialize samples X _:,1 and Y _:,1 according to the RBM visible biases.

Initialize s i,1 to exp( ζ i+1,1 − ζ i,1 ), ∀ i ∈ [1 , M − 1].

for t ∈ [2 , T ] do

Obtain training examples X _t ⁺ = { x ⁽ⁿ⁾ ∈ D; n ∈ [1 , N ]}

θ t ← θ t−1 − t

1 N P

x∈X

t⁺

h _∂F(x;θ

t

)

∂θ

i − _N ¹ P

y∈Y

t

h _∂F _(y;θ

F,t

)

∂θ

i . Choose N samples from Y _t to swap with X _1,t .

X _:,t ← sample PT ( q _:,t , X _:,t−1 , k ).

Y _t ← sample Gibbs ( q _1,t , Y _t−1 , k ).

( µ t,t , P t,t ) ← kalman filter ( q :,t−1 , q :,t , µ t−1,t−1 , P t−1,t−1 , s i,t , Σ _ζ ) ζ ˆ _:,t ← µ _t,t ; s _i,t+1 ← exp( ˆ ζ _i+1,t − ζ ˆ _i,t ), ∀ i ∈ [1 , M − 1].

end for

References

[1] Desjardins, G., Courville, A., Bengio, Y., Vincent, P., and Delalleau, O. (2010). Tempered Markov chain monte carlo for training of restricted Boltzmann machine. In JMLR W&CP: Proceedings of the Thirteenth International Conference on Artificial Intelligence and Statistics (AISTATS 2010), volume 9, pages 145–152.

[2] Salakhutdinov, R. (2010). Learning deep boltzmann machines using adaptive mcmc. In L. Bottou and M. Littman, editors, Proceedings of the Twenty-seventh International Conference on Machine Learning (ICML-10), volume 1, pages 943–950. ACM.

3

On Tracking The Partition Function: