On Tracking The Partition Function:
Supplemental Material
Guillaume Desjardins, Aaron Courville, Yoshua Bengio {desjagui,courvila,bengioy}@iro.umontreal.ca
D´epartement d’informatique et de recherche op´erationnelle Universit´e de Montr´eal
The supplementary material provides a more comprehensive view of our log partition function track- ing algorithm. The notation is mostly identical to that of the paper and is summarized below for convenience. One minor difference however, is that we make widespread use of the notations A :,l and A m,: to indicate the vectors formed by the l -th column and m -th row of matrix A (respectively).
q i,t RBM at inverse temp. β i at time-step t . q i,t ( x ) = ˜ q i,t ( x ) /Z i,t , with i ∈ [1 , M ].
θ set of model parameters.
F i,t ( x ) free-energy assigned by q i,t to input configuration x . ζ i,t log partition function of model q i,t .
X i,t mini-batch of samples { x (n) i,t ∼ q i,t ( x ); n ∈ [1 , N ]}.
Y t additional mini-batch of samples { x (n) 1,t ∼ q 1,t ( x ); n ∈ [1 , N Y ]} , N Y ≥ N .
D set of training examples.
µ t,t , P t,t estimated mean and covariance of posterior distribution p(ζ
t| O
(∆t)t:0, O
(∆β)t:0) . Σ
ζfixed covariance matrix of p(ζ
t|ζ
t−1) .
init , t initial learning rate and rate at time t .
s i,t coarse estimate of Z i+1,t /Z i,t used to generate bridging distribution q
∗. q
∗i,t bridging distribution with q
∗i,t ( x ) = s q ˜
i,t(x)˜ q
i+1,t(x)
i,t
q ˜
i,t(x)+˜ q
i+1,t(x) . r ( q 1 , q 2 , x 1 , x 2 ) swap probability used by PT, with r i,t = min
1 , q q ˜ ˜
11( (
xx21)˜ )˜ q q
22( (
xx12) ) .
1 Parallel Tempering Algorithm
We divide our algorithm into three parts. Algorithm 1 presents the pseudo-code for the Parallel Tempering sampling algorithm. For details, we refer the reader to [1].
Algorithm 1 sample PT ( q :,t , X :,t−1 , k ) Initialize X i,t as empty sets, ∀ i ∈ [1 , M ].
for n ∈ [1 , N ] do for i ∈ [1 , M ] do
Initialize Markov chain associated with model q i,t with state x (n) i,t−1 . Perform k steps of Gibbs sampling, yielding x (n) i,t .
end for
Swap x (n) i,t ↔ x (n) i+1,t with prob. r ( q i,t , q i+1,t , x (n) i,t , x (n) i+1,t ) , ∀ even i.
Swap x (n) i,t ↔ x (n) i+1,t with prob. r ( q i,t , q i+1,t , x (n) i,t , x (n) i+1,t ) , ∀ odd i . X i,t ← X i,t ∪ { x (n) i,t } , ∀ i .
end for return X :,t .
1
O∆t1,t
O∆t2,t O∆β1,t O∆β1,t−1
O∆βM−1,t−1 OM−1,t∆β
ζM,t−1 ζM,t
ζ2,t
ζ1,t
ζ1,t−1
ζ2,t−1
bt
bt−1
System Equations:
p ( ζ 0 ) = N ( µ 0 , Σ 0 )
p ( ζ t | ζ t−1 ) = N ( ζ t−1 , Σ ζ )
p ( O t (∆t) | ζ t , ζ t−1 ) = N ( C [ ζ t , ζ t−1 ] T , Σ ∆t ) p ( O t (∆β) | ζ t ) = N ( Hζ t , Σ ∆β )
C = 2 6 6 4
I
M1 0 .. . 0
−I
M0 0 .. . 0
3 7 7 5
H = 2 6 6 4
−1 +1 0 0 0
0 −1 +1 0 .. . 0
. . . 0
0 0 0 −1 +1 0
3 7 7 5
Figure 1: A directed graphical model for log partition function tracking. The shaded nodes represent observed variables, and the double-walled nodes represent the tractable ζ
M,:with β
M= 0. For clarity of presentation, we show the bias term as distinct from the other ζ
i,t(recall b
t= ζ
M+1,t.)
Inference Equations:
(i) p “
ζ
t−1, ζ
t| O
t−1:0(∆t), O
(∆β)t−1:0”
= N (η
t−1,t−1, V
t−1,t−1) with η
t−1,t−1=
» µ
t−1,t−1µ
t−1,t−1–
and V
t−1,t−1=
» P
t−1,t−1P
t−1,t−1P
t−1,t−1Σ
ζ+ P
t−1,t−1–
(ii) p(ζ
t−1, ζ
t| O
t:0(∆t), O
t−1:0(∆β)) = N (η
t,t−1, V
t,t−1)
with V
t,t−1= (V
t−1,t−1−1+ C
TΣ
−1∆tC)
−1and η
t,t−1= V
t,t−1(C
TΣ
∆tO
(∆t)t+ V
t−1,t−1−1η
t−1,t−1) (iii) p “
ζ
t| O
t:0(∆t), O
t−1:0(∆β)”
= N (µ
t,t−1, P
t,t−1) with µ
t,t−1= [η
t,t−1]
2and P
t,t−1= [V
t,t−1]
2,2(iv) p(ζ
t| O
t:0(∆t), O
(∆β)t:0) = N (µ
t,t, P
t,t)
with P
t,t= (P
t,t−1−1+ H
TΣ
−1∆βH)
−1and µ
t,t= P
t,t(H
TΣ
∆βO
t(∆β)+ P
t,t−1−1µ
t,t−1)
Figure 2: Inference equations for our log partition tracking algorithm, a variant on the Kalman filter. For any vector v and matrix V , we use the notation [v]
2to denote the vector obtained by preserving the bottom half elements of v and [V ]
2,2to indicate the lower right-hand quadrant of V .
2 Kalman Filter
Algorithm 2 presents the tracking algorithm. The statistical estimate O
i,t(∆t)is computed as an average of importance weights, measured between adjacent models q i,t and q i,t−1 . The estimate O
(∆β)i,tis computed through bridge sampling, applied to models q i+1,t and q i,t . These observations are combined through a Kalman filter, which also exploits a smoothness prior on the evolution of ζ t . We include the graphical model, system equations and inference equations in Figures 1 & 2 for com- pleteness. We however refer the reader to the accompanying paper for a more thorough description of these figures.
3 Simultaneous Tracking and Learning
Finally, Algorithm 3 ties everything together, performing joint training and estimation of the log partition function. Note that using two sets of samples from the target distribution (X 1,t and Y t )
2
Algorithm 2 kalman filter ( q :,t−1 , q :,t , µ t−1,t−1 , P t−1,t−1 , s i,t , Σ ζ )
Using µ t−1,t−1 , P t−1,t−1 and Σ ζ , compute η t−1,t−1 and V t−1,t−1 through equation (i).
for i ∈ [1 , M ] do w (n) i,t ← q ˜
i,t“
x
(n)i,t−1”
˜ q
i,t−1“
x
(n)i,t−1”
; O (∆t) i,t ← log h
N 1 P N
n=1 w (n) i,t i
; Σ ∆t ← Diag
"
Var[w
i,t]
“P
n
w
(n)i,t”2
#
end for
Using O (∆t) i,t and Σ ∆t , compute η t,t−1 and V t,t−1 . through equation (ii).
Compute µ t,t−1 and P t,t−1 using equation (iii).
for i ∈ [1 , M − 1] do u (n) i,t ← q
∗ i,t
“
x
(n)i,t”
˜ q
i,t“
x
(n)i,t”
; v (n) i,t ← q
∗ i,t
“
x
(n)i+1,t”
˜ q
i+1,t“
x
(n)i+1,t”
O i,t (∆β) ← log N 1 P N
n=1 u (n) i,t − log N 1 P N
n=1 v (n) i,t Σ ∆β ← Diag
"
Var[u
i,t]
“P
n
u
(n)i,t”2+
“Var[v
i,t]
Pn
v
i,t(n)”2#
end for
Using O (∆β) i,t and Σ ∆β , compute η t,t and V t,t through equation (iv).
Return ( η t,t , V t,t ).
is not required. Their use is inspired from [2] and allows us to separately tune N , the number of
“tempered” mini-batches and N Y , the size of the mini-batch used to estimate the gradient.
Algorithm 3 main
Initialize θ 1 and compute exact log partition functions ζ :,1 .
Initialize µ 1,1 with ζ :,1 , P 1,1 [1 : M, 1 : M ] = 0 and P 1,1 [ M + 1 , M + 1] = init · σ b 2 . Initialize samples X :,1 and Y :,1 according to the RBM visible biases.
Initialize s i,1 to exp( ζ i+1,1 − ζ i,1 ), ∀ i ∈ [1 , M − 1].
for t ∈ [2 , T ] do
Obtain training examples X t + = { x (n) ∈ D; n ∈ [1 , N ]}
θ t ← θ t−1 − t
1
N P
x∈X
t+h ∂F(x;θ
t
)
∂θ
i − N 1 P
y∈Y
th ∂F (y;θ
F,t