• Aucun résultat trouvé

On Tracking The Partition Function:

N/A
N/A
Protected

Academic year: 2022

Partager "On Tracking The Partition Function:"

Copied!
3
0
0

Texte intégral

(1)

On Tracking The Partition Function:

Supplemental Material

Guillaume Desjardins, Aaron Courville, Yoshua Bengio {desjagui,courvila,bengioy}@iro.umontreal.ca

D´epartement d’informatique et de recherche op´erationnelle Universit´e de Montr´eal

The supplementary material provides a more comprehensive view of our log partition function track- ing algorithm. The notation is mostly identical to that of the paper and is summarized below for convenience. One minor difference however, is that we make widespread use of the notations A :,l and A m,: to indicate the vectors formed by the l -th column and m -th row of matrix A (respectively).

q i,t RBM at inverse temp. β i at time-step t . q i,t ( x ) = ˜ q i,t ( x ) /Z i,t , with i ∈ [1 , M ].

θ set of model parameters.

F i,t ( x ) free-energy assigned by q i,t to input configuration x . ζ i,t log partition function of model q i,t .

X i,t mini-batch of samples { x (n) i,t ∼ q i,t ( x ); n ∈ [1 , N ]}.

Y t additional mini-batch of samples { x (n) 1,t ∼ q 1,t ( x ); n ∈ [1 , N Y ]} , N Y ≥ N .

D set of training examples.

µ t,t , P t,t estimated mean and covariance of posterior distribution p(ζ

t

| O

(∆t)t:0

, O

(∆β)t:0

) . Σ

ζ

fixed covariance matrix of p(ζ

t

t−1

) .

init , t initial learning rate and rate at time t .

s i,t coarse estimate of Z i+1,t /Z i,t used to generate bridging distribution q

. q

i,t bridging distribution with q

i,t ( x ) = s q ˜

i,t

(x)˜ q

i+1,t

(x)

i,t

q ˜

i,t

(x)+˜ q

i+1,t

(x) . r ( q 1 , q 2 , x 1 , x 2 ) swap probability used by PT, with r i,t = min

1 , q q ˜ ˜

11

( (

xx21

q q

22

( (

xx12

) ) .

1 Parallel Tempering Algorithm

We divide our algorithm into three parts. Algorithm 1 presents the pseudo-code for the Parallel Tempering sampling algorithm. For details, we refer the reader to [1].

Algorithm 1 sample PT ( q :,t , X :,t−1 , k ) Initialize X i,t as empty sets, ∀ i ∈ [1 , M ].

for n ∈ [1 , N ] do for i ∈ [1 , M ] do

Initialize Markov chain associated with model q i,t with state x (n) i,t−1 . Perform k steps of Gibbs sampling, yielding x (n) i,t .

end for

Swap x (n) i,t ↔ x (n) i+1,t with prob. r ( q i,t , q i+1,t , x (n) i,t , x (n) i+1,t ) , ∀ even i.

Swap x (n) i,t ↔ x (n) i+1,t with prob. r ( q i,t , q i+1,t , x (n) i,t , x (n) i+1,t ) , ∀ odd i . X i,t ← X i,t ∪ { x (n) i,t } , ∀ i .

end for return X :,t .

1

(2)

O∆t1,t

O∆t2,t O∆β1,t O∆β1,t−1

O∆βM−1,t−1 OM−1,t∆β

ζM,t−1 ζM,t

ζ2,t

ζ1,t

ζ1,t−1

ζ2,t−1

bt

bt−1

System Equations:

p ( ζ 0 ) = N ( µ 0 , Σ 0 )

p ( ζ t | ζ t−1 ) = N ( ζ t−1 , Σ ζ )

p ( O t (∆t) | ζ t , ζ t−1 ) = N ( C [ ζ t , ζ t−1 ] T , Σ ∆t ) p ( O t (∆β) | ζ t ) = N ( Hζ t , Σ ∆β )

C = 2 6 6 4

I

M

1 0 .. . 0

−I

M

0 0 .. . 0

3 7 7 5

H = 2 6 6 4

−1 +1 0 0 0

0 −1 +1 0 .. . 0

. . . 0

0 0 0 −1 +1 0

3 7 7 5

Figure 1: A directed graphical model for log partition function tracking. The shaded nodes represent observed variables, and the double-walled nodes represent the tractable ζ

M,:

with β

M

= 0. For clarity of presentation, we show the bias term as distinct from the other ζ

i,t

(recall b

t

= ζ

M+1,t

.)

Inference Equations:

(i) p “

ζ

t−1

, ζ

t

| O

t−1:0(∆t)

, O

(∆β)t−1:0

= N (η

t−1,t−1

, V

t−1,t−1

) with η

t−1,t−1

=

» µ

t−1,t−1

µ

t−1,t−1

and V

t−1,t−1

=

» P

t−1,t−1

P

t−1,t−1

P

t−1,t−1

Σ

ζ

+ P

t−1,t−1

(ii) p(ζ

t−1

, ζ

t

| O

t:0(∆t)

, O

t−1:0(∆β)

) = N (η

t,t−1

, V

t,t−1

)

with V

t,t−1

= (V

t−1,t−1−1

+ C

T

Σ

−1∆t

C)

−1

and η

t,t−1

= V

t,t−1

(C

T

Σ

∆t

O

(∆t)t

+ V

t−1,t−1−1

η

t−1,t−1

) (iii) p “

ζ

t

| O

t:0(∆t)

, O

t−1:0(∆β)

= N (µ

t,t−1

, P

t,t−1

) with µ

t,t−1

= [η

t,t−1

]

2

and P

t,t−1

= [V

t,t−1

]

2,2

(iv) p(ζ

t

| O

t:0(∆t)

, O

(∆β)t:0

) = N (µ

t,t

, P

t,t

)

with P

t,t

= (P

t,t−1−1

+ H

T

Σ

−1∆β

H)

−1

and µ

t,t

= P

t,t

(H

T

Σ

∆β

O

t(∆β)

+ P

t,t−1−1

µ

t,t−1

)

Figure 2: Inference equations for our log partition tracking algorithm, a variant on the Kalman filter. For any vector v and matrix V , we use the notation [v]

2

to denote the vector obtained by preserving the bottom half elements of v and [V ]

2,2

to indicate the lower right-hand quadrant of V .

2 Kalman Filter

Algorithm 2 presents the tracking algorithm. The statistical estimate O

i,t(∆t)

is computed as an average of importance weights, measured between adjacent models q i,t and q i,t−1 . The estimate O

(∆β)i,t

is computed through bridge sampling, applied to models q i+1,t and q i,t . These observations are combined through a Kalman filter, which also exploits a smoothness prior on the evolution of ζ t . We include the graphical model, system equations and inference equations in Figures 1 & 2 for com- pleteness. We however refer the reader to the accompanying paper for a more thorough description of these figures.

3 Simultaneous Tracking and Learning

Finally, Algorithm 3 ties everything together, performing joint training and estimation of the log partition function. Note that using two sets of samples from the target distribution (X 1,t and Y t )

2

(3)

Algorithm 2 kalman filter ( q :,t−1 , q :,t , µ t−1,t−1 , P t−1,t−1 , s i,t , Σ ζ )

Using µ t−1,t−1 , P t−1,t−1 and Σ ζ , compute η t−1,t−1 and V t−1,t−1 through equation (i).

for i ∈ [1 , M ] do w (n) i,tq ˜

i,t

x

(n)i,t−1

˜ q

i,t−1

x

(n)i,t−1

; O (∆t) i,t ← log h

N 1 P N

n=1 w (n) i,t i

; Σ ∆t ← Diag

"

Var[w

i,t

]

“P

n

w

(n)i,t

”2

#

end for

Using O (∆t) i,t and Σ ∆t , compute η t,t−1 and V t,t−1 . through equation (ii).

Compute µ t,t−1 and P t,t−1 using equation (iii).

for i ∈ [1 , M − 1] do u (n) i,tq

∗ i,t

x

(n)i,t

˜ q

i,t

x

(n)i,t

; v (n) i,tq

∗ i,t

x

(n)i+1,t

˜ q

i+1,t

x

(n)i+1,t

O i,t (∆β) ← log N 1 P N

n=1 u (n) i,t − log N 1 P N

n=1 v (n) i,t Σ ∆β ← Diag

"

Var[u

i,t

]

“P

n

u

(n)i,t”2

+

Var[v

i,t

]

P

n

v

i,t(n)”2

#

end for

Using O (∆β) i,t and Σ ∆β , compute η t,t and V t,t through equation (iv).

Return ( η t,t , V t,t ).

is not required. Their use is inspired from [2] and allows us to separately tune N , the number of

“tempered” mini-batches and N Y , the size of the mini-batch used to estimate the gradient.

Algorithm 3 main

Initialize θ 1 and compute exact log partition functions ζ :,1 .

Initialize µ 1,1 with ζ :,1 , P 1,1 [1 : M, 1 : M ] = 0 and P 1,1 [ M + 1 , M + 1] = init · σ b 2 . Initialize samples X :,1 and Y :,1 according to the RBM visible biases.

Initialize s i,1 to exp( ζ i+1,1 − ζ i,1 ), ∀ i ∈ [1 , M − 1].

for t ∈ [2 , T ] do

Obtain training examples X t + = { x (n) ∈ D; n ∈ [1 , N ]}

θ t ← θ t−1 − t

1

N P

x∈X

t+

h ∂F(x;θ

t

)

∂θ

i − N 1 P

y∈Y

t

h ∂F (y;θ

F,t

)

∂θ

i . Choose N samples from Y t to swap with X 1,t .

X :,t ← sample PT ( q :,t , X :,t−1 , k ).

Y t ← sample Gibbs ( q 1,t , Y t−1 , k ).

( µ t,t , P t,t ) ← kalman filter ( q :,t−1 , q :,t , µ t−1,t−1 , P t−1,t−1 , s i,t , Σ ζ ) ζ ˆ :,t ← µ t,t ; s i,t+1 ← exp( ˆ ζ i+1,t − ζ ˆ i,t ), ∀ i ∈ [1 , M − 1].

end for

References

[1] Desjardins, G., Courville, A., Bengio, Y., Vincent, P., and Delalleau, O. (2010). Tempered Markov chain monte carlo for training of restricted Boltzmann machine. In JMLR W&CP: Proceedings of the Thirteenth International Conference on Artificial Intelligence and Statistics (AISTATS 2010), volume 9, pages 145–152.

[2] Salakhutdinov, R. (2010). Learning deep boltzmann machines using adaptive mcmc. In L. Bottou and M. Littman, editors, Proceedings of the Twenty-seventh International Conference on Machine Learning (ICML-10), volume 1, pages 943–950. ACM.

3

Références

Documents relatifs

SML training of the RBM model parameters is a stochastic gradient descent algorithm (typically over a mini-batch of N examples) where the parameters change by small increments

Sundararajan, “A dual coordinate descent method for large-scale linear SVM,” in Proceedings of the Twenty Fifth International Conference on Machine Learning (ICML), 2008..

The fir edi ion of he XI-ML E plainable and In erpre able Machine Learning ork hop a held on Sep ember 21, 2020 a he ​43rd German Conference on Ar ificial In elligence,

Sur le plan managérial, en ce qui concerne les caractéristiques du CA, les propriétaires des EMF (actionnaires) doivent éviter le cumul des fonctions des DG et PCA c’est

Mobility in the Ancient Near East Images in Context. Archaeology as Cultural Heritage Engendering Near

In Human Language Technologies: The 2010 Annual Conference of the North American Chapter of the Association for Computational Linguistics, pages 831–839.. Association for

With regard to Giardia and giardiasis, the main areas of research for which new fi ndings and the most impressive communications were presented and discussed included: parasite

Keywords: Central hypoventilation, Autonomic nervous system, Congenital central hypoventilation syndrome, PHOX2B gene, Hirschsprung ’ s disease, Alanine expansion, Central control