Information geometry and minimum description length networks

(1)

Proceedings Chapter

Reference

Information geometry and minimum description length networks

SUN, Ke, et al.

Abstract

We study parametric unsupervised mixture learning. We measure the loss of intrinsic information from the observations to complex mixture models, and then to simple mixture models. We present a geometric picture, where all these representations are regarded as free points in the space of probability distributions. Based on minimum description length, we derive a simple geometric principle to learn all these models together. We present a new learning machine with theories, algorithms, and simulations.

SUN, Ke, et al . Information geometry and minimum description length networks. In:

Proceedings of the 32 nd International Conference on Machine Learning . 2015. p.

49-58

Available at:

http://archive-ouverte.unige.ch/unige:73193

Disclaimer: layout of this document may differ from the published version.

(2)

Proofs for “Information Geometry and Minimum Description Length Networks”

An Approximation of ln N (B, α)

As the value of lnN(B,α) does not depend on the choice of the coordinate system, we abuse notation and vary η = η⁽¹⁾, . . . , η^(dim^S)

to an ideal coordinate system, where g(η) is everywhere identity. In theory, this is possible locally. However, for convenience, we assume that such a coordinate system exists globally. By definition,

lnN(B,α) = ln Z

η∈S m

X

i=1

αiexp (−D(ηkη_i))dη

!

= ln

m

X

i=1

αi

Z

η∈S

exp (−D(ηkη_i))dη

!

≈ln

m

X

i=1

αi

Z

η⁽¹⁾

· · · Z

η^(dimS)

exp

−1

2(η−η_i)^Tg(η_i)(η−η_i)

×p

|g(η)|dη⁽¹⁾· · ·dη^(dim^S)

!

= ln

m

X

i=1

α_i Z

η⁽¹⁾

· · · Z

η^(dim^S)

exp

−1

2kη−η_ik²₂

dη⁽¹⁾· · ·dη^(dim^S)

!

≈ln

m

X

i=1

α_iexp

dimS 2 ln(2π)

!

= dimS

2 ln(2π). (S.1)

The first “≈” is by approximatingDto a square distance, which is only accurate whenη andη_i are close enough. The second “≈” is by relaxing the domain of the integration fromSto<^dim^S. This is a rough approximation for a generalS, to show the order of the term lnN(B,α), and to show its weak dependence to Bandα. More accurate approximations based on specific choices ofScan lead

(3)

Proof of E(N , A) ≤ E(N ˆ , A) (HARDN)

Proof. ∀l,∀i,

n_l+1

X

j=1

αl+1,jexp −D(η_likη_l+1,j)

≥max

j

αl+1,jexp −D(η_likη_l+1,j) . (S.2)

As−ln(x) is monotonically decreasing,

E(N, A) =−

L−1

X

l=0 nl

X

i=1

ln





nl+1

X

j=1





≤ −

L−1

X

l=0 n_l

X

i=1

ln max

j

=−

L−1

X

l=0 n_l

X

i=1

maxj

lnαl+1,j−D(η_likη_l+1,j)

=

L−1

X

l=0 nl

X

i=1

minj

−lnαl+1,j+D(η_likη_l+1,j)

= ˆE(N, A). (S.3)

Proof of E(N , A) ≤ E(N ¯ , A, B) (SOFTN)

Proof. Because of the convexity of−ln(x),

E(N, A) =−

L−1

X

l=0 nl

X

i=1

ln





n_l+1

X

j=1





=−

L−1

X

l=0 n_l

X

i=1

ln





n_l+1

X

j=1

β_li^j ·αl+1,jexp −D(η_likη_l+1,j) β^j_li





≤

L−1

X

l=0 n_l

X

i=1 nl+1

X

j=1

β^j_li

"

−ln α_l+1,jexp −D(η_likη_l+1,j) β_li^j

!#

=

L−1

X

l=0 n_l

X

i=1 n_l+1

X

j=1

β^j_li ln β^j_li

α_l+1,j +D(η_likη_l+1,j)

!

= ¯E(N, A, B). (S.4)

(4)

Proof of Theorem 3

Proof. Denote the true distribution with the components{η^t_i}and the weights {α^t_i}byT rue(x). By eq. (7),∀N,∀A, whenn→ ∞,

E(N, A) =−n Z

T rue(x) ln





n₁

X

j=1

α1jexp −D(η(x)kη_1j)



dx

−

L−1

X

l=1 n_l

X

i=1

ln





nl+1

X

j=1





=−n Z

T rue(x) ln





n1

X

j=1

α_1jp x|η_1j



dx+constant

−

L−1

X

l=1 n_l

X

i=1

ln





n_l+1

X

j=1

α_l+1,jexp −D(η_likη_l+1,j)



. (S.5)

We construct an MDL networkN^t, where L^t₁ is given by {η^t_1i =η^t_i} with the weights {α^t_1i = α^t_i}. The rest of the cells {η^t_li} in higher levels, including their weights {α^t_li} are given by the sub-optimal solution which minimizes the above eq. (S.5) withL^t₁ and its weights fixed. Given thatL0 is fixed by infinite samples corresponding to the truth,∀N,∀A,

E(N, A)−E(N^t, A^t) =n Z

T rue(x) ln T rue(x) Pn1

j=1α1jp xi|η_1jdx

−

L−1

X

l=1 n_l

X

i=1

ln





n_l+1

X

j=1

α_l+1,jexp −D(η_likη_l+1,j)





+

L−1

X

l=1 n_l

X

i=1

ln





n_l+1

X

j=1

α^t_l+1,jexp −D(η^t_likη^t_l+1,j)



. (S.6) If{η_1j} in N or {α_1j}in Adoes not correspond toT rue(x), the first term on the right-hand-side of eq. (S.6) will go to +∞ as n → ∞. The second term is always non-negative, because of the non-negativity of D. Because of the sub-optimality discussed earlier, the third term is lower-bounded, as in

L−1

X

n_l

Xln





n_l+1

Xα^t_l+1,jexp −D(η^t_likη^t_l+1,j)



≥ −

n₁

XD(η^t_ikη),˜ (S.7)

(5)

Proof of Theorem 4

Proof. By the definition ofD(η₁kη₂) in section 2.3 as a Bregman divergence,

∀θ(η), we have

gain(η) = D(η₁kη₂)−D(η₁kη)−D(ηkη₂)

= +

ψ^?(η₁)−ψ^?(η₂)−θ^T₂(η₁−η₂)

−

ψ^?(η₁)−ψ^?(η)−θ^T(η₁−η)

−

ψ^?(η)−ψ^?(η₂)−θ^T₂(η−η₂)

= (θ2−θ)^T(η−η₁). (S.8)

Letθlc = (θ1+θ2)/2 be the left-handed Bregman centroid of θ1 and θ2, thenθ2−θlc=θlc−θ1. Therefore,

gain(η_lc) = (θ2−θlc)^T(η_lc−η₁) = (θlc−θ1)^T(η_lc−η₁). (S.9) On the other hand,∀η_a,η_b∈ S, η_a 6=η_b,

D(η_akη_b) +D(η_bkη_a) = +

ψ^?(η_a)−ψ^?(η_b)−θ^T_b(η_a−η_b) +

ψ^?(η_b)−ψ^?(η_a)−θ^T_a(η_b−η_a)

=(θa−θb)^T(η_a−η_b)>0. (S.10) By eqs. (S.9) and (S.10),

gain(η_lc) =D(η_lckη₁) +D(η₁kη_lc)>0 (which proves¬). (S.11) Similarly, we let η_rc = (η₁+η₂)/2 be the righted-handed Bregman centroid, then

gain(η_rc) = (θ₂−θ_rc)^T(η_rc−η₁) = (θ₂−θ_rc)^T(η₂−η_rc)

=D(η₂kη_rc) +D(η_rckη₂). (S.12) By eqs. (S.11) and (S.12),∃η∈ S satisfying

gain(η)≥max{D(η_lckη₁) +D(η₁kη_lc), D(η₂kη_rc) +D(η_rckη₂)}. (S.13)