• Aucun résultat trouvé

Information geometry and minimum description length networks

N/A
N/A
Protected

Academic year: 2022

Partager "Information geometry and minimum description length networks"

Copied!
5
0
0

Texte intégral

(1)

Proceedings Chapter

Reference

Information geometry and minimum description length networks

SUN, Ke, et al.

Abstract

We study parametric unsupervised mixture learning. We measure the loss of intrinsic information from the observations to complex mixture models, and then to simple mixture models. We present a geometric picture, where all these representations are regarded as free points in the space of probability distributions. Based on minimum description length, we derive a simple geometric principle to learn all these models together. We present a new learning machine with theories, algorithms, and simulations.

SUN, Ke, et al . Information geometry and minimum description length networks. In:

Proceedings of the 32 nd International Conference on Machine Learning . 2015. p.

49-58

Available at:

http://archive-ouverte.unige.ch/unige:73193

Disclaimer: layout of this document may differ from the published version.

(2)

Proofs for “Information Geometry and Minimum Description Length Networks”

An Approximation of ln N (B, α)

As the value of lnN(B,α) does not depend on the choice of the coordinate system, we abuse notation and vary η = η(1), . . . , η(dimS)

to an ideal coor- dinate system, where g(η) is everywhere identity. In theory, this is possible locally. However, for convenience, we assume that such a coordinate system exists globally. By definition,

lnN(B,α) = ln Z

η∈S m

X

i=1

αiexp (−D(ηkηi))dη

!

= ln

m

X

i=1

αi

Z

η∈S

exp (−D(ηkηi))dη

!

≈ln

m

X

i=1

αi

Z

η(1)

· · · Z

η(dimS)

exp

−1

2(η−ηi)Tg(ηi)(η−ηi)

×p

|g(η)|dη(1)· · ·dη(dimS)

!

= ln

m

X

i=1

αi Z

η(1)

· · · Z

η(dimS)

exp

−1

2kη−ηik22

(1)· · ·dη(dimS)

!

≈ln

m

X

i=1

αiexp

dimS 2 ln(2π)

!

= dimS

2 ln(2π). (S.1)

The first “≈” is by approximatingDto a square distance, which is only accurate whenη andηi are close enough. The second “≈” is by relaxing the domain of the integration fromSto<dimS. This is a rough approximation for a generalS, to show the order of the term lnN(B,α), and to show its weak dependence to Bandα. More accurate approximations based on specific choices ofScan lead

(3)

Proof of E(N , A) ≤ E(N ˆ , A) (HARDN)

Proof. ∀l,∀i,

nl+1

X

j=1

αl+1,jexp −D(ηlil+1,j)

≥max

j

αl+1,jexp −D(ηlil+1,j) . (S.2)

As−ln(x) is monotonically decreasing,

E(N, A) =−

L−1

X

l=0 nl

X

i=1

ln

nl+1

X

j=1

αl+1,jexp −D(ηlil+1,j)

≤ −

L−1

X

l=0 nl

X

i=1

ln max

j

αl+1,jexp −D(ηlil+1,j)

=−

L−1

X

l=0 nl

X

i=1

maxj

lnαl+1,j−D(ηlil+1,j)

=

L−1

X

l=0 nl

X

i=1

minj

−lnαl+1,j+D(ηlil+1,j)

= ˆE(N, A). (S.3)

Proof of E(N , A) ≤ E(N ¯ , A, B) (SOFTN)

Proof. Because of the convexity of−ln(x),

E(N, A) =−

L−1

X

l=0 nl

X

i=1

ln

nl+1

X

j=1

αl+1,jexp −D(ηlil+1,j)

=−

L−1

X

l=0 nl

X

i=1

ln

nl+1

X

j=1

βlij ·αl+1,jexp −D(ηlil+1,j) βjli

L−1

X

l=0 nl

X

i=1 nl+1

X

j=1

βjli

"

−ln αl+1,jexp −D(ηlil+1,j) βlij

!#

=

L−1

X

l=0 nl

X

i=1 nl+1

X

j=1

βjli ln βjli

αl+1,j +D(ηlil+1,j)

!

= ¯E(N, A, B). (S.4)

(4)

Proof of Theorem 3

Proof. Denote the true distribution with the components{ηti}and the weights {αti}byT rue(x). By eq. (7),∀N,∀A, whenn→ ∞,

E(N, A) =−n Z

T rue(x) ln

n1

X

j=1

α1jexp −D(η(x)kη1j)

dx

L−1

X

l=1 nl

X

i=1

ln

nl+1

X

j=1

αl+1,jexp −D(ηlil+1,j)

=−n Z

T rue(x) ln

n1

X

j=1

α1jp x|η1j

dx+constant

L−1

X

l=1 nl

X

i=1

ln

nl+1

X

j=1

αl+1,jexp −D(ηlil+1,j)

. (S.5)

We construct an MDL networkNt, where Lt1 is given by {ηt1iti} with the weights {αt1i = αti}. The rest of the cells {ηtli} in higher levels, including their weights {αtli} are given by the sub-optimal solution which minimizes the above eq. (S.5) withLt1 and its weights fixed. Given thatL0 is fixed by infinite samples corresponding to the truth,∀N,∀A,

E(N, A)−E(Nt, At) =n Z

T rue(x) ln T rue(x) Pn1

j=1α1jp xi1jdx

L−1

X

l=1 nl

X

i=1

ln

nl+1

X

j=1

αl+1,jexp −D(ηlil+1,j)

+

L−1

X

l=1 nl

X

i=1

ln

nl+1

X

j=1

αtl+1,jexp −D(ηtlitl+1,j)

. (S.6) If{η1j} in N or {α1j}in Adoes not correspond toT rue(x), the first term on the right-hand-side of eq. (S.6) will go to +∞ as n → ∞. The second term is always non-negative, because of the non-negativity of D. Because of the sub-optimality discussed earlier, the third term is lower-bounded, as in

L−1

X

nl

Xln

nl+1

tl+1,jexp −D(ηtlitl+1,j)

≥ −

n1

XD(ηtikη),˜ (S.7)

(5)

Proof of Theorem 4

Proof. By the definition ofD(η12) in section 2.3 as a Bregman divergence,

∀θ(η), we have

gain(η) = D(η12)−D(η1kη)−D(ηkη2)

= +

ψ?1)−ψ?2)−θT21−η2)

ψ?1)−ψ?(η)−θT1−η)

ψ?(η)−ψ?2)−θT2(η−η2)

= (θ2−θ)T(η−η1). (S.8)

Letθlc = (θ12)/2 be the left-handed Bregman centroid of θ1 and θ2, thenθ2−θlclc−θ1. Therefore,

gain(ηlc) = (θ2−θlc)Tlc−η1) = (θlc−θ1)Tlc−η1). (S.9) On the other hand,∀ηab∈ S, ηa 6=ηb,

D(ηab) +D(ηba) = +

ψ?a)−ψ?b)−θTba−ηb) +

ψ?b)−ψ?a)−θTab−ηa)

=(θa−θb)Ta−ηb)>0. (S.10) By eqs. (S.9) and (S.10),

gain(ηlc) =D(ηlc1) +D(η1lc)>0 (which proves¬). (S.11) Similarly, we let ηrc = (η12)/2 be the righted-handed Bregman centroid, then

gain(ηrc) = (θ2−θrc)Trc−η1) = (θ2−θrc)T2−ηrc)

=D(η2rc) +D(ηrc2). (S.12) By eqs. (S.11) and (S.12),∃η∈ S satisfying

gain(η)≥max{D(ηlc1) +D(η1lc), D(η2rc) +D(ηrc2)}. (S.13)

Références

Documents relatifs

The main operation performed during the energy minimization process consists in flipping a pixel into a given region, and then to update the different energies accordingly.. A

Abstract: Worldwide schistosomiasis remains a serious public health problem with approximately 67 million people infected and 200 million at risk of infection from inhabiting

Existen otras bibliotecas españolas que participan de una forma más o menos activa en proyectos relacionados con la docencia: la Biblioteca de la UNED (integra sus

In fact, most of us (and I include myself here) have done rather worse than that – in that having identified the need to think beyond growth we have then reverted to the more

If a locally compact group is not σ-compact, then it has no proper length and therefore both Property PL and strong Property PL mean that every length is bounded.. Such groups

We overview two systems we hve implemented: EDGE, that returns the value of the probabilities associated with axioms tuned using an Expectation Maximization algorithm, and LEAP,

The prop- erties are modeled by a finite set that includes the data string, which amounts to modeling data by uniform distribution, and the amount of properties is measured by

We therefore introduce the notion of ports in order to represent how pattern embeddings connect to each other to form the original graph. The set of ports of a pattern is a subset