Gaussian Based Visualization of Gaussian and Non-Gaussian Based Clustering

(1)

HAL Id: hal-02400486

https://hal.archives-ouvertes.fr/hal-02400486

Submitted on 9 Dec 2019

HAL is a multi-disciplinary open access archive for the deposit and dissemination of sci- entific research documents, whether they are pub- lished or not. The documents may come from teaching and research institutions in France or abroad, or from public or private research centers.

L’archive ouverte pluridisciplinaire HAL, est destinée au dépôt et à la diffusion de documents scientifiques de niveau recherche, publiés ou non, émanant des établissements d’enseignement et de recherche français ou étrangers, des laboratoires publics ou privés.

Gaussian Based Visualization of Gaussian and Non-Gaussian Based Clustering

Christophe Biernacki, Matthieu Marbac-Lourdelle, Vincent Vandewalle

To cite this version:

Christophe Biernacki, Matthieu Marbac-Lourdelle, Vincent Vandewalle. Gaussian Based Visualization

of Gaussian and Non-Gaussian Based Clustering. SPSR 2019, Apr 2019, Bucarest, Romania. �hal-

02400486�

(2)

Gaussian Based Visualization

of Gaussian and Non-Gaussian Based Clustering

M. Marbac-Lourdelle, C. Biernacki, V. Vandewalle

SPSR 2019

10-11 May 2019, Bucharest, Romania

1/22

(3)

Clustering: from modeling to visualizing Mapping clusters as spherical Gaussians Numerical illustrations for functional data Discussion

Take home message

Data/density visualization

Traditionally: chose the form of the mapping fromX toY foruser convenience Proposal: chose the form of the density of the data onYforcluster interpretation convenience

2/22

(4)

Outline

1 Clustering: from modeling to visualizing

2 Mapping clusters as spherical Gaussians

3 Numerical illustrations for functional data

4 Discussion

3/22

(5)

Model-based clustering: pitch

¹

Data set:x= (x1, . . . ,xn), eachxi ∈ X withdX variables (possibly mixing continuous, categorical, functional. . . )

Unknown partition inK clusters:z= (z₁, . . . ,zn) with binary notation zi= (zi1, . . . ,ziK)

Statistical model: couples (x_i,z_i) independently arise from the parametrized pdf

f(x_i,z_i)

| {z }

∈F

=

K

Y

k=1

[πkfk(x_i)]^z^ik ⇒ f(x_i) =

K

X

k=1

πkfk(x_i)

Estimatingf: implement the MLE principle through an EM-like algorithm EstimatingK: use some information criteria as BIC, ICL,. . .

Estimatingz: use the MAP principle ˆzik= 1 iifk= arg max`ti`( ˆf) where

tik(f) = p(zik= 1|x_i;f) = π_kf_k(xi)

K

X

`=1

π`f`(xi)

| {z }

f(x_i)

.

1See for instance[McLachlan & Peel 2004],[Biernacki 2017]

4/22

(6)

Model-based clustering: poor user-friendly understanding

norK large: poor overview of partition ˆz

dX large: too many parameters to embrace as a whole in ˆfk

ComplexX: specific and non trivial parameters involved in ˆfk

Visualization procedures

Aim at proposing user-friendly understanding of the mathematical clustering results

5/22

(7)

Overview of clustering visualization: individual and pdf mappings Indivdual mapping: visualize x and its estimated partition ˆ z

Transformsx, defined onX, intoy= (y1, . . . ,yn), defined on a new spaceY

Mînd∈ Mînd: x∈ Xⁿ7→y=Mînd(x)∈ Yⁿ

Many methods, depending onX definition: PCA, MCA, MFA, FPCA, MDS. . . Some of them use ˆzinM^ind: LDA, mixture entropy preservation[Scrucca 2010]

Nearly always,Y=R²

Model ˆf(x,z) is not taken into account, approach focused onx

Pdf mapping: display information relative to the f distribution

Transformsf =P

kπkfk∈ F, into a new mixtureg=P

kπkgk∈ G M^pdf∈ M^pdf: f ∈ F 7→g=M^pdf(f)∈ G

Gis a pdf family defined on the spaceY

M^pdfis often obtained as a by product ofMînd(variable change formula) Gis not a usual mixture family (Gaussian, ...) whenMîndis a nonlinear mapping For largen,Mîndfinally displaysM^pdf

Often, bothyandg are overlaid

6/22

(8)

Overview of clustering visualization: individual and pdf mappings Indivdual mapping: visualize x and its estimated partition ˆ z

Transformsx, defined onX, intoy= (y1, . . . ,yn), defined on a new spaceY

Mînd∈ Mînd: x∈ Xⁿ7→y=Mînd(x)∈ Yⁿ

Many methods, depending onX definition: PCA, MCA, MFA, FPCA, MDS. . . Some of them use ˆzinM^ind: LDA, mixture entropy preservation[Scrucca 2010]

Nearly always,Y=R²

Model ˆf(x,z) is not taken into account, approach focused onx

Pdf mapping: display information relative to the f distribution

Transformsf =P

kπkfk∈ F, into a new mixtureg=P

kπkgk∈ G M^pdf∈ M^pdf: f ∈ F 7→g=M^pdf(f)∈ G

Gis a pdf family defined on the spaceY

M^pdfis often obtained as a by product ofMînd(variable change formula) Gis not a usual mixture family (Gaussian, ...) whenMîndis a nonlinear mapping For largen,Mîndfinally displaysM^pdf

Often, bothyandg are overlaid

6/22

(9)

Traditional visualization strategies and proposal Traditional strategies: Controlling the mapping family M

^{pdf 2}

G(M^pdf)

| {z }

uncontrolled

= (

g:g =M^pdf(f),f ∈ F,M^pdf∈ M^pdf

| {z }

controlled

)

Nature ofGcan dramatically depend on the choice ofM^pdf It can potentially lead to very different cluster shapes!

Arguments for traditionalM^pdf: user-friendly, easy-to-compute Examples: linear mappings in all PCA-like methods

Proposed strategy: Controlling the pdf family G

M^pdf(G)

| {z }

uncontrolled

=







M^pdf:g =M^pdf(f),f ∈ F,g∈ G

|{z}

controlled







It is the reversed situation whereGis controlled instead ofM^pdf

Offer opportunity to impose directlyGto be a user-friendly mixture family StrategyMand StrategyGare both validbut StrategyGis rarely explored!

2Similar thinking withM^ind 7/22

(10)

Traditional visualization strategies and proposal Traditional strategies: Controlling the mapping family M

^{pdf 2}

G(M^pdf)

| {z }

uncontrolled

= (

g:g =M^pdf(f),f ∈ F,M^pdf∈ M^pdf

| {z }

controlled

)

Nature ofGcan dramatically depend on the choice ofM^pdf It can potentially lead to very different cluster shapes!

Arguments for traditionalM^pdf: user-friendly, easy-to-compute Examples: linear mappings in all PCA-like methods

Proposed strategy: Controlling the pdf family G

M^pdf(G)

| {z }

uncontrolled

=







M^pdf:g =M^pdf(f),f ∈ F,g∈ G

|{z}

controlled







It is the reversed situation whereGis controlled instead ofM^pdf

Offer opportunity to impose directlyGto be a user-friendly mixture family StrategyMand StrategyGare both validbut StrategyGis rarely explored!

2Similar thinking withM^ind 7/22

(11)

Outline

4 Discussion

8/22

(12)

Spherical Gaussians as candidates

Users are usually familiar withmultivariate spherical GaussiansonY=R^d^Y Thus a simple and “user-friendly” candidategis a mixture of spherical Gaussians

g(y;µ) =

K

X

k=1

πk

|{z}

fromf

φd_Y(y;µk

|{z}

?

,I)

whereµ= (µ1, . . . ,µK) andφd_Y(.;µk,I) the pdf of the Gaussian distribution with expectationµ_k= (µ_k1, . . . , µ_kdY)∈R^dY

with covariance matrix equal to identityI

g(·;µ) should be then linked withf in order to define a sensibleG

G={g:g(·;µ),µ∈arg minδ(f,g(·;µ)),f ∈ F }

9/22

(13)

g as the “clustering twin” of f

Question: how to chooseδsince generallyX 6=Y?

Answer: in our clustering context,δshould measure theclustering ability difference

Kullback-Leibler divergence of clustering ability between bothf andg(·;µ)²

δKL(f,g(·;µ)) = Z

T

p_f(t) ln p_f(t) p_g(t;µ)dt

where

p_f: pdf of proba. of classificationt(f) = (t_i(f))ⁿ_i=1, witht_i(f) = (tik(f))^K−1_k=1 p_g(·;µ): pdf of proba. of classif. t(g) = (t_i(g))ⁿ_i=1, witht_i(g) = (t_ik(g))^K−1_k=1 T ={t:t= (t1, . . . ,tK−1),tk>0,P

ktk= 1}

Thus,g should produce a distribution of the class membership posterior probabilities similar the one resulting off.

A natural requirement:p_g(·;µ) andg should be linked by a one-to-one mapping Currently not true since rotations and/or translations are possible

It means: for one distributionf, there is a unique optimal distributiong(·;µ) Additional constraints ong(·;µ): dY =K−1,µK=0,µkh= 0 (h>k),µkk≥0

2p_f is the reference measure 10/22

(14)

g as the “clustering twin” of f

Question: how to chooseδsince generallyX 6=Y?

Answer: in our clustering context,δshould measure theclustering ability difference

Kullback-Leibler divergence of clustering ability between bothf andg(·;µ)²

δKL(f,g(·;µ)) = Z

T

p_f(t) ln p_f(t) p_g(t;µ)dt

where

p_f: pdf of proba. of classificationt(f) = (t_i(f))ⁿ_i=1, witht_i(f) = (tik(f))^K−1_k=1 p_g(·;µ): pdf of proba. of classif. t(g) = (t_i(g))ⁿ_i=1, witht_i(g) = (t_ik(g))^K−1_k=1 T ={t:t= (t1, . . . ,tK−1),tk>0,P

ktk= 1}

Thus,g should produce a distribution of the class membership posterior probabilities similar the one resulting off.

A natural requirement:p_g(·;µ) andg should be linked by a one-to-one mapping Currently not true since rotations and/or translations are possible

It means: for one distributionf, there is a unique optimal distributiong(·;µ) Additional constraints ong(·;µ): dY =K−1,µK=0,µkh= 0 (h>k),µkk≥0

2p_f is the reference measure 10/22

(15)

Estimating the Gaussian centers

The Kullback-Leibler divergenceδKLhas generally no closed-form Estimate it by the following consistent (inS) Monte-Carlo expression

ˆδKL(f,g(·;µ)) = 1 S

S

X

s=1

ln p_g(t^(s);µ)

| {z }

L(µ;t)

+cst

withSindependent draws of conditional proba. t= (t⁽¹⁾, . . . ,t^(S)) from p_f It is the normalized (observed-data) log-likelihood function of a mixture model But, by construction, all the conditional probabilities are fixed in this mixture Thus, just maximize the normalized complete-data log-likelihoodLcomp(µ;t):

K= 2: this maximization is straightforward

K>2: use a standardQuasi-Newton algorithmwith different random initializations, for avoiding possible local optima

11/22

(16)

From a multivariate to a bivariate Gaussian mixture

g is defined onR^K−1but it ismore convenient to be onR²

Just apply LDAong to display this distribution on its most discriminative map It leads to the bivariate spherical Gaussian mixture ˜g

˜ g( ˜y; ˜µ) =

K

X

k=1

πkφ2( ˜y; ˜µk,I),

where ˜y∈R², ˜µ= ( ˜µ1, . . . ,µ˜K) and ˜µk∈R²

Use the% of inertiaof LDA to measure the quality of the mapping fromg to ˜g

Remark

IfX =R^d andf is a Gaussian mixture with isotropic covariance matrices, thenthe proposed mapping is equivalent to applying a LDA to the centers off

12/22

(17)

Overall accuracy of the mapping between f and ˜ g

Use the followingdifference between the normalized entropiesoff and ˜g

δE(f,˜g) =− 1 lnK

K

X

k=1

Z

X

tk(x;f) lntk(x;f)dx− Z

R²

tk( ˜y; ˜g) lntk( ˜y; ˜g)dy˜

Such a quantity can beeasily estimatedby empirical values Its meaning is particularly relevant:

δE(f,g)˜ ≈0: the component overlap conveyed by ˜g(overf) is accurate δE(f,g)˜ ≈1: ˜gstrongly underestimates the component overlap off δE(f,g)˜ ≈ −1: ˜gstrongly overestimates the component overlap off

δE(f,g˜) permits toevaluate the bias of the visualization

13/22

(18)

Drawing ˜ g

Cluster centers: the locations of ˜µ1, . . . ,µ˜K are materialized by vectors Cluster spread: the 95% confidence level displayed by a black border

Cluster overlap: iso-probability curves of the MAP classification for different levels Mapping accuracy:δE(f,˜g) and also % of inertia by axis

14/22

(19)

Outline

4 Discussion

15/22

(20)

Bike sharing system: data

³

and model

Station occupancy data collected over the course of one month on the bike sharing system in Paris

Data collected over 5 weeks, between February, 24 and March, 30, 2014, on 1 189 bike stations

Functional data: station status information (available bikes/docks) downloaded every hour from the open-data APIs of JCDecaux company

The final data set contains 1 189 loading profiles, one per station, sampled at 1 448 time points

Model: profiles of the stations were projected on a basis of 25 Fourier functions Model-based clustering of these functional data[Bouveyronet al.2015]with ther packageFunFEM[Bouveyron 2015]

Retain 10 clusters

Visualization usingClusVis R package

3[Bouveyronet al.(2015)]

16/22

(21)

Bike sharing system: cluster of curves visualization

17/22

(22)

Mapping off on this graph is accurate becauseδE(f,˜g) =−0.03

18/22

(23)

Outline

4 Discussion

19/22

(24)

Conclusion and extensions

Conclusion

Generic method for visualizing the results of a model-based clustering Very easy to understand output since “Gaussian-like”

Permits visualization for any type of data, because only based on proba. of classif.

Can be used after any existing package of model-based clustering The overall accuracy of the visualization is also provided

Extensions

Possibility to explore other pdf visualizations than Gaussians However, should keep in mind simple visualizations are targeted Possibility to compare pdf candidates throughδKLorδE

20/22

(25)

About individual visualization

Theoretically, impossible to obtain individual visualization from pdf visualization However, we can propose apseudo scatter plotofxas follows

xi7−→ti(f) =ti(g)^bijection7−→ yi∈R^K−17−→^LDA y˜i∈R²

˜

yallows only to visualize the classification position ofx

Caution: do not overlay pdf and individual plots since ˜y= ( ˜y1, . . . ,y˜n) is not necessarily drawn from a Gaussian mixture

21/22

(26)

Model-based clustering: flexibility of F for complex X

Continuous data(X =R^d^X): multivariate Gaussian/tdistrib.[McNicholas 2016]

Categorical data: product of multinomial distributions[Goodman 1974]

Mixing cont./cat.: product Gaussian/multinomial[Moustaki & Papageorgiou 2005]

Functional data: the discriminative functional mixture[Bouveyronet al.2015]

Network data: the Erd¨os R´enyi mixture[Zanghiet al.2008]

Other kinds of data, missing data, high dimension,. . .

22/22