Variational inference for probabilistic Poisson PCA

(1)

HAL Id: hal-01608912

https://hal.archives-ouvertes.fr/hal-01608912

Submitted on 3 Oct 2017

HAL is a multi-disciplinary open access

archive for the deposit and dissemination of sci-entific research documents, whether they are pub-lished or not. The documents may come from teaching and research institutions in France or abroad, or from public or private research centers.

L’archive ouverte pluridisciplinaire HAL, est destinée au dépôt et à la diffusion de documents scientifiques de niveau recherche, publiés ou non, émanant des établissements d’enseignement et de recherche français ou étrangers, des laboratoires publics ou privés.

Julien Chiquet, Mahendra Mariadassou, Stephane Robin

To cite this version:

Julien Chiquet, Mahendra Mariadassou, Stephane Robin. Variational inference for probabilistic Pois-son PCA. [University works] auto-saisine. 2017. �hal-01608912�

(2)

VARIATIONAL INFERENCE FOR PROBABILISTIC POISSON PCA

By Julien Chiquet, Mahendra Mariadassou and St´ephane Robin

AgroParisTech, INRA, Universit´e Paris-Saclay

Many application domains such as ecology or genomics have to deal with multivariate non Gaussian observations. A typical example is the joint observation of the respective abundances of a set of species in a series of sites, aiming to understand the co-variations between these species. The Gaussian setting provides a canonical way to model such dependencies, but does not apply in general. We consider here the multivariate exponential family framework for which we introduce a generic model with multivariate Gaussian latent variables. We show that approximate maximum likelihood inference can be achieved via a variational algorithm for which gradient descent easily applies. We show that this setting enables us to account for covariates and offsets. We then focus on the case of the Poisson-lognormal model in the context of community ecology.

1. Introduction. Principal component analysis (PCA) is among the oldest and most popular tool for multivariate analysis. It basically aims at reducing the dimension of a large data set made of continuous variables

(Anderson,2003;Mardia et al.,1979) in order to ease its interpretation and

visualization. The methodology basically exploits the dependency structure between the variables to exhibit the few synthetic variables that best sum-marize the information content of the whole data set: the principal compo-nents. From a purely algebraic point-of-view, PCA can be seen as a matrix-factorization problem were the data matrix is decomposed as the product of a loading matrix with a score matrix (Eckart and Young,1936a).

For statistical purposes, PCA can also be cast in a probabilistic frame-work. Probabilistic PCA (pPCA) is a model-based version of PCA originally defined in a Gaussian setting, in which the principal components are treated as hidden variables (Tipping and Bishop, 1999; Minka, 2000). It is closely related to factor analysis and, as it involves hidden variable, maximum-likelihood estimates (MLE) can be obtained via an EM algorithm (Dempster

et al.,1977). The Gaussian setting is obviously convenient as the dependency

Keywords and phrases: Probabilistic PCA, Poisson-lognormal model, Count data, Vari-ational inference

1

(3)

structure is entirely described by the covariance matrix but pPCA has since been extended to more general settings.

Indeed, in many applications (Royle and Wikle, 2005; Srivastava and Chen,2010) Gaussian pPCA needs to be adapted to handle specific measure-ment types, such as binary or count data. For count data, the multivariate Poisson distribution seems a natural counterpart of the multivariate normal. Still, there is no canonical form (Johnson et al.,1997) and several distribu-tions have been proposed in the literature including Gamma-Poisson ( Nel-son, 1985) and lognormal-Poisson (Aitchison and Ho, 1989; Izs´ak,2008) as an alternative. The latter takes advantage of the properties of the Gaussian distribution to display a larger panel of dependency structure than the for-mer, but maximum likelihood-based parameter inference raises some issues as the MLE of the covariance matrix is not always positive definite.

A series of works have considered the extension of PCA to a broader class of distributions, typically in the exponential family. The matrix factoriza-tion point-of-view has been adopted to satisfy a positivity constraint of the parameters (Lafond,2015), to minimize the Poisson loss function (Cao and Xie,2015) or more general losses (Lee and Seung,2001) consistent with ex-ponential family noise. Sparse extensions have also been proposed (Witten

et al.,2009;Liu et al.,2016). In a model-based context, Collins et al.(2001)

suggest to minimize a Bregman divergence to get estimates of the principal components: the divergence is chosen according to the distribution at hand and a generic alternative minimization scheme is proposed. Salmon et al.

(2014) consider a similar framework and use matrix factorization for the minimization of Bregman divergence. In both cases, the principal compo-nents are considered as fixed parameters. Mohamed et al. (2009) cast the same model in a Bayesian context and use Monte-Carlo sampling for the inference.Acharya et al.(2015) consider Bayesian inference of the Gamma-Poisson distribution.

As recalled above, in pPCA, principal components are treated as hid-den variables. One of the main issue of non-Gaussian pPCA arises from the fact that their conditional distribution given the observed data is often intractable, which hampers the use of an Expectation-Minimization (EM) strategy. Variational approximations (Jaakkola and Jordan, 2000;

Wain-wright and Jordan,2008) have become a standard tool to approximate such

a conditional distribution.Karlis (2005) use such an approximation for the inference of the one-dimensional lognormal-Poisson model and derive a vari-ational EM (VEM) algorithm. Hall et al.(2011) provide a theoretical anal-ysis of this approximation for the same model and prove the consistency of the estimators. Li and Tao (2010) also use such an approximation to

(4)

extend pPCA to the simple exponential family, considering both loadings and scores as random hidden variables. Landgraf and Lee (2015) reframes exponential family PCA as an optimization problem with rank constraints and develops both a convex relaxation and a Maximization-Minimization (MM) algorithm to solve it for binomial and Poisson families. FinallyZhou

(2016);Zhou et al.(2012) consider factor analysis in the more complex

set-ting of negative-binomial families. The main difference between previous approaches and ours is that we only consider loading as random hidden variables, whereas we consider the scores as parameter. This has deep con-sequences on the general properties of the inference algorithm.

Our contribution. In this paper, pPCA is extended to the simple expo-nential family. We consider the principal components as Gaussian hidden variables to allow a large panel of dependency structures. We rely on a fre-quentist setting in order to avoid heavy-computing Monte-Carlo sampling often required in Bayesian inference. We use a variational approximation of the conditional distribution of the components given the observed data to derive a variational lower bound of the likelihood. This bound is biconcave, i.e. concave in the model parameters and in the variational parameters but not jointly concave in general. We use a quasi-Newton method with box-constraints to optimize the objective function and thus estimate the param-eters, in contrast to the EM algorithm traditionally used in this setting. Additionally, the model-based framework allows us to introduce covariates and offsets in the model and to handle missing data at no additional cost. In practical analyses, this enables us to distinguish between correlations that are caused by known covariates from those corresponding to unknown structure and requiring further investigations.

Organization of the paper is as follows: in Section2we introduce pPCA for the exponential family and the variational framework considered. Section3

generalizes the model in the manner of a generalized linear model, in order to handle covariates and offsets. Then, Section 4 is dedicated to the inference and optimization strategy. Section 5 details the special Poisson case and Section6devises the visualization, an important issue for non-Gaussian PCA methods. Finally, Section 7 considers applications to two examples from metagenomics: the impact of a pathogenic fungi on microbial communities from tree leaves, and the impact of weaning on piglets gut microbiota.

2. A variational framework for probabilistic PCA in the expo-nential family. We start this section by stating the probabilistic frame-work associated to Gaussian probabilistic PCA. Then we show how it can be naturally extended to other exponential families. We finally derive

(5)

vari-ational lower bounds for the likelihoods of pPCA which are the building blocks of our inference strategy.

2.1. Gaussian probabilistic PCA (pPCA). The probabilistic version of principal component analysis – or pPCA – (Minka,2000; Mohamed et al.,

2009;Tipping and Bishop,1999) relates a sample of p-dimensional

observa-tion vectors Yi to a sample of q-dimensional vectors of latent variables Wi

in the following way:

(1) Yi= µ + BWi+ εi, εi ∼ N (0p, σ2Ip).

The parameter µ allows the mode to have main effects. The p × q ma-trix B captures the dependence between latent and observed variables. Furthermore, the latent vectors are conventionally assumed to have inde-pendent Gaussian component with unit variance σ2 _{= 1, that is to say,}

εi ∼ N (0q, Iq). This ensures that there is no structure in the latent space.

Model (1) can thus be restated as Yi∼ N (µ, BB|+ Ip).

In the following, we consider an alternative formulation stated in a hier-archical framework. Despite its seemingly more complex statement it lends itself to generalizations. Formally,

(2)

latent space Wi i.i.d. Wi∼ N (0q, Iq)

parameter space Zi|Wi = µ + BWi Zi ∼ N (µ, BB|)

observation space Yij|Zij indep. Yij|Zij ∼ N (Zij, σ2)

In Equation (2), Zi is a linear transform of Wi and the last layer Yi|Zi

simply corresponds to observation noise. The diagonal nature of the covari-ance matrix of εi in (2) means that, conditionally on Zi, all components of

Yiare in fact independent. This is why we may consider univariate variables

Yij|Zij in Formulation (2).

Informally, the latent variables Wi (in Rq) are mapped to a linear

sub-space of the parameter sub-space Rp via the Zi which are then pushed into

the observation space using Gaussian emission laws. The main idea of this paper is to replace Gaussian emission laws with more general probability distributions, namely univariate natural exponential families. The focus of the inference is on the main effects µ and the matrix B that captures the dependence.

Hereafter and unless stated otherwise,

• index i refers to observations and ranges in {1, . . . , n}, • index j refers to variables and ranges in {1, . . . , p}, • index k refers to factors and ranges in {1, . . . , q}.

(6)

2.2. Natural Exponential family (NEF). The work in this study is based on essential properties of univariate natural exponential families (NEF) where the parameter is in canonical form. They include normal distribu-tion with known variance, Poisson distribudistribu-tion, gamma distribudistribu-tion with known shape parameter (and therefore exponential distribution as a partic-ular example) and binomial distribution with known number of trials. The probability density (or mass function) of a NEF can be written

(3) f (x|λ) = exp (xλ − b(λ) − a(x))

where λ is the canonical parameter and b and a are known functions. The function b is well known to be convex (and analytic) over its domain and the mean and variance are easily deduced from b as

Eλ[X] = b0(λ) and Vλ[X] = b00(λ).

The canonical link function g is defined such that g(b0(λ)) = λ. The max-imum likelihood estimate ˆλ of λ from a single observation x is given by ˆ

λ = ˆλ(x) = g(x) and satisfies

E_λ(x)ˆ [X] = b0(ˆλ(x)) = x.

2.3. Probabilistic PCA for the exponential family. We now extend pPCA from the Gaussian setting to more general NEF. The connection between the two versions is exactly the same as the connection between linear models and generalized linear models (GLM). Intuitively, we assume that i) there exists a (low) q-dimensional (linear) subspace in the natural canonical pa-rameter space where some latent variable Zi lie; and ii) observations Yi

are generated in the data space according to some NEF distribution with parameter Z. The latter is linked to E[Yi|Zi] through the canonical link

function g. In the Gaussian case, the link function is the identity and the parameter space can be identified with the data space but this is not the case in general for other families.

Formally, we extend Model (2) to

(4)

Wi i.i.d. Wi ∼ N (0q, Iq)

Zi|Wi= µ + BWi Zi ∼ N (µ, BB|)

Yij|Zij indep. Yij|Zij ∼ exp (YijZij − b(Zij) − a(Yij))

Note in particular that g(E[Yij|Zij]) = b0(Zij) and that an unconstrained

estimate ˜Zij of Zij is ˜Zij = g(Yij). The vector µ corresponds to main effects,

(7)

observation in the low-dimensional latent subspace of the parameter space. The model specified in (4) is the same as the one specified in (2) but for the last data emission layer.

Remark 1. B suffers from two identifiability limitations. First, it is only identifiable through BB| and therefore at best up to rotations in Rq. Second, if B is of rank q0 < q, the model is degenerate and could be written using a q0-dimensional latent space. Note that the first limitation is shared with standard PCA where each principal component (PC) can be arbitrarily flipped without changing the least square criteria. More generally PCA finds a good approximation subspace but without additional constraints, infinitely many bases can be used to parametrize this subspace. The second limitation deprives us from the nestedness properties induced by Eckart and Young’s theorem in standard PCA: there is no guarantee that the best q0-dimensional model can be built easily from the best q-dimensional model.

2.4. Likelihood. Note Y (resp. W ) the n×p (resp. n×q) matrix obtained by stacking the row-vectors Y|_i (resp. W|_i). Conversely, for any matrix A, Ai refers to the i-th row of A considered as a column vector. In matrix

expression, Z = 1nµ|+ W B|. The observation matrix Y only depends on

Z through µ, B and W and the complete log-likelihood is therefore

log p(Y , W ; µ, B) =

n

X

i=1

log p(Yi|Wi; µ, B) + log p(Wi)

= n X i=1   p X j=1 Yij(µj+ B|jWi) − b(µj+ B|jWi) − a(Yij) − q X k=1 W_ik2 + log(2π) 2  

which can be stated in the following compact matrix form:

(5) log p(Y , W ; µ, B) = 1|_n[Y (1nµ|+ W B|) − b(1nµ|+ W B|)] 1p −kW k 2 F 2 − nq 2 log(2π) − K(Y ), where the function a and b are applied component-wise to vectors and ma-trices, is the Hadamard product and K(Y ) = 1|na(Y )1p is a constant

depending only on Y and not scaling with q.

We do not know how to integrate out W and therefore cannot derive an analytic expression of log p(Y ; µ, B). Numerical approximation using Hermite-Gauss quadrature or MCMC techniques are possible but likely to become computationally prohibitive as the dimension of the integration

(8)

space increases. A standard EM algorithm relying on EW |Y[log p(Y , W ; µ, B)]

is also not possible as it would require at least first and second order of p(Wi|Yi) which are unknown in general. We resort instead to a variational

strategy and integrate out W under a tractable approximation of p(W |Y ). 2.5. Variational bound of the likelihood. Consider any product distribu-tion ˜p = ⊗n_i=1p˜i on the Zi. The variational approximation relies on

maxi-mizing the following lower bound over a tractable set for ˜p log p(Y ; µ, B) ≥ Jq(˜p, µ, B)

where

(6)

Jq(˜p, µ, B) := log p(Y ; µ, B) − KL(˜p(W )||p(W |Y ; µ, B))

= Ep˜[log p(Y , W ; µ, B) − log ˜p(W )]

=

n

X

i=1

Ep˜i[log p(Wi) + log p(Yi|Wi; µ, B) − log ˜pi(Wi)],

with term-by-term inequality: log p(Yi; µ, B) ≥ Jq(˜pi, µ, B)

:= Ep˜i[log p(Wi) + log p(Yi|Wi; µ, B) − log ˜pi(Wi)].

In our variational approximation, we choose here the set Q of product distribution of q-dimensional multivariate Gaussian with diagonal covariance matrices: Q = ( ˜ p := ˜pM ,S; ˜p(w) = n Y i=1 ˜ pi(wi) ) , where ˜pi = N (mi, diag(si si)), (mi, si) ∈ Rq× Rq+. (7)

The n × q matrices M and S are obtained by respectively stacking m|_i and s|_i. Note that by construction p(W |Y ) is a product distribution and that the approximation only stems from the functional form of each ˜pi, i.e.

multivariate normal with diagonal variance-covariance matrix. For such ˜p = ˜

pM ,S, results on first and second order moments of multivariate Gaussian

show that Jq(µ, B, mi, si) := Jq(˜pi, µ, B) = Y|_i(µ + Bmi) − 1 2[kmik 2 2+ ksik22] + 1 2(2 | qlog(si) + q) − 1|_p_Ep˜i[b(µ + BWi)] − K(Y ).

(9)

Therefore, (8) Jq(µ, B, M , S) := Jq(˜pM ,S, µ, B) = n X i=1 Jq(µ, B, mi, si) = 1|_n[Y (1nµ|+ M B|) − Ep˜[b (1|nµ + W B|)]] 1p − 1 21 | n[M M + S S − 2 log(S) − 1n,q] 1q− K(Y ).

Depending on the natural exponential family and thus the exact value of b in (8), we may have a fully explicit variational bound for the complete like-lihood which paves the way for efficient optimization. In particular, this is the case with the Poisson distribution that we investigate in further details in Section5.

Before moving on to actual inference, we show that the framework that we introduced above can be extended to account for covariates and offsets. 3. Accounting for covariates and offsets. Multivariate analyses typ-ically aim at deciphering dependencies between variables. Variations induced by the effect of covariates may be confounded with these dependencies. Therefore, it is desirable to account for such effects to focus on the residual dependencies. The rational of our approach is to postulate the existence of a model similar to linear regression in the parameter space. We consider the general framework of linear regression with multivariate outputs, which encompasses multivariate analysis of variance.

3.1. Model and likelihood. Suppose that each observation i is associated to a known d-dimensional covariate vector Xi. We assume that the

covari-ates act linearly in the parameter space through a p × d regression matrix Θ, i.e. Xi is linearly related to Zi. It can be also useful to add an

off-set to model different sampling efforts and/or exposures. There is usually one known offset parameter Oij per observation Yij and this offset can be

readily incorporated in our framework. Thus, a natural generalization of (4) accounting for covariates and offsets is

(9)

Wi i.i.d. Wi ∼ N (0q, Iq)

Zi|Wi= Oi+ ΘXi+ BWi Zi ∼ N (Oi+ ΘXi, BB|)

Yij|Zij indep. Yij|Zij ∼ exp (YijZij − b(Zij) − a(Yij))

where a column of ones can be added to the data matrix X to get an intercept in the model. The log-likelihood can be computed from (9) like before to get

(10)

3.2. Variational bound of the likelihood. We can use the variational class Q previously defined in (7) to lower bound the likelihood from Eq. (10). We first introduce the instrumental matrix A, which appears in many equations.

(11) A = Ep˜[b (O + XΘ

|_{+ W B}|_)]

= E[b (O + XΘ|+ (M + S U )B|)] = E[b (Z)],

where Z = (O + XΘ|+ (M + S U )B|) and U is a n × q matrix with unit variance independent Gaussian components.

Since O is known, we drop it from the arguments of Jq and obtain the

following lower bound, which extends the bound from Eq. (8):

(12) Jq(Θ, B, M , S) = 1|n Y (O + XΘ|+ M B|) − A 1p − 1 21 | n[M M + S S − 2 log(S) − 1n,q] 1q− K(Y ).

4. Inference. As usual in the variational framework, we aim to maxi-mize the lower bound Jqwhich we call the objective function in an

optimiza-tion perspective. The optimizaoptimiza-tion shall be performed on Θ, B, M , S. We only give results in the most general case (12) with covariates and offsets. All other case are deduced by setting O = 0n×p and/or X = 1nhereafter.

4.1. Inference strategy. We first highlight the biconcavity of the objec-tive function Jq. The major part of the proof is postponed to Appendix

A.

Proposition 1. The variational objective function Jq(Θ, B, M , S) is

concave in (Θ, B) for (M , S) fixed and vice-versa.

Proof. Fix (M , S) in (12). The non explicit part of Jq, that is to say

−1|_nA1p, is concave in (Θ, B) thanks to Lemma 2 (see Appendix A). By

inspection, the explicit part of Jq involves linear, quadratic and concave

functions of (Θ, B) and is also concave. The objective Jqis therefore concave

(11)

A standard approach for maximizing biconcave functions is block coordi-nate descents, of which the Expectation-Maximization (EM) algorithm is a popular representative in the latent variable setting. It is especially powerful when we have access to closed formula for both the optimal (M , S) given (Θ, B) (E-step) and the optimal (Θ, B) given (M , S) (M-step). However, the non-linear nature of Epe[b (O + XΘ

|_{+ W B}|_{)] combined with careful}

inspection of the objective function Jq shows that setting the derivatives of

Jq to 0, even after fixing the variational or model parameters, does not lead

to closed formula neither for (M , S) nor (B, Θ). Nevertheless, since we may derive convenient expressions for the gradient ∇Jq (see next Section 4.2),

we propose to rely on Broyden-Fletcher-Goldfarb-Shanno (BFGS) quasi-Newton updates with box constraints and limited memory (a.k.a L-BFGS-B) to maximize Jq (see, e.g.Press et al.,1989). In the general case (12), the

total number of parameters to optimize Jq(Θ, B, M , S) is p(d+q)+2nq. The

box constraints concern the variational parameters S, standing for standard deviations in (7) and thus only defined on Rq+. The starting point is chosen

according to the exact value of b.

4.2. Blockwise gradients of Jq. The blockwise gradient of Jq(Θ, B, M , S)

can be expressed compactly in matrix notations. We skip the tedious but straightforward derivation and present only the resulting partial gradients. We introduce A0 = E[b0(Z)], the natural counterpart to matrix A given in (11). Intuitively, A0_ij is the conditional expectation of Yij under pei. On top of that, we need two other matrices denoted A0₁ and A0₂, defined as follows:

A0₁ = E[b0(Z)|(S U )], A0₂ = E[(b0(Z) B) U ].

With those matrices the derivatives of Jq can be expressed compactly as

(13) ∂Jq ∂Θ = (Y − A 0₎|_X, ∂Jq ∂B = (Y − A 0₎|_{M − A}0 1, ∂Jq ∂M = (Y − A 0_{)B − M ,} ∂Jq ∂S = S _{− 2A}0 2− 2S ,

where the n × q matrix S is the elementwise inverse of S, i.e. S_ij = S_ij−1 for all i = 1, . . . , n, q = 1, . . . , Q.

4.3. About missing data. In the presence of missing data, note Ω ⊂ {1, . . . , n} × {1, . . . , p} the set of observed data and Ω the matrix where Ωij = 1 if (i, j) ∈ Ω and 0 otherwise. The likelihood can be adapted from

(12)

Eq. (10) as follows: log p(Y , W ; B, Θ, O) = 1|_n Y (O + XΘ|+ W B|) − b(O + XΘ|+ W B|) Ω1p −kW k 2 F 2 − nq 2 log(2π) − tr(Ω |_{a(Y )).}

The corresponding variational bound Jq and its partial derivatives are then

simple adaptations from Equations (12) and (13) where Y (resp. A, A0) is replaced with Y Ω (resp. A Ω, A0 Ω).

Note that it is strictly equivalent for the quasi-Newton method to use (Y − A0) Ω or to impute missing Yij with A0ij before using Eq. (13). Since

A0is computed as part of the gradient computation at each step, imputation of missing data is essentially a free by-product of the quasi-Newton method. Finally, note that A0_ij = Epei[Yij] so that the imputation makes intuitive

sense: we’re imputing Yij with its conditional expectation under the current

variational parameters.

4.4. Model selection. The dimension q of the latent space itself needs to be estimated. To this aim, we adopt a penalized-likelihood approach, replacing the log-likelihood by its lower bound Jq. We considered two

clas-sical criteria: BIC (Schwarz, 1978) and ICL (Biernacki et al., 2000). We remind that ICL uses the conditional entropy of the latent variables given the observations as an additional penalty with respect to BIC. Because the true conditional distribution p(W |Y ) is intractable, we replace it with its variational approximationp(W ) to evaluate this entropy. The difference be-_e tween BIC and ICL measures the uncertainty of the representation of the observations in the latent space.

Because the number of parameters in our model is p(q+d) and the entropy of each Wi under epi is q log(2πe)/2 + 1

|

qlog(si)], we obtain the following

pseudo-BIC and pseudo-ICL criteria:

(14) BIC(q) = Jq− p(d + q) log(n) ICL(q) = Jq− p(d + q) log(n) − nq 2 log(2πe) − 1 | nlog(S)1q

5. Poisson Family. Each term of the expectation matrix A in (11) can be reduced to computing expectations of the form E[b(a + cU )] for a convex analytic function b, a standard Gaussian U ∼ N (0, 1) and arbitrary scalars (a, c) ∈ R2. It can therefore be computed numerically efficiently using Gauss-Hermite quadrature (see, e.g., Press et al., 1989). However in

(13)

the special case of Poisson-distributed observations, b(x) = ex and most of the expectations can be computed analytically leading to explicit formulas for Equations (11), (12) and (13).

5.1. Explicit form of A, Jq, and ∇Jq. In the Poisson-case, the

varia-tional expectation of the non-linear part involving b – the matrix of condi-tional expectations A – is equal to A0 and can be expressed as

A = A0 = exp O + XΘ|+ M B|+1 2(S S)(B B) | .

The lower bound Jqand matrices A01, A02appearing in (13) can be expressed

simply from A as

A0₁ = [A|(S S)] B, A02 = 2[A(B B)] S.

5.2. Implementation details. We implemented our inference algorithm for the Poisson family in the R package PLNmodels, the last version of which is available on github https://github.com/jchiquet/PLNmodels. Maximization of variational bound Jq is done using the L-BFGS-B

imple-mentation ofByrd et al.(1995) available from the R optim function carefully tuned (R Development Core Team,2008). All graphics are produced using the ggplot2 package (Wickham,2009).

The choice of a good starting value is crucial in iterative procedures as it helps the algorithm start in the attractor field of a good local maximum and can substantially speed-up convergence. Here we initialize (Θ, B) by fitting a GLM-Poisson to Y , then extracting the regression coefficients ΘGLM and

the variance-covariance matrix ΣGLM of the Pearson residuals. We set Θ0 =

ΘGLM and B0= (Σ(q)GLM)1/2where Σ (q)

GLM is the best rank q approximation

of ΣGLM, as given by keeping the first q-dimensions of a SVD of ΣGLM. We

set the other starting values as M0= S0= 0n×q.

6. Visualization.

6.1. Specific issues in non-Gaussian PCA. This section is dedicated to the visualization of the results of the proposed modeling. Although this prob-lem has many similarities with visualization in standard PCA, two important differences exist that make adaptations of the usual procedures necessary.

(i) In the general case, the parameter space defined in (4) is different from the observation space, as opposed to the special case of Gaussian PCA.

(14)

(ii) The optimal subspaces of dimension q = 1, 2, . . . are not nested. As a consequence, in a model with q > 1 latent dimensions, there is no such thing as a ’first axis’. The genuine first axis corresponds to the optimal subspace of dimension 1, which is provided by the model with q = 1 latent dimension.

To address point (i), we choose to provide representations in the parame-ter space as it sets us in the Gaussian (and accompanying Euclidean) space setting practitioners are most familiar with. We thus focus on the positions Zi and more specifically on their inferred version Oi+ bΘXi+ bBmfi, gathered in the n × p matrix eZ := O + X bΘ|+ bB fM . Note that eZ is most useful to assess goodness of fit. For visualization of the latent structure remaining af-ter correction for the offset and the covariates, we consider instead positions

e

P = bB fM in the latent subspace of dimension q.

6.2. Quality of the dimension reduction. A first important criterion in PCA is the amount of information that is preserved by the q-dimensional reduction. To evaluate this criterion we define the matrix Λ(q)= [λ(q)_ij ] where we use entry λ(q)_ij := exp( eZij) as an estimate of the canonical parameter λij

of the distribution of Yij given in (3). Thus, we can define the log-likelihood

`q of the observed data with these estimates as

`q = n X i=1 p X j=1 [Yijg(λ(q)ij ) − Yij] − K(Y ).

We can compare the log-likelihood of the saturated model `max (replacing

λ(q)_ij with λmax_ij := Yij) and the log-likelihood `min of a GLM for Poisson

re-gression with no latent structure, which plays the role of the null model here (using λmin_ij := oij+ bΘXi, with bΘ estimated using a standard GLM). Then,

similarly to the deviance criterion used in GLMs, we define the following measure of fit:

(15) R_q2= (`q− `min) /(`max− `min) .

Note that, unlike standard PCA, there is no guarantee the R2_q is nonde-creasing in q as `q is not the objective function of the variational inference

algorithm of Section4.

6.3. Graphical outputs. Because no orthogonality constraint is applied in the inference step, the (centered) columns of eP are not orthogonal in general, as opposed to standard PCA. Furthermore, because of point (ii),

(15)

the order of the columns of B have no specific meaning as only the subspace spanned by eP is identifiable.

To provide a representation of the observations in a series of 2D-plots, we simply apply PCA to eP . As in regular PCA, we use the associated relative squared singular value d2_j/ Pq

k=1d2k to measure the contribution of

com-ponent j. Note that this measures the contribution of the comcom-ponent to the q-dimensional representation of the data and that it needs to be combined with the global measure of fit provided by R_q2. The variational variances_esi

can be used to draw confidence ellipsoids. Following the same line, we may plot the correlations between the columns of eP and the components arising from its PCA, to help with the interpretation of these components.

7. Illustrations.

7.1. Oak powdery mildew pathobiome.

Description of the experiment. We considered the metagenomic dataset introduced inJakuschkin et al.(2016), which consists in abundance measures of 66 bacterial species and 48 fungal species (p = 114) collected on the surface of n = 116 oak leaves. One aim of this experiment is to understand the association between the abundance of the fungal pathogenic species E. alphitoides, responsible for the oak powdery mildew, and the other species. The leaves were collected on three different trees and the species abundances were measured via metabarcoding.

Importance of the offset. The abundances Yij (where i denotes the leaf

and j the species) were measured separately for fungi and bacteria resulting in different sampling efforts for the two types of species: the median total abundance were respectively 668 for bacteria and 2166 for fungi. To account for this we define an offset oij term as the log-total count of each species

type (fungal or bacteria) for each leaf.

Model selection. The three trees from which the leafs where collected were respectively susceptible, intermediately resistant (hereafter “intermediate”) and resistant to mildew. We first fitted a null lognormal-Poisson model M0as

defined in (9) only with an offset term. Alternatively, we considered model M1 involving two covariates: the tree from which each leaf was collected

from, and the orientation (0=south-east, 1=north-west) of its branch. Figure1adisplays the lower bound J , the BIC and the ICL for model M0

(left) and M1 (right) as a function of the number of axes q considered. We

observe that the Jq is always increasing and that the BIC and ICL criteria

(16)

(a) Model selection

Offset (M0) Offset and covariates (M1)

criterion v alue ● ● ● ● ● ● ● ● ● ● ● ●●●●●●●●●●●●●●●●●●● ● ● ● ● ● ● ● ● ● ●●●●●●●●●●●●●●●●●●●●● ● ● ● ● ● ● ● ● ● ●●●●●●●●●●●●●●●●●●●●● R2 = 0.28 R2 = 0.49R2 = 0.6 R2 = 0.68R2 = 0.73 R2 = 0.78R2 = 0.81 R2 = 0.83R2 = 0.85 R2 = 0.87R2 = 0.88 R2 = 0.89R2 = 0.9R2 = 0.91 R2 = 0.92R2 = 0.92 R2 = 0.93R2 = 0.94 R2 = 0.94R2 = 0.95 R2 = 0.95R2 = 0.95 R2 = 0.96R2 = 0.96R2 = 0.96 R2 = 0.96R2 = 0.97 R2 = 0.97R2 = 0.97 R2 = 0.97 −200000 −150000 −100000 −50000 0 10 20 30 ● ● ● ● ● ● ● ● ● ● ● ● ●●●●●●●●●●●●●●●●●● ● ● ● ● ● ● ● ● ● ● ●● ●●●●●●●●●●●●●●●●●● ● ● ● ● ● ● ● ● ● ● ●●●●●●●●●●●●●●●●●●●● R2 = 0.33 R2 = 0.47R2 = 0.56 R2 = 0.64R2 = 0.7 R2 = 0.73R2 = 0.77R2 = 0.79 R2 = 0.82R2 = 0.84 R2 = 0.85R2 = 0.86 R2 = 0.88R2 = 0.89 R2 = 0.9R2 = 0.9 R2 = 0.91R2 = 0.92 R2 = 0.92R2 = 0.93R2 = 0.93 R2 = 0.94R2 = 0.94 R2 = 0.95R2 = 0.95 R2 = 0.95R2 = 0.95 R2 = 0.96R2 = 0.96 R2 = 0.96 −125000 −100000 −75000 −50000 0 10 20 30 criterion ● ● ● J BIC ICL number of axes

(b) Goodness of fit and Entropy

Rq2criterion Entropy (BICq− ICLq)

criterion v alue ● ● ● ● ● ● ● ● ● ● ●● ●●●●●●●●●●●●●●●●●● ● ● ● ● ● ● ● ● ● ● ● ● ●●●●●●●●●●●●●●●●●● ● ● ● ● ● ● ● ● ● ●●●●●●●●●●●●●●●●●●●●● 0.4 0.6 0.8 1.0 0 10 20 30 ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ●● ●● ● ●●● ●● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ●● ● ●● ●●●●●●● ●● ● −2000 −1000 0 10 20 30 model ● ● ● M0 M1 M1 corrected number of axes (c) Individual Factor Maps

Offset (M0) Offset and covariates (M1)

A1.02 A1.03 A1.04A1.05 A1.06 A1.07 A1.08 A1.09 A1.10

A1.11A1.12 _A1.13 A1.14 A1.15 A1.16 A1.17 A1.18 A1.19 A1.20 A1.21 A1.22 A1.23 A1.24 A1.25 A1.26 A1.28 A1.29 A1.30 A1.31A1.32 A1.33 A1.34 A1.35 A1.36 A1.37 A1.38 A1.39 A1.40 A2.01 A2.02 A2.03A2.04 A2.05 A2.06 A2.07 A2.08 A2.09 A2.10 A2.11 A2.12 A2.13A2.14 A2.15A2.16

A2.17 A2.18 A2.19 A2.20 A2.21 A2.22 A2.23 A2.24 A2.25 A2.26 A2.28_A2.29 A2.30 A2.31 A2.32 A2.33 A2.34 A2.35 A2.36 A2.37 A2.38 A2.39 A2.40 A3.01 A3.02 A3.03 A3.04 A3.05 A3.06 A3.08 A3.09 _A3.10 A3.11 A3.12 A3.13 A3.14 A3.15 A3.16 A3.17A3.18 A3.19

A3.20 A3.21 A3.22 A3.23 A3.24 A3.25 A3.26 A3.27 A3.28 A3.29 A3.30 A3.31 A3.32 A3.33 A3.34 A3.35 A3.36 A3.37 A3.38 A3.39 A3.40 −30 −20 −10 0 10 −25 0 25 50 75 axis 1 (48.52%) axis 2 (15.69%) A1.02 A1.03 A1.04 A1.05 A1.06 A1.07 A1.08 A1.09 A1.10 A1.11 A1.12 A1.13 A1.14 A1.15 A1.16 A1.17 A1.18 A1.19 A1.20 A1.21 A1.22 A1.23 A1.24 A1.25 A1.26 A1.28 A1.29 A1.30 A1.31 A1.32 A1.33 A1.34 A1.35 A1.36 A1.37 A1.38 A1.39 A1.40 A2.01 A2.02 A2.03 A2.04 A2.05A2.06 A2.07 A2.08 A2.09 A2.10 A2.11 A2.12A2.13 A2.14 A2.15 A2.16 A2.17 A2.18 A2.19 A2.20 A2.21 A2.22 A2.23 A2.24 A2.25 A2.26 A2.28 A2.29 A2.30 A2.31 A2.32 A2.33 A2.34 A2.35 A2.36 A2.37 A2.38 A2.39 A2.40 A3.01 A3.02 A3.03 A3.04 A3.05 A3.06 A3.08_A3.09 A3.10 A3.11 A3.12 A3.13 A3.14 A3.15 A3.16 A3.17 A3.18 A3.19 A3.20 A3.21 A3.22 A3.23 A3.24 A3.25 A3.26 A3.27 A3.28 A3.29 A3.30 A3.31_A3.32 A3.33 A3.34 A3.35 A3.36 A3.37 A3.38 A3.39 A3.40 −10 0 10 20 −20 0 20 axis 1 (31.16%) axis 2 (13.41%) tree a a a intermediate resistant susceptible

Fig 1: Dataset from Jakuschkin et al. (2016). (a) model selection criteria Jq, BICq and ICLq for model M0 (left) and M1 (right); (b) Rq2 criterion

and entropy ofp(W ); (c) scatter plot of the leaves on the first two principale components (left: M0, right: M1).

(17)

−45377) latent dimensions for model M0 and bq = 21 (ICL = −45430) for model M1. This suggest that the two models (with their respective optimal

dimension) provide a very similar fit.

We looked at the approximate posterior entropy in panel left of Figure1b: we observed that it is minimal near to the respective optimum in terms of model selection. This indicates that the selected dimensions are also optimal in terms of uncertainty on the latent variables.

Effect of the covariates. The choice between model M0 and M1 is mostly a

matter of the type of dependency we analyze with each of them, as the former does not account for the covariates whereas the later does. This is illustrated in Figure1c, when plotting the first principal plane. In model M0 (left), the

leafs collected on each tree are clearly separated. As expected, taking the tree as a covariate (right) removes the tree effect from the principal plane.

The effect of covariates on the abundance of E. alphitoides were also consistent: the estimates parameters θij associated with the intermediate

and resistant trees were −4.53 and −8.83, respectively, taking the susceptible tree as a reference.

We compared the respective estimates of Σ under M0 (denoted bΣ0) and

under M1 ( bΣ1) focusing on the correlation between E. alphitoides and the

other species. bΣ0 contains correlations between species, that are either due

to marginal co-variations between them or to the effects of the covariates, whereas as the correlations in bΣ1 are corrected from the effects of covariates.

We first observed a reduction of the variances (mean=8.96, median=2.73 in b

Σ0; mean=3.72, median=1.53 in bΣ1), which proves the strong effect of the

covariates on the abundance of the different species. Then we considered the correlations and found a low similarity between their rankings under model M0 and M1 (Kendall’s τ = .40), showing that the covariates drastically

change the apparent relationship between species abundances.

Percentage of variance. We now comment on use of the R2_qcriterion defined in Section 6 to evaluate the proportion of variability captured by a model with q latent dimensions. R2

q compares the pseudo-likelihood `mq obtained

with q latent dimensions under model Mm (m = 0, 1) with the likelihoods

`m_min and `m_max. We know that `0_max = `1_max whereas `0_min < `1_min because `0

min only relies on the offsets whereas `1min accounts for both the offsets

and the covariates. As a consequence, R_q2 tends to be higher under M0 than

under M1 for a given q. Right panel of Figure 1b compares the genuine

R2

q under models M0 and M1 and the corrected version of R2q under model

M1 using `0_min in place of `1_min. As expected, the corrected version of Rq2

(18)

models, the proportion of variability captured by the latent space is quite high: R2₂₄= 95.8% for M0and R221= 95.5% for M1. We remind thatqb0 = 24 and _bq1= 21 should both be compared with p = 114.

Variance of the variational posterior. We remind that Sij is the

approxi-mate conditional variance of Wij given the data. This parameter measures

the precision of the location of individual i along the j-th latent dimension. We can derive from them the approximate conditional variance of each Zij

as [B diag(si si)B|]jj. Figure 2, shows that this variance is much higher

when the corresponding abundance Yij is low. Indeed, any large negative

negative values of Zij yields in a Poisson parameter close to zero and, so,

to a null Yij. As a consequence, large negative Zij can not be predicted

accurately.

approximate conditional standard error of Zij

1 2 3 4 5 0 10 ₁₀₀ ₅₀₀ 1000 1500 2000 Yij(log scale)

Fig 2: Variational approximate conditional standard of the Zij (y axis) as a

function of the abundance Yij (x axis).

7.2. Impact of weaning on piglet microbiome.

Description of the experiment. We considered the metagenomic dataset introduced inMach et al. (2015), which consists in abundance measures of p = 4031 bacterial species collected from the feces of 31 piglets at 5 times points after birth (n = 155). One aim of this experiment is to understand the impact of weaning on gut microbiota. The species abundances were measured via metabarcoding (seeMach et al.(2015) for details). We mostly use this example to illustrate of the the proposed methodology behaves when the number of variable p increases.

Numerical Experiments. To test the impact of the number of variables on the dimension of the latent subspace, we inferred q on nested subsets of the

(19)

count table. We selected only the 3000, 2000, 1000, 500 and 100 most abun-dant species and fitted a model with appropriate offset to each subset. The offsets were chosen as log-total count of each sample. For context, the 2500 least abundant species each have total abundance lower than 5, and more than half (1287) are even seen only once in one sample. As expected, Fig-ure4shows that removing low-counts and low-information species increases the dimension of the latent subspace (ˆq going up from 4 to 23) and the R2 (up from 53% to 93%). We are thus able to explain a larger fraction of the variability when considering only the most abundant species rather than all of them. Figure3shows that running times increase sublinearly with q and linearly with p.

running time (seconds)

0 500 1000 1500 2000 0 1000 2000 3000 number of axes 2 3 5 10 20 number of variables

Fig 3: Dataset from Mach et al. (2015). Running times averaged over 4 replicates of the PLNPCA function in R PLNModels package. Single core Intel i7-4600U CPU 2.33GHz, R 3.3.3, Linux Ubuntu 16.04.

Impact of Weaning. We focus on results obtained on the 500 most abun-dant species, which account for 90.3% of the total counts. The ICL criteria on this subset selects ˆq = 19 (R2 = 86%). The main structure present in the latent subspace is the strong and systematic impact of weaning (Fig.5, left), almost entirely captured by Axis 1. The variable factor map highlights species from two specific bacterial families: Lactobacillaceae (red) and Pre-votellaceae (blue). The former are typically found in dairy products and thought to be transmitted to the piglets via breast milk. As expected, they are enriched in suckling piglets and negatively correlated with Axis 1. The latter produce enzymes that are essential to degrade cereals introduced in

(20)

log10(abundance) R 2

criterion chosen rank ˆq

0 1 2 0 1000 2000 3000 4000 0.00 0.25 0.50 0.75 1.00 0 1000 2000 3000 4000 0 5 10 15 20 0 1000 2000 3000 4000 number of variables

Fig 4: Dataset from Mach et al. (2015). The minimum overall abundance of included species (left panel), quality of approximation R2_q (central panel) and selected value ˆq (right panel) decrease when species with low abundance are added to the dataset.

the diet after weaning. As reported inMach et al.(2015), they are enriched after weaning and positively correlated with Axis 1. There is no system-atic effect of weaning on Axes 3 and 4 and neither Lactobacillaceae nor Prevotellaceae are highly correlated with those axes.

Acknowledgement. We thank C. Vacher and N. Mach for providing the data and discussing the results. This work was funded by ANR Hydrogen (project ANR-14-CE23-0001).

References.

A. Acharya, J. Ghosh, and M. Zhou. Nonparametric bayesian factor analysis for dynamic count matrices. In AISTATS, 2015.

J. Aitchison and C. H Ho. The multivariate Poisson-log normal distribution. Biometrika, 76(4):643–653, 1989.

T. W. Anderson. An introduction to multivariate statistical analysis. Series in Probability and Statistics. Wiley, 3 edition, 2003.

C. Biernacki, G. Celeux, and G. Govaert. Assessing a mixture model for clustering with the integrated completed likelihood. IEEE Trans. Pattern Anal. Machine Intel., 22(7): 719–25, 2000.

Richard H Byrd, Peihuang Lu, Jorge Nocedal, and Ciyou Zhu. A limited memory algorithm for bound constrained optimization. SIAM Journal on Scientific Computing, 16(5): 1190–1208, 1995.

Y. Cao and Y. Xie. Poisson matrix completion. In 2015 IEEE International Symposium on Information Theory (ISIT), pages 1841–1845. IEEE, 2015.

M. Collins, S. Dasgupta, and R. E Schapire. A generalization of principal components analysis to the exponential family. In Advances in neural information processing sys-tems, pages 617–624, 2001.

(21)

Individual Factor Map Variable Factor Map SF0901 SF0902SF0904SF0903 SF0905 SF0931 SF0932 SF0933 SF0934 SF0935 SF0951 SF0952 SF0953 SF0954SF0955 SF1011 SF1012 SF1013 SF1014SF1015 SF1031 SF1032 SF1033 SF1034 SF1035 SF1831 SF1832 SF1833 SF1834 SF1835 SF1861 SF1862 SF1863_SF1864 SF1865 SF1871 _SF1872 SF1873 SF1874 SF1875 SF1911 SF1912 SF1913 SF1914 SF1915 SF1931 SF1932 SF1933 SF1934 SF1935 SF2641 SF2642 SF2643SF2644SF2645 SF2671 SF2672 SF2673SF2674SF2675 SF2721 SF2722 SF2723 SF2724_SF2725 SF2741 SF2742 SF2743 SF2744 SF2745 SF2751 SF2752 SF2753 SF2754 SF2755 SF2801 SF2802SF2803 SF2804SF2805 SF2821 SF2822 SF2823 SF2824 SF2825 SF2881 SF2882 SF2883 SF2884 SF2885 SF2921 SF2922 SF2923 SF2924SF2925 SF2941 SF2942 SF2943_SF2945SF2944 SF4481 SF4482 SF4483 SF4484 SF4485 SF4491 SF4492 SF4493 SF4494SF4495 SF4521 SF4522 SF4523 SF4524SF4525 SF4541 SF4542 SF4543 SF4544SF4545 SF4581 SF4582 SF4583_SF4584 SF4585 SF5181 SF5182 SF5183 SF5184SF5185 SF5201 SF5202SF5203 SF5204 SF5205 SF5231 SF5232 SF5233SF5234SF5235 SF5261 SF5262 SF5263SF5264 SF5265 SF5271 SF5272 SF5273 SF5274SF5275 SF5301 SF5302 SF5303 SF5304SF5305 −100 −50 0 50 100 −400 −300 −200 −100 0 100 axis 1 (66.4%) axis 2 (7.85%) SF0901 SF0902 SF0903SF0904 SF0905 SF0931 SF0932 SF0933_SF0934 SF0935 SF0951 SF0952 SF0953 SF0954 SF0955 SF1011 SF1012SF1013 SF1014 SF1015 SF1031 SF1032_SF1033 SF1034 SF1035 SF1831 SF1832 SF1833 SF1834SF1835 SF1861 SF1862 SF1863 SF1864SF1865 SF1871 SF1872SF1873_SF1874 SF1875 SF1911 SF1912 SF1913 SF1914SF1915 SF1931 SF1932 SF1933 SF1934SF1935 SF2641 SF2642 SF2643SF2644 SF2645 SF2671 SF2672SF2673 SF2674 SF2675 SF2721 SF2722SF2723SF2725_SF2724 SF2741_SF2742 SF2743SF2744 SF2745 SF2751 _SF2752 SF2753 SF2754 SF2755 SF2801 SF2802 SF2803 SF2804SF2805 SF2821 SF2822_SF2823 SF2824 SF2825 SF2881 SF2882 SF2883 SF2884 SF2885 SF2921 SF2922 SF2923 SF2924 SF2925 SF2941SF2942 SF2943 SF2944 SF2945 SF4481 SF4482 SF4483 SF4484SF4485 SF4491 SF4492 SF4493_SF4494 SF4495 SF4521 SF4522 SF4523 SF4524SF4525 SF4541 SF4542 SF4543SF4544 SF4545 SF4581 SF4582 SF4583 SF4584 SF4585 SF5181 SF5182 SF5183 SF5184 SF5185 SF5201 SF5202 SF5203 SF5204 SF5205 SF5231 SF5232 SF5233 SF5234 SF5235 SF5261 SF5262 SF5263 SF5264 SF5265 SF5271 SF5272 SF5273SF5274 SF5275 SF5301 SF5302 SF5303 SF5304 SF5305 −150 −100 −50 0 50 −100 −50 0 50 100 axis 3 (4.8%) axis 4 (3%)

Weaned aFALSEaTRUE

250 29 291 266 86 345 22 559 80 11 331 54 293 230 62483 323 6 261 6726 231 299 61 17 555 111 1477 32 36 247 199 556 123 7 38331 737659 357 540 24256 298 226 346 1084 69 5776 81 23 2200 99 265 3368 519 408 424 ₆₀₄₀ 85 ₃₀₇ 386 112 347 107138 4166 63 5448 18 260 388 5889 349 482 365₉ 28 301 7238 1486 434 5824 6899 371 108 181 65 164 5 1 521 74 124 258 47 70 5700 151643041058188 245 426 13095 15 217 259 49 473305 209 541 374 21 5789 6449 431 666 20884 223 100 1817 5839 814 380 5957₄₉₇119468466 465 64 187 6350 888 127 19 5917 5535 6079 546 122 27 425 302 278 333 5786 6088 405 5743 533 270 50₂₀₅ 45 387 341 433 170 384 629 34 6391 89 516 720 5612 272 194 6462109 90 5783 172 811₄₆ 455 421429 276 5427 355₅₈ 5642 296 420 60 135 6673 1111 1128 343 311 6867 334 297 435 180 344 251 1897 3238693339 5866 659 274 641940₁₁₃ 150 329 8274 5978 303 10103203249210 606 5756 4 166 407 14 513 809₅₅ 409 5849 5881 267 171262 1125 2521922350 6355 1914 2454309 48 8993 8951 1950₂₆₈ 351₇₁₇₁1953528286 7793 152 5647 1837 257 6366 5925 5987 7281535 1087 201 281 288 10003 273 628 5436 219 547 304 6168 403 229 20 3385 1684 6372 13 83756213 105 1895 1484 14261₄₈₄ 816 5652 5712 5838 5982 97 5847 476 279 211 6261 870 125 5634 607 488 6345 39 8548 151 1263 8997 335 6099 2 38 5906 650 1476 415 494 243 ₄₉₈ 414 5879 7645 411 232 574 8180 5711 2131 8791 554 505 5895 ₆₈₁₅ 7954206 5777 6157 1086 550 10484 7305 5888 490 306 144 633 1464375 5901 381218 11893 300 6119 348 15198 6003 457 157 7296 5886 1129512810 237 319 6197 5790 213 2395 1109 5532 2298 5969 5807361 241 73 4151 15153 6623 336 6606263 437 6062 158 491 1896 499 5926 1096 8990 193 8660₂₂₄ 10393 1535 6448 6199 1954 10586₇₂ 62047809129 6216₆₃₈₇ 95 5672 6364 5876 6978 5845 480 214 277 5878 6954 2221 5744 254 1709 524 6849 1865423 2821 1677 1522238 3120 198 248 886 2262 1813 5432 7253 11432 1880₁₁₇ 5958 984842 587110413₆₅₉₉5844162 294 10430 456 1406378 59902649 1818 72832903 12 6050 24 57787412312 6042 5885 6358 6302 5764 9003 1493 8441 7251 5962 6138 57171082 646 5980 12600 8530 1499 5924453 5855 337 5928 803 10063 250 86 559 80 11 54 67 26 61 11132 36 199 123 7 59 242 1084 69 519₆₆ 138 28 1486 181 65 124 151688 13095217 49 84 223 1817 814 119 187 888 302₄₅ 720 19490 81158 135 1128297 180 ₁₈₉₇693 40 150 21055 1125 1922 1914 2454 152 535 1087 219 1895 14261 816 211 574 554 206 144 218 157 263 158 1896 193 1535 1865 1522238 3120 198 162 2649 1818 2903 12 1493 1082 646 −1.0 −0.5 0.0 0.5 1.0 −1.0 −0.5 0.0 0.5 1.0 axis 1 (66.4%) axis 2 (7.85%) 250 29₂₉₁ 266 86 34522559 80 11 33129354 230 62483 323 6 261 6726 231 29961 17 555 111 1477 32 36 247 199 556123 7 383 31 737659 357 540 ₂₄₂ 56298226 346 108457768169 232200 99 265 33 68 519 408424 6040 85 307 386112 347 107 66 41138635448 18 260 3885889 3494823656899582437172384342814863019 108 181 65 164 5 1₅₂₁ 74 ₁₂₄ 258 47 70 5700 1516 430410 58188 245 42613095 15 217 259 49473305 209 541 374 21 5789 6449 431666 20822384 100 1817 5839 814 380 497 119 468 466 5957 465 646350187888 127 19 5917 5535 6079 546 122 27 425302 278 333 5786 6088 405 5743 533 270 50 205341₄₅387 433170384 629 34 6391 89 516720 5612 272 194 6462 10990 5783172 81146 455 421429 276 5427 355296585642420 60 667311111128135 343 6867311 334 297 435 180 344 251 1897₃₂₃₈₃₃₉693 5866 6592746419113 40 150 329 8274₅₉₇₈ 303203₁₀₁₀₃₂₄₉ 210 606 5756 4 166 40714 513 80955 409 5849 5881 267 171262 1125350₁₉₂₂ 252 6355 19142454 309 48 8993 89511950₇₁₇₁ 268 351 5282861953 7793 1525647 1837 257 6366598759257281 535 1087 201281 288 10003273 628 5436 219 547 304₆₁₆₈ 403 229 20 3385 1684 6372 13 8375 6213 1051484₁₈₉₅ 14261 484816 565259829757125838 5847 476 2792118706261 125 5634 607 488₆₃₄₅151398548 1263₃₃₅₈₉₉₇ 6099 238 59064151476494 ₆₅₀ 243 498 414 5879 7645 411 232574 8180 5711 2131 8791554 505 5895 6815 7954206 577710866157550 10484 73055888 490 306 144₆₃₃ 375 1464 5901 381 218 11893 300 61191519845715760037296348 5886 11295 10₁₂₈ 237 319 6197 5790213 2395 1109 5532 2298 5969580773₄₁₅₁361241 1515343733666062636623 6062 158 491 1896 4995926 1096 89902241938660 10393 1535 64486199 1954 105861297809 620472 6216 6387₅₆₇₂22212772146364956978480₅₈₇₈69545876₅₈₄₅ 5744 254₆₈₄₉₁₆₇₇15222821238170952418653120423 198 248 88622627253₁₈₈₀₁₁₇5432₅₉₅₈42114321813₉₈₄₈ 5871 6599 162 10413 5844 294 10430 4563781406 59902649290318187283 126050 24 5778 7412 31260425885 63586302900314935764 8441 7251 5962 6138 5717 1082 646 5980₁₂₆₀₀ 8530 1499 5924 453 5855 337 5928 803 1006355925860 80 1154 6726 61 11132 36 199₁₂₃ 7 59242 10845196669 138281486 181 65 124 151688 13095217 4984 223 18173021945818711289072081181488845297135119 180 1897693 40 15021055 1125 1922 19141525352454₁₀₈₇ 219 1895574211₁₄₂₆₁816 554144218206 157263 158 1896 193 1535 1865 15222383120 198162 26491818 2903 1214931082 646 −1.0 −0.5 0.0 0.5 1.0 −1.0 −0.5 0.0 0.5 1.0 axis 3 (4.8%) axis 4 (3%)

Family aaLactobacillaceae aaOtheraaPrevotellaceae

Fig 5: Individual (left) and variable (right) maps corresponding to the first (top, Axes 1 and 2) and second (bottom, Axes 3 and 4) principal plane. Weaning has a strong and systematic effect on gut microbiota composition, well captured by axis 1. Bacterial families Prevotellaceae (red) and the Lac-tobacillaceae (blue) are two families well known to be affected by weaning and have a high correlation with Axis 1. The latent structure seen in Axes 2, 3 and 4 does not correspond to weaning.

(22)

A. P. Dempster, N. M. Laird, and D. B. Rubin. Maximum likelihood from incomplete data via the EM algorithm. J. R. Statist. Soc. B, 39:1–38, 1977.

C. Eckart and G. Young. The approximation of one matrix by another of lower rank. Psychometrika, 1(3):211–218, 1936a.

Carl Eckart and Gale Young. The approximation of one matrix by another of lower rank. Psychometrika, 1(3):211–218, 1936b. ISSN 1860-0980. . URLhttp://dx.doi.org/10.

1007/BF02288367.

P. Hall, J. T Ormerod, and MP Wand. Theory of gaussian variational approximation for a Poisson mixed model. Statistica Sinica, pages 369–389, 2011.

R. Izs´ak. Maximum likelihood fitting of the Poisson log-normal distribution. Environmen-tal and Ecological Statistics, 15(2):143–156, 2008.

T. S. Jaakkola and M. I. Jordan. Bayesian parameter estimation via variational methods. Statistics and Computing, 10(1):25–37, 2000.

B. Jakuschkin, V. Fievet, L. Schwaller, T. Fort, C. Robin, and C. Vacher. Deciphering the pathobiome: Intra-and interkingdom interactions involving the pathogen Erysiphe alphitoides. Microbial ecology, pages 1–11, 2016.

N. L. Johnson, S. Kotz, and N. Balakrishnan. Discrete multivariate distributions, volume 165. Wiley New York, 1997.

D. Karlis. EM algorithm for mixed Poisson and other discrete distributions. Astin bulletin, 35(01):3–24, 2005.

J. Lafond. Low rank matrix completion with exponential family noise. arXiv preprint arXiv:1502.06919, 2015.

A. J Landgraf and Y. Lee. Dimensionality reduction for binary data through the projection of natural parameters. arXiv preprint 1510.06112, 2015.

D. D Lee and H S. Seung. Algorithms for non-negative matrix factorization. In Advances in neural information processing systems, pages 556–562, 2001.

J. Li and D. Tao. Simple exponential family PCA. In AISTATS, pages 453–460, 2010. L. T. Liu, E. Dobriban, and A. Singer. ePCA: High Dimensional Exponential Family

PCA. ArXiv e-prints arXiv:1611.05550, 2016.

N. Mach, M. Berri, J. Estell´e, F. Levenez, G. Lemonnier, C. Denis, J.-J. Leplat, C. Cheva-leyre, Y. Billon, J. Dor, and et al. Early-life establishment of the swine gut microbiome and impact on host phenotypes. Environmental Microbiology Reports, 7(3):554–569, May 2015. ISSN 1758-2229. . URLhttp://dx.doi.org/10.1111/1758-2229.12285. K.V. Mardia, J.T. Kent, and J.M. Bibby. Multivariate analysis. Academic press, 1979. T. P Minka. Automatic choice of dimensionality for PCA. In NIPS, volume 13, pages

598–604, 2000.

S. Mohamed, Z. Ghahramani, and K. A Heller. Bayesian exponential family PCA. In Advances in neural information processing systems, pages 1089–1096, 2009.

J. F Nelson. Multivariate gamma-poisson models. Journal of the American Statistical Association, 80(392):828–834, 1985.

William H Press, Brian P Flannery, Saul A Teukolsky, William T Vetterling, et al. Nu-merical recipes. cambridge University Press, cambridge, third edition edition, 1989. R Development Core Team. R: A Language and Environment for Statistical Computing.

R Foundation for Statistical Computing, Vienna, Austria, 2008. URL http://www.

R-project.org. ISBN 3-900051-07-0.

J A. Royle and C. K Wikle. Efficient statistical mapping of avian count data. Environ-mental and Ecological Statistics, 12(2):225–243, 2005.

J. Salmon, Z. Harmany, C.-A. Deledalle, and R. Willett. Poisson noise reduction with non-local PCA. Journal of mathematical imaging and vision, 48(2):279–294, 2014. G. Schwarz. Estimating the dimension of a model. Ann. Statist., 6:461–4, 1978.

(23)

S. Srivastava and L. Chen. A two-parameter generalized Poisson model to improve the analysis of RNA-seq data. Nucleic acids research, 38(17):e170–e170, 2010.

M. E Tipping and C. M Bishop. Probabilistic principal component analysis. Journal of the Royal Statistical Society: Series B (Statistical Methodology), 61(3):611–622, 1999. M. J. Wainwright and M. I. Jordan. Graphical models, exponential families, and

varia-tional inference. Found. Trends Mach. Learn., 1(1–2):1–305, 2008.

Hadley Wickham. ggplot2: Elegant Graphics for Data Analysis. Springer-Verlag New York, 2009. ISBN 978-0-387-98140-6. URLhttp://ggplot2.org.

D. M Witten, R. Tibshirani, and T. Hastie. A penalized matrix decomposition, with appli-cations to sparse principal components and canonical correlation analysis. Biostatistics, page kxp008, 2009.

M. Zhou. Nonparametric bayesian negative binomial factor analysis. arXiv preprint arXiv:1604.07464, 2016.

M. Zhou, L. Hannah, D. B Dunson, and L. Carin. Beta-negative binomial process and poisson factor analysis. In AISTATS, volume 22, pages 1462–1471, 2012.

APPENDIX A: CONVEXITY LEMMAS

Lemma 1. For any vectors θ, x, m, s and b (with matching dimensions) and convex function f , if u ∼ N (0, I) and w = m + s u ∼ N (m, diag(s s)), then the map g : (θ, m, s, b) 7→ E[f (θ|x + b|w)] is convex in (θ, b) for (m, s) fixed and vice-versa.

Proof. Note Z = θ|x + b|w = (θ|x + b|m) + b|(s u). The first order derivative of g is

∇(θ, b, m, s) = Ef0

(Z)x m + s u b b u| . The second order partial derivatives of g are:

And the associated quadratic form Φ1(v, w) = (v, w)|Ψ1(θ, b)(v, w) and

Φ2(v, w) = (v, w)|Ψ2(m, s)(v, w) can be simplified to

Φ1(v, w) = E[f00(Z)(x|v + (m + s u)|w)2] ≥ 0

Φ2(v, w) = E[f00(Z)(b|v + (b u)|w)2] ≥ 0

The Hessians Ψ1 and Ψ2 are thus semidefinite positive, which ends the

(24)

Lemma 2. For any matrices Θ, X, M , S and B (with matching di-mensions) and convex function f , if U = [U1, . . . , Un]| where the Ui are

i.i.d and Ui ∼ N (0, I) and W = M + S U . The map g : (Θ, M , S, B) 7→

1|nE[f (XΘ|+W B|)]1p is convex in (Θ, B) for (M , S) fixed and vice-versa.

Proof. The function g is a sum of functions of the form gij : (Θ, M , S, B) 7→

E[f (X|iΘj+ B|j(Mi+ Si U )]. The result follows from Lemma1.

MIA-Paris

UMR 518 AgroParisTech / INRA AgroParisTech

16, rue Claude Bernard 75231 Paris CEDEX 05, France

E-mail:[email protected];[email protected]

INRA Unit´e MaIAGE

Bt. 233 et 210 Domaine de Vilvert

78352 JOUY-EN-JOSAS CEDEX, France E-mail:[email protected]