SCGLR: a component-based multivariate regression method to model species distributions. Fr´ed´eric Mortier

(1)

SCGLR: a component-based multivariate regression method to model species distributions.

Fr ´ed ´eric Mortier

with Xavier Bry, Guillaume Cornu & Catherine Trottier

International Statistical Ecology Conference Montpellier

July 2014

(2)

Problematic

⇒ Question: What in X may predict what in Y?

(3)

Collinearities:

to avoidoverfitting, search forcomponents.

Components must:

- capture enough variance in X , - model and predict y .

Several components:

to avoidredundancy, search for uncorrelation.

→ constraint of construction:orthogonality.

Multiple y :

same components,

but each y with itsown regression coefficients.

Exponential family distributed →generalized linear regression.

(4)

PLS1 : single y

• First component is acompromisebetween the direction of X that best

predicts y and the first principal component (PC) of X .

,→Criterion: max||u||2₌₁[cov (y , Xu)]

max||u||2₌₁

h

pvar (y)pvar (Xu) corr (y, Xu)i

,→Program to solve: P1:max||u||2₌₁[<y , Xu >_W]

• Further components: W -orthogonality of components is ensured using the part of X that is not yet used, i.e. the residuals of X regressed on previous components.

(5)

PLS2: multiple y

• First component can be obtained using several equivalent programs: ,→ P2:max||u||2_{,||v ||}2₌₁[<Xu, Yv >_W]

,→ P3:max||u||2₌₁

Pq

k =1<Xu, yk >2W

P3is adapted to the case of multiple weighting:

,→ P4:max||u||2₌₁

h

Pq

k =1<Xu, yk >2Wk

i

=⇒Solution: eigenvector associated to largest eigenvalue of:

A = X0ΩX with Ω =Pq

k =1Wkyky0kWk

• Further components: idem, subject to constraint of orthogonality to previous components.

(6)

Multiple GLM with common predictor

In the GLM, linear predictors are constrained to be collinear to one another:

∀k = 1, q : ηk =X βk +T δk =X γku + T δk

−→modified Fisher Scoring Algorithm:

u and γ = (γk)k =1,q estimated iterating an alternated least squares two steps

sequence:

• Given γ, working data (zk₎

k is regressed on matrix [γ ⊗ X , 1q⊗ T ] with

respect to working matrix W = diag[Wk]k

−→ coefficient vectors ˆu, ˆδ = (ˆδ_k)_k

−→ ˆu made unit norm −→ updated u

(2) Given Xu, each working vector zk is regressed on [Xu, T ] with respect to

working matrix Wk

(7)

SCGLR

Step t of the FSA:

minγ,u:u0_u=1

P

k||zk [t]− X γku||2 W_k[t]

⇔ minu:u0_u=1

P

k||zk [t]− ΠXuzk [t]||2 W_k[t]

⇔ maxu:u0_u=1

P k||zk [t]||2_W[t] k cos2 W_k[t](z k [t]_,_Xu)

isreplacedby: maxu:u0_u=1

P k||zk [t]||2_W[t] k cos2 W_k[t](z k [t]_,_Xu) _||Xu||2 W_k[t] equivalent to:

maxu:u0_u=1

" X k <zk [t],Xu >2 Wk[t] # = local extended PLS2

=⇒Solution: eigenvector associated to largest eigenvalue of:

A = X0_Ω[t]_{X with Ω}[t]₌Pq k =1W [t] k zk [t]z0k [t]W [t] k

(8)

Application

Abundance of tropical tree species (CoForChange project)

all trees with diameter higher than 30 cm

more than 120,000 plots of 0.5 ha more than 200 genera

soil, rainfall, human disturbances, vegetation activity (EVI) maps available

(9)

Application II

Select number of component: a cross-validation approach

> library(SCGLR)

> genus.cv <- scglrCrossVal(formula=form, data=genus, family=fam, K=12,

+ offset=genus$surface) >

> mean.crit <- t(apply(genus.cv,1,function(x) x/mean(x)))

> mean.crit <- apply(mean.crit,2,mean)

> K.cv <- which.min(mean.crit)-1

> cat("Best number of components: ",K.cv)

Best number of components: 8

0 2 4 6 8 10 12 0.95 1.00 1.05 1.10 1.15 K, number of components cr iter ion (MSPE)

(10)

Application III

Fitting and Plots

> genus.scglr <- scglr(formula=form, data=genus, family=fam, K=K.cv,

(11)

Application IV

Prediction of two genera

Manilkara mabokeensis Musanga cecropioides

(12)

Ongoing works

New alternate optimization algorithms: Iterative Normalized Gradient Multi-table (Theme) support

SCGLR packages new versions

Enhancements for plot customization

New distribution families (Negative-Binomial, Exponential, Inverse Gaussian) Multi-theme

(13)

References

1 _{X. Bry, C. Trottier, T. Verron and F. Mortier (2013). Supervised component}

generalized linear regression using a PLS-extension of the Fisher scoring algorithm. Journal of Multivariate Analysis, 119(0), 47.

2 _{S. Gourlet-Fleury et al. (2009–2014) CoForChange project:}

http://www.coforchange.eu/.

3 _{F. Mortier, C. Trottier, G. Cornu and X. Bry (2014). SCGLR: Supervised}

Component Generalized Linear Regression (SCGLR). R package version 1.2. http://CRAN.R-project.org/package=SCGLR

4 _{F. Mortier, C. Trottier, G. Cornu and X. Bry (2014). SCGLR - An R}

package for Supervised Component Generalized Linear Regression. Journal of Statitistical Software. submitted