• Aucun résultat trouvé

SCGLR: a component-based multivariate regression method to model species distributions. Fr´ed´eric Mortier

N/A
N/A
Protected

Academic year: 2021

Partager "SCGLR: a component-based multivariate regression method to model species distributions. Fr´ed´eric Mortier"

Copied!
13
0
0

Texte intégral

(1)

SCGLR: a component-based multivariate regression method to model species distributions.

Fr ´ed ´eric Mortier

with Xavier Bry, Guillaume Cornu & Catherine Trottier

International Statistical Ecology Conference Montpellier

July 2014

(2)

Problematic

⇒ Question: What in X may predict what in Y?

(3)

Collinearities:

to avoidoverfitting, search forcomponents.

Components must:

- capture enough variance in X , - model and predict y .

Several components:

to avoidredundancy, search for uncorrelation.

→ constraint of construction:orthogonality.

Multiple y :

same components,

but each y with itsown regression coefficients.

Exponential family distributed →generalized linear regression.

(4)

PLS1 : single y

• First component is acompromisebetween the direction of X that best

predicts y and the first principal component (PC) of X .

,→Criterion: max||u||2=1[cov (y , Xu)]

max||u||2=1

h

pvar (y)pvar (Xu) corr (y, Xu)i

,→Program to solve: P1:max||u||2=1[<y , Xu >W]

• Further components: W -orthogonality of components is ensured using the part of X that is not yet used, i.e. the residuals of X regressed on previous components.

(5)

PLS2: multiple y

• First component can be obtained using several equivalent programs: ,→ P2:max||u||2,||v ||2=1[<Xu, Yv >W]

,→ P3:max||u||2=1

Pq

k =1<Xu, yk >2W



P3is adapted to the case of multiple weighting:

,→ P4:max||u||2=1

h

Pq

k =1<Xu, yk >2Wk

i

=⇒Solution: eigenvector associated to largest eigenvalue of:

A = X0ΩX with Ω =Pq

k =1Wkyky0kWk

• Further components: idem, subject to constraint of orthogonality to previous components.

(6)

Multiple GLM with common predictor

In the GLM, linear predictors are constrained to be collinear to one another:

∀k = 1, q : ηk =X βk +T δk =X γku + T δk

−→modified Fisher Scoring Algorithm:

u and γ = (γk)k =1,q estimated iterating an alternated least squares two steps

sequence:

• Given γ, working data (zk)

k is regressed on matrix [γ ⊗ X , 1q⊗ T ] with

respect to working matrix W = diag[Wk]k

−→ coefficient vectors ˆu, ˆδ = (ˆδk)k

−→ ˆu made unit norm −→ updated u

(2) Given Xu, each working vector zk is regressed on [Xu, T ] with respect to

working matrix Wk

(7)

SCGLR

Step t of the FSA:

minγ,u:u0u=1

 P

k||zk [t]− X γku||2 Wk[t]



⇔ minu:u0u=1

 P

k||zk [t]− ΠXuzk [t]||2 Wk[t]



⇔ maxu:u0u=1

 P k||zk [t]||2W[t] k cos2 Wk[t](z k [t],Xu) 

isreplacedby: maxu:u0u=1

 P k||zk [t]||2W[t] k cos2 Wk[t](z k [t],Xu) ||Xu||2 Wk[t]  equivalent to:

maxu:u0u=1

" X k <zk [t],Xu >2 Wk[t] # = local extended PLS2

=⇒Solution: eigenvector associated to largest eigenvalue of:

A = X0[t]X with Ω[t]=Pq k =1W [t] k zk [t]z0k [t]W [t] k

(8)

Application

Abundance of tropical tree species (CoForChange project)

all trees with diameter higher than 30 cm

more than 120,000 plots of 0.5 ha more than 200 genera

soil, rainfall, human disturbances, vegetation activity (EVI) maps available

(9)

Application II

Select number of component: a cross-validation approach

> library(SCGLR)

> genus.cv <- scglrCrossVal(formula=form, data=genus, family=fam, K=12,

+ offset=genus$surface) >

> mean.crit <- t(apply(genus.cv,1,function(x) x/mean(x)))

> mean.crit <- apply(mean.crit,2,mean)

> K.cv <- which.min(mean.crit)-1

> cat("Best number of components: ",K.cv)

Best number of components: 8

0 2 4 6 8 10 12 0.95 1.00 1.05 1.10 1.15 K, number of components cr iter ion (MSPE)

(10)

Application III

Fitting and Plots

> genus.scglr <- scglr(formula=form, data=genus, family=fam, K=K.cv,

(11)

Application IV

Prediction of two genera

Manilkara mabokeensis Musanga cecropioides

(12)

Ongoing works

New alternate optimization algorithms: Iterative Normalized Gradient Multi-table (Theme) support

SCGLR packages new versions

Enhancements for plot customization

New distribution families (Negative-Binomial, Exponential, Inverse Gaussian) Multi-theme

(13)

References

1 X. Bry, C. Trottier, T. Verron and F. Mortier (2013). Supervised component

generalized linear regression using a PLS-extension of the Fisher scoring algorithm. Journal of Multivariate Analysis, 119(0), 47.

2 S. Gourlet-Fleury et al. (2009–2014) CoForChange project:

http://www.coforchange.eu/.

3 F. Mortier, C. Trottier, G. Cornu and X. Bry (2014). SCGLR: Supervised

Component Generalized Linear Regression (SCGLR). R package version 1.2. http://CRAN.R-project.org/package=SCGLR

4 F. Mortier, C. Trottier, G. Cornu and X. Bry (2014). SCGLR - An R

package for Supervised Component Generalized Linear Regression. Journal of Statitistical Software. submitted

Références

Documents relatifs

Inspired by the component selection and smoothing operator (COSSO) (Lin and Zhang, 2006), which is based on ANOVA (ANalysis Of VAriance) decomposition, and the wavelet kernel

[1] proposent une m´ ethode - la r´ egression lin´ eaire g´ en´ eralis´ ee sur composantes supervis´ ees (Supervised Component-based Generalized Linear Regression, SCGLR) -

We presented a method for mobile robot envi- ronment modeling that is based on principal com- ponent regression, a statistical method for building low-dimensional dependencies of

A new approach is proposed based on sliced inverse regression (SIR) for estimating the effective dimension reduction (EDR) space without requiring a prespecified para- metric

Looking for linear combinations of regressors that merely maximize the like- lihood of the GLM has two major consequences: 1) collinearity of regressors is a factor of

We propose a regularizing component-based technique, supervised component generalized linear regression (SCGLR), which extends the Fisher scoring algorithm by combining partial

We see that CorReg , used as a pre-treatment, provides similar prediction accuracy as the three variable selection methods (this prediction is often better for small datasets) but

Differently from the approach of setting discrete cost values in the literature, continuous costs are assigned to the different data points and an efficient method is proposed