• Aucun résultat trouvé

Non linear robust regression in high dimension

N/A
N/A
Protected

Academic year: 2021

Partager "Non linear robust regression in high dimension"

Copied!
2
0
0

Texte intégral

(1)

HAL Id: hal-01423622

https://hal.archives-ouvertes.fr/hal-01423622

Submitted on 30 Dec 2016

HAL is a multi-disciplinary open access

archive for the deposit and dissemination of sci-entific research documents, whether they are pub-lished or not. The documents may come from teaching and research institutions in France or abroad, or from public or private research centers.

L’archive ouverte pluridisciplinaire HAL, est destinée au dépôt et à la diffusion de documents scientifiques de niveau recherche, publiés ou non, émanant des établissements d’enseignement et de recherche français ou étrangers, des laboratoires publics ou privés.

Non linear robust regression in high dimension

Emeline Perthame, Florence Forbes, Brice Olivier, Antoine Deleforge

To cite this version:

Emeline Perthame, Florence Forbes, Brice Olivier, Antoine Deleforge. Non linear robust regression in high dimension. The XXVIIIth International Biometric Conference, Jul 2016, Victoria, Canada. �hal-01423622�

(2)

Non linear robust regression in high dimension

E. Perthame

, F. Forbes

, B. Olivier

, A. Deleforge

∗∗

MISTIS, INRIA Grenoble, France, ∗∗PANAMA, INRIA Rennes, France

1 - Non linear mapping problem

The goal is to retrieve X from Y through a non linear regression

func-tion g E(X|Y = y) = g(y) with Y ∈ RD, X ∈ RL,D  L y =      y1 .. . .. . yD      g(y)   x1 .. . xL   = x

For example, Y is a reflectance spectrum (D = 184) mea-sured at a specific location of the Mars planet and X is the composition of the ground at this location (L = 3)

0 50 100 150 0.1 0.2 0.3 0.4 0.5 Wavelength Reflectance prop. of dust prop. of CO2 ice

prop. of water ice " #

2 - Difficulties

High dimension (D  L)Inverse regression strategy

E(Y|X = x) = f (x)

Non linear mappingPiecewise linear approximation of f (and g)

Y = PKk=1(IZ=k)AkX + bk + Ek with E(Ek2) ∝ Σk and Z multinomial latent variable

P(Z = k) = πk

Dealing with outliersHeavy tail distribution

-6 -4 -2 0 2 4 6 0.0 0.1 0.2 0.3 0.4 x Densit y Gaussian Student α=0.1

Generalized Student distribution

SM(y; µ, Σ, α, γ) =

Γ(α + M/2)

|Σ|1/2 Γ(α) (2πγ)M/2 [1 + δ(y, µ, Σ)/(2γ)]

−(α+M/2)

,

Gaussian scale mixture representation (using weight variable U

dis-tributed according to a Gamma distribution ) SM(y; µ, Σ, α, γ) =

Z ∞

0

NM(y; µ, Σ/u) G(u; α, γ) du

Parameters estimation is tractable by a general EM algorithm

3 - SLLiM model

A mixture of Student distributions encodes the piecewise linear

regres-sions p(X = x, Y = y|Z = k) = SL+D [x, y]T; mk, Vk, αk, 1 with mk =  ck Akck + bk  and Vk =  Γk ΓkATk AkΓk Σk + AkΓkATk 

Therefore, the joint density (X, Y) is a mixture of Student regressions

p(X = x, Y = y) = PKk=1 πkSL+D([x, y]T; mk, Vk, αk, 1)

We denote by θ = (ck, Γk, Ak, bk, Σk, πk, αk)1≤k≤K the set of parametersExtension to partially observed responses

X = [T, W]T with T observed and W hidden variables

Allow to account for dependence among covariates and reduce the

sensitivity of the method to model mispecification

4 - Inverse regression strategy

Forward strategy (x = g(y)), conditionals are

p(X = x; θ) = PKk=1 πkSL(x; ck, Γk, αk, 1)

p(Y = y|X = x; θ) = PKk=1 πkSD(y; Akx + bk, Σk, α y k, γ y k) → D = 500, L = 2, Γk diagonal → 126 254 parametersInverse strategy (y = f (x)) p(Y = y; θ∗) = PKk=1 πkSD(y; c∗k, Γ∗k, αk, 1) p(X = x|Y = y;θ∗) = PKk=1 πkSL(x; Ak∗y + b∗k, Σ∗k, αxk, γkx) with θ∗ = (c∗k, Γ∗k, A∗k, b∗k, Σ∗k, πk, αk)1≤k≤K and c∗k = Akck + bk; Γ∗k = Σk + AkΓkATk ; A∗k = Σ∗kATk Σk−1; b∗k = Σ∗k(Γ−1k ck − ATk Σ−1k bk); Σ∗k = (Γ−1k + ATk Σ−1k Ak)−1 → D = 500, L = 2, Σk diagonal → 2 003 parameters

Our approach reduces the number of parameters to estimate

Prediction : The regression function of interest g is approached by eg

e g(y) = E(X|Y = y;θ∗) = K X k=1 πkSD(y; c∗k, Γ∗k, αk, 1) PK j=1 πjSD(y; c ∗ j, Γ∗j, αj, 1) (A∗ky + b∗k)

5 - Estimation of θ by EM algorithm

E-step

E-U step: Update of weight of each data point E[U|x, y, Z = k; θ(i)]

E-Z step: Update posterior probabilities P(Z = k|x, y, θ(i))

M-step

− (πk, ck, Γk) → Estimation is like a standard Student mixture

− (Ak, bk, Σk) → Estimation is “Linear regression-like"

− αk → Not in closed-form but standard

6 - Application to air quality in the subway in Paris

Prediction of NO (L=1) from NO2 (D=1) in Châtelet station in Paris

during March 2015 (N = 341 measures)

SLLiM achieves better prediction a than its Gaussian counterpart

(GLLiM) on complete data

SLLiM is equivalent to GLLiM when no outliers (removed)

Estimated regression functions with 7 outliers (left panel) and no outliers (center panel) and prediction error rates (right panel)

20 30 40 50 60 70 80 0 100 200 300 400 500 NO2 NO GLLiM SLLiM 20 30 40 50 60 70 80 0 100 200 300 400 500 NO2 NO GLLiM SLLiM 1 2 3 4 5 6 7 8 9 10 0.76 0.78 0.80 0.82 0.84 K NRMSE GLLiM SLLiM GLLiM-WO SLLiM-WO aNRMSE = s P i (ti −ti)ˆ 2 P i (ti −ttrain)¯ 2

7 - Other applications

Application when D  L

Hyperspectral data on MarsD=184, L=3, N=6983

∗ K fixed to 10, number of latent variables W estimated by BIC

Prediction of proportion of CO2 ice and dust from spectraNear-infrared spectra on orange juices

D=134, L=1, N=218

Prediction of sucrose level of each orange juice from its spectraComparison with other non linear regression methods

Prediction error rates for Mars data: average NRMSE (standard deviations) for proportions of CO2 ice and dust over 100 cross validation runs

Method Prop. of CO2 ice Prop. of dust

SLLiM (K=10) 0.168 (0.019) 0.145 (0.020) GLLiM (K=10) 0.180 (0.023) 0.155 (0.023) Regression splines 0.173 (0.016) 0.160 (0.021) SIR 0.243 (0.025) 0.157 (0.016) RVM 0.299 (0.021) 0.275 (0.034)

References

[1] A. Deleforge, F. Forbes, and R. Horaud. High-dimensional regression with Gaus-sian mixtures and partially-latent response variables. Statistics and Computing, 2015. [2] E. Perthame, F. Forbes, and A. Deleforge. Inverse regression approach to robust

non-linear high-to-low dimensional mapping. Submitted, 2016.

[3] Link to RATP (subway) data: http://data.ratp.fr/explore/dataset/qua-lite-de-lair-mesuree-dans-la-station-chatelet

Références

Documents relatifs

To test whether the vesicular pool of Atat1 promotes the acetyl- ation of -tubulin in MTs, we isolated subcellular fractions from newborn mouse cortices and then assessed

Néanmoins, la dualité des acides (Lewis et Bronsted) est un système dispendieux, dont le recyclage est une opération complexe et par conséquent difficilement applicable à

Cette mutation familiale du gène MME est une substitution d’une base guanine par une base adenine sur le chromosome 3q25.2, ce qui induit un remplacement d’un acide aminé cystéine

En ouvrant cette page avec Netscape composer, vous verrez que le cadre prévu pour accueillir le panoramique a une taille déterminée, choisie par les concepteurs des hyperpaysages

Chaque séance durera deux heures, mais dans la seconde, seule la première heure sera consacrée à l'expérimentation décrite ici ; durant la seconde, les élèves travailleront sur

A time-varying respiratory elastance model is developed with a negative elastic component (E demand ), to describe the driving pressure generated during a patient initiated

The aim of this study was to assess, in three experimental fields representative of the various topoclimatological zones of Luxembourg, the impact of timing of fungicide

Attention to a relation ontology [...] refocuses security discourses to better reflect and appreciate three forms of interconnection that are not sufficiently attended to