I Colloque Jeunes Probabilistes et Statisticiens (3 → 7 of May – Mont Dore) J Advisors:

(1)

M-estimation in Complex Modeling

Nabil Rachdi (PhD Student) [email protected]

I Colloque Jeunes Probabilistes et Statisticiens (3 → 7 of May – Mont Dore) J Advisors:

Jean-Claude Fort (Paris V), Thierry Klein (Toulouse III)

Fabien Mangeant (EADS IW), Régis Lebrun (EADS IW).

(2)

Outline

1 Motivations and notations

2 Model Selection

3 Risk excess bound

4 Theorical Results

5 Examples

(3)

General framework

Variable of Interest: Y, density f , probability measure Q unknown Feature: Density distribution, mean, threshold probability, quantile etc...

Experimental data: Y

_n^exp

= Y

₁^exp

, ..., Y

_n^exp

(a priori training data) supposed i.i.d from Q - Link to history

- Arise from experiments, complex codes etc...

- Small number - Difficult to obtain

Simulation model: h ∈ H : (x, θ) ∈ X × Θ 7→ y = h(x, θ)

For a configuration x

₀

, the model gives y

₀

= h(x, θ

0

) ⇒ Deterministic !!

Goal: Use Simulated data to improve feature estimation of Y :

⇒ 1. Estimation procedure: choice of the model h, and parameter θ

⇒ 2. Study of a feature based on ( Y b

₁^sim

, ..., Y b

_m^sim

) (a posteriori training data)

to compare with the feature based on (Y

₁^exp

, ..., Y

_n^exp

)

(4)

General framework

Variable of Interest: Y, density f , probability measure Q unknown Feature: Density distribution, mean, threshold probability, quantile etc...

Experimental data: Y

_n^exp

= Y

₁^exp

, ..., Y

_n^exp

(a priori training data) supposed i.i.d from Q - Link to history

- Arise from experiments, complex codes etc...

- Small number - Difficult to obtain

Simulated data: Consider that X ↔ (X ,B,P

_X

) i.e x (deterministic) ↔ X (random).

For m ≥ 0, let Y

_m^sim

= Y

₁^sim

, ..., Y

_m^sim

(depends on the model h and the parameter θ) - h ∈ H (set of models), θ ∈ Θ (set of parameters)

- Y

_i^sim

= h(X

i

,θ), i = 1, ..., m, X

i

i.i.d random variables (distribution P

X

).

- X t Y

^exp

Goal: Use Simulated data to improve feature estimation of Y :

⇒ 1. Estimation procedure: choice of the model h, and parameter θ

⇒ 2. Study of a feature based on ( Y b

₁^sim

, ..., Y b

_m^sim

) (a posteriori training data)

to compare with the feature based on (Y

₁^exp

, ..., Y

_n^exp

)

(5)

General framework

Variable of Interest: Y, density f , probability measure Q unknown Feature: Density distribution, mean, threshold probability, quantile etc...

Experimental data: Y

_n^exp

= Y

₁^exp

, ..., Y

_n^exp

(a priori training data) supposed i.i.d from Q - Link to history

- Arise from experiments, complex codes etc...

- Small number - Difficult to obtain

Simulated data: Consider that X ↔ (X ,B,P

_X

) i.e x (deterministic) ↔ X (random).

For m ≥ 0, let Y

_m^sim

= Y

₁^sim

, ..., Y

_m^sim

(depends on the model h and the parameter θ) - h ∈ H (set of models), θ ∈ Θ (set of parameters)

- Y

_i^sim

= h(X

i

,θ), i = 1, ..., m, X

i

i.i.d random variables (distribution P

X

).

- X t Y

^exp

Goal: Use Simulated data to improve feature estimation of Y :

⇒ 1. Estimation procedure: choice of the model h, and parameter θ

⇒ 2. Study of a feature based on ( Y b

₁^sim

, ..., Y b

_m^sim

) (a posteriori training data)

to compare with the feature based on (Y

₁^exp

, ..., Y

_n^exp

)

(6)

Simulated data estimation

Illustration of Experimental & Simulated Data: Example of a 2D-Performance Y = (Y

¹

,Y

²

)

Choice of h ∈ H and θ ∈ Θ ? ⇒ driven by the feature considered

(7)

Simulated data estimation

Illustration of Experimental & Simulated Data: Example of a 2D-Performance Y = (Y

¹

,Y

²

)

Choice of h ∈ H and θ ∈ Θ ? ⇒ driven by the feature considered

(8)

Definition: Feature

Let P be the set of all probability measures, D a metric space, we define a feature as a function

ρ : P −→ D

µ 7−→ ρ(µ).

Some feature:

- ρ(µ) = E

W∼µ

(W ) (mean), q

_W^α

(α-quantile)

⇒ D = R

- ρ(µ) = P(W > s) ⇒ D = [0, 1]

- ρ(µ) = f

µ

⇒ D = {set of distributions}

- etc ...

(9)

Choice of (h, θ) for a Feature ρ

In our framework:

the model

P

X (h,θ)

−→ P

h,θ featureρ

−→ ρ(P

h,θ

) := ρ

h

(θ)

Important Remark: the feature ρ

h

(θ) can be unreachable → we say that h is complex denote the "true" feature by ρ(Q) := ρ

^∗

For fix h ∈ H, choice of θ

θ

0

= Argmin

Θ

D(ρ

h

(θ), ρ

^∗

)

with D a measure of distance D × D.

We use the form:

M(h, θ) = Z

γ

h,θ

(y) Q(dy)

where the function γ

h,θ

is called contrast of (h,θ).

We have

γ

h,θ

= Ψ (ρ

h

(θ))

with Ψ some operator on D.

(10)

Example of contrasts:

If the feature is the density

- γ

h,θ

(y) = − ln (ρ

h

(θ)) (y) ⇒ M(h,θ) = K(ρ

^∗

, ρ

h

(θ)) - γ

h,θ

(y) = ||ρ

_h

(θ)||

²₂

− 2ρ

_h

(θ)(y ) ⇒ M(h, θ) = ||ρ

^∗

− ρ

h

(θ)||

²₂

. - etc...

The same for mean, quantile, threshold probability etc...

Criterion to minimize:

M(h, θ) = Z

γ

h,θ

(y) Q(dy)

Assume that (h

^∗

, θ

^∗

) is the unique minimum of M(·, ·) Difficulties:

- The measure Q is unknown (Classical !)

- For complex models h, the feature ρ

h

(θ) is unreachable so the contrast too

(γ

h,θ

= Ψ (ρ

h

(θ))).

(11)

M(h, θ) = Z

γ

h,θ

(y ) Q(dy)

Alternative: Use of Experimental & Simulated data

- Replace Q by its empirical version Q

n

(depends on n-Experimental data Y

_n^exp

) We have

M

n

(h, θ) = 1 n

n

X

i=1

γ

_h,θ

(Y

_i^exp

)

- Since γ

_h,θ^m

= Ψ ρ

^m_h

(θ)

, replace the feature ρ

h

(θ) by an estimator ρ

^m_h

(θ) (depends on m-Simulated data Y

_m^sim

).

B Pakes and Pollard (1989) studied estimators minimizing an approximation M e

n

(h, θ) ∼ M

n

(h, θ) (Optimization Estimators).

B They prove √

n-consistency and give limit theorems

B The conditions are given on the (approximated) criterion M e

n

(h, θ)

I We want conditions on the simulation (quantification) and on the feature (regularity)

(12)

Criterion to minimize in practice:

M

n,m

(h, θ) := 1 n

n

X

i=1

γ

_h,θ^m

(Y

_i^exp

)

Estimator of θ

₀

(h) = Argmin

_θ∈Θ

M(h,θ) b θ

n,m

(h) = Argmin

θ∈Θ

M

n,m

(h,θ) .

First question: Consistency b θ

n,m

(h) −→

n→+∞

m→+∞

θ

0

(h) ?

(13)

Source of errors

Proposition: Model Error Upper Bound We prove

M(h, b θ

n,m

(h)) − M(h

^∗

, θ

^∗

)

| {z }

risk excess of(h,θb_n,m(h))

. 1

√ n || G

n

γ

_h,·^m

||

_Θ

+ ||E

_h^m

||

_Θ

+ ∆

_h

Variance terms (random terms):

- 1

√ n || G

n

γ

^m_h,·

||

_Θ

= sup

θ∈Θ

1 n

n

X

i=1

γ

_h,θ^m

(Y

_i^exp

) − E

Y

(γ

^m_h,θ

(Y ))

(deviation)

⇒ Estimation Error of Statistical Data (G

n

:= √

n(Q

n

− Q)) - ||E

_h^m

||

Θ

= sup

θ∈Θ

|| γ

_h,θ^m

− γ

_h,θ

||

_1,_Q

, with ||g||

_1,_Q

= R

R

|g(y )| Q (dy)

⇒ Simulation Error Bias term :

- ∆

_h

= M(h,θ

₀

(h)) − M(h

^∗

, θ

^∗

)

⇒ Approximation Error of the model h

(14)

Link with classical methods

Classical Inequality Recall M(h, θ) = R

γ

_h,θ

(y) Q (dy), we have

M(h, b θ

n

(h)) − M(h

^∗

, θ

^∗

) ≤ 1

√ n || G

n

γ

h,·

||

_Θ

+ ∆

h

For no complex models: (Statistical models, simplified models etc...)

⇒ the feature ρ

h

(θ) is reachable ⇒ Simulation is useless Advantages

- Well studied

- Maximum Likelihood, Regression etc...

Drawback

- ∆

h

can be large (due to simplification of h) Trade-off Bias-Variance

OUR GOAL: Extend the theory to physical models

(15)

Theorical result

Recall:

model h + parameter θ ⇒ Feature ρ

h

(θ) ⇒ Contrast γ

h,θ

In our framework:

model h + parameter θ ⇒ Feature ρ

^m_h

(θ) ⇒ Contrast γ

_h,θ^m

Inequality:

M(h, b θ

n,m

(h)) − M(h

^∗

, θ

^∗

) . 1

√ n || G

n

γ

_h,·^m

||

_Θ

+ ||E

_h^m

||

_Θ

+ ∆

h

Key point: Let the class of functions

Γ

^m_h

:= {γ

_h,θ^m

, θ ∈ Θ} = {Ψ(ρ

^m_h

(θ)), θ ∈ Θ}

- the class is random (simulated data) - the class changes with m

Important Remark: || G

n

γ

_h,·^m

||

_Θ

= || G

n

||

_Γm h

Main difficulty : Prove the tightness of the random variable sequence

|| G

n

||

_Γm h

n,m

(16)

Definition: L

2

(Q) Bracketing numbers and Entropy with bracketing

B Let G be a class of functions. Given two functions l, u, the bracket [l, u] is the set of all functions g with l ≤ g ≤ u. An ε-bracket is a bracket [l, u] with ||u − l||

_2,Q

< ε. The bracketing number N

_{[ ]}

(ε, G, L

2

(Q)) is the minimum number of -brackets needed to cover the class of functions G.

B The entropy with bracketing is the logarithm of the bracketing number.

Definition: L

2

(Q) Bracketing integral The bracketing integral is defined as

J

_{[ ]}

(δ, G, L

₂

(Q)) :=

Z

δ 0

q

log N

_{[ ]}

(ε, G, L

₂

(Q))dε .

References:

v d Vaart & Wellner (1996) Weak

Convergence and Empirical Processes

v d Vaart (1998) Asymptotics Statistics

S van de Geer (2000) Empirical

Processes in M-estimation

(17)

Particular classes

Recall that G

n

:= √

n(Q

n

− Q)

Definition: Glivenko-Cantelli classes

A class of function G is ( Q )-Glivenko-Cantelli if

√ 1

n || G

n

||

G

−→

p.s

0 .

Definition: Donsker classes

A class of function G is (Q)-Donsker if

G

n

G in l

^∞

(G)

Theorems

1. If for all ε > 0, N

_{[ ]}

(ε, G, L

1

(Q)) < ∞ then G is Glivenko-Cantelli.

2. If J

_{[ ]}

(1, G, L

2

(Q)) < ∞ then G is Donsker.

The class Γ

^m_h

is random and changes with m. We need more results.

(18)

Our Results

Theorical study of the Γ

^m_h

-indexed empirical G

n

= √

n(Q

n

− Q) process

Consistency

We prove that b θ

n,m

−→

n→+∞

m→+∞

θ

0

, under some conditions in terms of :

- Model Complexity (Bracketing Numbers) (model h + feature regularity) - Simulation Speed (Size of Simulated Data set, m)

- Control of Simulated contrasts (Modified Lindeberg conditions)

Limiting Distribution

1. It exists r

n,m

≥ 0 such that r

n,m

(b θ

n,m

− θ

₀

) = O

_P

(1)

2. Under some conditions on the model h and on the feature ρ, we can define m

n

:= inf{m ≥ 0, r

n,m

& n

¹²^−εⁿ

}, where ε

n

> 0 is a nonincreasing sequence.

3. Under some conditions,

b θ

n,m

− θ

0

N (0,Σ)

(19)

Example of the Range study

Phenomenon : Y = Range (distance an aircraft can travel), feature = density distribution A priori training data : Experimental data, n = 20, Y

_n^exp

= Y

₁^exp

, ..., Y

_n^exp

(obtained from complex model h

^∗

supposed to be the "true") Additional knowledge : Simulated data, m = 3000, Y

_m^sim

= Y

₁^sim

, ..., Y

_m^sim

from

h(X, θ) = F V C

s

1 θ

1

log 1

1 − θ

2

- Uncertain Inputs X = (F, V ,C

s

)

^T

- Parameters θ = (θ

1

, θ

2

) ∈ Θ

Choice of θ ? θ

₀

(h) = Argmin

θ∈Θ

Z

R

γ

_h,θ

(y ) f (y) dy

γ

h,θ

= − ln(f

h,θ

) f

h,θ

↔ f

_h,θ^m

(Kernel) f ↔ 1 n

n

X

i=1

δ

_Yexp i

b θ

n,m

(h) = Argmin

θ∈Θ

− 1 n

n

X

i=1

ln



 1 m

m

X

j=1

K

_h_m

(Y

_i^exp

− Y

_j^sim

)





A posteriori training data : (Y

₁^exp

, ..., Y

_n^exp

, Y b

₁^sim

, ..., Y b

_m^sim

)→ n + m = 3020!

(20)

Feature with a priori (Experimental) and a posteriori (Experimental + Simulated) training data

(21)

Feature with a priori (Experimental) and a posteriori (Experimental + Simulated) training data

(22)

Feature with a priori (Experimental) and a posteriori (Experimental + Simulated) training data

(23)

Work in progress

Non parametric estimation improvement:

- Let ϕ be a non parametric estimator of the feature ρ ϕ : {set of samples} −→ D