M-estimation in Complex Modeling
Nabil Rachdi (PhD Student) [email protected]
I Colloque Jeunes Probabilistes et Statisticiens (3 → 7 of May – Mont Dore) J Advisors:
Jean-Claude Fort (Paris V), Thierry Klein (Toulouse III)
Fabien Mangeant (EADS IW), Régis Lebrun (EADS IW).
Outline
1 Motivations and notations
2 Model Selection
3 Risk excess bound
4 Theorical Results
5 Examples
General framework
Variable of Interest: Y, density f , probability measure Q unknown Feature: Density distribution, mean, threshold probability, quantile etc...
Experimental data: Y
nexp= Y
1exp, ..., Y
nexp(a priori training data) supposed i.i.d from Q - Link to history
- Arise from experiments, complex codes etc...
- Small number - Difficult to obtain
Simulation model: h ∈ H : (x, θ) ∈ X × Θ 7→ y = h(x, θ)
For a configuration x
0, the model gives y
0= h(x, θ
0) ⇒ Deterministic !!
Goal: Use Simulated data to improve feature estimation of Y :
⇒ 1. Estimation procedure: choice of the model h, and parameter θ
⇒ 2. Study of a feature based on ( Y b
1sim, ..., Y b
msim) (a posteriori training data)
to compare with the feature based on (Y
1exp, ..., Y
nexp)
General framework
Variable of Interest: Y, density f , probability measure Q unknown Feature: Density distribution, mean, threshold probability, quantile etc...
Experimental data: Y
nexp= Y
1exp, ..., Y
nexp(a priori training data) supposed i.i.d from Q - Link to history
- Arise from experiments, complex codes etc...
- Small number - Difficult to obtain
Simulated data: Consider that X ↔ (X ,B,P
X) i.e x (deterministic) ↔ X (random).
For m ≥ 0, let Y
msim= Y
1sim, ..., Y
msim(depends on the model h and the parameter θ) - h ∈ H (set of models), θ ∈ Θ (set of parameters)
- Y
isim= h(X
i,θ), i = 1, ..., m, X
ii.i.d random variables (distribution P
X).
- X t Y
expGoal: Use Simulated data to improve feature estimation of Y :
⇒ 1. Estimation procedure: choice of the model h, and parameter θ
⇒ 2. Study of a feature based on ( Y b
1sim, ..., Y b
msim) (a posteriori training data)
to compare with the feature based on (Y
1exp, ..., Y
nexp)
General framework
Variable of Interest: Y, density f , probability measure Q unknown Feature: Density distribution, mean, threshold probability, quantile etc...
Experimental data: Y
nexp= Y
1exp, ..., Y
nexp(a priori training data) supposed i.i.d from Q - Link to history
- Arise from experiments, complex codes etc...
- Small number - Difficult to obtain
Simulated data: Consider that X ↔ (X ,B,P
X) i.e x (deterministic) ↔ X (random).
For m ≥ 0, let Y
msim= Y
1sim, ..., Y
msim(depends on the model h and the parameter θ) - h ∈ H (set of models), θ ∈ Θ (set of parameters)
- Y
isim= h(X
i,θ), i = 1, ..., m, X
ii.i.d random variables (distribution P
X).
- X t Y
expGoal: Use Simulated data to improve feature estimation of Y :
⇒ 1. Estimation procedure: choice of the model h, and parameter θ
⇒ 2. Study of a feature based on ( Y b
1sim, ..., Y b
msim) (a posteriori training data)
to compare with the feature based on (Y
1exp, ..., Y
nexp)
Simulated data estimation
Illustration of Experimental & Simulated Data: Example of a 2D-Performance Y = (Y
1,Y
2)
Choice of h ∈ H and θ ∈ Θ ? ⇒ driven by the feature considered
Simulated data estimation
Illustration of Experimental & Simulated Data: Example of a 2D-Performance Y = (Y
1,Y
2)
Choice of h ∈ H and θ ∈ Θ ? ⇒ driven by the feature considered
Definition: Feature
Let P be the set of all probability measures, D a metric space, we define a feature as a function
ρ : P −→ D
µ 7−→ ρ(µ).
Some feature:
- ρ(µ) = E
W∼µ(W ) (mean), q
Wα(α-quantile)
⇒ D = R
- ρ(µ) = P(W > s) ⇒ D = [0, 1]
- ρ(µ) = f
µ⇒ D = {set of distributions}
- etc ...
Choice of (h, θ) for a Feature ρ
In our framework:
the model
P
X (h,θ)−→ P
h,θ featureρ−→ ρ(P
h,θ) := ρ
h(θ)
Important Remark: the feature ρ
h(θ) can be unreachable → we say that h is complex denote the "true" feature by ρ(Q) := ρ
∗For fix h ∈ H, choice of θ
θ
0= Argmin
Θ
D(ρ
h(θ), ρ
∗)
with D a measure of distance D × D.
We use the form:
M(h, θ) = Z
γ
h,θ(y) Q(dy)
where the function γ
h,θis called contrast of (h,θ).
We have
γ
h,θ= Ψ (ρ
h(θ))
with Ψ some operator on D.
Example of contrasts:
If the feature is the density
- γ
h,θ(y) = − ln (ρ
h(θ)) (y) ⇒ M(h,θ) = K(ρ
∗, ρ
h(θ)) - γ
h,θ(y) = ||ρ
h(θ)||
22− 2ρ
h(θ)(y ) ⇒ M(h, θ) = ||ρ
∗− ρ
h(θ)||
22. - etc...
The same for mean, quantile, threshold probability etc...
Criterion to minimize:
M(h, θ) = Z
γ
h,θ(y) Q(dy)
Assume that (h
∗, θ
∗) is the unique minimum of M(·, ·) Difficulties:
- The measure Q is unknown (Classical !)
- For complex models h, the feature ρ
h(θ) is unreachable so the contrast too
(γ
h,θ= Ψ (ρ
h(θ))).
M(h, θ) = Z
γ
h,θ(y ) Q(dy)
Alternative: Use of Experimental & Simulated data
- Replace Q by its empirical version Q
n(depends on n-Experimental data Y
nexp) We have
M
n(h, θ) = 1 n
n
X
i=1
γ
h,θ(Y
iexp)
- Since γ
h,θm= Ψ ρ
mh(θ)
, replace the feature ρ
h(θ) by an estimator ρ
mh(θ) (depends on m-Simulated data Y
msim).
B Pakes and Pollard (1989) studied estimators minimizing an approximation M e
n(h, θ) ∼ M
n(h, θ) (Optimization Estimators).
B They prove √
n-consistency and give limit theorems
B The conditions are given on the (approximated) criterion M e
n(h, θ)
I We want conditions on the simulation (quantification) and on the feature (regularity)
Criterion to minimize in practice:
M
n,m(h, θ) := 1 n
n
X
i=1
γ
h,θm(Y
iexp)
Estimator of θ
0(h) = Argmin
θ∈ΘM(h,θ) b θ
n,m(h) = Argmin
θ∈Θ
M
n,m(h,θ) .
First question: Consistency b θ
n,m(h) −→
n→+∞
m→+∞
θ
0(h) ?
Source of errors
Proposition: Model Error Upper Bound We prove
M(h, b θ
n,m(h)) − M(h
∗, θ
∗)
| {z }
risk excess of(h,θbn,m(h))
. 1
√ n || G
nγ
h,·m||
Θ+ ||E
hm||
Θ+ ∆
hVariance terms (random terms):
- 1
√ n || G
nγ
mh,·||
Θ= sup
θ∈Θ
1 n
n
X
i=1
γ
h,θm(Y
iexp) − E
Y(γ
mh,θ(Y ))
(deviation)
⇒ Estimation Error of Statistical Data (G
n:= √
n(Q
n− Q)) - ||E
hm||
Θ= sup
θ∈Θ
|| γ
h,θm− γ
h,θ||
1,Q, with ||g||
1,Q= R
R
|g(y )| Q (dy)
⇒ Simulation Error Bias term :
- ∆
h= M(h,θ
0(h)) − M(h
∗, θ
∗)
⇒ Approximation Error of the model h
Link with classical methods
Classical Inequality Recall M(h, θ) = R
γ
h,θ(y) Q (dy), we have
M(h, b θ
n(h)) − M(h
∗, θ
∗) ≤ 1
√ n || G
nγ
h,·||
Θ+ ∆
hFor no complex models: (Statistical models, simplified models etc...)
⇒ the feature ρ
h(θ) is reachable ⇒ Simulation is useless Advantages
- Well studied
- Maximum Likelihood, Regression etc...
Drawback
- ∆
hcan be large (due to simplification of h) Trade-off Bias-Variance
OUR GOAL: Extend the theory to physical models
Theorical result
Recall:
model h + parameter θ ⇒ Feature ρ
h(θ) ⇒ Contrast γ
h,θIn our framework:
model h + parameter θ ⇒ Feature ρ
mh(θ) ⇒ Contrast γ
h,θmInequality:
M(h, b θ
n,m(h)) − M(h
∗, θ
∗) . 1
√ n || G
nγ
h,·m||
Θ+ ||E
hm||
Θ+ ∆
hKey point: Let the class of functions
Γ
mh:= {γ
h,θm, θ ∈ Θ} = {Ψ(ρ
mh(θ)), θ ∈ Θ}
- the class is random (simulated data) - the class changes with m
Important Remark: || G
nγ
h,·m||
Θ= || G
n||
Γm hMain difficulty : Prove the tightness of the random variable sequence
|| G
n||
Γm hn,m
Definition: L
2(Q) Bracketing numbers and Entropy with bracketing
B Let G be a class of functions. Given two functions l, u, the bracket [l, u] is the set of all functions g with l ≤ g ≤ u. An ε-bracket is a bracket [l, u] with ||u − l||
2,Q< ε. The bracketing number N
[ ](ε, G, L
2(Q)) is the minimum number of -brackets needed to cover the class of functions G.
B The entropy with bracketing is the logarithm of the bracketing number.
Definition: L
2(Q) Bracketing integral The bracketing integral is defined as
J
[ ](δ, G, L
2(Q)) :=
Z
δ 0q
log N
[ ](ε, G, L
2(Q))dε .
References:
v d Vaart & Wellner (1996) Weak
Convergence and Empirical Processes
v d Vaart (1998) Asymptotics Statistics
S van de Geer (2000) Empirical
Processes in M-estimation
Particular classes
Recall that G
n:= √
n(Q
n− Q)
Definition: Glivenko-Cantelli classes
A class of function G is ( Q )-Glivenko-Cantelli if
√ 1
n || G
n||
G−→
p.s
0 .
Definition: Donsker classes
A class of function G is (Q)-Donsker if
G
nG in l
∞(G)
Theorems
1. If for all ε > 0, N
[ ](ε, G, L
1(Q)) < ∞ then G is Glivenko-Cantelli.
2. If J
[ ](1, G, L
2(Q)) < ∞ then G is Donsker.
The class Γ
mhis random and changes with m. We need more results.
Our Results
Theorical study of the Γ
mh-indexed empirical G
n= √
n(Q
n− Q) process
Consistency
We prove that b θ
n,m−→
n→+∞
m→+∞
θ
0, under some conditions in terms of :
- Model Complexity (Bracketing Numbers) (model h + feature regularity) - Simulation Speed (Size of Simulated Data set, m)
- Control of Simulated contrasts (Modified Lindeberg conditions)
Limiting Distribution
1. It exists r
n,m≥ 0 such that r
n,m(b θ
n,m− θ
0) = O
P(1)
2. Under some conditions on the model h and on the feature ρ, we can define m
n:= inf{m ≥ 0, r
n,m& n
12−εn}, where ε
n> 0 is a nonincreasing sequence.
3. Under some conditions,
b θ
n,m− θ
0N (0,Σ)
Example of the Range study
Phenomenon : Y = Range (distance an aircraft can travel), feature = density distribution A priori training data : Experimental data, n = 20, Y
nexp= Y
1exp, ..., Y
nexp(obtained from complex model h
∗supposed to be the "true") Additional knowledge : Simulated data, m = 3000, Y
msim= Y
1sim, ..., Y
msimfrom
h(X, θ) = F V C
s1 θ
1log 1
1 − θ
2- Uncertain Inputs X = (F, V ,C
s)
T- Parameters θ = (θ
1, θ
2) ∈ Θ
Choice of θ ? θ
0(h) = Argmin
θ∈Θ
Z
R
γ
h,θ(y ) f (y) dy
γ
h,θ= − ln(f
h,θ) f
h,θ↔ f
h,θm(Kernel) f ↔ 1 n
n
X
i=1
δ
Yexp ib θ
n,m(h) = Argmin
θ∈Θ
− 1 n
n
X
i=1
ln
1 m
m
X
j=1