HAL Id: hal-00802394
https://hal.archives-ouvertes.fr/hal-00802394
Submitted on 19 Mar 2013
HAL is a multi-disciplinary open access archive for the deposit and dissemination of sci- entific research documents, whether they are pub- lished or not. The documents may come from teaching and research institutions in France or abroad, or from public or private research centers.
L’archive ouverte pluridisciplinaire HAL, est destinée au dépôt et à la diffusion de documents scientifiques de niveau recherche, publiés ou non, émanant des établissements d’enseignement et de recherche français ou étrangers, des laboratoires publics ou privés.
regression methods
Tarik Benzineb, Emmanuel Gobet
To cite this version:
Tarik Benzineb, Emmanuel Gobet. Preliminary control variates to improve empirical regression meth-
ods. Monte Carlo Methods and Applications, De Gruyter, 2013, 19 (4), pp.331–354. �hal-00802394�
regression methods
Tarik Ben Zineb and Emmanuel Gobet
Abstract. We design a variance reduction method to reduce the estimation error in re- gression problems. It is based on an appropriate use of other known regression functions.
Theoretical estimates are supporting this improvement and numerical experiments are il- lustrating the efficiency of the method.
Keywords. Empirical regression, variance reduction, distribution-free theory.
2010 Mathematics Subject Classification. 62G08, 65Cxx.
1 Introduction
Empirical regression methods to compute E (H | X = x) are widely used in Monte- Carlo algorithms to solve optimal stopping problems [4,6,12], Backward Stochas- tic Differential Equations [7, 8, 11] and various stochastic control problems [1, 2]:
these algorithms are often referred to as Least Squares Monte-Carlo algorithms.
However in some situations, the number of simulations is constrained to be rel- atively small (due to restriction on the computational time, see [3]). That case may cause significant inaccuracy in the least squares regression method, because the estimation error (also called statistical error) can be dominant, in particular if the conditional variance V ar(H | X) is large (see [9, Chapter 11] for details, or Theorem 2.1 below). The purpose of this work is to design a flexible method to significantly reduce it.
Depending on the use of empirical regression methods, the framework can be quite different. In the statistics field with applications to inference, data are rather given to the experimenter and in general, he can not rely on extra information about their distribution . In the probability field related to Monte-Carlo algorithms, the situation is different: the experimenter simulates data with possibly additional information. Here we consider this second framework by assuming the knowledge
This work has been done while the first author was preparing a CIFRE PhD Thesis at AXA and Ecole Polytechnique. The first author research is supported by AXA Fund Research and ANRT.
The second author research is part of the Chair Financial Risks of the Risk Foundation, the Chair
Derivatives of the Future and the Chair Finance and Sustainable Development.
of explicit regression functions that are going to be used to optimally reduce the estimation error of the empirical regression methods (see Theorems 2.2 and 2.3 below). These explicit regression functions are going to serve as control variates (called Preliminary Control Variates or PCV in short), which result into a two- stages algorithm presented in Section 2.3: firstly a L 2 -projection of the response H using the PCV, secondly a linear least squares regression applied to the residual.
Our aim here is to prove numerically as well as mathematically how, in a context of few simulations, it reduces the estimation error and hence achieves the optimal bound (approximation error).
This paper is structured as follows. In the next section, we review an existing result of the error convergence of the standard linear least squares regression esti- mate; then we introduce the PCV method and state our main result about the global error estimates. The proofs are done in Section 3, where we recall the necessary tools of the Vapnik-Chervonenkis theory. Section 4 gathers numerical tests (in di- mension 1 and 2 for X) showing the efficiency of the method. In some cases, the same accuracy is obtained using 50 times fewer simulations. We refer the reader to [3] for additional experiments.
2 Statement of the problem and main results
2.1 Setting
Our goal is to approximate
m(x) = E [H|X = x] with H = h(Z),
where Z and X are two random variables taking values respectively in R d
zand R d
x(1 ≤ d z , d x < +∞ ) and h : R d
z→ R is a known function. To evaluate m( · ), we make use of a data sample D N := (Z i , X i ) 1 ≤ i ≤ N which consists of i.i.d simulations of (Z, X) . We set H i = h(Z i ) .
2.2 Standard linear least squares regression method Our presentation is inspired from [9].
•
Let µ N be the empirical measure associated to the sample D N :
µ N (dz, dx) := 1 N
X N i=1
δ (Z
i,X
i) (dz, dx).
•
The L 2 -norm of a function g , measured with respect to µ N , is denoted by k g k L
2(µ
N) :
k g k 2 L
2(µ
N) := 1 N
X N i=1
g 2 (Z i , X i ).
•
Similarly, by denoting µ the law of (Z, X), we define kgk L
2(µ) by kgk 2 L
2(µ) :=
Z
g 2 (z, x)µ(dz, dx).
•
The unknown regression function m(·) : R d
x7→ R is approximated within the linear vector space
F N = Span(Φ k : 1 ≤ k ≤ K F ),
with K F ∈ N ∗ and where Φ k : R d
x7→ R may depend on (X 1 , · · · , X N ) (see for instance the examples of data-driven basis functions in [4]).
The standard linear least squares regression method approximates 1 m by m N = P K
Fk=1 γ e k Φ k where
( γ e k ) 1 ≤ k ≤ K
F= arginf
(γ
k)
k1 N
X N i=1
H i −
K
FX
k=1
γ k Φ k (X i )
2
.
The following result (see [9, Theorem 11.1]) provides a standard control on the convergence rate of m N . For more recent results, see [10].
Theorem 2.1. Assume Σ 2 = sup x ∈R
dxV ar(H | X = x) < + ∞ . Then E h
k m N − m k 2 L
2(µ
N)
i ≤ Σ 2 K F N + E
f ∈F inf
Nk f − m k 2 L
2(µ
N)
.
The above global error reads as the usual bias-variance decomposition, consist- ing of a sum of two terms.
•
The first term is the estimation error: it is due to the finite number of simula- tions.
•
The second term is the approximation error: it measures how well the regres- sion function m can be approximated by functions of F N .
1
The minimizing function m
Nis unique in L
2(µ
N) but there may be multiple minimizing coef-
ficients: in that case, we choose for e γ the SVD-optimal one, corresponding to that with minimal
norm. This choice does not affect the empirical regression function m
N, see [8] for details.
•
The larger K F , the smaller the approximation error but the larger the estima- tion error: hence, K F and N have to be tuned optimally to achieve optimal convergence rates [13, 15].
Numerical experiments in [3] show that the upper bound in Theorem 2.1 is quite tight. But, in a context of few simulations or when the variance sup x ∈R
dxV ar(H | X = x) is large, the estimation error is dominant, making the global error be far from achieving the approximation error. Hence in such a context, designing a new re- gression method which accelerates the convergence of the global error to the ap- proximation error becomes an essential concern: this is the subject of the next subsection where we present the PCV method and the related results.
2.3 PCV least squares regression method Heuristics
Suppose that the available additional information is the knowledge of K pcv regres- sion functions x 7→ E [P k (Z) | X = x] for some known functions P k : R d
z7→ R (1 ≤ K pcv < +∞ ). There is no loss of generality 2 to assume that they are condi- tionally centered:
∀ 1 ≤ k ≤ K pcv : E [P k (Z)|X] = 0.
Based on this extra information and on the same sample data D N , we wish to reduce the estimation error term Σ 2 K N
Fin the previous theorem, that is to improve the factor sup x ∈R
dxV ar(H | X = x).
We expect such an improvement by replacing H i by H i − P K
pcvk=1 α ⋆ k P k (Z i ) where (α k ⋆ ) 1 ≤ k ≤ K
pcvare the optimal coefficients
(α ⋆ k ) 1 ≤ k ≤ K
pcv= arginf
α ∈A
E
H −
K
pcvX
k=1
α k P k (Z)
2
(1)
(the set A of admissible parameters is defined later on). Indeed, as E [P k (Z)|X] = 0 for any k, observe that α ⋆ is the minimizer
arginf
α ∈A
V ar H −
K X
pcvk=1
α k P k (Z)
= arginf
α ∈A
E
V ar H −
K X
pcvk=1
α k P k (Z)|X , (2)
2
Indeed, we can rewrite the regression problem using extended variables ( Z, b X, b H, b m) b where
Z b := (X, Z), X b := X, H b := h(Z), m( b X) b := E ( H| b X) = b m(X), P b
k( Z) b := P
k(Z) −
E [P
k(Z)|X] that satisfies E ( P b
k( Z)| b X) = b 0. See also further examples.
which might be close to minimizing sup x ∈R
dxV ar(H − P K
pcvk=1 α k P k (Z) | X = x) over α. Actually, to achieve a fully implementable scheme, the minimization w.r.t.
the L 2 (µ)-norm in (1) is going to be replaced by a minimization over the sample data D N .
Examples of known regression functions
Our examples are inspired by applications related to a given stochastic process (Y s ) 0 ≤ s ≤ T , that is for computing E [h(Y T )|Y t ].
We first consider control variates with local support.
a) Consider a standard Brownian motion W ; if h were well approximated by piecewise constant functions, one might take advantage of the control variates
P k (Z) := P k (W t , W T ) := p 0 x
k, ∆ (W T ) − E
p 0 x
k, ∆ (W T ) | W t
,
with
p 0 x
k, ∆ := 1 ]x
k−∆ ,x
k+ ∆ ] and x k := − 2 √
T + (k − 1 2 ) 4 K √
pcvT , ∆ := 2 K √
pcvT , so that the full support of (p 0 x
k,∆ ) k
is ]− 2 √ T , 2 √
T ]. Besides, one has E h
p 0 x
k,∆ (W T )|W t = x i
= N
x
k√ + ∆− x T − t
−
N x
k
−∆− x
√ T − t
and X = W t , Z = (W t , W T ).
b) Now suppose that h could be well approximated by hat functions. Define p 1 x
k, ∆ (x) := (1 − | x − x k
∆ | ) +
where x k := − 2 √
T + k∆ , ∆ := K 4 √ T
pcv
+1 , and set P k (Z) := P k (W t , W T ) := p 1 x
k,∆ (W T ) − E
p 1 x
k,∆ (W T )|W t ,
where E
p 1 x
k, ∆ (W T )|W t = x
= x − x k − 1
∆
N
x k − x
√ T − t
− N
x k − 1 − x
√ T − t
+ x k+1 − x
∆
N
x k+1 − x
√ T − t
− N
x k − x
√ T − t
+
√ T − t
∆ √ 2π
"
e
(xk−1−x)2 2(T−t)
− 2 e
(xk−x)2 2(T−t)
+ e
(xk+1−x)2 2(T−t)
#
.
c) The extension of the previous example to a d -dimensional Brownian Motion W = (W 1 , . . . , W d ) is directly obtained by tensorization, leading to K pcv = (K pcv (d=1) ) d control variates:
P k (Z) := P k
1,...,k
d(W t , W T ) :=
Y d i=1
p 1 x
ki
, ∆ (W T i ) − Y d i=1
E h p 1 x
ki
, ∆ (W T i ) | W t i i ,
where 1 ≤ k i ≤ K pcv (d=1) , 1 ≤ i ≤ d.
We now consider control variates with full support.
d) Associated to a standard Brownian motion (W s ) 0 ≤ s ≤ T , put X = W t , Z = (W t , W T ) and let (H k ) k be the Hermite polynomials given by H k (x) :=
e x
2/2 dx d
kk(e − x
2/2 ). Because of the martingale property of t k/2 H k
W √
tt
t ≥ 0
(see [14, Chapter 4]), one can take as control variates P k (Z) := T k/2 H k
W T
√ T
− t k/2 H k
W t
√ t
.
e) Similarly, for a Poisson process (N s ) 0 ≤ s ≤ T , let (C k ) k be the Charlier polyno- mials defined by P ∞
i=0 C i (y, a) w i!
i= e w (1 − w a ) y . Since t k C k (N t , t)
0 ≤ t ≤ T
is a martingale,
P k (Z) := T k C k (N T , T ) − t k C k (N t , t)
defines a control variate. Other examples of orthogonal polynomials in relation with stochastic processes are provided in [14].
f) Recently in [5], Cuchiero etal. have proved that polynomials of stochastic processes are preserved by conditional expectations, for a large class of mod- els including affine processes, processes with quadratic diffusion coefficients, Lévy-driven SDEs with affine vector fields. . . all these examples naturally lead to PCVs of polynomial type.
g) More generally, let (Y s ) 0 ≤ s ≤ T be a R d -valued diffusion process dY t = b(t, Y t )dt + σ(t, Y t )dW t
(assuming globally Lipschitz coefficients), and let (p k ) 1 ≤ k ≤ K
pcvbe a family of smooth functions. Put X = Y t , Z = (Y t , U t,T , Y U
t,T, Y T ) where U t,T is a random variable uniformly distributed on [t, T], independent of (Y s ) 0 ≤ s ≤ T . Denoting by L the infinitesimal generator and using Itô’s formula, it is easy to check that
P k (Z) := p k (T, Y T ) − (T − t)(∂ t + L )p k (U t,T , Y U
t,T) − p k (t, Y t )
is such E [P k (Z)|Y t ] = 0, thus it is a control variate.
Notations and algorithm description
Let A be a non-empty closed convex subset of R K
pcvand for α ∈ A , put
H α = H −
K
pcvX
k=1
α k P k (Z).
The PCV algorithm is as follows.
S TEP 1. Variance Reduction. We approximate (α ⋆ k ) 1 ≤ k ≤ K
pcvby the coefficients ( α e k ) 1 ≤ k ≤ K
pcvusing the empirical norm in (1):
( α e k ) 1 ≤ k ≤ K
pcv:= arginf
α ∈A
1 N
X N i=1
H i −
K
pcvX
k=1
α k P k (Z i )
2
.
S TEP 2. Least squares regression. We compute the coefficients of the least squares regression ( β e k ) 1 ≤ k ≤ K
Fverifying:
( β e k ) 1 ≤ k ≤ K
F:= arginf
(β
k)
k1 N
X N i=1
H i −
K
pcvX
k=1
e
α k P k (Z i ) −
K
FX
k=1
β k Φ k (X i )
2
.
Then, m e N is defined by
e m N :=
K
FX
k=1
β e k Φ k ,
because the regression function of H − P K
pcvk=1 α ⋆ k P k (Z) is also m(.), due to the fact that E [P k (Z)|X] = 0.
Hypothesis
To prove our theorems, we assume (H) For some L ≥ 1,
i) k P k k ∞ ≤ 1 and k h k ∞ ≤ L, ii) A := {α ∈ R K
pcv: P K
pcvi=1 |α i | ≤ L}.
Clearly, A is a non empty closed convex set. These are technical assumptions,
which allow us to bound uniformly the PCVs and to apply Hoeffding-type con-
centration inequalities to them.
Main theorems
Theorem 2.2. Set Σ 2 (α ⋆ ) = sup x ∈R
dxV ar(H α
⋆| X = x). Then, for any ρ > 0, E
k m e N − m k 2 L
2(µ
N)
≤ (1 + ρ − 1 )L 4
c 1 + (c 2 + c 3 log(N))(K pcv + 1) N
+( 1 + ρ) K F
N Σ 2 (α ⋆ ) + ( 1 + ρ) E
f ∈F inf
Nkf − mk 2 L
2(µ
N)
,
where c 1 , c 2 et c 3 are universal constants.
This theorem shows that the PCV regression presumably reduces the estimation error through the factor Σ 2 (α ⋆ ) instead of the usual Σ 2 = Σ 2 (0). But on the other hand, the error contains an additional term (related to c 1 , c 2 , c 3 ) because of the extra error in the estimation of α ⋆ by α. e
Hence, in practice, we have to choose a relatively small K pcv with respect to K F . In the numerical tests, we illustrate this issue.
Theorem 2.3. Consider piecewise constant F N -basis functions of the form Φ k = 1 I
kwhere (I k ) 1 ≤ k ≤ K
Fare disjoint sets. Assume that one of the following assump- tions holds, for some constant c I ≥ 1:
a) (I k ) k depends on (X 1 , . . . , X N ) and their frequencies dominate the uniform distribution, i.e.
1 N
X N i=1
1 X
i∈ I
k≥ 1
c I K F , 1 ≤ k ≤ K F ;
b) (I k ) k are deterministic and their occurrences w.r.t. the distribution of X dom- inate the uniform distribution, i.e.
P (X ∈ I k ) ≥ 1
c I K F , 1 ≤ k ≤ K F . Then, for any ρ > 0,
E
k m e N − m k 2 L
2(µ
N)
≤ (1 + ρ − 1 )L 4
c 1 + (c 2 + c 3 log(N ))(K pcv + 1) N
+ (1 + ρ)c I K F N inf
α ∈A
E [ V ar(H α |X )]
+ (1 + ρ) E
f ∈F inf
Nk f − m k 2 L
2(µ
N)
,
with the same constants c 1 , c 2 et c 3 as before.
In Theorem 2.3, provided an additional hypothesis on Φ k , we obtain a more pre- cise bound for the overall error, since the second term factor is inf α ∈A E [ V ar(H α | X)], which is exactly the quantity minimized in S TEP 1 of the PCV algorithm.
3 Proofs
We first introduce notations specific to the proofs.
i) Denote by h , i µ and h , i µ
Nthe scalar products related to the L 2 -norms k . k µ et k.k µ
Nintroduced before: for two functions g 1 , g 2 : R d
z× R d
x→ R ,
hg 1 , g 2 i µ = Z
g 1 (z, x)g 2 (z, x)µ(dz, dx), hg 1 , g 2 i µ
N= 1 N
X N i=1
g 1 (Z i , X i )g 2 (Z i , X i ).
ii) PCV spaces: set
G :=
K
pcvX
k=1
α k P k : α ∈ R K
pcv
, G A :=
K
pcvX
k=1
α k P k : α ∈ A
, T L G := {T L g : g ∈ G} ,
where T L is the truncation operator of functions at the level L defined by T L g(x) = −L ∨ g(x) ∧ L.
It is important to observe that the assumption (H) implies G A ⊂ T L G . iii) Define m e ⋆ N by
e m ⋆ N =
K
FX
k=1
β e k ⋆ Φ k
where
(α k ⋆ ) 1 ≤ k ≤ K
pcv= arginf
α ∈A
E
H −
K
pcvX
k=1
α k P k (Z)
2
,
( β e ⋆ k ) 1 ≤ k ≤ K
F= arginf
(β
k)
k1 N
X N i=1
H i −
K
pcvX
k=1
α k ⋆ P k (Z i ) −
K
FX
k=1
β k Φ k (X i )
2
.
We now present some tools useful in the theory of non-parametric regression,
see [9, Chapter 9] for full details.
Definition 3.1. [Covering numbers N p (ε, G, z M 1 ) ] Let G be a set of functions from R d to R and let z 1 M = (z 1 , · · · , z M ) be M points in R d : denote by ν M the corresponding empirical measure and by k . k L
p(ν
M) the related L p -norm (p ≥ 1).
(i) For ε > 0, (g j : R d → R ) 1 ≤ j ≤ n is called an ε-cover of G w.r.t k.k L
p(ν
M) , if for every g ∈ G , there is a j ∈ { 1, · · · , n } such that k g − g j k L
p(ν
M) < ε.
(ii) We denote by N p (ε, G , z M 1 ) the minimal size n of ε-covers (g j ) 1 ≤ j ≤ n of G w.r.t k . k L
p(ν
M) .
The combination of Lemma 9.2, Theorem 9.4, Theorem 9.5 and Equation (10.23) in [9] leads to the following ready-to-use estimates.
Proposition 3.2. For any z 1 M points in R d and any ε ∈]0, L/2], we have
N 1 ε, T L G, z M 1
≤ 3
2e(2L) ε log
3e(2L) ε
K
pcv+1
≤ 3 9L
ε
2(K
pcv+1)
, N 2 ε, T L G , z M 1
≤ 3
2 e( 2 L) 2 ε 2 log
3 e( 2 L) 2 ε 2
K
pcv+1
≤ 3 18 L 2
ε 2
2(K
pcv+1)
. Actually, the second series of inequalities are easily derived from log(x) ≤ x/e for any x > 0.
3.1 Proof of Theorem 2.2 We have :
E
h k m e N − mk 2 L
2(µ
N)
i ≤(1 + ρ − 1 ) E
h k m e N − m e ⋆ N k 2 L
2(µ
N)
i
+ (1 + ρ) E h
k m e ⋆ N − m k 2 L
2(µ
N)
i . (3)
Step 1: Bound for E [k f m N − m f ⋆
N k 2 L
2
(µ
N) ]. m e N and m e ⋆ N are respectively the orthogonal projections onto Span(Φ k : 1 ≤ k ≤ K F ) with respect to the scalar product h ., . i µ
Nof h − P K
pcvk=1 α e k P k and h − P K
pcvk=1 α ⋆ k P k . By the contraction property of projection, we get:
E
h k m e N − m e ⋆ N k 2 L
2(µ
N)
i ≤ E
k
K
pcvX
k=1
e α k P k −
K
pcvX
k=1
α ⋆ k P k k 2 L
2(µ
N)
= E h
k r N − r k 2 L
2(µ
N)
i
(4)
where we have set r := P K
pcvk=1 α ⋆ k P k and r N := P K
pcvk=1 α e k P k .
We now estimate E [kr N −rk 2 L
2(µ) ] before deducing a bound on E [kr N −rk 2 L
2(µ
N) ].
Actually, in view of (1), r is the L 2 (µ)-projection of h onto the non-empty closed convex set G A ; since r N is also in G A , we have
Z
|r N (z) − r(z)| 2 µ(dz, dx) ≤ Z
|r N (z) − h(z)| 2 µ(dz, dx) − E [|r(Z) − H| 2 ].
Because of N 1 P N
i=1 |r N (Z i ) − H i | 2 − N 1 P N
i=1 |r(Z i ) − H i | 2 ≤ 0, we deduce:
Z
| r N (z) − r(z) | 2 µ(dz, dx)
≤ T 1,N :=
Z
| r N (z) − h(z) | 2 µ(dz, dx) − E [ | r(Z) − H | 2 ]
− 2 1 N
X N i=1
| r N (Z i ) − H i | 2 − 1 N
X N i=1
| r(Z i ) − H i | 2
! . Let us estimate E (T 1,N ) by controlling P (T 1,N > t), t > 0. Since r N ∈ G A , we have
P (T 1,N > t) ≤ P
∃f ∈ G A : E [|f (Z) − H| 2 ] − E [|r(Z) − H| 2 ]
− h 1 N
X N i=1
| f(Z i ) − H i | 2 − 1 N
X N i=1
| r(Z i ) − H i | 2 i
> 1 2
h t 2 + t
2 + E [|f(Z) − H| 2 ] − E [|r(Z) − H| 2 ] i . By using Theorem A.2 in Appendix and Proposition 3.2 with G A ⊂ T L G , and restricting to t ≥ 320L N
4, we get
P (T 1,N > t) ≤ 14 sup
z
N1N 1
t
640L 3 , G A , z 1 N
exp
− N 5136L 4 t
≤ 14 sup
z
N1N 1 L
2N , T L G , z N 1
exp
− N 5136L 4 t
≤ 42(18N) 2(K
pcv+1) exp
− N 5136L 4 t
.
We deduce, for any ε ≥ 320L N
4, E (T 1,N ) ≤ ε +
Z ∞
ε
42(18N) 2(K
pcv+1) exp
− N 5136L 4 t
dt
≤ ε + 42(18N) 2(K
pcv+1) 5136L 4 N exp
− N 5136L 4 ε
. The expression above is minimal for ε = 5136L N
4log 42(18N ) 2(K
pcv+1)
(note that ε ≥ 320L N
4). By plugging this value, we find :
E h
k r N − r k 2 L
2(µ)
i ≤ E (T 1,N ) ≤ 5136L 4
N log(42e) + 2(K pcv + 1) log(18N) . (5) To deduce a bound for E h
k r N − r k 2 L
2(µ
N) i , we set
T 2,N := 2(max {k r N − r k L
2(µ
N) − 2 k r N − r k L
2(µ) , 0 } ) 2 and we write the decomposition
kr N − rk 2 L
2(µ
N) ≤ (max {kr N − rk L
2(µ
N) − 2 kr N − rk L
2(µ) , 0 } + 2 kr N − rk L
2(µ) ) 2
≤ T 2,N + 8 k r N − r k 2 L
2(µ) . (6) Let u > 144L N
2. Using r N ∈ G A ⊂ T L G , Lemma A.1 in Appendix and Proposition 3.2, we obtain
P (T 2,N > u) ≤ P (∃f ∈ T L G : kf − rk L
2(µ
N) − 2 kf − rk L
2(µ) > p u/2)
≤ 3 sup
z
2N1N 2
√ u
24 , T L G , z 2N 1
exp
− N 2304L 2 u
≤ 3 sup
z
2N1N 2 L
2 √
N , T L G , z 1 2N
exp
− N 2304L 2 u
≤ 9(72N ) 2(K
pcv+1) exp
− N 2304L 2 u
.
Thus, similarly to the evaluation of E (T 1,N ), we obtain, for any ε > 144L N
2, E (T 2,N ) ≤ ε + 9(72N) 2(K
pcv+1) 2304L 2
N exp
− N 2304L 2 ε
. For the choice ε = 2304L N
2log 9(72N) 2(K
pcv+1)
> 144L N
2, we get:
E (T 2,N ) ≤ 2304L 2
N (log(9e) + 2(K pcv + 1) log(72N )). (7) Plugging (5-6-7) into (4) and using L 2 ≤ L 4 , we obtain
E h
k m e N − m e ⋆ N k 2 L
2(µ
N)
i ≤ L 4 c 1 + (c 2 + c 3 log(N))(K pcv + 1)
N . (8)
Step 2: Bound for E [kf m ⋆
N − m k 2 L
2
(µ
N) ] . To simplify, we introduce the nota- tion E ∗ [.] := E [. | X 1 , . . . , X N ].
Consider a complete orthonormal basis (f 1 , . . . , f K ) (K ≤ K F ) for F N w.r.t. the empirical scalar product h., .i µ
N:
h f k , f l i µ
N= 1 N
X N i=1
f k (X i )f l (X i ) = δ k,l . (9) Observe that
E ∗ [ k m e ⋆ N − m k 2 L
2(µ
N) ] = E ∗
"
1 N
X N i=1
| m e ⋆ N (X i ) − E ∗ [ m e ⋆ N (X i )] | 2
#
+ E ∗
"
1 N
X N i=1
| E ∗ [ m e ⋆ N (X i )] − m(X i ) | 2
#
because E ∗ h 1
N
P N
i=1 ( m e ⋆ N (X i ) − E ∗ [ m e ⋆ N (X i )])( E ∗ [ m e ⋆ N (X i )] − m(X i )) i
= 0.
This proves E h
k m e ⋆ N − m k 2 L
2(µ
N)
i = E h
k m e ⋆ N − E ∗ [ m e ⋆ N ] k 2 L
2(µ
N)
i + E h
k E ∗ [ m e ⋆ N ] − m k 2 L
2(µ
N)
i . (10) Similarly to the proof of [9, Theorem 11.1, pp.185–187], we have
E
h k E ∗ [ m e ⋆ N ] − mk 2 L
2(µ
N)
i
= E h
f ∈F inf
Nkf − mk 2 L
2(µ
N)
i
, (11)
E ∗ h
k m e ⋆ N − E ∗ [ m e ⋆ N ] k 2 µ
Ni = 1 N 2
X N i=1
V ar(H i α
⋆| X i ) X K k=1
(f k (X i )) 2 . (12)
Introducing Σ 2 (α ⋆ ) for the uniform bound regarding to the conditional variance of H α
⋆and using k f k k L
2(µ
N) = 1, we obtain
E ∗
h k m e ⋆ N − E ∗ [ m e ⋆ N ]k 2 L
2(µ
N)
i ≤ K
N Σ 2 (α ⋆ ).
Thus, since K ≤ K F , it follows E h
k m e ⋆ N − E ∗ [ m e ⋆ N ] k 2 L
2(µ
N)
i ≤ K F
N Σ 2 (α ⋆ ). (13)
We complete the proof by combining (3-8-10-11-13).
3.2 Proof of Theorem 2.3
Case a) The orthonormal basis of F N w.r.t. h ., i µ
N(see (9)) is readily given by f k = 1 I
kq 1 N
P N
i=1 1 X
i∈ I
k≤ p
c I K F 1 I
k, 1 ≤ k ≤ K = K F .
Therefore, we have P K
k=1 (f k (X i )) 2 ≤ c I K F and from (12), we deduce E
k m e ⋆ N − E ∗ [ m e ⋆ N ]k 2 L
2(µ
N)
≤ 1 N E
h
V ar(H α
⋆|X) i
c I K F . (14) Gathering the idendity (2) and the inequalities (3-8-10-11-14) leads to the theorem result.
Case b) Concerning the orthogonalisation of (Φ k ) 1 ≤ k ≤ K
F, only K ≤ K F sets contain at least one data. More precisely, for such a set I k for which p k,N :=
1 N
P N
i=1 1 X
i∈ I
k> 0, we can set
f k = 1 I
k√ p k,N .
Hence, the last factor in (12) is equal to P K
k=1 (f k (X i )) 2 = P K
Fk=1 1
Xi∈Ikp
k,N1 p
k,N>0
using the convention 0/0 = 0. Then, by a symmetry argument within (X 1 , . . . , X N ), we have
E h
k m e ⋆ N − E ∗ [ m e ⋆ N ] k 2 L
2(µ
N)
i = 1 N E
"
V ar(H 1 α
⋆| X 1 )
K
FX
k=1
1 X
1∈ I
kp k,N 1 p
k,N>0
#
=
K
FX
k=1
E (
V ar (H 1 α
⋆|X 1 )1 X
1∈ I
kE
"
1 1 + P N
i=2 1 X
i∈ I
k#) .
Observe that P N
i=2 1 { X
i∈ I
k} is binomially distributed with parameters (N− 1, P (X ∈ I k )): from [9, Lemma 4.1] we know that
E 1/(1 +
X N i=2
1 X
i∈ I
k)
≤ 1
N P (X ∈ I k ) ≤ c I K F N . Hence, E
h k m e ⋆ N − E ∗ [ m e ⋆ N ]k 2 L
2(µ
N) i
≤ c I K
FN inf α ∈R
KpcvE [ V ar (H α |X)] . We con-
clude as for Case a).
4 Numerical experiments
Consider two independent Brownian motions W and B. We experiment the method in two cases related to the dimension of X: d x = 1 and d x = 2.
4.1 Dimension 1
Figure 1. Empirical error (in a log scale) as a function of N . Parameters: d
x= 1 , K
1= K
pcv, K
2= K
F.
Our goal is to estimate m(x) = E [h(W 2 ) | W 1 = x] where h(x) = e −
x2 2
; due
to model assumption, m is explicit. For the PCVs, we take the hat functions, see
example b) in Paragraph 2.3. Regarding F N , we choose Φ k (x) = x k , 0 ≤ k ≤ 9.
The Figure 1 shows the global empirical error 3 E
k m e N − m k 2 L
2(µ
N)
for K F = 10 and various values of K pcv , in a range of few simulations (N ≤ 10000) and in a range of many simulations (10000 ≤ N ≤ 100000). The variables on the plots are K 1 = K pcv and K 2 = K F .
We observe that the PCV method, compared to a simple regression, improves the convergence of the global error. However, the PCV efficiency for small N deteriorates when K pcv becomes large (K pcv = 21). This is explained by the fact that the statistical error regarding the estimation of α ⋆ (term with c 1 , c 2 , c 3
in Theorem 2.2) is significant for small N and large K pcv . Thus, for a small number of simulations, using K pcv = 3 or 5 is optimal. The standard method with N = 100000 yields an error equivalent to that using PCV with N = 2000: hence, the PCV yields a improvement factor of 50 regarding to the simulation effort. In the range of large N, the PCV method reaches the approximation error while the standard method still requires more simulations.
4.2 Dimension 2
Consider the estimation of m(x) = E [h(W 2 , B 2 ) | W 1 = x, B 1 = x] where h(x 1 , x 2 ) = e −
x2 1+x2
2+ρx1x2
2
with ρ = 0.5. We use the hat functions in dimension 2, see example c) in Paragraph 2.3. The Figure 2 represents the global empirical error E
k m e N −mk 2 L
2(µ
N)
when K F = 13 × 13 and with different values of K pcv . We observe the same features as in dimension 1. Here, the improvement factor is about 25 comparing the PCV with K pcv = 5 × 5 at N = 2000 and the standard method at N = 50000.
5 Conclusion and perspectives
The PCV method significantly accelerates the convergence of the estimation error to 0, regardless of the selected approximation space. It provides a higher accuracy of regression-based Monte-Carlo algorithms, especially for few simulations; thus, it can be used for efficiently reducing the computational time. However, in view of theoretical and numerical results, a special attention has to be paid to the choice of the PCVs and their number K pcv , which has to be small compared to the dimen- sion K F of the approximation space.
An adaptive selection procedure of PCVs would be worth being designed, which is left to further research. Moreover, we will investigate how to relax the bound-
3