HAL Id: hal-00460632
https://hal.archives-ouvertes.fr/hal-00460632
Submitted on 3 Mar 2010
HAL is a multi-disciplinary open access archive for the deposit and dissemination of sci- entific research documents, whether they are pub- lished or not. The documents may come from teaching and research institutions in France or
L’archive ouverte pluridisciplinaire HAL, est destinée au dépôt et à la diffusion de documents scientifiques de niveau recherche, publiés ou non, émanant des établissements d’enseignement et de recherche français ou étrangers, des laboratoires
CONSISTENCY OF RIDGE FUNCTION FIELDS FOR VARYING NONPARAMETRIC REGRESSION
Robert Frouin, Bruno Pelletier
To cite this version:
Robert Frouin, Bruno Pelletier. CONSISTENCY OF RIDGE FUNCTION FIELDS FOR VARYING NONPARAMETRIC REGRESSION. Communications in Statistics - Theory and Methods, Taylor &
Francis, 2009, 38 (8), pp.1272-1283. �10.1080/03610920802395702�. �hal-00460632�
C ONSISTENCY OF RIDGE FUNCTION FIELDS FOR VARYING NONPARAMETRIC REGRESSION
Robert FROUINaand Bruno PELLETIERb,∗
aScripps Institution of Oceanography University of California, San Diego
9500 Gilman Drive, La Jolla, CA 92093-0224, USA [email protected]
aInstitut de Math´ematiques et de Mod´elisation de Montpellier UMR CNRS 5149, Equipe de Probabilit´es et Statistique
Universit´e Montpellier II, CC 051
Place Eug`ene Bataillon, 34095 Montpellier Cedex 5, France [email protected]
Abstract
A nonparametric regression model proposed in [Pelletier and Frouin, Ap- plied Optics, 2006] as a solution to the geophysical problem of ocean color remote sensing is studied. The model, called ridge function field, com- bines a regression estimate in the form of a superposition of ridge func- tions, or equivalently a neural network, with the idea pertaining to varying- coefficients models, where the parameters of a parametric family are allowed to vary with other variables. Under mild assumptions on the underlying dis- tribution of the data, the strong universal consistency of the least-squares ridge function fields estimate is established.
Index Terms— Varying coefficients model, ridge function approximation, nonparametric regression, universal consistency, least squares.
AMS 2000 Classification: 62G08, 62G05.
∗Corresponding author.
1 Introduction
The motivation for the present work comes from the geophysical problem of ocean color remote sensing from space, which we shall describe first. In this problem, the aim is to predict the value of an oceanic parameter Y (e.g., the surface phy- toplankton pigment concentration) from a vectorXof radiometric measurements at several wavelengths acquired by a sensor onboard a satellite platform. This is in fact an inverse problem where the measurements X depend on Y and other parameters, such as the aerosol optical properties, through a given forward oper- ator (governing the radiative transfer in the ocean-atmosphere system) contami- nated by a random measurement noise. In particular, X depends on a vectorT of angular variables describing the relative positions of the Sun and the sensor with respect to the target; as such, T is observed simultaneously with X. The difficulty, though, is that the operator to be inverted, resulting from modeling of scattering and absorption processes at the micro-physical scale, is rather complex.
To circumvent this issue, the approach taken up in Pelletier and Frouin (2004, 2006) and Frouin and Pelletier (2007) consists in i) sampling the forward oper- ator according to some prior distribution, ii) selecting a reasonable noise model, and iii) estimating the regression function m of Y on X and T from the data (X1, T1, Y1), . . . ,(Xn, Tn, Yn), wheremis defined by
m(x, t) = EX=x,T=t
Y|X, T .
In this display, the variable T is independent from the response Y but not from the predictorX. As such, T acts as an effect modifierin the sense of Hastie and Tibshirani (1993) in the context of varying coefficient models, i.e., the shape of the relation explaining Y from X is affected by the values taken by T. Let us recall that a varying-coefficient model is a linear model the parameters of which are allowed to vary with other variables. Since the work of Hastie and Tibshirani (1993), varying-coefficient models have been studied by several authors (see e.g., Fan and Zhang, 1999, 2000; Fan, Yao, and Cai, 2003, Wong, Ip, and Zhang, 2008, and the references therein). By relaxing to some extent the parametric constraint, these models have proved useful in various application contexts, and in particular in a high-dimensional setting.
As pointed out by Hastie and Tibshirani (1993) as well as Fan and Zhang (1999), the idea of allowing the parameters of a linear model to vary with other variables
is not new, and is widely used in the literature. More generally, one may start with an arbitrary parametric family in place of a linear model, and allow its parameters to vary with other variables. This idea is developed in Pelletier and Frouin (2004, 2006) and Frouin and Pelletier (2007) where we considered models of the form
ζn∗ : (x, t)7→ζn∗(x, t) :=f x;θn(t)
. (1.1)
In model (1.1), the functionf(.;θn)belongs to a setRnof functions, about which more will be said later, parameterized by a vector θn, which in (1.1) is allowed to vary witht. Additionally, to obtain an implementable model, each coordinate map of the application t 7→ θn(t) in (1.1) is taken to lie in a parametric set Tn
of functions oft. Now suppose that X and T are taking values in some subsets X ⊆RdandS ⊆Rp, respectively. Another way of formalizing this construction is to view the model (1.1) as a function ofxbeing “attached” to each pointtinS, i.e., in mathematical terms, as a map
ζn:S → Rn, (1.2) such that the map ζn∗ in (1.1) is the representation of ζn over the product space X × S, i.e., we let ζn∗(x, t) = ζn(t)(x). Thus the representation (1.2) highlights the fact that the variableT acts as an effect modifier.
In the experiments reported in Pelletier and Frouin (2004, 2006) and Frouin and Pelletier (2007), the sequenceRnof functions in (1.2) is taken as the set spanned by functions of the ridge form or, equivalently, neural networks, and the resulting models (1.1)-(1.2) is called a ridge function field over S. The sets Rn form a nested sequence of the formR1 ⊆ R2 ⊆. . .. There exists a vast literature on re- gression estimation with these ridge functions and neural networks. In particular, it is known that the union of sets spanned by ridge functions is dense in L2(µ), for any probability measure µ (see, Cybenko, 1989; Hornik, Stinchcombe, and White, 1989; Barron, 1993; Lin and Pinkus, 1993; Burger and Neubaeur, 2001;
Maiorov, 1999), and that they yield consistent nonparametric regression proce- dures (see e.g. Gyorfi, Kohler, Krzyzak, and Walk, 2002, Chapter 16). Similarly, the sets Tn have been chosen to form an increasing sequenceT1 ⊆ T2 ⊆ . . . of sets of real-valued functions onS.
Density results for ridge function fields have been obtained in Pelletier (2004) in the compact-open topology. In particular, under mild assumptions on the sets (Rn)nand(Tn)n, the union of ridge function fields is dense. Thus under these con- ditions, ridge function fields are suitable candidates for nonparametric regression,
since the approximation error of the estimate will decrease to 0 as n increases.
The purpose of the present work is to establish the strong universal consistency of the least-squares regression estimate in the form of (1.1)-(1.2).
The paper is organized as follows. Section 2 introduces ridge function fields and presents our main result (Theorem 2.1), which states that ridge function fields adjusted by least squares arestrongly universally consistent, i.e., that for any dis- tribution of(X, T, Y)satisfying mild assumptions, theL2 error of the regression estimate converges to 0 with probability 1 as the sample size tends to infinity.
Section 3 is devoted to the proof of Theorem 2.1.
2 Ridge function fields
Let(X, T, Y)be a random object, whereX,T, andY take values inRd,Rp, and R, respectively. We shall assume that X, T, and Y are bounded, i.e., that there exists positive constantsCX,CT andCY such that
|X| ≤CX, |T| ≤CT, |Y| ≤CY, with probability 1. (2.1) Givenn independent copies(X1, T1, Y1), . . . ,(Xn, Tn, Yn)of(X, T, Y), our aim is to construct an estimate of the regression function mof Y on the pair(X, T), i.e.,
m(x, t) =E
Y|(X, T) = (x, t) .
Let us start with the definition of a ridge function. A ridge function on Rd is a function of the form σ ha, xi
where σ : R → R is a given function, where a∈Rd, and whereh., .idenotes the usual scalar product onRd. Approximation by ridge functions refers to approximation by linear combinations of ridge functions.
There exists several variants of approximation by ridge functions, according to whether the function σ is fixed or not. In the following, we shall consider the familyRnof functions onRddefined by
Rn=n x7→
Kn
X
j=1
cjσ haj, xi+bj
+c0 : aj ∈Rd, bj ∈R, c0, cj ∈R,
j = 1, . . . , Kn
o,
whereKnis an integer, and whereσ : R→ Ris a fixedsquashing function, i.e., σis nondecreasing, and satisfies the following two conditions:
u→−∞lim σ(u) = 0 and lim
u→∞σ(u) = 1.
Next, let Ψ1,Ψ2, . . . : Rp → R be bounded basis functions, the linear span of which is dense inC Rp
in the topology of uniform convergence on compact sets, i.e., the set
∞
[
k=1
nt7→
k
X
j=1
ajΨj(t) : a1, . . . , ak ∈Ro is dense in C Rp
. Without loss of generality, we assume that |Ψi| ≤ 1for all i ≥ 1. Next, given an integerL ∈ Nand a real numberρ >0, we shall consider the following families of functions onRp:
T(L) = n t7→
L
X
j=1
ajΨj(t) : a1, . . . , ak∈Ro ,
Tρ(L) = n t7→
L
X
j=1
ajΨj(t) : a1, . . . , ak∈R,
L
X
j=1
|aj| ≤ρo .
Then, given integers Lan, Lbn, and Lcn, and a positive number ρn, we define the familyFnof ridge function fields by
Fn =n
(x, t)7→
Kn
X
j=1
cj(t)σ haj(t), xi+bj(t)
+c0(t) : aj ∈ T(Lan)d
, bj ∈ T(Lbn), c0, cj ∈ Tρn(Lcn), j = 1, . . . , Kn
o .
We shall also consider the subset F˜n of Fn where only the coefficients cj are allowed to vary with the variablet, i.e., we define
F˜n=n
(x, t)7→
Kn
X
j=1
cj(t)σ haj, xi+bj
+c0(t) : aj, bj ∈R, c0, cj ∈ Tρn(Lcn),
j = 1, . . . , Kn
o.
Both sets Fn andF˜n are dense in the topology of uniform convergence on com- pact sets (Pelletier, 2004). Consequently,Fn andF˜nare also dense inL2 for any distribution with bounded support and, as exposed in the Introduction, the approx- imation error of the estimates will decrease to 0 asn→ ∞.
Let us first define the mapsm†nandm˜†nas any minimizer of the empiricalL2risk overFnandF˜n, respectively, i.e.,mnandm˜nare such that
1 n
n
X
i=1
m†n(Xi, Ti)−Yi
2
= inf
f∈Fn
1 n
n
X
i=1
f(Xi, Ti)−Yi
2
, (2.2) 1
n
n
X
i=1
m˜†n(Xi, Ti)−Yi
2
= inf
f˜∈F˜n
1 n
n
X
i=1
f(X˜ i, Ti)−Yi
2
. (2.3) Then we define the estimates mn and m˜n as truncated versions of m†n and m˜†n respectively, i.e.,
mn(x, t) = Tβnm†n(x, t), (2.4)
˜
mn(x, t) = Tβnm˜†n(x, t), (2.5) where(βn)nis a sequence of positive numbers such thatβn→ ∞, and whereTβn
is the truncation operator defined byTβnu= min{|u|, βn}sign(u).
We are now in a position to state our main result.
Theorem 2.1 Letmnandm˜nbe the truncated least-squares estimates defined by (2.2),(2.3),(2.4), and(2.5).
(i) If
Kn→ ∞, Lan→ ∞, Lbn→ ∞, Lcn→ ∞, βn→ ∞, ρn → ∞,
asn→ ∞in such a way that
Knβn4(Lan+Lbn+Lcn) log(βnρnKn)
n →0 and βn4
n1−δ →0, for someδ >0asn→ ∞, then
n→∞lim Z
mn(x, t)−m(x, t)2
µ(dx, dt) = 0 with probability 1,
for every distribution of(X, T, Y)satisfying (2.1).
(ii) If
Kn→ ∞, Lcn→ ∞, βn→ ∞, and ρn→ ∞, asn→ ∞in such a way that
Knβn4Lcnlog(βnρnKn)
n →0 and βn4
n1−δ →0, for someδ >0asn→ ∞, then
n→∞lim Z
˜
mn(x, t)−m(x, t)2
µ(dx, dt) = 0 with probability 1,
for every distribution of(X, T, Y)satisfying (2.1).
Theorem 2.1 states that nonparametric regression estimation with ridge function fields is a consistent procedure as long as the underlying distribution satisfies (2.1).
Note first that the extension to a multivariate response Y is straightforward, and second that the family F˜nis a subset of the family Fn. Consequently,Fn offers more flexibility than F˜n but, as expected, at the expense of an increased com- plexity of the fitting algorithm. More generally, one may consider replacing the familyRnin the definition of a set of function fields by any other nested sequence of models used in nonparametric regression. That said, regarding the ocean color remote sensing problem, the choice of using ridge functions has been dictated not only by their approximation properties, but more importantly by the speed of ex- ecution of these models, which should be high for processing large data sets, as provided by satellite imagery. Models belonging to Fn have been implemented and evaluated in Pelletier and Frouin (2006) for the retrieval of the chlorophyll-a concentration, i.e., a univariate response, while the family F˜n has been used in Frouin and Pelletier (2007) for the retrieval of the spectral marine reflectance, a multivariate output. In practice, the minimization of the empiricalL2risk in (2.2) and (2.3) may be solved using a stochastic gradient descent algorithm (see Pel- letier and Frouin, 2006 for details).
Let us also mention that assumption (2.1), which is reasonable from a physical perspective, may be weakened at the price of a little extra work. As a matter of fact, one may assume that the response Y is unbounded and use a truncation argument, as in, e.g., Kohler and Krzyzak (2005), and Gyorfi, Kohler, Krzyzak, and Walk (2002, Theorem 10.2). Next, ifXandT are unbounded, it is immediate to show that the union of the sets Fn is dense in L2(µ) for any distribution µ
provided that the sets T(L) contain the constant and affine functions of t. On the other hand, a little extra work would be needed to derive the density in L2
of the union of the F˜n’s. The study of the convergence rates is left for future research. In this perspective, let us emphasize that the respective influences of the variablesX andT on the responseY are well separated in a ridge function field, by construction, which may prove useful for adaptation over anisotropic classes of regression functions (see e.g. Hofmann and Lepski, 2002).
3 Proofs
3.1 Technical Lemmas
We shall need the following technical Lemmas to derive an upper-bound on the covering number of Fn. Let us start by introducing the notations and definitions used hereafter.
First of all, letG be a set of real valued functions onRd, and letx1, . . . , xn ben fixed points inRd. For any functionsg1, g2 :Rd→R, set
dist1,n(g1, g2) =Pn |g1−g2| ,
wherePnis the discrete uniform measure on{x1, . . . , xn}. Theε-covering num- ber ofGwith respect todist1,nis called theL1ε-covering number ofGon{x1, . . . , xn} and will be denoted byN1 ε,G, xn1
, i.e., N1 ε,G, xn1
is the smallest integerN such that there exists functionsg1, . . . , gnsuch that
1≤j≤Nmin 1 n
n
X
i=1
|g(xi)−gj(xi)| ≤ε,
for allg ∈ G.
Similarly to the above, the L1 ε-packing number of G on {x1, . . . , xn}, further denoted byM1(ε,G, xn1), is the maximal integerNsuch that there exists functions g1, . . . , gN inGsuch that
1 n
n
X
i=1
gj(xi)−gk(xi) ≥ε,
for all1≤j < k≤N.
Finally,G+will denote the set of all subgraphs of functions inG, i.e., G+ =
(x, u)∈Rd×R : u≤g(x), g ∈ G ,
and the Vapnik-Chervonenkis dimension of G+ will be denoted byVG+ (Vapnik and Chervonenkis, 1971).
The following results, extracted from Lemmas 16.3, 16.4, and 16.5 in Gyorfi, Kohler, Krzyzak, and Walk (2002), will be needed in the following, and are given to make the paper self-contained. LetF be a set of functions Rd → R. Given a squashing functionσ, defineG={σ ◦ f : f ∈ F}. Then
VG+ ≤VF+. (3.1)
Given two setsF andGof functionsRd→R, setF ⊕ G ={f+g : f ∈ F, g ∈ G}, and let{x1, . . . , xn}benpoints inRd. Then
N1 ε+δ,F ⊕ G, xn1
≤ N1 ε,F, xn1
N1 δ,G, xn1
. (3.2)
Given two sets F andG of functions Rd → R, uniformly bounded over Rd by some constantsCF andCG, respectively, setF ⊙ G ={f g : f ∈ F, g ∈ G}, and let{x1, . . . , xn}benpoints inRd. Then
N1 ε+δ,F ⊙ G, xn1
≤ N1 ε/CG,F, xn1
N1 δ/CF,G, xn1
. (3.3) At last, given a setGof functionsRd →[0;B]withVG+ ≥2,npoints{x1, . . . , xn} in Rd, and 0 < ε < B/4, we have (Gyorfi, Kohler, Krzyzak, and Walk (2002), Theorem 9.4)
M1 ε,G, xn1
≤3 2eB
ε log 3eB ε
VG+
. (3.4)
We may now state an upper-bound on the covering number ofFn. Lemma 3.1 Let N1 ε,Fn,(X, T)n1
be the L1 ε-covering number of Fn on the sample(X1, T1), . . . ,(Xn, Tn). Then
N1 ε,Fn,(X, T)n1
≤3
108eρn(Kn+ 1) ε
2(Lcn+1)+2Kn(Land+Lbn+Lcn+2)
. (3.5)
Proof. Consider the following sets of functions:
G1 =
(x, t)7→ ha(t), xi+b(t) : a1, . . . , ad ∈ T(Lan), b∈ T(Lbn) , G2 =
(x, t)7→σ ha(t), xi+b(t)
: a1, . . . , ad∈ T(Lan), b∈ T(Lbn) , G3 =
(x, t)7→c(t)σ ha(t), xi+b(t)
:a1, . . . , ad∈ T(Lan), b∈ T(Lbn), c∈ Tρn(Lcn) .
SinceG1 is a vector space of dimensionLand+Lbn, we have VG+
1 ≤Land+Lbn+ 1.
Next, (3.1) yields
VG+
2 ≤VG+
1 .
Consequently, since|σ(u)| ≤1for allu∈R, and using (3.4), we obtain N1 ε,G2,(X, T)n1
≤ M1 ε,G2,(X, T)n1
≤ 3
"
2e ε log
3e ε
#VG+ 2
≤ 3 3e
ε
2(Land+Lbn+1)
. (3.6)
Now, using the inequality (3.3) leads to N1 ε,G3,(X, T)n1
≤ N1
ε
2,Tρn(Lcn),(X, T)n1 N1
ε 2ρn
,G2,(X, T)n1 .
But, using (3.4), we have N1 ε,Tρn(Lcn),(X, T)n1
≤ M1 ε,Tρn(Lcn),(X, T)n1
≤ 3
"
2e(2ρn)
ε log
3e(2ρn) ε
#VT
ρn(Lcn)+
≤ 3 6eρn
ε
2(Lcn+1)
, (3.7)
since
VT
ρn(Lcn)+ ≤VT(Lcn)+ ≤Lcn+ 1,
and sinceT(Lcn)is a vector space of dimensionLcn. Then from (3.6) and (3.7), it follows that
N1 ε,G3,(X, T)n1
≤ 3 6eρn
ε/2
2(Lcn+1)
3
3e ε/(2ρn)
2(Land+Lbn+1)
= 9
12eρn
ε
2(Lcn+1) 6eρn
ε
2(Land+Lbn+1)
≤ 9
12eρn
ε
2(Land+Lbn+Lcn+2)
. (3.8)
Applying (3.2) yields N1 ε,Fn,(X, T)n1
≤ N1
ε
Kn+ 1,Tρn(Lcn),(X, T)n1 "
N1
ε
Kn+ 1,G3,(X, T)n1 #Kn
. Finally, by reporting (3.7) and (3.8) in the equation above, we obtain the upper
bound:
N1 ε,Fn,(X, T)n1
≤ 3
6eρn
ε/(Kn+ 1)
2(Lcn+1)"
9
12eρn
ε/(Kn+ 1)
2(Land+Lbn+Lcn+2)#Kn
≤ 3
108eρn(Kn+ 1) ε
2(Lcn+1)+2Kn(Land+Lbn+Lcn+2)
.
3.2 Proof of Theorem 2.1
By Lemma 10.2 in Gyorfi, Kohler, Krzyzak, and Walk (2002), we have the in- equality
Z
mn(x, t)−m(x, t)
2µ(dx, dt)
≤ inf
{f∈Fn:kfk∞≤βn}
Z
f(x, t)−m(x, t)
µ(dx, dt)
+ 2 sup
f∈TβnFn
1 n
n
X
i=1
f(Xi, Ti)−Yi
2
−E
n f(Xi, Ti)−Yi
2o
, (3.9)
where the first term is the approximation error, and where the second term is a uniform upper-bound overTβnFn of the estimation error. Since the union of the Fn’s is dense inC(Rd×Rp)in the topology of uniform convergence on compact sets, and sinceXandT are bounded by assumption, the union of theFn’s is dense inL2(µ)for everyµwith bounded support. Then it follows that the approximation error of m by an element of {f ∈ Fn : kfk∞ ≤ βn} converges to zero as Kn, Lan, Lbn, Lcn, βn, ρn → ∞, and part (i) of the theorem will be proved if we show that
sup
f∈TβnFn
1 n
n
X
i=1
f(Xi, Ti)−Yi
2
−En
f(Xi, Ti)−Yi
2o
→0, (3.10) a.s. asKn, Lan, Lbn, Lcn, βn, ρn→ ∞. The same arguments hold for the familyF˜n, so that part (ii) of the theorem will be proved if we show that (3.10) is satisfied withTβnFnreplaced byTβnF˜n.
To bound the second term in the right hand side of (3.9), let us first define the set Hnof functions by
Hn =
h:Rd×Rp×R∋(x, t, y)7→ |f(x, t)−y|21[−CY;CY](y) : f ∈TβnFn , where1[−CY;CY](.)is the characteristic function of the interval[−CY;CY]. Next, we shall setZi = (Xi, Ti, Yi), fori= 1, . . . , n.
For everyf inTβnFn, we have|f(x, t)| ≤βn. Hence fornlarge enough such that CY ≤βn,
|f(x, t)−y|21[−CY;CY](y)≤4βn2,
so that the range of each function h in Hn is included in the interval
0; 4βn2 . Then from the proof of Theorem 24 in Pollard (1984, p. 25), or Theorem 9.1 in Gyorfi, Kohler, Krzyzak, and Walk (2002, p. 136), we obtain for any ε > 0the following inequality:
P (
sup
f∈TβnFn
1 n
n
X
i=1
|f(Xi, Ti)−Yi|2−E
|f(X, T)−Y|2
> ε )
=P (
sup
h∈Hn
1 n
n
X
i=1
h(Zi)−E h(Z)
> ε )
≤8EN1ε
8,Hn, Z1n
exp − nε2 128 4βn22
!
. (3.11)
Now we proceed to bound the covering number in (3.11). Givenh1andh2 inHn, corresponding respectively tof1andf2 inTβnFn, we have
1 n
n
X
i=1
|h1(Zi)−h2(Zi)| = 1 n
n
X
i=1
|f1(Xi, Ti)−Yi|2− |f2(Xi, Ti)−Yi|2 a.s.
≤ 4βn
1 n
n
X
i=1
|f1(Xi, Ti)−f2(Xi, Ti)|, (3.12) since |f(x, t)| ≤ βn for every f inTβnFn, and since |Y| ≤ CY ≤ βn a.s. for n
large enough. Therefore, iff1, . . . , fNis aL1ε-cover ofTβnFnon(X1, T1), . . . ,(Xn, Tn) with
N =N1 ε, TβnFn,(X, T)n1 ,
there exists a L1 2ε-coverf1′, . . . , fN′ ofTβnFn on(X, T)n1 withfj′ ∈TβnFn, for allj = 1, . . . , n. Then using the inequality (3.12) we conclude that
N1ε
8,Hn, Z1n
≤ N1 ε 64βn
, TβnFn,(X, T)n1
!
. (3.13)
Using Lemma 3.1 together with the fact thatN1 ε, TβnFn,(X, T)n1
≤ N1 ε,Fn,(X, T)n1 , it follows that
P (
sup
f∈TβnFn
1 n
n
X
i=1
|f(Xi, Ti)−Yi|2−E
|f(X, T)−Y|2
> ε )
≤24
6912eβnρn(Kn+ 1) ε
2(Lcn+1)+2Kn(Land+Lbn+Lcn+2)
×exp
− nε2 2048βn4
= 24 exp (
−nδn1−δ βn4
ε2
2048 −2βn4
Lcn+ 1 +Kn(Land+Lbn+Lcn+ 2) n
×log
6912βnρn(Kn+ 1) ε
)
, (3.14)
for someδ >0.
Consequently, if
Kn→ ∞, Lan→ ∞, Lbn→ ∞, Lcn→ ∞, βn→ ∞, ρn → ∞, asn→ ∞in such a way that
Knβn4(Lan+Lbn+Lcn) log(βnρnKn)
n →0 and βn4
n1−δ →0, for someδ >0asn→ ∞, we deduce from (3.14) that
X
n≥1
P (
sup
f∈TβnFn
1 n
n
X
i=1
|f(Xi, Ti)−Yi|2 −E
|f(X, T)−Y|2
> ε )
<∞.
(3.15) Part (i) of Theorem 2.1 then follows from the Borel-Cantelli Lemma.
To prove part (ii), it suffices to setLan = 1andLbn = 1. Then one obtains (3.15) from (3.14) if
Kn → ∞, Lcn → ∞, βn → ∞, and ρn → ∞, asn→ ∞in such a way that
Knβn4Lcnlog(βnρnKn)
n →0 and βn4
n1−δ →0,
for someδ > 0asn → ∞, which together with the Borel-Cantelli Lemma con- cludes the proof of the second part of Theorem 2.1.
Acknowledgment
This work was supported by the National Aeronautics and Space Administration under Grant No. NNX08AF65A. The authors are indebted to an anonymous ref- eree for a careful reading of the paper and insightful comments.
References
[1] Barron, A. (1993). Universal approximation bounds for superpositions of a sigmoidal function.IEEE Transactions on Information Theory, Vol. 39, pp.
930-945.
[2] Burger, M. and Neubauer, A. (2001). Error bounds for approximation with neural networks.Journal of Approximation Theory,Vol. 112, pp. 235-250.
[3] Cybenko, G. (1989). Approximation by superpositions of a sigmoidal func- tionMathematics of Control, Signals, and Systems,Vol. 2, pp. 303-314.
[4] Fan, J. and Zhang, W. (1999). Statistical estimation in varying coefficients models.The Annals of Statistics,Vol. 27, pp. 1491-1518.
[5] Fan, J. and Zhang, J.-T. (2000). Two-step estimation of functional linear models with applications to longitudinal data.Journal of the Royal Statisti- cal Society, Series B,Vol. 62, pp.303-322.
[6] Fan, J., Yao, Q., and Cai, Z. (2003). Adaptive varying-coefficient linear mod- els.Journal of the Royal Statistical Society, Series B,Vol. 65, pp 57-80.
[7] Frouin, R. and Pelletier, B. (2007). Fields of nonlinear regression models for atmospheric correction of satellite ocean-color imagery.Remote Sensing of Environment,Vol. 111, pp. 450-465.
[8] Gyorfi, L., Kohler, M., Krzyzak, A., and Walk, H. (2002). A Distribution- Free Theory of Nonparametric Regression. Springer-Verlag, New-York.
[9] Hastie, T. and Tibshirani, R. (1993). Varying coefficient models,Journal of the Royal Statistical Society, Series B,Vol. 55, pp. 757-796.
[10] Hoffmann, M. and Lepski, O. (2002). Random rates in anisotropic regres- sion.The Annals of Statistics,Vol. 30, pp. 325-358.
[11] Hornik, K., Stinchcombe, M., and White, H. (1989). Multilayer feedforward networks are universal approximators.Neural Networks,Vol. 2, pp. 359-366.
[12] Kohler, M. and Krzyzak, A. (2005). Adaptive regression estimation with multilayer feedforward neural networks.Nonparametric Statistics, Vol. 17, pp. 891-913.
[13] Lin, V.Y. and Pinkus, A. (1993). Fundamentality of ridge functions.Journal of Approximation Theory,Vol. 75, pp. 295-311.
[14] Maiorov, V. (1999). On best approximation by ridge functions. Journal of Approximation Theory,Vol. 99, pp. 68-94.
[15] Pelletier, B. (2004). Approximation by ridge function fields over compact sets.Journal of Approximation theory,Vol. 129, pp. 230-239.
[16] Pelletier, B. and Frouin, R. (2006). Remote sensing of phytoplankton chlorophyll-a concentration by ridge function fields. Applied Optics, Vol.
45, pp784-798.
[17] Pelletier, B. and Frouin, R. (2004). Fields of nonlinear regression models for inversion of satellite data.Geophysical Research Letters, Vol. 31, L16304, doi 10.1029/2004GL019840.
[18] Pollard, D. (1984). Convergence of Stochastic Processes, Springer-Verlag, New-york.
[19] Vapnik, V.N. and Chervonenkis, A.Y. (1971). On the uniform convergence of relative frequencies of events to their probabilities.Theory of Probability and its Applications,Vol. 16, pp. 264-280.
[20] Wong, H., Ip, W., and Zhang, R. (2008). Varying-coefficient single-index model.Computational Statistics & Data Analysis,Vol. 52, pp. 1458-1476.