CONSISTENCY OF RIDGE FUNCTION FIELDS FOR VARYING NONPARAMETRIC REGRESSION

(1)

HAL Id: hal-00460632

https://hal.archives-ouvertes.fr/hal-00460632

Submitted on 3 Mar 2010

HAL is a multi-disciplinary open access archive for the deposit and dissemination of sci- entific research documents, whether they are pub- lished or not. The documents may come from teaching and research institutions in France or

L’archive ouverte pluridisciplinaire HAL, est destinée au dépôt et à la diffusion de documents scientifiques de niveau recherche, publiés ou non, émanant des établissements d’enseignement et de recherche français ou étrangers, des laboratoires

CONSISTENCY OF RIDGE FUNCTION FIELDS FOR VARYING NONPARAMETRIC REGRESSION

Robert Frouin, Bruno Pelletier

To cite this version:

Robert Frouin, Bruno Pelletier. CONSISTENCY OF RIDGE FUNCTION FIELDS FOR VARYING NONPARAMETRIC REGRESSION. Communications in Statistics - Theory and Methods, Taylor &

Francis, 2009, 38 (8), pp.1272-1283. �10.1080/03610920802395702�. �hal-00460632�

(2)

C ONSISTENCY OF RIDGE FUNCTION FIELDS FOR VARYING NONPARAMETRIC REGRESSION

Robert FROUIN^aand Bruno PELLETIER^b,^∗

aScripps Institution of Oceanography University of California, San Diego

9500 Gilman Drive, La Jolla, CA 92093-0224, USA [email protected]

aInstitut de Mathématiques et de Modélisation de Montpellier UMR CNRS 5149, Equipe de Probabilités et Statistique

Universit´e Montpellier II, CC 051

Place Eug`ene Bataillon, 34095 Montpellier Cedex 5, France [email protected]

Abstract

A nonparametric regression model proposed in [Pelletier and Frouin, Ap- plied Optics, 2006] as a solution to the geophysical problem of ocean color remote sensing is studied. The model, called ridge function field, com- bines a regression estimate in the form of a superposition of ridge functions, or equivalently a neural network, with the idea pertaining to varying- coefficients models, where the parameters of a parametric family are allowed to vary with other variables. Under mild assumptions on the underlying distribution of the data, the strong universal consistency of the least-squares ridge function fields estimate is established.

Index Terms— Varying coefficients model, ridge function approximation, nonparametric regression, universal consistency, least squares.

AMS 2000 Classification: 62G08, 62G05.

∗Corresponding author.

(3)

1 Introduction

The motivation for the present work comes from the geophysical problem of ocean color remote sensing from space, which we shall describe first. In this problem, the aim is to predict the value of an oceanic parameter Y (e.g., the surface phytoplankton pigment concentration) from a vectorXof radiometric measurements at several wavelengths acquired by a sensor onboard a satellite platform. This is in fact an inverse problem where the measurements X depend on Y and other parameters, such as the aerosol optical properties, through a given forward operator (governing the radiative transfer in the ocean-atmosphere system) contami- nated by a random measurement noise. In particular, X depends on a vectorT of angular variables describing the relative positions of the Sun and the sensor with respect to the target; as such, T is observed simultaneously with X. The difficulty, though, is that the operator to be inverted, resulting from modeling of scattering and absorption processes at the micro-physical scale, is rather complex.

To circumvent this issue, the approach taken up in Pelletier and Frouin (2004, 2006) and Frouin and Pelletier (2007) consists in i) sampling the forward operator according to some prior distribution, ii) selecting a reasonable noise model, and iii) estimating the regression function m of Y on X and T from the data (X1, T1, Y1), . . . ,(Xn, Tn, Yn), wheremis defined by

m(x, t) = E_X=x,T_=t

Y|X, T .

In this display, the variable T is independent from the response Y but not from the predictorX. As such, T acts as an effect modifierin the sense of Hastie and Tibshirani (1993) in the context of varying coefficient models, i.e., the shape of the relation explaining Y from X is affected by the values taken by T. Let us recall that a varying-coefficient model is a linear model the parameters of which are allowed to vary with other variables. Since the work of Hastie and Tibshirani (1993), varying-coefficient models have been studied by several authors (see e.g., Fan and Zhang, 1999, 2000; Fan, Yao, and Cai, 2003, Wong, Ip, and Zhang, 2008, and the references therein). By relaxing to some extent the parametric constraint, these models have proved useful in various application contexts, and in particular in a high-dimensional setting.

As pointed out by Hastie and Tibshirani (1993) as well as Fan and Zhang (1999), the idea of allowing the parameters of a linear model to vary with other variables

(4)

is not new, and is widely used in the literature. More generally, one may start with an arbitrary parametric family in place of a linear model, and allow its parameters to vary with other variables. This idea is developed in Pelletier and Frouin (2004, 2006) and Frouin and Pelletier (2007) where we considered models of the form

ζ_n^∗ : (x, t)7→ζ_n^∗(x, t) :=f x;θn(t)

. (1.1)

In model (1.1), the functionf(.;θn)belongs to a setRnof functions, about which more will be said later, parameterized by a vector θn, which in (1.1) is allowed to vary witht. Additionally, to obtain an implementable model, each coordinate map of the application t 7→ θn(t) in (1.1) is taken to lie in a parametric set Tn

of functions oft. Now suppose that X and T are taking values in some subsets X ⊆R^dandS ⊆R^p, respectively. Another way of formalizing this construction is to view the model (1.1) as a function ofxbeing “attached” to each pointtinS, i.e., in mathematical terms, as a map

ζn:S → Rn, (1.2) such that the map ζ_n^∗ in (1.1) is the representation of ζn over the product space X × S, i.e., we let ζ_n^∗(x, t) = ζn(t)(x). Thus the representation (1.2) highlights the fact that the variableT acts as an effect modifier.

In the experiments reported in Pelletier and Frouin (2004, 2006) and Frouin and Pelletier (2007), the sequenceRnof functions in (1.2) is taken as the set spanned by functions of the ridge form or, equivalently, neural networks, and the resulting models (1.1)-(1.2) is called a ridge function field over S. The sets Rn form a nested sequence of the formR1 ⊆ R2 ⊆. . .. There exists a vast literature on regression estimation with these ridge functions and neural networks. In particular, it is known that the union of sets spanned by ridge functions is dense in L2(µ), for any probability measure µ (see, Cybenko, 1989; Hornik, Stinchcombe, and White, 1989; Barron, 1993; Lin and Pinkus, 1993; Burger and Neubaeur, 2001;

Maiorov, 1999), and that they yield consistent nonparametric regression proce- dures (see e.g. Gyorfi, Kohler, Krzyzak, and Walk, 2002, Chapter 16). Similarly, the sets T_n have been chosen to form an increasing sequenceT₁ ⊆ T₂ ⊆ . . . of sets of real-valued functions onS.

Density results for ridge function fields have been obtained in Pelletier (2004) in the compact-open topology. In particular, under mild assumptions on the sets (Rn)nand(Tn)n, the union of ridge function fields is dense. Thus under these conditions, ridge function fields are suitable candidates for nonparametric regression,

(5)

since the approximation error of the estimate will decrease to 0 as n increases.

The purpose of the present work is to establish the strong universal consistency of the least-squares regression estimate in the form of (1.1)-(1.2).

The paper is organized as follows. Section 2 introduces ridge function fields and presents our main result (Theorem 2.1), which states that ridge function fields adjusted by least squares arestrongly universally consistent, i.e., that for any distribution of(X, T, Y)satisfying mild assumptions, theL2 error of the regression estimate converges to 0 with probability 1 as the sample size tends to infinity.

Section 3 is devoted to the proof of Theorem 2.1.

2 Ridge function fields

Let(X, T, Y)be a random object, whereX,T, andY take values inR^d,R^p, and R, respectively. We shall assume that X, T, and Y are bounded, i.e., that there exists positive constantsCX,CT andCY such that

|X| ≤CX, |T| ≤CT, |Y| ≤CY, with probability 1. (2.1) Givenn independent copies(X1, T1, Y1), . . . ,(Xn, Tn, Yn)of(X, T, Y), our aim is to construct an estimate of the regression function mof Y on the pair(X, T), i.e.,

m(x, t) =E

Y|(X, T) = (x, t) .

Let us start with the definition of a ridge function. A ridge function on R^d is a function of the form σ ha, xi

where σ : R → R is a given function, where a∈R^d, and whereh., .idenotes the usual scalar product onR^d. Approximation by ridge functions refers to approximation by linear combinations of ridge functions.

There exists several variants of approximation by ridge functions, according to whether the function σ is fixed or not. In the following, we shall consider the familyRnof functions onR^ddefined by

Rn=n x7→

Kn

X

j=1

cjσ haj, xi+bj

+c0 : aj ∈R^d, bj ∈R, c0, cj ∈R,

j = 1, . . . , Kn

o,

(6)

whereKnis an integer, and whereσ : R→ Ris a fixedsquashing function, i.e., σis nondecreasing, and satisfies the following two conditions:

u→−∞lim σ(u) = 0 and lim

u→∞σ(u) = 1.

Next, let Ψ₁,Ψ₂, . . . : R^p → R be bounded basis functions, the linear span of which is dense inC R^p

in the topology of uniform convergence on compact sets, i.e., the set

∞

[

k=1

nt7→

k

X

j=1

ajΨj(t) : a1, . . . , ak ∈Ro is dense in C R^p

. Without loss of generality, we assume that |Ψi| ≤ 1for all i ≥ 1. Next, given an integerL ∈ Nand a real numberρ >0, we shall consider the following families of functions onR^p:

T(L) = n t7→

L

X

j=1

ajΨj(t) : a1, . . . , ak∈Ro ,

Tρ(L) = n t7→

L

X

j=1

ajΨj(t) : a1, . . . , ak∈R,

L

X

j=1

|aj| ≤ρo .

Then, given integers L^a_n, L^b_n, and L^c_n, and a positive number ρn, we define the familyFnof ridge function fields by

Fn =n

(x, t)7→

Kn

X

j=1

cj(t)σ haj(t), xi+bj(t)

+c0(t) : aj ∈ T(L^a_n)d

, bj ∈ T(L^b_n), c0, cj ∈ Tρn(L^c_n), j = 1, . . . , Kn

o .

We shall also consider the subset F˜n of Fn where only the coefficients cj are allowed to vary with the variablet, i.e., we define

F˜n=n

(x, t)7→

Kn

X

j=1

cj(t)σ haj, xi+bj

+c0(t) : aj, bj ∈R, c0, cj ∈ Tρn(L^c_n),

j = 1, . . . , Kn

o.

(7)

Both sets Fn andF˜n are dense in the topology of uniform convergence on compact sets (Pelletier, 2004). Consequently,Fn andF˜nare also dense inL2 for any distribution with bounded support and, as exposed in the Introduction, the approximation error of the estimates will decrease to 0 asn→ ∞.

Let us first define the mapsm^†_nandm˜^†_nas any minimizer of the empiricalL2risk overFnandF˜n, respectively, i.e.,mnandm˜nare such that

1 n

n

X

i=1

m^†_n(Xi, Ti)−Yi

2

= inf

f∈Fn

1 n

n

X

i=1

f(Xi, Ti)−Yi

2

, (2.2) 1

n

X

i=1

m˜^†_n(Xi, Ti)−Yi

2

= inf

f˜∈F˜n

1 n

n

X

i=1

f(X˜ i, Ti)−Yi

2

. (2.3) Then we define the estimates mn and m˜n as truncated versions of m^†_n and m˜^†_n respectively, i.e.,

mn(x, t) = Tβnm^†_n(x, t), (2.4)

˜

mn(x, t) = Tβnm˜^†_n(x, t), (2.5) where(βn)nis a sequence of positive numbers such thatβn→ ∞, and whereTβn

is the truncation operator defined byTβnu= min{|u|, βn}sign(u).

We are now in a position to state our main result.

Theorem 2.1 Letmnandm˜nbe the truncated least-squares estimates defined by (2.2),(2.3),(2.4), and(2.5).

(i) If

Kn→ ∞, L^a_n→ ∞, L^b_n→ ∞, L^c_n→ ∞, βn→ ∞, ρn → ∞,

asn→ ∞in such a way that

Knβ_n⁴(L^a_n+L^b_n+L^c_n) log(βnρnKn)

n →0 and β_n⁴

n^1−δ →0, for someδ >0asn→ ∞, then

n→∞lim Z

mn(x, t)−m(x, t)2

µ(dx, dt) = 0 with probability 1,

for every distribution of(X, T, Y)satisfying (2.1).

(8)

(ii) If

Kn→ ∞, L^c_n→ ∞, βn→ ∞, and ρn→ ∞, asn→ ∞in such a way that

Knβ_n⁴L^c_nlog(βnρnKn)

n →0 and β_n⁴

n^1−δ →0, for someδ >0asn→ ∞, then

n→∞lim Z

˜

mn(x, t)−m(x, t)2

µ(dx, dt) = 0 with probability 1,

for every distribution of(X, T, Y)satisfying (2.1).

Theorem 2.1 states that nonparametric regression estimation with ridge function fields is a consistent procedure as long as the underlying distribution satisfies (2.1).

Note first that the extension to a multivariate response Y is straightforward, and second that the family Fñis a subset of the family Fn. Consequently,Fn offers more flexibility than Fñ but, as expected, at the expense of an increased com- plexity of the fitting algorithm. More generally, one may consider replacing the familyRnin the definition of a set of function fields by any other nested sequence of models used in nonparametric regression. That said, regarding the ocean color remote sensing problem, the choice of using ridge functions has been dictated not only by their approximation properties, but more importantly by the speed of ex- ecution of these models, which should be high for processing large data sets, as provided by satellite imagery. Models belonging to Fn have been implemented and evaluated in Pelletier and Frouin (2006) for the retrieval of the chlorophyll-a concentration, i.e., a univariate response, while the family Fñ has been used in Frouin and Pelletier (2007) for the retrieval of the spectral marine reflectance, a multivariate output. In practice, the minimization of the empiricalL2risk in (2.2) and (2.3) may be solved using a stochastic gradient descent algorithm (see Pel- letier and Frouin, 2006 for details).

Let us also mention that assumption (2.1), which is reasonable from a physical perspective, may be weakened at the price of a little extra work. As a matter of fact, one may assume that the response Y is unbounded and use a truncation argument, as in, e.g., Kohler and Krzyzak (2005), and Gyorfi, Kohler, Krzyzak, and Walk (2002, Theorem 10.2). Next, ifXandT are unbounded, it is immediate to show that the union of the sets F_n is dense in L2(µ) for any distribution µ

(9)

provided that the sets T(L) contain the constant and affine functions of t. On the other hand, a little extra work would be needed to derive the density in L2

of the union of the F˜n’s. The study of the convergence rates is left for future research. In this perspective, let us emphasize that the respective influences of the variablesX andT on the responseY are well separated in a ridge function field, by construction, which may prove useful for adaptation over anisotropic classes of regression functions (see e.g. Hofmann and Lepski, 2002).

3 Proofs

3.1 Technical Lemmas

We shall need the following technical Lemmas to derive an upper-bound on the covering number of Fn. Let us start by introducing the notations and definitions used hereafter.

First of all, letG be a set of real valued functions onR^d, and letx1, . . . , xn ben fixed points inR^d. For any functionsg1, g2 :R^d→R, set

dist_1,n(g₁, g₂) =P_n |g₁−g₂| ,

whereP_nis the discrete uniform measure on{x1, . . . , xn}. Theε-covering number ofGwith respect todist1,nis called theL1ε-covering number ofGon{x₁, . . . , xn} and will be denoted byN₁ ε,G, xⁿ₁

, i.e., N₁ ε,G, xⁿ₁

is the smallest integerN such that there exists functionsg1, . . . , gnsuch that

1≤j≤Nmin 1 n

n

X

i=1

|g(xi)−gj(xi)| ≤ε,

for allg ∈ G.

Similarly to the above, the L1 ε-packing number of G on {x₁, . . . , xn}, further denoted byM₁(ε,G, xⁿ₁), is the maximal integerNsuch that there exists functions g1, . . . , gN inGsuch that

1 n

n

X

i=1

gj(xi)−gk(xi) ≥ε,

(10)

for all1≤j < k≤N.

Finally,G⁺will denote the set of all subgraphs of functions inG, i.e., G⁺ =

(x, u)∈R^d×R : u≤g(x), g ∈ G ,

and the Vapnik-Chervonenkis dimension of G⁺ will be denoted byVG⁺ (Vapnik and Chervonenkis, 1971).

The following results, extracted from Lemmas 16.3, 16.4, and 16.5 in Gyorfi, Kohler, Krzyzak, and Walk (2002), will be needed in the following, and are given to make the paper self-contained. LetF be a set of functions R^d → R. Given a squashing functionσ, defineG={σ ◦ f : f ∈ F}. Then

V_G⁺ ≤V_F⁺. (3.1)

Given two setsF andGof functionsR^d→R, setF ⊕ G ={f+g : f ∈ F, g ∈ G}, and let{x₁, . . . , xn}benpoints inR^d. Then

N1 ε+δ,F ⊕ G, xⁿ₁

≤ N1 ε,F, xⁿ₁

N1 δ,G, xⁿ₁

. (3.2)

Given two sets F andG of functions R^d → R, uniformly bounded over R^d by some constantsC_F andC_G, respectively, setF ⊙ G ={f g : f ∈ F, g ∈ G}, and let{x1, . . . , xn}benpoints inR^d. Then

N₁ ε+δ,F ⊙ G, xⁿ₁

≤ N₁ ε/CG,F, xⁿ₁

N₁ δ/CF,G, xⁿ₁

. (3.3) At last, given a setGof functionsR^d →[0;B]withV_G⁺ ≥2,npoints{x1, . . . , xn} in R^d, and 0 < ε < B/4, we have (Gyorfi, Kohler, Krzyzak, and Walk (2002), Theorem 9.4)

M₁ ε,G, xⁿ₁

≤3 2eB

ε log 3eB ε

V_G+

. (3.4)

We may now state an upper-bound on the covering number ofFn. Lemma 3.1 Let N₁ ε,Fn,(X, T)ⁿ₁

be the L1 ε-covering number of Fn on the sample(X1, T1), . . . ,(Xn, Tn). Then

N₁ ε,Fn,(X, T)ⁿ₁

≤3

108eρn(Kn+ 1) ε

2(L^c_n+1)+2Kn(L^a_nd+L^b_n+L^c_n+2)

. (3.5)

(11)

Proof. Consider the following sets of functions:

G₁ =

(x, t)7→ ha(t), xi+b(t) : a1, . . . , ad ∈ T(L^a_n), b∈ T(L^b_n) , G₂ =

(x, t)7→σ ha(t), xi+b(t)

: a1, . . . , ad∈ T(L^a_n), b∈ T(L^b_n) , G₃ =

(x, t)7→c(t)σ ha(t), xi+b(t)

:a1, . . . , ad∈ T(L^a_n), b∈ T(L^b_n), c∈ Tρn(L^c_n) .

SinceG₁ is a vector space of dimensionL^a_nd+L^b_n, we have V_G⁺

1 ≤L^a_nd+L^b_n+ 1.

Next, (3.1) yields

V_G⁺

2 ≤V_G⁺

1 .

Consequently, since|σ(u)| ≤1for allu∈R, and using (3.4), we obtain N1 ε,G2,(X, T)ⁿ₁

≤ M1 ε,G2,(X, T)ⁿ₁

≤ 3

"

2e ε log

3e ε

#V_G+ 2

≤ 3 3e

ε

2(L^and+L^bn+1)

. (3.6)

Now, using the inequality (3.3) leads to N1 ε,G3,(X, T)ⁿ₁

≤ N1

ε

2,Tρn(L^c_n),(X, T)ⁿ₁ N1

ε 2ρn

,G2,(X, T)ⁿ₁ .

But, using (3.4), we have N1 ε,Tρn(L^c_n),(X, T)ⁿ₁

≤ M1 ε,Tρn(L^c_n),(X, T)ⁿ₁

≤ 3

"

2e(2ρn)

ε log

3e(2ρn) ε

#V_T

ρn(Lcn)+

≤ 3 6eρn

ε

2(L^c_n+1)

, (3.7)

since

V_T

ρn(L^cn)⁺ ≤V_T_(L^c_n₎⁺ ≤L^c_n+ 1,

(12)

and sinceT(L^c_n)is a vector space of dimensionL^c_n. Then from (3.6) and (3.7), it follows that

N₁ ε,G₃,(X, T)ⁿ₁

≤ 3 6eρn

ε/2

2(L^c_n+1)

3

3e ε/(2ρn)

2(L^a_nd+L^b_n+1)

= 9

12eρn

ε

2(L^cn+1) 6eρn

ε

2(L^and+L^bn+1)

≤ 9

12eρn

ε

2(L^a_nd+L^b_n+L^c_n+2)

. (3.8)

Applying (3.2) yields N1 ε,Fn,(X, T)ⁿ₁

≤ N1

ε

Kn+ 1,Tρn(L^c_n),(X, T)ⁿ₁ "

N1

ε

Kn+ 1,G3,(X, T)ⁿ₁ #Kn

. Finally, by reporting (3.7) and (3.8) in the equation above, we obtain the upper

bound:

N₁ ε,Fn,(X, T)ⁿ₁

≤ 3

6eρn

ε/(Kn+ 1)

2(L^cn+1)"

9

12eρn

ε/(Kn+ 1)

2(L^and+L^bn+L^cn+2)#Kn

≤ 3

108eρn(Kn+ 1) ε

2(L^cn+1)+2Kn(L^and+L^bn+L^cn+2)

.

3.2 Proof of Theorem 2.1

By Lemma 10.2 in Gyorfi, Kohler, Krzyzak, and Walk (2002), we have the inequality

Z

mn(x, t)−m(x, t)

2µ(dx, dt)

≤ inf

{f∈Fn:kfk∞≤βn}

Z

f(x, t)−m(x, t)

µ(dx, dt)

+ 2 sup

f∈T_βnFn

1 n

n

X

i=1

f(Xi, Ti)−Yi

2

−E

n f(Xi, Ti)−Yi

2o

, (3.9)

(13)

where the first term is the approximation error, and where the second term is a uniform upper-bound overTβnFn of the estimation error. Since the union of the Fn’s is dense inC(R^d×R^p)in the topology of uniform convergence on compact sets, and sinceXandT are bounded by assumption, the union of theFn’s is dense inL2(µ)for everyµwith bounded support. Then it follows that the approximation error of m by an element of {f ∈ Fn : kfk_∞ ≤ βn} converges to zero as Kn, L^a_n, L^b_n, L^c_n, βn, ρn → ∞, and part (i) of the theorem will be proved if we show that

sup

f∈T_βnFn

1 n

n

X

i=1

f(Xi, Ti)−Yi

2

−En

f(Xi, Ti)−Yi

2o

→0, (3.10) a.s. asKn, Lâ_n, L^b_n, L^c_n, βn, ρn→ ∞. The same arguments hold for the familyFñ, so that part (ii) of the theorem will be proved if we show that (3.10) is satisfied withTβnFnreplaced byTβnFñ.

To bound the second term in the right hand side of (3.9), let us first define the set Hnof functions by

Hn =

h:R^d×R^p×R∋(x, t, y)7→ |f(x, t)−y|²1[−CY;CY](y) : f ∈TβnFn , where1[−CY;CY](.)is the characteristic function of the interval[−CY;CY]. Next, we shall setZi = (Xi, Ti, Yi), fori= 1, . . . , n.

For everyf inTβnFn, we have|f(x, t)| ≤βn. Hence fornlarge enough such that CY ≤βn,

|f(x, t)−y|²1[−CY;CY](y)≤4β_n²,

so that the range of each function h in Hn is included in the interval

0; 4β_n² . Then from the proof of Theorem 24 in Pollard (1984, p. 25), or Theorem 9.1 in Gyorfi, Kohler, Krzyzak, and Walk (2002, p. 136), we obtain for any ε > 0the following inequality:

P (

sup

f∈T_βnFn

1 n

n

X

i=1

|f(Xi, Ti)−Yi|²−E

|f(X, T)−Y|²

> ε )

=P (

sup

h∈Hn

1 n

n

X

i=1

h(Zi)−E h(Z)

> ε )

≤8EN₁ε

8,Hn, Z₁ⁿ

exp − nε² 128 4β_n²2

!

. (3.11)

(14)

Now we proceed to bound the covering number in (3.11). Givenh₁andh₂ inHn, corresponding respectively tof1andf2 inTβnFn, we have

1 n

n

X

i=1

|h₁(Zi)−h2(Zi)| = 1 n

n

X

i=1

|f₁(Xi, Ti)−Yi|²− |f₂(Xi, Ti)−Yi|² a.s.

≤ 4βn

1 n

n

X

i=1

large enough. Therefore, iff1, . . . , fNis aL1ε-cover ofTβnFnon(X1, T1), . . . ,(Xn, Tn) with

N =N₁ ε, TβnF_n,(X, T)ⁿ₁ ,

there exists a L1 2ε-coverf₁^′, . . . , f_N^′ ofTβnFn on(X, T)ⁿ₁ withf_j^′ ∈TβnFn, for allj = 1, . . . , n. Then using the inequality (3.12) we conclude that

N₁ε

8,Hn, Z₁ⁿ

≤ N₁ ε 64βn

, TβnFn,(X, T)ⁿ₁

!

. (3.13)

Using Lemma 3.1 together with the fact thatN₁ ε, TβnFn,(X, T)ⁿ₁

≤ N₁ ε,Fn,(X, T)ⁿ₁ , it follows that

P (

sup

f∈T_βnFn

1 n

n

X

i=1

|f(Xi, Ti)−Yi|²−E

|f(X, T)−Y|²

> ε )

≤24

6912eβnρn(Kn+ 1) ε

2(L^cn+1)+2Kn(L^and+L^bn+L^cn+2)

×exp

− nε² 2048β_n⁴

= 24 exp (

−n^δn^1−δ β_n⁴

ε²

2048 −2β_n⁴

L^c_n+ 1 +Kn(L^a_nd+L^b_n+L^c_n+ 2) n

×log

6912βnρn(Kn+ 1) ε

)

, (3.14)

for someδ >0.

(15)

Consequently, if

Kn→ ∞, L^a_n→ ∞, L^b_n→ ∞, L^c_n→ ∞, βn→ ∞, ρn → ∞, asn→ ∞in such a way that

Knβ_n⁴(L^a_n+L^b_n+L^c_n) log(βnρnKn)

n →0 and β_n⁴

n^1−δ →0, for someδ >0asn→ ∞, we deduce from (3.14) that

X

n≥1

P (

sup

f∈T_βnFn

1 n

n

X

i=1

|f(Xi, Ti)−Yi|² −E

|f(X, T)−Y|²

> ε )

<∞.

(3.15) Part (i) of Theorem 2.1 then follows from the Borel-Cantelli Lemma.

To prove part (ii), it suffices to setL^a_n = 1andL^b_n = 1. Then one obtains (3.15) from (3.14) if

Kn → ∞, L^c_n → ∞, βn → ∞, and ρn → ∞, asn→ ∞in such a way that

Knβ_n⁴L^c_nlog(βnρnKn)

n →0 and β_n⁴

n^1−δ →0,

for someδ > 0asn → ∞, which together with the Borel-Cantelli Lemma con- cludes the proof of the second part of Theorem 2.1.

Acknowledgment

This work was supported by the National Aeronautics and Space Administration under Grant No. NNX08AF65A. The authors are indebted to an anonymous ref- eree for a careful reading of the paper and insightful comments.

References

[1] Barron, A. (1993). Universal approximation bounds for superpositions of a sigmoidal function.IEEE Transactions on Information Theory, Vol. 39, pp.

930-945.

(16)

[2] Burger, M. and Neubauer, A. (2001). Error bounds for approximation with neural networks.Journal of Approximation Theory,Vol. 112, pp. 235-250.

[3] Cybenko, G. (1989). Approximation by superpositions of a sigmoidal func- tionMathematics of Control, Signals, and Systems,Vol. 2, pp. 303-314.

[4] Fan, J. and Zhang, W. (1999). Statistical estimation in varying coefficients models.The Annals of Statistics,Vol. 27, pp. 1491-1518.

[5] Fan, J. and Zhang, J.-T. (2000). Two-step estimation of functional linear models with applications to longitudinal data.Journal of the Royal Statisti- cal Society, Series B,Vol. 62, pp.303-322.

[6] Fan, J., Yao, Q., and Cai, Z. (2003). Adaptive varying-coefficient linear models.Journal of the Royal Statistical Society, Series B,Vol. 65, pp 57-80.

[7] Frouin, R. and Pelletier, B. (2007). Fields of nonlinear regression models for atmospheric correction of satellite ocean-color imagery.Remote Sensing of Environment,Vol. 111, pp. 450-465.

[8] Gyorfi, L., Kohler, M., Krzyzak, A., and Walk, H. (2002). A Distribution- Free Theory of Nonparametric Regression. Springer-Verlag, New-York.

[9] Hastie, T. and Tibshirani, R. (1993). Varying coefficient models,Journal of the Royal Statistical Society, Series B,Vol. 55, pp. 757-796.

[10] Hoffmann, M. and Lepski, O. (2002). Random rates in anisotropic regression.The Annals of Statistics,Vol. 30, pp. 325-358.

[11] Hornik, K., Stinchcombe, M., and White, H. (1989). Multilayer feedforward networks are universal approximators.Neural Networks,Vol. 2, pp. 359-366.

[12] Kohler, M. and Krzyzak, A. (2005). Adaptive regression estimation with multilayer feedforward neural networks.Nonparametric Statistics, Vol. 17, pp. 891-913.

[13] Lin, V.Y. and Pinkus, A. (1993). Fundamentality of ridge functions.Journal of Approximation Theory,Vol. 75, pp. 295-311.

[14] Maiorov, V. (1999). On best approximation by ridge functions. Journal of Approximation Theory,Vol. 99, pp. 68-94.

(17)

[15] Pelletier, B. (2004). Approximation by ridge function fields over compact sets.Journal of Approximation theory,Vol. 129, pp. 230-239.

[16] Pelletier, B. and Frouin, R. (2006). Remote sensing of phytoplankton chlorophyll-a concentration by ridge function fields. Applied Optics, Vol.

45, pp784-798.

[17] Pelletier, B. and Frouin, R. (2004). Fields of nonlinear regression models for inversion of satellite data.Geophysical Research Letters, Vol. 31, L16304, doi 10.1029/2004GL019840.

[18] Pollard, D. (1984). Convergence of Stochastic Processes, Springer-Verlag, New-york.

[19] Vapnik, V.N. and Chervonenkis, A.Y. (1971). On the uniform convergence of relative frequencies of events to their probabilities.Theory of Probability and its Applications,Vol. 16, pp. 264-280.

[20] Wong, H., Ip, W., and Zhang, R. (2008). Varying-coefficient single-index model.Computational Statistics & Data Analysis,Vol. 52, pp. 1458-1476.