Finite-dimensional projection for clustering

Topologie faible

2. Projection-based curve clustering ∗

2.2. Finite-dimensional projection for clustering

X by considering only the first d coefficients of the expansion on a Hilbert basis, and then perform clustering in R^d. We study the theoretical properties of this clustering method with projection. A bound expressing what is lost when repla-cing the empirically optimal cluster centers by the centers obtained by projection is offered (Section 2.2). Since the result may depend on the basis choice, several projection bases are used in practice, and we look for the best one minimizing a given criterion. To this end, an algorithm based onCoifman and Wickerhauser [54]

is implemented, searching for an optimal basis among a library of wavelet packet bases available in the R package wmtsa, and this “optimal basis” is compared with the Fourier basis, the Haar basis, and the functional principal component basis (Section 2.3). Finally, this algorithm is applied to a simulated example and to our industrial problem (Section 2.4). Proofs are postponed to Section 2.6.

2.2. Finite-dimensional projection for clustering

As mentioned earlier, we are concerned with square integrable functions. Since all results can be adapted to L²([a, b]) by an appropriate rescaling, we consider for the sake of simplicity the space L²([0,1]). As an infinite-dimensional separable Hilbert space, L²([0,1]) is isomorphic via the choice of a Hilbert basis to the space `² of square-summable sequences. We focus more particularly on functions inL²([0,1]) whose coefficients in the expansion on a given Hilbert basis belong to the subset S of `² given by

where R >0 and (ϕj)j≥1 is a nonnegative increasing sequence such that

j→lim+∞ϕj = +∞.

It is worth pointing out that S is closely linked with the basis choice, even if the basis does not appear explicitly in the definition. To illustrate this important fact, three examples are discussed below.

Example 2.2.1 (Sobolev ellipsoids). For β ∈ N^∗ and L > 0, the periodic Sobolev class W^per(β, L) is the space of all functions f ∈ [0,1] → R such that f^(β⁻¹⁾ is

where

ϕ_j =

( j^2β for even j (j −1)^2β for oddj and R= L

π^β.

For the proof of this result and further details about Sobolev classes, we refer the reader to the book ofTsybakov [183]. Note that the setS could also be defined by ϕj =j^re^αj with α >0 andr ≥ −α (Tsybakov [182]).

Example 2.2.2 (Reproducing Kernel Hilbert Spaces). LetK : [0,1]×[0,1]→ Rbe a Mercer kernel, i.e.,K is continuous, symmetric and positive definite. Recall that a kernelKis said to be positive definite if for all finite sets{x1, . . . , xm}, the matrix Adefined by a_ij =K(x_i, x_j) for 1≤i, j ≤m is positive definite. For example, the Gaussian kernelK(x, y) = exp(−^(x⁻_σ²^y)²) and the kernelK(x, y) = (c²+ (x−y)²)⁻^a with a >0 are Mercer kernels. For x∈[0,1], let Kx :y7→ K(x, y). Then, Moore-Aronszajn’s Theorem (Aronszajn [11]) states that there exists a unique Hilbert space (HK,h·,·i) of functions on [0,1] such that:

1. For all x∈[0,1],Kx ∈ H^K.

2. The span of the set {Kx, x∈[0,1]} is dense in HK. 3. For all f ∈ HK and x∈[0,1],f(x) = hKx, fi.

The Hilbert space HK is said to be the reproducing kernel Hilbert space (for short, RKHS) associated with the kernel K. Next, the operator K defined by

Kf :y 7→

Z 1

0 K(x, y)f(x)dx

is self-adjoint, positive and compact. Consequently, there exists a complete or-thonormal system (ψj)j≥1 of L²([0,1]) such that Kψj = λjψj, where the set of eigenvalues {λ_j, j ≥1} is either finite or a sequence tending to 0 at infinity. Mo-reover, the λj are nonnegative. Suppose that K is not of finite rank — so that {λj, j ≥ 1} is infinite — and that the eigenvalues are sorted in decreasing order, that isλj ≥λj+ 1 for allj ≥1. Clearly, there is no loss of generality in assuming thatλj >0 for allj ≥1. Indeed, if not,L²([0,1]) is replaced by the linear subspace spanned by the eigenvectors corresponding to non-zero eigenvalues.

According to Mercer’s theorem,K has the representation K(x, y) =

+∞

j=1

λjψj(x)ψj(y),

2.2. Finite-dimensional projection for clustering

where the convergence is absolute and uniform (Cucker and Smale [60, Chapter III, Theorem 1]). Moreover, HK may be characterized through the eigenvalues of the operator K by

(Cucker and Smale [60, Chapter III, Theorem 4]). Then, letting S =

where ωr(f, t,[0,1])2 denotes the modulus of smoothness of f, as defined for ins-tance in DeVore and Lorentz [70], and r = bαc+ 1. Let Λ(j) be an index set

where C >0 depends only on the basis. We refer to Donoho and Johnstone [74]

for more details.

Let us now come back to the general setting S = Some notation and assumptions are in order. First, we will suppose that

P{kXk ≤R}= 1.

Notice that the fact thatX takes its values inS is in general not enough to imply P{kXk ≤R}= 1. Secondly, let j0 be the smallest integer j such that ϕj >0. To avoid technical difficulties, we require in the sequel d ≥j0. For all d≥ 1, we will denote by Πd the orthogonal projection onR^d and letSd= Πd(S). Lastly, observe that Sd identifies with the ellipsoid

As explained in the introduction, the criterion to minimize is W_∞_,n(c) = 1 S^k is measured by the distortion

W_∞(c) = E min

represents the optimal risk we can achieve. With the intention of performing clus-tering in the projection space S^d, we also introduce the “finite-dimensional” dis-tortion

Let us observe that, as the support of the empirical measure µn contains at mostn points, there exists an element ˆc_d,n which is a minimizer of Wd,n(c) onS^k. Moreover, in view of its definition,W_d,n(c) only depends on the centers projection Πd(c) (one has Wd,n(c) = Wd,n(Πd(c)) for all c) and we can thus assume that ˆ

c_d,n∈(Sd)^k. Notice also that for allc∈ S,

kΠd(X)−Πd(c)k² ≤ kX−ck²

2.2. Finite-dimensional projection for clustering

(the projection Πd is 1-Lipschitz), which implies, for all c, W_d(c)≤W_∞(c).

The following lemma provides an upper bound for the maximal deviation sup

c∈S^k

[W_∞(c)−Wd(c)].

Lemma 2.2.1. We have sup

c∈S^k

[W_∞(c)−Wd(c)]≤ 4R² ϕ_d .

We are now in a position to state the main result of this section.

Theorem 2.2.1. Let ˆc_d,n∈(S^d)^k be a minimizer of Wd,n(c). Then, E[W_∞(ˆc_d,n)]−W_∞^? ≤E[Wd(ˆc_d,n)]−W_d^?+8R²

ϕd

. (2.2)

Theorem2.2.1expresses the fact that the expected excess clustering risk in the infinite dimensional space is bounded by the corresponding “finite-dimensional risk” plus an additional term representing the price to pay when projecting onto Sd. Yet, the first term in the right-hand side of inequality (2.2) above is known to tend to 0 when n goes to infinity. More precisely, as P{kΠd(X)k ≤R}= 1, we have

E[Wd(ˆc_d,n)]−W_d^? ≤ Ck

√n,

where C = 12R² (Biau, Devroye, and Lugosi [32]). In our setting, to keep the same rate of convergence O(1/√

n) in spite of the extra term 8R²/ϕd, ϕd must be of the order √

n. For Sobolev ellipsoids (Example 2.2.1), where ϕ_j ≥ (j − 1)^2β, this means a dimension d of the order n^1/4β. When ϕj = j^re^αj, the rate of convergence is O(1/√

n) as long as d is chosen of the order lnn/(2α). In the RKHS context (Example 2.2.2), consider the case of eigenvalues {λ_j, j ≥1} with polynomial or exponential-polynomial decay, which covers a broad range of kernels (Williamson, Smola, and Schölkopf [188]). If λj =O(j⁻^(α+1)), α >0, then 1/ϕd = O(d⁻^(α+1)), andd must be of the order n^1/(2α+2), whereas λj =O(e⁻^αj^p), α, p > 0, leads to a projection dimension dof the order (lnn/(2α))^1/p. Obviously, the upper bound (2.2) is better for large ϕd, and consequently larged. Nevertheless, from a computational point of view, the projection dimension should not be chosen too large.

Remark 2.2.1. Throughout, we assumed thatP{kXk ≤R}= 1. This requirement, called the peak power constraint, is standard in the clustering and signal processing literature. We do not consider in this paper the case where this assumption is not satisfied, which is feasible but leads to technical complications (see Merhav and Ziv [145],Biau, Devroye, and Lugosi [32]for results in this direction). Besides, the number of clusters is assumed to be fixed throughout the paper. Several methods for estimatingkhave been proposed in the literature (see, e.g.,Milligan and Cooper [147] and Gordon [97]).

As already mentioned, the subset of coefficients S is intimately connected to the underlining Hilbert basis. As a consequence, all the results presented strongly depend on the orthonormal system considered. Therefore, the choice of a proper basis is crucial and is discussed in the next section.

Dans le document Apprentissage statistique non supervisé : grande dimension et courbes principales (Page 113-118)