A Tutorial Review of RKHS Methods in Machine Learning
Thomas Hofmann
∗, Bernhard Sch¨ olkopf
†, and Alexander J. Smola
‡December 6, 2005
1 Introduction
Over the last ten years, estimation and learning methods utilizing positive definite kernels have become rather popular, particularly in machine learning. Since these methods have a stronger mathematical slant than earlier machine learning methods (e.g., neural networks), there is also significant interest in the statistical and mathematical community for these methods. The present review aims to summarize the state of the art on a conceptual level. In doing so, we build on various sources (including the books [Vapnik, 1998, Burges, 1998, Cristianini and Shawe-Taylor, 2000, Herbrich, 2002] and in particular [Sch¨olkopf and Smola, 2002]), but we also add a fair amount of recent material which helps unifying the exposition.
The main idea of all the described methods can be summarized in one paragraph. Tradition- ally, theory and algorithms of machine learning and statistics has been very well developed for the linear case. Real world data analysis problems, on the other hand, often requires nonlinear methods to detect the kind of dependences that allow successful prediction of properties of in- terest. By using a positive definite kernel, one can sometimes have the best of both worlds. The kernel corresponds to a dot product in a (usually high dimensional) feature space. In this space, our estimation methods are linear, but as long as we can formulate everything in terms of kernel evaluations, we never explicitly have to work in the high dimensional feature space.
2 Kernels
2.1 An Introductory Example
Suppose we are given empirical data
(x1, y1), . . . ,(xn, yn)∈X×Y. (1) Here, the domainXis some nonempty set that theinputsxiare taken from; theyi∈Yare called targets. Here and below,i, j= 1, . . . , n.
Note that we have not made any assumptions on the domain Xother than it being a set.
In order to study the problem of learning, we need additional structure. In learning, we want to be able to generalize to unseen data points. In the case of pattern recognition, given some new input x∈X, we want to predict the corresponding y ∈ {±1}. Loosely speaking, we want
∗Universit¨at Darmstadt
†Max-Planck-Institut f¨ur biologische Kybernetik
‡[email protected], Statistical Machine Learning Program, National ICT Australia, and Computer Science Laboratory, Australian National University, Canberra, 0200 ACT
Figure 1: A simple geometric classification algorithm: given two classes of points (depicted by
‘o’ and ‘+’), compute their means c+, c− and assign a test input x to the one whose mean is closer. This can be done by looking at the dot product betweenx−c (wherec= (c++c−)/2) and w := c+−c−, which changes sign as the enclosed angle passes through π/2. Note that the corresponding decision boundary is a hyperplane (the dotted line) orthogonal to w (from [Sch¨olkopf and Smola, 2002]).
to choose y such that (x, y) is in some sensesimilar to the training examples. To this end, we need similarity measures inXand in{±1}. The latter is easier, as two target values can only be identical or different. For the former, we require a function
k:X×X→R, (x, x0)7→k(x, x0) (2) satisfying, for allx, x0∈X,
k(x, x0) =hΦ(x),Φ(x0)i, (3)
where Φ maps into some dot product spaceH, sometimes called thefeature space. The similarity measurekis usually called akernel, and Φ is called itsfeature map.
The advantage of using such a kernel as a similarity measure is that it allows us to construct algorithms in dot product spaces. For instance, consider the following simple classification algo- rithm, whereY={±1}. The idea is to compute the means of the two classes in the feature space, c+ = n1
+
P
{i:yi=+1}Φ(xi),andc−= n1
−
P
{i:yi=−1}Φ(xi),where n+ andn− are the number of examples with positive and negative target values, respectively. We then assign a new point Φ(x) to the class whose mean is closer to it. This leads to
y = sgn(hΦ(x), c+i − hΦ(x), c−i+b) (4) withb=12 kc−k2− kc+k2
.Substituting the expressions forc± yields
y= sgn
1 n+
X
{i:yi=+1}
hΦ(x),Φ(xi)i − 1 n−
X
{i:yi=−1}
hΦ(x),Φ(xi)i+b
(5) Rewritten in terms ofk, this reads
y= sgn
1 n+
X
{i:yi=+1}
k(x, xi)− 1 n−
X
{i:yi=−1}
k(x, xi) +b
, (6)
whereb=12
1 n2−
P
{(i,j):yi=yj=−1}k(xi, xj)−n12 +
P
{(i,j):yi=yj=+1}k(xi, xj) .
Let us consider one well-known special case of this type of classifier. Assume that the class means have the same distance to the origin (hence b = 0), and that k(., x) is a density for all x∈X.If the two classes are equally likely and were generated from two probability distributions that are correctly estimated by the Parzen windows estimators
p+(x) := 1 n+
X
{i:yi=+1}
k(x, xi), p−(x) := 1 n−
X
{i:yi=−1}
k(x, xi), (7)
then (6) is the Bayes decision rule.
The classifier (6) is quite close to the Support Vector Machine (SVM) that we will discuss below. It is linear in the feature space (see (4)), while in the input domain, it is represented by a kernel expansion (6). In both cases, the decision boundary is a hyperplane in the feature space;
however, the normal vectors are usually rather different.1
2.2 Positive Definite Kernels
We have above required that a kernel satisfy (3), i.e., correspond to a dot product in some dot product space. In the present section, we show that the class of kernels that can be written in the form (3) coincides with the class of positive definite kernels. This has far-reaching consequences.
There are examples of positive definite kernels which can be evaluated efficiently even though via (3) they correspond to dot products in infinite dimensional dot product spaces. In such cases, substitutingk(x, x0) forhΦ(x),Φ(x0)i, as we have done when going from (5) to (6), is crucial. In the machine learning community, this substitution is called thekernel trick.
2.2.1 Prerequisites
Definition 1 (Gram matrix) Given a kernelkand inputsx1, . . . , xn ∈X, then×nmatrix
K:= (k(xi, xj))ij (8)
is called the Gram matrix (or kernel matrix) ofk with respect tox1, . . . , xn.
Definition 2 (Positive definite matrix) A realn×n symmetric matrixKij satisfying X
i,j
cicjKij ≥0 (9)
for allci ∈Ris called positive definite. If equality in (9) only occurs forc1=· · ·=cn = 0then we shall call the matrix strictly positive definite.
Definition 3 (Positive definite kernel) LetXbe a nonempty set. A functionk:X×X→R which for all n ∈ N, xi ∈ X gives rise to a positive definite Gram matrix is called a positive definite kernel. A function k:X×X→Rwhich for all n∈Nand distinct xi∈Xgives rise to a strictly positive definite Gram matrix is called a strictly positive definite kernel.
1For (4), the normal vector isw =c+−c−. As an aside, note that if we normalize the targets such that ˆ
yi=yi/|{j:yj=yi}|, in which case the ˆyi sum to zero, thenkwk2=˙ K,yˆyˆ>¸
F, whereh., .iF is the Frobenius dot product. If the two classes have equal size, then up to a scaling factor involvingkKk2andn, this equals the kernel-target alignment defined in [Cristianini et al., 2002].
Occasionally, we shall refer to positive definite kernels simply as akernels. Note that for simplicity we have restricted ourselves to the case of real valued kernels. However, with small changes, the below will also hold for the complex valued case.
Since P
i,jcicjhΦ(xi),Φ(xj)i =D P
iciΦ(xi),P
jcjΦ(xj)E
≥ 0, kernels of the form (3) are positive definite for any choice of Φ. In particular, ifXis already a dot product space, we may choose Φ to be the identity. Kernels can thus be regarded as generalized dot products. Whilst they are not generally bilinear, they share important properties with dot products, such as the Cauchy-Schwartz inequality:
Proposition 4 If kis a positive definite kernel, andx1, x2∈X, then
k(x1, x2)2≤k(x1, x1)·k(x2, x2). (10) Proof The 2×2 Gram matrix with entries Kij =k(xi, xj) is positive definite. Hence both its eigenvalues are nonnegative, and so is their product,K’s determinant, i.e.,
0≤K11K22−K12K21=K11K22−K122 . (11) Substitutingk(xi, xj) forKij, we get the desired inequality.
2.2.2 Construction of the Reproducing Kernel Hilbert Space
We now define a map fromXinto the space of functions mappingXintoR, denoted asRX, via Φ :X → RX
x 7→ k(., x). (12)
Here, Φ(x) =k(., x) denotes the function that assigns the valuek(x0, x) tox0 ∈X.
We next construct a dot product space containing the images of the inputs under Φ. To this end, we first turn it into a vector space by forming linear combinations
f(.) =
n
X
i=1
αik(., xi). (13)
Here,n∈N, αi∈Randxi∈Xare arbitrary.
Next, we define a dot product between f and another function g(.) =Pn0
j=1βjk(., x0j) (with n0∈N,βj∈Randx0j∈X) as
hf, gi:=
n
X
i=1 n0
X
j=1
αiβjk(xi, x0j). (14)
To see that this is well-defined although it contains the expansion coefficients, note thathf, gi= Pn0
j=1βjf(x0j). The latter, however, does not depend on the particular expansion off. Similarly, forg, note thathf, gi=Pn
i=1αig(xi). This also shows thath·,·iis bilinear. It is symmetric, as hf, gi=hg, fi. Moreover, it is positive definite, since positive definiteness ofk implies that for any functionf, written as (13), we have
hf, fi=
n
X
i,j=1
αiαjk(xi, xj)≥0. (15)
Next, note that given functionsf1, . . . , fp, and coefficientsγ1, . . . , γp∈R, we have
p
X
i,j=1
γiγjhfi, fji=
* p X
i=1
γifi,
p
X
j=1
γjfj
+
≥0. (16)
Here, the left hand equality follows from the bi-linearity ofh·,·i, and the right hand inequality from (15).
By (16), h·,·i is a positive definite kernel, defined on our vector space of functions. For the last step in proving that it even is a dot product, we note that by (14), for all functions (13),
hk(., x), fi=f(x), (17)
and in particular
hk(., x), k(., x0)i=k(x, x0). (18) By virtue of these properties,kis called areproducing kernel [Aronszajn, 1950].
Due to (17) and Proposition 4, we have
|f(x)|2=|hk(., x), fi|2≤k(x, x)· hf, fi. (19) By this inequality,hf, fi= 0 impliesf = 0, which is the last property that was left to prove in order to establish thath., .iis a dot product.
Skipping some details, we add that one can complete the space of functions (13) in the norm corresponding to the dot product, and thus gets a Hilbert spaceH, called areproducing kernel Hilbert space (RKHS).
One can define an RKHS as a Hilbert spaceHof functions on a setXwith the property that for all x∈Xand f ∈ H, the point evaluationsf 7→f(x) are continuous linear functionals (in particular, all point valuesf(x) are well defined, which already distinguishes RKHS’s from many L2Hilbert spaces). From the point evaluation functional, one can then construct the reproducing kernel using the Riesz representation theorem. The Moore-Aronszajn-Theorem [Aronszajn, 1950]
states that for every positive definite kernel onX×X, there exists a unique RKHS and vice versa.
There is an analogue of the kernel trick for distances rather than dot products, i.e., dissim- ilarities rather than similarities. This leads to the larger class of conditionally positive definite kernels. Those kernels are defined just like positive definite ones, with the one difference being that their Gram matrices need to satisfy (9) only subject to
n
X
i=1
ci = 0. (20)
Interestingly, it turns out that many kernel algorithms, including SVMs and kernel PCA (see Section 3), can be applied also with this larger class of kernels, due to their being translation invariant in feature space [Sch¨olkopf and Smola, 2002, Hein et al., 2005].
We conclude this section with a note on terminology. In the early years of kernel machine learning research, it was not the notion of positive definite kernels that was being used. Instead, researchers considered kernels satisfying the conditions of Mercer’s theorem [Mercer, 1909], see e.g. [Vapnik, 1998, Cristianini and Shawe-Taylor, 2000]. However, whilst all such kernels do satisfy (3), the converse is not true. Since (3) is what we are interested in, positive definite kernels are thus the right class of kernels to consider.
2.2.3 Properties of Positive Definite Kernels
We begin with some closure properties of the set of positive definite kernels.
Proposition 5 Below, k1, k2, . . . are arbitrary positive definite kernels on X×X, whereXis a nonempty set.
(i) The set of positive definite kernels is a closed convex cone, i.e., (a) if α1, α2 ≥ 0, then α1k1+α2k2 is positive definite; and (b) Ifk(x, x0) := limn→∞kn(x, x0)exists for all x, x0, then k is positive definite.
(ii) The pointwise productk1k2 is positive definite.
(iii) Assume that fori= 1,2,ki is a positive definite kernel onXi×Xi, where Xi is a nonempty set. Then the tensor product k1⊗k2 and the direct sum k1⊕k2 are positive definite kernels on (X1×X2)×(X1×X2).
The proofs can be found in [Berg et al., 1984]. We only give a short proof of (ii), a result due to Schur (see e.g., [Berg and Forst, 1975]). Denote byK, Lthe positive definite Gram matrices ofk1, k2 with respect to some data set x1, . . . , xn. Being positive definite, we can expressL as SS> in terms of ann×nmatrixS (e.g., the positive definite square root ofL). We thus have
Lij=
n
X
m=1
SimSjm, (21)
and therefore, for anyc1, . . . , cn∈R,
n
X
i,j=1
cicjKijLij=
n
X
i,j=1
cicjKij
n
X
m=1
SimSjm=
n
X
m=1 n
X
i,j=1
(ciSim)(cjSjm)Kij≥0, (22) where the inequality follows from (9).
It is reassuring that sums and products of positive definite kernels are positive definite.
We will now argue that, loosely speaking, there are no other operations that preserve positive definiteness. To this end, let C denote the set of all functions ψ: R → R that map positive definite kernels to (conditionally) positive definite kernels (readers who are not interested in the case of conditionally positive definite kernels may ignore the term in parentheses):
C={ψ |kis a positive definite kernel ⇒ψ(k) is a (conditionally) positive definite kernel}.
(23) Moreover, define
C0={ψ|for any Hilbert spaceF, ψ(hx, x0iF) is (conditionally) positive definite} (24) and
C00={ψ|for alln∈N:K is a positive definiten×nmatrix ⇒ψ(K) is (conditionally) positive definite}, (25)
whereψ(K) is then×nmatrix with elementsψ(Kij).
Proposition 6 C = C’ = C”
Proof Let ψ ∈ C. For any Hilbert space F, hx, x0iF is positive definite, thus ψ(hx, x0iF) is (conditionally) positive definite. Henceψ∈C0.
Next, assumeψ∈C0. Consider, forn∈N, an arbitrary positive definiten×n-matrixK. We can expressK as the Gram matrix of somex1, . . . , xn∈Rn, i.e.,Kij=hxi, xji
Rn. Sinceψ∈C0,
we know that ψ(hx, x0i
Rn) is a (conditionally) positive definite kernel, hence in particular the matrixψ(K) is (conditionally) positive definite, thus ψ∈C00.
Finally, assume ψ ∈ C00. Let k be a positive definite kernel, n ∈ N, c1, . . . , cn ∈ R, and x1, . . . , xn∈X. Then
X
ij
cicjψ(k(xi, xj)) =X
ij
cicj(ψ(K))ij≥0 (26)
(for P
ici = 0), where the inequality follows from ψ ∈ C00. Therefore, ψ(k) is (conditionally) positive definite, and henceψ∈C.
The following proposition follows from a result of FitzGerald et al. [1995] for (conditionally) positive definite matrices; by Proposition 6, it also applies for (conditionally) positive definite kernels, and for functions of dot products. We state the latter case.
Proposition 7 Let ψ:R→R. Then
ψ(hx, x0iF) (27)
is positive definite for any Hilbert spaceF if and only ifψ is real entire of the form
ψ(t) =
∞
X
n=0
antn (28)
withan ≥0 forn≥0.
Moreover, ψ(hx, x0iF)is conditionally positive definite for any Hilbert space F if and only if ψ is real entire of the form (28) withan ≥0 forn≥1.
There are further properties ofkthat can be read off the coefficientsan.
• Steinwart [2002a] showed that if allanare strictly positive, then the kernel (27) isuniversal on every compact subset S of Rd in the sense that its RKHS is dense in the space of continuous functions onS in thek.k∞ norm. For support vector machines using universal kernels, he then shows (universal) consistency [Steinwart, 2002b]. Examples of universal kernels are (29) and (30) below.
In Lemma 13, we will show that thea0 term does not affect an SVM. Hence we infer that it is actually sufficient for consistency to havean>0 forn≥1.
• LetIeven andIodd denote the sets of even and odd incidesiwith the property thatai >0, respectively. Pinkus [2004] showed that for anyF, the kernel (27) is strictly positive definite for any if and only ifa0>0 and neitherIeven norIodd is finite.2 Moreover, he states that the necessary and sufficient conditions for universality of the kernel are that in addition both P
i∈Ieven
1
i andP
i∈Iodd
1
i diverge.
While the proofs of the above statements are fairly involved, one can make certain aspects plausible using simple arguments. For instance [Pinkus, 2004], supposekis not strictly positive definite. Then we know that for some c 6= 0 we have P
ijcicjk(xi, xj) = 0. This means that
2One might be tempted to think that given some strictly positive definite kernelkwith feature map Φ and feature spaceH, we could setF to equalHand considerψ(hΦ(x),Φ(x0)iH). In this case, it would seem that choosinga1 = 1 andan= 0 for n6= 1 should give us a strictly positive definite kernel which violates Pinkus’
conditions. However, the latter kernel is only strictly positive definite on Φ(X), which is a weaker statement than it being strictly positive definite on all ofH.
DP
icik(xi, .),P
jcjk(xj, .)E
= 0, implying that P
icik(xi, .) = 0. For any f =P
jαjk(xj, .), this implies that P
icif(xi) = D P
icik(xi, .),P
jαjk(xj, .)E
= 0. This equality thus holds for any function in the RKHS. Therefore, the RKHS cannot lie dense in the space of continuous functions onX, andkis thus not universal.
We thus know that if kis universal, it is necessarily strictly positive definite. The latter, in turn, implies that its RKHSHis infinite dimensional, since otherwise the maximal rank of the Gram matrix ofk with respect to a data set inHwould equal its dimensionality. If we assume thatFis finite dimensional, then infinite dimensionality ofHimplies that infinitely many of the aiin (27) must be strictly positive. If only finitely many of theaiin Proposition 7 were positive, we could construct the feature spaceHofψ(hx, x0i) by taking finitely many tensor products and (scaled) direct sums of copies ofF, andHwould be finite.3
We conclude the section with an example of a kernel which is positive definite by Proposition 7.
To this end, letXbe a dot product space. The power series expansion of ψ(x) = ex then tells us that
k(x, x0) =eh
x,x0i
σ2 (29)
is positive definite [Haussler, 1999]. If we further multiply k with the positive definite kernel f(x)f(x0), wheref(x) =e−
kxk2
2σ2 andσ >0, this leads to the positive definiteness of the Gaussian kernel
k0(x, x0) =k(x, x0)f(x)f(x0) =e−kx−x0 k
2
2σ2 . (30)
2.2.4 Properties of Positive Definite Functions
We now letX=Rd and consider positive definite kernels of the form
k(x, x0) =f(x−x0), (31)
in which casef is called apositive definite function.
The following characterization is due to Bochner [1933], see also [Rudin, 1962]. We state it in the form given by Wendland [2005].
Theorem 8 A continuous functionf onRd is positive definite if and only there exists a finite nonnegative Borel measure µonRd such that
f(x) = Z
Rd
e−ihx,ωidµ(ω). (32)
Whilst normally formulated for complex valued functions, the theorem also holds true for real functions. Note, however, that if we start with an arbitrary nonnegative Borel measure, its Fourier transform may not be real. Real valued positive definite functions are distinguished by the fact that the corresponding measuresµare symmetric.
We may normalize f such thatf(0) = 1 (hence by Proposition 4 |f(x)| ≤1), in which case µ is a probability measure and f is its characteristic function. For instance, if µ is a normal distribution of the form (2π/σ2)−d/2e−σ
2kωk2
2 dω, then the corresponding positive definite function is the Gaussiane−kxk
2
2σ2 , cf. (30).
Bochner’s theorem allows us to interpret the similarity measure k(x, x0) =f(x−x0) in the frequency domain. The choice of the measure µdetermines which frequency components occur
3Recall that the dot product ofF⊗Fis the square of the original dot product, and the one ofF⊕F0, where F0is another dot product space, is the sum of the original dot products.
in the kernel. Since the solutions of kernel algorithms will turn out to be finite kernel expansions, the measureµwill thus determine which frequencies occur in the estimates, i.e., it will determine their regularization properties — more on that in Section 2.3.3 below.
A proof of the theorem can be found for instance in [Rudin, 1962]. One part of the theorem is easy to prove: iff takes the form (32), then
X
ij
aiajf(xi−xj) =X
ij
aiaj Z
Rd
e−ihxi−xj,ωidµ(ω) = Z
Rd
X
i
aie−ihxi,ωi
2
dµ(ω)≥0. (33) Bochner’s theorem generalizes earlier work of Mathias, and has itself been generalized in various ways, e.g. by Schoenberg [1938]. An important generalization considers Abelian semigroups (e.g., [Berg et al., 1984]). In that case, the theorem provides an integral representation of positive definite functions in terms of the semigroup’s semicharacters. Further generalizations were given by Krein, for the cases of positive definite kernels and functions with a limited number of negative squares (see [Stewart, 1976] for further details and references).
As in the previous section, there are conditions that ensure that the positive definiteness becomes strict.
Proposition 9 [Wendland, 2005] A positive definite functionf is strictly positive definite if the carrier of the measure in the representation (32) contains an open subset.
This implies that the Gaussian kernel is strictly positive definite.
An important special case of positive definite functions, which includes the Gaussian, are radial basis functions. These are functions that can be written as f(x) = g(kxk2) for some functiong : [0,∞[→R. They have the property of being invariant under the Euclidean group.
If we would like to have the additional property of compact support, which is computationally attractive in a number of large scale applications, the following result becomes relevant.
Proposition 10 [Wendland, 2005] Assume that g : [0,∞[→ R is continuous and compactly supported. Then f(x) =g(kxk2)cannot be positive definite on every Rd.
•4
2.2.5 Examples of Kernels
We have above already seen several instances of positive definite kernels, and now intend to complete our selection with a few more examples. In particular, we discuss polynomial kernels, convolution kernels, ANOVA expansions, and kernels on documents.
Polynomial Kernels From Proposition 5 it is clear that homogeneous polynomial kernels k(x, x0) =hx, x0ip are positive definite for p∈N andx, x0 ∈Rd. By direct calculation we can derive the corresponding feature map [Poggio, 1975]
hx, x0ip=
d
X
j=1
[x]j[x0]j
p
= X
j∈[d]p
[x]j1· · · · ·[x]jp·[x0]j1· · · · ·[x0]jp=hCp(x), Cp(x0)i, (34)
4Bernhard: universality of pdf functions should follow from standard results of Fourier transf., ifµhas full support?
where Cp maps x∈ Rd to the vectorCp(x) whose entries are all possible p-th degree ordered products of the entries ofx(note that we use [d] as a shorthand for{1, . . . , d}). The polynomial kernel of degreepthus computes a dot product in the space spanned by all monomials of degree pin the input coordinates. Other useful kernels include the inhomogeneous polynomial,
k(x, x0) = (hx, x0i+c)p wherep∈Nandc≥0, (35) which computes all monomials up to degreep.
Spline Kernels It is possible to obtain spline functions as a result of kernel expansions [Smola, 1996, Vapnik et al., 1997] simply by noting that convolution of an even number of indicator functions yields a positive kernel function. Denote byIXthe indicator (or characteristic) function on the setX, and denote by⊗the convolution operation, (f⊗g)(x) :=R
Rdf(x0)g(x0−x)dx0).
Then the B-spline kernels are given by
k(x, x0) =B2p+1(x−x0) wherep∈NwithBi+1:=Bi⊗B0. (36) Here B0 is the characteristic function on the unit ball5 in Rd. From the definition of (36) it is obvious that for oddm we may writeBm as inner product between functionsBm/2. Moreover, note that for evenm,Bmis not a kernel.
Convolutions and Structures Let us now move to kernels defined on structured objects [Haussler, 1999, Watkins, 2000]. Suppose the object x ∈ X is composed of xp ∈ Xp, where p∈[P] (note that the sets Xp need not be equal). For instance, consider the stringx=AT G, and P = 2. It is composed of the parts x1 =AT and x2 =G, or alternatively, of x1=A and x2=T G. Mathematically speaking, the set of “allowed” decompositions can be thought of as a relation R(x1, . . . , xP, x), to be read as “x1, . . . , xP constitute the composite object x.”
Haussler [1999] investigated how to define a kernel between composite objects by building on similarity measures that assess their respective parts; in other words, kernels kp defined on Xp×Xp. Define theR-convolution ofk1, . . . , kP as
[k1?· · ·? kP] (x, x0) := X
¯x∈R(x),¯x0∈R(x0) P
Y
p=1
kp(¯xp,x¯0p), (37)
where the sum runs over all possible waysR(x) and R(x0) in which we can decompose x into
¯
x1, . . . ,x¯P andx0 analogously.6 If there is only a finite number of ways, the relationR is called finite. In this case, it can be shown that theR-convolution is a valid kernel [Haussler, 1999].
ANOVA Kernels Specific examples of convolution kernels are Gaussians and ANOVA kernels [Wahba, 1990, Burges and Vapnik, 1995, Vapnik, 1998, Stitson et al., 1999]. To construct an ANOVA kernel, we considerX=SN for some setS, and kernelsk(i)onS×S, wherei= 1, . . . , N. ForP = 1, . . . , N, theANOVA kernel of orderP is defined as
kP(x, x0) := X
1≤i1<···<iP≤N P
Y
p=1
k(ip)(xip, x0i
p). (38)
5Note that inRone typically usesI[−12,12].
6We use the convention that an empty sum equals zero, hence if eitherxorx0 cannot be decomposed, then (k1?· · ·? kP)(x, x0) = 0.
Note that ifP =N, the sum consists only of the term for which (i1, . . . , iP) = (1, . . . , N), and kequals the tensor productk(1)⊗ · · · ⊗k(N). At the other extreme, ifP= 1, then the products collapse to one factor each, andkequals the direct sumk(1)⊕ · · · ⊕k(N). For intermediate values ofP, we get kernels that lie in between tensor products and direct sums.
ANOVA kernels typically use some moderate value of P, which specifies the order of the interactions between attributes xip that we are interested in. The sum then runs over the numerous terms that take into account interactions of orderP; fortunately, the computational cost can be reduced to O(P d) cost by utilizing recurrent procedures for the kernel evaluation [Burges and Vapnik, 1995, Stitson et al., 1999]. ANOVA kernels have been shown to work rather well in multi-dimensional SV regression problems [Stitson et al., 1999].
Brownian Bridge and Related Kernels The Brownian bridge kernel min(x, x0) defined on Rlikewise is a positive kernel. Note that one-dimensional linear spline kernels with an infinite number of nodes [Vapnik et al., 1997] forx, x0∈R+0 are given by
k(x, x0) = min(x, x0)3
3 +min(x, x0)2|x−x0|
2 + 1 +xx0. (39)
These kernels can be used as basis functionsk(n)in ANOVA expansions. Note that it is advisable to use ak(n)which never or rarely takes the value zero, since a single zero term would eliminate the product in (38).
Bag of Words One way in which SVMs have been used for text categorization [Joachims, 1998] is the bag-of-words representation. This maps a given text to a sparse vector, where each component corresponds to a word, and a component is set to one (or some other number) whenever the related word occurs in the text. Using an efficient sparse representation, the dot product between two such vectors can be computed quickly. Furthermore, this dot product is by construction a valid kernel, referred to as a sparse vector kernel. One of its shortcomings, however, is that it does not take into account the word ordering of a document. Other sparse vector kernels are also conceivable, such as one that maps a text to the set of pairs of words that are in the same sentence [Joachims, 1998, Watkins, 2000], or those which look only at pairs of words within a certain vicinity with respect to each other [Sim, 2001].
n-grams and Suffix Trees A more sophisticated way of dealing with string data was proposed [Watkins, 2000, Haussler, 1999]. The basic idea is as described above for general structured objects (37): Compare the strings by means of the substrings they contain. The more substrings two strings have in common, the more similar they are. The substrings need not always be contiguous; that said, the further apart the first and last element of a substring are, the less weight should be given to the similarity. Depending on the specific choice of a similarity measure it is possible to define more or less efficient kernels which compute the dot product in the feature space spanned byall substrings of documents.
Consider a finite alphabet Σ, the set of all strings of length n, Σn, and the set of all finite strings, Σ∗ := ∪∞n=0Σn. The length of a string s ∈ Σ∗ is denoted by |s|, and its elements by s(1). . . s(|s|); the concatenation ofsandt∈Σ∗ is writtenst. Denote by
k(x, x0) =X
s
#(x, s)#(x0, s)cs
a string kernel computed from exact matches. Here #(x, s) is the number of occurrences ofsin xand cs≥0.
Vishwanathan and Smola [2004a] prove that for arbitrarycsthe value of the kernel k(x, x0) can be computed inO(|x|+|x0|) time and memory. This is done by means of suffix trees, which allow one to find a compact representation of all substrings ofxin only O(|x|) time and space.
Even better, alsof(x) =hw,Φ(x)ican be computed inO(|x|) time if preprocessing linear in the size of the support vectors is carried out.
For inexact matches of a limited degree, typically up to= 3, and strings of bounded length a similar data structure can be built by explicitly generating a dictionary of strings and their neighborhood in terms of a Hamming distance [Leslie et al., 2002b,a]. These kernels are defined by replacing #(x, s) by a mismatch function #(x, s, ) which reports the number of approximate occurrences ofsinx. By trading off computational complexity with storage (hence the restriction to small numbers of mismatches) essentially linear-time algorithms can be designed. Whether a general purpose algorithm exists which allows for efficient comparisons of strings with mismatches in linear time is still an open question. Tools from approximate string matching [Navarro and Raffinot, 1999, Cole and Hariharan, 2000, Gusfield, 1997] promise to be of help in this context.
Mismatch Kernels In the general case it is only possible to find algorithms whose complexity is linear in the lengths of the documents being compared, and the length of the substrings, i.e.
O(|x| · |x0|) or worse. We now describe such a kernel with a specific choice of weights, following Lodhi et al. [2000].
Let us now form subsequences uof strings. Given an index sequencei:= (i1, . . . , i|u|) with 1≤i1 <· · · < i|u| ≤ |s|, we define u:=s(i) :=s(i1). . . s(i|u|). We calll(i) :=i|u|−i1+ 1the length of the subsequence ins. Note that ifiis not contiguous, thenl(i)>|u|.
The feature space built from strings of length nis defined to beHn :=R(Σ
n). This notation means that the space has one dimension (or coordinate) for each element of Σn, labelled by that element (equivalently, we can think of it as the space of all real-valued functions on Σn). We can thus describe the feature map coordinate-wise for eachu∈Σn via
[Φn(s)]u:= X
i:s(i)=u
λl(i). (40)
Here, 0< λ≤1 is a decay parameter: The larger the length of the subsequence ins, the smaller the respective contribution to [Φn(s)]u. The sum runs over all subsequences ofswhich equalu.
For instance, consider a dimension of H3 spanned (that is, labelled) by the string asd. In this case, we have [Φ3(Nasdaq)]asd=λ3, while [Φ3(lass das)]asd= 2λ5.7 The kernel induced by the map Φn takes the form
kn(s, t) = X
u∈Σn
[Φn(s)]u[Φn(t)]u= X
u∈Σn
X
(i,j):s(i)=t(j)=u
λl(i)λl(j). (41)
The string kernelkncan be computed using a dynamic programming scheme, see [Watkins, 2000, Lodhi et al., 2000, Durbin et al., 1998].
Locality Improved Kernels They are adjusted for the structure of the data. Recall the Gaussian RBF and polynomial kernels. When applied to an image, it makes no difference whether one uses as xthe image or a version of x where all locations of the pixels have been permuted. This indicates that function space onXinduced bykdoes not take advantage of the locality properties of the data.
7In the first string,asdis a contiguous substring. In the second string, it appears twice as a non-contiguous substring of length 5 inlass das, the two occurrences arelass dasandlass das.
By taking advantage of the local structure, estimates can be improved. On biological se- quences [Zien et al., 2000] one may assign more weight to the entries of the sequence close to the location where estimates should occur. In other words, one replaceshx, x0ibyx>Ωx, where Ω0 is a diagonal matrix with largest terms at the location which needs to be classified.
For images, local interactions between image patches need to be considered. One way is to use thepyramidal kernel [Sch¨olkopf, 1997, DeCoste and Sch¨olkopf, 2002], inspired by the pyramidal cells of the brain: It takes inner products between corresponding image patches, then raises the latter to some power p1, and finally raises their sum to another power p2. This means that mainly short-range interactions are considered and that the long-range interactions are taken with respect to short-range groups.
Tree Kernels We now discuss similarity measures on more structured objects. For trees Collins and Duffy [2001] propose a decomposition method which maps a treexinto its set of subtrees.
The kernel between two trees x, x0 is then computed by taking a weighted sum of all terms between both trees. In particular, Collins and Duffy [2001] show a quadratic time algorithm, i.e. O(|x| · |x0|) to compute this expression, where|x|is the number of nodes of the tree. When restricting the sum to all proper rooted subtrees it is possible to reduce the computational cost to linear time, i.e.O(|x|+|x0|) by means of a tree to string conversion [Vishwanathan and Smola, 2004a].
Tree Kernels Graphs pose a dual challenge: one may both design a kernelonvertices of them and also a kernelbetween them. In the former case, the graph itself becomes the object defining the metric between the vertices. See [G¨artner, 2003, Kashima et al., 2004, 2003] for details on the latter. In the following we discuss kernelson graphs.
Denote by W ∈ Rn×n the adjacency matrix of a graph with Wij > 0 if an edge between i, j exists. Moreover, assume for simplicity that the graph is undirected, that is W> =W (see Zhou et al. [2005] for extensions to directed graphs). Denote byL=D−W the graph Laplacian and by ˜L=1−D−12W D−12 the normalized graph Laplacian. HereD is a diagonal matrix with Dii =P
jWij denoting the degree of vertexi.
It is well known [Fiedler, 1973] that the second largest eigenvector of L approximately de- composes the graph into two parts according to their sign. The other large eigenvectors partition the graph into correspondingly smaller portions. L arises from the fact that for a function f defined on the vertices of the graphP
i,j(f(i)−f(j))2= 2f>Lf.
Finally, Smola and Kondor [2003] show that under mild conditions and up to rescalingL is the only quadratic permutation invariant form which can be obtained as a linear function ofW. Hence it is reasonable to consider kernel matrices K obtained from L (and ˜L). [Smola and Kondor, 2003] suggest kernelsK =r(L) orK =r( ˜L) and they show that they have desirable smoothness properties. Herer: [0,∞)→[0,∞) is a monotonically decreasing function. Popular choices include
r(ξ) = exp(−λξ) diffusion kernel (42)
r(ξ) = (ξ+λ)−1 regularized graph Laplacian (43)
r(ξ) = (λ−ξ)p p-step random walk (44)
where λ > 0 is suitably chosen. Eq. 42 was proposed by Kondor and Lafferty [2002]. In Section 2.3.3 we will discuss the connection between regularization operators and kernels inRn. Without going into details, the functionr(ξ) describes the smoothness properties on the graph andLplays the role of the Laplace operator.
Kernels on Sets and Subspaces Whenever each observationxiconsists of aset of instances we may use a range of methods to capture the specific properties of these sets:
• G¨artner et al. [2002] take the average of the elements of the set in feature space, that is, φ(xi) =n1P
jφ(xij). This yields good performance in the area of multi-instance learning.
• Jebara and Kondor [2003] extend the idea by dealing with distributions pi(x) such that φ(xi) =E[φ(x)] wherex∼pi(x). This is used with great success in the classification of images with missing pixels.
• Wolf and Shashua [2003] take a different approach by studying the subspace spanned by the observations. More specifically, they consider subspace angles [Martin, 2000, Cock and Moor, 2002] between spaces and design a kernel based on this. In a nutshell, ifU, U0denote the orthogonal matrices spanning the subspaces of x andx0 respectively, then k(x, x0) = detU>U0.
• Vishwanathan and Smola [2004b] extend this to arbitrary compound matrices (i.e. matrices composed of subdeterminants of matrices) and dynamical systems. Their result exploits the Binet Cauchy theorem which states that compound matrices are a representation of the matrix group, i.e.Cq(AB) =Cq(A)Cq(B). This leads to kernels of the formk(x, x0) = trCq(AB). Note that forq= dimAwe recover the determinant kernel of Wolf and Shashua [2003].
Fisher Kernels In their quest to to make density estimates directly accessible to kernel meth- ods Jaakkola and Haussler [1999b,a] designed kernels which work directly on probability density estimatesp(x|θ). Denote by
Uθ(x) :=∂θ−logp(x|θ) (45)
I:=Ex
Uθ(x)Uθ>(x)
(46) the Fisher scores and the Fisher information matrix respectively. Note that for maximum likeli- hood estimatorsEx[Uθ(x)] = 0 and thereforeI is the covariance ofUθ(x). The Fisher kernel is defined as
k(x, x0) :=Uθ>(x)I−1Uθ(x0) ork(x, x0) :=Uθ>(x)Uθ(x0) (47) depending on whether whether we study the normalized or the unnormalized kernel respectively.
It is a versatile tool to re-engineer existing density estimators for the purpose of discriminative estimation.
In addition to that, it has several attractive theoretical properties: Oliver et al. [2000] show that estimation using the normalized Fisher kernel corresponds to estimation subject to a regu- larization on theL2(p(·|θ)) norm.
Moreover, in the context of exponential families (see Section 4.1 for a more detailed discussion) wherep(x|θ) = exp(hφ(x), θi −g(θ)) we have
k(x, x0) = [φ(x)−∂θg(θ)] [φ(x0)−∂θg(θ)]. (48) for the unnormalized Fisher kernel. This means that up to centering by∂θg(θ) the Fisher kernel is identical to the kernel arising from the inner product of the sufficient statistics φ(x). This is not a coincidence. In fact, in our analysis of nonparametric exponential families we will encounter this fact several times. See Section 4 for further details. Moreover, note that the centering is immaterial, as can be seen in Lemma 13.
While we presented an overview over recent kernel design in this section the overview is by no means complete. The reader is referred to textbooks by Cristianini and Shawe-Taylor [2000], Shawe-Taylor and Cristianini [2004], Sch¨olkopf and Smola [2003], Joachims [2002], Herbrich [2002]
for further details on how kernels may be constructed.
2.3 Kernel Function Classes
2.3.1 The Representer Theorem
From kernels, we now move to functions that can be expressed in terms of kernel expansions.
The representer theorem [Kimeldorf and Wahba, 1971, Cox and O’Sullivan, 1990] shows that solutions of a large class of optimization problems can be expressed as kernel expansions over the sample points. We present a slightly more general version of the theorem with a simple proof [Sch¨olkopf et al., 2001]. As above,His the RKHS associated to the kernel k.
Theorem 11 (Representer Theorem) Denote by Ω : [0,∞) → R a strictly monotonic in- creasing function, byXa set, and byc: (X×R2)n→R∪ {∞}an arbitrary loss function. Then each minimizer f ∈H of the regularized risk functional
c((x1, y1, f(x1)), . . . ,(xn, yn, f(xn))) + Ω kfk2H
(49) admits a representation of the form
f(x) =
n
X
i=1
αik(xi, x). (50)
Proof We decompose any f ∈ H into a part contained in the span of the kernel functions k(x1,·),· · · , k(xn,·), and one in the orthogonal complement;
f(x) =fk(x) +f⊥(x) =
n
X
i=1
αik(xi, x) +f⊥(x). (51) Here αi ∈Randf⊥∈H withhf⊥, k(xi,·)iH = 0 for all i∈[n] :={1, . . . , n}. By (17) we may writef(xj) (for allj∈[n]) as
f(xj) =hf(·), k(xj, .)i=
n
X
i=1
αik(xi, xj) +hf⊥(·), k(xj, .)iH=
n
X
i=1
αik(xi, xj). (52) Second, for allf⊥,
Ω(kfk2H) = Ω
n
X
i
αik(xi,·)
2
H
+kf⊥k2H
≥Ω
n
X
i
αik(xi,·)
2
H
. (53) Thus for any fixed αi ∈Rthe risk functional (49) is minimized forf⊥ = 0. Since this also has to hold for the solution, the theorem holds.
Monotonicity of Ω does not prevent the regularized risk functional (49) from having multiple local minima. To ensure a global minimum, we would need to require convexity. If we discard the strictness of the monotonicity, then it no longer follows that each minimizer of the regularized risk admits an expansion (50); it still follows, however, that there is always another solution that is as good, and thatdoes admit the expansion.
The significance of the Representer Theorem is that although we might be trying to solve an optimization problem in an infinite-dimensional space H, containing linear combinations of kernels centered onarbitrary points ofX, it states that the solution lies in the span ofnparticular kernels — those centered on the training points. We will encounter (50) again further below, where it is called theSupport Vector expansion. For suitable choices of loss functions, many of theαi often equal 0.
2.3.2 Reduced Set Methods
Nevertheless, it can often be the case that the expansion (50) contains a large number of terms.
This can be problematic in practical applications, since the time required to evaluate (50) is proportional to the number of terms. To deal with this issue, we thus need to express or at least approximate (50) using a smaller number of terms. Exact expressions in a reduced number of basis points are often not possible; e.g., if the kernel is strictly positive definite and the data points x1, . . . , xn are distinct, then no exact reduction is possible. For the purpose of approximation, we need to decide on a discrepancy measure that we want to minimize. The most natural such measure is the norm of the RKHS. From a practical point of view, this norm has the advantage that it can be computed using evaluations of the kernel function which opens the possibility for relatively efficient approximation algorithms. Clearly, if we were given a set of pointsz1, . . . , zm
such that the function (cf. (50))
f =
n
X
i=1
αik(xi, .). (54)
can be expressed with small RKHS-error in the subspace spanned byk(z1, .), . . . , k(zm, .), then we can compute anm-term approximation off by projection onto that subspace.
The problem thus reduces to the choice of the expansion pointsz1, . . . , zm. In the literature, two classes of methods have been pursued. In the first, it is attempted to choose the points as a subset of some larger set x1, . . . , xn, which is usually the training set [Sch¨olkopf, 1997, Frieß and Harrison, 1998]. In the second, we compute the difference in the RKHS between (54) and thereduced set expansion [Burges, 1996]
g=
m
X
p=1
βpk(zp, .), (55)
leading to
kf−gk2=
n
X
i,j=1
αiαjk(xi, xj) +
m
X
p,q=1
βpβqk(zp, zq)−2
n
X
i=1 m
X
p=1
αiβpk(xi, zp). (56) To obtain thez1, . . . , zmandβ1, . . . , βm, we can minimize this objective function, which for most kernels will be a nonconvex optimization problem. Specialized methods for specific kernels exist, as well as iterative methods taking into account the fact that given thez1, . . . , zm, the coefficients β1, . . . , βmcan be computed in closed form. See [Sch¨olkopf and Smola, 2002] for further details.
Once we have constructed or selected a set of expansion points, we are working in a finite dimensional subspace ofHspanned by these points.
It is known that each closed subspace of an RKHS is itself an RKHS. Denote its kernel byl, then the projectionP onto the subspace is given by (e.g., [Meschkowski, 1962])
(P f)(x0) =hf, l(., x0)i. (57)
If the subspace is spanned by a finite set of linearly independent kernelsk(., z1), . . . , k(., zm), the kernel of the subspace takes the form
km(x, x0) = (k(x, z1), . . . , k(x, zm))K−1(k(x0, z1), . . . , k(x0, zm))>, (58) where K is the Gram matrix of k with respect toz1, . . . , zm. This can be seen by noting that (i) the km(., x) span an m-dimensional subspace, and (ii) we have km(zi, zj) = k(zi, zj) for
i, j = 1, . . . , m. The kernel of the subspace’s orthogonal complement, on the other hand, is k−km.
The projection onto them-dimensional subspace can be written P=
m
X
i,j=1
K−1
ijk(., zi)k(., zj)>, (59) wherek(., zj)> denotes the linear form mappingk(., x) tok(zj, x) forx∈X.
If we prefer to work with coordinates and the Euclidean dot product, we can use the kernel PCA map [Sch¨olkopf and Smola, 2002]
Φm:X→Rm, x7→K−1/2(k(x, z1), . . . , k(x, zm))>, (60) which directly projects into an orthonormal basis of the subspace.8
A number of articles exist discussing how to best choose the expansion points zi for the purpose of processing a given dataset, e.g., [Smola and Sch¨olkopf, 2000, Williams and Seeger, 2000].
2.3.3 Regularization Properties
The regularizer kfk2H used in Theorem 11 stems from the dot product hf, fik in the RKHS H associated with a positive definite kernel. Since it is somewhat hard to interpret which aspects of f are influencing the value of this dot product, it is instructive to consider the cases where there exists a positive operator Υ mapping the RKHSHinto a function space endowed with the dot product
hf, gi= Z
f(x)g(x)dx, (61)
with the property that for allf, g∈H
hΥf,Υgi=hf, gik. (62)
We shall now sketch this, with a particular focus on the case where the dot product can be written conveniently in the Fourier domain. Our exposition will be informal (see also [Poggio and Girosi, 1990, Girosi et al., 1993, Smola et al., 1998]), and we will implicitly assume that all integrals are overRd and exist, and that the operators are well defined. Rather than (62), we consider the condition
hΥk(x, .),Υk(x0, .)i=hk(x, .), k(x0, .)ik, (63) which by linearity extends to (62) (cf. Section 2.2.2).
Let us rewrite the first dot product as
Υ2k(x, .), k(x0, .)
(recall that Υ is positive and thus symmetric). We then see that ifk(x, .) is aGreens function of Υ2, we have
Υ2k(x, .), k(x0, .)
=hδx, k(x0, .)i=k(x, x0), (64) which by the reproducing property (18) amounts to the desired equality (63). For conditionally positive definite kernels, a similar correspondence can be established, with a regularization op- erator whose null space is spanned by a set of functions which are not regularized (in the case (20), which is sometimes calledconditionally positive definite of order 1, these are the constant functions).
8In caseKdoes not have full rank, projections onto the span{k(·, z1), . . . , k(·, zm)}are achieved by using the pseudoinverse ofKinstead.
We now consider the particular case where the kernel can be written k(x, x0) = f(x−x0) with a continuous strictly positive definite functionf ∈L1(Rd) (cf. Section 2.2.4). A variation of Bochner’s theorem (Theorem 8), stated in Wendland [2005], then tells us that the measure corresponding tof has a nonvanishing densityυwith respect to the Lebesgue measure, i.e., that kcan be written as
k(x, x0) = Z
e−ihx−x0,ωiυ(ω)dω= Z
e−ihx,ωie−ihx0,ωiυ(ω)dω. (65) We would like to rewrite this as hΥk(x, .),Υk(x0, .)i for some linear operator Υ. It turns out that a multiplication operator in the Fourier domain will do the job. To this end, recall the d-dimensional Fourier transform, given by
F[f](ω) := (2π)−d2 Z
f(x)e−ihx,ωidx, (66)
with the inverse
F−1[f](x) = (2π)−d2 Z
f(ω)eihx,ωidω. (67)
Next, compute the Fourier transform of kas F[k(x, .)](ω) = (2π)−d2
Z Z
υ(ω0)e−ihx,ω0i
eihx0,ω0idω0e−ihx0,ωidx0 (68)
= (2π)d2υ(ω)e−ihx,ωi. (69)
Hence we can rewrite (65) as
k(x, x0) = (2π)−d
Z F[k(x, .)](ω)F[k(x0, .)](ω)
υ(ω) dω. (70)
If our regularization operator maps
Υ : f 7→(2π)−d2υ−12F[f], (71) we thus have
k(x, x0) = Z
(Υk(x, .))(ω)(Υk(x0, .))(ω)dω, (72) i.e., our desired identity (63) holds true.
As required in (62), we can thus interpret the dot producthf, gikin the RKHS as a dot product R(Υf)(ω)(Υg)(ω)dω. This allows us to understand regularization properties ofkin terms of its (scaled) Fourier transformυ(ω). Small values ofυ(ω) amplify the corresponding frequencies in (71). Penalizinghf, fik thus amounts to astrong attenuation of the corresponding frequencies.
Hence small values ofυ(ω) for largekωkare desirable, since high frequency components ofF[f] correspond to rapid changes in f. It follows that υ(ω) describes the filter properties of the corresponding regularization operator Υ. In view of our comments following Theorem 8, we can translate this insight into probabilistic terms: if the probability measure Rυ(ω)dω
υ(ω)dω describes the desired filter properties, then the natural translation invariant kernel to use is the characteristic function of the measure.