Support vector machines and applications in computational biology
Jean-Philippe Vert
Jean-Philippe.Vert@mines.org
Outline
1 Motivations
2 Linear SVM
3 Nonlinear SVM and kernels
4 Kernels for strings and graphs
Outline
1 Motivations
2 Linear SVM
3 Nonlinear SVM and kernels
4 Kernels for strings and graphs
Cancer diagnosis
Problem 1
Given the expression levels of 20k genes in a leukemia, is it an acute lymphocytic or myeloid leukemia (ALL or AML)?
Cancer prognosis
Problem 2
Given the expression levels of 20k genes in a tumour after surgery, is it likely to relapse later?
Pharmacogenomics / Toxicogenomics
Problem 3
Given the genome of a person, which drug should we give?
Protein annotation
Data available
Secreted proteins:
MASKATLLLAFTLLFATCIARHQQRQQQQNQCQLQNIEA...
MARSSLFTFLCLAVFINGCLSQIEQQSPWEFQGSEVW...
MALHTVLIMLSLLPMLEAQNPEHANITIGEPITNETLGWL...
...
Non-secreted proteins:
MAPPSVFAEVPQAQPVLVFKLIADFREDPDPRKVNLGVG...
MAHTLGLTQPNSTEPHKISFTAKEIDVIEWKGDILVVG...
MSISESYAKEIKTAFRQFTDFPIEGEQFEDFLPIIGNP..
...
Problem 4
Given a newly sequenced protein, is it secreted or not?
Drug discovery
inactive active
active
active
inactive
inactive
Problem 5
Given a new candidate molecule, is it likely to be active?
A common topic
A common topic
A common topic
A common topic
On real data...
Pattern recognition, aka supervised classification
Challenges
High dimension Few samples Structured data Heterogeneous data Prior knowledge Fast and scalable implementations Interpretable models
Outline
1 Motivations
2 Linear SVM
3 Nonlinear SVM and kernels
4 Kernels for strings and graphs
Linear classifier
Linear classifier
Linear classifier
Linear classifier
Linear classifier
Linear classifier
Linear classifier
Linear classifier
Which one is better?
The margin of a linear classifier
The margin of a linear classifier
The margin of a linear classifier
The margin of a linear classifier
The margin of a linear classifier
Largest margin classifier (hard-margin SVM)
Support vectors
More formally
Thetraining setis a finite set ofndata/class pairs:
S =
(~x1,y1), . . . ,(~xn,yn) , where~xi ∈Rpandyi ∈ {−1,1}.
We assume (for the moment) that the data arelinearly separable, i.e., that there exists(w,~ b)∈Rp×Rsuch that:
(w~.~xi+b>0 ifyi =1, w~.~xi+b<0 ifyi =−1.
How to find the largest separating hyperplane?
For a given linear classifierf(x) =w.~~ x+bconsider the "tube" defined by the values−1 and+1 of the decision function:
x2 x1
w.x+b > +1
w.x+b < −1
w
w.x+b=+1
w.x+b=−1 w.x+b=0
The margin is 2/k w ~ k
Indeed, the points~x1andx~2satisfy:
(w.~~ x1+b=0, w.~~ x2+b=1.
By subtracting we getw.(~~ x2−~x1) =1, and therefore:
γ=2k~x2−~x1k= 2 kw~ k.
All training points should be on the correct side of the dotted line
For positive examples (yi =1) this means:
w.~~ xi+b≥1. For negative examples (yi =−1) this means:
w.~~ xi+b≤ −1. Both cases are summarized by:
∀i =1, . . . ,n, yi w~.~xi+b
≥1.
Finding the optimal hyperplane
Find(~w,b)which minimize:
kw~ k2 under the constraints:
∀i=1, . . . ,n, yi w.~~ xi+b
−1≥0.
This is a classical quadratic program onRp+1.
Lagrangian
In order to minimize:
1 2kw~ k22 under the constraints:
∀i=1, . . . ,n, yi w.~~ xi+b
−1≥0,
we introduceone dual variableαi for each constraint, i.e., for each training point. The Lagrangian is:
L w,~ b, ~α
= 1
2||~w||2−
n
X
i=1
αi yi w~.~xi+b
−1 .
Lagrangian
L w~,b, ~α
is convex quadratic inw. It is minimized for:~
∇w~L=w~ −
n
X
i=1
αiyi~xi =0 =⇒ w~ =
n
X
i=1
αiyi~xi.
L w~,b, ~α
is affine inb. Its minimum is−∞except if:
∇bL=
n
X
i=1
αiyi =0.
Dual function
We therefore obtain theLagrange dual function:
q(~α) = inf
~
w∈Rp,b∈R
L w~,b, ~α
= (Pn
i=1αi−12Pn i=1
Pn
j=1yiyjαiαj~xi.~xj if Pn
i=1αiyi =0,
−∞ otherwise.
The dual problem is:
maximize q(~α) subject to α~ ≥0.
Dual problem
Findα∗ ∈Rnwhich maximizes L(~α) =
n
X
i=1
αi−1 2
n
X
i=1 n
X
j=1
αiαjyiyj~xi.~xj,
under the (simple) constraintsαi ≥0 (fori =1, . . . ,n), and
n
X
i=1
αiyi =0.
This is a quadratic program onRN, with "box constraints". ~α∗ can be found efficiently using dedicated optimization softwares.
Recovering the optimal hyperplane
Once~α∗ is found, we recover(w~∗,b∗)corresponding to the optimal hyperplane. w∗is given by:
w~∗ =
n
X
i=1
αi~xi,
and thedecision functionis therefore:
f∗(~x) =w~∗.~x +b∗
=
n
X
i=1
αi~xi.~x+b∗. (1)
Interpretation: support vectors
α>0
α=0
What if data are not linearly separable?
What if data are not linearly separable?
What if data are not linearly separable?
What if data are not linearly separable?
Soft-margin SVM
Find a trade-off betweenlarge marginandfew errors.
Mathematically:
minf
1
margin(f) +C×errors(f)
Cis a parameter
Soft-margin SVM formulation
Themarginof a labeled point(~x,y)is
margin(~x,y) =y w.~~ x+b Theerroris
0 ifmargin(~x,y)>1, 1−margin(~x,y)otherwise.
The soft margin SVM solves:
min
w,b~
(
||w~||2+C
n
X
i=1
max 0,1−yi w.~~ xi+b )
Soft-margin SVM and hinge loss
min
w,b~
( n X
i=1
`hinge w.x~ i+b,yi
+λkw~ k22 )
,
forλ=1/Cand the hinge loss function:
`hinge(u,y) =max(1−yu,0) =
(0 ifyu ≥1, 1−yu otherwise.
yf(x) l(f(x),y)
1
Dual formulation of soft-margin SVM (exercice)
Maximize
L(~α) =
n
X
i=1
αi−1 2
n
X
i=1 n
X
j=1
αiαjyiyj~xi.~xj,
under the constraints:
(0≤αi ≤C, fori =1, . . . ,n Pn
i=1αiyi =0.
Interpretation: bounded and unbounded support vectors
C α=0
0<α<C
α=
Primal (for large n) vs dual (for large p) optimization
1 Find(w,~ b)∈Rp+1which solve:
min
w,b~
( n X
i=1
`hinge w.x~ i+b,yi
+λkw~ k22 )
.
2 Findα∗ ∈Rnwhich maximizes L(~α) =
n
X
i=1
αi−1 2
n
X
i=1 n
X
j=1
αiαjyiyj~xi.~xj,
under the constraints:
(0≤αi ≤C, fori=1, . . . ,n Pn
i=1αiyi =0.
Outline
1 Motivations
2 Linear SVM
3 Nonlinear SVM and kernels
4 Kernels for strings and graphs
Sometimes linear methods are not interesting
Solution: nonlinear mapping to a feature space
R 2 x1
x2
x1
x2 2
Forx = x1
x2
letΦ(x) = x12
x22
. The decision function is:
f(x) =x12+x22−R2= 1
1 >
x12 x22
−R2=β>Φ(x) +b.
Kernel = inner product in the feature space
Definition
For a given mapping
Φ :X 7→ H
from the space of objectsX to some Hilbert space of featuresH, the kernelbetween two objectsx andx0 is the inner product of their images in the features space:
∀x,x0 ∈ X, K(x,x0) = Φ(x)>Φ(x0). φ
X F
Example
φ
X F
LetX =H=R2and forx = x1
x2
letΦ(x) = x12
x22
Then
K(x,x0) = Φ(x)>Φ(x0) = (x1)2(x10)2+ (x2)2(x20)2.
The kernel tricks
φ
X F
2 tricks
1 Many linear algorithms (in particularlinear SVM) can be
performed in the feature space ofΦ(x)without explicitly computing the imagesΦ(x), but instead by computing kernelsK(x,x0).
2 It is sometimes possible toeasilycompute kernels which
correspond to complex large-dimensional feature spaces:K(x,x0) is often much simpler to compute thanΦ(x)andΦ(x0)
Trick 1 : SVM in the original space
Train the SVM by maximizing maxα∈Rn
n
X
i=1
αi−1 2
n
X
i=1 n
X
j=1
αiαjyiyjxi>xj,
under the constraints:
(0≤αi ≤C, fori =1, . . . ,n Pn
i=1αiyi =0. Predict with the decision function
f(x) =
n
X
i=1
αiyixi>x+b∗.
Trick 1 : SVM in the feature space
Train the SVM by maximizing maxα∈Rn
n
X
i=1
αi− 1 2
n
X
i=1 n
X
j=1
αiαjyiyjΦ (xi)>Φ xj ,
under the constraints:
(0≤αi ≤C, fori =1, . . . ,n Pn
i=1αiyi =0. Predict with the decision function
f(x) =
n
X
i=1
αiyiΦ (xi)>Φ (x)+b∗.
Trick 1 : SVM in the feature space with a kernel
Train the SVM by maximizing maxα∈Rn
n
X
i=1
αi− 1 2
n
X
i=1 n
X
j=1
αiαjyiyjK xi,xj ,
under the constraints:
(0≤αi ≤C, fori =1, . . . ,n Pn
i=1αiyi =0. Predict with the decision function
f(x) =
n
X
i=1
αiK(xi,x)+b∗.
Trick 2 illustration: polynomial kernel
R 2 x1
x2
x1
x2 2
Forx = (x1,x2)>∈R2, letΦ(x) = (x12,√
2x1x2,x22)∈R3: K(x,x0)=x12x102+2x1x2x10x20 +x22x202
= x1x10 +x2x202
=
x>x02
.
Trick 2 illustration: polynomial kernel
R 2 x1
x2
x1
x2 2
More generally, forx,x0∈Rp, K(x,x0) =
x>x0+1d
is an inner product in a feature space of all monomials of degree up to d (left as exercice.)
Combining tricks: learn a polynomial discrimination rule with SVM
Train the SVM by maximizing maxα∈Rn
n
X
i=1
αi−1 2
n
X
i=1 n
X
j=1
αiαjyiyj
xi>xj+1d
,
under the constraints:
(0≤αi ≤C, fori =1, . . . ,n Pn
i=1αiyi =0. Predict with the decision function
f(x) =
n
X
i=1
αiyi
xi>x +1 d
+b∗.
Illustration: toy nonlinear problem
> plot(x,col=ifelse(y>0,1,2),pch=ifelse(y>0,1,2))
● ●
●
●
●
●
●
● ●
●
●
●
●
●
●
●
● ●
●
●
●
●
●
● ●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
● ●
● ●
●
● ●
● ●
●
●●
●
●
●
●
●
−1 0 1 2 3
−101234
Training data
x1
x2
Illustration: toy nonlinear problem, linear SVM
> library(kernlab)
> svp <- ksvm(x,y,type="C-svc",kernel=’vanilladot’)
> plot(svp,data=x)
−1.0
−0.5 0.0 0.5 1.0 1.5 2.0
−1 0 1 2 3 4
−1 0 1 2 3
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
● ●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
● ●
●
●
●
●
●
●
●
● ●
●
●
●
●
●
●
●
●
●
●
●
●
SVM classification plot
x2
x1
Illustration: toy nonlinear problem, polynomial SVM
> svp <- ksvm(x,y,type="C-svc", ...
kernel=polydot(degree=2))
> plot(svp,data=x)
−5 0 5 10
−1 0 1 2 3 4
−1 0 1 2 3
●
●
●
●
●
●
●
●
●
●
●
●
●
●
● ●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
● ●
●
●
●
●
●
●
● ●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
SVM classification plot
x2
x1
Which functions K (x , x
0) are kernels?
Definition
A functionK(x,x0)defined on a setX is akernelif and only if there exists a features space (Hilbert space)Hand a mapping
Φ :X 7→ H,
such that, for anyx,x0 inX: K x,x0
=
Φ (x),Φ x0
H .
φ
X F
Positive Definite (p.d.) functions
Definition
Apositive definite (p.d.) functionon the setX is a function K :X × X →Rsymmetric:
∀ x,x0
∈ X2, K x,x0
=K x0,x , and which satisfies, for allN∈N,(x1,x2, . . . ,xN)∈ XN et (a1,a2, . . . ,aN)∈RN:
N
X
i=1 N
X
j=1
aiajK xi,xj
≥0.
Kernels are p.d. functions
Theorem (Aronszajn, 1950)
K is a kernelif and only ifit is a positive definite function.
φ
X F
Proof?
Kernel =⇒ p.d. function:
hΦ (x),Φ (x0)iRd =hΦ (x0),Φ (x)Rdi , PN
i=1
PN
j=1aiajhΦ (xi),Φ (xj)i
Rd =k PN
i=1aiΦ (xi)k2
Rd ≥0 . P.d. function =⇒ kernel: more difficult...
Example: SVM with a Gaussian kernel
Training:
α∈minRn n
X
i=1
αi−1 2
n
X
i,j=1
αiαjyiyjexp −||~xi−~xj||2 2σ2
!
s.t. 0≤αi ≤C, and
n
X
i=1
αiyi =0.
Prediction
f(~x) =
n
X
i=1
αiexp
−||~x−~xi||2 2σ2
Example: SVM with a Gaussian kernel
f(~x) =
n
X
i=1
αiexp
−||~x −~xi||2 2σ2
−1.0
−0.5 0.0 0.5 1.0
−2 0 2 4 6
−2 0 2 4
●
●
●
●
●
●
●
●
●
●
●
●
SVM classification plot
Linear vs nonlinear SVM
Regularity vs data fitting trade-off
C controls the trade-off
min
f
1
margin(f) +C×errors(f)
Why it is important to control the trade-off
How to choose C in practice
Split your dataset in two ("train" and "test") Train SVM with differentCon the "train" set
Compute the accuracy of the SVM on the "test" set Choose theC which minimizes the "test" error
(you may repeat this several times = cross-validation)
SVM summary
Large margin
Linear or nonlinear (with the kernel trick)
Control of the regularization / data fitting trade-off withC
Outline
1 Motivations
2 Linear SVM
3 Nonlinear SVM and kernels
4 Kernels for strings and graphs
Supervised sequence classification
Data (training)
Secreted proteins:
MASKATLLLAFTLLFATCIARHQQRQQQQNQCQLQNIEA...
MARSSLFTFLCLAVFINGCLSQIEQQSPWEFQGSEVW...
MALHTVLIMLSLLPMLEAQNPEHANITIGEPITNETLGWL...
...
Non-secreted proteins:
MAPPSVFAEVPQAQPVLVFKLIADFREDPDPRKVNLGVG...
MAHTLGLTQPNSTEPHKISFTAKEIDVIEWKGDILVVG...
MSISESYAKEIKTAFRQFTDFPIEGEQFEDFLPIIGNP..
...
Goal
Build aclassifiertopredictwhether new proteins are secreted or not.
String kernels
The idea
Map each stringx ∈ X to avectorΦ(x)∈ F.
Train aclassifier for vectorson the imagesΦ(x1), . . . ,Φ(xn)of the training set (nearest neighbor, linear perceptron, logistic
regression, support vector machine...)
mahtlg...
φ
X F
maskat...
msises marssl...
malhtv...
mappsv...
Example: substring indexation
The approach
Index the feature space by fixed-length strings, i.e., Φ (x) = (Φu(x))u∈Ak
whereΦu(x)can be:
the number of occurrences ofu inx(without gaps) : spectrum kernel(Leslie et al., 2002)
the number of occurrences ofu inxup tommismatches (without gaps) : mismatch kernel(Leslie et al., 2004)
the number of occurrences ofu inxallowing gaps, with a weight decaying exponentially with the number of gaps :substring kernel (Lohdi et al., 2002)
Spectrum kernel (1/2)
Kernel definition The 3-spectrum of
x=CGGSLIAMMWFGV is:
(CGG,GGS,GSL,SLI,LIA,IAM,AMM,MMW,MWF,WFG,FGV). LetΦu(x)denote the number of occurrences ofuinx. The k-spectrum kernel is:
K x,x0
:= X
u∈Ak
Φu(x) Φu x0 .
Spectrum kernel (2/2)
Implementation
The computation of the kernel is formally a sum over|A|k terms, but at most|x| −k+1 terms are non-zero inΦ (x) =⇒
Computation inO(|x|+|x0|)with pre-indexation of the strings.
Fast classification of a sequencexinO(|x|):
f(x) =w·Φ (x) =X
u
wuΦu(x)=
|x|−k+1
X
i=1
wxi...xi+k−1.
Remarks
Work with any string (natural language, time series...)
Fast and scalable, a good default method for string classification.
Variants allow matching ofk-mers up tommismatches.
Local alignmnent kernel (Saigo et al., 2004)
CGGSLIAMM----WFGV
|...|||||....||||
C---LIVMMNRLMWFGV
sS,g(π) =S(C,C) +S(L,L) +S(I,I) +S(A,V) +2S(M,M)
+S(W,W) +S(F,F) +S(G,G) +S(V,V)−g(3)−g(4) SWS,g(x,y) := max
π∈Π(x,y)sS,g(π) is not a kernel KLA(β)(x,y) = X
π∈Π(x,y)
exp βsS,g(x,y, π)
is a kernel
LA kernel is p.d.: proof (1/2)
Definition: Convolution kernel (Haussler, 1999)
LetK1andK2be two p.d. kernels for strings. TheconvolutionofK1 andK2, denotedK1?K2, is defined for anyx,x0 ∈ X by:
K1?K2(x,y) := X
x1x2=x,y1y2=y
K1(x1,y1)K2(x2,y2).
Lemma
If K1and K2are p.d. thenK1?K2is p.d..
LA kernel is p.d.: proof (2/2)
KLA(β)=
∞
X
n=0
K0?
Ka(β)?Kg(β)(n−1)
?Ka(β)?K0, with
The constant kernel:
K0(x,y) :=1. A kernel for letters:
Ka(β)(x,y) :=
0 if |x| 6=1 where |y| 6=1, exp(βS(x,y)) otherwise.
A kernel for gaps:
Kg(β)(x,y) =exp[β(g(|x|) +g(|x|))] .
The choice of kernel matters
0 10 20 30 40 50 60
0 0.2 0.4 0.6 0.8 1
No. of families with given performance
ROC50
SVM-LA SVM-pairwise SVM-Mismatch SVM-Fisher
Performance on the SCOP superfamily recognition benchmark (from Saigo et al., 2004).
Virtual screening for drug discovery
inactive active
active
active
inactive
inactive
NCI AIDS screen results (from http://cactus.nci.nih.gov).
Image retrieval and classification
From Harchaoui and Bach (2007).
Graph kernels
1 Represent each graphx by a vectorΦ(x)∈ H, eitherexplicitlyor implicitlythrough the kernel
K(x,x0) = Φ(x)>Φ(x0).
2 Use a linear method for classification inH.
X
Graph kernels
1 Represent each graphx by a vectorΦ(x)∈ H, eitherexplicitlyor implicitlythrough the kernel
K(x,x0) = Φ(x)>Φ(x0).
2 Use a linear method for classification inH.
φ
X H
Graph kernels
1 Represent each graphx by a vectorΦ(x)∈ H, eitherexplicitlyor implicitlythrough the kernel
K(x,x0) = Φ(x)>Φ(x0).
2 Use a linear method for classification inH.
φ
X H
Indexing by all subgraphs?
Theorem
Computing all subgraph occurrences isNP-hard.
Proof.
The linear graph of sizenis a subgraph of a graphX withn vertices iffX has an Hamiltonian path
The decision problem whether a graph has a Hamiltonian path is NP-complete.
Indexing by all subgraphs?
Theorem
Computing all subgraph occurrences isNP-hard.
Proof.
The linear graph of sizenis a subgraph of a graphX withn vertices iffX has an Hamiltonian path
The decision problem whether a graph has a Hamiltonian path is NP-complete.
Indexing by all subgraphs?
Theorem
Computing all subgraph occurrences isNP-hard.
Proof.
The linear graph of sizenis a subgraph of a graphX withn vertices iffX has an Hamiltonian path
The decision problem whether a graph has a Hamiltonian path is NP-complete.
Indexing by specific subgraphs
Substructure selection
We can imagine more limited sets of substuctures that lead to more computationnally efficient indexing (non-exhaustive list)
substructures selected bydomain knowledge(MDL fingerprint) all pathup to lengthk (Openeye fingerprint, Nicholls 2005) allshortest paths(Borgwardt and Kriegel, 2005)
all subgraphsup tok vertices(graphlet kernel, Sherashidze et al., 2009)
allfrequentsubgraphs in the database (Helma et al., 2004)
Example : Indexing by all shortest paths
(0,...,0,2,0,...,0,1,0,...) B
A B A
A A A B
A B A B
A A
A A
Properties (Borgwardt and Kriegel, 2005) There areO(n2)shortest paths.
The vector of counts can be computed inO(n4)with the Floyd-Warshall algorithm.
Example : Indexing by all shortest paths
(0,...,0,2,0,...,0,1,0,...) B
A B A
A A A B
A B A B
A A
A A
Properties (Borgwardt and Kriegel, 2005) There areO(n2)shortest paths.
The vector of counts can be computed inO(n4)with the Floyd-Warshall algorithm.
Example : Indexing by all subgraphs up to k vertices
Properties (Shervashidze et al., 2009) Naive enumeration scales asO(nk).
Enumeration of connected graphlets inO(ndk−1)for graphs with degree≤d andk ≤5.
Randomly sample subgraphs if enumeration is infeasible.
Example : Indexing by all subgraphs up to k vertices
Properties (Shervashidze et al., 2009) Naive enumeration scales asO(nk).
Enumeration of connected graphlets inO(ndk−1)for graphs with degree≤d andk ≤5.
Randomly sample subgraphs if enumeration is infeasible.
Walks
Definition
Awalkof a graph(V,E)is sequence ofv1, . . . ,vn∈V such that (vi,vi+1)∈E fori=1, . . . ,n−1.
We noteWn(G)the set of walks withnvertices of the graphG, andW(G)the set of all walks.
etc...
Walks 6= paths
Walk kernel
Definition
LetSndenote the set of all possiblelabel sequencesof walks of lengthn(including vertices and edges labels), andS =∪n≥1Sn. For any graphX let aweightλG(w)be associated to each walk w ∈ W(G).
Let the feature vectorΦ(G) = (Φs(G))s∈S be defined by:
Φs(G) = X
w∈W(G)
λG(w)1(sis the label sequence ofw) .
A walk kernel is a graph kernel defined by:
Kwalk(G1,G2) =X
s∈S
Φs(G1)Φs(G2).
Walk kernel
Definition
LetSndenote the set of all possiblelabel sequencesof walks of lengthn(including vertices and edges labels), andS =∪n≥1Sn. For any graphX let aweightλG(w)be associated to each walk w ∈ W(G).
Let the feature vectorΦ(G) = (Φs(G))s∈S be defined by:
Φs(G) = X
w∈W(G)
λG(w)1(sis the label sequence ofw) .
A walk kernel is a graph kernel defined by:
Kwalk(G1,G2) =X
s∈S
Φs(G1)Φs(G2).
Walk kernel examples
Thenth-order walk kernelis the walk kernel withλG(w) =1 if the length ofw isn, 0 otherwise. It compares two graphs through their common walks of lengthn.
Therandom walk kernelis obtained withλG(w) =PG(w), where PG is aMarkov random walk onG. In that case we have:
K(G1,G2) =P(label(W1) =label(W2)),
whereW1andW2are two independant random walks onG1and G2, respectively (Kashima et al., 2003).
Thegeometric walk kernelis obtained (when it converges) with λG(w) =βlength(w), forβ >0. In that case the feature space is of infinite dimension(Gärtner et al., 2003).
Walk kernel examples
Thenth-order walk kernelis the walk kernel withλG(w) =1 if the length ofw isn, 0 otherwise. It compares two graphs through their common walks of lengthn.
Therandom walk kernelis obtained withλG(w) =PG(w), where PG is aMarkov random walk onG. In that case we have:
K(G1,G2) =P(label(W1) =label(W2)),
whereW1andW2are two independant random walks onG1and G2, respectively (Kashima et al., 2003).
Thegeometric walk kernelis obtained (when it converges) with λG(w) =βlength(w), forβ >0. In that case the feature space is of infinite dimension(Gärtner et al., 2003).
Walk kernel examples
Thenth-order walk kernelis the walk kernel withλG(w) =1 if the length ofw isn, 0 otherwise. It compares two graphs through their common walks of lengthn.
Therandom walk kernelis obtained withλG(w) =PG(w), where PG is aMarkov random walk onG. In that case we have:
K(G1,G2) =P(label(W1) =label(W2)),
whereW1andW2are two independant random walks onG1and G2, respectively (Kashima et al., 2003).
Thegeometric walk kernelis obtained (when it converges) with λG(w) =βlength(w), forβ >0. In that case the feature space is of infinite dimension(Gärtner et al., 2003).
Computation of walk kernels
Proposition
These three kernels (nth-order, random and geometric walk kernels) can be computed efficiently inpolynomial time.
Product graph
Definition
LetG1= (V1,E1)andG2= (V2,E2)be two graphs with labeled
vertices. Theproduct graphG=G1×G2is the graphG= (V,E)with:
1 V ={(v1,v2)∈V1×V2 : v1andv2have the same label},
2 E =
(v1,v2),(v10,v20)
∈V ×V : (v1,v10)∈E1and(v2,v20)∈E2 .
G1 x G2
c
d e
4 3
2
1 1b 2a 1d
1a 2b 3c
4c 2d
3e
4e
G1 G2
a b
Walk kernel and product graph
Lemma
There is abijectionbetween:
1 Thepairs of walksw1∈ Wn(G1)andw2∈ Wn(G2)with thesame label sequences,
2 Thewalks on the product graphw ∈ Wn(G1×G2).
Corollary
Kwalk(G1,G2) =X
s∈S
Φs(G1)Φs(G2)
= X
(w1,w2)∈W(G1)×W(G1)
λG1(w1)λG2(w2)1(l(w1) =l(w2))
= X
w∈W(G1×G2)
λG1×G2(w).
Walk kernel and product graph
Lemma
There is abijectionbetween:
1 Thepairs of walksw1∈ Wn(G1)andw2∈ Wn(G2)with thesame label sequences,
2 Thewalks on the product graphw ∈ Wn(G1×G2).
Corollary
Kwalk(G1,G2) =X
s∈S
Φs(G1)Φs(G2)
= X
(w1,w2)∈W(G1)×W(G1)
λG1(w1)λG2(w2)1(l(w1) =l(w2))
= X
w∈W(G1×G2)
λG1×G2(w).
Computation of the nth-order walk kernel
For thenth-order walk kernel we haveλG1×G2(w) =1 if the length ofw isn, 0 otherwise.
Therefore:
Knth−order(G1,G2) = X
w∈Wn(G1×G2)
1. LetAbe the adjacency matrix ofG1×G2. Then we get:
Knth−order(G1,G2) =X
i,j
[An]i,j =1>An1. Computation inO(n|G1||G2|d1d2), wheredi is the maximum degree ofGi.
Computation of random and geometric walk kernels
In both casesλG(w)for a walkw =v1. . .vncan be decomposed as:
λG(v1. . .vn) =λi(v1)
n
Y
i=2
λt(vi−1,vi).
LetΛi be the vector ofλi(v)andΛt be the matrix ofλt(v,v0):
Kwalk(G1,G2) =
∞
X
n=1
X
w∈Wn(G1×G2)
λi(v1)
n
Y
i=2
λt(vi−1,vi)
=
∞
X
n=0
ΛiΛnt1
=Λi(I−Λt)−11 Computation inO(|G1|3|G2|3)
Extension: branching walks (Ramon and Gärtner, 2003; Mahé and Vert, 2009)
.. .
.. . .. N .
N C O C
C
..
. C
O C
N C
N O
C
N N C C C
N N
T(v,n+1) = X
R⊂N(v)
Y
v0∈R
λt(v,v0)T(v0,n),
2D Subtree vs walk kernels
707274767880
AUC
Walks Subtrees
CCRF−CEMHL−60(TB) K−562MOLT−4 RPMI−8226SRA549/ATCCEKVX HOP−62HOP−92NCI−H226 NCI−H23NCI−H322MNCI−H460NCI−H522COLO_205 HCC−2998HCT−116 HCT−15HT29KM12SW−620 SF−268SF−295SF−539SNB−19 SNB−75U251LOX_IMVI MALME−3MM14SK−MEL−2SK−MEL−28 SK−MEL−5UACC−257UACC−62 IGR−OV1OVCAR−3OVCAR−4OVCAR−5 OVCAR−8SK−OV−3786−0A498 ACHNCAKI−1RXF_393 SN12CTK−10UO−31 PC−3DU−145MCF7NCI/ADR−RES MDA−MB−231/ATCCHS_578TMDA−MB−435MDA−N BT−549T−47D
Screening of inhibitors for 60 cancer cell lines.
Image classification (Harchaoui and Bach, 2007)
COREL14 dataset
1400 natural images in 14 classes
Compare kernel between histograms (H), walk kernel (W), subtree kernel (TW), weighted subtree kernel (wTW), and a combination (M).
H W TW wTW M
0.05 0.06 0.07 0.08 0.09 0.1 0.11 0.12
Test error
Kernels
Performance comparison on Corel14
References
N. Aronszajn. Theory of reproducing kernels.Trans. Am. Math. Soc., 68:337 – 404, 1950. URL http://www.jstor.org/stable/1990404.
K. M. Borgwardt and H.-P. Kriegel. Shortest-path kernels on graphs. InICDM ’05: Proceedings of the Fifth IEEE International Conference on Data Mining, pages 74–81, Washington, DC, USA, 2005. IEEE Computer Society. ISBN 0-7695-2278-5. doi:
http://dx.doi.org/10.1109/ICDM.2005.132.
Z. Harchaoui and F. Bach. Image classification with segmentation graph kernels. In2007 IEEE Computer Society Conference on Computer Vision and Pattern Recognition (CVPR 2007), pages 1–8. IEEE Computer Society, 2007. doi:10.1109/CVPR.2007.383049. URL
http://dx.doi.org/10.1109/CVPR.2007.383049.
D. Haussler. Convolution Kernels on Discrete Structures. Technical Report UCSC-CRL-99-10, UC Santa Cruz, 1999.
C. Helma, T. Cramer, S. Kramer, and L. De Raedt. Data mining and machine learning techniques for the identification of mutagenicity inducing substructures and structure activity relationships of noncongeneric compounds.J. Chem. Inf. Comput. Sci., 44(4):1402–11, 2004. doi:
10.1021/ci034254q. URLhttp://dx.doi.org/10.1021/ci034254q.
C. Leslie and R. Kuang. Fast string kernels using inexact matching for protein sequences.J.
Mach. Learn. Res., 5:1435–1455, 2004.
C. Leslie, E. Eskin, and W. Noble. The spectrum kernel: a string kernel for SVM protein classification. In R. B. Altman, A. K. Dunker, L. Hunter, K. Lauerdale, and T. E. Klein, editors, Proceedings of the Pacific Symposium on Biocomputing 2002, pages 564–575, Singapore, 2002. World Scientific.
References (cont.)
H. Lodhi, C. Saunders, J. Shawe-Taylor, N. Cristianini, and C. n. p. v. d. d. r. Watkins. Text classification using string kernels.J. Mach. Learn. Res., 2:419–444, 2002. URLhttp:
//www.ai.mit.edu/projects/jmlr/papers/volume2/lodhi02a/abstract.html.
P. Mahé and J. P. Vert. Graph kernels based on tree patterns for molecules.Mach. Learn., 75(1):
3–35, 2009. doi:10.1007/s10994-008-5086-2. URL
http://dx.doi.org/10.1007/s10994-008-5086-2.
A. Nicholls. Oechem, version 1.3.4, openeye scientific software. website, 2005.
J. Ramon and T. Gärtner. Expressivity versus efficiency of graph kernels. In T. Washio and L. De Raedt, editors,Proceedings of the First International Workshop on Mining Graphs, Trees and Sequences, pages 65–74, 2003.
H. Saigo, J.-P. Vert, N. Ueda, and T. Akutsu. Protein homology detection using string alignment kernels.Bioinformatics, 20(11):1682–1689, 2004. URLhttp:
//bioinformatics.oupjournals.org/cgi/content/abstract/20/11/1682.
N. Sherashidze, S. Vishwanathan, T. Petri, K. Mehlhorn, and K. Borgwardt. Efficient graphlet kernels for large graph comparison. In12th International Conference on Artificial Intelligence and Statistics (AISTATS), pages 488–495, Clearwater Beach, Florida USA, 2009. Society for Artificial Intelligence and Statistics.