Support vector machines and applications in computational biology

(1)

Support vector machines and applications in computational biology

Jean-Philippe Vert

Jean-Philippe.Vert@mines.org

(2)

Outline

1 Motivations

2 Linear SVM

3 Nonlinear SVM and kernels

4 Kernels for strings and graphs

(3)

Outline

1 Motivations

2 Linear SVM

(4)

Cancer diagnosis

Problem 1

Given the expression levels of 20k genes in a leukemia, is it an acute lymphocytic or myeloid leukemia (ALL or AML)?

(5)

Cancer prognosis

Problem 2

Given the expression levels of 20k genes in a tumour after surgery, is it likely to relapse later?

(6)

Pharmacogenomics / Toxicogenomics

Problem 3

Given the genome of a person, which drug should we give?

(7)

Protein annotation

Data available

Secreted proteins:

MASKATLLLAFTLLFATCIARHQQRQQQQNQCQLQNIEA...

MARSSLFTFLCLAVFINGCLSQIEQQSPWEFQGSEVW...

MALHTVLIMLSLLPMLEAQNPEHANITIGEPITNETLGWL...

...

Non-secreted proteins:

MAPPSVFAEVPQAQPVLVFKLIADFREDPDPRKVNLGVG...

MAHTLGLTQPNSTEPHKISFTAKEIDVIEWKGDILVVG...

MSISESYAKEIKTAFRQFTDFPIEGEQFEDFLPIIGNP..

...

Problem 4

Given a newly sequenced protein, is it secreted or not?

(8)

Drug discovery

inactive active

active

inactive

Problem 5

Given a new candidate molecule, is it likely to be active?

(9)

A common topic

(10)

A common topic

(11)

A common topic

(12)

A common topic

(13)

On real data...

(14)

Pattern recognition, aka supervised classification

Challenges

High dimension Few samples Structured data Heterogeneous data Prior knowledge Fast and scalable implementations Interpretable models

(15)

Outline

1 Motivations

2 Linear SVM

(16)

Linear classifier

(17)

Linear classifier

(18)

Linear classifier

(19)

Linear classifier

(20)

Linear classifier

(21)

Linear classifier

(22)

Linear classifier

(23)

Linear classifier

(24)

Which one is better?

(25)

The margin of a linear classifier

(26)

The margin of a linear classifier

(27)

The margin of a linear classifier

(28)

The margin of a linear classifier

(29)

The margin of a linear classifier

(30)

Largest margin classifier (hard-margin SVM)

(31)

Support vectors

(32)

More formally

Thetraining setis a finite set ofndata/class pairs:

S =

(~x₁,y₁), . . . ,(~x_n,y_n) , where~x_i ∈R^pandy_i ∈ {−1,1}.

We assume (for the moment) that the data arelinearly separable, i.e., that there exists(w,~ b)∈R^p×Rsuch that:

(w~.~x_i+b>0 ify_i =1, w~.~x_i+b<0 ify_i =−1.

(33)

How to find the largest separating hyperplane?

For a given linear classifierf(x) =w.~~ x+bconsider the "tube" defined by the values−1 and+1 of the decision function:

x2 x1

w.x+b > +1

w.x+b < −1

w

w.x+b=+1

w.x+b=−1 w.x+b=0

(34)

The margin is 2/k w ~ k

Indeed, the points~x₁andx~₂satisfy:

(w.~~ x₁+b=0, w.~~ x₂+b=1.

By subtracting we getw.(~~ x₂−~x₁) =1, and therefore:

γ=2k~x₂−~x₁k= 2 kw~ k.

(35)

All training points should be on the correct side of the dotted line

For positive examples (y_i =1) this means:

w.~~ x_i+b≥1. For negative examples (y_i =−1) this means:

w.~~ x_i+b≤ −1. Both cases are summarized by:

∀i =1, . . . ,n, y_i w~.~x_i+b

≥1.

(36)

Finding the optimal hyperplane

Find(~w,b)which minimize:

kw~ k² under the constraints:

∀i=1, . . . ,n, y_i w.~~ x_i+b

−1≥0.

This is a classical quadratic program onR^p+1.

(37)

Lagrangian

In order to minimize:

1 2kw~ k²₂ under the constraints:

∀i=1, . . . ,n, y_i w.~~ x_i+b

−1≥0,

we introduceone dual variableαi for each constraint, i.e., for each training point. The Lagrangian is:

L w,~ b, ~α

= 1

2||~w||²−

n

X

i=1

α_i y_i w~.~x_i+b

−1 .

(38)

Lagrangian

L w~,b, ~α

is convex quadratic inw. It is minimized for:~

∇_w_~L=w~ −

n

X

i=1

α_iy_i~x_i =0 =⇒ w~ =

n

X

i=1

α_iy_i~x_i.

L w~,b, ~α

is affine inb. Its minimum is−∞except if:

∇_bL=

n

X

i=1

α_iy_i =0.

(39)

Dual function

We therefore obtain theLagrange dual function:

q(~α) = inf

~

w∈R^p,b∈R

L w~,b, ~α

= (Pn

i=1α_i−¹₂Pn i=1

Pn

j=1y_iy_jα_iα_j~x_i.~x_j if Pn

i=1α_iy_i =0,

−∞ otherwise.

The dual problem is:

maximize q(~α) subject to α~ ≥0.

(40)

Dual problem

Findα^∗ ∈Rⁿwhich maximizes L(~α) =

n

X

i=1

α_i−1 2

n

X

i=1 n

X

j=1

α_iα_jy_iy_j~x_i.~x_j,

under the (simple) constraintsαi ≥0 (fori =1, . . . ,n), and

n

X

i=1

α_iy_i =0.

This is a quadratic program onR^N, with "box constraints". ~α^∗ can be found efficiently using dedicated optimization softwares.

(41)

Recovering the optimal hyperplane

Once~α^∗ is found, we recover(w~^∗,b^∗)corresponding to the optimal hyperplane. w^∗is given by:

w~^∗ =

n

X

i=1

α_i~x_i,

and thedecision functionis therefore:

f^∗(~x) =w~^∗.~x +b^∗

=

n

X

i=1

α_i~x_i.~x+b^∗. (1)

(42)

Interpretation: support vectors

α>0

α=0

(43)

What if data are not linearly separable?

(44)

What if data are not linearly separable?

(45)

What if data are not linearly separable?

(46)

What if data are not linearly separable?

(47)

Soft-margin SVM

Find a trade-off betweenlarge marginandfew errors.

Mathematically:

minf

1

margin(f) +C×errors(f)

Cis a parameter

(48)

Soft-margin SVM formulation

Themarginof a labeled point(~x,y)is

margin(~x,y) =y w.~~ x+b Theerroris

0 ifmargin(~x,y)>1, 1−margin(~x,y)otherwise.

The soft margin SVM solves:

min

w,b~

(

||w~||²+C

n

X

i=1

max 0,1−y_i w.~~ x_i+b )

(49)

Soft-margin SVM and hinge loss

min

w,b~

( _n X

i=1

`_hinge w.x~ _i+b,y_i

+λkw~ k²₂ )

,

forλ=1/Cand the hinge loss function:

`_hinge(u,y) =max(1−yu,0) =

(0 ifyu ≥1, 1−yu otherwise.

yf(x) l(f(x),y)

1

(50)

Dual formulation of soft-margin SVM (exercice)

Maximize

L(~α) =

n

X

i=1

αi−1 2

n

X

i=1 n

X

j=1

αiαjy_iy_j~x_i.~x_j,

under the constraints:

(0≤α_i ≤C, fori =1, . . . ,n Pn

i=1αiy_i =0.

(51)

Interpretation: bounded and unbounded support vectors

C α=0

0<α<C

α=

(52)

Primal (for large n) vs dual (for large p) optimization

1 Find(w,~ b)∈R^p+1which solve:

min

w,b~

( _n X

i=1

`_hinge w.x~ _i+b,y_i

+λkw~ k²₂ )

.

2 Findα^∗ ∈Rⁿwhich maximizes L(~α) =

n

X

i=1

α_i−1 2

n

X

i=1 n

X

j=1

α_iα_jy_iy_j~x_i.~x_j,

(0≤αi ≤C, fori=1, . . . ,n Pn

i=1α_iy_i =0.

(53)

Outline

1 Motivations

2 Linear SVM

(54)

Sometimes linear methods are not interesting

(55)

Solution: nonlinear mapping to a feature space

R 2 x1

x2

x1

x2 2

Forx = x₁

x₂

letΦ(x) = x₁²

x₂²

. The decision function is:

f(x) =x₁²+x₂²−R²= 1

1 >

x₁² x₂²

−R²=β^>Φ(x) +b.

(56)

Kernel = inner product in the feature space

Definition

For a given mapping

Φ :X 7→ H

from the space of objectsX to some Hilbert space of featuresH, the kernelbetween two objectsx andx⁰ is the inner product of their images in the features space:

∀x,x⁰ ∈ X, K(x,x⁰) = Φ(x)^>Φ(x⁰). φ

X F

(57)

Example

φ

X F

LetX =H=R²and forx = x₁

x₂

letΦ(x) = x₁²

x₂²

Then

K(x,x⁰) = Φ(x)^>Φ(x⁰) = (x₁)²(x₁⁰)²+ (x₂)²(x₂⁰)².

(58)

The kernel tricks

φ

X F

2 tricks

1 Many linear algorithms (in particularlinear SVM) can be

performed in the feature space ofΦ(x)without explicitly computing the imagesΦ(x), but instead by computing kernelsK(x,x⁰).

2 It is sometimes possible toeasilycompute kernels which

correspond to complex large-dimensional feature spaces:K(x,x⁰) is often much simpler to compute thanΦ(x)andΦ(x⁰)

(59)

Trick 1 : SVM in the original space

Train the SVM by maximizing maxα∈Rⁿ

n

X

i=1

α_i−1 2

n

X

i=1 n

X

j=1

α_iα_jy_iy_jx_i^>x_j,

(0≤α_i ≤C, fori =1, . . . ,n Pn

i=1αiy_i =0. Predict with the decision function

f(x) =

n

X

i=1

αiy_ix_i^>x+b^∗.

(60)

Trick 1 : SVM in the feature space

n

X

i=1

α_i− 1 2

n

X

i=1 n

X

j=1

α_iα_jy_iy_jΦ (x_i)^>Φ x_j ,

(0≤α_i ≤C, fori =1, . . . ,n Pn

f(x) =

n

X

i=1

αiy_iΦ (x_i)^>Φ (x)+b^∗.

(61)

Trick 1 : SVM in the feature space with a kernel

n

X

i=1

α_i− 1 2

n

X

i=1 n

X

j=1

α_iα_jy_iy_jK x_i,x_j ,

(0≤α_i ≤C, fori =1, . . . ,n Pn

f(x) =

n

X

i=1

αiK(x_i,x)+b^∗.

(62)

Trick 2 illustration: polynomial kernel

R 2 x1

x2

x1

x2 2

Forx = (x₁,x₂)^>∈R², letΦ(x) = (x₁²,√

2x₁x₂,x₂²)∈R³: K(x,x⁰)=x₁²x₁⁰²+2x₁x₂x₁⁰x₂⁰ +x₂²x₂⁰²

= x₁x₁⁰ +x₂x₂⁰2

=

x^>x⁰2

.

(63)

Trick 2 illustration: polynomial kernel

R 2 x1

x2

x1

x2 2

More generally, forx,x⁰∈R^p, K(x,x⁰) =

x^>x⁰+1d

is an inner product in a feature space of all monomials of degree up to d (left as exercice.)

(64)

Combining tricks: learn a polynomial discrimination rule with SVM

n

X

i=1

α_i−1 2

n

X

i=1 n

X

j=1

α_iα_jy_iy_j

x_i^>x_j+1d

,

(0≤α_i ≤C, fori =1, . . . ,n Pn

i=1α_iy_i =0. Predict with the decision function

f(x) =

n

X

i=1

α_iy_i

x_i^>x +1 d

+b^∗.

(65)

Illustration: toy nonlinear problem

> plot(x,col=ifelse(y>0,1,2),pch=ifelse(y>0,1,2))

● ●

●

● ●

●

● ●

●

● ●

●

● ●

●

● ●

●

●●

●

−1 0 1 2 3

−101234

Training data

x1

x2

(66)

Illustration: toy nonlinear problem, linear SVM

> library(kernlab)

> svp <- ksvm(x,y,type="C-svc",kernel=’vanilladot’)

> plot(svp,data=x)

−1.0

−0.5 0.0 0.5 1.0 1.5 2.0

−1 0 1 2 3 4

−1 0 1 2 3

●

● ●

●

● ●

●

● ●

●

SVM classification plot

x2

x1

(67)

Illustration: toy nonlinear problem, polynomial SVM

> svp <- ksvm(x,y,type="C-svc", ...

kernel=polydot(degree=2))

> plot(svp,data=x)

−5 0 5 10

−1 0 1 2 3 4

−1 0 1 2 3

●

● ●

●

● ●

●

● ●

●

x2

x1

(68)

Which functions K (x , x

⁰

) are kernels?

Definition

A functionK(x,x⁰)defined on a setX is akernelif and only if there exists a features space (Hilbert space)Hand a mapping

Φ :X 7→ H,

such that, for anyx,x⁰ inX: K x,x⁰

=

Φ (x),Φ x⁰

H .

φ

X F

(69)

Positive Definite (p.d.) functions

Definition

Apositive definite (p.d.) functionon the setX is a function K :X × X →Rsymmetric:

∀ x,x⁰

∈ X², K x,x⁰

=K x⁰,x , and which satisfies, for allN∈N,(x₁,x₂, . . . ,x_N)∈ X^N et (a₁,a₂, . . . ,a_N)∈R^N:

N

X

i=1 N

X

j=1

a_ia_jK x_i,x_j

≥0.

(70)

Kernels are p.d. functions

Theorem (Aronszajn, 1950)

K is a kernelif and only ifit is a positive definite function.

φ

X F

(71)

Proof?

Kernel =⇒ p.d. function:

hΦ (x),Φ (x⁰)i_Rd =hΦ (x⁰),Φ (x)_Rdi , PN

i=1

PN

j=1a_ia_jhΦ (x_i),Φ (x_j)i

R^d =k PN

i=1a_iΦ (x_i)k²

R^d ≥0 . P.d. function =⇒ kernel: more difficult...

(72)

Example: SVM with a Gaussian kernel

Training:

α∈minRⁿ n

X

i=1

αi−1 2

n

X

i,j=1

αiαjy_iy_jexp −||~x_i−~x_j||² 2σ²

!

s.t. 0≤α_i ≤C, and

n

X

i=1

α_iy_i =0.

Prediction

f(~x) =

n

X

i=1

α_iexp

−||~x−~x_i||² 2σ²

(73)

Example: SVM with a Gaussian kernel

f(~x) =

n

X

i=1

α_iexp

−||~x −~x_i||² 2σ²

−1.0

−0.5 0.0 0.5 1.0

−2 0 2 4 6

−2 0 2 4

●

(74)

Linear vs nonlinear SVM

(75)

Regularity vs data fitting trade-off

(76)

C controls the trade-off

min

f

1

margin(f) +C×errors(f)

(77)

Why it is important to control the trade-off

(78)

How to choose C in practice

Split your dataset in two ("train" and "test") Train SVM with differentCon the "train" set

Compute the accuracy of the SVM on the "test" set Choose theC which minimizes the "test" error

(you may repeat this several times = cross-validation)

(79)

SVM summary

Large margin

Linear or nonlinear (with the kernel trick)

Control of the regularization / data fitting trade-off withC

(80)

Outline

1 Motivations

2 Linear SVM

(81)

Supervised sequence classification

Data (training)

Secreted proteins:

MASKATLLLAFTLLFATCIARHQQRQQQQNQCQLQNIEA...

MARSSLFTFLCLAVFINGCLSQIEQQSPWEFQGSEVW...

MALHTVLIMLSLLPMLEAQNPEHANITIGEPITNETLGWL...

...

Non-secreted proteins:

MAPPSVFAEVPQAQPVLVFKLIADFREDPDPRKVNLGVG...

MAHTLGLTQPNSTEPHKISFTAKEIDVIEWKGDILVVG...

MSISESYAKEIKTAFRQFTDFPIEGEQFEDFLPIIGNP..

...

Goal

Build aclassifiertopredictwhether new proteins are secreted or not.

(82)

String kernels

The idea

Map each stringx ∈ X to avectorΦ(x)∈ F.

Train aclassifier for vectorson the imagesΦ(x₁), . . . ,Φ(x_n)of the training set (nearest neighbor, linear perceptron, logistic

regression, support vector machine...)

mahtlg...

φ

X F

maskat...

msises marssl...

malhtv...

mappsv...

(83)

Example: substring indexation

The approach

Index the feature space by fixed-length strings, i.e., Φ (x) = (Φ_u(x))_u∈Ak

whereΦu(x)can be:

the number of occurrences ofu inx(without gaps) : spectrum kernel(Leslie et al., 2002)

the number of occurrences ofu inxup tommismatches (without gaps) : mismatch kernel(Leslie et al., 2004)

the number of occurrences ofu inxallowing gaps, with a weight decaying exponentially with the number of gaps :substring kernel (Lohdi et al., 2002)

(84)

Spectrum kernel (1/2)

Kernel definition The 3-spectrum of

x=CGGSLIAMMWFGV is:

(CGG,GGS,GSL,SLI,LIA,IAM,AMM,MMW,MWF,WFG,FGV). LetΦu(x)denote the number of occurrences ofuinx. The k-spectrum kernel is:

K x,x⁰

:= X

u∈A^k

Φu(x) Φ_u x⁰ .

(85)

Spectrum kernel (2/2)

Implementation

The computation of the kernel is formally a sum over|A|^k terms, but at most|x| −k+1 terms are non-zero inΦ (x) =⇒

Computation inO(|x|+|x⁰|)with pre-indexation of the strings.

Fast classification of a sequencexinO(|x|):

f(x) =w·Φ (x) =X

u

w_uΦu(x)=

|x|−k+1

X

i=1

w_x_i_...x_i+k₋₁.

Remarks

Work with any string (natural language, time series...)

Fast and scalable, a good default method for string classification.

Variants allow matching ofk-mers up tommismatches.

(86)

Local alignmnent kernel (Saigo et al., 2004)

CGGSLIAMM----WFGV

|...|||||....||||

C---LIVMMNRLMWFGV

s_S,g(π) =S(C,C) +S(L,L) +S(I,I) +S(A,V) +2S(M,M)

+S(W,W) +S(F,F) +S(G,G) +S(V,V)−g(3)−g(4) SW_S,g(x,y) := max

π∈Π(x,y)s_S,g(π) is not a kernel K_LA^(β)(x,y) = X

π∈Π(x,y)

exp βs_S,g(x,y, π)

is a kernel

(87)

LA kernel is p.d.: proof (1/2)

Definition: Convolution kernel (Haussler, 1999)

LetK₁andK₂be two p.d. kernels for strings. TheconvolutionofK₁ andK₂, denotedK₁?K₂, is defined for anyx,x⁰ ∈ X by:

K₁?K₂(x,y) := X

x1x2=x,y1y2=y

K₁(x₁,y₁)K₂(x₂,y₂).

Lemma

If K₁and K₂are p.d. thenK₁?K₂is p.d..

(88)

LA kernel is p.d.: proof (2/2)

K_LA^(β)=

∞

X

n=0

K₀?

K_a^(β)?K_g^(β)(n−1)

?K_a^(β)?K₀, with

The constant kernel:

K₀(x,y) :=1. A kernel for letters:

K_a^(β)(x,y) :=

0 if |x| 6=1 where |y| 6=1, exp(βS(x,y)) otherwise.

A kernel for gaps:

K_g^(β)(x,y) =exp[β(g(|x|) +g(|x|))] .

(89)

The choice of kernel matters

0 10 20 30 40 50 60

0 0.2 0.4 0.6 0.8 1

No. of families with given performance

ROC50

SVM-LA SVM-pairwise SVM-Mismatch SVM-Fisher

Performance on the SCOP superfamily recognition benchmark (from Saigo et al., 2004).

(90)

Virtual screening for drug discovery

inactive active

active

inactive

NCI AIDS screen results (from http://cactus.nci.nih.gov).

(91)

Image retrieval and classification

From Harchaoui and Bach (2007).

(92)

Graph kernels

1 Represent each graphx by a vectorΦ(x)∈ H, eitherexplicitlyor implicitlythrough the kernel

K(x,x⁰) = Φ(x)^>Φ(x⁰).

2 Use a linear method for classification inH.

X

(93)

Graph kernels

K(x,x⁰) = Φ(x)^>Φ(x⁰).

φ

X H

(94)

Graph kernels

K(x,x⁰) = Φ(x)^>Φ(x⁰).

φ

X H

(95)

Indexing by all subgraphs?

Theorem

Computing all subgraph occurrences isNP-hard.

Proof.

The linear graph of sizenis a subgraph of a graphX withn vertices iffX has an Hamiltonian path

The decision problem whether a graph has a Hamiltonian path is NP-complete.

(96)

Indexing by all subgraphs?

Theorem

Proof.

(97)

Indexing by all subgraphs?

Theorem

Proof.

(98)

Indexing by specific subgraphs

Substructure selection

We can imagine more limited sets of substuctures that lead to more computationnally efficient indexing (non-exhaustive list)

substructures selected bydomain knowledge(MDL fingerprint) all pathup to lengthk (Openeye fingerprint, Nicholls 2005) allshortest paths(Borgwardt and Kriegel, 2005)

all subgraphsup tok vertices(graphlet kernel, Sherashidze et al., 2009)

allfrequentsubgraphs in the database (Helma et al., 2004)

(99)

Example : Indexing by all shortest paths

(0,...,0,2,0,...,0,1,0,...) B

A B A

A A A B

A B A B

A A

Properties (Borgwardt and Kriegel, 2005) There areO(n²)shortest paths.

The vector of counts can be computed inO(n⁴)with the Floyd-Warshall algorithm.

(100)

Example : Indexing by all shortest paths

(0,...,0,2,0,...,0,1,0,...) B

A B A

A A A B

A B A B

A A

Properties (Borgwardt and Kriegel, 2005) There areO(n²)shortest paths.

The vector of counts can be computed inO(n⁴)with the Floyd-Warshall algorithm.

(101)

Example : Indexing by all subgraphs up to k vertices

Properties (Shervashidze et al., 2009) Naive enumeration scales asO(n^k).

Enumeration of connected graphlets inO(nd^k−1)for graphs with degree≤d andk ≤5.

Randomly sample subgraphs if enumeration is infeasible.

(102)

Example : Indexing by all subgraphs up to k vertices

Properties (Shervashidze et al., 2009) Naive enumeration scales asO(n^k).

Enumeration of connected graphlets inO(nd^k−1)for graphs with degree≤d andk ≤5.

Randomly sample subgraphs if enumeration is infeasible.

(103)

Walks

Definition

Awalkof a graph(V,E)is sequence ofv₁, . . . ,vn∈V such that (v_i,v_i+1)∈E fori=1, . . . ,n−1.

We noteW_n(G)the set of walks withnvertices of the graphG, andW(G)the set of all walks.

etc...

(104)

Walks 6= paths

(105)

Walk kernel

Definition

LetS_ndenote the set of all possiblelabel sequencesof walks of lengthn(including vertices and edges labels), andS =∪_n≥1S_n. For any graphX let aweightλ_G(w)be associated to each walk w ∈ W(G).

Let the feature vectorΦ(G) = (Φ_s(G))_s∈S be defined by:

Φ_s(G) = X

w∈W(G)

λ_G(w)1(sis the label sequence ofw) .

A walk kernel is a graph kernel defined by:

K_walk(G₁,G₂) =X

s∈S

Φs(G₁)Φs(G₂).

(106)

Walk kernel

Definition

LetS_ndenote the set of all possiblelabel sequencesof walks of lengthn(including vertices and edges labels), andS =∪_n≥1S_n. For any graphX let aweightλ_G(w)be associated to each walk w ∈ W(G).

Let the feature vectorΦ(G) = (Φ_s(G))_s∈S be defined by:

Φ_s(G) = X

w∈W(G)

λ_G(w)1(sis the label sequence ofw) .

A walk kernel is a graph kernel defined by:

s∈S

Φs(G₁)Φs(G₂).

(107)

Walk kernel examples

Thenth-order walk kernelis the walk kernel withλ_G(w) =1 if the length ofw isn, 0 otherwise. It compares two graphs through their common walks of lengthn.

Therandom walk kernelis obtained withλ_G(w) =P_G(w), where P_G is aMarkov random walk onG. In that case we have:

K(G₁,G₂) =P(label(W₁) =label(W₂)),

whereW₁andW₂are two independant random walks onG₁and G₂, respectively (Kashima et al., 2003).

Thegeometric walk kernelis obtained (when it converges) with λ_G(w) =β^length(w), forβ >0. In that case the feature space is of infinite dimension(Gärtner et al., 2003).

(108)

Walk kernel examples

(109)

Walk kernel examples

(110)

Computation of walk kernels

Proposition

These three kernels (nth-order, random and geometric walk kernels) can be computed efficiently inpolynomial time.

(111)

Product graph

Definition

LetG₁= (V₁,E₁)andG₂= (V₂,E₂)be two graphs with labeled

vertices. Theproduct graphG=G₁×G₂is the graphG= (V,E)with:

1 V ={(v₁,v₂)∈V₁×V₂ : v₁andv₂have the same label},

2 E =

(v₁,v₂),(v₁⁰,v₂⁰)

∈V ×V : (v₁,v₁⁰)∈E₁and(v₂,v₂⁰)∈E₂ .

G1 x G2

c

d e

4 3

2

1 1b 2a 1d

1a 2b 3c

4c 2d

3e

4e

G1 G2

a b

(112)

Walk kernel and product graph

Lemma

There is abijectionbetween:

1 Thepairs of walksw₁∈ W_n(G₁)andw₂∈ W_n(G₂)with thesame label sequences,

2 Thewalks on the product graphw ∈ W_n(G₁×G₂).

Corollary

s∈S

Φ_s(G₁)Φ_s(G₂)

= X

(w₁,w₂)∈W(G₁)×W(G₁)

λ_G₁(w₁)λ_G₂(w₂)1(l(w₁) =l(w₂))

= X

w∈W(G₁×G₂)

λ_G₁_×G₂(w).

(113)

Walk kernel and product graph

Lemma

There is abijectionbetween:

1 Thepairs of walksw₁∈ W_n(G₁)andw₂∈ W_n(G₂)with thesame label sequences,

2 Thewalks on the product graphw ∈ W_n(G₁×G₂).

Corollary

s∈S

Φ_s(G₁)Φ_s(G₂)

= X

(w₁,w₂)∈W(G₁)×W(G₁)

λ_G₁(w₁)λ_G₂(w₂)1(l(w₁) =l(w₂))

= X

w∈W(G₁×G₂)

λ_G₁_×G₂(w).

(114)

Computation of the nth-order walk kernel

For thenth-order walk kernel we haveλ_G₁_×G₂(w) =1 if the length ofw isn, 0 otherwise.

Therefore:

K_nth−order(G₁,G₂) = X

w∈Wn(G₁×G₂)

1. LetAbe the adjacency matrix ofG₁×G₂. Then we get:

Knth−order(G₁,G₂) =X

i,j

[Aⁿ]_i,j =1^>Aⁿ1. Computation inO(n|G₁||G₂|d₁d₂), whered_i is the maximum degree ofG_i.

(115)

Computation of random and geometric walk kernels

In both casesλ_G(w)for a walkw =v₁. . .v_ncan be decomposed as:

λ_G(v₁. . .v_n) =λⁱ(v₁)

n

Y

i=2

λ^t(v_i−1,v_i).

LetΛ_i be the vector ofλⁱ(v)andΛ_t be the matrix ofλ^t(v,v⁰):

K_walk(G₁,G₂) =

∞

X

n=1

X

w∈Wn(G₁×G₂)

λⁱ(v₁)

n

Y

i=2

λ^t(v_i−1,v_i)

=

∞

X

n=0

Λ_iΛⁿ_t1

=Λ_i(I−Λ_t)⁻¹1 Computation inO(|G₁|³|G₂|³)

(116)

Extension: branching walks (Ramon and Gärtner, 2003; Mahé and Vert, 2009)

.. .

.. . .. N .

N C O C

C

..

. ^C

O C

N C

N O

C

N N C C C

N N

T(v,n+1) = X

R⊂N(v)

Y

v⁰∈R

λ_t(v,v⁰)T(v⁰,n),

(117)

2D Subtree vs walk kernels

707274767880

AUC

Walks Subtrees

CCRF−CEMHL−60(TB) K−562MOLT−4 RPMI−8226SRA549/ATCCEKVX HOP−62HOP−92NCI−H226 NCI−H23NCI−H322MNCI−H460NCI−H522COLO_205 HCC−2998HCT−116 HCT−15HT29KM12SW−620 SF−268SF−295SF−539SNB−19 SNB−75U251LOX_IMVI MALME−3MM14SK−MEL−2SK−MEL−28 SK−MEL−5UACC−257UACC−62 IGR−OV1OVCAR−3OVCAR−4OVCAR−5 OVCAR−8SK−OV−3786−0A498 ACHNCAKI−1RXF_393 SN12CTK−10UO−31 PC−3DU−145MCF7NCI/ADR−RES MDA−MB−231/ATCCHS_578TMDA−MB−435MDA−N BT−549T−47D

Screening of inhibitors for 60 cancer cell lines.

(118)

Image classification (Harchaoui and Bach, 2007)

COREL14 dataset

1400 natural images in 14 classes

Compare kernel between histograms (H), walk kernel (W), subtree kernel (TW), weighted subtree kernel (wTW), and a combination (M).

H W TW wTW M

0.05 0.06 0.07 0.08 0.09 0.1 0.11 0.12

Test error

Kernels

Performance comparison on Corel14

(119)

References

N. Aronszajn. Theory of reproducing kernels.Trans. Am. Math. Soc., 68:337 – 404, 1950. URL http://www.jstor.org/stable/1990404.

K. M. Borgwardt and H.-P. Kriegel. Shortest-path kernels on graphs. InICDM ’05: Proceedings of the Fifth IEEE International Conference on Data Mining, pages 74–81, Washington, DC, USA, 2005. IEEE Computer Society. ISBN 0-7695-2278-5. doi:

http://dx.doi.org/10.1109/ICDM.2005.132.

Z. Harchaoui and F. Bach. Image classification with segmentation graph kernels. In2007 IEEE Computer Society Conference on Computer Vision and Pattern Recognition (CVPR 2007), pages 1–8. IEEE Computer Society, 2007. doi:10.1109/CVPR.2007.383049. URL

http://dx.doi.org/10.1109/CVPR.2007.383049.

D. Haussler. Convolution Kernels on Discrete Structures. Technical Report UCSC-CRL-99-10, UC Santa Cruz, 1999.

C. Helma, T. Cramer, S. Kramer, and L. De Raedt. Data mining and machine learning techniques for the identification of mutagenicity inducing substructures and structure activity relationships of noncongeneric compounds.J. Chem. Inf. Comput. Sci., 44(4):1402–11, 2004. doi:

10.1021/ci034254q. URLhttp://dx.doi.org/10.1021/ci034254q.

C. Leslie and R. Kuang. Fast string kernels using inexact matching for protein sequences.J.

Mach. Learn. Res., 5:1435–1455, 2004.

C. Leslie, E. Eskin, and W. Noble. The spectrum kernel: a string kernel for SVM protein classification. In R. B. Altman, A. K. Dunker, L. Hunter, K. Lauerdale, and T. E. Klein, editors, Proceedings of the Pacific Symposium on Biocomputing 2002, pages 564–575, Singapore, 2002. World Scientific.

(120)

References (cont.)

H. Lodhi, C. Saunders, J. Shawe-Taylor, N. Cristianini, and C. n. p. v. d. d. r. Watkins. Text classification using string kernels.J. Mach. Learn. Res., 2:419–444, 2002. URLhttp:

//www.ai.mit.edu/projects/jmlr/papers/volume2/lodhi02a/abstract.html.

P. Mahé and J. P. Vert. Graph kernels based on tree patterns for molecules.Mach. Learn., 75(1):

3–35, 2009. doi:10.1007/s10994-008-5086-2. URL

http://dx.doi.org/10.1007/s10994-008-5086-2.

A. Nicholls. Oechem, version 1.3.4, openeye scientific software. website, 2005.

J. Ramon and T. Gärtner. Expressivity versus efficiency of graph kernels. In T. Washio and L. De Raedt, editors,Proceedings of the First International Workshop on Mining Graphs, Trees and Sequences, pages 65–74, 2003.

H. Saigo, J.-P. Vert, N. Ueda, and T. Akutsu. Protein homology detection using string alignment kernels.Bioinformatics, 20(11):1682–1689, 2004. URLhttp:

//bioinformatics.oupjournals.org/cgi/content/abstract/20/11/1682.

N. Sherashidze, S. Vishwanathan, T. Petri, K. Mehlhorn, and K. Borgwardt. Efficient graphlet kernels for large graph comparison. In12th International Conference on Artificial Intelligence and Statistics (AISTATS), pages 488–495, Clearwater Beach, Florida USA, 2009. Society for Artificial Intelligence and Statistics.