• Aucun résultat trouvé

Support vector machines and applications in computational biology

N/A
N/A
Protected

Academic year: 2022

Partager "Support vector machines and applications in computational biology"

Copied!
120
0
0

Texte intégral

(1)

Support vector machines and applications in computational biology

Jean-Philippe Vert

Jean-Philippe.Vert@mines.org

(2)

Outline

1 Motivations

2 Linear SVM

3 Nonlinear SVM and kernels

4 Kernels for strings and graphs

(3)

Outline

1 Motivations

2 Linear SVM

3 Nonlinear SVM and kernels

4 Kernels for strings and graphs

(4)

Cancer diagnosis

Problem 1

Given the expression levels of 20k genes in a leukemia, is it an acute lymphocytic or myeloid leukemia (ALL or AML)?

(5)

Cancer prognosis

Problem 2

Given the expression levels of 20k genes in a tumour after surgery, is it likely to relapse later?

(6)

Pharmacogenomics / Toxicogenomics

Problem 3

Given the genome of a person, which drug should we give?

(7)

Protein annotation

Data available

Secreted proteins:

MASKATLLLAFTLLFATCIARHQQRQQQQNQCQLQNIEA...

MARSSLFTFLCLAVFINGCLSQIEQQSPWEFQGSEVW...

MALHTVLIMLSLLPMLEAQNPEHANITIGEPITNETLGWL...

...

Non-secreted proteins:

MAPPSVFAEVPQAQPVLVFKLIADFREDPDPRKVNLGVG...

MAHTLGLTQPNSTEPHKISFTAKEIDVIEWKGDILVVG...

MSISESYAKEIKTAFRQFTDFPIEGEQFEDFLPIIGNP..

...

Problem 4

Given a newly sequenced protein, is it secreted or not?

(8)

Drug discovery

inactive active

active

active

inactive

inactive

Problem 5

Given a new candidate molecule, is it likely to be active?

(9)

A common topic

(10)

A common topic

(11)

A common topic

(12)

A common topic

(13)

On real data...

(14)

Pattern recognition, aka supervised classification

Challenges

High dimension Few samples Structured data Heterogeneous data Prior knowledge Fast and scalable implementations Interpretable models

(15)

Outline

1 Motivations

2 Linear SVM

3 Nonlinear SVM and kernels

4 Kernels for strings and graphs

(16)

Linear classifier

(17)

Linear classifier

(18)

Linear classifier

(19)

Linear classifier

(20)

Linear classifier

(21)

Linear classifier

(22)

Linear classifier

(23)

Linear classifier

(24)

Which one is better?

(25)

The margin of a linear classifier

(26)

The margin of a linear classifier

(27)

The margin of a linear classifier

(28)

The margin of a linear classifier

(29)

The margin of a linear classifier

(30)

Largest margin classifier (hard-margin SVM)

(31)

Support vectors

(32)

More formally

Thetraining setis a finite set ofndata/class pairs:

S =

(~x1,y1), . . . ,(~xn,yn) , where~xi ∈Rpandyi ∈ {−1,1}.

We assume (for the moment) that the data arelinearly separable, i.e., that there exists(w,~ b)∈Rp×Rsuch that:

(w~.~xi+b>0 ifyi =1, w~.~xi+b<0 ifyi =−1.

(33)

How to find the largest separating hyperplane?

For a given linear classifierf(x) =w.~~ x+bconsider the "tube" defined by the values−1 and+1 of the decision function:

x2 x1

w.x+b > +1

w.x+b < −1

w

w.x+b=+1

w.x+b=−1 w.x+b=0

(34)

The margin is 2/k w ~ k

Indeed, the points~x1andx~2satisfy:

(w.~~ x1+b=0, w.~~ x2+b=1.

By subtracting we getw.(~~ x2−~x1) =1, and therefore:

γ=2k~x2−~x1k= 2 kw~ k.

(35)

All training points should be on the correct side of the dotted line

For positive examples (yi =1) this means:

w.~~ xi+b≥1. For negative examples (yi =−1) this means:

w.~~ xi+b≤ −1. Both cases are summarized by:

∀i =1, . . . ,n, yi w~.~xi+b

≥1.

(36)

Finding the optimal hyperplane

Find(~w,b)which minimize:

kw~ k2 under the constraints:

∀i=1, . . . ,n, yi w.~~ xi+b

−1≥0.

This is a classical quadratic program onRp+1.

(37)

Lagrangian

In order to minimize:

1 2kw~ k22 under the constraints:

∀i=1, . . . ,n, yi w.~~ xi+b

−1≥0,

we introduceone dual variableαi for each constraint, i.e., for each training point. The Lagrangian is:

L w,~ b, ~α

= 1

2||~w||2

n

X

i=1

αi yi w~.~xi+b

−1 .

(38)

Lagrangian

L w~,b, ~α

is convex quadratic inw. It is minimized for:~

w~L=w~ −

n

X

i=1

αiyi~xi =0 =⇒ w~ =

n

X

i=1

αiyi~xi.

L w~,b, ~α

is affine inb. Its minimum is−∞except if:

bL=

n

X

i=1

αiyi =0.

(39)

Dual function

We therefore obtain theLagrange dual function:

q(~α) = inf

~

w∈Rp,b∈R

L w~,b, ~α

= (Pn

i=1αi12Pn i=1

Pn

j=1yiyjαiαj~xi.~xj if Pn

i=1αiyi =0,

−∞ otherwise.

The dual problem is:

maximize q(~α) subject to α~ ≥0.

(40)

Dual problem

Findα ∈Rnwhich maximizes L(~α) =

n

X

i=1

αi−1 2

n

X

i=1 n

X

j=1

αiαjyiyj~xi.~xj,

under the (simple) constraintsαi ≥0 (fori =1, . . . ,n), and

n

X

i=1

αiyi =0.

This is a quadratic program onRN, with "box constraints". ~α can be found efficiently using dedicated optimization softwares.

(41)

Recovering the optimal hyperplane

Once~α is found, we recover(w~,b)corresponding to the optimal hyperplane. wis given by:

w~ =

n

X

i=1

αi~xi,

and thedecision functionis therefore:

f(~x) =w~.~x +b

=

n

X

i=1

αi~xi.~x+b. (1)

(42)

Interpretation: support vectors

α>0

α=0

(43)

What if data are not linearly separable?

(44)

What if data are not linearly separable?

(45)

What if data are not linearly separable?

(46)

What if data are not linearly separable?

(47)

Soft-margin SVM

Find a trade-off betweenlarge marginandfew errors.

Mathematically:

minf

1

margin(f) +C×errors(f)

Cis a parameter

(48)

Soft-margin SVM formulation

Themarginof a labeled point(~x,y)is

margin(~x,y) =y w.~~ x+b Theerroris

0 ifmargin(~x,y)>1, 1margin(~x,y)otherwise.

The soft margin SVM solves:

min

w,b~

(

||w~||2+C

n

X

i=1

max 0,1−yi w.~~ xi+b )

(49)

Soft-margin SVM and hinge loss

min

w,b~

( n X

i=1

`hinge w.x~ i+b,yi

+λkw~ k22 )

,

forλ=1/Cand the hinge loss function:

`hinge(u,y) =max(1−yu,0) =

(0 ifyu ≥1, 1−yu otherwise.

yf(x) l(f(x),y)

1

(50)

Dual formulation of soft-margin SVM (exercice)

Maximize

L(~α) =

n

X

i=1

αi−1 2

n

X

i=1 n

X

j=1

αiαjyiyj~xi.~xj,

under the constraints:

(0≤αi ≤C, fori =1, . . . ,n Pn

i=1αiyi =0.

(51)

Interpretation: bounded and unbounded support vectors

C α=0

0<α<C

α=

(52)

Primal (for large n) vs dual (for large p) optimization

1 Find(w,~ b)∈Rp+1which solve:

min

w,b~

( n X

i=1

`hinge w.x~ i+b,yi

+λkw~ k22 )

.

2 Findα ∈Rnwhich maximizes L(~α) =

n

X

i=1

αi−1 2

n

X

i=1 n

X

j=1

αiαjyiyj~xi.~xj,

under the constraints:

(0≤αi ≤C, fori=1, . . . ,n Pn

i=1αiyi =0.

(53)

Outline

1 Motivations

2 Linear SVM

3 Nonlinear SVM and kernels

4 Kernels for strings and graphs

(54)

Sometimes linear methods are not interesting

(55)

Solution: nonlinear mapping to a feature space

R 2 x1

x2

x1

x2 2

Forx = x1

x2

letΦ(x) = x12

x22

. The decision function is:

f(x) =x12+x22−R2= 1

1 >

x12 x22

−R2>Φ(x) +b.

(56)

Kernel = inner product in the feature space

Definition

For a given mapping

Φ :X 7→ H

from the space of objectsX to some Hilbert space of featuresH, the kernelbetween two objectsx andx0 is the inner product of their images in the features space:

∀x,x0 ∈ X, K(x,x0) = Φ(x)>Φ(x0). φ

X F

(57)

Example

φ

X F

LetX =H=R2and forx = x1

x2

letΦ(x) = x12

x22

Then

K(x,x0) = Φ(x)>Φ(x0) = (x1)2(x10)2+ (x2)2(x20)2.

(58)

The kernel tricks

φ

X F

2 tricks

1 Many linear algorithms (in particularlinear SVM) can be

performed in the feature space ofΦ(x)without explicitly computing the imagesΦ(x), but instead by computing kernelsK(x,x0).

2 It is sometimes possible toeasilycompute kernels which

correspond to complex large-dimensional feature spaces:K(x,x0) is often much simpler to compute thanΦ(x)andΦ(x0)

(59)

Trick 1 : SVM in the original space

Train the SVM by maximizing maxα∈Rn

n

X

i=1

αi−1 2

n

X

i=1 n

X

j=1

αiαjyiyjxi>xj,

under the constraints:

(0≤αi ≤C, fori =1, . . . ,n Pn

i=1αiyi =0. Predict with the decision function

f(x) =

n

X

i=1

αiyixi>x+b.

(60)

Trick 1 : SVM in the feature space

Train the SVM by maximizing maxα∈Rn

n

X

i=1

αi− 1 2

n

X

i=1 n

X

j=1

αiαjyiyjΦ (xi)>Φ xj ,

under the constraints:

(0≤αi ≤C, fori =1, . . . ,n Pn

i=1αiyi =0. Predict with the decision function

f(x) =

n

X

i=1

αiyiΦ (xi)>Φ (x)+b.

(61)

Trick 1 : SVM in the feature space with a kernel

Train the SVM by maximizing maxα∈Rn

n

X

i=1

αi− 1 2

n

X

i=1 n

X

j=1

αiαjyiyjK xi,xj ,

under the constraints:

(0≤αi ≤C, fori =1, . . . ,n Pn

i=1αiyi =0. Predict with the decision function

f(x) =

n

X

i=1

αiK(xi,x)+b.

(62)

Trick 2 illustration: polynomial kernel

R 2 x1

x2

x1

x2 2

Forx = (x1,x2)>∈R2, letΦ(x) = (x12,√

2x1x2,x22)∈R3: K(x,x0)=x12x102+2x1x2x10x20 +x22x202

= x1x10 +x2x202

=

x>x02

.

(63)

Trick 2 illustration: polynomial kernel

R 2 x1

x2

x1

x2 2

More generally, forx,x0∈Rp, K(x,x0) =

x>x0+1d

is an inner product in a feature space of all monomials of degree up to d (left as exercice.)

(64)

Combining tricks: learn a polynomial discrimination rule with SVM

Train the SVM by maximizing maxα∈Rn

n

X

i=1

αi−1 2

n

X

i=1 n

X

j=1

αiαjyiyj

xi>xj+1d

,

under the constraints:

(0≤αi ≤C, fori =1, . . . ,n Pn

i=1αiyi =0. Predict with the decision function

f(x) =

n

X

i=1

αiyi

xi>x +1 d

+b.

(65)

Illustration: toy nonlinear problem

> plot(x,col=ifelse(y>0,1,2),pch=ifelse(y>0,1,2))

● ●

−1 0 1 2 3

−101234

Training data

x1

x2

(66)

Illustration: toy nonlinear problem, linear SVM

> library(kernlab)

> svp <- ksvm(x,y,type="C-svc",kernel=’vanilladot’)

> plot(svp,data=x)

−1.0

−0.5 0.0 0.5 1.0 1.5 2.0

−1 0 1 2 3 4

−1 0 1 2 3

SVM classification plot

x2

x1

(67)

Illustration: toy nonlinear problem, polynomial SVM

> svp <- ksvm(x,y,type="C-svc", ...

kernel=polydot(degree=2))

> plot(svp,data=x)

−5 0 5 10

−1 0 1 2 3 4

−1 0 1 2 3

SVM classification plot

x2

x1

(68)

Which functions K (x , x

0

) are kernels?

Definition

A functionK(x,x0)defined on a setX is akernelif and only if there exists a features space (Hilbert space)Hand a mapping

Φ :X 7→ H,

such that, for anyx,x0 inX: K x,x0

=

Φ (x),Φ x0

H .

φ

X F

(69)

Positive Definite (p.d.) functions

Definition

Apositive definite (p.d.) functionon the setX is a function K :X × X →Rsymmetric:

x,x0

∈ X2, K x,x0

=K x0,x , and which satisfies, for allN∈N,(x1,x2, . . . ,xN)∈ XN et (a1,a2, . . . ,aN)∈RN:

N

X

i=1 N

X

j=1

aiajK xi,xj

≥0.

(70)

Kernels are p.d. functions

Theorem (Aronszajn, 1950)

K is a kernelif and only ifit is a positive definite function.

φ

X F

(71)

Proof?

Kernel =⇒ p.d. function:

hΦ (x),Φ (x0)iRd =hΦ (x0),Φ (x)Rdi , PN

i=1

PN

j=1aiajhΦ (xi),Φ (xj)i

Rd =k PN

i=1aiΦ (xi)k2

Rd 0 . P.d. function =⇒ kernel: more difficult...

(72)

Example: SVM with a Gaussian kernel

Training:

α∈minRn n

X

i=1

αi−1 2

n

X

i,j=1

αiαjyiyjexp −||~xi−~xj||22

!

s.t. 0≤αi ≤C, and

n

X

i=1

αiyi =0.

Prediction

f(~x) =

n

X

i=1

αiexp

−||~x−~xi||22

(73)

Example: SVM with a Gaussian kernel

f(~x) =

n

X

i=1

αiexp

−||~x −~xi||22

−1.0

−0.5 0.0 0.5 1.0

−2 0 2 4 6

−2 0 2 4

SVM classification plot

(74)

Linear vs nonlinear SVM

(75)

Regularity vs data fitting trade-off

(76)

C controls the trade-off

min

f

1

margin(f) +C×errors(f)

(77)

Why it is important to control the trade-off

(78)

How to choose C in practice

Split your dataset in two ("train" and "test") Train SVM with differentCon the "train" set

Compute the accuracy of the SVM on the "test" set Choose theC which minimizes the "test" error

(you may repeat this several times = cross-validation)

(79)

SVM summary

Large margin

Linear or nonlinear (with the kernel trick)

Control of the regularization / data fitting trade-off withC

(80)

Outline

1 Motivations

2 Linear SVM

3 Nonlinear SVM and kernels

4 Kernels for strings and graphs

(81)

Supervised sequence classification

Data (training)

Secreted proteins:

MASKATLLLAFTLLFATCIARHQQRQQQQNQCQLQNIEA...

MARSSLFTFLCLAVFINGCLSQIEQQSPWEFQGSEVW...

MALHTVLIMLSLLPMLEAQNPEHANITIGEPITNETLGWL...

...

Non-secreted proteins:

MAPPSVFAEVPQAQPVLVFKLIADFREDPDPRKVNLGVG...

MAHTLGLTQPNSTEPHKISFTAKEIDVIEWKGDILVVG...

MSISESYAKEIKTAFRQFTDFPIEGEQFEDFLPIIGNP..

...

Goal

Build aclassifiertopredictwhether new proteins are secreted or not.

(82)

String kernels

The idea

Map each stringx ∈ X to avectorΦ(x)∈ F.

Train aclassifier for vectorson the imagesΦ(x1), . . . ,Φ(xn)of the training set (nearest neighbor, linear perceptron, logistic

regression, support vector machine...)

mahtlg...

φ

X F

maskat...

msises marssl...

malhtv...

mappsv...

(83)

Example: substring indexation

The approach

Index the feature space by fixed-length strings, i.e., Φ (x) = (Φu(x))u∈Ak

whereΦu(x)can be:

the number of occurrences ofu inx(without gaps) : spectrum kernel(Leslie et al., 2002)

the number of occurrences ofu inxup tommismatches (without gaps) : mismatch kernel(Leslie et al., 2004)

the number of occurrences ofu inxallowing gaps, with a weight decaying exponentially with the number of gaps :substring kernel (Lohdi et al., 2002)

(84)

Spectrum kernel (1/2)

Kernel definition The 3-spectrum of

x=CGGSLIAMMWFGV is:

(CGG,GGS,GSL,SLI,LIA,IAM,AMM,MMW,MWF,WFG,FGV). LetΦu(x)denote the number of occurrences ofuinx. The k-spectrum kernel is:

K x,x0

:= X

u∈Ak

Φu(x) Φu x0 .

(85)

Spectrum kernel (2/2)

Implementation

The computation of the kernel is formally a sum over|A|k terms, but at most|x| −k+1 terms are non-zero inΦ (x) =⇒

Computation inO(|x|+|x0|)with pre-indexation of the strings.

Fast classification of a sequencexinO(|x|):

f(x) =w·Φ (x) =X

u

wuΦu(x)=

|x|−k+1

X

i=1

wxi...xi+k−1.

Remarks

Work with any string (natural language, time series...)

Fast and scalable, a good default method for string classification.

Variants allow matching ofk-mers up tommismatches.

(86)

Local alignmnent kernel (Saigo et al., 2004)

CGGSLIAMM----WFGV

|...|||||....||||

C---LIVMMNRLMWFGV

sS,g(π) =S(C,C) +S(L,L) +S(I,I) +S(A,V) +2S(M,M)

+S(W,W) +S(F,F) +S(G,G) +S(V,V)−g(3)−g(4) SWS,g(x,y) := max

π∈Π(x,y)sS,g(π) is not a kernel KLA(β)(x,y) = X

π∈Π(x,y)

exp βsS,g(x,y, π)

is a kernel

(87)

LA kernel is p.d.: proof (1/2)

Definition: Convolution kernel (Haussler, 1999)

LetK1andK2be two p.d. kernels for strings. TheconvolutionofK1 andK2, denotedK1?K2, is defined for anyx,x0 ∈ X by:

K1?K2(x,y) := X

x1x2=x,y1y2=y

K1(x1,y1)K2(x2,y2).

Lemma

If K1and K2are p.d. thenK1?K2is p.d..

(88)

LA kernel is p.d.: proof (2/2)

KLA(β)=

X

n=0

K0?

Ka(β)?Kg(β)(n−1)

?Ka(β)?K0, with

The constant kernel:

K0(x,y) :=1. A kernel for letters:

Ka(β)(x,y) :=

0 if |x| 6=1 where |y| 6=1, exp(βS(x,y)) otherwise.

A kernel for gaps:

Kg(β)(x,y) =exp[β(g(|x|) +g(|x|))] .

(89)

The choice of kernel matters

0 10 20 30 40 50 60

0 0.2 0.4 0.6 0.8 1

No. of families with given performance

ROC50

SVM-LA SVM-pairwise SVM-Mismatch SVM-Fisher

Performance on the SCOP superfamily recognition benchmark (from Saigo et al., 2004).

(90)

Virtual screening for drug discovery

inactive active

active

active

inactive

inactive

NCI AIDS screen results (from http://cactus.nci.nih.gov).

(91)

Image retrieval and classification

From Harchaoui and Bach (2007).

(92)

Graph kernels

1 Represent each graphx by a vectorΦ(x)∈ H, eitherexplicitlyor implicitlythrough the kernel

K(x,x0) = Φ(x)>Φ(x0).

2 Use a linear method for classification inH.

X

(93)

Graph kernels

1 Represent each graphx by a vectorΦ(x)∈ H, eitherexplicitlyor implicitlythrough the kernel

K(x,x0) = Φ(x)>Φ(x0).

2 Use a linear method for classification inH.

φ

X H

(94)

Graph kernels

1 Represent each graphx by a vectorΦ(x)∈ H, eitherexplicitlyor implicitlythrough the kernel

K(x,x0) = Φ(x)>Φ(x0).

2 Use a linear method for classification inH.

φ

X H

(95)

Indexing by all subgraphs?

Theorem

Computing all subgraph occurrences isNP-hard.

Proof.

The linear graph of sizenis a subgraph of a graphX withn vertices iffX has an Hamiltonian path

The decision problem whether a graph has a Hamiltonian path is NP-complete.

(96)

Indexing by all subgraphs?

Theorem

Computing all subgraph occurrences isNP-hard.

Proof.

The linear graph of sizenis a subgraph of a graphX withn vertices iffX has an Hamiltonian path

The decision problem whether a graph has a Hamiltonian path is NP-complete.

(97)

Indexing by all subgraphs?

Theorem

Computing all subgraph occurrences isNP-hard.

Proof.

The linear graph of sizenis a subgraph of a graphX withn vertices iffX has an Hamiltonian path

The decision problem whether a graph has a Hamiltonian path is NP-complete.

(98)

Indexing by specific subgraphs

Substructure selection

We can imagine more limited sets of substuctures that lead to more computationnally efficient indexing (non-exhaustive list)

substructures selected bydomain knowledge(MDL fingerprint) all pathup to lengthk (Openeye fingerprint, Nicholls 2005) allshortest paths(Borgwardt and Kriegel, 2005)

all subgraphsup tok vertices(graphlet kernel, Sherashidze et al., 2009)

allfrequentsubgraphs in the database (Helma et al., 2004)

(99)

Example : Indexing by all shortest paths

(0,...,0,2,0,...,0,1,0,...) B

A B A

A A A B

A B A B

A A

A A

Properties (Borgwardt and Kriegel, 2005) There areO(n2)shortest paths.

The vector of counts can be computed inO(n4)with the Floyd-Warshall algorithm.

(100)

Example : Indexing by all shortest paths

(0,...,0,2,0,...,0,1,0,...) B

A B A

A A A B

A B A B

A A

A A

Properties (Borgwardt and Kriegel, 2005) There areO(n2)shortest paths.

The vector of counts can be computed inO(n4)with the Floyd-Warshall algorithm.

(101)

Example : Indexing by all subgraphs up to k vertices

Properties (Shervashidze et al., 2009) Naive enumeration scales asO(nk).

Enumeration of connected graphlets inO(ndk−1)for graphs with degree≤d andk ≤5.

Randomly sample subgraphs if enumeration is infeasible.

(102)

Example : Indexing by all subgraphs up to k vertices

Properties (Shervashidze et al., 2009) Naive enumeration scales asO(nk).

Enumeration of connected graphlets inO(ndk−1)for graphs with degree≤d andk ≤5.

Randomly sample subgraphs if enumeration is infeasible.

(103)

Walks

Definition

Awalkof a graph(V,E)is sequence ofv1, . . . ,vn∈V such that (vi,vi+1)∈E fori=1, . . . ,n−1.

We noteWn(G)the set of walks withnvertices of the graphG, andW(G)the set of all walks.

etc...

(104)

Walks 6= paths

(105)

Walk kernel

Definition

LetSndenote the set of all possiblelabel sequencesof walks of lengthn(including vertices and edges labels), andS =∪n≥1Sn. For any graphX let aweightλG(w)be associated to each walk w ∈ W(G).

Let the feature vectorΦ(G) = (Φs(G))s∈S be defined by:

Φs(G) = X

w∈W(G)

λG(w)1(sis the label sequence ofw) .

A walk kernel is a graph kernel defined by:

Kwalk(G1,G2) =X

s∈S

Φs(G1s(G2).

(106)

Walk kernel

Definition

LetSndenote the set of all possiblelabel sequencesof walks of lengthn(including vertices and edges labels), andS =∪n≥1Sn. For any graphX let aweightλG(w)be associated to each walk w ∈ W(G).

Let the feature vectorΦ(G) = (Φs(G))s∈S be defined by:

Φs(G) = X

w∈W(G)

λG(w)1(sis the label sequence ofw) .

A walk kernel is a graph kernel defined by:

Kwalk(G1,G2) =X

s∈S

Φs(G1s(G2).

(107)

Walk kernel examples

Thenth-order walk kernelis the walk kernel withλG(w) =1 if the length ofw isn, 0 otherwise. It compares two graphs through their common walks of lengthn.

Therandom walk kernelis obtained withλG(w) =PG(w), where PG is aMarkov random walk onG. In that case we have:

K(G1,G2) =P(label(W1) =label(W2)),

whereW1andW2are two independant random walks onG1and G2, respectively (Kashima et al., 2003).

Thegeometric walk kernelis obtained (when it converges) with λG(w) =βlength(w), forβ >0. In that case the feature space is of infinite dimension(Gärtner et al., 2003).

(108)

Walk kernel examples

Thenth-order walk kernelis the walk kernel withλG(w) =1 if the length ofw isn, 0 otherwise. It compares two graphs through their common walks of lengthn.

Therandom walk kernelis obtained withλG(w) =PG(w), where PG is aMarkov random walk onG. In that case we have:

K(G1,G2) =P(label(W1) =label(W2)),

whereW1andW2are two independant random walks onG1and G2, respectively (Kashima et al., 2003).

Thegeometric walk kernelis obtained (when it converges) with λG(w) =βlength(w), forβ >0. In that case the feature space is of infinite dimension(Gärtner et al., 2003).

(109)

Walk kernel examples

Thenth-order walk kernelis the walk kernel withλG(w) =1 if the length ofw isn, 0 otherwise. It compares two graphs through their common walks of lengthn.

Therandom walk kernelis obtained withλG(w) =PG(w), where PG is aMarkov random walk onG. In that case we have:

K(G1,G2) =P(label(W1) =label(W2)),

whereW1andW2are two independant random walks onG1and G2, respectively (Kashima et al., 2003).

Thegeometric walk kernelis obtained (when it converges) with λG(w) =βlength(w), forβ >0. In that case the feature space is of infinite dimension(Gärtner et al., 2003).

(110)

Computation of walk kernels

Proposition

These three kernels (nth-order, random and geometric walk kernels) can be computed efficiently inpolynomial time.

(111)

Product graph

Definition

LetG1= (V1,E1)andG2= (V2,E2)be two graphs with labeled

vertices. Theproduct graphG=G1×G2is the graphG= (V,E)with:

1 V ={(v1,v2)∈V1×V2 : v1andv2have the same label},

2 E =

(v1,v2),(v10,v20)

∈V ×V : (v1,v10)∈E1and(v2,v20)∈E2 .

G1 x G2

c

d e

4 3

2

1 1b 2a 1d

1a 2b 3c

4c 2d

3e

4e

G1 G2

a b

(112)

Walk kernel and product graph

Lemma

There is abijectionbetween:

1 Thepairs of walksw1∈ Wn(G1)andw2∈ Wn(G2)with thesame label sequences,

2 Thewalks on the product graphw ∈ Wn(G1×G2).

Corollary

Kwalk(G1,G2) =X

s∈S

Φs(G1s(G2)

= X

(w1,w2)∈W(G1)×W(G1)

λG1(w1G2(w2)1(l(w1) =l(w2))

= X

w∈W(G1×G2)

λG1×G2(w).

(113)

Walk kernel and product graph

Lemma

There is abijectionbetween:

1 Thepairs of walksw1∈ Wn(G1)andw2∈ Wn(G2)with thesame label sequences,

2 Thewalks on the product graphw ∈ Wn(G1×G2).

Corollary

Kwalk(G1,G2) =X

s∈S

Φs(G1s(G2)

= X

(w1,w2)∈W(G1)×W(G1)

λG1(w1G2(w2)1(l(w1) =l(w2))

= X

w∈W(G1×G2)

λG1×G2(w).

(114)

Computation of the nth-order walk kernel

For thenth-order walk kernel we haveλG1×G2(w) =1 if the length ofw isn, 0 otherwise.

Therefore:

Knth−order(G1,G2) = X

w∈Wn(G1×G2)

1. LetAbe the adjacency matrix ofG1×G2. Then we get:

Knth−order(G1,G2) =X

i,j

[An]i,j =1>An1. Computation inO(n|G1||G2|d1d2), wheredi is the maximum degree ofGi.

(115)

Computation of random and geometric walk kernels

In both casesλG(w)for a walkw =v1. . .vncan be decomposed as:

λG(v1. . .vn) =λi(v1)

n

Y

i=2

λt(vi−1,vi).

LetΛi be the vector ofλi(v)andΛt be the matrix ofλt(v,v0):

Kwalk(G1,G2) =

X

n=1

X

w∈Wn(G1×G2)

λi(v1)

n

Y

i=2

λt(vi−1,vi)

=

X

n=0

ΛiΛnt1

i(I−Λt)−11 Computation inO(|G1|3|G2|3)

(116)

Extension: branching walks (Ramon and Gärtner, 2003; Mahé and Vert, 2009)

.. .

.. . .. N .

N C O C

C

..

. C

O C

N C

N O

C

N N C C C

N N

T(v,n+1) = X

R⊂N(v)

Y

v0∈R

λt(v,v0)T(v0,n),

(117)

2D Subtree vs walk kernels

707274767880

AUC

Walks Subtrees

CCRF−CEMHL−60(TB) K−562MOLT−4 RPMI−8226SRA549/ATCCEKVX HOP−62HOP−92NCI−H226 NCI−H23NCI−H322MNCI−H460NCI−H522COLO_205 HCC−2998HCT−116 HCT−15HT29KM12SW−620 SF−268SF−295SF−539SNB−19 SNB−75U251LOX_IMVI MALME−3MM14SK−MEL−2SK−MEL−28 SK−MEL−5UACC−257UACC−62 IGR−OV1OVCAR−3OVCAR−4OVCAR−5 OVCAR−8SK−OV−3786−0A498 ACHNCAKI−1RXF_393 SN12CTK−10UO−31 PC−3DU−145MCF7NCI/ADR−RES MDA−MB−231/ATCCHS_578TMDA−MB−435MDA−N BT−549T−47D

Screening of inhibitors for 60 cancer cell lines.

(118)

Image classification (Harchaoui and Bach, 2007)

COREL14 dataset

1400 natural images in 14 classes

Compare kernel between histograms (H), walk kernel (W), subtree kernel (TW), weighted subtree kernel (wTW), and a combination (M).

H W TW wTW M

0.05 0.06 0.07 0.08 0.09 0.1 0.11 0.12

Test error

Kernels

Performance comparison on Corel14

(119)

References

N. Aronszajn. Theory of reproducing kernels.Trans. Am. Math. Soc., 68:337 – 404, 1950. URL http://www.jstor.org/stable/1990404.

K. M. Borgwardt and H.-P. Kriegel. Shortest-path kernels on graphs. InICDM ’05: Proceedings of the Fifth IEEE International Conference on Data Mining, pages 74–81, Washington, DC, USA, 2005. IEEE Computer Society. ISBN 0-7695-2278-5. doi:

http://dx.doi.org/10.1109/ICDM.2005.132.

Z. Harchaoui and F. Bach. Image classification with segmentation graph kernels. In2007 IEEE Computer Society Conference on Computer Vision and Pattern Recognition (CVPR 2007), pages 1–8. IEEE Computer Society, 2007. doi:10.1109/CVPR.2007.383049. URL

http://dx.doi.org/10.1109/CVPR.2007.383049.

D. Haussler. Convolution Kernels on Discrete Structures. Technical Report UCSC-CRL-99-10, UC Santa Cruz, 1999.

C. Helma, T. Cramer, S. Kramer, and L. De Raedt. Data mining and machine learning techniques for the identification of mutagenicity inducing substructures and structure activity relationships of noncongeneric compounds.J. Chem. Inf. Comput. Sci., 44(4):1402–11, 2004. doi:

10.1021/ci034254q. URLhttp://dx.doi.org/10.1021/ci034254q.

C. Leslie and R. Kuang. Fast string kernels using inexact matching for protein sequences.J.

Mach. Learn. Res., 5:1435–1455, 2004.

C. Leslie, E. Eskin, and W. Noble. The spectrum kernel: a string kernel for SVM protein classification. In R. B. Altman, A. K. Dunker, L. Hunter, K. Lauerdale, and T. E. Klein, editors, Proceedings of the Pacific Symposium on Biocomputing 2002, pages 564–575, Singapore, 2002. World Scientific.

(120)

References (cont.)

H. Lodhi, C. Saunders, J. Shawe-Taylor, N. Cristianini, and C. n. p. v. d. d. r. Watkins. Text classification using string kernels.J. Mach. Learn. Res., 2:419–444, 2002. URLhttp:

//www.ai.mit.edu/projects/jmlr/papers/volume2/lodhi02a/abstract.html.

P. Mahé and J. P. Vert. Graph kernels based on tree patterns for molecules.Mach. Learn., 75(1):

3–35, 2009. doi:10.1007/s10994-008-5086-2. URL

http://dx.doi.org/10.1007/s10994-008-5086-2.

A. Nicholls. Oechem, version 1.3.4, openeye scientific software. website, 2005.

J. Ramon and T. Gärtner. Expressivity versus efficiency of graph kernels. In T. Washio and L. De Raedt, editors,Proceedings of the First International Workshop on Mining Graphs, Trees and Sequences, pages 65–74, 2003.

H. Saigo, J.-P. Vert, N. Ueda, and T. Akutsu. Protein homology detection using string alignment kernels.Bioinformatics, 20(11):1682–1689, 2004. URLhttp:

//bioinformatics.oupjournals.org/cgi/content/abstract/20/11/1682.

N. Sherashidze, S. Vishwanathan, T. Petri, K. Mehlhorn, and K. Borgwardt. Efficient graphlet kernels for large graph comparison. In12th International Conference on Artificial Intelligence and Statistics (AISTATS), pages 488–495, Clearwater Beach, Florida USA, 2009. Society for Artificial Intelligence and Statistics.

Références

Documents relatifs

2 Similarly, if we can compute the subgraph kernel then we can deduce the presence of a Hamiltonian path (left as exercice). Jean-Philippe Vert (ParisTech) Machine learning

Then note that if these modifications are performed, then, in order to get a strong orientation of G H , for every representative pair {u, v}, we would need to have the path of length

Figure 7: Dependencies of average deviation of the best decision quality from optimum ∆Q (a, d), probability of finding decision p (b, e) and probability of finding optimal decision

IFREMER : Institut Français de Recherche et Exploitation des produits de la Mer INRH : Institut National de Recherche Halieutique. INRA Institut National de la Recherche

These isolates showed unique internal transcribed spacer sequences, different enough from those of any described species to justify new species status.. These three distinct

(c) The normalized transmission intensity (color scale) of the unfocused 800 nm beam as a function of the quarter wave plate (QWP) and polarizer angle in degrees.. (d) The intensity

We then introduce the graph path ordering inspired by the recursive path ordering on terms and show that it is a well-founded rewrite ordering on graphs for which checking

Finally, the last set of columns gives us information about the solution obtained with the BC code: the time (in seconds) spent to solve the instances to optimality (“ − ” means