Support vector machines, kernels, and applications in computational biology

(1)

Support vector machines, kernels, and applications in computational biology

Jean-Philippe Vert

Jean-Philippe.Vert@mines-paristech.fr

Mines ParisTech / Institut Curie / Inserm

Mines ParisTech, ES "Machine learning" module, April 1st, 2011.

Jean-Philippe Vert (ParisTech) Machine learning in bioinformatics Mines ParisTech 1 / 88

(2)

Outline

1 Machine learning in bioinformatics

2 Linear support vector machines

3 Nonlinear SVM and kernels

4 SVM for complex data: the case of graphs

5 Conclusion

(3)

Outline

1 Machine learning in bioinformatics

2 Linear support vector machines

3 Nonlinear SVM and kernels

4 SVM for complex data: the case of graphs

5 Conclusion

(4)

Outline

1 Machine learning in bioinformatics

2 Linear support vector machines

3 Nonlinear SVM and kernels

4 SVM for complex data: the case of graphs

5 Conclusion

(5)

Outline

1 Machine learning in bioinformatics

2 Linear support vector machines

3 Nonlinear SVM and kernels

4 SVM for complex data: the case of graphs

5 Conclusion

(6)

Outline

1 Machine learning in bioinformatics

2 Linear support vector machines

3 Nonlinear SVM and kernels

4 SVM for complex data: the case of graphs

5 Conclusion

(7)

Outline

1 Machine learning in bioinformatics

2 Linear support vector machines

3 Nonlinear SVM and kernels

4 SVM for complex data: the case of graphs

5 Conclusion

(8)

A simple view of cancer progression

(9)

Chromosomic aberrations in cancer

(10)

Comparative Genomic Hybridization (CGH)

Motivation

Comparative genomic hybridization (CGH) data measure the DNA copy number along the genome

Very useful, in particular in cancer research

Can we classify CGH arrays for diagnosis or prognosis purpose?

(11)

Aggressive vs non-aggressive melanoma

0 500 1000 1500 2000 2500

−0.5 0 0.5

0 500 1000 1500 2000 2500

−1 0 1 2

0 500 1000 1500 2000 2500

−2

−1 0 1

0 500 1000 1500 2000 2500

−2 0 2 4

0 500 1000 1500 2000 2500

−4

−2 0 2

0 500 1000 1500 2000 2500

−1

−0.5 0 0.5

0 500 1000 1500 2000 2500

−1

−0.5 0 0.5

0 500 1000 1500 2000 2500

−4

−2 0 2

0 500 1000 1500 2000 2500

−4

−2 0 2

0 500 1000 1500 2000 2500

−1 0 1

Problem 1

Given the CGH profile of a melanoma, is it aggressive or not?

(12)

DNA → RNA → protein

CGH shows the (static) DNA

Cancer cells have also abnormal (dynamic) gene expression (=

transcription)

(13)

Tissue profiling with DNA chips

(14)

Use in diagnosis

Problem 2

Given the expression profile of a leukemia, is it an acute lymphocytic or myeloid leukemia (ALL or AML)?

(15)

Use in prognosis

Problem 3

Given the expression profile of a breast cancer, is the risk of relapse within 5 years high?

(16)

Proteins

A: Alanine V: Valine L: Leucine

F: Phenylalanine P: Proline M: Methionine

E: Acide glutamique K: Lysine R: Arginine

T: Threonine C: Cysteine N: Asparagine

H: Histidine V: Thyrosine W: Tryptophane

I: Isoleucine S: Serine Q: Glutamine

D: Acide aspartique G: Glycine

(17)

Protein annotation

Data available

Secreted proteins:

MASKATLLLAFTLLFATCIARHQQRQQQQNQCQLQNIEA...

MARSSLFTFLCLAVFINGCLSQIEQQSPWEFQGSEVW...

MALHTVLIMLSLLPMLEAQNPEHANITIGEPITNETLGWL...

...

Non-secreted proteins:

MAPPSVFAEVPQAQPVLVFKLIADFREDPDPRKVNLGVG...

MAHTLGLTQPNSTEPHKISFTAKEIDVIEWKGDILVVG...

MSISESYAKEIKTAFRQFTDFPIEGEQFEDFLPIIGNP..

...

Problem 4

Given a newly sequenced protein, is it secreted or not?

(18)

Drug discovery

inactive active

active

inactive

Problem 4

Given a new candidate molecule, is it likely to be active?

(19)

Pattern recognition, aka supervised classification

0 500 1000 1500 2000 2500

−0.5 0 0.5

0 500 1000 1500 2000 2500

−1 0 1 2

0 500 1000 1500 2000 2500

−2

−1 0 1

0 500 1000 1500 2000 2500

−2 0 2 4

0 500 1000 1500 2000 2500

−4

−2 0 2

active

active inactive

inactive

(20)

Pattern recognition, aka supervised classification

0 500 1000 1500 2000 2500

−0.5 0 0.5

0 500 1000 1500 2000 2500

−1 0 1 2

0 500 1000 1500 2000 2500

−2

−1 0 1

0 500 1000 1500 2000 2500

−2 0 2 4

0 500 1000 1500 2000 2500

−4

−2 0 2

active

inactive

(21)

Pattern recognition, aka supervised classification

0 500 1000 1500 2000 2500

−0.5 0 0.5

0 500 1000 1500 2000 2500

−1 0 1 2

0 500 1000 1500 2000 2500

−2

−1 0 1

0 500 1000 1500 2000 2500

−2 0 2 4

0 500 1000 1500 2000 2500

−4

−2 0 2

active

inactive

(22)

Pattern recognition, aka supervised classification

0 500 1000 1500 2000 2500

−0.5 0 0.5

0 500 1000 1500 2000 2500

−1 0 1 2

0 500 1000 1500 2000 2500

−2

−1 0 1

0 500 1000 1500 2000 2500

−2 0 2 4

0 500 1000 1500 2000 2500

−4

−2 0 2

active

inactive

(23)

Pattern recognition, aka supervised classification

Challenges

High dimension Few samples Structured data Heterogeneous data Prior knowledge Fast and scalable implementations Interpretable models

(24)

Outline

1 Machine learning in bioinformatics

2 Linear support vector machines

3 Nonlinear SVM and kernels

4 SVM for complex data: the case of graphs

5 Conclusion

(25)

Linear classifiers

(26)

Linear classifiers

(27)

Linear classifiers

(28)

Linear classifiers

(29)

Linear classifiers

(30)

Linear classifiers

(31)

Linear classifiers

(32)

Linear classifiers

(33)

Which one is better?

(34)

The margin of a linear classifier

(35)

The margin of a linear classifier

(36)

The margin of a linear classifier

(37)

The margin of a linear classifier

(38)

The margin of a linear classifier

(39)

Largest margin classifier (support vector machines)

(40)

Support vectors

(41)

More formally

The training set is a finite set of N data/class pairs:

S =

( ~ x ₁ , y ₁ ), . . . , ( ~ x _N , y _N ) , where ~ x _i ∈ R ^d and y _i ∈ {−1, 1}.

We assume (for the moment) that the data are linearly separable, i.e., that there exists ( w, ~ b) ∈ R ^d × R such that:

( w ~ .~ x _i + b > 0 if y _i = 1 , w ~ .~ x _i + b < 0 if y _i = −1 .

(42)

How to find the largest separating hyperplane?

For a given linear classifier f(x ) = w.~ ~ x + b consider the "tube" defined by the values −1 and +1 of the decision function:

x1 x2

w.x+b > +1

w.x+b < −1

w

w.x+b=+1

w.x+b=−1 w.x+b=0

(43)

The margin is 2/|| w|| ~

Indeed, the points ~ x ₁ and x ~ ₂ satisfy:

( w ~ .~ x ₁ + b = 0, w ~ .~ x ₂ + b = 1.

By subtracting we get w.( ~ ~ x ₂ − ~ x ₁ ) = 1, and therefore:

γ = 2||~ x ₂ − ~ x ₁ || = 2

||~ w|| .

(44)

All training points should be on the right side of the dotted line

For positive examples (y _i = 1) this means:

w.~ ~ x _i + b ≥ 1 For negative examples (y _i = −1) this means:

w ~ .~ x _i + b ≤ −1 Both cases are summarized by:

∀i = 1, . . . , N, y _i w.~ ~ x _i + b

≥ 1

(45)

Finding the optimal hyperplane

Find ( w ~ , b) which minimize:

||~ w|| ² under the constraints:

∀i = 1, . . . , N, y _i w.~ ~ x _i + b

− 1 ≥ 0.

This is a classical quadratic program on R ^d+1 .

(46)

Lagrangian

In order to minimize:

1 2 ||~ w|| ² under the constraints:

∀i = 1, . . . , N, y _i w.~ ~ x _i + b

− 1 ≥ 0.

we introduce one dual variable α i for each constraint, i.e., for each training point. The Lagrangian is:

L(~ w, b, ~ α) = 1

2 ||~ w|| ² −

N

X

i=1

α _i y _i w ~ .~ x _i + b

− 1 .

(47)

Dual problem

Find α ^∗ ∈ R ^N which maximizes

L(~ α) =

N

X

i=1

α i − 1 2

N

X

i,j=1

α i α j y _i y _j ~ x _i .~ x _j ,

under the (simple) constraints α _i ≥ 0 (for i = 1, . . . , N), and

N

X

i=1

α _i y _i = 0.

This is a quadratic program on R ^N , with "box constraints". ~ α ^∗ can be found efficiently using dedicated optimization softwares.

(48)

Recovering the optimal hyperplane

Once ~ α ^∗ is found, we recover ( w ~ ^∗ , b ^∗ ) corresponding to the optimal hyperplane. w ^∗ is given by:

w ~ ^∗ =

N

X

i=1

α i ~ x _i ,

and the decision function is therefore:

f ^∗ ( ~ x) = w ~ ^∗ .~ x + b ^∗

=

N

X

i=1

α i ~ x _i .~ x + b ^∗ . (1)

(49)

Interpretation: support vectors

α>0 α=0

(50)

What if data are not linearly separable?

(51)

What if data are not linearly separable?

(52)

What if data are not linearly separable?

(53)

What if data are not linearly separable?

(54)

Soft-margin SVM

Find a trade-off between large margin and few errors.

Mathematically:

min f

1 margin(f ) + C × errors(f )

C is a parameter

(55)

Soft-margin SVM formulation

The margin of a labeled point ( ~ x , y ) is

margin( ~ x , y ) = y w.~ ~ x + b The error is

0 if margin( ~ x , y ) > 1, 1 − margin( ~ x , y ) otherwise.

The soft margin SVM solves:

min

w,b ~

(

|| w ~ || ² + C

N

X

i=1

max 0, 1 − y _i w.~ ~ x _i + b )

(56)

Dual formulation of soft-margin SVM

Maximize

L(~ α) =

N

X

i=1

α _i − 1 2

N

X

i,j=1

α _i α _j y _i y _j ~ x _i .~ x _j ,

under the constraints:

( 0 ≤ α _i ≤ C, for i = 1, . . . , N P N

i=1 α _i y _i = 0.

(57)

Interpretation: bounded and unbounded support vectors

C α=0

0<α<C α=

(58)

Outline

1 Machine learning in bioinformatics

2 Linear support vector machines

3 Nonlinear SVM and kernels

4 SVM for complex data: the case of graphs

5 Conclusion

(59)

Sometimes linear classifiers are not interesting

(60)

Solution: non-linear mapping to a feature space

R

2

x1

x2

x1

x2

2

Let ~ Φ(~ x ) = (x ₁ ² , x ₂ ² ) ⁰ , w ~ = (1, 1) ⁰ and b = 1. Then the decision function is:

f ( ~ x) = x ₁ ² + x ₂ ² − R ² = w.~ ~ Φ( ~ x) + b,

(61)

Kernel (simple but important)

For a given mapping Φ from the space of objects X to some feature space, the kernel of two objects x and x ⁰ is the inner product of their images in the features space:

∀x , x ⁰ ∈ X , K (x , x ⁰ ) = Φ(x).~ ~ Φ(x ⁰ ).

Example: if ~ Φ(~ x ) = (x ₁ ² , x ₂ ² ) ⁰ , then

K (~ x , ~ x ⁰ ) = Φ(~ ~ x ).~ Φ(~ x ⁰ ) = (x ₁ ) ² (x ₁ ⁰ ) ² + (x ₂ ) ² (x ₂ ⁰ ) ² .

(62)

Training a SVM in the feature space

Replace each ~ x .~ x ⁰ in the SVM algorithm by ~ Φ(x ).~ Φ(x ⁰ ) = K (x, x ⁰ ) The dual problem is to maximize

L(~ α) =

N

X

i=1

α _i − 1 2

N

X

i,j=1

α _i α _j y _i y _j K (x _i , x _j ),

under the constraints:

( 0 ≤ α i ≤ C, for i = 1, . . . , N P N

i=1 α _i y _i = 0.

(63)

Predicting with a SVM in the feature space

The decision function becomes:

f(x) = w ~ ^∗ .~ Φ(x ) + b ^∗

=

N

X

i=1

α i K (x _i , x ) + b ^∗ . (2)

(64)

The kernel trick

The explicit computation of ~ Φ(x ) is not necessary. The kernel K (x, x ⁰ ) is enough. SVM work implicitly in the feature space.

It is sometimes possible to easily compute kernels which correspond to complex large-dimensional feature spaces.

(65)

Kernel example: polynomial kernel

R 2 x1

x2

x1

x2 2

For ~ x = (x ₁ , x ₂ ) ^> ∈ R ² , let Φ(~ ~ x) = (x ₁ ² , √

2x ₁ x ₂ , x ₂ ² ) ∈ R ³ : K ( ~ x , ~ x ⁰ ) = x ₁ ² x ₁ ⁰² + 2x ₁ x ₂ x ₁ ⁰ x ₂ ⁰ + x ₂ ² x ₂ ⁰²

= x ₁ x ₁ ⁰ + x ₂ x ₂ ⁰ 2

= ~ x.~ x ⁰ 2

.

(66)

Kernel example: polynomial kernel

R 2 x1

x2

x1

x2 2

More generally,

K ( ~ x , ~ x ⁰ ) = ~ x.~ x ⁰ + 1 d

is an inner product in a feature space of all monomials of degree up to d (left as exercice.)

(67)

Which functions K (x , x ⁰ ) are kernels?

Definition

A function K (x , x ⁰ ) defined on a set X is a kernel if and only if there exists a features space (Hilbert space) H and a mapping

Φ : X 7→ H ,

such that, for any x, x ⁰ in X : K x, x ⁰

=

Φ (x) , Φ x ⁰

H . φ

X F

(68)

Positive Definite (p.d.) functions

Definition

A positive definite (p.d.) function on the set X is a function K : X × X → R symmetric:

∀ x, x ⁰

∈ X ² , K x, x ⁰

= K x ⁰ , x , and which satisfies, for all N ∈ N , (x ₁ , x ₂ , . . . , x _N ) ∈ X ^N et (a ₁ , a ₂ , . . . , a _N ) ∈ R ^N :

N

X

i=1 N

X

j=1

a _i a _j K x _i , x _j

≥ 0.

(69)

Kernels are p.d. functions

Theorem (Aronszajn, 1950)

K is a kernel if and only if it is a positive definite function.

φ

X F

(70)

Proof?

Kernel = ⇒ p.d. function:

hΦ (x) , Φ (x

⁰

)i

_Rd

= hΦ (x

⁰

) , Φ (x)

_Rd

i , P

N

i=1

P

N

j=1

a

_i

a

_j

hΦ (x

_i

) , Φ (x

_j

)i

R^d

= k P

N

i=1

a

_i

Φ (x

_i

) k

²

R^d

≥ 0 . P.d. function = ⇒ kernel: more difficult...

(71)

Kernel examples

Polynomial (on R ^d ):

K (x , x ⁰ ) = (x .x ⁰ + 1) ^d Gaussian radial basis function (RBF) (on R ^d )

K (x, x ⁰ ) = exp

− ||x − x ⁰ || ² 2σ ²

Laplace kernel (on R )

K (x , x ⁰ ) = exp −γ |x − x ⁰ | Min kernel (on R + )

K (x, x ⁰ ) = min(x, x ⁰ )

Exercice: for each kernel, find a Hilbert space H and a mapping Φ : X → H such that K (x , x ⁰ ) = hΦ(x ), Φ(x ⁰ )i

(72)

Example: SVM with a Gaussian kernel

Training:

min

α∈ R

^N

N

X

i=1

α _i − 1 2

N

X

i,j=1

α _i α _j y _i y _j exp − ||~ x _i − ~ x _j || ² 2σ ²

!

s.t. 0 ≤ α _i ≤ C, and

N

X

i=1

α _i y _i = 0.

Prediction

f ( ~ x) =

N

X

i=1

α i exp

− ||~ x − ~ x _i || ² 2σ ²

(73)

Example: SVM with a Gaussian kernel

f ( ~ x ) =

N

X

i=1

α _i exp

− ||~ x − ~ x _i || ² 2σ ²

−1.0

−0.5 0.0 0.5 1.0

−2 0 2 4 6

−2 0 2 4

●

SVM classification plot

(74)

Linear vs nonlinear SVM

(75)

Regularity vs data fitting trade-off

(76)

C controls the trade-off

min

f

1 margin(f) + C × errors(f )

(77)

Why it is important to control the trade-off

(78)

How to choose C in practice

Split your dataset in two ("train" and "test") Train SVM with different C on the "train" set

Compute the accuracy of the SVM on the "test" set Choose the C which minimizes the "test" error

(you may repeat this several times = cross-validation)

(79)

SVM summary

Large margin

Linear or nonlinear (with the kernel trick)

Control of the regularization / data fitting trade-off with C

(80)

Outline

1 Machine learning in bioinformatics

2 Linear support vector machines

3 Nonlinear SVM and kernels

4 SVM for complex data: the case of graphs

5 Conclusion

(81)

Virtual screening for drug discovery

inactive active

active

inactive

NCI AIDS screen results (from http://cactus.nci.nih.gov).

(82)

Classification with SVM

1

Represent each graph x by a vector Φ(x) ∈ H, either explicitly or implicitly through the kernel

K (x , x ⁰ ) = Φ(x) ^> Φ(x ⁰ ) .

2

Use a linear method for classification in H.

X

(83)

Classification with SVM

1

Represent each graph x by a vector Φ(x) ∈ H, either explicitly or implicitly through the kernel

K (x , x ⁰ ) = Φ(x) ^> Φ(x ⁰ ) .

2

Use a linear method for classification in H.

φ X H

(84)

Classification with SVM

1

Represent each graph x by a vector Φ(x) ∈ H, either explicitly or implicitly through the kernel

K (x , x ⁰ ) = Φ(x) ^> Φ(x ⁰ ) .

2

Use a linear method for classification in H.

φ X H

(85)

Example: indexing by substructures

O

N O O

O O

N N N

O O

O

Often we believe that the presence substructures are important predictive patterns

Hence it makes sense to represent a graph by features that indicate the presence (or the number of occurrences) of particular substructures

However, detecting the presence of particular substructures may be computationally challenging...

(86)

Subgraphs

Definition

A subgraph of a graph (V , E) is a connected graph (V ⁰ , E ⁰ ) with V ⁰ ⊂ V and E ⁰ ⊂ E .

(87)

Indexing by all subgraphs?

Theorem

1

Computing all subgraph occurrences is NP-hard.

2

Computing the subgraph kernel is NP-hard.

Proof.

1

Finding an occurrence of the linear path of size n is finding a Hamiltonian path, which is NP-complete.

2

Similarly, if we can compute the subgraph kernel then we can deduce the presence of a Hamiltonian path (left as exercice).

(88)

Indexing by all subgraphs?

Theorem

1

Computing all subgraph occurrences is NP-hard.

2

Computing the subgraph kernel is NP-hard.

Proof.

1

Finding an occurrence of the linear path of size n is finding a Hamiltonian path, which is NP-complete.

2

Similarly, if we can compute the subgraph kernel then we can deduce the presence of a Hamiltonian path (left as exercice).

(89)

Indexing by all subgraphs?

Theorem

1

Computing all subgraph occurrences is NP-hard.

2

Computing the subgraph kernel is NP-hard.

Proof.

1

Finding an occurrence of the linear path of size n is finding a Hamiltonian path, which is NP-complete.

2

Similarly, if we can compute the subgraph kernel then we can deduce the presence of a Hamiltonian path (left as exercice).

(90)

Paths

Definition

A path of a graph (V , E) is sequence of distinct vertices v ₁ , . . . , v _n ∈ V (i 6= j = ⇒ v _i 6= v _j ) such that (v _i , v _i+1 ) ∈ E for i = 1, . . . , n − 1.

Equivalently the paths are the linear subgraphs.

(91)

Indexing by all paths?

B

A A A B A

(0,...,0,1,0,...,0,1,0,...)

A A

A B

Theorem

1

Computing all path occurrences is NP-hard.

2

Computing the path kernel is NP-hard

Proof.

Same as for subgraphs.

(92)

Indexing by all paths?

B

A A A B A

(0,...,0,1,0,...,0,1,0,...)

A A

A B

Theorem

1

Computing all path occurrences is NP-hard.

2

Computing the path kernel is NP-hard Proof.

Same as for subgraphs.

(93)

Indexing by all paths?

B

A A A B A

(0,...,0,1,0,...,0,1,0,...)

A A

A B

Theorem

1

Computing all path occurrences is NP-hard.

2

Computing the path kernel is NP-hard Proof.

Same as for subgraphs.

(94)

Walks

Definition

A walk of a graph (V , E ) is sequence of v ₁ , . . . , v n ∈ V such that (v _i , v _i+1 ) ∈ E for i = 1, . . . , n − 1.

We note W _n (G) the set of walks with n vertices of the graph G, and W(G) the set of all walks.

etc...

(95)

Walks 6= paths

(96)

Walk kernel

Definition

Let S _n denote the set of all possible label sequences of walks of length n (including vertices and edges labels), and S = ∪ _n≥1 S _n . For any graph X let a weight λ _G (w ) be associated to each walk w ∈ W(G).

Let the feature vector Φ(G) = (Φ _s (G)) _s∈S be defined by:

Φ s (G) = X

w∈W(G)

λ _G (w )1 (s is the label sequence of w) .

A walk kernel is a graph kernel defined by:

K _walk (G ₁ , G ₂ ) = X

s∈S

Φ s (G ₁ )Φ s (G ₂ ) .

(97)

Walk kernel

Definition

Let S _n denote the set of all possible label sequences of walks of length n (including vertices and edges labels), and S = ∪ _n≥1 S _n . For any graph X let a weight λ _G (w ) be associated to each walk w ∈ W(G).

Let the feature vector Φ(G) = (Φ _s (G)) _s∈S be defined by:

Φ s (G) = X

w∈W(G)

λ _G (w )1 (s is the label sequence of w) .

A walk kernel is a graph kernel defined by:

K _walk (G ₁ , G ₂ ) = X

s∈S

Φ s (G ₁ )Φ s (G ₂ ) .

(98)

Walk kernel examples

Examples

The nth-order walk kernel is the walk kernel with λ _G (w) = 1 if the length of w is n, 0 otherwise. It compares two graphs through their common walks of length n.

The random walk kernel is obtained with λ _G (w ) = P _G (w), where P _G is a Markov random walk on G. In that case we have:

K (G ₁ , G ₂ ) = P(label(W ₁ ) = label(W ₂ )) ,

where W ₁ and W ₂ are two independant random walks on G ₁ and G ₂ , respectively (Kashima et al., 2003).

The geometric walk kernel is obtained (when it converges) with λ _G (w ) = β ^length(w) , for β > 0. In that case the feature space is of infinite dimension (Gärtner et al., 2003).

(99)

Walk kernel examples

Examples

The nth-order walk kernel is the walk kernel with λ _G (w) = 1 if the length of w is n, 0 otherwise. It compares two graphs through their common walks of length n.

The random walk kernel is obtained with λ _G (w ) = P _G (w), where P _G is a Markov random walk on G. In that case we have:

K (G ₁ , G ₂ ) = P(label(W ₁ ) = label(W ₂ )) ,

where W ₁ and W ₂ are two independant random walks on G ₁ and G ₂ , respectively (Kashima et al., 2003).

The geometric walk kernel is obtained (when it converges) with λ _G (w ) = β ^length(w) , for β > 0. In that case the feature space is of infinite dimension (Gärtner et al., 2003).

(100)

Walk kernel examples

Examples

The nth-order walk kernel is the walk kernel with λ _G (w) = 1 if the length of w is n, 0 otherwise. It compares two graphs through their common walks of length n.

The random walk kernel is obtained with λ _G (w ) = P _G (w), where P _G is a Markov random walk on G. In that case we have:

K (G ₁ , G ₂ ) = P(label(W ₁ ) = label(W ₂ )) ,

where W ₁ and W ₂ are two independant random walks on G ₁ and G ₂ , respectively (Kashima et al., 2003).

The geometric walk kernel is obtained (when it converges) with λ _G (w ) = β ^length(w) , for β > 0. In that case the feature space is of infinite dimension (Gärtner et al., 2003).

(101)

Computation of walk kernels

Proposition

These three kernels (nth-order, random and geometric walk kernels) can be computed efficiently in polynomial time.

(102)

Product graph

Definition

Let G ₁ = (V ₁ , E ₁ ) and G ₂ = (V ₂ , E ₂ ) be two graphs with labeled

vertices. The product graph G = G ₁ × G ₂ is the graph G = (V , E) with:

1

V = {(v ₁ , v ₂ ) ∈ V ₁ × V ₂ : v ₁ and v ₂ have the same label} ,

2

E =

(v ₁ , v ₂ ), (v ₁ ⁰ , v ₂ ⁰ )

∈ V × V : (v ₁ , v ₁ ⁰ ) ∈ E ₁ and (v ₂ , v ₂ ⁰ ) ∈ E ₂ .

G1 x G2

c

d e

4 3

2 1 1b 2a 1d

1a 2b 3c

4c 2d

3e

4e

G1 G2

a b

(103)

Walk kernel and product graph

Lemma

There is a bijection between:

1

The pairs of walks w ₁ ∈ W _n (G ₁ ) and w ₂ ∈ W _n (G ₂ ) with the same label sequences,

2

The walks on the product graph w ∈ W _n (G ₁ × G ₂ ).

Corollary

K _walk (G ₁ , G ₂ ) = X

s∈S

Φ _s (G ₁ )Φ _s (G ₂ )

= X

(w

₁

,w

₂

)∈W(G

₁

)×W (G

₁

)

λ _G

₁

(w ₁ )λ _G

₂

(w ₂ )1(l(w ₁ ) = l(w ₂ ))

= X

w∈W(G

₁

×G

₂

)

λ _G

₁

_×G

₂

(w ) .

(104)

Walk kernel and product graph

Lemma

There is a bijection between:

1

The pairs of walks w ₁ ∈ W _n (G ₁ ) and w ₂ ∈ W _n (G ₂ ) with the same label sequences,

2

The walks on the product graph w ∈ W _n (G ₁ × G ₂ ).

Corollary

K _walk (G ₁ , G ₂ ) = X

s∈S

Φ _s (G ₁ )Φ _s (G ₂ )

= X

(w

₁

,w

₂

)∈W(G

₁

)×W (G

₁

)

λ _G

₁

(w ₁ )λ _G

₂

(w ₂ )1(l(w ₁ ) = l(w ₂ ))

= X

w∈W(G

₁

×G

₂

)

λ _G

₁

_×G

₂

(w ) .

(105)

Computation of the nth-order walk kernel

For the nth-order walk kernel we have λ _G

₁

_×G

₂

(w) = 1 if the length of w is n, 0 otherwise.

Therefore:

K _nth−order (G ₁ , G ₂ ) = X

w∈W

n

(G

₁

×G

₂

)

1 . Let A be the adjacency matrix of G ₁ × G ₂ . Then we get:

K nth−order (G ₁ , G ₂ ) = X

i,j

[A ⁿ ] _i,j = 1 ^> A ⁿ 1 . Computation in O(n|G ₁ ||G ₂ |d ₁ d ₂ ), where d _i is the maximum degree of G _i .

(106)

Computation of random and geometric walk kernels

In both cases λ _G (w ) for a walk w = v ₁ . . . v n can be decomposed as:

λ _G (v ₁ . . . v _n ) = λ ⁱ (v ₁ )

n

Y

i=2

λ ^t (v _i−1 , v _i ) .

Let Λ _i be the vector of λ ⁱ (v) and Λ t be the matrix of λ ^t (v, v ⁰ ):

K _walk (G ₁ , G ₂ ) =

∞

X

n=1

X

w∈W

n

(G

₁

×G

₂

)

λ ⁱ (v ₁ )

n

Y

i=2

λ ^t (v _i−1 , v _i )

=

∞

X

n=0

Λ _i Λ ⁿ _t 1

= Λ _i (I − Λ _t ) ⁻¹ 1 Computation in O(|G ₁ | ³ |G ₂ | ³ )

(107)

Extensions 1: label enrichment

Atom relabebling with the Morgan index

Order 2 indices N

O

O 1

1 1

1

1 1

N O

O 2

2

2 3

2

3

1 1 2

N O

O 4

4

5 7

5

3 3 4

No Morgan Indices Order 1 indices

Compromise between fingerprints and structural keys features.

Other relabeling schemes are possible (graph coloring).

Faster computation with more labels (less matches implies a smaller product graph).

(108)

Extension 2: Non-tottering walk kernel

Tottering walks

A tottering walk is a walk w = v ₁ . . . v n with v _i = v _i+2 for some i.

Tottering Non−tottering

Tottering walks seem irrelevant for many applications

Focusing on non-tottering walks is a way to get closer to the path kernel (e.g., equivalent on trees).

(109)

Computation of the non-tottering walk kernel (Mahé et al., 2005)

Second-order Markov random walk to prevent tottering walks Written as a first-order Markov random walk on an augmented graph

Normal walk kernel on the augmented graph (which is always a directed graph).

(110)

Extension 3: Subtree kernels

(111)

Example: Tree-like fragments of molecules

. . .

. . . . . N .

N C O C

C

. .

.

C

O C

N C

N O

C

N N C C C

N N

(112)

Computation of the subtree kernel

Like the walk kernel, amounts to compute the (weighted) number of subtrees in the product graph.

Recursion: if T (v , n) denotes the weighted number of subtrees of depth n rooted at the vertex v , then:

T (v , n + 1) = X

R⊂N (v)

Y

v

⁰

∈R

λ _t (v , v ⁰ )T (v ⁰ , n) ,

where N (v ) is the set of neighbors of v.

Can be combined with the non-tottering graph transformation as preprocessing to obtain the non-tottering subtree kernel.

(113)

Application in chemoinformatics (Mahé et al., 2004)

MUTAG dataset

aromatic/hetero-aromatic compounds

high mutagenic activity /no mutagenic activity, assayed in Salmonella typhimurium.

188 compouunds: 125 + / 63 - Results

10-fold cross-validation accuracy

Method Accuracy Progol1 81.4%

2D kernel 91.2%

(114)

2D Subtree vs walk kernels (Mahé and V., 2009)

707274767880

AUC

Walks Subtrees

CCRF−CEMHL−60(TB) K−562MOLT−4 RPMI−8226SRA549/ATCCEKVX HOP−62HOP−92NCI−H226 NCI−H23NCI−H322MNCI−H460NCI−H522COLO_205 HCC−2998HCT−116 HCT−15HT29KM12SW−620 SF−268SF−295SF−539SNB−19 SNB−75U251LOX_IMVI MALME−3MM14SK−MEL−2SK−MEL−28 SK−MEL−5UACC−257UACC−62 IGR−OV1OVCAR−3OVCAR−4OVCAR−5 OVCAR−8SK−OV−3786−0A498 ACHNCAKI−1RXF_393 SN12CTK−10UO−31PC−3 DU−145MCF7NCI/ADR−RES MDA−MB−231/ATCCHS_578TMDA−MB−435MDA−N BT−549T−47D

Screening of inhibitors for 60 cancer cell lines.

(115)

Summary: graph kernels

What we saw

Kernels do not allow to overcome the NP-hardness of subgraph patterns

They allow to work with approximate subgraphs (walks, subtrees), in infinite dimension, thanks to the kernel trick

They give state-of-the-art results

(116)

Outline

1 Machine learning in bioinformatics

2 Linear support vector machines

3 Nonlinear SVM and kernels

4 SVM for complex data: the case of graphs

5 Conclusion

(117)

Machine learning in computational biology

Biology faces a flood of data following the development of high-throughput technologies (sequencing, DNA chips, ...) Many problems can be formalized in the framework of machine learning, e.g.:

Diagnosis, prognosis Protein annotation

Drug discovery, virtual screening

These data have often complex structures (strings, graphs, high-dimensional vectors) and often require dedicated algorithms.

(118)

Support vector machines (SVM)

A general-purpose algorithm for pattern recognition

Based on the principle of large margin ("séparateur à vaste marge")

Linear or nonlinear with the kernel trick

Control of the regularization / data fitting trade-off with the C parameter

State-of-the-art performance on many applications

(119)