• Aucun résultat trouvé

Support vector machines, kernels, and applications in computational biology

N/A
N/A
Protected

Academic year: 2022

Partager "Support vector machines, kernels, and applications in computational biology"

Copied!
119
0
0

Texte intégral

(1)

Support vector machines, kernels, and applications in computational biology

Jean-Philippe Vert

Jean-Philippe.Vert@mines-paristech.fr

Mines ParisTech / Institut Curie / Inserm

Mines ParisTech, ES "Machine learning" module, April 1st, 2011.

Jean-Philippe Vert (ParisTech) Machine learning in bioinformatics Mines ParisTech 1 / 88

(2)

Outline

1 Machine learning in bioinformatics

2 Linear support vector machines

3 Nonlinear SVM and kernels

4 SVM for complex data: the case of graphs

5 Conclusion

Jean-Philippe Vert (ParisTech) Machine learning in bioinformatics Mines ParisTech 2 / 88

(3)

Outline

1 Machine learning in bioinformatics

2 Linear support vector machines

3 Nonlinear SVM and kernels

4 SVM for complex data: the case of graphs

5 Conclusion

Jean-Philippe Vert (ParisTech) Machine learning in bioinformatics Mines ParisTech 2 / 88

(4)

Outline

1 Machine learning in bioinformatics

2 Linear support vector machines

3 Nonlinear SVM and kernels

4 SVM for complex data: the case of graphs

5 Conclusion

Jean-Philippe Vert (ParisTech) Machine learning in bioinformatics Mines ParisTech 2 / 88

(5)

Outline

1 Machine learning in bioinformatics

2 Linear support vector machines

3 Nonlinear SVM and kernels

4 SVM for complex data: the case of graphs

5 Conclusion

Jean-Philippe Vert (ParisTech) Machine learning in bioinformatics Mines ParisTech 2 / 88

(6)

Outline

1 Machine learning in bioinformatics

2 Linear support vector machines

3 Nonlinear SVM and kernels

4 SVM for complex data: the case of graphs

5 Conclusion

Jean-Philippe Vert (ParisTech) Machine learning in bioinformatics Mines ParisTech 2 / 88

(7)

Outline

1 Machine learning in bioinformatics

2 Linear support vector machines

3 Nonlinear SVM and kernels

4 SVM for complex data: the case of graphs

5 Conclusion

Jean-Philippe Vert (ParisTech) Machine learning in bioinformatics Mines ParisTech 3 / 88

(8)

A simple view of cancer progression

Jean-Philippe Vert (ParisTech) Machine learning in bioinformatics Mines ParisTech 4 / 88

(9)

Chromosomic aberrations in cancer

Jean-Philippe Vert (ParisTech) Machine learning in bioinformatics Mines ParisTech 5 / 88

(10)

Comparative Genomic Hybridization (CGH)

Motivation

Comparative genomic hybridization (CGH) data measure the DNA copy number along the genome

Very useful, in particular in cancer research

Can we classify CGH arrays for diagnosis or prognosis purpose?

Jean-Philippe Vert (ParisTech) Machine learning in bioinformatics Mines ParisTech 6 / 88

(11)

Aggressive vs non-aggressive melanoma

0 500 1000 1500 2000 2500

−0.5 0 0.5

0 500 1000 1500 2000 2500

−1 0 1 2

0 500 1000 1500 2000 2500

−2

−1 0 1

0 500 1000 1500 2000 2500

−2 0 2 4

0 500 1000 1500 2000 2500

−4

−2 0 2

0 500 1000 1500 2000 2500

−1

−0.5 0 0.5

0 500 1000 1500 2000 2500

−1

−0.5 0 0.5

0 500 1000 1500 2000 2500

−4

−2 0 2

0 500 1000 1500 2000 2500

−4

−2 0 2

0 500 1000 1500 2000 2500

−1 0 1

Problem 1

Given the CGH profile of a melanoma, is it aggressive or not?

Jean-Philippe Vert (ParisTech) Machine learning in bioinformatics Mines ParisTech 7 / 88

(12)

DNA → RNA → protein

CGH shows the (static) DNA

Cancer cells have also abnormal (dynamic) gene expression (=

transcription)

Jean-Philippe Vert (ParisTech) Machine learning in bioinformatics Mines ParisTech 8 / 88

(13)

Tissue profiling with DNA chips

Jean-Philippe Vert (ParisTech) Machine learning in bioinformatics Mines ParisTech 9 / 88

(14)

Use in diagnosis

Problem 2

Given the expression profile of a leukemia, is it an acute lymphocytic or myeloid leukemia (ALL or AML)?

Jean-Philippe Vert (ParisTech) Machine learning in bioinformatics Mines ParisTech 10 / 88

(15)

Use in prognosis

Problem 3

Given the expression profile of a breast cancer, is the risk of relapse within 5 years high?

Jean-Philippe Vert (ParisTech) Machine learning in bioinformatics Mines ParisTech 11 / 88

(16)

Proteins

A: Alanine V: Valine L: Leucine

F: Phenylalanine P: Proline M: Methionine

E: Acide glutamique K: Lysine R: Arginine

T: Threonine C: Cysteine N: Asparagine

H: Histidine V: Thyrosine W: Tryptophane

I: Isoleucine S: Serine Q: Glutamine

D: Acide aspartique G: Glycine

Jean-Philippe Vert (ParisTech) Machine learning in bioinformatics Mines ParisTech 12 / 88

(17)

Protein annotation

Data available

Secreted proteins:

MASKATLLLAFTLLFATCIARHQQRQQQQNQCQLQNIEA...

MARSSLFTFLCLAVFINGCLSQIEQQSPWEFQGSEVW...

MALHTVLIMLSLLPMLEAQNPEHANITIGEPITNETLGWL...

...

Non-secreted proteins:

MAPPSVFAEVPQAQPVLVFKLIADFREDPDPRKVNLGVG...

MAHTLGLTQPNSTEPHKISFTAKEIDVIEWKGDILVVG...

MSISESYAKEIKTAFRQFTDFPIEGEQFEDFLPIIGNP..

...

Problem 4

Given a newly sequenced protein, is it secreted or not?

Jean-Philippe Vert (ParisTech) Machine learning in bioinformatics Mines ParisTech 13 / 88

(18)

Drug discovery

inactive active

active

active

inactive

inactive

Problem 4

Given a new candidate molecule, is it likely to be active?

Jean-Philippe Vert (ParisTech) Machine learning in bioinformatics Mines ParisTech 14 / 88

(19)

Pattern recognition, aka supervised classification

0 500 1000 1500 2000 2500

−0.5 0 0.5

0 500 1000 1500 2000 2500

−1 0 1 2

0 500 1000 1500 2000 2500

−2

−1 0 1

0 500 1000 1500 2000 2500

−2 0 2 4

0 500 1000 1500 2000 2500

−4

−2 0 2

inactive active

active

active inactive

inactive

Jean-Philippe Vert (ParisTech) Machine learning in bioinformatics Mines ParisTech 15 / 88

(20)

Pattern recognition, aka supervised classification

0 500 1000 1500 2000 2500

−0.5 0 0.5

0 500 1000 1500 2000 2500

−1 0 1 2

0 500 1000 1500 2000 2500

−2

−1 0 1

0 500 1000 1500 2000 2500

−2 0 2 4

0 500 1000 1500 2000 2500

−4

−2 0 2

inactive active

active

active inactive

inactive

Jean-Philippe Vert (ParisTech) Machine learning in bioinformatics Mines ParisTech 15 / 88

(21)

Pattern recognition, aka supervised classification

0 500 1000 1500 2000 2500

−0.5 0 0.5

0 500 1000 1500 2000 2500

−1 0 1 2

0 500 1000 1500 2000 2500

−2

−1 0 1

0 500 1000 1500 2000 2500

−2 0 2 4

0 500 1000 1500 2000 2500

−4

−2 0 2

inactive active

active

active inactive

inactive

Jean-Philippe Vert (ParisTech) Machine learning in bioinformatics Mines ParisTech 15 / 88

(22)

Pattern recognition, aka supervised classification

0 500 1000 1500 2000 2500

−0.5 0 0.5

0 500 1000 1500 2000 2500

−1 0 1 2

0 500 1000 1500 2000 2500

−2

−1 0 1

0 500 1000 1500 2000 2500

−2 0 2 4

0 500 1000 1500 2000 2500

−4

−2 0 2

inactive active

active

active inactive

inactive

Jean-Philippe Vert (ParisTech) Machine learning in bioinformatics Mines ParisTech 15 / 88

(23)

Pattern recognition, aka supervised classification

Challenges

High dimension Few samples Structured data Heterogeneous data Prior knowledge Fast and scalable implementations Interpretable models

Jean-Philippe Vert (ParisTech) Machine learning in bioinformatics Mines ParisTech 16 / 88

(24)

Outline

1 Machine learning in bioinformatics

2 Linear support vector machines

3 Nonlinear SVM and kernels

4 SVM for complex data: the case of graphs

5 Conclusion

Jean-Philippe Vert (ParisTech) Machine learning in bioinformatics Mines ParisTech 17 / 88

(25)

Linear classifiers

Jean-Philippe Vert (ParisTech) Machine learning in bioinformatics Mines ParisTech 18 / 88

(26)

Linear classifiers

Jean-Philippe Vert (ParisTech) Machine learning in bioinformatics Mines ParisTech 18 / 88

(27)

Linear classifiers

Jean-Philippe Vert (ParisTech) Machine learning in bioinformatics Mines ParisTech 18 / 88

(28)

Linear classifiers

Jean-Philippe Vert (ParisTech) Machine learning in bioinformatics Mines ParisTech 18 / 88

(29)

Linear classifiers

Jean-Philippe Vert (ParisTech) Machine learning in bioinformatics Mines ParisTech 18 / 88

(30)

Linear classifiers

Jean-Philippe Vert (ParisTech) Machine learning in bioinformatics Mines ParisTech 18 / 88

(31)

Linear classifiers

Jean-Philippe Vert (ParisTech) Machine learning in bioinformatics Mines ParisTech 18 / 88

(32)

Linear classifiers

Jean-Philippe Vert (ParisTech) Machine learning in bioinformatics Mines ParisTech 18 / 88

(33)

Which one is better?

Jean-Philippe Vert (ParisTech) Machine learning in bioinformatics Mines ParisTech 19 / 88

(34)

The margin of a linear classifier

Jean-Philippe Vert (ParisTech) Machine learning in bioinformatics Mines ParisTech 20 / 88

(35)

The margin of a linear classifier

Jean-Philippe Vert (ParisTech) Machine learning in bioinformatics Mines ParisTech 20 / 88

(36)

The margin of a linear classifier

Jean-Philippe Vert (ParisTech) Machine learning in bioinformatics Mines ParisTech 20 / 88

(37)

The margin of a linear classifier

Jean-Philippe Vert (ParisTech) Machine learning in bioinformatics Mines ParisTech 20 / 88

(38)

The margin of a linear classifier

Jean-Philippe Vert (ParisTech) Machine learning in bioinformatics Mines ParisTech 20 / 88

(39)

Largest margin classifier (support vector machines)

Jean-Philippe Vert (ParisTech) Machine learning in bioinformatics Mines ParisTech 21 / 88

(40)

Support vectors

Jean-Philippe Vert (ParisTech) Machine learning in bioinformatics Mines ParisTech 22 / 88

(41)

More formally

The training set is a finite set of N data/class pairs:

S =

( ~ x 1 , y 1 ), . . . , ( ~ x N , y N ) , where ~ x i ∈ R d and y i ∈ {−1, 1}.

We assume (for the moment) that the data are linearly separable, i.e., that there exists ( w, ~ b) ∈ R d × R such that:

( w ~ .~ x i + b > 0 if y i = 1 , w ~ .~ x i + b < 0 if y i = −1 .

Jean-Philippe Vert (ParisTech) Machine learning in bioinformatics Mines ParisTech 23 / 88

(42)

How to find the largest separating hyperplane?

For a given linear classifier f(x ) = w.~ ~ x + b consider the "tube" defined by the values −1 and +1 of the decision function:

x1 x2

w.x+b > +1

w.x+b < −1

w

w.x+b=+1

w.x+b=−1 w.x+b=0

Jean-Philippe Vert (ParisTech) Machine learning in bioinformatics Mines ParisTech 24 / 88

(43)

The margin is 2/|| w|| ~

Indeed, the points ~ x 1 and x ~ 2 satisfy:

( w ~ .~ x 1 + b = 0, w ~ .~ x 2 + b = 1.

By subtracting we get w.( ~ ~ x 2 − ~ x 1 ) = 1, and therefore:

γ = 2||~ x 2 − ~ x 1 || = 2

||~ w|| .

Jean-Philippe Vert (ParisTech) Machine learning in bioinformatics Mines ParisTech 25 / 88

(44)

All training points should be on the right side of the dotted line

For positive examples (y i = 1) this means:

w.~ ~ x i + b ≥ 1 For negative examples (y i = −1) this means:

w ~ .~ x i + b ≤ −1 Both cases are summarized by:

∀i = 1, . . . , N, y i w.~ ~ x i + b

≥ 1

Jean-Philippe Vert (ParisTech) Machine learning in bioinformatics Mines ParisTech 26 / 88

(45)

Finding the optimal hyperplane

Find ( w ~ , b) which minimize:

||~ w|| 2 under the constraints:

∀i = 1, . . . , N, y i w.~ ~ x i + b

− 1 ≥ 0.

This is a classical quadratic program on R d+1 .

Jean-Philippe Vert (ParisTech) Machine learning in bioinformatics Mines ParisTech 27 / 88

(46)

Lagrangian

In order to minimize:

1 2 ||~ w|| 2 under the constraints:

∀i = 1, . . . , N, y i w.~ ~ x i + b

− 1 ≥ 0.

we introduce one dual variable α i for each constraint, i.e., for each training point. The Lagrangian is:

L(~ w, b, ~ α) = 1

2 ||~ w|| 2

N

X

i=1

α i y i w ~ .~ x i + b

− 1 .

Jean-Philippe Vert (ParisTech) Machine learning in bioinformatics Mines ParisTech 28 / 88

(47)

Dual problem

Find α ∈ R N which maximizes

L(~ α) =

N

X

i=1

α i − 1 2

N

X

i,j=1

α i α j y i y j ~ x i .~ x j ,

under the (simple) constraints α i ≥ 0 (for i = 1, . . . , N), and

N

X

i=1

α i y i = 0.

This is a quadratic program on R N , with "box constraints". ~ α can be found efficiently using dedicated optimization softwares.

Jean-Philippe Vert (ParisTech) Machine learning in bioinformatics Mines ParisTech 29 / 88

(48)

Recovering the optimal hyperplane

Once ~ α is found, we recover ( w ~ , b ) corresponding to the optimal hyperplane. w is given by:

w ~ =

N

X

i=1

α i ~ x i ,

and the decision function is therefore:

f ( ~ x) = w ~ .~ x + b

=

N

X

i=1

α i ~ x i .~ x + b . (1)

Jean-Philippe Vert (ParisTech) Machine learning in bioinformatics Mines ParisTech 30 / 88

(49)

Interpretation: support vectors

α>0 α=0

Jean-Philippe Vert (ParisTech) Machine learning in bioinformatics Mines ParisTech 31 / 88

(50)

What if data are not linearly separable?

Jean-Philippe Vert (ParisTech) Machine learning in bioinformatics Mines ParisTech 32 / 88

(51)

What if data are not linearly separable?

Jean-Philippe Vert (ParisTech) Machine learning in bioinformatics Mines ParisTech 32 / 88

(52)

What if data are not linearly separable?

Jean-Philippe Vert (ParisTech) Machine learning in bioinformatics Mines ParisTech 32 / 88

(53)

What if data are not linearly separable?

Jean-Philippe Vert (ParisTech) Machine learning in bioinformatics Mines ParisTech 32 / 88

(54)

Soft-margin SVM

Find a trade-off between large margin and few errors.

Mathematically:

min f

1

margin(f ) + C × errors(f )

C is a parameter

Jean-Philippe Vert (ParisTech) Machine learning in bioinformatics Mines ParisTech 33 / 88

(55)

Soft-margin SVM formulation

The margin of a labeled point ( ~ x , y ) is

margin( ~ x , y ) = y w.~ ~ x + b The error is

0 if margin( ~ x , y ) > 1, 1 − margin( ~ x , y ) otherwise.

The soft margin SVM solves:

min

w,b ~

(

|| w ~ || 2 + C

N

X

i=1

max 0, 1 − y i w.~ ~ x i + b )

Jean-Philippe Vert (ParisTech) Machine learning in bioinformatics Mines ParisTech 34 / 88

(56)

Dual formulation of soft-margin SVM

Maximize

L(~ α) =

N

X

i=1

α i − 1 2

N

X

i,j=1

α i α j y i y j ~ x i .~ x j ,

under the constraints:

( 0 ≤ α i ≤ C, for i = 1, . . . , N P N

i=1 α i y i = 0.

Jean-Philippe Vert (ParisTech) Machine learning in bioinformatics Mines ParisTech 35 / 88

(57)

Interpretation: bounded and unbounded support vectors

C α=0

0<α<C α=

Jean-Philippe Vert (ParisTech) Machine learning in bioinformatics Mines ParisTech 36 / 88

(58)

Outline

1 Machine learning in bioinformatics

2 Linear support vector machines

3 Nonlinear SVM and kernels

4 SVM for complex data: the case of graphs

5 Conclusion

Jean-Philippe Vert (ParisTech) Machine learning in bioinformatics Mines ParisTech 37 / 88

(59)

Sometimes linear classifiers are not interesting

Jean-Philippe Vert (ParisTech) Machine learning in bioinformatics Mines ParisTech 38 / 88

(60)

Solution: non-linear mapping to a feature space

R

2

x1

x2

x1

x2

2

Let ~ Φ(~ x ) = (x 1 2 , x 2 2 ) 0 , w ~ = (1, 1) 0 and b = 1. Then the decision function is:

f ( ~ x) = x 1 2 + x 2 2 − R 2 = w.~ ~ Φ( ~ x) + b,

Jean-Philippe Vert (ParisTech) Machine learning in bioinformatics Mines ParisTech 39 / 88

(61)

Kernel (simple but important)

For a given mapping Φ from the space of objects X to some feature space, the kernel of two objects x and x 0 is the inner product of their images in the features space:

∀x , x 0 ∈ X , K (x , x 0 ) = Φ(x).~ ~ Φ(x 0 ).

Example: if ~ Φ(~ x ) = (x 1 2 , x 2 2 ) 0 , then

K (~ x , ~ x 0 ) = Φ(~ ~ x ).~ Φ(~ x 0 ) = (x 1 ) 2 (x 1 0 ) 2 + (x 2 ) 2 (x 2 0 ) 2 .

Jean-Philippe Vert (ParisTech) Machine learning in bioinformatics Mines ParisTech 40 / 88

(62)

Training a SVM in the feature space

Replace each ~ x .~ x 0 in the SVM algorithm by ~ Φ(x ).~ Φ(x 0 ) = K (x, x 0 ) The dual problem is to maximize

L(~ α) =

N

X

i=1

α i − 1 2

N

X

i,j=1

α i α j y i y j K (x i , x j ),

under the constraints:

( 0 ≤ α i ≤ C, for i = 1, . . . , N P N

i=1 α i y i = 0.

Jean-Philippe Vert (ParisTech) Machine learning in bioinformatics Mines ParisTech 41 / 88

(63)

Predicting with a SVM in the feature space

The decision function becomes:

f(x) = w ~ .~ Φ(x ) + b

=

N

X

i=1

α i K (x i , x ) + b . (2)

Jean-Philippe Vert (ParisTech) Machine learning in bioinformatics Mines ParisTech 42 / 88

(64)

The kernel trick

The explicit computation of ~ Φ(x ) is not necessary. The kernel K (x, x 0 ) is enough. SVM work implicitly in the feature space.

It is sometimes possible to easily compute kernels which correspond to complex large-dimensional feature spaces.

Jean-Philippe Vert (ParisTech) Machine learning in bioinformatics Mines ParisTech 43 / 88

(65)

Kernel example: polynomial kernel

R 2 x1

x2

x1

x2 2

For ~ x = (x 1 , x 2 ) > ∈ R 2 , let Φ(~ ~ x) = (x 1 2 , √

2x 1 x 2 , x 2 2 ) ∈ R 3 : K ( ~ x , ~ x 0 ) = x 1 2 x 1 02 + 2x 1 x 2 x 1 0 x 2 0 + x 2 2 x 2 02

= x 1 x 1 0 + x 2 x 2 0 2

= ~ x.~ x 0 2

.

Jean-Philippe Vert (ParisTech) Machine learning in bioinformatics Mines ParisTech 44 / 88

(66)

Kernel example: polynomial kernel

R 2 x1

x2

x1

x2 2

More generally,

K ( ~ x , ~ x 0 ) = ~ x.~ x 0 + 1 d

is an inner product in a feature space of all monomials of degree up to d (left as exercice.)

Jean-Philippe Vert (ParisTech) Machine learning in bioinformatics Mines ParisTech 45 / 88

(67)

Which functions K (x , x 0 ) are kernels?

Definition

A function K (x , x 0 ) defined on a set X is a kernel if and only if there exists a features space (Hilbert space) H and a mapping

Φ : X 7→ H ,

such that, for any x, x 0 in X : K x, x 0

=

Φ (x) , Φ x 0

H . φ

X F

Jean-Philippe Vert (ParisTech) Machine learning in bioinformatics Mines ParisTech 46 / 88

(68)

Positive Definite (p.d.) functions

Definition

A positive definite (p.d.) function on the set X is a function K : X × X → R symmetric:

x, x 0

∈ X 2 , K x, x 0

= K x 0 , x , and which satisfies, for all N ∈ N , (x 1 , x 2 , . . . , x N ) ∈ X N et (a 1 , a 2 , . . . , a N ) ∈ R N :

N

X

i=1 N

X

j=1

a i a j K x i , x j

≥ 0.

Jean-Philippe Vert (ParisTech) Machine learning in bioinformatics Mines ParisTech 47 / 88

(69)

Kernels are p.d. functions

Theorem (Aronszajn, 1950)

K is a kernel if and only if it is a positive definite function.

φ

X F

Jean-Philippe Vert (ParisTech) Machine learning in bioinformatics Mines ParisTech 48 / 88

(70)

Proof?

Kernel = ⇒ p.d. function:

hΦ (x) , Φ (x

0

)i

Rd

= hΦ (x

0

) , Φ (x)

Rd

i , P

N

i=1

P

N

j=1

a

i

a

j

hΦ (x

i

) , Φ (x

j

)i

Rd

= k P

N

i=1

a

i

Φ (x

i

) k

2

Rd

≥ 0 . P.d. function = ⇒ kernel: more difficult...

Jean-Philippe Vert (ParisTech) Machine learning in bioinformatics Mines ParisTech 49 / 88

(71)

Kernel examples

Polynomial (on R d ):

K (x , x 0 ) = (x .x 0 + 1) d Gaussian radial basis function (RBF) (on R d )

K (x, x 0 ) = exp

− ||x − x 0 || 22

Laplace kernel (on R )

K (x , x 0 ) = exp −γ |x − x 0 | Min kernel (on R + )

K (x, x 0 ) = min(x, x 0 )

Exercice: for each kernel, find a Hilbert space H and a mapping Φ : X → H such that K (x , x 0 ) = hΦ(x ), Φ(x 0 )i

Jean-Philippe Vert (ParisTech) Machine learning in bioinformatics Mines ParisTech 50 / 88

(72)

Example: SVM with a Gaussian kernel

Training:

min

α∈ R

N

N

X

i=1

α i − 1 2

N

X

i,j=1

α i α j y i y j exp − ||~ x i − ~ x j || 22

!

s.t. 0 ≤ α i ≤ C, and

N

X

i=1

α i y i = 0.

Prediction

f ( ~ x) =

N

X

i=1

α i exp

− ||~ x − ~ x i || 22

Jean-Philippe Vert (ParisTech) Machine learning in bioinformatics Mines ParisTech 51 / 88

(73)

Example: SVM with a Gaussian kernel

f ( ~ x ) =

N

X

i=1

α i exp

− ||~ x − ~ x i || 22

−1.0

−0.5 0.0 0.5 1.0

−2 0 2 4 6

−2 0 2 4

SVM classification plot

Jean-Philippe Vert (ParisTech) Machine learning in bioinformatics Mines ParisTech 52 / 88

(74)

Linear vs nonlinear SVM

Jean-Philippe Vert (ParisTech) Machine learning in bioinformatics Mines ParisTech 53 / 88

(75)

Regularity vs data fitting trade-off

Jean-Philippe Vert (ParisTech) Machine learning in bioinformatics Mines ParisTech 54 / 88

(76)

C controls the trade-off

min

f

1

margin(f) + C × errors(f )

Jean-Philippe Vert (ParisTech) Machine learning in bioinformatics Mines ParisTech 55 / 88

(77)

Why it is important to control the trade-off

Jean-Philippe Vert (ParisTech) Machine learning in bioinformatics Mines ParisTech 56 / 88

(78)

How to choose C in practice

Split your dataset in two ("train" and "test") Train SVM with different C on the "train" set

Compute the accuracy of the SVM on the "test" set Choose the C which minimizes the "test" error

(you may repeat this several times = cross-validation)

Jean-Philippe Vert (ParisTech) Machine learning in bioinformatics Mines ParisTech 57 / 88

(79)

SVM summary

Large margin

Linear or nonlinear (with the kernel trick)

Control of the regularization / data fitting trade-off with C

Jean-Philippe Vert (ParisTech) Machine learning in bioinformatics Mines ParisTech 58 / 88

(80)

Outline

1 Machine learning in bioinformatics

2 Linear support vector machines

3 Nonlinear SVM and kernels

4 SVM for complex data: the case of graphs

5 Conclusion

Jean-Philippe Vert (ParisTech) Machine learning in bioinformatics Mines ParisTech 59 / 88

(81)

Virtual screening for drug discovery

inactive active

active

active

inactive

inactive

NCI AIDS screen results (from http://cactus.nci.nih.gov).

Jean-Philippe Vert (ParisTech) Machine learning in bioinformatics Mines ParisTech 60 / 88

(82)

Classification with SVM

1

Represent each graph x by a vector Φ(x) ∈ H, either explicitly or implicitly through the kernel

K (x , x 0 ) = Φ(x) > Φ(x 0 ) .

2

Use a linear method for classification in H.

X

Jean-Philippe Vert (ParisTech) Machine learning in bioinformatics Mines ParisTech 61 / 88

(83)

Classification with SVM

1

Represent each graph x by a vector Φ(x) ∈ H, either explicitly or implicitly through the kernel

K (x , x 0 ) = Φ(x) > Φ(x 0 ) .

2

Use a linear method for classification in H.

φ X H

Jean-Philippe Vert (ParisTech) Machine learning in bioinformatics Mines ParisTech 61 / 88

(84)

Classification with SVM

1

Represent each graph x by a vector Φ(x) ∈ H, either explicitly or implicitly through the kernel

K (x , x 0 ) = Φ(x) > Φ(x 0 ) .

2

Use a linear method for classification in H.

φ X H

Jean-Philippe Vert (ParisTech) Machine learning in bioinformatics Mines ParisTech 61 / 88

(85)

Example: indexing by substructures

O

N O O

O O

N N N

O O

O

Often we believe that the presence substructures are important predictive patterns

Hence it makes sense to represent a graph by features that indicate the presence (or the number of occurrences) of particular substructures

However, detecting the presence of particular substructures may be computationally challenging...

Jean-Philippe Vert (ParisTech) Machine learning in bioinformatics Mines ParisTech 62 / 88

(86)

Subgraphs

Definition

A subgraph of a graph (V , E) is a connected graph (V 0 , E 0 ) with V 0 ⊂ V and E 0 ⊂ E .

Jean-Philippe Vert (ParisTech) Machine learning in bioinformatics Mines ParisTech 63 / 88

(87)

Indexing by all subgraphs?

Theorem

1

Computing all subgraph occurrences is NP-hard.

2

Computing the subgraph kernel is NP-hard.

Proof.

1

Finding an occurrence of the linear path of size n is finding a Hamiltonian path, which is NP-complete.

2

Similarly, if we can compute the subgraph kernel then we can deduce the presence of a Hamiltonian path (left as exercice).

Jean-Philippe Vert (ParisTech) Machine learning in bioinformatics Mines ParisTech 64 / 88

(88)

Indexing by all subgraphs?

Theorem

1

Computing all subgraph occurrences is NP-hard.

2

Computing the subgraph kernel is NP-hard.

Proof.

1

Finding an occurrence of the linear path of size n is finding a Hamiltonian path, which is NP-complete.

2

Similarly, if we can compute the subgraph kernel then we can deduce the presence of a Hamiltonian path (left as exercice).

Jean-Philippe Vert (ParisTech) Machine learning in bioinformatics Mines ParisTech 64 / 88

(89)

Indexing by all subgraphs?

Theorem

1

Computing all subgraph occurrences is NP-hard.

2

Computing the subgraph kernel is NP-hard.

Proof.

1

Finding an occurrence of the linear path of size n is finding a Hamiltonian path, which is NP-complete.

2

Similarly, if we can compute the subgraph kernel then we can deduce the presence of a Hamiltonian path (left as exercice).

Jean-Philippe Vert (ParisTech) Machine learning in bioinformatics Mines ParisTech 64 / 88

(90)

Paths

Definition

A path of a graph (V , E) is sequence of distinct vertices v 1 , . . . , v n ∈ V (i 6= j = ⇒ v i 6= v j ) such that (v i , v i+1 ) ∈ E for i = 1, . . . , n − 1.

Equivalently the paths are the linear subgraphs.

Jean-Philippe Vert (ParisTech) Machine learning in bioinformatics Mines ParisTech 65 / 88

(91)

Indexing by all paths?

B

A A A B A

(0,...,0,1,0,...,0,1,0,...)

A A

A B

Theorem

1

Computing all path occurrences is NP-hard.

2

Computing the path kernel is NP-hard

Proof.

Same as for subgraphs.

Jean-Philippe Vert (ParisTech) Machine learning in bioinformatics Mines ParisTech 66 / 88

(92)

Indexing by all paths?

B

A A A B A

(0,...,0,1,0,...,0,1,0,...)

A A

A B

Theorem

1

Computing all path occurrences is NP-hard.

2

Computing the path kernel is NP-hard Proof.

Same as for subgraphs.

Jean-Philippe Vert (ParisTech) Machine learning in bioinformatics Mines ParisTech 66 / 88

(93)

Indexing by all paths?

B

A A A B A

(0,...,0,1,0,...,0,1,0,...)

A A

A B

Theorem

1

Computing all path occurrences is NP-hard.

2

Computing the path kernel is NP-hard Proof.

Same as for subgraphs.

Jean-Philippe Vert (ParisTech) Machine learning in bioinformatics Mines ParisTech 66 / 88

(94)

Walks

Definition

A walk of a graph (V , E ) is sequence of v 1 , . . . , v n ∈ V such that (v i , v i+1 ) ∈ E for i = 1, . . . , n − 1.

We note W n (G) the set of walks with n vertices of the graph G, and W(G) the set of all walks.

etc...

Jean-Philippe Vert (ParisTech) Machine learning in bioinformatics Mines ParisTech 67 / 88

(95)

Walks 6= paths

Jean-Philippe Vert (ParisTech) Machine learning in bioinformatics Mines ParisTech 68 / 88

(96)

Walk kernel

Definition

Let S n denote the set of all possible label sequences of walks of length n (including vertices and edges labels), and S = ∪ n≥1 S n . For any graph X let a weight λ G (w ) be associated to each walk w ∈ W(G).

Let the feature vector Φ(G) = (Φ s (G)) s∈S be defined by:

Φ s (G) = X

w∈W(G)

λ G (w )1 (s is the label sequence of w) .

A walk kernel is a graph kernel defined by:

K walk (G 1 , G 2 ) = X

s∈S

Φ s (G 1 )Φ s (G 2 ) .

Jean-Philippe Vert (ParisTech) Machine learning in bioinformatics Mines ParisTech 69 / 88

(97)

Walk kernel

Definition

Let S n denote the set of all possible label sequences of walks of length n (including vertices and edges labels), and S = ∪ n≥1 S n . For any graph X let a weight λ G (w ) be associated to each walk w ∈ W(G).

Let the feature vector Φ(G) = (Φ s (G)) s∈S be defined by:

Φ s (G) = X

w∈W(G)

λ G (w )1 (s is the label sequence of w) .

A walk kernel is a graph kernel defined by:

K walk (G 1 , G 2 ) = X

s∈S

Φ s (G 1 )Φ s (G 2 ) .

Jean-Philippe Vert (ParisTech) Machine learning in bioinformatics Mines ParisTech 69 / 88

(98)

Walk kernel examples

Examples

The nth-order walk kernel is the walk kernel with λ G (w) = 1 if the length of w is n, 0 otherwise. It compares two graphs through their common walks of length n.

The random walk kernel is obtained with λ G (w ) = P G (w), where P G is a Markov random walk on G. In that case we have:

K (G 1 , G 2 ) = P(label(W 1 ) = label(W 2 )) ,

where W 1 and W 2 are two independant random walks on G 1 and G 2 , respectively (Kashima et al., 2003).

The geometric walk kernel is obtained (when it converges) with λ G (w ) = β length(w) , for β > 0. In that case the feature space is of infinite dimension (Gärtner et al., 2003).

Jean-Philippe Vert (ParisTech) Machine learning in bioinformatics Mines ParisTech 70 / 88

(99)

Walk kernel examples

Examples

The nth-order walk kernel is the walk kernel with λ G (w) = 1 if the length of w is n, 0 otherwise. It compares two graphs through their common walks of length n.

The random walk kernel is obtained with λ G (w ) = P G (w), where P G is a Markov random walk on G. In that case we have:

K (G 1 , G 2 ) = P(label(W 1 ) = label(W 2 )) ,

where W 1 and W 2 are two independant random walks on G 1 and G 2 , respectively (Kashima et al., 2003).

The geometric walk kernel is obtained (when it converges) with λ G (w ) = β length(w) , for β > 0. In that case the feature space is of infinite dimension (Gärtner et al., 2003).

Jean-Philippe Vert (ParisTech) Machine learning in bioinformatics Mines ParisTech 70 / 88

(100)

Walk kernel examples

Examples

The nth-order walk kernel is the walk kernel with λ G (w) = 1 if the length of w is n, 0 otherwise. It compares two graphs through their common walks of length n.

The random walk kernel is obtained with λ G (w ) = P G (w), where P G is a Markov random walk on G. In that case we have:

K (G 1 , G 2 ) = P(label(W 1 ) = label(W 2 )) ,

where W 1 and W 2 are two independant random walks on G 1 and G 2 , respectively (Kashima et al., 2003).

The geometric walk kernel is obtained (when it converges) with λ G (w ) = β length(w) , for β > 0. In that case the feature space is of infinite dimension (Gärtner et al., 2003).

Jean-Philippe Vert (ParisTech) Machine learning in bioinformatics Mines ParisTech 70 / 88

(101)

Computation of walk kernels

Proposition

These three kernels (nth-order, random and geometric walk kernels) can be computed efficiently in polynomial time.

Jean-Philippe Vert (ParisTech) Machine learning in bioinformatics Mines ParisTech 71 / 88

(102)

Product graph

Definition

Let G 1 = (V 1 , E 1 ) and G 2 = (V 2 , E 2 ) be two graphs with labeled

vertices. The product graph G = G 1 × G 2 is the graph G = (V , E) with:

1

V = {(v 1 , v 2 ) ∈ V 1 × V 2 : v 1 and v 2 have the same label} ,

2

E =

(v 1 , v 2 ), (v 1 0 , v 2 0 )

∈ V × V : (v 1 , v 1 0 ) ∈ E 1 and (v 2 , v 2 0 ) ∈ E 2 .

G1 x G2

c

d e

4 3

2

1 1b 2a 1d

1a 2b 3c

4c 2d

3e

4e

G1 G2

a b

Jean-Philippe Vert (ParisTech) Machine learning in bioinformatics Mines ParisTech 72 / 88

(103)

Walk kernel and product graph

Lemma

There is a bijection between:

1

The pairs of walks w 1 ∈ W n (G 1 ) and w 2 ∈ W n (G 2 ) with the same label sequences,

2

The walks on the product graph w ∈ W n (G 1 × G 2 ).

Corollary

K walk (G 1 , G 2 ) = X

s∈S

Φ s (G 1s (G 2 )

= X

(w

1

,w

2

)∈W(G

1

)×W (G

1

)

λ G

1

(w 1G

2

(w 2 )1(l(w 1 ) = l(w 2 ))

= X

w∈W(G

1

×G

2

)

λ G

1

×G

2

(w ) .

Jean-Philippe Vert (ParisTech) Machine learning in bioinformatics Mines ParisTech 73 / 88

(104)

Walk kernel and product graph

Lemma

There is a bijection between:

1

The pairs of walks w 1 ∈ W n (G 1 ) and w 2 ∈ W n (G 2 ) with the same label sequences,

2

The walks on the product graph w ∈ W n (G 1 × G 2 ).

Corollary

K walk (G 1 , G 2 ) = X

s∈S

Φ s (G 1s (G 2 )

= X

(w

1

,w

2

)∈W(G

1

)×W (G

1

)

λ G

1

(w 1G

2

(w 2 )1(l(w 1 ) = l(w 2 ))

= X

w∈W(G

1

×G

2

)

λ G

1

×G

2

(w ) .

Jean-Philippe Vert (ParisTech) Machine learning in bioinformatics Mines ParisTech 73 / 88

(105)

Computation of the nth-order walk kernel

For the nth-order walk kernel we have λ G

1

×G

2

(w) = 1 if the length of w is n, 0 otherwise.

Therefore:

K nth−order (G 1 , G 2 ) = X

w∈W

n

(G

1

×G

2

)

1 . Let A be the adjacency matrix of G 1 × G 2 . Then we get:

K nth−order (G 1 , G 2 ) = X

i,j

[A n ] i,j = 1 > A n 1 . Computation in O(n|G 1 ||G 2 |d 1 d 2 ), where d i is the maximum degree of G i .

Jean-Philippe Vert (ParisTech) Machine learning in bioinformatics Mines ParisTech 74 / 88

(106)

Computation of random and geometric walk kernels

In both cases λ G (w ) for a walk w = v 1 . . . v n can be decomposed as:

λ G (v 1 . . . v n ) = λ i (v 1 )

n

Y

i=2

λ t (v i−1 , v i ) .

Let Λ i be the vector of λ i (v) and Λ t be the matrix of λ t (v, v 0 ):

K walk (G 1 , G 2 ) =

X

n=1

X

w∈W

n

(G

1

×G

2

)

λ i (v 1 )

n

Y

i=2

λ t (v i−1 , v i )

=

X

n=0

Λ i Λ n t 1

= Λ i (I − Λ t ) −1 1 Computation in O(|G 1 | 3 |G 2 | 3 )

Jean-Philippe Vert (ParisTech) Machine learning in bioinformatics Mines ParisTech 75 / 88

(107)

Extensions 1: label enrichment

Atom relabebling with the Morgan index

Order 2 indices N

O

O 1

1 1

1 1

1

1

1 1

N O

O 2

2

2 3

2

3

1 1 2

N O

O 4

4

5 7

5

5

3 3 4

No Morgan Indices Order 1 indices

Compromise between fingerprints and structural keys features.

Other relabeling schemes are possible (graph coloring).

Faster computation with more labels (less matches implies a smaller product graph).

Jean-Philippe Vert (ParisTech) Machine learning in bioinformatics Mines ParisTech 76 / 88

(108)

Extension 2: Non-tottering walk kernel

Tottering walks

A tottering walk is a walk w = v 1 . . . v n with v i = v i+2 for some i.

Tottering Non−tottering

Tottering walks seem irrelevant for many applications

Focusing on non-tottering walks is a way to get closer to the path kernel (e.g., equivalent on trees).

Jean-Philippe Vert (ParisTech) Machine learning in bioinformatics Mines ParisTech 77 / 88

(109)

Computation of the non-tottering walk kernel (Mahé et al., 2005)

Second-order Markov random walk to prevent tottering walks Written as a first-order Markov random walk on an augmented graph

Normal walk kernel on the augmented graph (which is always a directed graph).

Jean-Philippe Vert (ParisTech) Machine learning in bioinformatics Mines ParisTech 78 / 88

(110)

Extension 3: Subtree kernels

Jean-Philippe Vert (ParisTech) Machine learning in bioinformatics Mines ParisTech 79 / 88

(111)

Example: Tree-like fragments of molecules

. . .

. . . . . N .

N C O C

C

. .

.

C

O C

N C

N O

C

N N C C C

N N

Jean-Philippe Vert (ParisTech) Machine learning in bioinformatics Mines ParisTech 80 / 88

(112)

Computation of the subtree kernel

Like the walk kernel, amounts to compute the (weighted) number of subtrees in the product graph.

Recursion: if T (v , n) denotes the weighted number of subtrees of depth n rooted at the vertex v , then:

T (v , n + 1) = X

R⊂N (v)

Y

v

0

∈R

λ t (v , v 0 )T (v 0 , n) ,

where N (v ) is the set of neighbors of v.

Can be combined with the non-tottering graph transformation as preprocessing to obtain the non-tottering subtree kernel.

Jean-Philippe Vert (ParisTech) Machine learning in bioinformatics Mines ParisTech 81 / 88

(113)

Application in chemoinformatics (Mahé et al., 2004)

MUTAG dataset

aromatic/hetero-aromatic compounds

high mutagenic activity /no mutagenic activity, assayed in Salmonella typhimurium.

188 compouunds: 125 + / 63 - Results

10-fold cross-validation accuracy

Method Accuracy Progol1 81.4%

2D kernel 91.2%

Jean-Philippe Vert (ParisTech) Machine learning in bioinformatics Mines ParisTech 82 / 88

(114)

2D Subtree vs walk kernels (Mahé and V., 2009)

707274767880

AUC

Walks Subtrees

CCRF−CEMHL−60(TB) K−562MOLT−4 RPMI−8226SRA549/ATCCEKVX HOP−62HOP−92NCI−H226 NCI−H23NCI−H322MNCI−H460NCI−H522COLO_205 HCC−2998HCT−116 HCT−15HT29KM12SW−620 SF−268SF−295SF−539SNB−19 SNB−75U251LOX_IMVI MALME−3MM14SK−MEL−2SK−MEL−28 SK−MEL−5UACC−257UACC−62 IGR−OV1OVCAR−3OVCAR−4OVCAR−5 OVCAR−8SK−OV−3786−0A498 ACHNCAKI−1RXF_393 SN12CTK−10UO−31PC−3 DU−145MCF7NCI/ADR−RES MDA−MB−231/ATCCHS_578TMDA−MB−435MDA−N BT−549T−47D

Screening of inhibitors for 60 cancer cell lines.

Jean-Philippe Vert (ParisTech) Machine learning in bioinformatics Mines ParisTech 83 / 88

(115)

Summary: graph kernels

What we saw

Kernels do not allow to overcome the NP-hardness of subgraph patterns

They allow to work with approximate subgraphs (walks, subtrees), in infinite dimension, thanks to the kernel trick

They give state-of-the-art results

Jean-Philippe Vert (ParisTech) Machine learning in bioinformatics Mines ParisTech 84 / 88

(116)

Outline

1 Machine learning in bioinformatics

2 Linear support vector machines

3 Nonlinear SVM and kernels

4 SVM for complex data: the case of graphs

5 Conclusion

Jean-Philippe Vert (ParisTech) Machine learning in bioinformatics Mines ParisTech 85 / 88

(117)

Machine learning in computational biology

Biology faces a flood of data following the development of high-throughput technologies (sequencing, DNA chips, ...) Many problems can be formalized in the framework of machine learning, e.g.:

Diagnosis, prognosis Protein annotation

Drug discovery, virtual screening

These data have often complex structures (strings, graphs, high-dimensional vectors) and often require dedicated algorithms.

Jean-Philippe Vert (ParisTech) Machine learning in bioinformatics Mines ParisTech 86 / 88

(118)

Support vector machines (SVM)

A general-purpose algorithm for pattern recognition

Based on the principle of large margin ("séparateur à vaste marge")

Linear or nonlinear with the kernel trick

Control of the regularization / data fitting trade-off with the C parameter

State-of-the-art performance on many applications

Jean-Philippe Vert (ParisTech) Machine learning in bioinformatics Mines ParisTech 87 / 88

(119)

Kernels

A central ingredient of SVM Allows nonlinearity

Allows to work implicitly in a high-dimensional feature space Allows to work with structured data (e.g., graphs)

φ X H

Jean-Philippe Vert (ParisTech) Machine learning in bioinformatics Mines ParisTech 88 / 88

Références

Documents relatifs

In order to be able to use a dot product as a similarity measure, we therefore first need to represent the patterns as vectors in some dot product space H , called the feature

We proposed a novel metric, based on a simple structure from factor analysis, for the measurement of the interpretability of factors delivered by matrix de- composition

Therefore, besides of ∆Rrel, the cumulated electrical response FG of KMnO4/La-Al2O3 during regeneration may also be a suitable sensor signal for the cumulated NOx exposure and

In the case of a bimaterial, as far as localized waves as Love waves can be used, the evaluation of the boundary terms for detecting the crack is easier to estimate because such

Videotaped lessons of a model on energy chain, from two French 5 th grade classes with the same teacher, were analysed in order to assess the percentage of stability as

If there is considerable uncertainty about the nature of the hazards and the likelihood of them causing harm, then successful management of risk in an industrial activity

The linear graph of size n is a subgraph of a graph X with n vertices iff X has an Hamiltonian path. The decision problem whether a graph has a Hamiltonian path

Unless it is considered that the functioning of the State and of the social model engenders the lack of civism of the French (which cannot be determined from the data presented in