• Aucun résultat trouvé

Intégration de données hétérogènes Enrichissement

N/A
N/A
Protected

Academic year: 2022

Partager "Intégration de données hétérogènes Enrichissement"

Copied!
33
0
0

Texte intégral

(1)

Intégration de données hétérogènes Enrichissement

Master 2

Bioinformatique et Biologie des Systèmes

(2)

Recoupement de voisinages : approche ensembliste

complexes protéiques voies

métaboliques

Gene Ontology co-citation

localisation chromosomique

domaines

protéiques

ensembles de gènes

(3)

Définitions

• (Identifiants de) gène  ARNm  protéine

G : ensemble des gènes d’un organisme

Fonction de regroupement : relation entre gènes basée sur un indice de similarité.

Ensemble de (gènes) voisins : ensemble de gènes E

⊆ G regroupés par une fonction de regroupement.

Voisinage : sous-ensemble de P(G) formant un ensemble d’ensembles de voisins, V ⊆ P(G),

regroupés par une même fonction de regroupement.

G

E 6 E 5

E 4

E 3

E 1 c d

e f

h

b

g i k

j l a

n

E 2

V = {E 1 ,E 2 ,E 3 ,E 4 ,E 5 ,E 6 } ⊆ P(G)

m

(4)

Représentation d’un voisinage : ordre partiel (poset)

• Un voisinage est un ensemble (d'ensembles de voisins) ordonné par la relation d’inclusion ⊆

h i g j l a b c d e f n k m

E 1 E 2 E 3 E 4 E 5 E 6 G

G

E 5

E 4

E 3

E 1 c d e f

h

b g i k

j l a

n

E 2

E m 6

diagramme de Hasse de V

V = {E 1 ,E 2 ,E 3 ,E 4 ,E 5 ,E 6 }

(5)

Exemple de fonction de regroupement : complexes protéiques

un complexe  un ensemble de protéines

(6)

Exemple de critère de regroupement : voies métaboliques

une voie métabolique  un ensemble de protéines

[Kanehisa et Goto, 2000]

(7)

Exemple de critère de regroupement : données d’expression

Conditions expérimentales

gè ne s

clustering hiérarchique des profils

un cluster  un ensemble de gènes

(8)

Mise en relation des données

• Recherche d’ensembles similaires

Q

ensemble requête Q ⊆ G

Quels sont les ensembles cibles

qui lui sont similaires ?

base de données de voisinages :

ensembles cibles

(9)

Mesure de (dis)similarité

• Loi hypergéométrique : probabilité d’avoir au moins le nombre

d’éléments communs observé entre 2 échantillons issus d’une même population

avec

g = |G| : taille de la population

q = |Q| : taille de l’ensemble requête

t = |T| : taille de l’ensemble cible

c = |Q T| : nombre d’éléments communs

• Autres mesures :

 Loi binomiale

 χ 2

 ratio, pourcentage

( )( )

å ( )

=

- -

=

- ( , , , ) min( q , t )

c

k g

q t g k t q

g k

q t c valeur p

G Q

T

(10)

Mise en relation des données

• Recherche d’ensembles similaires

Q

ensemble requête Q ⊆ G

Quels sont les ensembles cibles

qui lui sont similaires ?

base de données de voisinages :

ensembles cibles

(11)

Tests multiples et significativité des p-valeurs

• Probabilité d’obtenir une p-valeur aussi faible par hasard :

fonction de répartition des p-valeurs

minimales

• Simulations

RandomSet_1, minPi = M1 RandomSet_2, minPi = M2 . .

RandomSet_n, minPi = Mn Étant donnée une p-valeur p Combien ont un meilleur score ?

levure Saccharomyces cerevisiae n=500, q=9, g=5786, KEGG Pathways

x = p-valeur y = p ou rc en ta ge d ’e ns em bl es al éa to ire s ay an t u ne p- va le ur ≤ x

0,0002 0,05

[Barriot et al. 2004]

(12)

Significativité des p-valeurs obtenues

Saccharomyces cerevisiae n=500, q=6-9-200-500-1000,

g=5786, KEGG Pathways x = p-valeur

pr op or tio n d’ en se m bl es re qu êt es a lé at oi re s ay an t u ne p- va le ur ≤ x

Saccharomyces cerevisiae n=500, q=50, g=5786,

GO molecular function, Ferea et al., 1999

x = p-valeur

pr op or tio n d’ en se m bl es re qu êt es a lé at oi re s ay an t u ne p- va le ur ≤ x

(13)

Complex 440.30.10 mRNA splicing

GO Term Description Targetsize elementsCommon

GO:0000398 nuclear mRNA splicing, via spliceosome 84 33 GO:0000377 RNA splicing, via transesterification reactions with

bulged adenosine as nucleophile 84 33

GO:0000375 RNA splicing, via transesterification reactions 88 33

GO:0008380 RNA splicing 99 33

GO:0006397 mRNA processing 108 33

GO:0016071 mRNA metabolism 132 33

GO:0006396 RNA processing 262 34

GO:0016070 RNA metabolism 360 34

GO:0043283 biopolymer metabolism 812 34

GO:0006139 nucleobase, nucleoside, nucleotide and nucleic

acid metabolism 1057 34

GO:0044238 primary metabolism 2191 34

GO:0044237 cellular metabolism 2407 34

GO:0008152 metabolism 2465 34

GO:0000245 spliceosome assembly 10 5

GO:0006461 protein complex assembly 61 5

GO:0006374 nuclear mRNA splicing via U2-type

spliceosome 8 8

GO:0000391 U2-type spliceosome dissembly 2 2

GO:0000390 spliceosome dissembly 2 2

GO:0000370 U2-type nuclear mRNA branch site recognition 2 2

GO:0000348 nuclear mRNA branch site recognition 2 2

GO:0000393 spliceosomal conformational changes to

generate catalytic conformation 3 3

GO:0008152 (34/2465)

GO:0044237 (34/2407) GO:0044238 (34/2191)

GO:0006139 (34/1057)

GO:0016070 (34/360)

GO:0042283 (34/812)

GO:0016071 (33/132)

GO:0006397 (33/108) GO:0000377 (33/84)

GO:0000375 (33/88) GO:0008380 (33/99)

GO:0006396 (34/262)

GO:0000398 (33/84)

GO:0006374 (8/8)

GO:0000245 (5/10) GO:0000393 (3/3)

GO:0000390 (2/2)

GO:0000391 (2/2)

GO:0000370 (2/2)

GO:0000348 (2/2) GO:0006461 (5/61)

(14)

Optimisations

a target set T is pertinent if Q ∩ T ≠

∄ T ′ ∈ N such that T ′ ⊂ T and T ′ ∩ Q = T ∩ Q and

∄ T ′ ∈ N such that T ⊂ T ′ and T ′ − Q = T − Q and

G

S 5 S 4

S 3

S 1 c d e f

h

b g i k

j l a

n

S 2

m S 6

N = {S 1 ,S 2 ,S 3 ,S 4 ,S 5 ,S 6 }

h i g j l a b c d e f n k m

S 1 S 2 S 3 S 4 S 5 S 6 G

Hasse diagram of N

(15)

Pertinence definition

• Q a non empty query set

• N a neighborhood

• a target set T ∈ N

• T pertinent if

Q = {a,b,e}

Q’ = {a,b}

T

2

= {a,b} T

5

= {a,b,x}

T

4

= {a,b,x,y}

T

1

= {a,b,x}

T

6

= {a,x}

T

3

= {a}

N

1

N

2

c = T ∩ Q d = T - Q T = c ∪ d c’ ⊂ c

d’ ⊂ d c’ ⊂ c

d’ = d

c’ = c d’ ⊂ d c’ = c

d’ ⊃ d

c’ ⊃ c

d’ ⊃ d c’ ⊃ c

d’ = d

Local decision

|c| > 0

|d| < min({d parents })

|c| > max({c children })

[Barriot, Dutour, Sherman, 2007, BMC Bioinformatics]

Q ∩ T ≠ ∅ and

T ′ ∈ N such that T ′ ⊂ T and T ′ ∩ Q = T ∩ Q

T ′ ∈ N such that T ⊂ T ′ and T ′ − Q = T − Q and

(16)

Illustration

• Pertinence des comparaisons & redondance des résultats

RNA metabolic process {a,b,c,d,e,f,g}

RNA processing {a,b,c,d,e}

RNA splicing {b,c,d,e}

regulation of RNA splicing

{b,c}

nuclear mRNA splicing

{c,e}

mRNA metabolic process {b,c,d,e,f,g}

modification mRNA {f}

ensemble requête :

gènes d’intérêt comparaisons ensembles cibles

résultats {b,d,e}

mRNA metabolic process {b,c,d,e,f,g}

RNA processing {a,b,c,d,e}

RNA splicing {b,c,d,e}

ensembles cibles similaires

si m ila rit é dé cr oi ss an te

b

d

c e

f

a g

(17)

thousands of elements (genes or proteins) query elements

sets having common elements and that have bad p-values

sets having

common elements and that may have good p-values sets having no common

elements are not interesting

A small portion of the DAG is searched

up to m ill io ns o f t ar ge t s et s in th e D AG o f t he n ei gh bo rh oo d

(18)

Complex 440.30.10 mRNA splicing

GO Term Description Targetsize elementsCommon

GO:0000398 nuclear mRNA splicing, via spliceosome 84 33 GO:0000377 RNA splicing, via transesterification reactions with

bulged adenosine as nucleophile 84 33

GO:0000375 RNA splicing, via transesterification reactions 88 33

GO:0008380 RNA splicing 99 33

GO:0006397 mRNA processing 108 33

GO:0016071 mRNA metabolism 132 33

GO:0006396 RNA processing 262 34

GO:0016070 RNA metabolism 360 34

GO:0043283 biopolymer metabolism 812 34

GO:0006139 nucleobase, nucleoside, nucleotide and nucleic

acid metabolism 1057 34

GO:0044238 primary metabolism 2191 34

GO:0044237 cellular metabolism 2407 34

GO:0008152 metabolism 2465 34

GO:0000245 spliceosome assembly 10 5

GO:0006461 protein complex assembly 61 5

GO:0006374 nuclear mRNA splicing via U2-type

spliceosome 8 8

GO:0000391 U2-type spliceosome dissembly 2 2

GO:0000390 spliceosome dissembly 2 2

GO:0000370 U2-type nuclear mRNA branch site recognition 2 2

GO:0000348 nuclear mRNA branch site recognition 2 2

GO:0000393 spliceosomal conformational changes to

generate catalytic conformation 3 3

GO:0008152 (34/2465)

GO:0044237 (34/2407) GO:0044238 (34/2191)

GO:0006139 (34/1057)

GO:0016070 (34/360)

GO:0042283 (34/812)

GO:0016071 (33/132)

GO:0006397 (33/108) GO:0000377 (33/84)

GO:0000375 (33/88) GO:0008380 (33/99)

GO:0006396 (34/262)

GO:0000398 (33/84)

GO:0006374 (8/8)

GO:0000245 (5/10) GO:0000393 (3/3)

GO:0000390 (2/2)

GO:0000391 (2/2)

GO:0000370 (2/2)

GO:0000348 (2/2) GO:0006461 (5/61)

(19)

Autre approche pour la visualisation et l'interprétation des résultats

• 134 GO Terms, p-value < 0.1 (no FDR)

63 BP, 19 CC, 52 MF

# co mm on ge ne s

# ge ne s w ith th is an no ta tio n i n t he

ge no me GO Te rm

p-v alu e

(20)

Treemap visualisation

• 134 GO Terms, p-value < 0.1 (no FDR)

63 BP, 19 CC, 52 MF

REVIGO [Supek et al., 2011]

(21)

REVIGO scatterplot

(22)

Similarité entre Termes GO

• Node/term information content

with p(term)= freq(term)

MICA ( t 1 , t 2 ): Maximum Information Common Ancestor

sim res ( t 1 , t 2 ) = IC ( MICA ( t 1 , t 2 )) [Resnik, 1995]

sim lin ( t 1 , t 2 )= IC ( MICA ( t 1 , t 2 ))/( IC ( t 1 )+ IC ( t 2 )) [Lin, 1998]

• [Pesquita et al., 2008]

) (

log )

( term p term

IC = -

) , ( ),

( max arg

) ,

( t 1 t 2 IC t t ancestors t 1 t 2

MICA = i i Î

{ }

{ }

å å

È Î

Ç

= Î

) ( )

(

) ( )

( 2

1

2 1

2 1

) (

) ) (

, (

t GO t

GO t

t GO t

GO t

gic IC t

t t IC

t

sim

(23)

REVIGO interactive graph

(24)

Further optimizations (1/2)

• each node has only 1 parent

• Algorithm

• parses the input with a stack of stacks at the time it is loaded

O (| G |) time

( ( ( a ( b c ) ) d ) e )

tree

(25)

Further optimizations (2/2)

implicit

a b c d e f g

• DAG is implicit, e.g.

adjacent genes on the chromosome:

 store the genes order

Ɵ (| G |) space instead of Ɵ (| G | 2 )

 each pair of genes defines an interval which defines a set

• requires a specific algorithm

O (| Q | 2 ) time

(26)

Context

→ Question: Do those genes surprisingly cluster in the genome?

Goal: consider every possible region for enrichment

?

Set of genes of interest Examples

 Differentially expressed genes

 Co-expressed genes

 Tissue specific genes

 Partners of a protein complex

 Imprinted genes

 …

(27)

Down Syndrome differentially expressed genes Experiment:

Published list of differentially expressed genes in Down syndrome patients from Mao, R., C.L. Zielke, H.R. Zielke, and J. Pevsner, Global up-regulation of chromosome 21 gene

expression in the developing Down syndrome brain (2003) Genomics 81: 457-467.

Enrichment measure: Hypergeometric distribution

Multiple testing adjustment: False Discovery Rate (FDR)

Issues:

• Number of regions to test

• False positives

• Redundancy

(28)

Pertinent Regions

1 2 3 4 5 6 7

A region is pertinent if it is:

• bounded by genes of interests

• the largest, when genes of interest are consecutive

8 9

(29)

Down Syndrome (minP i )

29

Large regions tend to have smaller p-values while small regions tend to have higher percentage of enrichment

→ A smaller region included in a more significant one is pertinent if it has a much

higher percentage of genes of interests (>50%)

(30)

Down Syndrome d.e.g. Final Results

(31)

Défis actuels

GO:0016070

GO:0016071

GO:0006397 GO:0000377

GO:0000375

GO:0006396

GO:0000398

GO:0006374 GO:0000245 GO:0000393

GO:0000390

GO:0000391

GO:0000370

GO:0000348 GO:0006461

Ensembles requêtes

base de données

de voisinages :

ensembles cibles

(32)

Analyse de données d’expression

[Ferea et al., 1999] [Kanehisa & Goto, 2000]

?

(33)

Tendances

Extraction de sous-graphe pertinent & visualisation

Gènes ayant la même annotation ex : interaction with host

Idée :

• Grands graphes d’interactions physiques et/ou fonctionnelles

• Visualiser les relations entre gènes

d’intérêt

Visualisation du sous graphe expliquant le mieux ce qui lie

les gènes d’intérêt

Marche aléatoire :

pondération des arcs Surreprésentation :

sous-graphe pertinent

Références

Documents relatifs

At their next stop on the safe motherhood road, women should be provided with access to community- based maternity services (during pregnancy, during delivery and after

The Health Care Service (SAS) collects information and starts the monitoring of the case. Student notifies SAS and the degree coordinator the high

The boot stem (for the forms je, tu, il / elle, ils / elles) is the third person plural of the present indicative, the very same as the stem for regular formation of the

Avoir is also used as an auxiliary in compound tenses (passé composé with avoir, plus-que-parfait, futur antérieur, etc.) Besides ownership, the verb avoir expresses age in

Since each small rectangle R i has an integer side, it must contain an equal area coloured black and white.. Therefore the large rectangle R must contain such

If Japan is, historically speaking, a special case in its relationship between budo and sports, what is striking at the global level, it is our inability to think the physical

[r]

Large regions tend to have smaller p-values while small regions tend to have higher percentage of enrichment. → A smaller region included in a more significant one is pertinent if