• Aucun résultat trouvé

Algorithms and bioinformatics Comparative genomics Anthony Labarre September 26, 2016

N/A
N/A
Protected

Academic year: 2022

Partager "Algorithms and bioinformatics Comparative genomics Anthony Labarre September 26, 2016"

Copied!
84
0
0

Texte intégral

(1)

Algorithms and bioinformatics

Comparative genomics

Anthony Labarre

September 26, 2016

(2)

Strings Other models Alternative approaches Beyond pairwise comparisons

Duplications in evolution Balanced strings General strings

Motivation

I We saw a model for representing genomes without directionality;

I We saw another model for taking directionality into account;

I Both of them lack realism in a crucial way: they don’t allow duplications;

I And duplications / insertions / deletions account for a very large part of what happens in evolution [Ohno, 1970];

(3)

Strings Other models Alternative approaches Beyond pairwise comparisons

Duplications in evolution Balanced strings General strings

Two examples of duplications

Example (tandem duplications)

(source: K. Aainsqatsi on Wikimedia)

Example (whole genome duplication)

(source: Eric Lyons on CoGePedia)

(4)

Strings Other models Alternative approaches Beyond pairwise comparisons

Duplications in evolution Balanced strings General strings

Two examples of duplications

Example (tandem duplications)

(source: K. Aainsqatsi on Wikimedia)

Example (whole genome duplication)

(5)

Strings Other models Alternative approaches Beyond pairwise comparisons

Duplications in evolution Balanced strings General strings

Today

I Models that take duplications into account;

I Other approaches to solving the corresponding problems;

I Other models for those cases where only partial information is available or relevant;

(6)

Strings Other models Alternative approaches Beyond pairwise comparisons

Duplications in evolution Balanced strings General strings

Strings

I Since duplications pervade genomes, we should take them into account;

I We now see genomes asstringson an alphabet Σ;

I Be careful: similar segments have been identified, so Σ ={segments} and not{A,C,G,T};

I Our goal is still to explain evolution using most parsimonious scenarios made of fixed transformations;

(7)

Strings Other models Alternative approaches Beyond pairwise comparisons

Duplications in evolution Balanced strings General strings

Strings

I Note: the restriction to sorting problems does not work anymore;

I if you have two A’s, which one should be “number one”?

I So we really are interested in transforming one string into another, which isnotequivalent to sorting another string;

I Sorting problems have been considered in that model, but they’re just a special case of a more general problem;

(8)

Strings Other models Alternative approaches Beyond pairwise comparisons

Duplications in evolution Balanced strings General strings

Strings

I We can distinguish between several approaches based on gene contents;

I Either we have exactly the same contents in both genomes (and duplications are of course allowed);

I Or we have duplications but with different amounts of repetitions (e.g. three 1’s in genomeAbut only two in genomeB);

I This time the breakpoint graph cannot save us anymore, since we would not know how to connect elements or decompose the graph;

(9)

Strings Other models Alternative approaches Beyond pairwise comparisons

Duplications in evolution Balanced strings General strings

Balanced strings

I The number of occurrences of a characterc in a stringS is denoted byocc(c,S);

Definition (balanced strings)

Two stringsS andT on an alphabet Σ are balanced if:

∀ c ∈Σ :occ(c,S) =occ(c,T).

I Basically,S andT are anagrams;

I Straightforward generalisation of permutations: we have

duplications, but we actually still have the same content in both genomes;

(10)

Strings Other models Alternative approaches Beyond pairwise comparisons

Duplications in evolution Balanced strings General strings

Comparing balanced strings

I One way of relating genomes’ contents is to identify common segments;

I In other words, we want to partition genomes into the same set of segments;

I this is how we obtained (signed) permutations;

I but now we want to partition the resulting sequences;

(11)

Strings Other models Alternative approaches Beyond pairwise comparisons

Duplications in evolution Balanced strings General strings

Generalising breakpoints

I Recall that, for permutations:

I adjacenciesare pairs of adjacent elements inπthat are also adjacent inι=h1 2 · · · ni(orχ=hn n1 · · · 1ifor reversals);

I breakpointsare pairs that are not adjacencies;

I Recall that, for signed permutations:

I adjacenciesare pairs of adjacent elements inπthat are also adjacent inι=h1 2 · · · ni(orχ=h−n (n1) · · · −1ifor signed reversals);

I breakpointsare pairs that are not adjacencies;

I Those can be generalised to any pair of permutations;

I And we can do the same thing for strings;

(12)

Strings Other models Alternative approaches Beyond pairwise comparisons

Duplications in evolution Balanced strings General strings

Generalising breakpoints

I Recall that, for permutations:

I adjacenciesare pairs of adjacent elements inπthat are also adjacent inι=h1 2 · · · ni(orχ=hn n1 · · · 1ifor reversals);

I breakpointsare pairs that are not adjacencies;

I Recall that, for signed permutations:

I adjacenciesare pairs of adjacent elements inπthat are also adjacent inι=h1 2 · · · ni(orχ=h−n (n1) · · · −1ifor signed reversals);

I breakpointsare pairs that are not adjacencies;

I Those can be generalised to any pair of permutations;

I And we can do the same thing for strings;

(13)

Strings Other models Alternative approaches Beyond pairwise comparisons

Duplications in evolution Balanced strings General strings

Generalising breakpoints

I Recall that, for permutations:

I adjacenciesare pairs of adjacent elements inπthat are also adjacent inι=h1 2 · · · ni(orχ=hn n1 · · · 1ifor reversals);

I breakpointsare pairs that are not adjacencies;

I Recall that, for signed permutations:

I adjacenciesare pairs of adjacent elements inπthat are also adjacent inι=h1 2 · · · ni(orχ=h−n (n1) · · · −1ifor signed reversals);

I breakpointsare pairs that are not adjacencies;

I Those can be generalised to any pair of permutations;

I And we can do the same thing for strings;

(14)

Strings Other models Alternative approaches Beyond pairwise comparisons

Duplications in evolution Balanced strings General strings

Generalising breakpoints

I Recall that, for permutations:

I adjacenciesare pairs of adjacent elements inπthat are also adjacent inι=h1 2 · · · ni(orχ=hn n1 · · · 1ifor reversals);

I breakpointsare pairs that are not adjacencies;

I Recall that, for signed permutations:

I adjacenciesare pairs of adjacent elements inπthat are also adjacent inι=h1 2 · · · ni(orχ=h−n (n1) · · · −1ifor signed reversals);

I breakpointsare pairs that are not adjacencies;

I Those can be generalised to any pair of permutations;

I And we can do the same thing for strings;

(15)

Strings Other models Alternative approaches Beyond pairwise comparisons

Duplications in evolution Balanced strings General strings

Generalising breakpoints

I Recall that, for permutations:

I adjacenciesare pairs of adjacent elements inπthat are also adjacent inι=h1 2 · · · ni(orχ=hn n1 · · · 1ifor reversals);

I breakpointsare pairs that are not adjacencies;

I Recall that, for signed permutations:

I adjacenciesare pairs of adjacent elements inπthat are also adjacent inι=h1 2 · · · ni(orχ=h−n (n1) · · · −1ifor signed reversals);

I breakpointsare pairs that are not adjacencies;

I Those can be generalised to any pair of permutations;

I And we can do the same thing for strings;

(16)

Strings Other models Alternative approaches Beyond pairwise comparisons

Duplications in evolution Balanced strings General strings

Generalising breakpoints

I Recall that, for permutations:

I adjacenciesare pairs of adjacent elements inπthat are also adjacent inι=h1 2 · · · ni(orχ=hn n1 · · · 1ifor reversals);

I breakpointsare pairs that are not adjacencies;

I Recall that, for signed permutations:

I adjacenciesare pairs of adjacent elements inπthat are also adjacent inι=h1 2 · · · ni(orχ=h−n (n1) · · · −1ifor signed reversals);

I breakpointsare pairs that are not adjacencies;

I Those can be generalised to any pair of permutations;

I And we can do the same thing for strings;

(17)

Strings Other models Alternative approaches Beyond pairwise comparisons

Duplications in evolution Balanced strings General strings

Generalising breakpoints

I Recall that, for permutations:

I adjacenciesare pairs of adjacent elements inπthat are also adjacent inι=h1 2 · · · ni(orχ=hn n1 · · · 1ifor reversals);

I breakpointsare pairs that are not adjacencies;

I Recall that, for signed permutations:

I adjacenciesare pairs of adjacent elements inπthat are also adjacent inι=h1 2 · · · ni(orχ=h−n (n1) · · · −1ifor signed reversals);

I breakpointsare pairs that are not adjacencies;

I Those can be generalised to any pair of permutations;

I And we can do the same thing for strings;

(18)

Strings Other models Alternative approaches Beyond pairwise comparisons

Duplications in evolution Balanced strings General strings

Generalising breakpoints

I Recall that, for permutations:

I adjacenciesare pairs of adjacent elements inπthat are also adjacent inι=h1 2 · · · ni(orχ=hn n1 · · · 1ifor reversals);

I breakpointsare pairs that are not adjacencies;

I Recall that, for signed permutations:

I adjacenciesare pairs of adjacent elements inπthat are also adjacent inι=h1 2 · · · ni(orχ=h−n (n1) · · · −1ifor signed reversals);

I breakpointsare pairs that are not adjacencies;

I Those can be generalised to any pair of permutations;

I And we can do the same thing for strings;

(19)

Strings Other models Alternative approaches Beyond pairwise comparisons

Duplications in evolution Balanced strings General strings

Minimum common string partition

I Apartitionof a string S is a set of strings that can be concatenated to obtainS;

I Acommon partitionof two strings S andT is a set of strings that can be concatenated to obtain bothS andT;

Example (common string partitions)

Here’s a common partition of “dictionary” and “indicatory”:

d i c t i o n a r y

S1 S2 S3 S4 S5 S6 S7

i n d i c a t o r y

S3 S5 S1 S6 S2 S4 S7

(20)

Strings Other models Alternative approaches Beyond pairwise comparisons

Duplications in evolution Balanced strings General strings

Minimum common string partition

I Apartitionof a string S is a set of strings that can be concatenated to obtainS;

I Acommon partitionof two strings S andT is a set of strings that can be concatenated to obtain bothS andT;

Example (common string partitions)

Here’s a common partition of “dictionary” and “indicatory”:

d i c t i o n a r y

S1 S2 S3 S4 S5 S6 S7

i n d i c a t o r y

S3 S5 S1 S6 S2 S4 S7

(21)

Strings Other models Alternative approaches Beyond pairwise comparisons

Duplications in evolution Balanced strings General strings

Minimum common string partition

I Apartitionof a string S is a set of strings that can be concatenated to obtainS;

I Acommon partitionof two strings S andT is a set of strings that can be concatenated to obtain bothS andT;

Example (common string partitions)

Here’s a common partition of “dictionary” and “indicatory”:

d i c t i o n a r y

S1 S2 S3 S4 S5 S6 S7

i n d i c a t o r y

S3 S5 S1 S6 S2 S4 S7

(22)

Strings Other models Alternative approaches Beyond pairwise comparisons

Duplications in evolution Balanced strings General strings

Minimum common string partition

I Apartitionof a string S is a set of strings that can be concatenated to obtainS;

I Acommon partitionof two strings S andT is a set of strings that can be concatenated to obtain bothS andT;

Example (common string partitions)

Here’s a common partition of “dictionary” and “indicatory”:

d i c t i o n a r y

S1 S2 S3 S4 S5 S6 S7

i n d i c a t o r y

S3 S5 S1 S6 S2 S4 S7

(23)

Strings Other models Alternative approaches Beyond pairwise comparisons

Duplications in evolution Balanced strings General strings

Minimum common string partition

I Apartitionof a string S is a set of strings that can be concatenated to obtainS;

I Acommon partitionof two strings S andT is a set of strings that can be concatenated to obtain bothS andT;

Example (common string partitions)

Here’s a common partition of “dictionary” and “indicatory”:

d i c t i o n a r y

S1 S2 S3 S4 S5 S6 S7

i n d i c a t o r y

S3 S5 S1 S6 S2 S4 S7

(24)

Strings Other models Alternative approaches Beyond pairwise comparisons

Duplications in evolution Balanced strings General strings

Minimum common string partition

I A common string partition isminimum if there is no smaller common string partition for the two strings under consideration;

I This leads to the following decision problem:

Problem (minimum common string partition (mcsp)) Instance: balanced strings S and T , a bound k ∈N;

Question: is there a common partition of S and T with at most k blocks?

(25)

Strings Other models Alternative approaches Beyond pairwise comparisons

Duplications in evolution Balanced strings General strings

Relation(s) to rearrangement problems

I Recall that breakpoints were pairs of elements adjacent in one genome but not in the other;

I Common string partitions generalise that point of view to an arbitrary number of elements in each part;

I So if we have a minimum common string partition forS andT, we get the number of breakpoints between stringsS andT;

(26)

Strings Other models Alternative approaches Beyond pairwise comparisons

Duplications in evolution Balanced strings General strings

About mcsp

I Bad news aboutmcsp:

I NP-hard, even if only one gene family is nontrivial [Blin et al., 2004];

I APX-hard, even if no character appears more than twice [Goldstein et al., 2005];

I Good news aboutmcsp:

I fixed parameter tractable: a solution of sizek can be found in time f(k)·poly(n) (n=|S|=|T|) [Bulteau and Komusiewicz, 2014];

I Greedy approach [Goldstein and Lewenstein, 2011]: repeatedly select an LCS without any marked letter;

X simple and fast (runs inO(n) time);

× approximation ratio between Ω(n0.43) and O(n0.69) [Kaplan and Shafrir, 2006];

(27)

Strings Other models Alternative approaches Beyond pairwise comparisons

Duplications in evolution Balanced strings General strings

About mcsp

I Bad news aboutmcsp:

I NP-hard, even if only one gene family is nontrivial [Blin et al., 2004];

I APX-hard, even if no character appears more than twice [Goldstein et al., 2005];

I Good news aboutmcsp:

I fixed parameter tractable: a solution of sizek can be found in time f(k)·poly(n) (n=|S|=|T|) [Bulteau and Komusiewicz, 2014];

I Greedy approach [Goldstein and Lewenstein, 2011]: repeatedly select an LCS without any marked letter;

X simple and fast (runs inO(n) time);

× approximation ratio between Ω(n0.43) and O(n0.69) [Kaplan and Shafrir, 2006];

(28)

Strings Other models Alternative approaches Beyond pairwise comparisons

Duplications in evolution Balanced strings General strings

About mcsp

I Bad news aboutmcsp:

I NP-hard, even if only one gene family is nontrivial [Blin et al., 2004];

I APX-hard, even if no character appears more than twice [Goldstein et al., 2005];

I Good news aboutmcsp:

I fixed parameter tractable: a solution of sizek can be found in time f(k)·poly(n) (n=|S|=|T|) [Bulteau and Komusiewicz, 2014];

I Greedy approach [Goldstein and Lewenstein, 2011]: repeatedly select an LCS without any marked letter;

X simple and fast (runs inO(n) time);

× approximation ratio between Ω(n0.43) and O(n0.69) [Kaplan and Shafrir, 2006];

(29)

Strings Other models Alternative approaches Beyond pairwise comparisons

Duplications in evolution Balanced strings General strings

About mcsp

I Bad news aboutmcsp:

I NP-hard, even if only one gene family is nontrivial [Blin et al., 2004];

I APX-hard, even if no character appears more than twice [Goldstein et al., 2005];

I Good news aboutmcsp:

I fixed parameter tractable: a solution of sizek can be found in time f(k)·poly(n) (n=|S|=|T|) [Bulteau and Komusiewicz, 2014];

I Greedy approach [Goldstein and Lewenstein, 2011]: repeatedly select an LCS without any marked letter;

X simple and fast (runs inO(n) time);

× approximation ratio between Ω(n0.43) and O(n0.69) [Kaplan and Shafrir, 2006];

(30)

Strings Other models Alternative approaches Beyond pairwise comparisons

Duplications in evolution Balanced strings General strings

About mcsp

I Bad news aboutmcsp:

I NP-hard, even if only one gene family is nontrivial [Blin et al., 2004];

I APX-hard, even if no character appears more than twice [Goldstein et al., 2005];

I Good news aboutmcsp:

I fixed parameter tractable: a solution of sizek can be found in time f(k)·poly(n) (n=|S|=|T|) [Bulteau and Komusiewicz, 2014];

I Greedy approach [Goldstein and Lewenstein, 2011]: repeatedly select an LCS without any marked letter;

X simple and fast (runs inO(n) time);

× approximation ratio between Ω(n0.43) and O(n0.69) [Kaplan and Shafrir, 2006];

(31)

Strings Other models Alternative approaches Beyond pairwise comparisons

Duplications in evolution Balanced strings General strings

About mcsp

I Bad news aboutmcsp:

I NP-hard, even if only one gene family is nontrivial [Blin et al., 2004];

I APX-hard, even if no character appears more than twice [Goldstein et al., 2005];

I Good news aboutmcsp:

I fixed parameter tractable: a solution of sizek can be found in time f(k)·poly(n) (n=|S|=|T|) [Bulteau and Komusiewicz, 2014];

I Greedy approach [Goldstein and Lewenstein, 2011]: repeatedly select an LCS without any marked letter;

X simple and fast (runs inO(n) time);

× approximation ratio between Ω(n0.43) and O(n0.69) [Kaplan and Shafrir, 2006];

(32)

Strings Other models Alternative approaches Beyond pairwise comparisons

Duplications in evolution Balanced strings General strings

About mcsp

I Bad news aboutmcsp:

I NP-hard, even if only one gene family is nontrivial [Blin et al., 2004];

I APX-hard, even if no character appears more than twice [Goldstein et al., 2005];

I Good news aboutmcsp:

I fixed parameter tractable: a solution of sizek can be found in time f(k)·poly(n) (n=|S|=|T|) [Bulteau and Komusiewicz, 2014];

I Greedy approach [Goldstein and Lewenstein, 2011]: repeatedly select an LCS without any marked letter;

X simple and fast (runs inO(n) time);

× approximation ratio between Ω(n0.43) and O(n0.69) [Kaplan and Shafrir, 2006];

(33)

Strings Other models Alternative approaches Beyond pairwise comparisons

Duplications in evolution Balanced strings General strings

About mcsp

I Bad news aboutmcsp:

I NP-hard, even if only one gene family is nontrivial [Blin et al., 2004];

I APX-hard, even if no character appears more than twice [Goldstein et al., 2005];

I Good news aboutmcsp:

I fixed parameter tractable: a solution of sizek can be found in time f(k)·poly(n) (n=|S|=|T|) [Bulteau and Komusiewicz, 2014];

I Greedy approach [Goldstein and Lewenstein, 2011]: repeatedly select an LCS without any marked letter;

X simple and fast (runs inO(n) time);

× approximation ratio between Ω(n0.43) andO(n0.69) [Kaplan and Shafrir, 2006];

(34)

Strings Other models Alternative approaches Beyond pairwise comparisons

Duplications in evolution Balanced strings General strings

Minimum common string partition: variants

I One can also considersignedstrings: each segment is then equivalent up to a reversal;

I Or equivalence under full reversals: a partition ofS is also a partition ofT if one can concatenate its elements to obtainT or its reverse;

I Those variants are still hard, but the positive results do not straightforwardly generalise [Bulteau and Komusiewicz, 2014];

(35)

Strings Other models Alternative approaches Beyond pairwise comparisons

Duplications in evolution Balanced strings General strings

Unbalanced strings

I Of course, we are not always so lucky that our genomes are just anagrams;

I Most of the time, duplications are not balanced;

I So, what do we do?

(36)

Strings Other models Alternative approaches Beyond pairwise comparisons

Duplications in evolution Balanced strings General strings

Arbitrary strings

I One idea is to try and match different copies of a same gene accross two genomes;

I Three general approaches have been proposed:

1. the exemplar model;

2. the intermediate model;

3. the full model;

I All three are based on a notion of matching;

(37)

Strings Other models Alternative approaches Beyond pairwise comparisons

Duplications in evolution Balanced strings General strings

Matching and pruning

Definition (gene matching)

Agene matchingbetween two strings S andT is a set of disjoint pairs{Si,Tj} such thatSi =Tj for every such pair (1≤i ≤ |S|, 1≤j ≤ |T|).

Definition (pruning)

Given two stringsS andT and a gene matching M, the M-pruning is the pair (S0,T0) obtained by removing all

unmatched characters fromS andT and relabelling the remaining characters according toM.

(examples to appear shortly)

(38)

Strings Other models Alternative approaches Beyond pairwise comparisons

Duplications in evolution Balanced strings General strings

Matching and pruning

Definition (gene matching)

Agene matchingbetween two strings S andT is a set of disjoint pairs{Si,Tj} such thatSi =Tj for every such pair (1≤i ≤ |S|, 1≤j ≤ |T|).

Definition (pruning)

Given two stringsS andT and a gene matching M, the M-pruning is the pair (S0,T0) obtained by removing all

unmatched characters fromS andT and relabelling the remaining characters according toM.

(examples to appear shortly)

(39)

Strings Other models Alternative approaches Beyond pairwise comparisons

Duplications in evolution Balanced strings General strings

Matching and pruning

Definition (gene matching)

Agene matchingbetween two strings S andT is a set of disjoint pairs{Si,Tj} such thatSi =Tj for every such pair (1≤i ≤ |S|, 1≤j ≤ |T|).

Definition (pruning)

Given two stringsS andT and a gene matching M, the M-pruning is the pair (S0,T0) obtained by removing all

unmatched characters fromS andT and relabelling the remaining characters according toM.

(examples to appear shortly)

(40)

Strings Other models Alternative approaches Beyond pairwise comparisons

Duplications in evolution Balanced strings General strings

Matching(s) and pruning(s)

I Matchings will depend on the model we use;

I Since prunings are derived from matchings, they will also vary depending on the underlying model;

I Let us review them on examples;

(41)

Strings Other models Alternative approaches Beyond pairwise comparisons

Duplications in evolution Balanced strings General strings

Exemplar matching / pruning

I In theexemplarmodel, we match only one copy of each gene:

Example (exemplar matching / pruning)

S = 1 2 −4 −2 3 1 4 −3 4

T = 4 1 −3 −2 2 1 2 4

S0 = 1 2 3 4

T0= 1 −3 −2 4

(42)

Strings Other models Alternative approaches Beyond pairwise comparisons

Duplications in evolution Balanced strings General strings

Intermediate matching/pruning

I In theintermediatemodel, we match at least one copy of each gene:

Example (intermediate matching/pruning)

S = 1 2 −4 −2 3 1 4 −3 4

T = 4 1 −3 −2 2 1 2 4

S0= 1 2 3 10 4 T0 = 1 −3 −2 10 4

(43)

Strings Other models Alternative approaches Beyond pairwise comparisons

Duplications in evolution Balanced strings General strings

Full matching / pruning

I In thefull model, we match as many copies of each gene as possible:

Example (full matching / pruning)

S = 1 2 −4 −2 3 1 4 −3 4

T = 4 1 −3 −2 2 1 2 4

S0 = 1 2 −40 −20 3 10 4 T0= 40 10 −3 20 1 2 4

(44)

Strings Other models Alternative approaches Beyond pairwise comparisons

Duplications in evolution Balanced strings General strings

Using matchings and prunings

I Once we’ve pruned our input strings, we can compare them as if they were permutations;

I This gives rise to many variations on the following theme:

Problem (“(M, d)-comparison”) Input: two strings S and T

Goal: find an “M matching” such that the resulting “M pruning”

(S0,T0) minimises d(S0,T0)

I HereM ∈ {exemplar, intermediate, full}, andd is any distance on Sn orSn± (withn =|S0|=|T0|);

(45)

Strings Other models Alternative approaches Beyond pairwise comparisons

Duplications in evolution Balanced strings General strings

Strings

I This isnot “just a matching problem”;

I in matching problems, every edge is given a weight, and we have to optimize a function that takes all weights into account;

I while here, we look for a matching that optimises a quantity, but the edge weights are not fixed to begin with;

I In other words: in matching problems we can compute the cost of a partial solution, while here we must have a full matching before we can even begin to compute the cost;

(46)

Strings Other models Alternative approaches Beyond pairwise comparisons

Duplications in evolution Balanced strings General strings

Strings: extensions

I Strings can of course be signed to take directionality into account;

I They can also be circular;

I And of course we could have a mix of both to represent different chromosomes;

(47)

Strings Other models Alternative approaches Beyond pairwise comparisons

Posets Set systems

Other models

I We’ve mostly seen (signed) permutations and strings so far;

I Other models may be more suitable, according to:

I the data we have;

I the relations we want to take into account;

I We mention briefly the following structures:

I posets;

I set systems;

(48)

Strings Other models Alternative approaches Beyond pairwise comparisons

Posets Set systems

The need for other models

I Most genomes consist of several chromosomes:

(49)

Strings Other models Alternative approaches Beyond pairwise comparisons

Posets Set systems

Posets

I Recall that genomes are not directly copied from a long string of DNA to a drive;

1. “small” subsequences called readsare identified;

2. then those reads are assembled to form the target genome;

I We still want to be able to compare genomes even if only partial gene order information is available;

I This naturally leads us to compareposets instead of permutations or strings;

(50)

Strings Other models Alternative approaches Beyond pairwise comparisons

Posets Set systems

Posets

I Informally, although we may not know the complete ordering, we may know parts of it;

I So segments are partially ordered, and genomes may be represented by directedacyclic graphs, where:

I vertices stand for segments;

I arc (u,v) means “segmentuprecedes segmentv”;

I In this regard, permutations are paths of maximal length;

Example (a genome as a poset)

1

−2

3

−5 6 10 9 12

(51)

Strings Other models Alternative approaches Beyond pairwise comparisons

Posets Set systems

Comparing genomes as posets

I Comparing genomesG1 andG2 represented as posets is based on permutations:

I find linear extensions L1andL2that minimised(L1,L2);

I Another way of trying to aggregate their contents is by:

I merging them into a conflict-free graph;

I finding a linear extension of that graph;

(52)

Strings Other models Alternative approaches Beyond pairwise comparisons

Posets Set systems

Finding an “agreement” for posets

G1: 1

−2

3

−5 6 10 8 12

G2: 1 −2 −4 −5 7

9

11

12

G1G2: 1 −2 −4 −5

3

6 10 8

12

7 9

11

(53)

Strings Other models Alternative approaches Beyond pairwise comparisons

Posets Set systems

Set systems and the syntenic distance

I Recall that chromosomes are ordered sets of genes;

I Sometimes we’re not interested in order, but in the fact that two segments belong to the same chromosome;

I So we view a genome as a family of (unordered) sets of genes;

(54)

Strings Other models Alternative approaches Beyond pairwise comparisons

Posets Set systems

Set systems and the syntenic distance

I Three operations are taken into account in that setting:

{{a,b,c},{p,q,r},{x,y}}

{{a,b},{c},{p,q,r},{x,y}} {{a,b,c,x,y},{p,q,r}}

{{a,p},{b,c,q,r},{x,y}}

fission fusion

translocation

I Thesyntenic distancebetween two genomes is then the

minimum number of such operations that are needed to transform

(55)

Strings Other models Alternative approaches Beyond pairwise comparisons

Posets Set systems

Set systems and the syntenic distance

I There is acompact representationthat allows us to assume that:

1. our input is{S1,S2, . . . ,Sk}(subsets of {1,2, . . . ,n});

2. our target is {{1},{2}, . . . ,{n}};

I So we want to obtain that genome using as few fissions, fusions and translocations as possible;

I Syntenic genes are simply genes that belong to the same chromosome;

(56)

Strings Other models Alternative approaches Beyond pairwise comparisons

Posets Set systems

Synteny graph

I A graph-theoretic approach for attacking the problem was proposed:

Definition ([DasGupta et al., 1998])

Thesynteny graphof an instanceS(n,k) is defined by:

I V ={S1,S2, . . . ,Sk};

I E ={{Si,Sj} | Si∩Sj 6=∅,1≤i 6=j ≤n};

I The synteny graph of our target{{1},{2}, . . . ,{n}}hasn components;

(57)

Strings Other models Alternative approaches Beyond pairwise comparisons

Posets Set systems

Mutations and the synteny graph

I Translocations, fusions and fissions affect the graph in different ways;

I translocations (may) disconnect adjacent vertices;

I fissions split vertices into two nonadjacent vertices;

I fusions: opposite of fissions;

I Our goal is to obtainn components;

I It can be proved that the distance is at leastn−p (where p is the number of components in our instance’s graph);

(58)

Strings Other models Alternative approaches Beyond pairwise comparisons

Posets Set systems

About the syntenic distance

I The synteny graph dictates that we want to increase the number of connected components;

I In that regard, restricting oneself to “intra-component moves” seems optimal;

I But any approach that does this is a 2-approximation [Liben-Nowell, 2001];

I No better approximation is known;

I And computing the distance or an optimal scenario is NP-hard [DasGupta et al., 1998];

(59)

Strings Other models Alternative approaches Beyond pairwise comparisons

Posets Set systems

About the syntenic distance

I The synteny graph dictates that we want to increase the number of connected components;

I In that regard, restricting oneself to “intra-component moves”

seems optimal;

I But any approach that does this is a 2-approximation [Liben-Nowell, 2001];

I No better approximation is known;

I And computing the distance or an optimal scenario is NP-hard [DasGupta et al., 1998];

(60)

Strings Other models Alternative approaches Beyond pairwise comparisons

Posets Set systems

About the syntenic distance

I The synteny graph dictates that we want to increase the number of connected components;

I In that regard, restricting oneself to “intra-component moves”

seems optimal;

I But any approach that does this is a 2-approximation [Liben-Nowell, 2001];

I No better approximation is known;

I And computing the distance or an optimal scenario is NP-hard [DasGupta et al., 1998];

(61)

Strings Other models Alternative approaches Beyond pairwise comparisons

Posets Set systems

About the syntenic distance

I The synteny graph dictates that we want to increase the number of connected components;

I In that regard, restricting oneself to “intra-component moves”

seems optimal;

I But any approach that does this is a 2-approximation [Liben-Nowell, 2001];

I No better approximation is known;

I And computing the distance or an optimal scenario is NP-hard [DasGupta et al., 1998];

(62)

Strings Other models Alternative approaches Beyond pairwise comparisons

Posets Set systems

About the syntenic distance

I The synteny graph dictates that we want to increase the number of connected components;

I In that regard, restricting oneself to “intra-component moves”

seems optimal;

I But any approach that does this is a 2-approximation [Liben-Nowell, 2001];

I No better approximation is known;

I And computing the distance or an optimal scenario is NP-hard [DasGupta et al., 1998];

(63)

Strings Other models Alternative approaches Beyond pairwise comparisons

SAT solvers Linear programming

Today’s models: wrap-up

I As soon as we have duplications, most problems become hard (to solve exactly, or even to approximate within a reasonable factor)

I As soon as we forget about order (partially or completely), we also end up with difficult problems;

I Yet the problems still have to be solved;

(64)

Strings Other models Alternative approaches Beyond pairwise comparisons

SAT solvers Linear programming

Alternative approach: sat solvers

I satsolvers are highly-optimised programs for solving the

well-known NP-completesatisfiability problem [Cook, 1971]:

Problem (satisfiability (sat))

Input: a Boolean formulaφ in conjunctive normal form.

Question: is there a satisfying assignment for φ?

I Idea: take advantage of these solvers;

(65)

Strings Other models Alternative approaches Beyond pairwise comparisons

SAT solvers Linear programming

Alternative approach: sat solvers

I The workflow is as follows:

PROBLEM INSTANCE

BOOLEAN FORMULA SAT SOLVER SATISFYING ASSIGNMENT

SOLUTION

translation

(66)

Strings Other models Alternative approaches Beyond pairwise comparisons

SAT solvers Linear programming

Alternative approach: linear and pseudo-boolean programming

I Linear programsare of the form:

maximise cTx subject to Ax ≤b

and x≥0

I Pseudo-boolean programs: same form, but the function to optimise maps{0,1}n toR (versus{0,1}for boolean functions);

I Specialised solvers also exist for those and were used to solve rearrangement problems on strings [Angibaud et al., 2007] and posets [Angibaud et al., 2009];

(67)

Strings Other models Alternative approaches Beyond pairwise comparisons

SAT solvers Linear programming

Comparative genomics wrap-up

I Here we talked mostly about computing “edit distances” between genomes;

I Other measures of similarity exist that are not associated to mutations;

I Many hard problems;

I Much remains to be done in order to satisfy biologists;

I realistic models;

I software;

I ...

(68)

Strings Other models Alternative approaches Beyond pairwise comparisons

From comparisons to phylogenies Bounds

Selected results

Beyond pairwise comparisons

I The genome rearrangement problems we’ve seen were formulated in a pairwise fashion;

I But actually, more than two genomes can be taken into account;

I Unsurprisingly, most problems become hard in that setting;

(69)

Strings Other models Alternative approaches Beyond pairwise comparisons

From comparisons to phylogenies Bounds

Selected results

Why more than two genomes?

I A sequence does not yield enough information for ancestral genome reconstruction:

G1 G2

I Taking an additional genome into account restricts our choices:

G1 G2

G3

I What’s more, it’s ultimately one of our goals;

(70)

Strings Other models Alternative approaches Beyond pairwise comparisons

From comparisons to phylogenies Bounds

Selected results

Why more than two genomes?

I A sequence does not yield enough information for ancestral genome reconstruction:

G1 G2

I Taking an additional genome into account restricts our choices:

G1 G2

G3

I What’s more, it’s ultimately one of our goals;

(71)

Strings Other models Alternative approaches Beyond pairwise comparisons

From comparisons to phylogenies Bounds

Selected results

Why more than two genomes?

I A sequence does not yield enough information for ancestral genome reconstruction:

G1 G2

I Taking an additional genome into account restricts our choices:

G1 G2

G3

I What’s more, it’s ultimately one of our goals;

(72)

Strings Other models Alternative approaches Beyond pairwise comparisons

From comparisons to phylogenies Bounds

Selected results

Median problems

I Measures of similarities between genomes are useful in reconstructingphylogenies;

Example (phylogeny from distance matrix)

a b c d e a 0 2 3 6 6 b 2 0 3 6 6 c 3 3 0 5 5 d 6 6 5 0 4 e 6 6 5 4 0

a 1

b 1

1 2

c

1

2 d

2 e

I (The matrix must satisfy some conditions [Buneman, 1971]);

(73)

Strings Other models Alternative approaches Beyond pairwise comparisons

From comparisons to phylogenies Bounds

Selected results

Median problems

I Parsimony again: search for a tree that minimises the total number of evolutionary events (i.e. the sum of all edge weights);

I In its simplest form, the problem we want to solve is:

Problem (median of three)

Given: π,σ,τ in Sn±; a distance d :Sn±×Sn±→N. Find: a permutationµ in Sn± that minimises

w(µ) =d(π, µ) +d(σ, µ) +d(τ, µ).

I Can be generalised to more than three input permutations;

(74)

Strings Other models Alternative approaches Beyond pairwise comparisons

From comparisons to phylogenies Bounds

Selected results

Generic bounds [Siepel and Moret, 2001]

I Generic lower and upper bounds for any distance:

π

σ τ

d(π, σ) d(π, τ)

d(σ, τ)

µ d(π, µ)

d(µ, σ) d(µ, τ)

I w(µ)min{

ifµ=π

z }| {

d(π, σ) +d(π, τ),

ifµ=σ

z }| {

d(π, σ) +d(σ, τ),

ifµ=τ

z }| {

d(π, τ) +d(σ, τ)}.

d(π, σ)+d(π, τ)+d(σ, τ) (triangle inequalities)

Références

Documents relatifs

Then the maximum number of resource types that can be put in the network is precisely the k-domatic number of its underlying graph.. Despite being a natural generalization of

Introduction Permutations Signed permutations Strings Other models Alternative approaches Beyond pairwise comparisons.. Context

Introduction Permutations Signed permutations Strings Other models Alternative approaches Beyond pairwise comparisons.. Context

et donc, le poids d’une solution optimale pour tsp est au moins celui d’un ACPM, qu’on est capable de trouver en temps polynomial ;.. Introduction vertex cover

The model Exchanges Larger-scale transformations The directed breakpoint graph.. Modelling genomes as

In order to study the impact of different variables (e.g. sequencing throughput, sequencing error rate. Indeed, simulated data represents very controlled conditions discarding

This is in particular true when λ is a partition with maximal hook length less or equal to k but then, we get by Assertion 1 of the previous Lemma that the associated Toeplitz matrix

This paper highlights the evolution of twelve Tricho- derma species that are most frequently observed in na- ture and which belong to three different Trichoderma sections/clades