Algorithms and bioinformatics
Comparative genomics
Anthony Labarre
September 26, 2016
Strings Other models Alternative approaches Beyond pairwise comparisons
Duplications in evolution Balanced strings General strings
Motivation
I We saw a model for representing genomes without directionality;
I We saw another model for taking directionality into account;
I Both of them lack realism in a crucial way: they don’t allow duplications;
I And duplications / insertions / deletions account for a very large part of what happens in evolution [Ohno, 1970];
Strings Other models Alternative approaches Beyond pairwise comparisons
Duplications in evolution Balanced strings General strings
Two examples of duplications
Example (tandem duplications)
(source: K. Aainsqatsi on Wikimedia)
Example (whole genome duplication)
(source: Eric Lyons on CoGePedia)
Strings Other models Alternative approaches Beyond pairwise comparisons
Duplications in evolution Balanced strings General strings
Two examples of duplications
Example (tandem duplications)
(source: K. Aainsqatsi on Wikimedia)
Example (whole genome duplication)
Strings Other models Alternative approaches Beyond pairwise comparisons
Duplications in evolution Balanced strings General strings
Today
I Models that take duplications into account;
I Other approaches to solving the corresponding problems;
I Other models for those cases where only partial information is available or relevant;
Strings Other models Alternative approaches Beyond pairwise comparisons
Duplications in evolution Balanced strings General strings
Strings
I Since duplications pervade genomes, we should take them into account;
I We now see genomes asstringson an alphabet Σ;
I Be careful: similar segments have been identified, so Σ ={segments} and not{A,C,G,T};
I Our goal is still to explain evolution using most parsimonious scenarios made of fixed transformations;
Strings Other models Alternative approaches Beyond pairwise comparisons
Duplications in evolution Balanced strings General strings
Strings
I Note: the restriction to sorting problems does not work anymore;
I if you have two A’s, which one should be “number one”?
I So we really are interested in transforming one string into another, which isnotequivalent to sorting another string;
I Sorting problems have been considered in that model, but they’re just a special case of a more general problem;
Strings Other models Alternative approaches Beyond pairwise comparisons
Duplications in evolution Balanced strings General strings
Strings
I We can distinguish between several approaches based on gene contents;
I Either we have exactly the same contents in both genomes (and duplications are of course allowed);
I Or we have duplications but with different amounts of repetitions (e.g. three 1’s in genomeAbut only two in genomeB);
I This time the breakpoint graph cannot save us anymore, since we would not know how to connect elements or decompose the graph;
Strings Other models Alternative approaches Beyond pairwise comparisons
Duplications in evolution Balanced strings General strings
Balanced strings
I The number of occurrences of a characterc in a stringS is denoted byocc(c,S);
Definition (balanced strings)
Two stringsS andT on an alphabet Σ are balanced if:
∀ c ∈Σ :occ(c,S) =occ(c,T).
I Basically,S andT are anagrams;
I Straightforward generalisation of permutations: we have
duplications, but we actually still have the same content in both genomes;
Strings Other models Alternative approaches Beyond pairwise comparisons
Duplications in evolution Balanced strings General strings
Comparing balanced strings
I One way of relating genomes’ contents is to identify common segments;
I In other words, we want to partition genomes into the same set of segments;
I this is how we obtained (signed) permutations;
I but now we want to partition the resulting sequences;
Strings Other models Alternative approaches Beyond pairwise comparisons
Duplications in evolution Balanced strings General strings
Generalising breakpoints
I Recall that, for permutations:
I adjacenciesare pairs of adjacent elements inπthat are also adjacent inι=h1 2 · · · ni(orχ=hn n−1 · · · 1ifor reversals);
I breakpointsare pairs that are not adjacencies;
I Recall that, for signed permutations:
I adjacenciesare pairs of adjacent elements inπthat are also adjacent inι=h1 2 · · · ni(orχ=h−n −(n−1) · · · −1ifor signed reversals);
I breakpointsare pairs that are not adjacencies;
I Those can be generalised to any pair of permutations;
I And we can do the same thing for strings;
Strings Other models Alternative approaches Beyond pairwise comparisons
Duplications in evolution Balanced strings General strings
Generalising breakpoints
I Recall that, for permutations:
I adjacenciesare pairs of adjacent elements inπthat are also adjacent inι=h1 2 · · · ni(orχ=hn n−1 · · · 1ifor reversals);
I breakpointsare pairs that are not adjacencies;
I Recall that, for signed permutations:
I adjacenciesare pairs of adjacent elements inπthat are also adjacent inι=h1 2 · · · ni(orχ=h−n −(n−1) · · · −1ifor signed reversals);
I breakpointsare pairs that are not adjacencies;
I Those can be generalised to any pair of permutations;
I And we can do the same thing for strings;
Strings Other models Alternative approaches Beyond pairwise comparisons
Duplications in evolution Balanced strings General strings
Generalising breakpoints
I Recall that, for permutations:
I adjacenciesare pairs of adjacent elements inπthat are also adjacent inι=h1 2 · · · ni(orχ=hn n−1 · · · 1ifor reversals);
I breakpointsare pairs that are not adjacencies;
I Recall that, for signed permutations:
I adjacenciesare pairs of adjacent elements inπthat are also adjacent inι=h1 2 · · · ni(orχ=h−n −(n−1) · · · −1ifor signed reversals);
I breakpointsare pairs that are not adjacencies;
I Those can be generalised to any pair of permutations;
I And we can do the same thing for strings;
Strings Other models Alternative approaches Beyond pairwise comparisons
Duplications in evolution Balanced strings General strings
Generalising breakpoints
I Recall that, for permutations:
I adjacenciesare pairs of adjacent elements inπthat are also adjacent inι=h1 2 · · · ni(orχ=hn n−1 · · · 1ifor reversals);
I breakpointsare pairs that are not adjacencies;
I Recall that, for signed permutations:
I adjacenciesare pairs of adjacent elements inπthat are also adjacent inι=h1 2 · · · ni(orχ=h−n −(n−1) · · · −1ifor signed reversals);
I breakpointsare pairs that are not adjacencies;
I Those can be generalised to any pair of permutations;
I And we can do the same thing for strings;
Strings Other models Alternative approaches Beyond pairwise comparisons
Duplications in evolution Balanced strings General strings
Generalising breakpoints
I Recall that, for permutations:
I adjacenciesare pairs of adjacent elements inπthat are also adjacent inι=h1 2 · · · ni(orχ=hn n−1 · · · 1ifor reversals);
I breakpointsare pairs that are not adjacencies;
I Recall that, for signed permutations:
I adjacenciesare pairs of adjacent elements inπthat are also adjacent inι=h1 2 · · · ni(orχ=h−n −(n−1) · · · −1ifor signed reversals);
I breakpointsare pairs that are not adjacencies;
I Those can be generalised to any pair of permutations;
I And we can do the same thing for strings;
Strings Other models Alternative approaches Beyond pairwise comparisons
Duplications in evolution Balanced strings General strings
Generalising breakpoints
I Recall that, for permutations:
I adjacenciesare pairs of adjacent elements inπthat are also adjacent inι=h1 2 · · · ni(orχ=hn n−1 · · · 1ifor reversals);
I breakpointsare pairs that are not adjacencies;
I Recall that, for signed permutations:
I adjacenciesare pairs of adjacent elements inπthat are also adjacent inι=h1 2 · · · ni(orχ=h−n −(n−1) · · · −1ifor signed reversals);
I breakpointsare pairs that are not adjacencies;
I Those can be generalised to any pair of permutations;
I And we can do the same thing for strings;
Strings Other models Alternative approaches Beyond pairwise comparisons
Duplications in evolution Balanced strings General strings
Generalising breakpoints
I Recall that, for permutations:
I adjacenciesare pairs of adjacent elements inπthat are also adjacent inι=h1 2 · · · ni(orχ=hn n−1 · · · 1ifor reversals);
I breakpointsare pairs that are not adjacencies;
I Recall that, for signed permutations:
I adjacenciesare pairs of adjacent elements inπthat are also adjacent inι=h1 2 · · · ni(orχ=h−n −(n−1) · · · −1ifor signed reversals);
I breakpointsare pairs that are not adjacencies;
I Those can be generalised to any pair of permutations;
I And we can do the same thing for strings;
Strings Other models Alternative approaches Beyond pairwise comparisons
Duplications in evolution Balanced strings General strings
Generalising breakpoints
I Recall that, for permutations:
I adjacenciesare pairs of adjacent elements inπthat are also adjacent inι=h1 2 · · · ni(orχ=hn n−1 · · · 1ifor reversals);
I breakpointsare pairs that are not adjacencies;
I Recall that, for signed permutations:
I adjacenciesare pairs of adjacent elements inπthat are also adjacent inι=h1 2 · · · ni(orχ=h−n −(n−1) · · · −1ifor signed reversals);
I breakpointsare pairs that are not adjacencies;
I Those can be generalised to any pair of permutations;
I And we can do the same thing for strings;
Strings Other models Alternative approaches Beyond pairwise comparisons
Duplications in evolution Balanced strings General strings
Minimum common string partition
I Apartitionof a string S is a set of strings that can be concatenated to obtainS;
I Acommon partitionof two strings S andT is a set of strings that can be concatenated to obtain bothS andT;
Example (common string partitions)
Here’s a common partition of “dictionary” and “indicatory”:
d i c t i o n a r y
S1 S2 S3 S4 S5 S6 S7
i n d i c a t o r y
S3 S5 S1 S6 S2 S4 S7
Strings Other models Alternative approaches Beyond pairwise comparisons
Duplications in evolution Balanced strings General strings
Minimum common string partition
I Apartitionof a string S is a set of strings that can be concatenated to obtainS;
I Acommon partitionof two strings S andT is a set of strings that can be concatenated to obtain bothS andT;
Example (common string partitions)
Here’s a common partition of “dictionary” and “indicatory”:
d i c t i o n a r y
S1 S2 S3 S4 S5 S6 S7
i n d i c a t o r y
S3 S5 S1 S6 S2 S4 S7
Strings Other models Alternative approaches Beyond pairwise comparisons
Duplications in evolution Balanced strings General strings
Minimum common string partition
I Apartitionof a string S is a set of strings that can be concatenated to obtainS;
I Acommon partitionof two strings S andT is a set of strings that can be concatenated to obtain bothS andT;
Example (common string partitions)
Here’s a common partition of “dictionary” and “indicatory”:
d i c t i o n a r y
S1 S2 S3 S4 S5 S6 S7
i n d i c a t o r y
S3 S5 S1 S6 S2 S4 S7
Strings Other models Alternative approaches Beyond pairwise comparisons
Duplications in evolution Balanced strings General strings
Minimum common string partition
I Apartitionof a string S is a set of strings that can be concatenated to obtainS;
I Acommon partitionof two strings S andT is a set of strings that can be concatenated to obtain bothS andT;
Example (common string partitions)
Here’s a common partition of “dictionary” and “indicatory”:
d i c t i o n a r y
S1 S2 S3 S4 S5 S6 S7
i n d i c a t o r y
S3 S5 S1 S6 S2 S4 S7
Strings Other models Alternative approaches Beyond pairwise comparisons
Duplications in evolution Balanced strings General strings
Minimum common string partition
I Apartitionof a string S is a set of strings that can be concatenated to obtainS;
I Acommon partitionof two strings S andT is a set of strings that can be concatenated to obtain bothS andT;
Example (common string partitions)
Here’s a common partition of “dictionary” and “indicatory”:
d i c t i o n a r y
S1 S2 S3 S4 S5 S6 S7
i n d i c a t o r y
S3 S5 S1 S6 S2 S4 S7
Strings Other models Alternative approaches Beyond pairwise comparisons
Duplications in evolution Balanced strings General strings
Minimum common string partition
I A common string partition isminimum if there is no smaller common string partition for the two strings under consideration;
I This leads to the following decision problem:
Problem (minimum common string partition (mcsp)) Instance: balanced strings S and T , a bound k ∈N;
Question: is there a common partition of S and T with at most k blocks?
Strings Other models Alternative approaches Beyond pairwise comparisons
Duplications in evolution Balanced strings General strings
Relation(s) to rearrangement problems
I Recall that breakpoints were pairs of elements adjacent in one genome but not in the other;
I Common string partitions generalise that point of view to an arbitrary number of elements in each part;
I So if we have a minimum common string partition forS andT, we get the number of breakpoints between stringsS andT;
Strings Other models Alternative approaches Beyond pairwise comparisons
Duplications in evolution Balanced strings General strings
About mcsp
I Bad news aboutmcsp:
I NP-hard, even if only one gene family is nontrivial [Blin et al., 2004];
I APX-hard, even if no character appears more than twice [Goldstein et al., 2005];
I Good news aboutmcsp:
I fixed parameter tractable: a solution of sizek can be found in time f(k)·poly(n) (n=|S|=|T|) [Bulteau and Komusiewicz, 2014];
I Greedy approach [Goldstein and Lewenstein, 2011]: repeatedly select an LCS without any marked letter;
X simple and fast (runs inO(n) time);
× approximation ratio between Ω(n0.43) and O(n0.69) [Kaplan and Shafrir, 2006];
Strings Other models Alternative approaches Beyond pairwise comparisons
Duplications in evolution Balanced strings General strings
About mcsp
I Bad news aboutmcsp:
I NP-hard, even if only one gene family is nontrivial [Blin et al., 2004];
I APX-hard, even if no character appears more than twice [Goldstein et al., 2005];
I Good news aboutmcsp:
I fixed parameter tractable: a solution of sizek can be found in time f(k)·poly(n) (n=|S|=|T|) [Bulteau and Komusiewicz, 2014];
I Greedy approach [Goldstein and Lewenstein, 2011]: repeatedly select an LCS without any marked letter;
X simple and fast (runs inO(n) time);
× approximation ratio between Ω(n0.43) and O(n0.69) [Kaplan and Shafrir, 2006];
Strings Other models Alternative approaches Beyond pairwise comparisons
Duplications in evolution Balanced strings General strings
About mcsp
I Bad news aboutmcsp:
I NP-hard, even if only one gene family is nontrivial [Blin et al., 2004];
I APX-hard, even if no character appears more than twice [Goldstein et al., 2005];
I Good news aboutmcsp:
I fixed parameter tractable: a solution of sizek can be found in time f(k)·poly(n) (n=|S|=|T|) [Bulteau and Komusiewicz, 2014];
I Greedy approach [Goldstein and Lewenstein, 2011]: repeatedly select an LCS without any marked letter;
X simple and fast (runs inO(n) time);
× approximation ratio between Ω(n0.43) and O(n0.69) [Kaplan and Shafrir, 2006];
Strings Other models Alternative approaches Beyond pairwise comparisons
Duplications in evolution Balanced strings General strings
About mcsp
I Bad news aboutmcsp:
I NP-hard, even if only one gene family is nontrivial [Blin et al., 2004];
I APX-hard, even if no character appears more than twice [Goldstein et al., 2005];
I Good news aboutmcsp:
I fixed parameter tractable: a solution of sizek can be found in time f(k)·poly(n) (n=|S|=|T|) [Bulteau and Komusiewicz, 2014];
I Greedy approach [Goldstein and Lewenstein, 2011]: repeatedly select an LCS without any marked letter;
X simple and fast (runs inO(n) time);
× approximation ratio between Ω(n0.43) and O(n0.69) [Kaplan and Shafrir, 2006];
Strings Other models Alternative approaches Beyond pairwise comparisons
Duplications in evolution Balanced strings General strings
About mcsp
I Bad news aboutmcsp:
I NP-hard, even if only one gene family is nontrivial [Blin et al., 2004];
I APX-hard, even if no character appears more than twice [Goldstein et al., 2005];
I Good news aboutmcsp:
I fixed parameter tractable: a solution of sizek can be found in time f(k)·poly(n) (n=|S|=|T|) [Bulteau and Komusiewicz, 2014];
I Greedy approach [Goldstein and Lewenstein, 2011]: repeatedly select an LCS without any marked letter;
X simple and fast (runs inO(n) time);
× approximation ratio between Ω(n0.43) and O(n0.69) [Kaplan and Shafrir, 2006];
Strings Other models Alternative approaches Beyond pairwise comparisons
Duplications in evolution Balanced strings General strings
About mcsp
I Bad news aboutmcsp:
I NP-hard, even if only one gene family is nontrivial [Blin et al., 2004];
I APX-hard, even if no character appears more than twice [Goldstein et al., 2005];
I Good news aboutmcsp:
I fixed parameter tractable: a solution of sizek can be found in time f(k)·poly(n) (n=|S|=|T|) [Bulteau and Komusiewicz, 2014];
I Greedy approach [Goldstein and Lewenstein, 2011]: repeatedly select an LCS without any marked letter;
X simple and fast (runs inO(n) time);
× approximation ratio between Ω(n0.43) and O(n0.69) [Kaplan and Shafrir, 2006];
Strings Other models Alternative approaches Beyond pairwise comparisons
Duplications in evolution Balanced strings General strings
About mcsp
I Bad news aboutmcsp:
I NP-hard, even if only one gene family is nontrivial [Blin et al., 2004];
I APX-hard, even if no character appears more than twice [Goldstein et al., 2005];
I Good news aboutmcsp:
I fixed parameter tractable: a solution of sizek can be found in time f(k)·poly(n) (n=|S|=|T|) [Bulteau and Komusiewicz, 2014];
I Greedy approach [Goldstein and Lewenstein, 2011]: repeatedly select an LCS without any marked letter;
X simple and fast (runs inO(n) time);
× approximation ratio between Ω(n0.43) and O(n0.69) [Kaplan and Shafrir, 2006];
Strings Other models Alternative approaches Beyond pairwise comparisons
Duplications in evolution Balanced strings General strings
About mcsp
I Bad news aboutmcsp:
I NP-hard, even if only one gene family is nontrivial [Blin et al., 2004];
I APX-hard, even if no character appears more than twice [Goldstein et al., 2005];
I Good news aboutmcsp:
I fixed parameter tractable: a solution of sizek can be found in time f(k)·poly(n) (n=|S|=|T|) [Bulteau and Komusiewicz, 2014];
I Greedy approach [Goldstein and Lewenstein, 2011]: repeatedly select an LCS without any marked letter;
X simple and fast (runs inO(n) time);
× approximation ratio between Ω(n0.43) andO(n0.69) [Kaplan and Shafrir, 2006];
Strings Other models Alternative approaches Beyond pairwise comparisons
Duplications in evolution Balanced strings General strings
Minimum common string partition: variants
I One can also considersignedstrings: each segment is then equivalent up to a reversal;
I Or equivalence under full reversals: a partition ofS is also a partition ofT if one can concatenate its elements to obtainT or its reverse;
I Those variants are still hard, but the positive results do not straightforwardly generalise [Bulteau and Komusiewicz, 2014];
Strings Other models Alternative approaches Beyond pairwise comparisons
Duplications in evolution Balanced strings General strings
Unbalanced strings
I Of course, we are not always so lucky that our genomes are just anagrams;
I Most of the time, duplications are not balanced;
I So, what do we do?
Strings Other models Alternative approaches Beyond pairwise comparisons
Duplications in evolution Balanced strings General strings
Arbitrary strings
I One idea is to try and match different copies of a same gene accross two genomes;
I Three general approaches have been proposed:
1. the exemplar model;
2. the intermediate model;
3. the full model;
I All three are based on a notion of matching;
Strings Other models Alternative approaches Beyond pairwise comparisons
Duplications in evolution Balanced strings General strings
Matching and pruning
Definition (gene matching)
Agene matchingbetween two strings S andT is a set of disjoint pairs{Si,Tj} such thatSi =Tj for every such pair (1≤i ≤ |S|, 1≤j ≤ |T|).
Definition (pruning)
Given two stringsS andT and a gene matching M, the M-pruning is the pair (S0,T0) obtained by removing all
unmatched characters fromS andT and relabelling the remaining characters according toM.
(examples to appear shortly)
Strings Other models Alternative approaches Beyond pairwise comparisons
Duplications in evolution Balanced strings General strings
Matching and pruning
Definition (gene matching)
Agene matchingbetween two strings S andT is a set of disjoint pairs{Si,Tj} such thatSi =Tj for every such pair (1≤i ≤ |S|, 1≤j ≤ |T|).
Definition (pruning)
Given two stringsS andT and a gene matching M, the M-pruning is the pair (S0,T0) obtained by removing all
unmatched characters fromS andT and relabelling the remaining characters according toM.
(examples to appear shortly)
Strings Other models Alternative approaches Beyond pairwise comparisons
Duplications in evolution Balanced strings General strings
Matching and pruning
Definition (gene matching)
Agene matchingbetween two strings S andT is a set of disjoint pairs{Si,Tj} such thatSi =Tj for every such pair (1≤i ≤ |S|, 1≤j ≤ |T|).
Definition (pruning)
Given two stringsS andT and a gene matching M, the M-pruning is the pair (S0,T0) obtained by removing all
unmatched characters fromS andT and relabelling the remaining characters according toM.
(examples to appear shortly)
Strings Other models Alternative approaches Beyond pairwise comparisons
Duplications in evolution Balanced strings General strings
Matching(s) and pruning(s)
I Matchings will depend on the model we use;
I Since prunings are derived from matchings, they will also vary depending on the underlying model;
I Let us review them on examples;
Strings Other models Alternative approaches Beyond pairwise comparisons
Duplications in evolution Balanced strings General strings
Exemplar matching / pruning
I In theexemplarmodel, we match only one copy of each gene:
Example (exemplar matching / pruning)
S = 1 2 −4 −2 3 1 4 −3 4
T = 4 1 −3 −2 2 1 2 4
S0 = 1 2 3 4
T0= 1 −3 −2 4
Strings Other models Alternative approaches Beyond pairwise comparisons
Duplications in evolution Balanced strings General strings
Intermediate matching/pruning
I In theintermediatemodel, we match at least one copy of each gene:
Example (intermediate matching/pruning)
S = 1 2 −4 −2 3 1 4 −3 4
T = 4 1 −3 −2 2 1 2 4
S0= 1 2 3 10 4 T0 = 1 −3 −2 10 4
Strings Other models Alternative approaches Beyond pairwise comparisons
Duplications in evolution Balanced strings General strings
Full matching / pruning
I In thefull model, we match as many copies of each gene as possible:
Example (full matching / pruning)
S = 1 2 −4 −2 3 1 4 −3 4
T = 4 1 −3 −2 2 1 2 4
S0 = 1 2 −40 −20 3 10 4 T0= 40 10 −3 20 1 2 4
Strings Other models Alternative approaches Beyond pairwise comparisons
Duplications in evolution Balanced strings General strings
Using matchings and prunings
I Once we’ve pruned our input strings, we can compare them as if they were permutations;
I This gives rise to many variations on the following theme:
Problem (“(M, d)-comparison”) Input: two strings S and T
Goal: find an “M matching” such that the resulting “M pruning”
(S0,T0) minimises d(S0,T0)
I HereM ∈ {exemplar, intermediate, full}, andd is any distance on Sn orSn± (withn =|S0|=|T0|);
Strings Other models Alternative approaches Beyond pairwise comparisons
Duplications in evolution Balanced strings General strings
Strings
I This isnot “just a matching problem”;
I in matching problems, every edge is given a weight, and we have to optimize a function that takes all weights into account;
I while here, we look for a matching that optimises a quantity, but the edge weights are not fixed to begin with;
I In other words: in matching problems we can compute the cost of a partial solution, while here we must have a full matching before we can even begin to compute the cost;
Strings Other models Alternative approaches Beyond pairwise comparisons
Duplications in evolution Balanced strings General strings
Strings: extensions
I Strings can of course be signed to take directionality into account;
I They can also be circular;
I And of course we could have a mix of both to represent different chromosomes;
Strings Other models Alternative approaches Beyond pairwise comparisons
Posets Set systems
Other models
I We’ve mostly seen (signed) permutations and strings so far;
I Other models may be more suitable, according to:
I the data we have;
I the relations we want to take into account;
I We mention briefly the following structures:
I posets;
I set systems;
Strings Other models Alternative approaches Beyond pairwise comparisons
Posets Set systems
The need for other models
I Most genomes consist of several chromosomes:
Strings Other models Alternative approaches Beyond pairwise comparisons
Posets Set systems
Posets
I Recall that genomes are not directly copied from a long string of DNA to a drive;
1. “small” subsequences called readsare identified;
2. then those reads are assembled to form the target genome;
I We still want to be able to compare genomes even if only partial gene order information is available;
I This naturally leads us to compareposets instead of permutations or strings;
Strings Other models Alternative approaches Beyond pairwise comparisons
Posets Set systems
Posets
I Informally, although we may not know the complete ordering, we may know parts of it;
I So segments are partially ordered, and genomes may be represented by directedacyclic graphs, where:
I vertices stand for segments;
I arc (u,v) means “segmentuprecedes segmentv”;
I In this regard, permutations are paths of maximal length;
Example (a genome as a poset)
1
−2
3
−5 6 10 9 12
Strings Other models Alternative approaches Beyond pairwise comparisons
Posets Set systems
Comparing genomes as posets
I Comparing genomesG1 andG2 represented as posets is based on permutations:
I find linear extensions L1andL2that minimised(L1,L2);
I Another way of trying to aggregate their contents is by:
I merging them into a conflict-free graph;
I finding a linear extension of that graph;
Strings Other models Alternative approaches Beyond pairwise comparisons
Posets Set systems
Finding an “agreement” for posets
G1: 1
−2
3
−5 6 10 8 12
G2: 1 −2 −4 −5 7
9
11
12
G1∪G2: 1 −2 −4 −5
3
6 10 8
12
7 9
11
Strings Other models Alternative approaches Beyond pairwise comparisons
Posets Set systems
Set systems and the syntenic distance
I Recall that chromosomes are ordered sets of genes;
I Sometimes we’re not interested in order, but in the fact that two segments belong to the same chromosome;
I So we view a genome as a family of (unordered) sets of genes;
Strings Other models Alternative approaches Beyond pairwise comparisons
Posets Set systems
Set systems and the syntenic distance
I Three operations are taken into account in that setting:
{{a,b,c},{p,q,r},{x,y}}
{{a,b},{c},{p,q,r},{x,y}} {{a,b,c,x,y},{p,q,r}}
{{a,p},{b,c,q,r},{x,y}}
fission fusion
translocation
I Thesyntenic distancebetween two genomes is then the
minimum number of such operations that are needed to transform
Strings Other models Alternative approaches Beyond pairwise comparisons
Posets Set systems
Set systems and the syntenic distance
I There is acompact representationthat allows us to assume that:
1. our input is{S1,S2, . . . ,Sk}(subsets of {1,2, . . . ,n});
2. our target is {{1},{2}, . . . ,{n}};
I So we want to obtain that genome using as few fissions, fusions and translocations as possible;
I Syntenic genes are simply genes that belong to the same chromosome;
Strings Other models Alternative approaches Beyond pairwise comparisons
Posets Set systems
Synteny graph
I A graph-theoretic approach for attacking the problem was proposed:
Definition ([DasGupta et al., 1998])
Thesynteny graphof an instanceS(n,k) is defined by:
I V ={S1,S2, . . . ,Sk};
I E ={{Si,Sj} | Si∩Sj 6=∅,1≤i 6=j ≤n};
I The synteny graph of our target{{1},{2}, . . . ,{n}}hasn components;
Strings Other models Alternative approaches Beyond pairwise comparisons
Posets Set systems
Mutations and the synteny graph
I Translocations, fusions and fissions affect the graph in different ways;
I translocations (may) disconnect adjacent vertices;
I fissions split vertices into two nonadjacent vertices;
I fusions: opposite of fissions;
I Our goal is to obtainn components;
I It can be proved that the distance is at leastn−p (where p is the number of components in our instance’s graph);
Strings Other models Alternative approaches Beyond pairwise comparisons
Posets Set systems
About the syntenic distance
I The synteny graph dictates that we want to increase the number of connected components;
I In that regard, restricting oneself to “intra-component moves” seems optimal;
I But any approach that does this is a 2-approximation [Liben-Nowell, 2001];
I No better approximation is known;
I And computing the distance or an optimal scenario is NP-hard [DasGupta et al., 1998];
Strings Other models Alternative approaches Beyond pairwise comparisons
Posets Set systems
About the syntenic distance
I The synteny graph dictates that we want to increase the number of connected components;
I In that regard, restricting oneself to “intra-component moves”
seems optimal;
I But any approach that does this is a 2-approximation [Liben-Nowell, 2001];
I No better approximation is known;
I And computing the distance or an optimal scenario is NP-hard [DasGupta et al., 1998];
Strings Other models Alternative approaches Beyond pairwise comparisons
Posets Set systems
About the syntenic distance
I The synteny graph dictates that we want to increase the number of connected components;
I In that regard, restricting oneself to “intra-component moves”
seems optimal;
I But any approach that does this is a 2-approximation [Liben-Nowell, 2001];
I No better approximation is known;
I And computing the distance or an optimal scenario is NP-hard [DasGupta et al., 1998];
Strings Other models Alternative approaches Beyond pairwise comparisons
Posets Set systems
About the syntenic distance
I The synteny graph dictates that we want to increase the number of connected components;
I In that regard, restricting oneself to “intra-component moves”
seems optimal;
I But any approach that does this is a 2-approximation [Liben-Nowell, 2001];
I No better approximation is known;
I And computing the distance or an optimal scenario is NP-hard [DasGupta et al., 1998];
Strings Other models Alternative approaches Beyond pairwise comparisons
Posets Set systems
About the syntenic distance
I The synteny graph dictates that we want to increase the number of connected components;
I In that regard, restricting oneself to “intra-component moves”
seems optimal;
I But any approach that does this is a 2-approximation [Liben-Nowell, 2001];
I No better approximation is known;
I And computing the distance or an optimal scenario is NP-hard [DasGupta et al., 1998];
Strings Other models Alternative approaches Beyond pairwise comparisons
SAT solvers Linear programming
Today’s models: wrap-up
I As soon as we have duplications, most problems become hard (to solve exactly, or even to approximate within a reasonable factor)
I As soon as we forget about order (partially or completely), we also end up with difficult problems;
I Yet the problems still have to be solved;
Strings Other models Alternative approaches Beyond pairwise comparisons
SAT solvers Linear programming
Alternative approach: sat solvers
I satsolvers are highly-optimised programs for solving the
well-known NP-completesatisfiability problem [Cook, 1971]:
Problem (satisfiability (sat))
Input: a Boolean formulaφ in conjunctive normal form.
Question: is there a satisfying assignment for φ?
I Idea: take advantage of these solvers;
Strings Other models Alternative approaches Beyond pairwise comparisons
SAT solvers Linear programming
Alternative approach: sat solvers
I The workflow is as follows:
PROBLEM INSTANCE
BOOLEAN FORMULA SAT SOLVER SATISFYING ASSIGNMENT
SOLUTION
translation
Strings Other models Alternative approaches Beyond pairwise comparisons
SAT solvers Linear programming
Alternative approach: linear and pseudo-boolean programming
I Linear programsare of the form:
maximise cTx subject to Ax ≤b
and x≥0
I Pseudo-boolean programs: same form, but the function to optimise maps{0,1}n toR (versus{0,1}for boolean functions);
I Specialised solvers also exist for those and were used to solve rearrangement problems on strings [Angibaud et al., 2007] and posets [Angibaud et al., 2009];
Strings Other models Alternative approaches Beyond pairwise comparisons
SAT solvers Linear programming
Comparative genomics wrap-up
I Here we talked mostly about computing “edit distances” between genomes;
I Other measures of similarity exist that are not associated to mutations;
I Many hard problems;
I Much remains to be done in order to satisfy biologists;
I realistic models;
I software;
I ...
Strings Other models Alternative approaches Beyond pairwise comparisons
From comparisons to phylogenies Bounds
Selected results
Beyond pairwise comparisons
I The genome rearrangement problems we’ve seen were formulated in a pairwise fashion;
I But actually, more than two genomes can be taken into account;
I Unsurprisingly, most problems become hard in that setting;
Strings Other models Alternative approaches Beyond pairwise comparisons
From comparisons to phylogenies Bounds
Selected results
Why more than two genomes?
I A sequence does not yield enough information for ancestral genome reconstruction:
G1 G2
I Taking an additional genome into account restricts our choices:
G1 G2
G3
I What’s more, it’s ultimately one of our goals;
Strings Other models Alternative approaches Beyond pairwise comparisons
From comparisons to phylogenies Bounds
Selected results
Why more than two genomes?
I A sequence does not yield enough information for ancestral genome reconstruction:
G1 G2
I Taking an additional genome into account restricts our choices:
G1 G2
G3
I What’s more, it’s ultimately one of our goals;
Strings Other models Alternative approaches Beyond pairwise comparisons
From comparisons to phylogenies Bounds
Selected results
Why more than two genomes?
I A sequence does not yield enough information for ancestral genome reconstruction:
G1 G2
I Taking an additional genome into account restricts our choices:
G1 G2
G3
I What’s more, it’s ultimately one of our goals;
Strings Other models Alternative approaches Beyond pairwise comparisons
From comparisons to phylogenies Bounds
Selected results
Median problems
I Measures of similarities between genomes are useful in reconstructingphylogenies;
Example (phylogeny from distance matrix)
a b c d e a 0 2 3 6 6 b 2 0 3 6 6 c 3 3 0 5 5 d 6 6 5 0 4 e 6 6 5 4 0
a 1
b 1
1 2
c
1
2 d
2 e
I (The matrix must satisfy some conditions [Buneman, 1971]);
Strings Other models Alternative approaches Beyond pairwise comparisons
From comparisons to phylogenies Bounds
Selected results
Median problems
I Parsimony again: search for a tree that minimises the total number of evolutionary events (i.e. the sum of all edge weights);
I In its simplest form, the problem we want to solve is:
Problem (median of three)
Given: π,σ,τ in Sn±; a distance d :Sn±×Sn±→N. Find: a permutationµ in Sn± that minimises
w(µ) =d(π, µ) +d(σ, µ) +d(τ, µ).
I Can be generalised to more than three input permutations;
Strings Other models Alternative approaches Beyond pairwise comparisons
From comparisons to phylogenies Bounds
Selected results
Generic bounds [Siepel and Moret, 2001]
I Generic lower and upper bounds for any distance:
π
σ τ
d(π, σ) d(π, τ)
d(σ, τ)
µ d(π, µ)
d(µ, σ) d(µ, τ)
I w(µ)≤min{
ifµ=π
z }| {
d(π, σ) +d(π, τ),
ifµ=σ
z }| {
d(π, σ) +d(σ, τ),
ifµ=τ
z }| {
d(π, τ) +d(σ, τ)}.
≥d(π, σ)+d(π, τ)+d(σ, τ) (triangle inequalities)