Algorithms and bioinformatics Comparative genomics Anthony Labarre September 26, 2016

(1)

Anthony Labarre

September 26, 2016

(2)

Motivation

I We saw a model for representing genomes without directionality;

I We saw another model for taking directionality into account;

I Both of them lack realism in a crucial way: they don’t allow duplications;

I And duplications / insertions / deletions account for a very large part of what happens in evolution [Ohno, 1970];

(3)

Two examples of duplications

Example (tandem duplications)

(source: K. Aainsqatsi on Wikimedia)

Example (whole genome duplication)

(4)

Today

I Models that take duplications into account;

I Other approaches to solving the corresponding problems;

I Other models for those cases where only partial information is available or relevant;

(5)

Strings

I Since duplications pervade genomes, we should take them into account;

I We now see genomes asstringson an alphabet Σ;

I Be careful: similar segments have been identified, so Σ ={segments} and not{A,C,G,T};

I Our goal is still to explain evolution using most parsimonious scenarios made of fixed transformations;

(6)

Strings

I Note: the restriction to sorting problems does not work anymore;

I if you have two A’s, which one should be “number one”?

I So we really are interested in transforming one string into another, which isnotequivalent to sorting another string;

I Sorting problems have been considered in that model, but they’re just a special case of a more general problem;

(7)

Strings

I We can distinguish between several approaches based on gene contents;

I Either we have exactly the same contents in both genomes (and duplications are of course allowed);

I Or we have duplications but with different amounts of repetitions (e.g. three 1’s in genomeAbut only two in genomeB);

I This time the breakpoint graph cannot save us anymore, since we would not know how to connect elements or decompose the graph;

(8)

Balanced strings

I The number of occurrences of a characterc in a stringS is denoted byocc(c,S);

Definition (balanced strings)

Two stringsS andT on an alphabet Σ are balanced if:

∀ c ∈Σ :occ(c,S) =occ(c,T).

I Basically,S andT are anagrams;

I Straightforward generalisation of permutations: we have

duplications, but we actually still have the same content in both genomes;

(9)

Comparing balanced strings

I One way of relating genomes’ contents is to identify common segments;

I In other words, we want to partition genomes into the same set of segments;

I this is how we obtained (signed) permutations;

I but now we want to partition the resulting sequences;

(10)

Generalising breakpoints

I Recall that, for permutations:

I adjacenciesare pairs of adjacent elements inπthat are also adjacent inι=h1 2 · · · ni(orχ=hn n−1 · · · 1ifor reversals);

I breakpointsare pairs that are not adjacencies;

I Recall that, for signed permutations:

I adjacenciesare pairs of adjacent elements inπthat are also adjacent inι=h1 2 · · · ni(orχ=h−n −(n−1) · · · −1ifor signed reversals);

I breakpointsare pairs that are not adjacencies;

I Those can be generalised to any pair of permutations;

I And we can do the same thing for strings;

(11)

Minimum common string partition

I Apartitionof a string S is a set of strings that can be concatenated to obtainS;

I Acommon partitionof two strings S andT is a set of strings that can be concatenated to obtain bothS andT;

Example (common string partitions)

Here’s a common partition of “dictionary” and “indicatory”:

d i c t i o n a r y

S1 S2 S3 S4 S5 S6 S7

(12)

Minimum common string partition

I A common string partition isminimum if there is no smaller common string partition for the two strings under consideration;

I This leads to the following decision problem:

Problem (minimum common string partition (mcsp)) Instance: balanced strings S and T , a bound k ∈N;

Question: is there a common partition of S and T with at most k blocks?

(13)

Relation(s) to rearrangement problems

I Recall that breakpoints were pairs of elements adjacent in one genome but not in the other;

I Common string partitions generalise that point of view to an arbitrary number of elements in each part;

I So if we have a minimum common string partition forS andT, we get the number of breakpoints between stringsS andT;

(14)

About mcsp

I Bad news aboutmcsp:

I NP-hard, even if only one gene family is nontrivial [Blin et al., 2004];

I APX-hard, even if no character appears more than twice [Goldstein et al., 2005];

I Good news aboutmcsp:

I fixed parameter tractable: a solution of sizek can be found in time f(k)·poly(n) (n=|S|=|T|) [Bulteau and Komusiewicz, 2014];

I Greedy approach [Goldstein and Lewenstein, 2011]: repeatedly select an LCS without any marked letter;

X simple and fast (runs inO(n) time);

× approximation ratio between Ω(n^0.43) andO(n^0.69) [Kaplan and Shafrir, 2006];

(15)

Minimum common string partition: variants

I One can also considersignedstrings: each segment is then equivalent up to a reversal;

I Or equivalence under full reversals: a partition ofS is also a partition ofT if one can concatenate its elements to obtainT or its reverse;

I Those variants are still hard, but the positive results do not straightforwardly generalise [Bulteau and Komusiewicz, 2014];

(16)

Unbalanced strings

I Of course, we are not always so lucky that our genomes are just anagrams;

I Most of the time, duplications are not balanced;

I So, what do we do?

(17)

Arbitrary strings

I One idea is to try and match different copies of a same gene accross two genomes;

I Three general approaches have been proposed:

1. the exemplar model;

2. the intermediate model;

3. the full model;

I All three are based on a notion of matching;

(18)

Matching and pruning

Definition (gene matching)

Agene matchingbetween two strings S andT is a set of disjoint pairs{S_i,T_j} such thatS_i =T_j for every such pair (1≤i ≤ |S|, 1≤j ≤ |T|).

Definition (pruning)

Given two stringsS andT and a gene matching M, the M-pruning is the pair (S⁰,T⁰) obtained by removing all

unmatched characters fromS andT and relabelling the remaining characters according toM.

(examples to appear shortly)

(19)

Matching(s) and pruning(s)

I Matchings will depend on the model we use;

I Since prunings are derived from matchings, they will also vary depending on the underlying model;

I Let us review them on examples;

(20)

Exemplar matching / pruning

I In theexemplarmodel, we match only one copy of each gene:

Example (exemplar matching / pruning)

S = 1 2 −4 −2 3 1 4 −3 4

T = 4 1 −3 −2 2 1 2 4

S⁰ = 1 2 3 4

T⁰= 1 −3 −2 4

(21)

Intermediate matching/pruning

I In theintermediatemodel, we match at least one copy of each gene:

Example (intermediate matching/pruning)

S = 1 2 −4 −2 3 1 4 −3 4

T = 4 1 −3 −2 2 1 2 4

0 0

(22)

Full matching / pruning

I In thefull model, we match as many copies of each gene as possible:

Example (full matching / pruning)

S = 1 2 −4 −2 3 1 4 −3 4

T = 4 1 −3 −2 2 1 2 4

S⁰ = 1 2 −4⁰ −2⁰ 3 1⁰ 4 T⁰= 4⁰ 1⁰ −3 2⁰ 1 2 4

(23)

Using matchings and prunings

I Once we’ve pruned our input strings, we can compare them as if they were permutations;

I This gives rise to many variations on the following theme:

Problem (“(M, d)-comparison”) Input: two strings S and T

Goal: find an “M matching” such that the resulting “M pruning”

(S⁰,T⁰) minimises d(S⁰,T⁰)

(24)

Strings

I This isnot “just a matching problem”;

I in matching problems, every edge is given a weight, and we have to optimize a function that takes all weights into account;

I while here, we look for a matching that optimises a quantity, but the edge weights are not fixed to begin with;

I In other words: in matching problems we can compute the cost of a partial solution, while here we must have a full matching before we can even begin to compute the cost;

(25)

Strings: extensions

I Strings can of course be signed to take directionality into account;

I They can also be circular;

I And of course we could have a mix of both to represent different chromosomes;

(26)

Other models

I We’ve mostly seen (signed) permutations and strings so far;

I Other models may be more suitable, according to:

I the data we have;

I the relations we want to take into account;

I We mention briefly the following structures:

I posets;

I set systems;

(27)

The need for other models

I Most genomes consist of several chromosomes:

(28)

Posets

I Recall that genomes are not directly copied from a long string of DNA to a drive;

1. “small” subsequences called readsare identified;

2. then those reads are assembled to form the target genome;

I We still want to be able to compare genomes even if only partial gene order information is available;

I This naturally leads us to compareposets instead of permutations or strings;

(29)

Posets

I Informally, although we may not know the complete ordering, we may know parts of it;

I So segments are partially ordered, and genomes may be represented by directedacyclic graphs, where:

I vertices stand for segments;

I arc (u,v) means “segmentuprecedes segmentv”;

I In this regard, permutations are paths of maximal length;

Example (a genome as a poset)

−2

(30)

Comparing genomes as posets

I Comparing genomesG₁ andG₂ represented as posets is based on permutations:

I find linear extensions L1andL2that minimised(L1,L2);

I Another way of trying to aggregate their contents is by:

I merging them into a conflict-free graph;

I finding a linear extension of that graph;

(31)

Finding an “agreement” for posets

G1: 1

−2

3

−5 6 10 8 12

G2: 1 −2 −4 −5 7

9

11

12

G1∪G2: 1 −2 −4 −5

6 10 8

12

(32)

Set systems and the syntenic distance

I Recall that chromosomes are ordered sets of genes;

I Sometimes we’re not interested in order, but in the fact that two segments belong to the same chromosome;

I So we view a genome as a family of (unordered) sets of genes;

(33)

Set systems and the syntenic distance

I Three operations are taken into account in that setting:

{{a,b,c},{p,q,r},{x,y}}

{{a,b},{c},{p,q,r},{x,y}} {{a,b,c,x,y},{p,q,r}}

{{a,p},{b,c,q,r},{x,y}}

fission fusion

translocation

(34)

Set systems and the syntenic distance

I There is acompact representationthat allows us to assume that:

1. our input is{S1,S2, . . . ,Sk}(subsets of {1,2, . . . ,n});

2. our target is {{1},{2}, . . . ,{n}};

I So we want to obtain that genome using as few fissions, fusions and translocations as possible;

I Syntenic genes are simply genes that belong to the same chromosome;

(35)

Synteny graph

I A graph-theoretic approach for attacking the problem was proposed:

Definition ([DasGupta et al., 1998])

Thesynteny graphof an instanceS(n,k) is defined by:

I V ={S₁,S2, . . . ,Sk};

I E ={{S_i,S_j} | S_i∩S_j 6=∅,1≤i 6=j ≤n};

I The synteny graph of our target{{1},{2}, . . . ,{n}}hasn

(36)

Mutations and the synteny graph

I Translocations, fusions and fissions affect the graph in different ways;

I translocations (may) disconnect adjacent vertices;

I fissions split vertices into two nonadjacent vertices;

I fusions: opposite of fissions;

I Our goal is to obtainn components;

I It can be proved that the distance is at leastn−p (where p is the number of components in our instance’s graph);

(37)

About the syntenic distance

I The synteny graph dictates that we want to increase the number of connected components;

I In that regard, restricting oneself to “intra-component moves”

seems optimal;

I But any approach that does this is a 2-approximation [Liben-Nowell, 2001];

I No better approximation is known;

I And computing the distance or an optimal scenario is NP-hard [DasGupta et al., 1998];

(38)

Today’s models: wrap-up

I As soon as we have duplications, most problems become hard (to solve exactly, or even to approximate within a reasonable factor)

I As soon as we forget about order (partially or completely), we also end up with difficult problems;

I Yet the problems still have to be solved;

(39)

Alternative approach: sat solvers

I satsolvers are highly-optimised programs for solving the

well-known NP-completesatisfiability problem [Cook, 1971]:

Problem (satisfiability (sat))

Input: a Boolean formulaφ in conjunctive normal form.

Question: is there a satisfying assignment for φ?

I Idea: take advantage of these solvers;

(40)

Alternative approach: sat solvers

I The workflow is as follows:

PROBLEM INSTANCE

BOOLEAN FORMULA SAT SOLVER SATISFYING ASSIGNMENT

SOLUTION

translation

(41)

Alternative approach: linear and pseudo-boolean programming

I Linear programsare of the form:

maximise c^Tx subject to Ax ≤b

and x≥0

I Pseudo-boolean programs: same form, but the function to optimise maps{0,1}ⁿ toR (versus{0,1}for boolean functions);

I Specialised solvers also exist for those and were used to solve rearrangement problems on strings [Angibaud et al., 2007] and

(42)

Comparative genomics wrap-up

I Here we talked mostly about computing “edit distances” between genomes;

I Other measures of similarity exist that are not associated to mutations;

I Many hard problems;

I Much remains to be done in order to satisfy biologists;

I realistic models;

I software;

I ...

(43)

Beyond pairwise comparisons

I The genome rearrangement problems we’ve seen were formulated in a pairwise fashion;

I But actually, more than two genomes can be taken into account;

I Unsurprisingly, most problems become hard in that setting;

(44)

Why more than two genomes?

I A sequence does not yield enough information for ancestral genome reconstruction:

G1 G2

I Taking an additional genome into account restricts our choices:

G1 G2

G3

I What’s more, it’s ultimately one of our goals;

(45)

Median problems

I Measures of similarities between genomes are useful in reconstructingphylogenies;

Example (phylogeny from distance matrix)

a b c d e a 0 2 3 6 6 b 2 0 3 6 6 c 3 3 0 5 5 d 6 6 5 0 4 e 6 6 5 4 0

a 1

b 1

1 2

c

1

2 d

2 e

(46)

Median problems

I Parsimony again: search for a tree that minimises the total number of evolutionary events (i.e. the sum of all edge weights);

I In its simplest form, the problem we want to solve is:

Problem (median of three)

Given: π,σ,τ in S_n^±; a distance d :S_n^±×S_n^±→N. Find: a permutationµ in S_n^± that minimises

w(µ) =d(π, µ) +d(σ, µ) +d(τ, µ).

I Can be generalised to more than three input permutations;

(47)

Generic bounds [Siepel and Moret, 2001]

I Generic lower and upper bounds for any distance:

π

σ τ

d(π, σ) d(π, τ)

d(σ, τ) µ d(π, µ)

d(µ, σ) d(µ, τ)

I w(µ)≤min{

ifµ=π

z }| {

d(π, σ) +d(π, τ),

ifµ=σ

z }| {

d(π, σ) +d(σ, τ),

ifµ=τ

z }| {

d(π, τ) +d(σ, τ)}.

(48)

Results on median problems

I What has been done:

Operation or measure Median of three Best approximation breakpoint NP-hard [Bryant, 1998] 5/3 [Caprara, 2002]

signed breakpoint NP-hard [Bryant, 1998] 7/6 [Pe’er and Shamir, 2000]

exchange ? ?

signed reversal NP-hard [Caprara, 2003] 4/3 [Caprara, 1999]

signed double-cut-and-join NP-hard [Caprara, 2003] 4/3 [Caprara, 1999]

transposition NP-hard [Bader, 2011] ?

I What could be done:

1. complexity of the exchange median problem?

(trivial for 2 permutations, NP-hard for ≥4; what about 3?) 2. better approximations;

3. “median clouds” [Eriksen, 2009];

(49)

Further topics

I Other topics could have been discussed:

I what to do in the presence of multiple optimal sequences?

I what can be said about the distribution of those distances?

I how else can we assess the quality of the solutions?

I how do we modify them if they’re unsatisfactory?

I what other biological constraints can we add?

I . . .

(50)

References I

Angibaud, S., Fertin, G., Rusu, I., and Vialette, S. (2007).

A pseudo-boolean framework for computing rearrangement distances between genomes with duplicates.

Journal of Computational Biology, 14(4):379–393.

Angibaud, S., Fertin, G., Th´evenin, A., and Vialette, S. (2009).

Pseudo boolean programming for partially ordered genomes.

In Ciccarelli, F. and Mikl´os, I., editors,RECOMB-CG, volume 5817 ofLecture Notes in Computer Science, pages 126–137. Springer.

Bader, M. (2011).

The transposition median problem is NP-complete.

Theoretical Computer Science, 412(12-14):1099–1110.

Blin, G., Fertin, G., Chauve, C., et al. (2004).

The breakpoint distance for signed sequences.

In1st Conference on Algorithms and Computational Methods for biochemical and Evolutionary Networks (CompBioNets’ 04), volume 3, pages 3–16.

Bryant, D. (1998).

The complexity of the breakpoint median problem.

Technical report, Universit´e De Montr´eal.

Bulteau, L. and Komusiewicz, C. (2014).

Minimum common string partition parameterized by partition size is fixed-parameter tractable.

InProc. 25th SODA, pages 102–121.

(51)

References II

Buneman, P. (1971).

The recovery of trees from measures of dissimilarity.

Mathematics in the Archaeological and Historical Sciences, pages 387–395.

Caprara, A. (1999).

Formulations and hardness of multiple sorting by reversals.

InRECOMB’99, pages 84–93, New York, NY, USA. ACM.

Caprara, A. (2002).

Additive bounding, worst-case analysis, and the breakpoint median problem.

SIAM Journal on Optimization, 13:508–519.

Caprara, A. (2003).

The reversal median problem.

INFORMS Journal on Computing, 15:93–113.

Cook, S. A. (1971).

The complexity of theorem-proving procedures.

InProc. 3rd STOC, pages 151–158, Shaker Heights, Ohio, USA. ACM.

(52)

References III

Eriksen, N. (2009).

Median clouds and a fast transposition median solver.

InFPSAC’09, Discrete Math. Theor. Comput. Sci. Proc., AK, pages 373–384. Assoc. Discrete Math. Theor.

Comput. Sci., Nancy.

Goldstein, A., Kolman, P., and Zheng, J. (2005).

Minimum common string partition problem: Hardness and approximations.

Electronic Journal of Combinatorics, 12(1).

Goldstein, I. and Lewenstein, M. (2011).

Quick greedy computation for minimum common string partitions.

In Giancarlo, R. and Manzini, G., editors,CPM, volume 6661 ofLecture Notes in Computer Science, pages 273–284. Springer.

Kaplan, H. and Shafrir, N. (2006).

The greedy algorithm for edit distance with moves.

Information Processing Letters, 97(1):23–27.

Liben-Nowell, D. (2001).

On the structure of syntenic distance.

Journal of Computational Biology, 8(1):53–67.

Ohno, S. (1970).

Evolution by gene duplication.

Springer-Verlag.

(53)

References IV

Pe’er, I. and Shamir, R. (2000).

Approximation algorithms for the median problem in the breakpoint model.

D. Sankoff, J.H. Nadeau (Eds.), Comparative Genomics, Kluwer, Dordrecht, 2000:225–241.

Siepel, A. C. and Moret, B. M. E. (2001).

Finding an optimal inversion median: Experimental results.

InWABI’01, volume 2149 ofLNCS, pages 189–203. Springer-Verlag.