Algorithms and bioinformatics Comparative genomics Anthony Labarre September 26, 2016

(1)

Algorithms and bioinformatics

Comparative genomics

Anthony Labarre

September 26, 2016

(2)

Strings Other models Alternative approaches Beyond pairwise comparisons

Duplications in evolution Balanced strings General strings

Motivation

I We saw a model for representing genomes without directionality;

I We saw another model for taking directionality into account;

I Both of them lack realism in a crucial way: they don’t allow duplications;

I And duplications / insertions / deletions account for a very large part of what happens in evolution [Ohno, 1970];

(3)

Two examples of duplications

Example (tandem duplications)

(source: K. Aainsqatsi on Wikimedia)

Example (whole genome duplication)

(source: Eric Lyons on CoGePedia)

(4)

Two examples of duplications

Example (tandem duplications)

(source: K. Aainsqatsi on Wikimedia)

Example (whole genome duplication)

(5)

Today

I Models that take duplications into account;

I Other approaches to solving the corresponding problems;

I Other models for those cases where only partial information is available or relevant;

(6)

Strings

I Since duplications pervade genomes, we should take them into account;

I We now see genomes asstringson an alphabet Σ;

I Be careful: similar segments have been identified, so Σ ={segments} and not{A,C,G,T};

I Our goal is still to explain evolution using most parsimonious scenarios made of fixed transformations;

(7)

Strings

I Note: the restriction to sorting problems does not work anymore;

I if you have two A’s, which one should be “number one”?

I So we really are interested in transforming one string into another, which isnotequivalent to sorting another string;

I Sorting problems have been considered in that model, but they’re just a special case of a more general problem;

(8)

Strings

I We can distinguish between several approaches based on gene contents;

I Either we have exactly the same contents in both genomes (and duplications are of course allowed);

I Or we have duplications but with different amounts of repetitions (e.g. three 1’s in genomeAbut only two in genomeB);

I This time the breakpoint graph cannot save us anymore, since we would not know how to connect elements or decompose the graph;

(9)

Balanced strings

I The number of occurrences of a characterc in a stringS is denoted byocc(c,S);

Definition (balanced strings)

Two stringsS andT on an alphabet Σ are balanced if:

∀ c ∈Σ :occ(c,S) =occ(c,T).

I Basically,S andT are anagrams;

I Straightforward generalisation of permutations: we have

duplications, but we actually still have the same content in both genomes;

(10)

Comparing balanced strings

I One way of relating genomes’ contents is to identify common segments;

I In other words, we want to partition genomes into the same set of segments;

I this is how we obtained (signed) permutations;

I but now we want to partition the resulting sequences;

(11)

Generalising breakpoints

I Recall that, for permutations:

I adjacenciesare pairs of adjacent elements inπthat are also adjacent inι=h1 2 · · · ni(orχ=hn n−1 · · · 1ifor reversals);

I breakpointsare pairs that are not adjacencies;

I Recall that, for signed permutations:

I adjacenciesare pairs of adjacent elements inπthat are also adjacent inι=h1 2 · · · ni(orχ=h−n −(n−1) · · · −1ifor signed reversals);

I Those can be generalised to any pair of permutations;

I And we can do the same thing for strings;

(12)

Generalising breakpoints

(13)

Generalising breakpoints

(14)

Generalising breakpoints

(15)

Generalising breakpoints

(16)

Generalising breakpoints

(17)

Generalising breakpoints

(18)

Generalising breakpoints

(19)

Minimum common string partition

I Apartitionof a string S is a set of strings that can be concatenated to obtainS;

I Acommon partitionof two strings S andT is a set of strings that can be concatenated to obtain bothS andT;

Example (common string partitions)

Here’s a common partition of “dictionary” and “indicatory”:

d i c t i o n a r y

S1 S2 S3 S4 S5 S6 S7

i n d i c a t o r y

S3 S5 S1 S6 S2 S4 S7

(20)

Minimum common string partition

d i c t i o n a r y

S1 S2 S3 S4 S5 S6 S7

i n d i c a t o r y

S3 S5 S1 S6 S2 S4 S7

(21)

Minimum common string partition

d i c t i o n a r y

S1 S2 S3 S4 S5 S6 S7

i n d i c a t o r y

S3 S5 S1 S6 S2 S4 S7

(22)

Minimum common string partition

d i c t i o n a r y

S1 S2 S3 S4 S5 S6 S7

i n d i c a t o r y

S3 S5 S1 S6 S2 S4 S7

(23)

Minimum common string partition

d i c t i o n a r y

S1 S2 S3 S4 S5 S6 S7

i n d i c a t o r y

S3 S5 S1 S6 S2 S4 S7

(24)

Minimum common string partition

I A common string partition isminimum if there is no smaller common string partition for the two strings under consideration;

I This leads to the following decision problem:

Problem (minimum common string partition (mcsp)) Instance: balanced strings S and T , a bound k ∈N;

Question: is there a common partition of S and T with at most k blocks?

(25)

Relation(s) to rearrangement problems

I Recall that breakpoints were pairs of elements adjacent in one genome but not in the other;

I Common string partitions generalise that point of view to an arbitrary number of elements in each part;

I So if we have a minimum common string partition forS andT, we get the number of breakpoints between stringsS andT;

(26)

About mcsp

I Bad news aboutmcsp:

I NP-hard, even if only one gene family is nontrivial [Blin et al., 2004];

I APX-hard, even if no character appears more than twice [Goldstein et al., 2005];

I Good news aboutmcsp:

I fixed parameter tractable: a solution of sizek can be found in time f(k)·poly(n) (n=|S|=|T|) [Bulteau and Komusiewicz, 2014];

I Greedy approach [Goldstein and Lewenstein, 2011]: repeatedly select an LCS without any marked letter;

X simple and fast (runs inO(n) time);

× approximation ratio between Ω(n^0.43) and O(n^0.69) [Kaplan and Shafrir, 2006];

(27)

About mcsp

(28)

About mcsp

(29)

About mcsp

(30)

About mcsp

(31)

About mcsp

(32)

About mcsp

(33)

About mcsp

× approximation ratio between Ω(n^0.43) andO(n^0.69) [Kaplan and Shafrir, 2006];

(34)

Minimum common string partition: variants

I One can also considersignedstrings: each segment is then equivalent up to a reversal;

I Or equivalence under full reversals: a partition ofS is also a partition ofT if one can concatenate its elements to obtainT or its reverse;

I Those variants are still hard, but the positive results do not straightforwardly generalise [Bulteau and Komusiewicz, 2014];

(35)

Unbalanced strings

I Of course, we are not always so lucky that our genomes are just anagrams;

I Most of the time, duplications are not balanced;

I So, what do we do?

(36)

Arbitrary strings

I One idea is to try and match different copies of a same gene accross two genomes;

I Three general approaches have been proposed:

1. the exemplar model;

2. the intermediate model;

3. the full model;

I All three are based on a notion of matching;

(37)

Matching and pruning

Definition (gene matching)

Agene matchingbetween two strings S andT is a set of disjoint pairs{S_i,T_j} such thatS_i =T_j for every such pair (1≤i ≤ |S|, 1≤j ≤ |T|).

Definition (pruning)

Given two stringsS andT and a gene matching M, the M-pruning is the pair (S⁰,T⁰) obtained by removing all

unmatched characters fromS andT and relabelling the remaining characters according toM.

(examples to appear shortly)

(38)

Matching and pruning

(39)

Matching and pruning

(40)

Matching(s) and pruning(s)

I Matchings will depend on the model we use;

I Since prunings are derived from matchings, they will also vary depending on the underlying model;

I Let us review them on examples;

(41)

Exemplar matching / pruning

I In theexemplarmodel, we match only one copy of each gene:

Example (exemplar matching / pruning)

S = 1 2 −4 −2 3 1 4 −3 4

T = 4 1 −3 −2 2 1 2 4

S⁰ = 1 2 3 4

T⁰= 1 −3 −2 4

(42)

Intermediate matching/pruning

I In theintermediatemodel, we match at least one copy of each gene:

Example (intermediate matching/pruning)

S = 1 2 −4 −2 3 1 4 −3 4

T = 4 1 −3 −2 2 1 2 4

S⁰= 1 2 3 1⁰ 4 T⁰ = 1 −3 −2 1⁰ 4

(43)

Full matching / pruning

I In thefull model, we match as many copies of each gene as possible:

Example (full matching / pruning)

S = 1 2 −4 −2 3 1 4 −3 4

T = 4 1 −3 −2 2 1 2 4

S⁰ = 1 2 −4⁰ −2⁰ 3 1⁰ 4 T⁰= 4⁰ 1⁰ −3 2⁰ 1 2 4

(44)

Using matchings and prunings

I Once we’ve pruned our input strings, we can compare them as if they were permutations;

I This gives rise to many variations on the following theme:

Problem (“(M, d)-comparison”) Input: two strings S and T

Goal: find an “M matching” such that the resulting “M pruning”

(S⁰,T⁰) minimises d(S⁰,T⁰)

I HereM ∈ {exemplar, intermediate, full}, andd is any distance on S_n orS_n^± (withn =|S⁰|=|T⁰|);

(45)

Strings

I This isnot “just a matching problem”;

I in matching problems, every edge is given a weight, and we have to optimize a function that takes all weights into account;

I while here, we look for a matching that optimises a quantity, but the edge weights are not fixed to begin with;

I In other words: in matching problems we can compute the cost of a partial solution, while here we must have a full matching before we can even begin to compute the cost;

(46)

Strings: extensions

I Strings can of course be signed to take directionality into account;

I They can also be circular;

I And of course we could have a mix of both to represent different chromosomes;

(47)

Posets Set systems

Other models

I We’ve mostly seen (signed) permutations and strings so far;

I Other models may be more suitable, according to:

I the data we have;

I the relations we want to take into account;

I We mention briefly the following structures:

I posets;

I set systems;

(48)

Posets Set systems

The need for other models

I Most genomes consist of several chromosomes:

(49)

Posets Set systems

Posets

I Recall that genomes are not directly copied from a long string of DNA to a drive;

1. “small” subsequences called readsare identified;

2. then those reads are assembled to form the target genome;

I We still want to be able to compare genomes even if only partial gene order information is available;

I This naturally leads us to compareposets instead of permutations or strings;

(50)

Posets Set systems

Posets

I Informally, although we may not know the complete ordering, we may know parts of it;

I So segments are partially ordered, and genomes may be represented by directedacyclic graphs, where:

I vertices stand for segments;

I arc (u,v) means “segmentuprecedes segmentv”;

I In this regard, permutations are paths of maximal length;

Example (a genome as a poset)

1

−2

3

−5 6 10 9 12

(51)

Posets Set systems

Comparing genomes as posets

I Comparing genomesG₁ andG₂ represented as posets is based on permutations:

I find linear extensions L1andL2that minimised(L1,L2);

I Another way of trying to aggregate their contents is by:

I merging them into a conflict-free graph;

I finding a linear extension of that graph;

(52)

Posets Set systems

Finding an “agreement” for posets

G1: 1

−2

3

−5 6 10 8 12

G2: 1 −2 −4 −5 7

9

11

12

G1∪G2: 1 −2 −4 −5

3

6 10 8

12

7 9

11

(53)

Posets Set systems

Set systems and the syntenic distance

I Recall that chromosomes are ordered sets of genes;

I Sometimes we’re not interested in order, but in the fact that two segments belong to the same chromosome;

I So we view a genome as a family of (unordered) sets of genes;

(54)

Posets Set systems

Set systems and the syntenic distance

I Three operations are taken into account in that setting:

{{a,b,c},{p,q,r},{x,y}}

{{a,b},{c},{p,q,r},{x,y}} {{a,b,c,x,y},{p,q,r}}

{{a,p},{b,c,q,r},{x,y}}

fission fusion

translocation

I Thesyntenic distancebetween two genomes is then the

minimum number of such operations that are needed to transform

(55)

Posets Set systems

Set systems and the syntenic distance

I There is acompact representationthat allows us to assume that:

1. our input is{S1,S2, . . . ,Sk}(subsets of {1,2, . . . ,n});

2. our target is {{1},{2}, . . . ,{n}};

I So we want to obtain that genome using as few fissions, fusions and translocations as possible;

I Syntenic genes are simply genes that belong to the same chromosome;

(56)

Posets Set systems

Synteny graph

I A graph-theoretic approach for attacking the problem was proposed:

Definition ([DasGupta et al., 1998])

Thesynteny graphof an instanceS(n,k) is defined by:

I V ={S₁,S2, . . . ,Sk};

I E ={{S_i,S_j} | S_i∩S_j 6=∅,1≤i 6=j ≤n};

I The synteny graph of our target{{1},{2}, . . . ,{n}}hasn components;

(57)

Posets Set systems

Mutations and the synteny graph

I Translocations, fusions and fissions affect the graph in different ways;

I translocations (may) disconnect adjacent vertices;

I fissions split vertices into two nonadjacent vertices;

I fusions: opposite of fissions;

I Our goal is to obtainn components;

I It can be proved that the distance is at leastn−p (where p is the number of components in our instance’s graph);

(58)

Posets Set systems

About the syntenic distance

I The synteny graph dictates that we want to increase the number of connected components;

I In that regard, restricting oneself to “intra-component moves” seems optimal;

I But any approach that does this is a 2-approximation [Liben-Nowell, 2001];

I No better approximation is known;

I And computing the distance or an optimal scenario is NP-hard [DasGupta et al., 1998];

(59)

Posets Set systems

About the syntenic distance

I In that regard, restricting oneself to “intra-component moves”

seems optimal;

(60)

Posets Set systems

About the syntenic distance

seems optimal;

(61)

Posets Set systems

About the syntenic distance

seems optimal;

(62)

Posets Set systems

About the syntenic distance

seems optimal;

(63)

SAT solvers Linear programming

Today’s models: wrap-up

I As soon as we have duplications, most problems become hard (to solve exactly, or even to approximate within a reasonable factor)

I As soon as we forget about order (partially or completely), we also end up with difficult problems;

I Yet the problems still have to be solved;

(64)

Alternative approach: sat solvers

I satsolvers are highly-optimised programs for solving the

well-known NP-completesatisfiability problem [Cook, 1971]:

Problem (satisfiability (sat))

Input: a Boolean formulaφ in conjunctive normal form.

Question: is there a satisfying assignment for φ?

I Idea: take advantage of these solvers;

(65)

Alternative approach: sat solvers

I The workflow is as follows:

PROBLEM INSTANCE

BOOLEAN FORMULA SAT SOLVER SATISFYING ASSIGNMENT

SOLUTION

translation

(66)

Alternative approach: linear and pseudo-boolean programming

I Linear programsare of the form:

maximise c^Tx subject to Ax ≤b

and x≥0

I Pseudo-boolean programs: same form, but the function to optimise maps{0,1}ⁿ toR (versus{0,1}for boolean functions);

I Specialised solvers also exist for those and were used to solve rearrangement problems on strings [Angibaud et al., 2007] and posets [Angibaud et al., 2009];

(67)

Comparative genomics wrap-up

I Here we talked mostly about computing “edit distances” between genomes;

I Other measures of similarity exist that are not associated to mutations;

I Many hard problems;

I Much remains to be done in order to satisfy biologists;

I realistic models;

I software;

I ...

(68)

From comparisons to phylogenies Bounds

Selected results

Beyond pairwise comparisons

I The genome rearrangement problems we’ve seen were formulated in a pairwise fashion;

I But actually, more than two genomes can be taken into account;

I Unsurprisingly, most problems become hard in that setting;

(69)

Selected results

Why more than two genomes?

I A sequence does not yield enough information for ancestral genome reconstruction:

G1 G2

I Taking an additional genome into account restricts our choices:

G1 G2

G3

I What’s more, it’s ultimately one of our goals;

(70)

Selected results

Why more than two genomes?

G1 G2

G3

(71)

Selected results

Why more than two genomes?

G1 G2

G3

(72)

Selected results

Median problems

I Measures of similarities between genomes are useful in reconstructingphylogenies;

Example (phylogeny from distance matrix)

a b c d e a 0 2 3 6 6 b 2 0 3 6 6 c 3 3 0 5 5 d 6 6 5 0 4 e 6 6 5 4 0

a 1

b 1

1 2

c

1

2 d

2 e

I (The matrix must satisfy some conditions [Buneman, 1971]);

(73)

Selected results

Median problems

I Parsimony again: search for a tree that minimises the total number of evolutionary events (i.e. the sum of all edge weights);

I In its simplest form, the problem we want to solve is:

Problem (median of three)

Given: π,σ,τ in S_n^±; a distance d :S_n^±×S_n^±→N. Find: a permutationµ in S_n^± that minimises

w(µ) =d(π, µ) +d(σ, µ) +d(τ, µ).

I Can be generalised to more than three input permutations;

(74)

Selected results

Generic bounds [Siepel and Moret, 2001]

I Generic lower and upper bounds for any distance:

π

σ τ

d(π, σ) d(π, τ)

d(σ, τ)

µ d(π, µ)

d(µ, σ) d(µ, τ)

I w(µ)≤min{

ifµ=π

z }| {

d(π, σ) +d(π, τ),

ifµ=σ

z }| {

d(π, σ) +d(σ, τ),

ifµ=τ

z }| {

d(π, τ) +d(σ, τ)}.

≥d(π, σ)+d(π, τ)+d(σ, τ) (triangle inequalities)