Fixed-Parameter Tractable Sampling for RNA Design with Multiple Target Structures

(1)

HAL Id: hal-01631277

https://hal.inria.fr/hal-01631277v2

Submitted on 31 Mar 2018

HAL is a multi-disciplinary open access

archive for the deposit and dissemination of

sci-entific research documents, whether they are

pub-lished or not. The documents may come from

teaching and research institutions in France or

abroad, or from public or private research centers.

L’archive ouverte pluridisciplinaire HAL, est

destinée au dépôt et à la diffusion de documents

scientifiques de niveau recherche, publiés ou non,

émanant des établissements d’enseignement et de

recherche français ou étrangers, des laboratoires

publics ou privés.

Fixed-Parameter Tractable Sampling for RNA Design

with Multiple Target Structures

Stefan Hammer, Yann Ponty, Wei Wang, Sebastian Will

To cite this version:

Stefan Hammer, Yann Ponty, Wei Wang, Sebastian Will. Fixed-Parameter Tractable Sampling for

RNA Design with Multiple Target Structures. RECOMB - 22nd Annual International Conference on

Research in Computational Molecular Biology - 2018, Apr 2018, Paris, France. �hal-01631277v2�

(2)

Fixed-Parameter Tractable Sampling for RNA Design with

Multiple Target Structures

Stefan Hammer1,2,3_{, Yann Ponty}4,5,?_{, Wei Wang}4_{and Sebastian Will}2

1_{University Leipzig, Department of Computer Science and Interdisciplinary Center for Bioinformatics, 04107} Leipzig, Germany;2University of Vienna, Faculty of Chemistry, Department of Theoretical Chemistry, 1090 Vienna, Austria;3_{University of Vienna, Faculty of Computer Science, Research Group Bioinformatics and} Computational Biology, 1090 Vienna, Austria;4CNRS UMR 7161 LIX, Ecole Polytechnique, Bat. Turing, 91120

Palaiseau, France; and5_{AMIBio team, Inria Saclay, Bat Alan Turing, 91120 Palaiseau, France}

Abstract. Motivation: The design of multi-stable RNA molecules has important applications in biology, medicine, and biotechnology. Synthetic design approaches profit strongly from effective in-silico methods, which can tremendously impact their cost and feasibility.

Results: We revisit a central ingredient of most in-silico design methods: the sampling of sequences for the design of multi-target structures, possibly including pseudoknots. For this task, we present the efficient, tree decomposition-based algorithm RNARedPrint. Our fixed parameter tractable ap-proach is underpinned by establishing the #P-hardness of uniform sampling. Modeling the problem as a constraint network, RNARedPrint supports generic Boltzmann-weighted sampling for arbitrary additive RNA energy models; this enables the generation of RNA sequences meeting specific goals like expected free energies or GC-content. Finally, we empirically study general properties of the approach and generate biologically relevant multi-target Boltzmann-weighted designs for a com-mon design benchmark. Generating seed sequences with RNARedPrint, we decom-monstrate significant improvements over the previously best multi-target sampling strategy (uniform sampling). Availability: Our software is freely available at:https://github.com/yannponty/RNARedPrint Contact:[email protected]

1 Introduction

Synthetic biology endeavors the engineering of artificial biological systems, promising broad applications in biology, biotechnology and medicine. Centrally, this requires the design of biological macromolecules with highly specific properties and programmable functions. In particular, RNAs present themselves as well-suited tools for rational design targeting specific functions (Kushwaha et al.,2016). RNA function is tightly coupled to the formation of secondary structure, as well as changes in base pairing propensities and the accessibility of regions, e.g. by burying or exposing interaction sites (Rodrigo and Jaramillo,

2014). At the same time, the thermodynamics of RNA secondary structure is well understood and its prediction is computationally tractable (McCaskill, 1990). Thus, structure can serve as effective proxy within rational design approaches, ultimately targeting catalytic (Zhang et al.,2013) or regulatory (

Ro-drigo and Jaramillo,2014) functions.

The function of many RNAs depends on their selective folding into one or several alternative confor-mations. Classic examples include riboswitches, which notoriously adopt different stable structures upon binding a specific ligand. Riboswitches have been a popular application of rational design (Wachsmuth

et al.,2013;Domin et al., 2017), partly motivated by their capacity to act as biosensors (Findeiß et al.,

2017). At the co-transcriptional level, certain RNA families feature alternative, evolutionarily conserved, transient structures (Zhu et al.,2013), which facilitate the ultimate adoption of functional structures at full elongation. More generally, simultaneous compatibility to multiple structures is a relevant design ob-jective for engineering kinetically controlled RNAs, finally targeting prescribed folding pathways. Thus, modern applications of RNA design often target multiple structures, additionally aiming at other fea-tures, such as specific GC-content (Reinharz et al.,2013) or the presence/absence of functionally relevant motifs (either anywhere or at specific positions) (Zhou et al., 2013); these objectives motivate flexible computational design methods.

(3)

Many computational methods for RNA design rely on similar overall strategies: initially generating one or several seed sequences and optimizing them subsequently. In many cases, the seed quality was found to be critical for the empirical performance of RNA design methods (Levin et al., 2012). For instance, random seed generation improves the prospect of subsequent optimizations, helping to over-come local optima of the objective function, and increases the diversity across designs (Reinharz et al.,

2013). For single-target approaches, INFO-RNA (Busch and Backofen, 2006) made significant improve-ments mainly by starting its local search from the minimum energy sequence for the target structure instead of (uniform) random sequences for the early RNAinverse algorithm (Hofacker et al.,1994). This strategy was later shown to result in unrealistically high GC-contents in designed sequences. To address this issue, IncaRNAtion (Reinharz et al., 2013) controls the GC-content through an adaptive sampling strategy.

Specifically, for multi-target design, virtually all available methods (Lyngsø et al., 2012; H¨oner zu

Siederdissen et al.,2013;Taneda,2015;Hammer et al.,2017) follow the same overall

generation/optimiza-tion scheme. Facing the complex sequence constraints induced by multiple targets, early methods such as Frnakenstein (Lyngsø et al.,2012) and Modena (Taneda,2015) did not attempt to solve sequence genera-tion systematically, but rely on ad-hoc sampling strategies. Recently, the RNAdesign approach (H¨oner zu

Siederdissen et al., 2013), coupled with powerful local search within RNAblueprint (Hammer et al.,

2017), solved the problem of sampling seeds from the uniform distribution for multiple targets. These methods adopt a graph coloring perspective, initially decomposing the graph hierarchically using various decomposition algorithms, and precomputing the number of valid sequences within each subgraph. The decomposition is then reinterpreted as a decision tree to perform a stochastic backtrack, inspired by Ding and Lawrence (Ding and Lawrence,2003). Uniform sampling is achieved by choosing individual nucleotide assignments with probabilities derived from the subsolution counts. The overall complexity of RNAdesign grows like Θ(4γ), where the parameter γ is bounded by the length of the designed RNA; typically, the decomposition strategy achieves much lower γ.

The exponential time and space requirements of the RNAdesign method already raise the question of the complexity of (uniform) sampling for multi-target design. Since stochastic backtrack can be performed in linear time per sample, the method is dominated by the precomputation step, which requires counting valid designs. Thus, we focus on the question: Is there a polynomial-time algorithm to count valid multi-target designs? In Section 5, we answer in the negative, showing that there exists no such algorithm unless P = NP. Our result relies on a surprising bijection (up to a trivial symmetry) be-tween valid sequences and independent sets of a bipartite graph, being the object of recent breakthroughs in approximate counting complexity (Bulatov et al., 2013; Cai et al., 2016). The hardness of counting (and conjectured hardness of sampling) does not preclude, however, practically applicable algorithms for counting and sampling. In particular, we wish to extend the flexibility of multi-structure design, leading to the following questions: How to sample, generally, from a Boltzmann distribution based on expres-sive energy models? How to enforce additional constraints, such as the GC-content, complex sequence constraints, or the energy of individual structures?

To answer these questions, we introduce a generic framework (illustrated in Fig.1) enabling efficient Boltzmann-weighted sampling over RNA sequences with multiple target structures (Section3). Guided by a tree decomposition of the network, we devise dynamic programming to compute partition func-tions and sample sequences from the Boltzmann distribution. We show that these algorithms are fixed-parameter tractable for the treewidth; in practice, we limit this fixed-parameter by using state-of-the-art tree decomposition algorithms. By evaluating (partial) sequences in a weighted constraint network, we support arbitrary multi-ary constraints and thus arbitrarily complex energy models, notably subsuming all commonly used RNA energy models. Moreover, we describe an adaptive sampling strategy to con-trol the free energies of the individual target structures and the GC-content. We observe that sampling based on less complex RNA energy models (taking only the most important energy contributions into account) still allows targeting realistic RNA energies in the well-accepted Turner RNA energy model. The resulting combination of efficiency and high accuracy finally enables generating biologically relevant multi-target designs in our final application of our overall strategy to a large set of multi-target RNA design instances from a representative benchmark Section4).

(4)

b c d e f g h i j k l m n o p q r s a t u v a b c d e f g h i jk l_m n o p q r s t u v a b c d e f gh i jk l m n o p q r st u v R1 R₂ R3 i) Input Structures ab c d e f g h i j k l mn o p q r s t u v

ii) Merged Base-Pairs

a e g p u k c s n j q t b d h m f l o v i r

iii) Compatibility Graph

∅ v o o l l f f r l i j q q t t b b d d h h m k p k p g g n g p u u g e u e a e s s c

iv) Tree Decomposition

RNARedPrint Partition Function Stochastic Backtrack GACUUUAGCUUGUUUGA AGUUA GACUUGGGA CUUUUAGGUGUCU UUAGAUUCCA GGGGCUUGUA GG GGUUUAGGGCCUUUA GGUAUCU AUUGUGACGGUCGUGA CUAGUU . . . 0 0 0 −5 −5 −5 kcal.mol−1 kcal.mol−1 kcal.mol−1 E1 E2 E3 GC% 0 50 100

v) Weight Optimization (Adaptive Sampling)

Wei gh ts up d at e GCCGCGGUAGCUACAGCCGGCU UUGGGGUUGGGUAGACUCCGGU GCUGCAGCGGCUGUGGCUGGCC GGUUCUGGUUUGCUUAGGGCUA CGACGGCGGUGCCGGCAUUUGC_{. . .} 0 kcal.mol−1 E1 E2 E3 GC% 0 50 100

vi) Final Designs

Fig. 1. General outline of RNARedPrint for base pair-based energy models. From a set of target secondary struc-tures (i), base-pairs are merged (ii) into a (base pair) dependency graph (iii) and transformed into a tree decom-position (iv). The tree is then used to compute the partition function, followed by a Boltzmann sampling of valid sequences (v). An adaptive scheme learns weights to achieve targeted energies and GC-content, leading to the production of suitable designs (vi).

2 Definitions and problem statement

An RNA sequence S is a word over the nucleotides alphabet Σ = {A, C, G, U}; let Sn denote the set of sequences of length n. An RNA (secondary) structure R of length n is a set of base pairs (i, j), where 1 ≤ i < j ≤ n, where for all different (i, j), (i0, j0) ∈ R: {i, j} ∩ {i0, j0} = ∅ (“degree ≤ 1”). Valid base pair must pair bases from B := {{A, U}, {G, C}, {G, U}} . Consequently, S is valid for R, iff {Si, Sj} ∈ B for all (i, j) ∈ R.

We consider a fixed set of target RNA structures R := {R1, . . . , Rk} for sequences of length n. R induces a base pair dependency graph GR with nodes {1, . . . , n} and edges S`∈[1,k]R`, which describe the minimal dependencies present in all relevant settings due to the requirement of canonical base pairing.

One can interpret the valid sequences for R as colorings of GR in a slightly modified graph coloring variant, where the colors (from Σ) assigned to adjacent vertices of GR must constitute valid base pairs. We define the energy of a sequence S based on the set of structures R as ER(S) ∈ R ∪ {∞}. In our setting, the energy ER(S) is additively composed of the energies of the single RNA structures in an RNA energy model, as well as sequence dependent features like GC-content. Furthermore note that ER(S) is finite iff S is valid for each structure R1, . . . , Rk.

At the core of this work, we study the computation of partition functions over sequences.

Central problem (Partition function over sequences). Given an energy function E and a set R of structures of length n, compute the partition function

ZER= X

S∈Sn

exp(−βER(S)), (1)

where β denotes the inverse pseudo-temperature, and Sn the set of sequences of length n.

As we elaborate in subsequent sections, our approach relies on breaking down the energy function ER(S) into additive components, each depending on only few sequence positions. Given R, we express ER(S) as the sum of energy contributions f (S) over a set F (of functions f : Sn → R), s.t. ER(S) = P

f ∈Ff (S). This captures realistic RNA energy models—including nearest neighbor models, e.g.Turner

and Mathews(2009), and even pseudoknot models, e.g. Andronescu et al. (2010), while bounding the

(5)

dep(f ) of f as the minimum set of sequence positions I ⊆ {1, . . . , n}, where f (S) = f (S0) for all sequences S and S0 that agree at the positions in I.

Each set F of functions (on sequences of length n) induces a dependency graph on sequence positions, namely the hypergraph GF = ({1, . . . , n}, {dep(f ) | f ∈ F }). Our algorithms will critically rely on a tree decomposition of the dependency graph, which we define below.

Definition 1 (Tree decomposition and width). Let G = (X, E) be a hypergraph. A tree decom-position of G is a pair (T, χ), where T is an unrooted tree/forest, and (for each v ∈ T ) χ(v) ⊆ X is a set of vertices assigned to the node tree v ∈ T , such that

1. each x ∈ X occurs in at least one χ(v);

2. for all x ∈ X, {v | x ∈ χ(v)} induces a connected subtree of T ; 3. for all e ∈ E, there is a node v ∈ T , such that e ⊆ χ(v).

The width of a tree decomposition (T, χ) is defined as w(T, χ) = minu∈T|χ(u)| − 1. The treewidth of G is the smallest width of any tree decomposition of G.

3 An FPT algorithm for the partition function and sampling of

Boltzmann-weighted designs

For our algorithmic description, we translate the concepts of Section 2 to the formalism of constraint networks, here specialized as RNA design network. This allows us to base our algorithm on the cluster tree elimination (CTE) ofDechter(2013). In the RNA design network, (partially determined) RNA sequences replace the more general concept of (partial) assignments in constraint networks. Partially determined RNA sequences, for short partial sequences, are words ¯S over the alphabet Σ ∪ {?} equivalently representing the set S( ¯S) of RNA sequences, where for positions 1 ≤ i ≤ n, ¯Si∈ Σ implies Si= ¯Si for all S ∈ S( ¯S). The positions 1 ≤ i ≤ n, where ¯Si∈ Σ, are called determined by ¯S and form its domain. Since the functions f ∈ F of Section2depend on only the subset dep(f ) of sequence positions, one can evaluate them for partial sequences ¯S that determine (at least) the nucleotides at all positions in dep(f ). Thus for functions f and partial sequences ¯S that determine dep(f ), we write f_JS¯_{K to evaluate f} for

¯

S; i.e. f_JS¯_{K := f (S ), for any sequence S ∈ S (}S).¯

Definition 2. An RNA design network (for sequences of length n) is a tuple N = (X , F ), where – X is the set of sequence positions 1, . . . , n

– F is a set of functions f : Sn → R

The energy eN(S) of a sequence S in a network N is defined as sum of the values of all functions in f ∈ F evaluated for S, i.e. eN(S) :=Pf ∈FfJS K.

The network energy eN(S) corresponds to the energy in Eq. (1), where this energy is modeled as sum of the functions in F . Consequently, ZR of Eq. (1) is modeled as network partition function ZN := P Sexp(−β eN(S)) = P S Q f ∈Fexp(−β · fJS K).

3.1 Partition function and Boltzmann sampling through stochastic backtrack

The minimum energy, counting, and partition function over RNA design network can be computed by dynamic programming based on a tree decomposition of the network’s dependency graph (i.e. cluster tree elimination). We focus on the efficient computation of the partition function.

We require additional definitions: A cluster tree for the network N = (X , F ) is a tuple (T, χ, φ), where (T, χ) is a tree decomposition of GF, and φ(v) represents a set of functions f , each uniquely assigned to a node v ∈ T ; dep(f ) ⊆ χ(v) and φ(v) ∩ φ(v0_{) = ∅ for all v 6= v}0. For two nodes v and u of the cluster tree, define their separator as sep(u, v) := χ(u) ∩ χ(v); moreover, we define the difference positions from u to an adjacent v by diff(u → v) := χ(v) − sep(u, v).

(6)

Data: Cluster tree (T, χ, φ)

Result: Messages mu→v for all (u → v) ∈ T ; i.e. partition functions of the subtrees of all v for all possible partial sequences determining exactly the positions sep(u, v).

for u → v ∈ T in postorder do for ¯S ∈ PS(sep(u, v)) do

x := 0;

for ¯S0∈ PS(diff(u → v)) do

p := product( exp(−βf_JS ⊕ ¯¯ S0_{K) for f ∈ φ(u) )} · product( mw→uJS ⊕ ¯¯ S 0 K for (w → u) ∈ T ); x := x + p; mu→v_JS¯_{K := x;} return m;

Algorithm 1: FPT computation of the partition function using dynamic programming (CTE).

For a set Y of sequence positions, write PS(Y) to denote the set of all partial sequences that determine exactly the positions in Y; furthermore, given partial sequences ¯S and ¯S0, we define the combined partial sequence ¯S0⊕ ¯S00such that

( ¯S ⊕ ¯S0)i:= ( ¯_S0

i if ¯Si0 ∈ Σ ¯

Si otherwise

Finally we assume, w.l.o.g., that all position difference sets diff(u → v) are singleton: for any given cluster tree, an equivalent (in term of treewidth) cluster tree can always be obtained by inserting at most Θ(|X |) additional clusters.

Let us now consider the computation of the partition function. Given is the RNA design network N = (X , F ) and its cluster tree decomposition (T, χ, φ). W.l.o.g., we assume that T is connected and contains a dedicated node r, with χ(r) = ∅ and φ(r) = ∅, added as a virtual root r connected to a node in each connected component of T . Now, we consider the set of directed edges Er

T of T oriented to r; define Tr(u) as the induced subtree of u. Algorithm 1 computes the partition function by passing messages along these directed edges u → v (i.e. always from some child u to its parent v). Each message is a function that depends on the positions dep(m) ⊆ X and yields a partition function in R ∪ {∞}. The message from u to v represents the partition functions of the subtree of u for all possible partial sequences in PS(sep(u, v)). Induction over T lets us show the correctness of the algorithm (Supp. Mat.10). After running Alg. 1, multiplying the 0-ary messages sent to the root r yields the total partition function: ZN =Q(u→r)∈Tmu→rJ∅K.

The partition functions can then direct a stochastic backtrack to achieve Boltzmann sampling of sequences, such that one samples from the Boltzmann distribution of a given design network N . The sampling algorithm assumes that the cluster tree was expanded and the messages mu→v for the edges in Er

T are already generated by Algorithm1for the expanded cluster tree. Algorithm2defines the recursive procedure Sample(u, ¯S), which returns—randomly drawn from the Boltzmann distribution— a partial sequence that determines all sequence positions in the subtree rooted at u. Called on r and the empty partial sequence, which does not determine any positions, the procedure samples a random sequence from the Boltzmann distribution.

3.2 Computational complexity of the multiple target sampling algorithm

Note that in the following complexity analysis, we omit time and space for computing the tree de-composition itself, since we observed that the computation time of tree dede-composition (GreedyFillIn, implemented in LibTW by van Dijk et al. (2006)) for multi-target sampling is negligible compared to Alg.1(Supp. Mat. 8and9).

We define the maximum separator size s as maxu,v∈V | sep(u, v)| and denote the maximum size of diff(u → v) over (u, v) ∈ Er

T as D. In the absence of specific optimizations, running Alg. 1 requires O((|F | + |V |) · 4w+1_{) time and O(|V | · 4}s_{) space (Supp. Mat.}₁₁_{); Alg.}₂_{would require O((|F | + |V |) · 4}D₎

(7)

per sample on arbitrary tree decompositions (Supp. Mat.11). W.l.o.g. we assume that D = 1; note that tree decompositions can generally be transformed, such that diff(u → v) ≤ 1. Moreover, the size of F is linearly bounded: for k input structures for sequences of length n, the energy function is expressed by O(n k) functions. Finally, the number of cluster tree nodes is in O(n), such that |F | + |V | ∈ O(n k).

Data: Node u, partial sequence ¯S ∈ PS(sep(u, v));

Cluster tree (T, χ, φ) and partition functions mu0→v0[ ¯S0], ∀(u0→ v0) ∈ T and ¯S0∈ PS(sep(u0, v0)).

Result: Boltzmann-distributed random partial sequence for the subtree rooted at u, specializing a partial sequence ¯S.

Function Sample (u, ¯S; T, χ, φ, m) r := UnifRand(mu→vJS¯K); for ¯S0∈ PS(diff(u → v)) do

p := product( exp(−βf_JS ⊕ ¯¯ S0_{K) for f ∈ φ(u)} · product( mw→uJS ⊕ ¯¯ S 0 K for (w → u) ∈ T ); r := r − p; if r < 0 then ¯ Sres:= ¯S ⊕ ¯S0; for (v → w) ∈ T do ¯

Sres:= ¯Sres⊕Sample(v, ¯S ⊕ ¯S0; T, χ, φ, m); return ¯Sres;

Algorithm 2: Stochastic backtrack algorithm for partial sequences in the Boltzmann distribution.

Theorem 1 (Complexities). For sequence length n, k target structures, treewidth w and a base pair dependency graph having c connected components, t sequences are generated from the Boltzmann distri-bution in O(n k 4w+1+ t n k) time.

As shown in Supp. Mat.12, the complexity of the precomputation can be further improved to O(n k 2w+12c), where c is the maximum number of connected components represented in a node of the tree decomposition (c ≤ w + 1).

3.3 Sequence features, constraints, and energy models.

The functions in F allow expressing complex features of the sequences alone, e.g. rewarding or penalizing specific sequence motifs, as well as features depending on the target structures. Furthermore, constraints, which enforce or forbid features, are naturally expressed by assigning infinite penalties to invalid (partial) sequences. Finally, but less obviously, the framework captures various RNA energy models with bounded dependencies, which we describe briefly.

In simple base pair-based energy models, energy is defined as the sum of base pair (pseudo-)energies. If base pair energies E_kbp(i, j, x, y) (where i and j are sequence positions, x and y are bases in Σ) are given for each target structure `, s.t. E(S; R`) := P_(i,j)∈R_`E_kbp(i, j, Si, Sj), we encode the network energy by the set of functions f for each base pair (i, j) ∈ R` of each input structure R` that evaluate to ln(π`)E

bp

k (i, j, ¯S(xi), ¯S(xj)) under partial sequence ¯S; here, π`> 0 is a weight that controls the influence of structure ` on the sampling (as elaborated later).

More complex loop-based energy models —e.g. the Turner model, which among others includes energy terms for special loops and dangling ends—can also be encoded as straightforward extensions. An interesting stripped-down variant of the nearest neighbor model is the stacking energy model. This model assigns non-zero energy contributions only to stacks, i.e. interior loops with closing base pair (i, j) and inner base pair (i + 1, j − 1).

The arity of the introduced functions provides an important bound on the treewidth of the network (and therefore computational complexity). Thus, it is noteworthy that the base pair energy model requires only binary functions; the stacking model, only quarternary dependencies. This arity is increased in a

(8)

few cases by the commonly used Turner 2004 model (Turner and Mathews,2009) for encoding tabulated special hairpin and interior loop contributions, which depend on up to nine bases for the interior loops with a total of 5 unpaired bases (“2x3” interior loops)—all other energy contributions (like dangling ends) still depend on at most four bases of the sequence.

3.4 Extension to multidimensional Boltzmann sampling

The flexibility of our framework supports an advanced sampling technique, named multidimensional Boltzmann sampling (Bodini and Ponty,2010) to (probabilistically) enforce additional constraints. This technique was previously used to control the GC-content (Waldisp¨uhl and Ponty, 2011; Reinharz et al.,

2013) and dinucleotide content (Zhang et al.,2013) of sampled RNA sequences; it enables explicit control of the free energies (E?

1, . . . , Ek?) of the single targets.

Multidimensional Boltzmann sampling requires the ability to sample from a weighted distri-bution over the set of compatible sequences, where the probability of a sequence S with free energies (E1, . . . , Ek) for its target structures is P(S | πππ) =

Qk `=1π

−Ei i

Zπππ , where πππ := (π1· · · πk) is a vector of posi-tive real-valued weights, and Zπππ is the weighted partition function. Such a distribution can be induced by a simple modification of the functions described in Sec. 3.3, where any energy function E( ¯S) for a structure ` is replaced by E0( ¯S) := ln(π`) E( ¯S)/β. The probability of a sequence S is thus proportional toQk `=1e − ln(π`) Ei ₌Qk `=1π −Ei i .

One then needs to learn a weights vector πππ such that, on average, the targeted energies are achieved by a random sequences in the weighted distribution, i.e. such that E(E`(S) | πππ) = E`?, ∀` ∈ [1, k]. The expected value of E`is always decreasing for increasing weights π` (see Supp. Mat.14). More generally, computing a suitable πππ can be restated as a convex optimization problem, and be efficiently solved using a wide array of methods (Denise et al., 2010; Bendkowski et al., 2017). In practice, we use a simple heuristics which starts from an initial weight vector πππ[0]_{:= (e}β_{, . . . , e}β_{) and, at each iteration, generates} a sample S of sequences. The expected value of an energy E` is estimated as ˆµ`(S) =PS∈SE`(S)/|S|, and the weights are updated at the t-th iteration by π[t+1]_` = π_`[t]· γµˆ`(S)−E`?. In practice, the constant γ > 1 is chosen empirically to achieve effective optimization. While heuristic in nature, this basic iteration was elected in our initial version of RNARedPrint because of its reasonable empirical behavior (choosing γ = 1.2).

A further rejection step is applied to the generated structures to retain only sequences whose energy for each structure R`belongs to [E`?·(1−ε), E`?·(1+ε)], for ε ≥ 0 some predefined tolerance. The rejection approach is justified by the following considerations: i) Enacting an exact control over the energies would be technically hard and costly. Indeed, controlling the energies through dynamic programming would require explicit convolution products, generalizingCupal et al.(1996), inducing additional Θ(n2k) time and Θ(nk) space overheads; ii) Induced distributions are typically concentrated. Intuitively, unless sequences are fully constrained individual energy terms are independent enough so that their sum is concentrated around its mean – the targeted energy (cf Fig.2). For base pair-based energy models and special base pair dependency graph (paths, cycles. . . ) this property rigorously follows from analytic combinatorics, see Bender et al. (1983) and Drmota (1997). In such cases, the expected number of rejections before reaching the targeted energies remains constant when ε ≥ 1/√n, and Θ(nk/2_{) when} ε = 0. The GC-content of designs can also be controlled, jointly and in a similar way, as done in IncaRNAtion (Reinharz et al.,2013).

4 Results

4.1 Targeting Turner energies and GC-content

We implemented the Boltzmann sampling approach (Algorithms 1 and 2), performing sampling for given target structures and weights π1, . . . , πn; moreover on top, multi-dimensional Boltzmann sampling (see Section 3.4) to target specific energies and GC-content. Our tool RNARedPrint evaluates energies according to the base pair energy model or the stacking energy model, using parameters which were trained to approximate Turner energies (Supp. Mat.13).

(9)

A B

Fig. 2. Targeting specific energies using multi-dimensional Boltzmann sampling. (A) Turner energy distributions for three target structures (right) while targeting (−40, −40, −20) and (−20, −20, −20) free energy vectors. For comparison, we show the distribution of uniform and Boltzmann samples; respectively associated with homoge-nous weights 1 and eβ. (B) Accuracy of targeting over all target structures of the Modena benchmark instances RNAtabupath (2str), 3str and 4str. Shown is the relative deviation of targeted and achieved energies in the simple stacking model and the Turner model.

To capture the realistic Turner model ET, we exploit its very good correlation (supp. Fig 6) with our simple stacking-based model ES. Namely, we observe a structure-specific affine dependency between these Turner and stacking energy models, so that ET(S`; R`) ≈ γ`ES;R`(S`) + δ` for each structure R`. We learn the (γ`, δ`) parameters from a set of sequences generated with homogenous weights w = eβ, tuning only the GC-content to a predetermined target frequency. Finally, we adjust the targeted energy of our stacking model to E?_S = (E_T? − δ`)/γ`.

To illustrate our above strategy, we sampled n = 1 000 sequences targeting GC% = 0.5 and different Turner energies for the three structure targets of an example instance. Fig. 2A illustrates how the Turner energy distributions of the shown target structures can be accurately shifted to prescribed target energies; for comparison, we plot the respective energy distribution of uniformly and Boltzmann sampled sequences. Fig.2B summarizes the targeting accuracy for the single structure energies over a larger set of instances from the Modena benchmark. We emphasize that only the simple stacking energies are directly targeted (at ±10% tolerance, expect for a few hard instances). Due to our linear adjustment, we finally achieve well-controlled energies in the realistic Turner energy model.

4.2 Generating high-quality seeds for further optimization

We empirically study the capacity of RNARedPrint to improve the generation of seed sequences for subsequent local optimizations, showing an important application in biologically relevant scenario.

For a multi-target design instance, individual reasonable target energies are estimated from averaging the energy over samples at relatively high weights eβ, setting the targeted GC-content to 50%. For each instance of the Modena benchmark (Taneda, 2015), we estimated such target energies and subsequently generated 1 000 seed sequences at these energies. Our procedure is tailored to produce sequences with similar Turner energy that favor the stability of the target structures having moderate GC-content; all these properties are desirable objectives for RNA design. All sampled sequences are evaluated under the multi-stable design objective function ofHammer et al. (2017):

f (S) = 1 k k X `=1 (E(S, R`) − G(S)) + 0.5 1_k 2 X 1≤`<j≤k |E(S, R`) − E(S, Rj)|. (2)

Using our strategy, we generated at least 1 000 seeds per instance of the subsets RNAtabupath (2str), 3str, and 4str, of the Modena benchmark set with 2,3, and 4 structures. The results are compared,

(10)

Dataset RNARedPrint Uniform Improvement Seeds 2str 21.67 (±4.38) 37.74 (±6.45) 73% 3str 18.09 (±3.98) 30.49 (±5.41) 71% 4str 19.94 (±3.84) 32.29 (±5.24) 63% Optimized 2str 5.84 (±1.31) 7.95 (±1.76) 28% 3str 5.08 (±1.10) 7.04 (±1.52) 31% 4str 8.77(±1.48) 13.13 (±2.13) 37%

Table 1. Comparison of the multi-stable design objective function ( Eq. (2) of sequences sampled by RNARedPrint vs. uniform samples for three benchmark sets; before (seeds) and after optimization (optimized). We report the average scores with standard deviations. Smaller scores are better; the final column reports our improvements over uniform sampling.

in terms of their value for the objective function of Eq. (2), to seed sequences uniformly sampled using RNAblueprint (Hammer et al.,2017) (Table1, seeds). Our first analysis reveals that Boltzmann-sampled sequences constitute better seeds, associated with better values (∼ 69% improvement on average), com-pared to uniform seeds for all instances of our benchmark (Supp. Mat.15).

Moreover, for each generated seed sequence, we optimized the objective function (2) through an adaptive greedy walk consisting of 500 steps. At each step, we resampled (uniformly random) the positions of a randomly selected component in the base pair dependency graph, and accepted the modification only if it resulted in a gain, as described in RNAblueprint (Hammer et al., 2017). Once again, for all instances, we observe a positive improvement of the quality of Boltzmann designs over uniform ones, suggesting a superior potential for optimization (∼ 32% avg improvement, Table 1, optimized). Notably, our subsequent optimization runs, which are directly inherited from the uniformly sampling RNABlueprint software, partially level the advantages of Boltzmann sampling. In future work, we hope to improve this aspect by exploiting Boltzmann sampling even during the optimization run.

5 Counting valid designs is #P-hard

Finally, we turn to the complexity of #Designs(G), the problem of computing the number of valid sequences for a given compatibility graph G = (V, E). Note that this problem corresponds to the partition function problem in a simple (0/∞)-valued base pair model, with β 6= 1. As previously noted (Flamm

et al.,2001), a set of target structures admits a valid design iff its compatibility graph is bipartite, which

can be tested in linear time. Moreover, without {G, U} base pairs, any connected component C ∈ CC(G) is entirely determined by the assignment of a single of its nucleotides. The number of valid designs is thus simply 4#CC(G), where #CC(G) is the number of connected components.

The introduction of {G, U} base pairs radically changes this picture, and we show that valid designs for a set of structures cannot be counted in polynomial time, unless #P = FP. The latter equality would, in particular, imply the classic P = NP, and solving #Designs(G) is polynomial time is therefore probably difficult.

To establish that claim, we consider instances G = (V1∪ V2, E) that are connected and bipartite ((V1 × V2) ∩ E = ∅), noting that hardness over restricted instances implies hardness of the general problem. Moreover remark that, as observed in Subsec. 3.2, assigning a nucleotide to a position u ∈ V constrains the parity ({A, G} or {C, U}) of all positions in the connected component of u. For this reason, we restrict our attention to the counting of valid designs up to trivial (A ↔ C/G ↔ U) symmetry, by constraining the positions in V1 (resp. V2) to only A and G (resp. C and U). Let Designs?(G) denote the subset of all designs for G under this constraint, noting that #Designs(G) = 2 · |Designs?(G)|.

We remind that an independent set of G = (V, E) is a subset V0 ⊆ V of nodes that are not connected by any edge in E. Let IndSets(G) denote the set of all independent sets in G.

(11)

Proof: Consider the function Ψ : Designs?(G) → IndSets(G) defined by Ψ (f ) := {v ∈ V | f (v) ∈ {A, C}} . Let us establish the injectivity of Ψ , i.e. that Ψ (f ) 6= Ψ (f0) for all f 6= f0. If f 6= f0, then there exists a node v ∈ V such that f (v) 6= f0(v). Assume that v ∈ V1, and remind that the only nucleotides allowed in V1 are A and G. Since f (v) 6= f0(v), then we have {f (v), f0(v)} = {A, G} and we conclude that Ψ (f ) differs from Ψ (f0) at least with respect to its inclusion/exclusion of v.

We turn to the surjectivity of Ψ , i.e. the existence of a preimage for each element S ∈ IndSets(G). Let us consider the function f defined as

∀v1∈ V1, v2∈ V2: f (v1) = ( A if v1∈ S G if v1∈ S/ and f (v2) = ( C if v2∈ S U if v2∈ S/ (3)

One easily verifies that Ψ (f ) = S.

We are then left to determine if f is a valid design for G, i.e. if for each (v, v0) ∈ E one has {f (v), f (v0_{)} ∈ B. Since G is bipartite, any edge in E involves two nodes v}

1 ∈ V1 and v2 ∈ V2. Remark that, among all the possible assignments f (v1) and f (v2), the only invalid combination of nucleotides is (f (v1), f (v2)) = (A, C). However, such nucleotides are assigned to positions that are in the independent set S, and therefore cannot be adjacent. We conclude that Ψ is surjective, and thus bijective.

Now we can build on the connection between the two problems to obtain complexity results for #Designs. Counting independent sets in bipartite graphs (#BIS) is indeed a well-studied #P-hard prob-lem (Ge and ˇStefankoviˇc,2012), from which we immediately conclude:

Corollary 1. #Designs is #P-hard

Proof: Note that #BIS is also #P-hard on connected graphs, as the number of independent sets for a disconnected graph G is given by |IndSets(G)| = Q

cc∈CC(G)|IndSets(cc)|. Thus any efficient algorithm for #BIS on connected instances provides an efficient algorithm for general graphs.

Let us now hypothesize the existence of a polynomial-time algorithm A for #Designs over strongly-connected graphs G. Consider the (polynomial-time) algorithm A0 _{that first executes A on G to produce} #Designs(G), and returns #Designs(G)/2 = |Designs?(G)| = |IndSets(G)|. Clearly A0 solves #BIS in polynomial-time. This means that #Designs is at least as hard as #BIS, thus does not admit a polynomial time exact algorithm unless #P = FP.

6 Conclusion

Motivated by the—here established—hardness of the problem, we introduced a general framework and efficient algorithms for design of RNA sequences to multiple target structures with fine-tuned proper-ties. Our method combines an FPT stochastic sampling algorithm with multi-dimensional Boltzmann sampling over distributions controlled by expressive RNA energy models. Compared to the previously best available sampling method (uniform sampling), this approach generated significantly better seed sequences for instances of a common multi-target design benchmark.

The presented method enables new possibilities for sequence generation in the field of RNA sequence design by enforcing additional constraints, like the GC-content, while controlling the energy of multiple target structures. Thus it presents a major advance over previously applied ad-hoc sampling and even efficient uniform sampling strategies. We have shown the practicality of such controlled sequence gener-ation and studied its use for multi-target RNA design. Moreover, our framework is equipped to include more complex sequence constraints, including mandatory/forbidden motifs at specific positions or any-where in the designed sequences. Furthermore, the method can support even negative design principles, for instance by penalizing a set of alternative helices/structures. Based on this generality, we envision our framework—in extensions, which still require delicate elaboration—to support various complex RNA design scenarios far beyond the sketched applications.

(12)

Acknowledgements

YP is supported by the Agence Nationale de la Recherche and the FWF (ANR-14-CE34-0011; project RNALands). SH is supported by the German Federal Ministry of Education and Research (BMBF support code 031A538B; de.NBI: German Network for Bioinformatics Infrastructure) and the Future and Emerging Technologies programme (FET-Open grant 323987; project RiboNets). We thank Leonid Chindelevitch for suggesting using the stable sets analogy to optimize our FPT algorithm, and Arie Koster for practical recommendations for computing tree decompositions.

(13)

Bibliography

Andronescu, M. S. et al. (2010). Improved free energy parameters for RNA pseudoknotted secondary structure prediction. RNA, 16(1), 26–42.

Bender, E. A. et al. (1983). Central and local limit theorems applied to asymptotic enumeration. iii. matrix recursions. Journal of Combinatorial Theory, Series A, 35(3), 263–278.

Bendkowski, M. et al. (2017). Polynomial tuning of multiparametric combinatorial samplers. arXiv preprint arXiv:1708.01212 .

Bodini, O. and Ponty, Y. (2010). Multi-dimensional Boltzmann sampling of languages. In Proceedings of the 21st International Meeting on Probabilistic, Combinatorial, and Asymptotic Methods in the Analysis of Algorithms (AofA’10), pages 49–64. DMTCS Proceedings.

Bulatov, A. A. et al. (2013). The expressibility of functions on the boolean domain, with applications to counting CSPs. J. ACM , 60(5), 32:1–32:36.

Busch, A. and Backofen, R. (2006). INFO-RNA–a fast approach to inverse RNA folding. Bioinformatics (Oxford, England), 22, 1823–1831.

Cai, J.-Y. et al. (2016). # BIS-hardness for 2-spin systems on bipartite bounded degree graphs in the tree non-uniqueness region. Journal of Computer and System Sciences, 82(5), 690–711.

Cupal, J. et al. (1996). Dynamic programming algorithm for the density of states of RNA secondary structures. In German Conference on Bioinformatics, pages 184–186.

Dechter, R. (2013). Reasoning with Probabilistic and Deterministic Graphical Models: Exact Algorithms. Morgan & Claypool.

Denise, A. et al. (2010). Controlled non-uniform random generation of decomposable structures. Theo-retical Computer Science, 411(40-42), 3527–3552.

Ding, Y. and Lawrence, C. E. (2003). A statistical sampling algorithm for RNA secondary structure prediction. Nucleic acids research, 31, 7280–7301.

Domin, G. et al. (2017). Applicability of a computational design approach for synthetic riboswitches. Nucleic Acids Research, 45(7), 4108–4119.

Drmota, M. (1997). Systems of functional equations. Random Structures and Algorithms, 10(1-2), 103–124.

Findeiß, S. et al. (2017). Design of artificial riboswitches as biosensors. Sensors (Basel, Switzerland), 17.

Flamm, C. et al. (2001). Design of multistable RNA molecules. RNA (New York, N.Y.), 7, 254–265. Ge, Q. and ˇStefankoviˇc, D. (2012). A graph polynomial for independent sets of bipartite graphs.

Com-binatorics, Probability and Computing, 21(05), 695–714.

Goldberg, L. A. et al. (2004). The complexity of choosing an H-coloring (nearly) uniformly at random. SIAM Journal on Computing, 33(2), 416–432.

Hammer, S. et al. (2017). RNAblueprint: flexible multiple target nucleic acid sequence design. Bioinfor-matics (Oxford, England), 33, 2850–2858.

Hofacker, I. L. et al. (1994). Fast folding and comparison of RNA secondary structures. Monatshefte f¨ur Chemie/Chemical Monthly , 125(2), 167–188.

H¨oner zu Siederdissen, C. et al. (2013). Computational design of RNAs with complex energy landscapes. Biopolymers, 99, 1124–1136.

Jerrum, M. R. et al. (1986). Random generation of combinatorial structures from a uniform distribution. Theoretical Computer Science, 43(Supplement C), 169 – 188.

Kushwaha, M. et al. (2016). Using RNA as molecular code for programming cellular function. ACS Synthetic Biology, 5(8), 795–809. PMID: 26999422.

Levin, A. et al. (2012). A global sampling approach to designing and reengineering RNA secondary structures. Nucleic acids research, 40(20), 10041–10052.

Lyngsø, R. B. et al. (2012). Frnakenstein: multiple target inverse RNA folding. BMC bioinformatics, 13, 260.

McCaskill, J. S. (1990). The equilibrium partition function and base pair binding probabilities for RNA secondary structure. Biopolymers, 29, 1105–1119.

(14)

Reinharz, V. et al. (2013). A weighted sampling algorithm for the design of RNA sequences with targeted secondary structure and nucleotide distribution. Bioinformatics (Oxford, England), 29, i308–i315. Rodrigo, G. and Jaramillo, A. (2014). RiboMaker: computational design of conformation-based

riboreg-ulation. Bioinformatics, 30(17), 2508–2510.

Taneda, A. (2015). Multi-objective optimization for RNA design with multiple target secondary struc-tures. BMC bioinformatics, 16, 280.

Turner, D. H. and Mathews, D. H. (2009). NNDB: the nearest neighbor parameter database for predicting stability of nucleic acid secondary structure. Nucleic acids research, 38(suppl 1), D280–D282. van Dijk, T. et al. (2006). Computing treewidth with LibTW. Technical report, University of Utrecht. Wachsmuth, M. et al. (2013). De novo design of a synthetic riboswitch that regulates transcription

termination. Nucleic Acids Research, 41(4), 2541–2551.

Waldisp¨uhl, J. and Ponty, Y. (2011). An unbiased adaptive sampling algorithm for the exploration of RNA mutational landscapes under evolutionary pressure. Journal of computational biology : a journal of computational molecular cell biology, 18, 1465–1479.

Weitz, D. (2006). Counting independent sets up to the tree threshold. In Proceedings of the 38th Annual ACM Symposium on Theory of Computing, Seattle, WA, USA, May 21-23, 2006 , pages 140–149. Zhang, Y. et al. (2013). SPARCS: a web server to analyze (un)structured regions in coding RNA

sequences. Nucleic acids research, 41, W480–W485.

Zhou, Y. et al. (2013). Flexible RNA design under structure and sequence constraints using formal lan-guages. In J. Gao, editor, ACM Conference on Bioinformatics, Computational Biology and Biomedical Informatics. ACM-BCB 2013, Washington, DC, USA, September 22-25, 2013 , page 229. ACM. Zhu, J. Y. A. et al. (2013). Transient RNA structure features are evolutionarily conserved and can be

(15)

Supplementary material

7 Approximate counting and random generation

In fact, not only is #BIS a reference problem in counting complexity, but it is also a

landmark problem with respect to the complexity of approximate counting problems. In

this context, it is the representative for a class of #BIS-hard problems (Bulatov et al.,

2013) that are easier to approximate than #SAT, yet are widely believed not to admit

any Fully Polynomial-time Randomized Approximation Scheme. Recent results reveal a

surprising dichotomy in the behavior of #BIS: it admits a Fully Polynomial-Time

Ap-proximation Scheme (FPTAS) for graphs of max degree ≤ 5 (Weitz,

2006), but is as

hard to approximate as the general #BIS problem on graphs of degree ≥ 6 (Cai et al.,

2016). In other words, there is a clear threshold, in term of the max degree, separating

(relatively) easy instances from really hard ones.

Additionally, let us note that, from the classic Vizing Theorem, any bipartite graph G

having maximum degree ∆ can be decomposed in polynomial time in exactly ∆

match-ings. Any such matching can be reinterpreted as a secondary structure, possibly with

crossing interactions (pseudoknots). These results have two immediate consequences

for the pseudoknotted version of the multiple design counting problem.

Corollary 2 (as follows from (Weitz,

2006)). The number of designs compatible

with m ≤ 5 pseudoknotted RNA structures can be approximated within any fixed ratio by

a deterministic polynomial-time algorithm.

Corollary 3 (as follows from (Cai et al.,

2016)). As soon as the number of

pseudo-knotted RNA structures strictly exceeds 5, #Designs is as hard to approximate as #BIS.

It is worth noting that the #P hardness of #Designs does not immediately imply

the hardness of generating a valid design uniformly at random, as demonstrated

con-structively by Jerrum, Valiant and Vazirani (Jerrum et al.,

1986). However, in the same

work, the authors establish a strong connection between the complexity of approximate

counting and the uniform random generation. Namely, they showed that, for problems

associated with self-reducible relations, approximate counting is equally hard as (almost)

uniform random generation. We conjecture that the (almost) uniform sampling of

se-quences from multiple structures with pseudoknots is in fact #BIS-hard as soon as the

number of input structures strictly exceeds 5, as indicated by

Goldberg et al.

(2004),

motivating even further our parameterized approach.

8 Tree decomposition for RNA design instances in practice

For studying the typically expected treewidths and tree decomposition run times in

multi-target design instances, we consider five sets of multi-multi-target RNA design instances of

different complexity. Our first set consists of the Modena benchmark instances.

In addition, we generated four sets of instances of increasing complexity. The instances

of the sets RF3, RF4, RF5, and RF6, each respectively specify 3,4,5, and 6 target

struc-tures for sequence length 100. For each instance (100 instances per set), we generated a

set of k (k = 3, . . . , 6) compatible structures as follows

(16)

– Generate a random sequence of length 100;

– Compute its minimum free energy structure (ViennaRNA package);

– Add the new structure to the instances if the resulting base pair dependency graph is

bipartite;

– Repeat until k structures are collected.

For each instance, we generated the dependency graphs in the base pair model and in the

stacking model. Then, we performed tree decomposition (using strategy “GreedyFillIn”

of LibTW (van Dijk et al.,

2006)) on each dependency graph. The obtained treewidths

are reported in Fig.

3, while Fig.

4 shows the corresponding run-times of the tree

decom-position. Fig.

5 shows the tree decompositions for an example instance from set RF3.

● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● Modena RF3 RF4 RF5 RF6 5 10 15 T ree width ● ● ● ● ● ● ● ● ● ● ● ● Modena RF3 RF4 RF5 RF6 5 10 15 20 25 30

Fig. 3. Treewidths for multi-target RNA design instances of different complexity. Distributions of treewidths shown as boxplots for the base pair (left) and stacking model (right).

9 Expressing higher arity functions as cliques in the dependency graph

Our implementation relies on external tools for the tree decomposition. Therefore it is not

trivial that the weight functions (i.e. terms of the energy function) can be captured within

the tree decompositions returned by a specific tool. In other words, it is not immediately

clear how one can construct a cluster tree from an arbitrary tree decomposition for a

given network.

Crucially, one can show that, as long as the network the arguments dep(f ) of a function

f are materialized by a clique within the network N , then there exists at least one node

in the tree decomposition that contains dep(f ), thus allowing the evaluation of f .

Lemma 1. Let G = (V, E) be an undirected graph, and T, χ be a tree decomposition for

G. For each clique C ⊂ V , there exists a node n ∈ T such that C ⊂ χ(v).

Proof:

For the sake of simplicity, let us consider T as rooted on an arbitrary node,

(17)

● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● Modena RF3 RF4 RF5 RF6 70 80 90 100 Time (ms) ● ● ● ● ● ● Modena RF3 RF4 RF5 RF6 80 100 120 140 160

Fig. 4. Computation time of tree decompositions for multi-target RNA design instances of different complexity. Distributions of times (in ms/instance) shown as boxplots for the base pair (left) and stacking model (right).

n or its descendants, namely

tverts(n) = χ(n)

[

n0_{∈Children(n)}

tverts(n

0

).

Now consider (one of ) the deepest node(s) n ∈ T , such that C ⊂ tverts(v). We are

going to prove that C ⊂ χ(n), using contradiction, by showing that C 6⊂ χ(n) implies that

T is not a tree decomposition.

Indeed, let us assume that C 6⊂ χ(n), then there exists a node v

0

∈ χ(n) whose presence

/

in tverts(n) stems from its presence in some descendant of n. Let us denote by n

0

6= n

the closest descendant of n such that v

0

∈ χ(n

0

_{). Now, consider a node v ∈ C such}

that v ∈ tverts(v

0

) and v /

∈ tverts(v

0

_{). Such a node always exists, otherwise n}

0

_{, and not}

n, would be the deepest node such that C ⊂ tverts(v

0

). From the definition of the tree

decomposition, we know that neither n

0

nor its descendants may contain v. On the other

hand, none of the parents or siblings of n

0

may contain v

0

. It follows that is no node

n

00

∈ T whose vertex set χ(v

00

_{) includes, at the same time, v and v}

0

_{. Since both v and v}

0

belong to a clique, one has {v, v

0

} ∈ E. The absence of a node in T capturing an edge in

E contradicts our initial assumption that T is tree decomposition for G.

We conclude that n is such that C 6⊂ χ(n).

This result implies that classic tree decomposition algorithms and, more importantly,

implementations can be directly re-used to produce decompositions that capture energy

models of arbitrary complexity, functions of arbitrary arity. Indeed, it suffices to add a

clique, involving the parameters of the function, and Lemma

1 guarantees that the tree

decomposition will feature one node to which the function can be associated.

10 Correctness of the FPT partition function algorithm

Theorem 2 (Correctness of Alg.

1). As computed by Alg.

1 for cluster tree (T, χ, φ),

(18)

((((.((....)).)))).((.(((.((((...(((..((((((.((..(((.(...).)))..)).)).))))..)))..)))).))).)).... ..(((((...(((.(((((((...))))..))).)))...)))))..((((((((((...))).)....))))))...((((((....)))))) ...(((((...(((...(((.((.((.(((....((...))...))).)).)))))..)))...))))).((((...)))). 48 41 48 41 13 13 41 78 78 70 48 70 48 13 6 6 57 78 57 57 31 31 21 31 21 95 21 95 95 90 90 98 90 98 98 87 87 29 87 29 7 87 7 23 29 23 64 23 64 23 93 93 54 64 54 54 34 34 81 54 81 33 55 33 55 11 11 83 83 84 10 84 10 9 85 9 85 8 8 86 86 83 36 36 36 18 18 68 18 68 1 18 1 62 68 62 62 25 25 56 62 56 91 25 91 97 91 97 97 88 88 99 86 99 30 86 30 99 89 89 85 100 100 96 89 96 89 27 27 92 96 92 20 96 20 92 24 24 63 24 63 63 67 67 19 67 19 61 27 61 69 61 69 28 88 28 28 60 60 71 60 71 59 76 59 76 30 58 58 30 22 22 77 58 77 56 79 79 80 55 80 20 32 32 19 35 35 53 35 53 39 15 39 15 49 69 49 17 69 17 40 49 40 49 5 5 40 14 14 47 7 47 7 12 12 4 15 4 4 50 50 51 51 3 3 53 65 65 66 66 52 52 72 46 72 46 73 45 73 45 75 75 44 44 76 43 43 77 42 42 81 38 38 17 37 37 17 2 2 37 82 82 16 3 16 97 90 97 90 98 98 97 90 98 63 63 97 62 9098 63 62 97 61 62 90 98 63 61 97 31 61 62 90 98 63 31 97 64 31 61 62 9098 63 64 97 83 64 31 61 62 90 98 63 83 30 97 83 64 31 61 62 90 98 63 30 30 97 83 647 31 61 62 90 98 63 7 30 83 6431 7 61 62 63 35 35 30 97 83 6431 7 61 62 23 90 98 63 23 30 41 83 64 7 31 61 6263 35 41 30 81 41 8364 31 7 61 62 63 35 81 81 41 83 68 7 61 62 63 35 68 30 81 79 41 64 31 35 79 97 96 83 64 61 23 90 63 30 31 7 62 98 96 30 97 96 8331 7 61 29 23 90 98 29 97 96 91 64 62 23 90 98 63 91 17 81 41 83 68 7 61 62 63 35 17 17 81 41 83 68 7 61 6218 63 35 18 17 81 83 68 36 62 18 63 35 36 17 41 68 7 61 6269 18 69 30 81 79 41 56 64 31 35 56 30 41 79 56 31 58 58 81 80 79 56 64 35 80 67 68 36 62 18 63 35 67 17 37 81 83 36 18 37 17 15 41 7 18 69 15 30 97 96 8331 7 61 29 23 90 89 98 89 30 97 83 96 7 61 29 90 98 89 87 87 30 96 31 29 23 90 89 22 22 30 99 83 7 29 90 98 89 87 99 97 96 61 29 88 8998 87 88 30 99 83 7 29 8698 87 86 8 83 99 7 86 87 8 97 92 96 91 64 62 23 63 92 28 61 29 88 89 87 28 92 91 64 24 62 23 63 24 30 96 95 31 90 8922 95 67 19 68 36 18 35 19 30 79 41 56 57 3158 57 41 79 56 57 58 78 78 81 80 79 56 64 5535 55 81 80 54 64 55 35 54 30 96 95 21 31 22 21 17 4 15 41 7 18 69 4 17 4 15 41 7 18 69 3 3 4 41 15 49 7 693 49 17 4 15 16 18 3 16 48 4 41 15 49 769 3 48 48 4 41 15 7 4913 3 13 70 48 49 69 70 92 91 24 62 63 25 25 92 24 23 93 93 48 4 7 49 13 3 6 6 48 41 15 40 49 13 40 48 4 49 5 3 6 5 48 7 6 47 47 7 13 12 6 12 41 15 40 13 14 14 50 4 49 5 3 50 54 34 64 55 35 34 77 41 57 58 78 77 37 81 83 36 82 82 28 61 89 88 27 27 20 96 95 21 31 20 17 16 18 2 3 2 8 85 99 83 86 85 53 54 6434 35 53 33 54 34 55 33 8 85 8386 9 9 85 99 100 86 100 85 8384 9 84 10 85 83 84 9 10 10 83 11 84 11 15 39 40 14 39 51 4 50 3 51 53 54 65 64 65 81 37 38 82 38 28 61 60 27 60 32 20 21 31 32 17 1 18 2 1 77 41 4258 78 42 77 76 42 58 76 77 76 58 59 59 77 76 42 43 43 53 65 66 66 53 65 66 52 52 72 46 72 46 46 72 73 73 45 46 72 73 45 75 76 43 75 75 76 44 43 44

Fig. 5. Example instance from RF3 (top) with its tree decompositions in the base pair (left) and stacking model (right). The respective treewidths are 2 and 12.

for the partial sequences ¯

S ∈ PS(sep(u, v)), i.e. the messages satisfy

m

u→v

J

S

¯

K =

X

¯ S0_∈PS(χ(T r(u))−sep(u,v))

Y

f ∈φ(Tr(u))

exp(−βf

_J

S

¯

0

⊕ ¯

S

_K),

(4)

where χ(T

r

(u)) denotes all χ-assigned positions of nodes in T

r

(u); respectively φ(T

r

(u));

all φ-assigned functions.

Proof: Note that in more concise notation, Alg.

1 computes messages such that

m

u→v

J

S

¯

K :=

X

¯ S0_{∈PS(diff(u→v))}

Y

f ∈φ(u)

exp(−βf

_J

S

¯

0

⊕ ¯

S

_K)

Y

(w→u)∈T

m

w→u

J

S

¯

0

_{⊕ ¯}

S

_K.

(5)

Proof by induction on T . If u is a leaf, χ(r) = χ(T

r

(u)), there are no messages sent to

(19)

in postorder, u received from its children w

1

, . . . , w

q

the messages m

w1→u

, . . . , m

wq→u

,

which satisfy Eq. (4) (induction hypothesis). Let ¯

S ∈ PS(sep(u, v)); then, m

u→v

J

S

¯

K is

computed by the algorithm according to Eq. (5). We rewrite as follows

X ¯ S0_{∈PS(diff(u→v))} Y f ∈φ(u) exp(−βf_JS¯0⊕ ¯S_K) Y (w→u)∈T mw→u_JS¯0⊕ ¯S_K = X ¯ S0∈PS(χ(u)−sep(u,v)) Y f ∈φ(u) exp(−βf_JS¯0⊕ ¯S_K) q Y i=1 mwi→uJS¯ 0 ⊕ ¯S_K =IH X ¯ S0_{∈PS(χ(u)−sep(u,v))} Y f ∈φ(u) exp(−βf_JS¯0⊕ ¯S_K) q Y i=1 X ¯ S00_∈PS(χ(T r(wi))−sep(wi,u)) Y f ∈φ(Tr(wi)) exp(−βf_JS¯00⊕ ¯S0⊕ ¯S_K) =∗ X ¯ S0∈PS(χ(u)−sep(u,v)) X ¯ S00_∈PS(Sq

i=1χ(Tr(wi))−sep(wi,u))

Y f ∈φ(u) exp(−βf_JS¯0⊕ ¯S_K) Y f ∈φ(Tr(wi)) exp(−βf_JS¯00⊕ ¯S0| ¯S_K) =(∗∗) X ¯ S0∈PS(χ(Tr(u))−sep(u,v)) Y f ∈φ(Tr(u)) exp(−βf_JS¯0⊕ ¯S_K)

To see (*) and (**), we observe:

– The sets χ(u) − sep(u, v) and χ(T

r

(w

i

)) − sep(w

i

, u) are all disjoint due to Def.

1,

property 2. First, this property implies that any shared position between the subtrees of

w

i

and w

j

must be in χ(w

i

), χ(w

j

) and χ(u), thus the positions of χ(T

r

(w

i

))−sep(w

i

, u)

are disjoint. Second, if a position χ(T

r

(w

i

)) occurs in χ(u), it must occur in χ(w

i

) and

consequently in sep(u, v).

– The union of the sets χ(u)−sep(u, v) and χ(T

r

(w

i

))−sep(w

i

, u) is χ(T

r

(u))−sep(u, v)).

11 General complexity of the partition function computation by cluster

tree elimination and generation of samples

Proposition 2. Given a cluster tree (T, χ, φ), T = (V, E) of the RNA design network

N = (X , F ) with treewidth w and maximum separator size s, computing the partition

function by Algorithm

1 takes O((|F | + |V |) · 4

w+1

) time and O(|V |4

s

) space (for storing

all messages). The sampling step has time complexity of O((|F | + |V |) · 4

D

_{) per sample.}

Proof (Proposition

2):

Let d

u

denote the degree of node u in T , s

u

is the size of the

separator between u and its parent in T rooted at r. For each node in the cluster tree,

Alg.

1 computes one message by combining (|φ(u)| + d

u

− 1) functions, each time

enu-merating 4

|χ(u)|

combinations; Alg.

2 computes 4

|χu|−su

_{partition functions each time}

com-bining (|φ(u)| + d

u

− 1) functions. 4

|χ(u)|

is bound by 4

w+1

and 4

|χu|−su

by 4

D

; moreover

P

u∈V

(|φ(u)| + d

u

− 1) = |F | + |V | − 1.

12 Exploiting constraint consistency to reduce the complexity

While Alg.

1 computes messages values for all possible combinations of nucleotides for

the positions in a node, we observe here that many such combinations are not required for

computing all relevant partition functions. In particular, the algorithm can be restricted

to consider only valid combinations, satisfying the (hard) constraints induced by valid

base pairs.

(20)

RNA design introduces constraints due to the requirement of canonical (aka

comple-mentary) base pairing, which can be exploited in a particularly simple and effective way

to reduce the complexity. As previously noted by

Flamm et al.

(2001), the base pair

complementarity induces a bi-partition of each connected component within the base

pair dependency graph, such that the nucleotides assigned to the two set of nodes in

the partition are restricted to values in {A, G} and {C, U} respectively. We call a partial

sequence cc-valid, iff its determined positions are consistent with such a separation for

all determined positions of the same connected component.

One can now modify Alg.

1, on a tree decomposition (T, χ), such that the message

values m

u→v

J

S

¯

K are only computed for cc-valid partial sequences

S ∈ PS(sep(u, v)).

¯

Moreover, the loop over ¯

S

0

∈ PS(diff(u → v)) is restricted, such that ¯

S ⊕ ¯

S

0

are cc-valid.

Analogous restrictions are then implemented in the sampling algorithm Alg.

2. The correctness of the modified algorithm follows from the same induction argument,

where the message computation over one node evaluates messages from its children only at

cc-valid partial sequences. The result of this computation is a message, which corresponds

to the partition function, restricted to cc-valid partial sequences. Since invalid sequences

have infinite energy, they do not contribute to the partition function, and the partition

function restricted to cc-valid sequences coincides with the initial one.

This restriction drastically improves the time complexity. Indeed, for any given node

v, the original algorithm sends message for χ(v) := ¯

S⊕ ¯

S

0

such that ¯

S ∈ PS(sep(u, v)) and

¯

S

0

∈ PS(diff(u → v)), while the modified algorithm only considers cc-valid assignments

for χ(v). Remark that, in any connected component cc, assigning some nucleotide to a

position reduces cc-valid assignments to (at most) two alternatives for each of the |cc| − 1

remaining positions. It follows that, for a node v featuring positions from #cc(v) distinct

connected components {cc

1

, cc

2

. . .} in the base pair dependency graph, the number of

cc-valid assignments to positions in χ(v) is exactly

4

#cc(v)

Y

i=1

2

|cci|−1

_{= 2}

#cc(v)

₂

|χ(v)|

_{∈ Θ(2}

#cc

₂

w+1

_),

for a single node, where #cc is the total number of connected components in the base pair

dependency graph, and w is the tree-width of the tree decomposition T . Since the number

of nodes in T is in Θ(n), and the number of atomic energy contributions associated with

k structures is in O(k n), then the overall complexity grows like Θ(n k 2

#cc

₂

w+1