Forbidden substrings and the connectivity of the Hamming graph of RNA sequences: Partial disconnectivity tests

(1)

HAL Id: hal-02403517

https://hal.inria.fr/hal-02403517

Submitted on 11 Dec 2019

HAL is a multi-disciplinary open access archive for the deposit and dissemination of sci- entific research documents, whether they are pub- lished or not. The documents may come from teaching and research institutions in France or abroad, or from public or private research centers.

L’archive ouverte pluridisciplinaire HAL, est destinée au dépôt et à la diffusion de documents scientifiques de niveau recherche, publiés ou non, émanant des établissements d’enseignement et de recherche français ou étrangers, des laboratoires publics ou privés.

Forbidden substrings and the connectivity of the Hamming graph of RNA sequences: Partial

disconnectivity tests

Maher Mallem, Alain Denise, Yann Ponty

To cite this version:

Maher Mallem, Alain Denise, Yann Ponty. Forbidden substrings and the connectivity of the Hamming graph of RNA sequences: Partial disconnectivity tests. SEQBIM 2019 - Séquences en Bioinformatique, Informatique et Mathématiques, Dec 2019, Marne-la-Vallée, France. �hal-02403517�

(2)

Forbidden substrings and the connectivity of the Hamming graph of RNA sequences:

partial disconnectivity tests

Maher Mallem¹, Alain Denise²*, Yann Ponty³*

1Department of Computer Science, ENS Paris-Saclay, Cachan, France

2LRI and I2BC, Universit´e Paris-Sud / Paris-Saclay, Gif-sur-Yvette, France

3LIX, ´Ecole Polytechnique, Palaiseau, France

*Corresponding authors: alain.denise@u-psud.fr and yann.ponty@lix.polytechnique.fr

Abstract

RNA structure design methods have grown in complexity to cover an increasing scope of application. Recent approaches combine an initial random generation with a local optimization step, and consider both a user-specified secondary structure and sets of mandatory and forbidden substrings.

Although these additional constraints lead to better design results, they may interfere with the local optimization phase. Indeed, forbidden substrings may disrupt the connectivity of their underlying search space, a key property for the success of the local search. A naive connectivity test would explore the whole graph of candidate sequences, leading to an exponential time connectivity test.

In this work, we propose two partial algorithms based on compact graph structures - the De Bruijn graphs and the Aho-Corasick automaton - allowing the detection of disconnectivity in time independent from the length of RNA sequence. Tested on random instances, our tests were able to detect the disconnectivity with sensitivity ranging between 35% and 55%, motivating further research.

Keywords

RNA Design – Forbidden Substrings – De Bruijn graphs – Aho-Corasick automaton

1. Introduction

First introduced in [1], the computational design of RiboNucleic Acids (RNA) design has been studied extensively over the past decades [2] due to its successful application in a variety of biological contexts [3,4]. Its ultimate goal is the synthesis of molecules to achieve a targeted biological function. In its simplest form, also called inverse folding of RNA, the design problem consists in finding a sequence that adopts a given secondary structure as its Minimum Free Energy (MFE) structure, typically computed using polynomial-time dynamic programming [5]. Given theNP-hardness of the problem [6], recent methods [7,8,9,10] tackle the problem heuristically in two phases: First, an initial seed sequence is sampled from a distribution that captures a relaxed version of the objective function [11]; Next, the seed is iteratively refined using a local search strategy[1], eventually inducing a Boltzmann-Gibbs distribution with respect to the final objective function (e.g. the free-energy difference between the sequence MFE structure and its first suboptimal structure).

(3)

Forbidden substrings and Hamming graphs connectivity

However, realistic applications of design require additional sequence con- straints, for instance to avoid undesired interactions within a cellular context.

The seed sampling phase can be adapted to avoid a predefined set F of forbidden motifs using formal language constructs [12] or direct dynamic programming [13].

However, to the best of our knowledge, little to no work has been done to assess the impact of forbidden motifs on the local search. Indeed, allowing the local search to violate sequence constraints would lead to very few valid candidate sequences, since an overwhelming proportion of the sequences may (and will, from the monkey/typewriter paradox) feature some forbidden motif during the local search.

On the other hand, enforcing the avoidance ofF at each step of the local search may disrupt the search space connectivity, or equivalently the non-ergodicity of the Markov Chain induced by the sequence space and the moves set of the local search.

For instance, while designing an RNA of lengthnwithin an alphabet Σ ={A,U}and F ={AU,UA}, the only two words avoidingF,AⁿandUⁿ, have Hamming distancen.

The search space is thus disconnected for any move set inducing changes of bounded Hamming distancen⁰ < n. Such adisconnectivity prevents the convergence of the local search, i.e. it rules out any (probabilistic) guarantee to ultimately discover promising candidates whenever such candidates exist.

In this work, we address the efficient algorithmic detection of disconnected search spaces for a given set F of forbidden motifs, a given RNA sequence length n and a given moves set. We restrict our attention to k-Hamming move sets, consisting of symmetric moves s↔s⁰ where bothsand s⁰ avoidF, and such that Hamming distance H(s, s⁰) =k. Abrute-force solution would generate the whole search space as a graph, and check the existence of a single connected component in a highly impracticalO(|Σ|ⁿ) time complexity. Instead, we exploit the highly-structured nature of the problem to propose partial algorithms, based on the De Bruijn graphs and Aho-Corasick automata, whose complexity depend onF andk, but remain largely independent from n.

2. Definition of the problem

Let Σ be an alphabet, |Σ| ≥2, and n∈N, n≥2 be a sequence length. Denote by F ⊂Σ^? the set of forbidden motifs, then L_F,n ⊆Σⁿ represents the words that do not contain any motif inF. Letm(F)^def= max_f∈F|f|, we assume thatnm(F).

General problem

Input: Length n ≥ 2, set F of forbidden motifs, and neighborhood functionδ :L_F,n → L_F,n

Output: YesifG= (L_F,n, δ) is (strongly) connected,No otherwise.

Here we restrict our attention to thek-Hamming neighborhoodδkfor somek∈[1, n], defined for any wordw∈ L_F,n asδ_k(w) ={w⁰ ∈ L_F |H(w, w⁰)≤k}whereH(w, w⁰) is the classic Hamming distance between two words w,w⁰ ∈Σⁿ.

Sincek-Hamming neighborhoods are symmetric, strong connectivity and connectivity are equivalent. The central question, addressed in the following, becomes:

Is theHamming graphG_F_,n,k ^def= (LF,n, δ_k) connected?

(4)

Figure 1. De Bruijn graph DB_F forF ={ACA,CAAA,AAC}and Σ ={A,C}

3. Algorithms

We derive a first partial disconnectivity test from a simple property of De Bruijn graphs. Then using an equivalence relation on the nodes of the De Bruijn graph, we infer a similar partial disconnectivity test on a variant of the Aho-Corasick automaton which is in linear time on the length of the desired sequence.

3.1 Detecting disconnectivity using the De Bruijn graph of m(F)-mers

We use variants of the De Bruijn graph [14] to infer the disconnectivity ofGF,n,k. Definition 1. Given a set F of forbidden motifs, we define:

• The De Bruijn (di)graphDB_F = (V, E) of F, such that V :=L_F,m(F), the valid sequences of length m(F), andE :=ⁿ(a.w, w.b)∈ L²_F,m(F)|a, b∈Σ^o;

• The pruned De Bruijn graph DB_F,n, obtained by removing any connected component in DB_F that cannot generate any word of length n.

DB_F,n can be built inO(|Σ|^m(F)+1) time, and detecting unproductive connected components (CC) to build DB_F,n can be done in O(|V|) time using topological sorting to either detect a cycle (→ keep CC), or determine n⁰ the length of the longest path (→ keep CC only if n⁰ ≥n−m(F)−1).

Remark that DB_F has O(|Σ|^m(F⁾) nodes, and is typically much smaller than the Hamming graph G_F,n,k (O(|Σ|ⁿ) nodes), all valid sequences of length n are represented in DB_F as paths of lengthn−m(F). For example in Figure 1the valid sequence CACCAAcorresponds to the pathCACC→ACCA→CCAA.

Lemma 1. Upon reading a sequence of letters a1.a2. . . aj, j ≥ m(F) from two distinct nodes u, v∈ DB_F the two paths merge at some index i≤m(F).

Intuitively,DBF can be seen as an automaton, whose states encode the suffixes of lengthm(F). Thus, after readingm(F) characters the resulting state isa1. . . a_m(F), irrespectively of the starting state, so the paths either merged at index m(F) or before. This means that if we follow two paths in different connected components of DB_F, the sequence of letters must diverge at least once every m(F) steps, which implies an increasing Hamming distance between the corresponding valid words. This

(5)

(a)AC_F (b) DBF and the equivalence classes Figure 2. ACd_F,n and DB_F_,n when F ={ACA,AAC,CAAA}and Σ ={A,C}.

holds for any pair of paths in DBF generated from different connected components, leading to the following result.

Theorem 2. ∀n≥(k+ 1)×m(F),DB_F,n disconnected ⇒ GF,n,k disconnected.

The implication is not an equivalence, as it is possible to build instances where GF,n,k is disconnected while DB_F,n remains connected. It nevertheless suggests a first algorithm for a partial disconnectivity test within GF,n,k: Build DB_F,n and report its connectivity. It has overall time complexity in O(|Σ|^m(F)),i.e. no longer exponential in the sequence length n, yet remains exponential in the length of the forbidden substrings.

3.2 Detecting disconnectivity using the Aho-Corasick automaton of F

Next we attempt to exploit the Nerode equivalence, with respect to the suffix language, of some states in DBF,n.

Definition 3. Define the Aho-Corasick automaton AC_F as the DFA having states set Q={u proper prefix of some f ∈ F }, initial state qI ={ε}, and accepting all words ending in Q. Transitions are ∆ = ∆_f ]∆_b, with:

• ∆_f the forward edges: {(u, a, u.a)|a∈Σ∧u, u.a∈Q} (i.e. prefix tree of F)

• ∆_b the backward edges: {(u, a, v)|ua /∈Q∧v∈Q longest suffix of u.a}

With this definition of AC_F, a word w is accepted iff no f ∈ F is a substring of w, i.e. AC_F recognizes the complement language of the usual Aho-Corasick automaton [15]. Moreover, AC_F can be built in time O(|Σ| × |F | ×m(F)).

Definition 4. We define:

• ACdF from ACF by removing states that are no longer visited afterm(F) steps;

• ACd_F,n as the restriction ofACd_F to components producing words of length n.

(6)

|Σ| m(F) n #Samples #GF,n,1 discon. %Rec. DBF,n %Rec. ACdF,n

2 5 10 100 000 36 630 49.5 47.1

2 5 11 100 000 35 893 48.2 46.2

3 5 10 10 000 4 395 53.9 49.2

4 3 6 25 000 9 447 37.6 34.3

4 3 7 10 000 3 728 37.9 35.7

4 4 8 4 000 1 904 54.3 50.1

Figure 3. Recall (TP/P) of our disconnectivity tests for various sets of parameters

As illustrated in Figure 2, grouping together nodes in DB_F having same prefix/suffix overlaps with forbidden substrings, we get exactly ACd_F. This equivalence relation and Theorem2 imply the following:

Theorem 5. ∀n≥(k+ 1)×m(F),one has

ACd_F,n disconnected ⇒ DB_F_,n disconnected ⇒ GF,n,k disconnected.

Again, the second implication is only one-way: DBF,n may be disconnected whileACd_F,n remains connected. Still, buildingdAC_F_,n, and testing its disconnectivity represents an additional partial disconnectivity test for GF,n,k. While this variant is expected to detect less cases of disconnectivity, its complexity is significantly better, with the overall construction ofACd_F,n now only requiringO(|Σ| × |F | ×m(F)) time.

4. Results and Discussion

Both our partial tests were executed on randomly generated sets of forbidden substrings with various parameters. Since the connectivity of the Hamming graph GF,n,k had to be checked on every instance to establish a ground truth, tests could only be conducted with k = 1 and small n and m(F) values. The recall (#DetectedDisconnections/#Disconnections, or TP/P) results are given in Figure 3.

As expected, the Aho-Corasick-based test always performs slightly worse than the De Bruijn-based one, but not by a large margin (∼5%) in our empirical experiments.

With a trade-off in accuracy that minimal, the Aho-Corasick-based variant seems to represent a natural first choice in most cases. Recall values range between 35% and 55% for both variants, which is already significant but could probably be improved by exploring subtler relationships between the Aho-Corasick automaton and the Hamming graph.

This preliminary work leaves open several questions of general interest, including:

• What are the shared properties of disconnected instances associated with connected dAC_F_,n? DB_F,n?

• Is the problem NP-hard in general?

• How to generalize our constructs to mandatory motifs? To any general automaton generating sequences?

• How to design move sets ensuring connectivity for a given F?

(7)

References

[1] Ivo Hofacker, Walter Fontana, Peter Stadler, Sebastian Bonhoeffer, Manfred Tacker, and Peter Schuster. Fast folding and comparison of RNA secondary structures. Monatshefte f¨ur Chemie/Chemical Monthly, 125(2):167–188, Feb 1994.

[2] Alexander Churkin, Matan Drory Retwitzer, Vladimir Reinharz, Yann Ponty, Jérôme Wald- ispühl, and Danny Barash. Design of RNAs: comparing programs for inverse RNA folding.

Briefings in Bioinformatics, 19(2):350–358, 01 2017.

[3] Sven Findeiß, Manja Wachsmuth, Mario M¨orl, and Peter F Stadler. Design of transcription regulating riboswitches. InMethods in enzymology, volume 550, pages 1–22. Elsevier, 2015.

[4] Ryota Yamagami, Mohammad Kayedkhordeh, David H Mathews, and Philip C Bevilacqua.

Design of highly active double-pseudoknotted ribozymes: a combined computational and experimental study. Nucleic acids research, 47(1):29–42, 2018.

[5] M. Zuker and P. Stiegler. Optimal computer folding of large RNA sequences using thermo- dynamics and auxiliary information. Nucleic Acids Research, 9:133–148, 1981.

[6] Edouard Bonnet, Pawe l Rzazewski, and Florian Sikora. Designing RNA secondary struc-´ tures is hard. InResearch in Computational Molecular Biology - 22nd Annual International Conference, RECOMB 2018, pages 248–250, 2018.

[7] Joseph N Zadeh, Brian R Wolfe, and Niles A Pierce. Nucleic acid sequence design via efficient ensemble defect optimization. Journal of Computational Chemistry, 32(3):439–52, 2011.

[8] Matan Drory Retwitzer, Vladimir Reinharz, Yann Ponty, Jérôme Waldispühl, and Danny Barash. incaRNAfbinv: a web server for the fragment-based design of RNA sequences.Nucleic acids research, 44(W1):W308–W314, 2016.

[9] Stefan Hammer, Birgit Tschiatschek, Christoph Flamm, Ivo L Hofacker, and Sven Find- eiß. RNAblueprint: flexible multiple target nucleic acid sequence design. Bioinformatics, 33(18):2850–2858, 04 2017.

[10] Stefan Hammer, Wei Wang, Sebastian Will, and Yann Ponty. Fixed-parameter tractable sampling for RNA design with multiple target structures. BMC bioinformatics, 20(1):209, 2019.

[11] Vladimir Reinharz, Yann Ponty, and Jérôme Waldispühl. A weighted sampling algorithm for the design of RNA sequences with targeted secondary structure and nucleotide distribution.

Bioinformatics, 29(13):i308–i315, 2013.

[12] Yu Zhou, Yann Ponty, Stéphane Vialette, Jérôme Waldispühl, Yi Zhang, and Alain Denise.

Flexible RNA design under structure and sequence constraints using formal languages. In ACM-BCB - ACM Conference on Bioinformatics, Computational Biology and Biomedical Informatics - 2013, Bethesda, Washigton DC, United States, September 2013.

[13] Vincent Le Gallic, Alain Denise, and Yann Ponty. R´esultats algorithmiques pour le design d’ARN avec contraintes de s´equence. InSeqBio 2015, pages 26–31, Orsay, France, November 2015.

[14] N. G. De Bruijn. A combinatorial problem. Proc. Koninklijke Nederlandse Academie van Wetenschappen, 49:758–764, 1946.

[15] Alfred V. Aho and Margaret J. Corasick. Efficient string matching: An aid to bibliographic search. Commun. ACM, 18(6):333–340, June 1975.