HAL Id: hal-02403517
https://hal.inria.fr/hal-02403517
Submitted on 11 Dec 2019
HAL is a multi-disciplinary open access archive for the deposit and dissemination of sci- entific research documents, whether they are pub- lished or not. The documents may come from teaching and research institutions in France or abroad, or from public or private research centers.
L’archive ouverte pluridisciplinaire HAL, est destinée au dépôt et à la diffusion de documents scientifiques de niveau recherche, publiés ou non, émanant des établissements d’enseignement et de recherche français ou étrangers, des laboratoires publics ou privés.
Forbidden substrings and the connectivity of the Hamming graph of RNA sequences: Partial
disconnectivity tests
Maher Mallem, Alain Denise, Yann Ponty
To cite this version:
Maher Mallem, Alain Denise, Yann Ponty. Forbidden substrings and the connectivity of the Hamming graph of RNA sequences: Partial disconnectivity tests. SEQBIM 2019 - Séquences en Bioinformatique, Informatique et Mathématiques, Dec 2019, Marne-la-Vallée, France. �hal-02403517�
Forbidden substrings and the connectivity of the Hamming graph of RNA sequences:
partial disconnectivity tests
Maher Mallem1, Alain Denise2*, Yann Ponty3*
1Department of Computer Science, ENS Paris-Saclay, Cachan, France
2LRI and I2BC, Universit´e Paris-Sud / Paris-Saclay, Gif-sur-Yvette, France
3LIX, ´Ecole Polytechnique, Palaiseau, France
*Corresponding authors: alain.denise@u-psud.fr and yann.ponty@lix.polytechnique.fr
Abstract
RNA structure design methods have grown in complexity to cover an increasing scope of applica- tion. Recent approaches combine an initial random generation with a local optimization step, and consider both a user-specified secondary structure and sets of mandatory and forbidden substrings.
Although these additional constraints lead to better design results, they may interfere with the local optimization phase. Indeed, forbidden substrings may disrupt the connectivity of their underlying search space, a key property for the success of the local search. A naive connectivity test would explore the whole graph of candidate sequences, leading to an exponential time connectivity test.
In this work, we propose two partial algorithms based on compact graph structures - the De Bruijn graphs and the Aho-Corasick automaton - allowing the detection of disconnectivity in time indepen- dent from the length of RNA sequence. Tested on random instances, our tests were able to detect the disconnectivity with sensitivity ranging between 35% and 55%, motivating further research.
Keywords
RNA Design – Forbidden Substrings – De Bruijn graphs – Aho-Corasick automaton
1. Introduction
First introduced in [1], the computational design of RiboNucleic Acids (RNA) design has been studied extensively over the past decades [2] due to its successful application in a variety of biological contexts [3,4]. Its ultimate goal is the synthesis of molecules to achieve a targeted biological function. In its simplest form, also called inverse folding of RNA, the design problem consists in finding a sequence that adopts a given secondary structure as its Minimum Free Energy (MFE) structure, typically computed using polynomial-time dynamic programming [5]. Given theNP-hardness of the problem [6], recent methods [7,8,9,10] tackle the problem heuristically in two phases: First, an initial seed sequence is sampled from a distribution that captures a relaxed version of the objective function [11]; Next, the seed is iteratively refined using a local search strategy[1], eventually inducing a Boltzmann-Gibbs distribution with respect to the final objective function (e.g. the free-energy difference between the sequence MFE structure and its first suboptimal structure).
Forbidden substrings and Hamming graphs connectivity
However, realistic applications of design require additional sequence con- straints, for instance to avoid undesired interactions within a cellular context.
The seed sampling phase can be adapted to avoid a predefined set F of forbidden motifs using formal language constructs [12] or direct dynamic programming [13].
However, to the best of our knowledge, little to no work has been done to assess the impact of forbidden motifs on the local search. Indeed, allowing the local search to violate sequence constraints would lead to very few valid candidate sequences, since an overwhelming proportion of the sequences may (and will, from the monkey/typewriter paradox) feature some forbidden motif during the local search.
On the other hand, enforcing the avoidance ofF at each step of the local search may disrupt the search space connectivity, or equivalently the non-ergodicity of the Markov Chain induced by the sequence space and the moves set of the local search.
For instance, while designing an RNA of lengthnwithin an alphabet Σ ={A,U}and F ={AU,UA}, the only two words avoidingF,AnandUn, have Hamming distancen.
The search space is thus disconnected for any move set inducing changes of bounded Hamming distancen0 < n. Such adisconnectivity prevents the convergence of the local search, i.e. it rules out any (probabilistic) guarantee to ultimately discover promising candidates whenever such candidates exist.
In this work, we address the efficient algorithmic detection of disconnected search spaces for a given set F of forbidden motifs, a given RNA sequence length n and a given moves set. We restrict our attention to k-Hamming move sets, consisting of symmetric moves s↔s0 where bothsand s0 avoidF, and such that Hamming distance H(s, s0) =k. Abrute-force solution would generate the whole search space as a graph, and check the existence of a single connected component in a highly impracticalO(|Σ|n) time complexity. Instead, we exploit the highly-structured nature of the problem to propose partial algorithms, based on the De Bruijn graphs and Aho-Corasick automata, whose complexity depend onF andk, but remain largely independent from n.
2. Definition of the problem
Let Σ be an alphabet, |Σ| ≥2, and n∈N, n≥2 be a sequence length. Denote by F ⊂Σ? the set of forbidden motifs, then LF,n ⊆Σn represents the words that do not contain any motif inF. Letm(F)def= maxf∈F|f|, we assume thatnm(F).
General problem
Input: Length n ≥ 2, set F of forbidden motifs, and neighborhood functionδ :LF,n → LF,n
Output: YesifG= (LF,n, δ) is (strongly) connected,No otherwise.
Here we restrict our attention to thek-Hamming neighborhoodδkfor somek∈[1, n], defined for any wordw∈ LF,n asδk(w) ={w0 ∈ LF |H(w, w0)≤k}whereH(w, w0) is the classic Hamming distance between two words w,w0 ∈Σn.
Sincek-Hamming neighborhoods are symmetric, strong connectivity and connec- tivity are equivalent. The central question, addressed in the following, becomes:
Is theHamming graphGF,n,k def= (LF,n, δk) connected?
Figure 1. De Bruijn graph DBF forF ={ACA,CAAA,AAC}and Σ ={A,C}
3. Algorithms
We derive a first partial disconnectivity test from a simple property of De Bruijn graphs. Then using an equivalence relation on the nodes of the De Bruijn graph, we infer a similar partial disconnectivity test on a variant of the Aho-Corasick automaton which is in linear time on the length of the desired sequence.
3.1 Detecting disconnectivity using the De Bruijn graph of m(F)-mers
We use variants of the De Bruijn graph [14] to infer the disconnectivity ofGF,n,k. Definition 1. Given a set F of forbidden motifs, we define:
• The De Bruijn (di)graphDBF = (V, E) of F, such that V :=LF,m(F), the valid sequences of length m(F), andE :=n(a.w, w.b)∈ L2F,m(F)|a, b∈Σo;
• The pruned De Bruijn graph DBF,n, obtained by removing any connected component in DBF that cannot generate any word of length n.
DBF,n can be built inO(|Σ|m(F)+1) time, and detecting unproductive connected components (CC) to build DBF,n can be done in O(|V|) time using topological sorting to either detect a cycle (→ keep CC), or determine n0 the length of the longest path (→ keep CC only if n0 ≥n−m(F)−1).
Remark that DBF has O(|Σ|m(F)) nodes, and is typically much smaller than the Hamming graph GF,n,k (O(|Σ|n) nodes), all valid sequences of length n are represented in DBF as paths of lengthn−m(F). For example in Figure 1the valid sequence CACCAAcorresponds to the pathCACC→ACCA→CCAA.
Lemma 1. Upon reading a sequence of letters a1.a2. . . aj, j ≥ m(F) from two distinct nodes u, v∈ DBF the two paths merge at some index i≤m(F).
Intuitively,DBF can be seen as an automaton, whose states encode the suffixes of lengthm(F). Thus, after readingm(F) characters the resulting state isa1. . . am(F), irrespectively of the starting state, so the paths either merged at index m(F) or before. This means that if we follow two paths in different connected components of DBF, the sequence of letters must diverge at least once every m(F) steps, which implies an increasing Hamming distance between the corresponding valid words. This
Forbidden substrings and Hamming graphs connectivity
(a)ACF (b) DBF and the equivalence classes Figure 2. ACdF,n and DBF,n when F ={ACA,AAC,CAAA}and Σ ={A,C}.
holds for any pair of paths in DBF generated from different connected components, leading to the following result.
Theorem 2. ∀n≥(k+ 1)×m(F),DBF,n disconnected ⇒ GF,n,k disconnected.
The implication is not an equivalence, as it is possible to build instances where GF,n,k is disconnected while DBF,n remains connected. It nevertheless suggests a first algorithm for a partial disconnectivity test within GF,n,k: Build DBF,n and report its connectivity. It has overall time complexity in O(|Σ|m(F)),i.e. no longer exponential in the sequence length n, yet remains exponential in the length of the forbidden substrings.
3.2 Detecting disconnectivity using the Aho-Corasick automaton of F
Next we attempt to exploit the Nerode equivalence, with respect to the suffix language, of some states in DBF,n.
Definition 3. Define the Aho-Corasick automaton ACF as the DFA having states set Q={u proper prefix of some f ∈ F }, initial state qI ={ε}, and accepting all words ending in Q. Transitions are ∆ = ∆f ]∆b, with:
• ∆f the forward edges: {(u, a, u.a)|a∈Σ∧u, u.a∈Q} (i.e. prefix tree of F)
• ∆b the backward edges: {(u, a, v)|ua /∈Q∧v∈Q longest suffix of u.a}
With this definition of ACF, a word w is accepted iff no f ∈ F is a substring of w, i.e. ACF recognizes the complement language of the usual Aho-Corasick automaton [15]. Moreover, ACF can be built in time O(|Σ| × |F | ×m(F)).
Definition 4. We define:
• ACdF from ACF by removing states that are no longer visited afterm(F) steps;
• ACdF,n as the restriction ofACdF to components producing words of length n.
|Σ| m(F) n #Samples #GF,n,1 discon. %Rec. DBF,n %Rec. ACdF,n
2 5 10 100 000 36 630 49.5 47.1
2 5 11 100 000 35 893 48.2 46.2
3 5 10 10 000 4 395 53.9 49.2
4 3 6 25 000 9 447 37.6 34.3
4 3 7 10 000 3 728 37.9 35.7
4 4 8 4 000 1 904 54.3 50.1
Figure 3. Recall (TP/P) of our disconnectivity tests for various sets of parameters
As illustrated in Figure 2, grouping together nodes in DBF having same pre- fix/suffix overlaps with forbidden substrings, we get exactly ACdF. This equivalence relation and Theorem2 imply the following:
Theorem 5. ∀n≥(k+ 1)×m(F),one has
ACdF,n disconnected ⇒ DBF,n disconnected ⇒ GF,n,k disconnected.
Again, the second implication is only one-way: DBF,n may be disconnected whileACdF,n remains connected. Still, buildingdACF,n, and testing its disconnectivity represents an additional partial disconnectivity test for GF,n,k. While this variant is expected to detect less cases of disconnectivity, its complexity is significantly better, with the overall construction ofACdF,n now only requiringO(|Σ| × |F | ×m(F)) time.
4. Results and Discussion
Both our partial tests were executed on randomly generated sets of forbidden substrings with various parameters. Since the connectivity of the Hamming graph GF,n,k had to be checked on every instance to establish a ground truth, tests could only be conducted with k = 1 and small n and m(F) values. The recall (#DetectedDisconnections/#Disconnections, or TP/P) results are given in Figure 3.
As expected, the Aho-Corasick-based test always performs slightly worse than the De Bruijn-based one, but not by a large margin (∼5%) in our empirical experiments.
With a trade-off in accuracy that minimal, the Aho-Corasick-based variant seems to represent a natural first choice in most cases. Recall values range between 35% and 55% for both variants, which is already significant but could probably be improved by exploring subtler relationships between the Aho-Corasick automaton and the Hamming graph.
This preliminary work leaves open several questions of general interest, including:
• What are the shared properties of disconnected instances associated with connected dACF,n? DBF,n?
• Is the problem NP-hard in general?
• How to generalize our constructs to mandatory motifs? To any general automa- ton generating sequences?
• How to design move sets ensuring connectivity for a given F?
Forbidden substrings and Hamming graphs connectivity
References
[1] Ivo Hofacker, Walter Fontana, Peter Stadler, Sebastian Bonhoeffer, Manfred Tacker, and Peter Schuster. Fast folding and comparison of RNA secondary structures. Monatshefte f¨ur Chemie/Chemical Monthly, 125(2):167–188, Feb 1994.
[2] Alexander Churkin, Matan Drory Retwitzer, Vladimir Reinharz, Yann Ponty, J´erˆome Wald- isp¨uhl, and Danny Barash. Design of RNAs: comparing programs for inverse RNA folding.
Briefings in Bioinformatics, 19(2):350–358, 01 2017.
[3] Sven Findeiß, Manja Wachsmuth, Mario M¨orl, and Peter F Stadler. Design of transcription regulating riboswitches. InMethods in enzymology, volume 550, pages 1–22. Elsevier, 2015.
[4] Ryota Yamagami, Mohammad Kayedkhordeh, David H Mathews, and Philip C Bevilacqua.
Design of highly active double-pseudoknotted ribozymes: a combined computational and experimental study. Nucleic acids research, 47(1):29–42, 2018.
[5] M. Zuker and P. Stiegler. Optimal computer folding of large RNA sequences using thermo- dynamics and auxiliary information. Nucleic Acids Research, 9:133–148, 1981.
[6] Edouard Bonnet, Pawe l Rzazewski, and Florian Sikora. Designing RNA secondary struc-´ tures is hard. InResearch in Computational Molecular Biology - 22nd Annual International Conference, RECOMB 2018, pages 248–250, 2018.
[7] Joseph N Zadeh, Brian R Wolfe, and Niles A Pierce. Nucleic acid sequence design via efficient ensemble defect optimization. Journal of Computational Chemistry, 32(3):439–52, 2011.
[8] Matan Drory Retwitzer, Vladimir Reinharz, Yann Ponty, J´erˆome Waldisp¨uhl, and Danny Barash. incaRNAfbinv: a web server for the fragment-based design of RNA sequences.Nucleic acids research, 44(W1):W308–W314, 2016.
[9] Stefan Hammer, Birgit Tschiatschek, Christoph Flamm, Ivo L Hofacker, and Sven Find- eiß. RNAblueprint: flexible multiple target nucleic acid sequence design. Bioinformatics, 33(18):2850–2858, 04 2017.
[10] Stefan Hammer, Wei Wang, Sebastian Will, and Yann Ponty. Fixed-parameter tractable sampling for RNA design with multiple target structures. BMC bioinformatics, 20(1):209, 2019.
[11] Vladimir Reinharz, Yann Ponty, and J´erˆome Waldisp¨uhl. A weighted sampling algorithm for the design of RNA sequences with targeted secondary structure and nucleotide distribution.
Bioinformatics, 29(13):i308–i315, 2013.
[12] Yu Zhou, Yann Ponty, St´ephane Vialette, J´erˆome Waldisp¨uhl, Yi Zhang, and Alain Denise.
Flexible RNA design under structure and sequence constraints using formal languages. In ACM-BCB - ACM Conference on Bioinformatics, Computational Biology and Biomedical Informatics - 2013, Bethesda, Washigton DC, United States, September 2013.
[13] Vincent Le Gallic, Alain Denise, and Yann Ponty. R´esultats algorithmiques pour le design d’ARN avec contraintes de s´equence. InSeqBio 2015, pages 26–31, Orsay, France, November 2015.
[14] N. G. De Bruijn. A combinatorial problem. Proc. Koninklijke Nederlandse Academie van Wetenschappen, 49:758–764, 1946.
[15] Alfred V. Aho and Margaret J. Corasick. Efficient string matching: An aid to bibliographic search. Commun. ACM, 18(6):333–340, June 1975.