• Aucun résultat trouvé

Forbidden substrings and the connectivity of the Hamming graph of RNA sequences: Partial disconnectivity tests

N/A
N/A
Protected

Academic year: 2021

Partager "Forbidden substrings and the connectivity of the Hamming graph of RNA sequences: Partial disconnectivity tests"

Copied!
7
0
0

Texte intégral

(1)

HAL Id: hal-02403517

https://hal.inria.fr/hal-02403517

Submitted on 11 Dec 2019

HAL is a multi-disciplinary open access archive for the deposit and dissemination of sci- entific research documents, whether they are pub- lished or not. The documents may come from teaching and research institutions in France or abroad, or from public or private research centers.

L’archive ouverte pluridisciplinaire HAL, est destinée au dépôt et à la diffusion de documents scientifiques de niveau recherche, publiés ou non, émanant des établissements d’enseignement et de recherche français ou étrangers, des laboratoires publics ou privés.

Forbidden substrings and the connectivity of the Hamming graph of RNA sequences: Partial

disconnectivity tests

Maher Mallem, Alain Denise, Yann Ponty

To cite this version:

Maher Mallem, Alain Denise, Yann Ponty. Forbidden substrings and the connectivity of the Hamming graph of RNA sequences: Partial disconnectivity tests. SEQBIM 2019 - Séquences en Bioinformatique, Informatique et Mathématiques, Dec 2019, Marne-la-Vallée, France. �hal-02403517�

(2)

Forbidden substrings and the connectivity of the Hamming graph of RNA sequences:

partial disconnectivity tests

Maher Mallem1, Alain Denise2*, Yann Ponty3*

1Department of Computer Science, ENS Paris-Saclay, Cachan, France

2LRI and I2BC, Universit´e Paris-Sud / Paris-Saclay, Gif-sur-Yvette, France

3LIX, ´Ecole Polytechnique, Palaiseau, France

*Corresponding authors: alain.denise@u-psud.fr and yann.ponty@lix.polytechnique.fr

Abstract

RNA structure design methods have grown in complexity to cover an increasing scope of applica- tion. Recent approaches combine an initial random generation with a local optimization step, and consider both a user-specified secondary structure and sets of mandatory and forbidden substrings.

Although these additional constraints lead to better design results, they may interfere with the local optimization phase. Indeed, forbidden substrings may disrupt the connectivity of their underlying search space, a key property for the success of the local search. A naive connectivity test would explore the whole graph of candidate sequences, leading to an exponential time connectivity test.

In this work, we propose two partial algorithms based on compact graph structures - the De Bruijn graphs and the Aho-Corasick automaton - allowing the detection of disconnectivity in time indepen- dent from the length of RNA sequence. Tested on random instances, our tests were able to detect the disconnectivity with sensitivity ranging between 35% and 55%, motivating further research.

Keywords

RNA Design – Forbidden Substrings – De Bruijn graphs – Aho-Corasick automaton

1. Introduction

First introduced in [1], the computational design of RiboNucleic Acids (RNA) design has been studied extensively over the past decades [2] due to its successful application in a variety of biological contexts [3,4]. Its ultimate goal is the synthesis of molecules to achieve a targeted biological function. In its simplest form, also called inverse folding of RNA, the design problem consists in finding a sequence that adopts a given secondary structure as its Minimum Free Energy (MFE) structure, typically computed using polynomial-time dynamic programming [5]. Given theNP-hardness of the problem [6], recent methods [7,8,9,10] tackle the problem heuristically in two phases: First, an initial seed sequence is sampled from a distribution that captures a relaxed version of the objective function [11]; Next, the seed is iteratively refined using a local search strategy[1], eventually inducing a Boltzmann-Gibbs distribution with respect to the final objective function (e.g. the free-energy difference between the sequence MFE structure and its first suboptimal structure).

(3)

Forbidden substrings and Hamming graphs connectivity

However, realistic applications of design require additional sequence con- straints, for instance to avoid undesired interactions within a cellular context.

The seed sampling phase can be adapted to avoid a predefined set F of forbidden motifs using formal language constructs [12] or direct dynamic programming [13].

However, to the best of our knowledge, little to no work has been done to assess the impact of forbidden motifs on the local search. Indeed, allowing the local search to violate sequence constraints would lead to very few valid candidate sequences, since an overwhelming proportion of the sequences may (and will, from the monkey/typewriter paradox) feature some forbidden motif during the local search.

On the other hand, enforcing the avoidance ofF at each step of the local search may disrupt the search space connectivity, or equivalently the non-ergodicity of the Markov Chain induced by the sequence space and the moves set of the local search.

For instance, while designing an RNA of lengthnwithin an alphabet Σ ={A,U}and F ={AU,UA}, the only two words avoidingF,AnandUn, have Hamming distancen.

The search space is thus disconnected for any move set inducing changes of bounded Hamming distancen0 < n. Such adisconnectivity prevents the convergence of the local search, i.e. it rules out any (probabilistic) guarantee to ultimately discover promising candidates whenever such candidates exist.

In this work, we address the efficient algorithmic detection of disconnected search spaces for a given set F of forbidden motifs, a given RNA sequence length n and a given moves set. We restrict our attention to k-Hamming move sets, consisting of symmetric moves ss0 where bothsand s0 avoidF, and such that Hamming distance H(s, s0) =k. Abrute-force solution would generate the whole search space as a graph, and check the existence of a single connected component in a highly impracticalO(|Σ|n) time complexity. Instead, we exploit the highly-structured nature of the problem to propose partial algorithms, based on the De Bruijn graphs and Aho-Corasick automata, whose complexity depend onF andk, but remain largely independent from n.

2. Definition of the problem

Let Σ be an alphabet, |Σ| ≥2, and n∈N, n≥2 be a sequence length. Denote by F ⊂Σ? the set of forbidden motifs, then LF,n ⊆Σn represents the words that do not contain any motif inF. Letm(F)def= maxf∈F|f|, we assume thatnm(F).

General problem

Input: Length n ≥ 2, set F of forbidden motifs, and neighborhood functionδ :LF,n → LF,n

Output: YesifG= (LF,n, δ) is (strongly) connected,No otherwise.

Here we restrict our attention to thek-Hamming neighborhoodδkfor somek∈[1, n], defined for any wordw∈ LF,n asδk(w) ={w0 ∈ LF |H(w, w0)≤k}whereH(w, w0) is the classic Hamming distance between two words w,w0 ∈Σn.

Sincek-Hamming neighborhoods are symmetric, strong connectivity and connec- tivity are equivalent. The central question, addressed in the following, becomes:

Is theHamming graphGF,n,k def= (LF,n, δk) connected?

(4)

Figure 1. De Bruijn graph DBF forF ={ACA,CAAA,AAC}and Σ ={A,C}

3. Algorithms

We derive a first partial disconnectivity test from a simple property of De Bruijn graphs. Then using an equivalence relation on the nodes of the De Bruijn graph, we infer a similar partial disconnectivity test on a variant of the Aho-Corasick automaton which is in linear time on the length of the desired sequence.

3.1 Detecting disconnectivity using the De Bruijn graph of m(F)-mers

We use variants of the De Bruijn graph [14] to infer the disconnectivity ofGF,n,k. Definition 1. Given a set F of forbidden motifs, we define:

The De Bruijn (di)graphDBF = (V, E) of F, such that V :=LF,m(F), the valid sequences of length m(F), andE :=n(a.w, w.b)∈ L2F,m(F)|a, b∈Σo;

The pruned De Bruijn graph DBF,n, obtained by removing any connected component in DBF that cannot generate any word of length n.

DBF,n can be built inO(|Σ|m(F)+1) time, and detecting unproductive connected components (CC) to build DBF,n can be done in O(|V|) time using topological sorting to either detect a cycle (→ keep CC), or determine n0 the length of the longest path (→ keep CC only if n0nm(F)−1).

Remark that DBF has O(|Σ|m(F)) nodes, and is typically much smaller than the Hamming graph GF,n,k (O(|Σ|n) nodes), all valid sequences of length n are represented in DBF as paths of lengthnm(F). For example in Figure 1the valid sequence CACCAAcorresponds to the pathCACC→ACCA→CCAA.

Lemma 1. Upon reading a sequence of letters a1.a2. . . aj, jm(F) from two distinct nodes u, v∈ DBF the two paths merge at some index im(F).

Intuitively,DBF can be seen as an automaton, whose states encode the suffixes of lengthm(F). Thus, after readingm(F) characters the resulting state isa1. . . am(F), irrespectively of the starting state, so the paths either merged at index m(F) or before. This means that if we follow two paths in different connected components of DBF, the sequence of letters must diverge at least once every m(F) steps, which implies an increasing Hamming distance between the corresponding valid words. This

(5)

Forbidden substrings and Hamming graphs connectivity

(a)ACF (b) DBF and the equivalence classes Figure 2. ACdF,n and DBF,n when F ={ACA,AAC,CAAA}and Σ ={A,C}.

holds for any pair of paths in DBF generated from different connected components, leading to the following result.

Theorem 2. ∀n≥(k+ 1)×m(F),DBF,n disconnectedGF,n,k disconnected.

The implication is not an equivalence, as it is possible to build instances where GF,n,k is disconnected while DBF,n remains connected. It nevertheless suggests a first algorithm for a partial disconnectivity test within GF,n,k: Build DBF,n and report its connectivity. It has overall time complexity in O(|Σ|m(F)),i.e. no longer exponential in the sequence length n, yet remains exponential in the length of the forbidden substrings.

3.2 Detecting disconnectivity using the Aho-Corasick automaton of F

Next we attempt to exploit the Nerode equivalence, with respect to the suffix language, of some states in DBF,n.

Definition 3. Define the Aho-Corasick automaton ACF as the DFA having states set Q={u proper prefix of some f ∈ F }, initial state qI ={ε}, and accepting all words ending in Q. Transitions are ∆ = ∆f ]∆b, with:

• ∆f the forward edges: {(u, a, u.a)|a∈Σ∧u, u.aQ} (i.e. prefix tree of F)

• ∆b the backward edges: {(u, a, v)|ua /QvQ longest suffix of u.a}

With this definition of ACF, a word w is accepted iff no f ∈ F is a substring of w, i.e. ACF recognizes the complement language of the usual Aho-Corasick automaton [15]. Moreover, ACF can be built in time O(|Σ| × |F | ×m(F)).

Definition 4. We define:

• ACdF from ACF by removing states that are no longer visited afterm(F) steps;

• ACdF,n as the restriction ofACdF to components producing words of length n.

(6)

|Σ| m(F) n #Samples #GF,n,1 discon. %Rec. DBF,n %Rec. ACdF,n

2 5 10 100 000 36 630 49.5 47.1

2 5 11 100 000 35 893 48.2 46.2

3 5 10 10 000 4 395 53.9 49.2

4 3 6 25 000 9 447 37.6 34.3

4 3 7 10 000 3 728 37.9 35.7

4 4 8 4 000 1 904 54.3 50.1

Figure 3. Recall (TP/P) of our disconnectivity tests for various sets of parameters

As illustrated in Figure 2, grouping together nodes in DBF having same pre- fix/suffix overlaps with forbidden substrings, we get exactly ACdF. This equivalence relation and Theorem2 imply the following:

Theorem 5. ∀n≥(k+ 1)×m(F),one has

ACdF,n disconnected ⇒ DBF,n disconnectedGF,n,k disconnected.

Again, the second implication is only one-way: DBF,n may be disconnected whileACdF,n remains connected. Still, buildingdACF,n, and testing its disconnectivity represents an additional partial disconnectivity test for GF,n,k. While this variant is expected to detect less cases of disconnectivity, its complexity is significantly better, with the overall construction ofACdF,n now only requiringO(|Σ| × |F | ×m(F)) time.

4. Results and Discussion

Both our partial tests were executed on randomly generated sets of forbidden substrings with various parameters. Since the connectivity of the Hamming graph GF,n,k had to be checked on every instance to establish a ground truth, tests could only be conducted with k = 1 and small n and m(F) values. The recall (#DetectedDisconnections/#Disconnections, or TP/P) results are given in Figure 3.

As expected, the Aho-Corasick-based test always performs slightly worse than the De Bruijn-based one, but not by a large margin (∼5%) in our empirical experiments.

With a trade-off in accuracy that minimal, the Aho-Corasick-based variant seems to represent a natural first choice in most cases. Recall values range between 35% and 55% for both variants, which is already significant but could probably be improved by exploring subtler relationships between the Aho-Corasick automaton and the Hamming graph.

This preliminary work leaves open several questions of general interest, including:

• What are the shared properties of disconnected instances associated with connected dACF,n? DBF,n?

• Is the problem NP-hard in general?

• How to generalize our constructs to mandatory motifs? To any general automa- ton generating sequences?

• How to design move sets ensuring connectivity for a given F?

(7)

Forbidden substrings and Hamming graphs connectivity

References

[1] Ivo Hofacker, Walter Fontana, Peter Stadler, Sebastian Bonhoeffer, Manfred Tacker, and Peter Schuster. Fast folding and comparison of RNA secondary structures. Monatshefte f¨ur Chemie/Chemical Monthly, 125(2):167–188, Feb 1994.

[2] Alexander Churkin, Matan Drory Retwitzer, Vladimir Reinharz, Yann Ponty, J´erˆome Wald- isp¨uhl, and Danny Barash. Design of RNAs: comparing programs for inverse RNA folding.

Briefings in Bioinformatics, 19(2):350–358, 01 2017.

[3] Sven Findeiß, Manja Wachsmuth, Mario M¨orl, and Peter F Stadler. Design of transcription regulating riboswitches. InMethods in enzymology, volume 550, pages 1–22. Elsevier, 2015.

[4] Ryota Yamagami, Mohammad Kayedkhordeh, David H Mathews, and Philip C Bevilacqua.

Design of highly active double-pseudoknotted ribozymes: a combined computational and experimental study. Nucleic acids research, 47(1):29–42, 2018.

[5] M. Zuker and P. Stiegler. Optimal computer folding of large RNA sequences using thermo- dynamics and auxiliary information. Nucleic Acids Research, 9:133–148, 1981.

[6] Edouard Bonnet, Pawe l Rzazewski, and Florian Sikora. Designing RNA secondary struc-´ tures is hard. InResearch in Computational Molecular Biology - 22nd Annual International Conference, RECOMB 2018, pages 248–250, 2018.

[7] Joseph N Zadeh, Brian R Wolfe, and Niles A Pierce. Nucleic acid sequence design via efficient ensemble defect optimization. Journal of Computational Chemistry, 32(3):439–52, 2011.

[8] Matan Drory Retwitzer, Vladimir Reinharz, Yann Ponty, J´erˆome Waldisp¨uhl, and Danny Barash. incaRNAfbinv: a web server for the fragment-based design of RNA sequences.Nucleic acids research, 44(W1):W308–W314, 2016.

[9] Stefan Hammer, Birgit Tschiatschek, Christoph Flamm, Ivo L Hofacker, and Sven Find- eiß. RNAblueprint: flexible multiple target nucleic acid sequence design. Bioinformatics, 33(18):2850–2858, 04 2017.

[10] Stefan Hammer, Wei Wang, Sebastian Will, and Yann Ponty. Fixed-parameter tractable sampling for RNA design with multiple target structures. BMC bioinformatics, 20(1):209, 2019.

[11] Vladimir Reinharz, Yann Ponty, and J´erˆome Waldisp¨uhl. A weighted sampling algorithm for the design of RNA sequences with targeted secondary structure and nucleotide distribution.

Bioinformatics, 29(13):i308–i315, 2013.

[12] Yu Zhou, Yann Ponty, St´ephane Vialette, J´erˆome Waldisp¨uhl, Yi Zhang, and Alain Denise.

Flexible RNA design under structure and sequence constraints using formal languages. In ACM-BCB - ACM Conference on Bioinformatics, Computational Biology and Biomedical Informatics - 2013, Bethesda, Washigton DC, United States, September 2013.

[13] Vincent Le Gallic, Alain Denise, and Yann Ponty. R´esultats algorithmiques pour le design d’ARN avec contraintes de s´equence. InSeqBio 2015, pages 26–31, Orsay, France, November 2015.

[14] N. G. De Bruijn. A combinatorial problem. Proc. Koninklijke Nederlandse Academie van Wetenschappen, 49:758–764, 1946.

[15] Alfred V. Aho and Margaret J. Corasick. Efficient string matching: An aid to bibliographic search. Commun. ACM, 18(6):333–340, June 1975.

Références

Documents relatifs

Coleridge,&#34; &#34;Washing-Day,&#34; and &#34;The Invitation,&#34; as well as the prose work, &#34;Hill of Science: A Dream-Vision,&#34; this thesis demonstrates how Barbauld

The third and final qucstionexamined in the present study was whether the maturityof children's execuliveauention processes is relaled lodistractibility and task performance in

In this thesis, the water entry problems for 2-D and 3-D objects entering both calm water and regular waves are investigated numerically with a CIP-based method.. Emphasis is put

AE sensors are sensitive to the micro-cracking stage of damage, therefore showed a great potential for early detection of different forms of deteriorations in reinforced concrete

If we quickly re- view how vorticity confinement works, we would find that curl is calculated from the macroscopic velocity, and the N-S based solver uses the velocity to advect

For the dimeric complexes, the lowest energy structures are deprotonated at N3 of uracil with the metal bound by a bidentate electrostatic interaction with N3 and O4, and the

Mental Health services in Bonavista, a rural community in NL, recently began offering time-sensitive counselling services to its residents, entitled the “Change Clinic.” This

Incorporating an inverter type DEG with the proposed pumped hydro system, ba~tery bank and dump load can maximize the wind energy penetration in Ramea hybrid