• Aucun résultat trouvé

Compilation des données de Rfam et la conception de la base de données

Avant-propos

Ce chapitre présente un article en préparation pour une soumission à la revue

Database and ontologies. Cette base de donnée fait partie des méthodes utilisées dans ce

travail et a permis de compiler les données de Rfam afin d’en faciliter l’utilisation. J’ai

participé à la conception de la base de données et de son site web. J’ai aussi testé le

fonctionnement de la base de données. Je n’ai pas participé à la rédaction de cet article.

Database and ontologies

RNAstem: a relational database of stem-loop features in RNA

Pierre-Étienne Cholley

1

, Amell El Korbi

1

, Jon Antony

2

, Michelle Meyer

2*

and Jonathan

Perreault

1

1

INRS - Institut Armand-Frappier, Laval (Québec), H7V 1B7, Canada.

2

Biology Department, Boston College, 140 Commonwealth Avenue, Chestnut Hill, MA 02467, USA.

2.1 Abstract

Motivation: Our understanding of RNA tertiary structures benefited a great deal from

experimentally determined atomic resolution structures. Still, the 1102 structures currently

available represent a very small fraction of known RNAs; the Rfam database lists nearly 2000

RNA families and over two million sequences. Many recurring small RNA motifs have been

identified from mining RNA structural data, but a different organization of data is needed to

fully take advantage of wealth of information provided by Rfam and growing amounts of data

from sequencing projects.

Results: We introduce RNAstem, a web server to study RNA secondary structure.

RNAstem is a MySQL database with an incomparable query flexibility that allows simple or

global analysis of loops, bulges, internal loops, base paired regions, multistem junctions to

look for conservation, covariation of motifs and discovery of novel motifs.

2.2 Introduction

Following many major discoveries of the past two decades RNA is now recognized as

a complex multifunctional molecule. The wealth of available sequence data has been exploited

by computational tools to find hundreds of non-coding RNAs. Many recurring small RNA

motifs have been identified from mining RNA structural data; such as the notoriously stable

GNRA and UNCG tetraloops, the most common RNA loops found in ribosomes (89,90). Such

motifs have been useful for deepening our understanding of RNA structure and in the

development of computational tools for RNA structure prediction. For instance, when folding

a query sequence, the predictive tool Mfold takes into consideration the improvement in free-

energy provided by a GNRA tetraloop in addition to that derived from base-pairing (91) and

many computational tools that find novel RNA structures through comparative genomics rely

in part on free-energy calculations (48,75,92-94). More sophisticated prediction tools, like

MC-fold, attempt to consider most of the potential tertiary interactions (95). All of these

approaches rely on data accumulated from the determination of three dimensional structures.

However, the number of RNA sequences characterized in these studies represents a very small

subset of available functional RNA sequences. There are currently 1102 RNA structures in the

Protein Structure Databank (common repository for all biomolecule crystal structures (96))

derived from less than 100 families of RNA. In contrast, the Rfam database of RNA families

(97) consisted of 1973 RNA families in 2009, with secondary structure alignments comprising

over two million sequences.

To aid in the search of such constraints, we have built a database of all the stem-loops

present in the Rfam database, which totals approximately 10,000,000 stem-loops. All the

features of these hairpins have been extracted and listed in different tables with various

information allowing complex queries.

2.3 Description

Stem-loop browsing and searches: A user-friendly interface allows casual visitors to

sequence and species, as well as additional information like the overall conservation of that

stem (see Figure 7).

Figure 7. Screenshot of interface.

Search by features and multiple conditions: Users interested in all the stems with a

given feature set can easily access that information. For example, all the stems harbouring a

GNRA tetraloop can be accessed by using the pattern G[ACGU][AG]A in conjunction with

the “terminal loops” option. Other examples of information that can be fetched include the

presence, size or sequence of any secondary structure element: terminal loops, internal loops,

bulges or the stem itself. Queries can combine many different criteria simultaneously, for

example, looking for all stems that have two bulges, one internal loop and a terminal loop of

five bases that comprises GGA. Other examples include presence of a D-loop and a T-loop in

the same RNA.

Complex queries: A schematic of the database tables is provided for users familiar with

SQL to allow them to write their own queries. We therefore provide the full capabilities of

SQL language, without the need for numerous options that would only add confusion to a user

interface.

2.4 Implementation

The MySQL database has a web interface implemented in php. The server runs on

UNIX. Simple queries are processed and displayed in real time. However, because of the size

of the database and the complexity of some potential queries, compressed result files will be

sent by email address of registered users.

2.5 Results

To highlight the potential of the database, we performed a query to identify all stems

with at least one putative non-canonical bp (within Rfam aligned sequences). In this search,

we find that the closing stem of most riboswitches is more likely to include a non-canonical bp

than simple stem-loops, in contrast with other well known RNA families not known to be

“structural-switches” (see table 1). This can be explained by a need for more flexible

structures that allow switching where the aptamer and expression platform portions of

riboswitches overlap, typically the closing stem of the aptamer (98).

Riboswitch cl. NC st. NC cl.-st. RNA (other)

cl. NC st. NC cl.-st.

Mg sensor 0.01 0.15 -0.14 RNaseP bact-b 0.08 0.18 -0.10

SAM-IV

0.02 0.22 -0.20 CPEB3

0.16 0.24 -0.08

Lysine

0.51 0.57 -0.06 RNaseP nuc

0.27 0.36 -0.09

Purine

0.58 0.55 0.03 CoTC ribozyme 0.49 0.56 -0.08

TPP

0.53 0.29 0.25 RNaseP bact a 0.37 0.27 0.09

Cobalamin 0.37 0.20 0.17 Hammerhead 1 0.47 0.34 0.13

AdoCbl(2) 0.61 0.29 0.32 RNase P

0.33 0.18 0.15

MOCO

0.98 0.46 0.52 RNaseP arch

0.48 0.25 0.24

SAM-I-IV 0.43 0.15 0.28 RNase MRP

0.54 0.27 0.28

AdoCbl(3) 0.37 0.12 0.25 Hammerhead 3 0.07 0.03 0.04

SAM

0.48 0.15 0.33 IRES HCV

0.58 0.71 -0.13

THF

0.77 0.20 0.58 U1

0.33 0.59 -0.26

Glycine

0.60 0.11 0.48 U11

0.28 0.47 -0.20

drz-agam-2 0.96 0.18 0.78 U1 yeast

0.11 0.08 0.03

FMN

0.59 0.01 0.58 tRNA

0.20 0.28 -0.08

Avg

0.52 0.24 0.28* Avg

0.42 0.31 -0.01*

Table 1. Stems with non-canonical bp.

cl. NC, ratio of closing stem sequences with non-canonical bp over total number of sequences;

st. NC, same ratio for the other stems; cl.-st, closing stem ratio from which the ratio of the

other stems is subtracted; Avg, average. *P<0.005.

A simpler example of a query is a search for all instances of internal loops consisting

of one “U” on each strand, opened by a CG bp and closed by a GC bp. This motif is found in

type 1 myotonic dystrophy where a “CUG” repeat is overabundant, leading to the formation of

a very long stem with numerous “U,U” internal loops. Recent work suggests that molecule

interacting specifically with this motif is potentially a treatment of the disease (99). We found

2879 instances of this motif in Rfam, most of which are in 5S rRNA sequences, but not in the

human 5S rRNA. Overall, this is not a common motif, suggesting that the molecule will have

few off-target effects.

2.6 Conclusion

RNAstem represents a powerful tool for those interested in studying RNA secondary

structure on a global scale. By providing the basic element of RNA structure, hairpins, it

partially circumvents the problem of comparing different RNA families, thus allowing the user

to look for general features applicable to any RNA.

Acknowledgements

The authors wish to thank colleagues at the INRS for helpful comments.

Funding: Start up funds from the INRS and Boston College. JP is the recipient of a Junior 1

Chapitre 3

Identification des motifs communs et rares des

Documents relatifs