Avant-propos
Ce chapitre présente un article en préparation pour une soumission à la revue
Database and ontologies. Cette base de donnée fait partie des méthodes utilisées dans ce
travail et a permis de compiler les données de Rfam afin d’en faciliter l’utilisation. J’ai
participé à la conception de la base de données et de son site web. J’ai aussi testé le
fonctionnement de la base de données. Je n’ai pas participé à la rédaction de cet article.
Database and ontologies
RNAstem: a relational database of stem-loop features in RNA
Pierre-Étienne Cholley
1, Amell El Korbi
1, Jon Antony
2, Michelle Meyer
2*and Jonathan
Perreault
11
INRS - Institut Armand-Frappier, Laval (Québec), H7V 1B7, Canada.
2
Biology Department, Boston College, 140 Commonwealth Avenue, Chestnut Hill, MA 02467, USA.
2.1 Abstract
Motivation: Our understanding of RNA tertiary structures benefited a great deal from
experimentally determined atomic resolution structures. Still, the 1102 structures currently
available represent a very small fraction of known RNAs; the Rfam database lists nearly 2000
RNA families and over two million sequences. Many recurring small RNA motifs have been
identified from mining RNA structural data, but a different organization of data is needed to
fully take advantage of wealth of information provided by Rfam and growing amounts of data
from sequencing projects.
Results: We introduce RNAstem, a web server to study RNA secondary structure.
RNAstem is a MySQL database with an incomparable query flexibility that allows simple or
global analysis of loops, bulges, internal loops, base paired regions, multistem junctions to
look for conservation, covariation of motifs and discovery of novel motifs.
2.2 Introduction
Following many major discoveries of the past two decades RNA is now recognized as
a complex multifunctional molecule. The wealth of available sequence data has been exploited
by computational tools to find hundreds of non-coding RNAs. Many recurring small RNA
motifs have been identified from mining RNA structural data; such as the notoriously stable
GNRA and UNCG tetraloops, the most common RNA loops found in ribosomes (89,90). Such
motifs have been useful for deepening our understanding of RNA structure and in the
development of computational tools for RNA structure prediction. For instance, when folding
a query sequence, the predictive tool Mfold takes into consideration the improvement in free-
energy provided by a GNRA tetraloop in addition to that derived from base-pairing (91) and
many computational tools that find novel RNA structures through comparative genomics rely
in part on free-energy calculations (48,75,92-94). More sophisticated prediction tools, like
MC-fold, attempt to consider most of the potential tertiary interactions (95). All of these
approaches rely on data accumulated from the determination of three dimensional structures.
However, the number of RNA sequences characterized in these studies represents a very small
subset of available functional RNA sequences. There are currently 1102 RNA structures in the
Protein Structure Databank (common repository for all biomolecule crystal structures (96))
derived from less than 100 families of RNA. In contrast, the Rfam database of RNA families
(97) consisted of 1973 RNA families in 2009, with secondary structure alignments comprising
over two million sequences.
To aid in the search of such constraints, we have built a database of all the stem-loops
present in the Rfam database, which totals approximately 10,000,000 stem-loops. All the
features of these hairpins have been extracted and listed in different tables with various
information allowing complex queries.
2.3 Description
Stem-loop browsing and searches: A user-friendly interface allows casual visitors to
sequence and species, as well as additional information like the overall conservation of that
stem (see Figure 7).
Figure 7. Screenshot of interface.
Search by features and multiple conditions: Users interested in all the stems with a
given feature set can easily access that information. For example, all the stems harbouring a
GNRA tetraloop can be accessed by using the pattern G[ACGU][AG]A in conjunction with
the “terminal loops” option. Other examples of information that can be fetched include the
presence, size or sequence of any secondary structure element: terminal loops, internal loops,
bulges or the stem itself. Queries can combine many different criteria simultaneously, for
example, looking for all stems that have two bulges, one internal loop and a terminal loop of
five bases that comprises GGA. Other examples include presence of a D-loop and a T-loop in
the same RNA.
Complex queries: A schematic of the database tables is provided for users familiar with
SQL to allow them to write their own queries. We therefore provide the full capabilities of
SQL language, without the need for numerous options that would only add confusion to a user
interface.
2.4 Implementation
The MySQL database has a web interface implemented in php. The server runs on
UNIX. Simple queries are processed and displayed in real time. However, because of the size
of the database and the complexity of some potential queries, compressed result files will be
sent by email address of registered users.
2.5 Results
To highlight the potential of the database, we performed a query to identify all stems
with at least one putative non-canonical bp (within Rfam aligned sequences). In this search,
we find that the closing stem of most riboswitches is more likely to include a non-canonical bp
than simple stem-loops, in contrast with other well known RNA families not known to be
“structural-switches” (see table 1). This can be explained by a need for more flexible
structures that allow switching where the aptamer and expression platform portions of
riboswitches overlap, typically the closing stem of the aptamer (98).
Riboswitch cl. NC st. NC cl.-st. RNA (other)
cl. NC st. NC cl.-st.
Mg sensor 0.01 0.15 -0.14 RNaseP bact-b 0.08 0.18 -0.10
SAM-IV
0.02 0.22 -0.20 CPEB3
0.16 0.24 -0.08
Lysine
0.51 0.57 -0.06 RNaseP nuc
0.27 0.36 -0.09
Purine
0.58 0.55 0.03 CoTC ribozyme 0.49 0.56 -0.08
TPP
0.53 0.29 0.25 RNaseP bact a 0.37 0.27 0.09
Cobalamin 0.37 0.20 0.17 Hammerhead 1 0.47 0.34 0.13
AdoCbl(2) 0.61 0.29 0.32 RNase P
0.33 0.18 0.15
MOCO
0.98 0.46 0.52 RNaseP arch
0.48 0.25 0.24
SAM-I-IV 0.43 0.15 0.28 RNase MRP
0.54 0.27 0.28
AdoCbl(3) 0.37 0.12 0.25 Hammerhead 3 0.07 0.03 0.04
SAM
0.48 0.15 0.33 IRES HCV
0.58 0.71 -0.13
THF
0.77 0.20 0.58 U1
0.33 0.59 -0.26
Glycine
0.60 0.11 0.48 U11
0.28 0.47 -0.20
drz-agam-2 0.96 0.18 0.78 U1 yeast
0.11 0.08 0.03
FMN
0.59 0.01 0.58 tRNA
0.20 0.28 -0.08
Avg
0.52 0.24 0.28* Avg
0.42 0.31 -0.01*
Table 1. Stems with non-canonical bp.
cl. NC, ratio of closing stem sequences with non-canonical bp over total number of sequences;
st. NC, same ratio for the other stems; cl.-st, closing stem ratio from which the ratio of the
other stems is subtracted; Avg, average. *P<0.005.
A simpler example of a query is a search for all instances of internal loops consisting
of one “U” on each strand, opened by a CG bp and closed by a GC bp. This motif is found in
type 1 myotonic dystrophy where a “CUG” repeat is overabundant, leading to the formation of
a very long stem with numerous “U,U” internal loops. Recent work suggests that molecule
interacting specifically with this motif is potentially a treatment of the disease (99). We found
2879 instances of this motif in Rfam, most of which are in 5S rRNA sequences, but not in the
human 5S rRNA. Overall, this is not a common motif, suggesting that the molecule will have
few off-target effects.
2.6 Conclusion
RNAstem represents a powerful tool for those interested in studying RNA secondary
structure on a global scale. By providing the basic element of RNA structure, hairpins, it
partially circumvents the problem of comparing different RNA families, thus allowing the user
to look for general features applicable to any RNA.
Acknowledgements
The authors wish to thank colleagues at the INRS for helpful comments.
Funding: Start up funds from the INRS and Boston College. JP is the recipient of a Junior 1
Chapitre 3
Identification des motifs communs et rares des
Dans le document
Identification de caractéristiques communes et rares dans les ARN structurés dans la base de données Rfam
(Page 34-40)