• Aucun résultat trouvé

An observation of circular RNAs in bacterial RNA-seq data

N/A
N/A
Protected

Academic year: 2021

Partager "An observation of circular RNAs in bacterial RNA-seq data"

Copied!
6
0
0

Texte intégral

(1)

1Department of Computational Biology, KTH Royal Institute of Technology, AlbaNova University Center, Roslagstullsbacken 17, SE-10691 Stockholm, Sweden

2Combient AB, Nettov¨agen 6, SE-17541, J¨arf¨alla, Sweden

3System Biology Programme, KTH Royal Institute of Technology, SE-100 44 Stockholm, Sweden

4Luxembourg Centre for Systems Biomedicine, University of Luxembourg, 7, Avenue des Hauts-Fourneaux, L-4362 Esch-sur-Alzette, Luxembourg

5Department of Information and Computer Science, Aalto University, Konemiehentie 2, FI-02150 Espoo, Finland

Circular RNAs (circRNAs) are a class of RNA with an important role in micro RNA (miRNA) regulation recently discovered in Human and various other eukaryotes as well as in archaea. Here, we have analyzed RNA-seq data obtained fromEnterococcus faecalisand Escherichia coli in a way similar to previous studies performed on eukaryotes. We report observations of circRNAs in RNA-seq data that are reproducible across multiple experiments performed with different protocols or growth conditions.

Circular RNAs (circRNAs) are a type of RNA transcripts the 3’ends of which are ligated to their 5’ends during splicing. They were first dis- covered in plants [1] almost 40 years ago and relatively little studied for several decades. Re- cently, an incidental discovery of a high abun- dance of circRNAs in human [2] and their role in micro RNA (miRNA) regulation [3] attracted the attention of many. Large scale searches for novel circRNAs was performed in animals [4]

and archaea [5] leading to hundreds of previ- ously unknown circularised transcripts. While it has been shown that splicing in bacteria occurs, though much more rarely than in eukaryotes [6], that circRNAs can be synthesized by bacteria in vivo using genetic engineering, and that those transcripts can even direct translations of pro- teins in those organisms, it is commonly believed that all natural transcripts in bacteria are linear [7].

Our starting point is a slightly modified ver- sion of the well established pipeline used by Memczak et al. [4] to analyse our recently pub- lished transcriptome data on E. faecalis v583 [8]. The major difference of our version of the pipeline is that we removed the requirement for break points to be flanked by GU/AG sites, as such a configuration is characteristic of spliceoso- mal splicing, absent in bacteria. As a result, the algorithm reports as potential circRNA all cases where the 3’ region of a read can be mapped up-

stream of its 5’ region and where the sequence on the reference genome near them corresponds to the one in the read insert (Figure 1(a) ).

The discovery pipeline was applied to the transcriptome coined ”IlluminaSt” [SRA acces- sion ID : SRR1770393] in our previous nomen- clature [8], leading to a total of 134 circRNA can- didates (Table S1). Based on these, we built a custom database of junctions encompassing 250 nt on each side of the predicted junctions (Figure 1(a)). This database was subsequently used as reference against which we aligned whole reads directly. After alignment, we counted the num- ber of reads that spanned the junction, and reads extending 8 nt or more on both sides of a junc- tion were termed ”long spanning reads” (Figure 1(b)).

Alignment of the same ”IlluminaSt” tran- scriptome reads yielded an enrichment of long spanning reads (Table S1) caused by relaxed con- straints since anchors in our 50 nt long reads were required to be 20 nt long: a junction needed to be in a region of 10 nt to be identified by the detection algorithm, which, assuming uni- form RNA fragmentation, would happen in 20%

of the cases. Few large deviations from the typi- cally observed two-to-ten-fold enrichment (Table S1) are plausibly explained by fragmentation bi- ases in the sequencing sample preparation [9].

In addition, we mapped reads from four otherE. faecalis V583 transcriptomes [8] named

arXiv:1606.04576v1 [q-bio.GN] 14 Jun 2016

(2)

5’ anchor Insert 3’ anchor

Mapped 5’ anchor Mapped 3’ anchor

Tail Head

Junction

250 nt 250 nt

Read from RNA-seq

Mapping on reference genome

New reference sequence for re-alignment

(a)

8 nt 8 nt

Reads aligned onto the new reference with discovered junctions

Irrelevant alignments Spanning reads

Long spanning reads

(b)

FIG. 1: (a) In the discovery phase, reads are cut in silico into three parts : a region of 20 nt at the 5’end (5’anchor), another of 20 nt at the 3’end (3’anchor) and the middle region containing the remaining nucleotides of the reads. For reads that contain a tail-to-head junction the insert, the discovery pipeline will map the 5’anchor downstream of the 3’anchor and then use the insert sequence to recover the beginning and end of the RNA transcript before circularisation. Once this position has been found, a new reference sequence is created to be used for a second mapping procedure. (b) Schematic representation of reads after aligning to the new reference built during the discovery phase. Reads that map entirely on one side of the junction are irrelevant to our purpose. Each read that maps across the junction votes in favor of its existence, with a higher confidence for reads having a longer sequences on each side.

”KTH”, ”KTHr”, ”Rt” and ”St” [SRA accession IDs: SRR1766416, SRR1769261, SRR1769749, SRR1769750] acquired using different experi- mental protocols, growth conditions and se- quencing platforms. Out of the previously men- tioned 134 candidates, 81 (60%) were confirmed by at least two long spanning reads from at least

two transcriptomes (Table S1).

Removing cases with multiply mapped read fragments and retaining only candidates sup- ported by at least five non-duplicated reads, we extracted a list of 23 transcripts with lengths between 56 and 549 nt for further analysis (Ta- ble S1). Among these, 15 (65%) were found

(3)

with long spanning reads in at least two tran- scriptomes, and only one case (circ 000082 cor- responding to the sRNA ref114) was observed only in ”IlluminaSt”. We then verified that, while antisense transcription was observed in the neighborhood of several candidates, in all but one case no long antisense spanning reads were observed, with the exception of the candidate

”circ 000104”, discussed in more detail below.

We retained circ 000104 because it was found in all transcriptomes analyzed, had the high- est number of long-spanning reads in the KTH transcriptome, and the antisense signal across its junction was about 20 times weaker than the direct signal. In all the 23 cases, coverage signal clearly indicates presence of transcripts across the entire candidate sequences (See ”The ppRNomebrowse ”[12][8]).

In order to estimate the false discovery rate of our procedure, we generated a database of 5000 tail-to-head junctions up to 600 nt apart randomly distributed over the genome with a uniform distribution and assessed the number of long spanning reads found covering those junc- tions. Indeed not a single one was reported while over 95% the database had reads aligned in an irrelevant way near the junction, strengthening the confidence in our analysis of the true tran- scriptomic data.

Of the 23 retrieved circRNA candidates, 11 correspond to short RNAs (sRNAs) previously described, the majority of which having so far uncharacterised functions, 5 overlap with un- translated regions (UTRs) sometimes includ- ing part of a neighbouring coding region, 6 roughly match the locations of annotated genes (including two candidates in the vicinity of EF 0104/arcA) and 1 coincides with a trypto- phan transfer RNA (tRNA-Trp1) (TableS1).

A striking observation is a set of 3 candi- dates (circ 000023, circ 000025 and circ 000104) in a region we recently reported as containing an antisense organisation of two or three ncR- NAs called Ref85A/B and Ref86 [8]. On the for- ward strand, we observes two tail-to-head junc- tions linking coordinates 1.894.992 to 1.895.401 and 1.894.992/-3 to 1.895.183/-6, respectively.

Position 1.894.992 coincides with the mapped transcription start site (TSS) of Ref85A[8]. The

other coordinates correspond to the 3’ of Ref85B and to a position 20 nt upstream of the 3’ of Ref86 on the opposite strand (Figure 2). On the reverse strand, the detected junction links posi- tion 1.895.168 to 1.895.273, i.e., 2 nt upstream of the reported beginning of Ref86 to exactly its reported end.

While the role of such an intricate organiza- tion remains elusive, the above observation clar- ifies that Ref85A and Ref85B originate from the same primary transcript and that Ref86 may in- deed interact with Ref85A/B.

While it is well known that artifacts in the reverse transcription, particularly template switching, may lead to junction reads similar to those resulting from splicing or circularisa- tion, those effects are expected to be random [10]. The presence of non-identical (long) span- ning reads across multiple experiments is an ar- gument against artifactual origin. Moreover, the clear tendency for the candidates to be found in non-protein coding regions (UTRs and ncRNAs) that constitute a minor portion of the genome strengthens the biological origin of our observa- tions.

To verify that the effect is not limited to the particular bacterial strain used or to our exper- imental protocol, we then applied our discov- ery pipeline to 23 publicly available RNA-seq datasets of E. coli. Datasets were taken from the SRA database and chosen arbitrarily among experiments done with single-end strand-specific standard RNA-seq on the Illumina sequencing platform, on strain MG1655 or closely related ones, and preferentially performed at high cov- erage (accession numbers for the datasets and results of the analyses are given in table S2).

From the 23 transcriptomes, we obtained a total of about 20 000 predictions that cluster in 2660 candidate loci (Table S2), indicating that predictions accross different samples lead to sim- ilar positions. Among those 2660 loci, 316 were observed in at least 7 transcriptomes (Table S2).

To provide a global overview, we plotted the density of predictions across all 23 samples in Figure 2 and, in Figure S1, the coverage signal calculated from the reads on which predictions are based for the 316 regions. The two strongest peaks in the plots correspond tocsrBand csrC,

(4)

20 nt

2 nt

Circ_000023 : 1894992 -> 1895401 Circ_000025 : 1894992/-3 -> 1895183/-6

Circ_000104 : 1895273 -> 1895168

FIG. 2: Three head-to-tail junctions detected in a region of the genome ofE. faecaliscontaining previously reported ncRNAs region, suggesting an intricate interaction between the RNA transcripts.

two short non-coding RNAs with the same func- tion : binding multiple copies of the CsrA pro- tein, which is itself an inhibitor of translation, in order to inhibit its function and consequently increase protein synthesis. Both have similar structures, forming many hair-pins on an oth- erwise opened backbone [11]. Smaller peaks in Figure 2 correspond to hdeA and B, an operon coding for two well known proteins in E. coli, yehD, a region annotated as putative protein, and rnpB, the catalytic subunit of RNase P.

The latter is also found circularised in E. fae- calis as candidates circ 000068, and especially circ 000048 which is one the strongest signals discussed above.

To summarize, we discovered a series of po- tential circular RNA candidates in bacteria using publicly available discovery pipeline and RNA- seq data. We have verified that a number of the candidates appear to be reproducible across ex- periments with different RNA-seq protocols and platforms and the case of two or possibly three ncRNAs antisense in E. faecalis to each other showing signs of circularisation in an intricate manner.

A result of our analysis is that most can- didates were not found in coding regions but rather in functional RNA transcripts or UTRs.

In many cases, candidates found were repro- ducible across experiments with good accuracy.

In particular the rnpB sRNA shows one of the strongest signals in bothE. faecalis andE. coli.

We finally remark that in all samples, the signal from putative circRNA was weak compared to

the total amount of RNA detected in the cor- responding regions, suggesting that either the circularized form is rare byproduct of a post- transcriptional process or that the circular form is not well captured by the RNA-seq procedure, for instance by escaping fragmentation.

Authors’ Contributions

A.F.d’H. and N.I. formulated the research problem. N.I., A.F.d’H., and E.A. designed the work. H.-S.N. implemented the computational pipeline. NI and H.-S.N. performed data analy- sis. N.I. and E.A. wrote the paper. All authors critically reviewed the manuscript.

[1] Sanger, H. L., Klotz, G., Riesner, D., Gross, H. J. & Kleinschmidt, A. K. Viroids are single-stranded covalently closed circular RNA molecules existing as highly base-paired rod-like structures. Proc. Natl. Acad. Sci. USA 73, 3852–3856 (1976).

[2] Salzman, J., Gawad, C., Wang, P. L., Lacayo, N. & Brown, P. O. Circular RNAs are the pre- dominant transcript isoform from hundreds of human genes in diverse cell types. PLoS ONE 7, e30733 (2012).

[3] Hansen, T. B.et al. Natural RNA circles func- tion as efficient microRNA sponges.Nature495, 384–388 (2013).

[4] Memczak, S. et al. Circular RNAs are a large class of animal RNAs with regulatory potency.

Nature 495, 333–338 (2013).

(5)

Coli

0k 50k 100k 150k 200k 250k

300k 350k 400k

450k 500k

550k 600k

650k 700k

750k 800k

850k 900k

950k 1000k

1050k 1100k 1150k 1200k 1250k 1300k 1350k 1400k 1450k 1500k 1550k 1600k 1650k 1700k 1750k 1800k 1850k 1900k 1950k 2000k 2050k 2100k 2150k 2200k 2250k 2300k 2350k 2400k 2450k 2500k 2550k 2600k 2650k 2700k 2750k 2800k 2850k 2900k 2950k 3000k 3050k 3100k 3150k 3200k 3250k 3300k 3350k 3400k 3450k 3500k 3550k 3600k 3650k

3700k 3750k

3800k 3850k

3900k 3950k

4000k 4050k

4100k 4150k

4200k

4250k4300k4350k 4400k 4450k 4500k 4550k 4600k

csrB

E. Coli

(MG1655)

rnpB csrC

hdeA-B

yehD

Density of predictions Ribosomal RNA loci

FIG. 3: Density of predicted circular RNAs in E. coli for the 23E. colitranscriptomes analysed. The plot uses a linear scale and is normalised to the maximum value across the genome (the peak atcsrB).

[5] Danan, M., Schwartz, S., Edelheit, S. & Sorek, R. Transcriptome-wide discovery of circular RNAs in archaea. Nucleic Acids Res.(2011).

[6] Woodson, S. A. Ironing out the kinks: splic- ing and translation in bacteria. Genes Dev.12, 1243–1247 (1998).

[7] Perriman, R. & Ares, M. Circular mRNA can direct translation of extremely long repeating- sequence proteins in vivo. RNA 4, 1047–1054 (1998).

[8] Innocenti, N. et al. Whole genome mapping of 5’ RNA ends in bacteria by tagged sequencing : A comprehensive view in Enterococcus faecalis.

RNA(2015).

[9] Roberts, A., Trapnell, C., Donaghey, J., Rinn, J. L. & Pachter, L. Improving RNA-Seq expres- sion estimates by correcting for fragment bias.

Genome Biol.12, R22 (2011).

[10] Jeck, W. R. & Sharpless, N. E. Detecting and characterizing circular rnas. Nat. Biotech. 32, 453–461 (2014).

[11] Gardner, P. P.et al. Rfam: updates to the rna families database.Nucleic Acids Res.37, D136–

D140 (2009).

[12] http://ebio.u-psud.fr/eBIO_BDD.php

(6)

Co li

0k 50k

100k 150k

200k 250k

300k 350k

400k 450k

500k 550k

600k

650k 700k 750k 800k 850k 900k 950k 1000k 1050k 1100k 1150k 1200k 1250k 1300k 1350k 1400k 1450k

1500k 1550k

1600k 1650k

1700k 1750k

1800k 1850k

1900k 1950k

2000k 2050k

2100k 2150k

2200k 2250k 2300k

2350k 2400k 2450k 2500k 2550k 2600k 2650k 2700k 2750k 2800k 2850k 2900k 2950k 3000k 3050k 3100k 3150k 3200k 3250k 3300k 3350k 3400k

3450k

3500k 3550k 3600k 3650k 3700k 3750k 3800k 3850k 3900k 3950k 4000k 4050k 4100k 4150k 4200k 4250k 4300k 4350k 4400k 4450k 4500k 4550k 4600k

csrB

E. Coli (MG1655) rnpB

csrC hdeA-B yehD

Co verage from junction reads Rib osomal RNA lo ci

Figure S1 : Coverage signal calculated from junction reads (column #reads in table S2) for the 316 regions that contain predictions of circular RNAs in at least 7 samples.

Références

Documents relatifs

L’archive ouverte pluridisciplinaire HAL, est destinée au dépôt et à la diffusion de documents scientifiques de niveau recherche, publiés ou non, émanant des

We propose ELECTOR, a novel tool that enables the evaluation of long read correction methods:. I provide more metrics than LRCstats on the correction quality I scale to very long

Même si l’al- lure globale est similaire à celle que l’on a obtenu lors des simulations par éléments finis (voir chapitre 3), les champs de gradients sont très bruités, ceci

Intuitively, averaging the abundance across each unitig makes it possible to ’rescue’ genomic k-mers with low abundance (that tend to be lost by k-mer-spectrum techniques, see

Keywords: Transcriptome assembly, RNA‑seq, Repeats, Alternative splicing, Formal model for representing repeats, Enumeration algorithm, De Bruijn graph topology, Assembly

Lecture : en 2015, dans le regroupement de niveau 1 de la Cris « Métalllurgie et sidérurgie », la rémunération brute moyenne en EQTP des salariés s’élève à 3 815 euros par

The read filter [1] we developed improves on a previous work that detects privacy-sensitive reads [11] in the sense that it is more sensitive and accurate, in particular when

These experiments have been performed on three archaeal species (Sulfolobus solfataricus, Pyrococcus abyssi and Nanoarchaeum equitans). In particular, we will discuss i)