TRACTOR_DB: a database of regulatory networks in gamma-proteobacterial genomes.

(1)

TRACTOR_DB: a database of regulatory networks

in gamma-proteobacterial genomes

Abel D. Gonza´lez, Vladimir Espinosa, Ana T. Vasconcelos

1

,

Ernesto Pe´rez-Rueda

2

and Julio Collado-Vides

3,

*

National Bioinformatics Center, Industria y San Jose´, Capitolio Nacional, CP. 10200, Habana Vieja, Habana, Cuba,

1

National Laboratory for Scientific Computing, Avenue Getulio Vargas 333, Quitandinha, CEP 25651-075, Petropolis,

Rio de Janeiro, Brazil,2Depto. de Ingenierı`a Celular y Biocata´lisis, IBT-UNAM, Cuernavaca, Morelos, Mexico and

3

Center of Genomics, UNAM, AP 565-A Cuernavaca, CP. 62100, Morelos, Mexico Received August 6, 2004; Revised and Accepted October 1, 2004

ABSTRACT

Experimental data on the Escherichia coli

transcriptional regulatory system has been used in the past years to predict new regulatory elements (promoters, transcription factors (TFs), TFs’ binding sites and operons) within its genome. As more gen-omes of gamma-proteobacteria are being sequenced, the prediction of these elements in a growing number of organisms has become more feasible, as a step towards the study of how different bacteria respond to environmental changes at the level of transcriptional regulation. In this work, we present TRACTOR_DB (TRAnscription FaCTORs’ predicted binding sites in prokaryotic genomes), a relational data-base that contains computational predictions of new members of 74 regulons in 17 gamma-proteobacterial genomes. For these predictions we used a compar-ative genomics approach regarding which several proof-of-principle articles for large regulons have been published. TRACTOR_DB may be currently accessed at http://www.bioinfo.cu/Tractor_DB, http:// www.tractor.lncc.br/ or at http://www.cifn.unam.mx/ Computational_Genomics/tractorDB. Contact Email id is tractor@cifn.unam.mx.

INTRODUCTION

One of the challenges of Functional Genomics is the identifi-cation of all the elements that take part in an organism’s transcriptional regulatory network. This is necessary to under-stand how the cell reacts to environmental stimuli at the level of transcriptional regulation. Intense research is being carried

out in this direction (1–3). The first step towards this goal is the recognition of all the genes regulated by a transcription factor (TF), i.e. its regulon.

Computational approaches to recognizing the location of regulatory sites in bacterial genomes include the use of weight matrices (4), phylogenetic footprinting (5), searching for statistical overrepresentation of oligonucleotides within a genome and clustering co-expressed genes in order to find conserved patterns in their upstream regions (6,7), among

others (8,9). Recently, Tanet al. (10) proposed a new

method-ology which brings together the advantages of both, the weight matrices and the phylogenetic footprinting approaches.

In the past few years, a great amount of research has been dedicated to computational prediction of important regulatory

elements in the Escherichia coli genome: promoters (11),

operons (12), TFs (13) and TF binding sites (9). As more bacterial genomes are sequenced, it is becoming more impor-tant to extend these efforts to other organisms, and decipher their transcriptional regulatory networks by means of com-parative regulatory studies (10,14–19).

Our two major goals in this work were the production of a reliable set of binding site predictions for as many gamma-proteobacterial TFs as possible in 17 organisms of this division

[E.coli K12 (NC_000913), Haemophilus influenzae

(NC_000907), Salmonella typhi (NC_003198), Salmonella

typhimurium LT2 (NC_003197), Shewanella oneidensis

(NC_004347), Shigella flexneri 2a (NC_004337), Vibrio

cholerae (NC_002505), Yersinia pestis KIM (NC_004088), Buchnera aphidicola (NC_004545), Pseudomonas aeruginosa

(NC_002516),Pseudomonas syringae (NC_004578),

Pasteur-ella multocida (NC_002663), Pseudomonas putida KT2440

(NC_002947),Vibrio parahaemolyticus (NC_004603), Vibrio

vulnificus CMCP6 (NC_004459), Xanthomonas axonopodis

(NC_003919) and Xylella fastidiosa (NC_002488)], and

the construction of a database (TRACTOR_DB, accessible

at http://www.bioinfo.cu/Tractor_DB, http://www.tractor.

*To whom correspondence should be addressed. Tel:+527 773 132063; Fax: +527 773 175581; Email: collado@ccg.unam.mx

The online version of this article has been published under an open access model. Users are entitled to use, reproduce, disseminate, or display the open access version of this article for non-commercial purposes provided that: the original authorship is properly and fully attributed; the Journal and Oxford University Press are attributed as the original place of publication with the correct citation details given; if an article is subsequently reproduced or disseminated not in its entirety but only in part or as a derivative work this must be clearly indicated. For commercial re-use permissions, please contact journals.permissions@oupjournals.org. ª 2005, the authors

at University of Luxembourg on May 12, 2014

http://nar.oxfordjournals.org/

(2)

lncc.br and http://www.cifn.unam.mx/Computational_ Genomics/tractorDB) with a user-friendly navigation interface containing these computationally predicted sites. Several recent papers have addressed the first goal; however, to our knowledge, the present work is the most complete in terms of the number of regulons and organisms that it comprises. Most

other studies have been limited toE.coli and H.influenzae (10)

or to one or few regulons (18,19). In contrast, the work by

McCueet al. (20) predicted a large number of putative

reg-ulatory sites in 10 gamma-proteobacterial genomes, but could not associate many of those sites to a given TF.

METHODS

Selecting organisms and regulons

We used the main ideas of the methodology proposed by Tan et al. (10) to predict new regulon members in 17 gamma-proteobacterial genomes. Orthology information is used in this methodology to assess the biological significance of putative regulon members. The methodology requires that

the organisms selected be phylogenetically close to E.coli,

since E.coli known binding sequences are used to build a

statistical model of a TF’s binding site, and this model is subsequently used to predict new members of that regulon in the other organisms within the study. Hence, we included in our study a group of organisms from different subdivisions of the gamma-proteobacteria subclass whose genomes were completely sequenced. We started working with all TFs with

at least one binding site known inE.coli.

Outline of the predictive methodology

Original training sets containE.coli TFs’ binding sequences

extracted from RegulonDB, version 4.0 (21). The first eight organism referenced in the list above (those sharing at least

30% of orthologous genes withE.coli) were used to construct

the original training sets. A training set was built for each TF

with at least one known binding sequence inE.coli (21) and it

included also those of orthologous non-coding regions when

less than 25 binding sequences were known inE.coli. We built

weight matrices only for TFs with training sets larger than four sequences. Two statistical models were built for each TF: one using CONSENSUS, and the second using the Gibbs-SAMPLER. The training sets were filtered twice as proposed

by Tan et al. (10) to eliminate possible weak binding

sequences and cutoff values were selected in accordance with their methodology.

The model built using CONSENSUS was aligned against

TUs promoter regions (from 400 to+50) using PATSER (4),

setting the lower threshold of the program to the weak cutoff value; the DSCAN software was used to scan these regions with the model built using the Gibbs-SAMPLER (22). The sets of putative sites thus produced by CONSENSUS and Gibbs-SAMPLER were merged and subsequently filtered using orthology information and score (10).

New training sets were built rescuing putative sites predicted for each TF in all organisms. For those TFs that

produced more than four putative sites in E.coli (after the

orthology filtering) and in at least one other organism, a train-ing set was built for each separate organism with more than

four putative sites. These sets were then used to build a specific model for each organism using CONSENSUS, which were then used to re-calculate the cutoffs and to re-scan the reg-ulatory regions of each organism.

Summing up, the main features included in the approach

proposed by Tanet al. (10) in order to extend it to as many TFs

as possible were as follows: (i) the use of two different algo-rithms to build the statistical models of each TF binding site: the CONSENSUS (4) and the Gibbs-SAMPLER (22); (ii) the inclusion, within the training set used to build the model of each TF, of non-coding regions upstream the TUs of

the organisms other thanE.coli that are orthologous to those

that in E.coli are known as regulon members (orthologous

non-coding regions), along with E.coli known binding

sequences of the TF; and (iii) the reconstruction of the models for each TF in each organism after the prediction process, which allows the refinement of the search for new members of the regulon within each genome.

All the sites found for a given TF—in the eight initial organisms—after the second scanning using the rebuilt models were aligned, using CONSENSUS, to produce a Positional Weight Matrix (PWM). Those PWMs were then used to scan the genomes of the other nine organisms for new putative binding sites. The orthology filtering process was done as described in the Supplementary Material, Section I.5, using only the first eight organisms at the centre of the analysis.

For a thorough description of the predictive methodo-logy see Section I.(1–6) and Figure 1 of the Supplementary Material.

Designing and building the database

The database was designed and built following the relational model, and installed on a MySQL server. The web interface is managed by a cgi PERL script that does all the work, from querying the database at the user’s request, to generating the dynamic web pages that form the interface. The design that we have adopted makes it very easy to incorporate new instances to the database, which may be very easily accommodated into the interface. Figure 1 shows the main relationships between the tables that compose TRACTOR_DB and the most impor-tant queries carried out by the interface program.

Several links in the dynamically generated web pages that form the interface make the navigation easy and user-friendly. The dynamic pages are linked to other databases such as RegulonDB (21), EcoCyc (23) and the NCBI database. All the data stored in the database may be downloaded in the form of flat files and the complete system will be available for installation upon request in the near future.

RESULTS

Putative new regulon members by organism

Table 1 of the Supplementary Material shows the number of members (TUs) found in all organisms for each of the 74 regulons for which a statistical model was built. (In those cases for which the model of the binding site could be rebuilt, the results shown correspond to the search done with that

rebuilt model.) For those regulon members found in E.coli,

predictions are compared to regulon members (TUs) annotated

Nucleic Acids Research, 2005, Vol. 33, Database issue D99

(3)

in RegulonDB (shown in column one), in order to assess the sensitivity of our search for new regulon members. Regulons may be grouped in three different clusters based on this sen-sitivity. First, there is a set of six regulons for which 50–80% of the known TUs are recognized by the methodology. This group contains several global regulators (24), and its most prominent members are CRP (75% of member TUs recog-nized) and FNR (65%). This improves the results obtained

by Tan et al. (10) (44.6 and 46.2%, respectively), which

are mostly due to the incorporation of new organisms into the study. The failure in detecting the remaining quarter of CRP, and one-third of FNR regulated TUs is mostly due to the restriction imposed by the weak cutoff. Most of the sites that the methodology does not recognize score below this cutoff. The second group comprises 63 regulons, with few (<10) TU

members known in E.coli. For these, most TU members are

recognized by the methodology (from one-half to all). Finally, there is a group of five regulons for which less than that one-half of the members are identified, in most cases due to the failure in producing a good model of the binding site.

Our results recognizing E.coli known regulon members are

comparable to those of McCueet al. (20) for small regulons

and are better for large regulons.

For some regulon–organism combinations no putative sites are found although the TF is found in that particular organism. This happens mainly for small regulons, those for which there is a greater variation in the sequence of the recognition helix from genome to genome (16), and thus, for which a greater variation in the recognition sequence from one organism to the next is expected.

The number of regulons for which our methodology could predict sites is significantly low in the organisms more distant

from E.coli (i.e. Xanthomonas axonopodis and Buchnera

aphidicola, Table 1). This is mainly due to the limitations that impose the methodology which rely on orthology relation-ships between the organisms under study.

The database

We designed the web interface TRACTOR_DB so that it may be accessed and browsed in four different ways or modes. The first mode may be called one-gene-one-genome; it displays all the sites found for a TU of any organism. The user selects this mode by entering a gene name or gi and ticking an organism’s name. In the second mode (one-gene-multigenome), the user

enters a gene name or gi and ticks the all_organisms

Figure 1. TRACTOR_DB design. Main relationships between tables (oval arrows), and main queries performed by the interface program (stealth arrows).

(4)

option: the interface then displays the orthology information

that supports all sites predicted upstream theE.coli TU where

that gene is located. The third mode (one-TF-one-genome) displays all predicted members of the TF regulon in one organ-ism. To access this mode, the user ticks an organism’s name and selects a TF name from the menu next to it. The last mode, which may be called one-TF-multigenome, displays the infor-mation regarding the conservation of a simple or complex E.coli regulon (a set of genes regulated by the combination of two or more TFs) across all the other 16 genomes.

The current release of TRACTOR_DB (1.0) contains com-putational predictions of new members of 74 regulons in the 17 gamma-proteobacterial genomes referenced above. Future releases will include transcription binding site predictions for new gamma-proteobacteria whose genomes are already sequenced and others which are currently in progress. The update process will be carried out as new genomes are available approximately once every six months. Table 1 summarizes the information available in the current release of TRACTOR_DB.

DISCUSSION

To fulfill the identification of new members of as many known gamma-proteobacterial regulons as possible in eight organ-isms of this group, we employed the main ideas developed

by Tanet al. (10) in their approach to predict new members of

regulons, as an effective way to reduce the high false positives rate inherent to all computational approaches used to predict regulatory signals. Moreover, we incorporated several exten-sions into that methodology in order to include as many reg-ulons as possible to the study, including all with at least one

known binding sequence in E.coli. Finally, we found new

putative binding sites in E.coli for 74 out of 102 TFs with

binding sites annotated in RegulonDB, and for 42 of them, we were able to rebuild the model of the binding site and tune the search in each genome. The sensitivity of this extended

methodology inE.coli proved very good (Table 1 of

Supple-mentary Material). In addition, we predicted new members of regulons in the other 16 gamma-proteobacterial genomes

under study, in numbers ranging from two regulons in X.axonopodis to 32 in S.flexneri.

The most important result of this work is the establishment of TRACTOR_DB, a relational database that gives access, through a user-friendly browsing interface, to our comput-ationally predicted members of proteobacterial regulons. The design of the database makes it very easy to include new computationally predicted regulon members in other gamma-proteobacterial genomes and/or using different pre-dictive approaches. TRACTOR_DB should aid experiment-alist researchers in the analysis of microarray results or in any other situation when a piece of information regarding the possible regulation of a gene or set of genes may be of help (orienting researchers about which genes are more likely to be regulated by a given TF or set of TFs).

TRACTOR_DB may also be regarded as a starting point in the understanding of the organization of the transcriptional

regulatory network of bacteria other thanE.coli, given the fact

that—to our knowledge—this is the largest set of putative regulon members in gamma-proteobacterial genomes from the point of view of number of regulons and organisms.

SUPPLEMENTARY MATERIAL

Supplementary Material is available at NAR Online.

ACKNOWLEDGEMENTS

We thank Dr Marcelo T. dos Santos (LNCC) for his important contribution to this work, Dr Gabriel Moreno-Hagelsieb for fruitful discussions, and Fernanda Mendonc¸a (LNCC),

Roger Paix~aao (LNCC) and Heladia Salgado (CCG) for their

contribution to the database interface. This work was partly funded by a collaboration project on Bioinformatics between Cuba and Brazil supported by CNPq/MCT. J.C.-V. acknowl-edges support from grants 0028 from Conacyt and GM62205 from NIH.

REFERENCES

1. Buhler,N.E., Gerland,U. and Hwa,T. (2003) On schemes of combinatorial

transcription logic.Proc. Natl Acad. Sci. USA, 100, 5136–5141.

2. Cases,I., de Lorenzo,V. and Ouzounis,C.A. (2003) Transcription

regulation and environmental adaptation in bacteria.Trends Microbiol.,

11, 248–253.

3. Thieffry,D., Huerta,A.M., Peerez-Rueda,E. and Collado-Vides,J. (1998)

From specific gene regulation to genomics networks: a global analysis of

transcriptional regulation inEscherichia coli. Bioessays, 20, 433–440.

4. Hertz,G.Z. and Stormo,G.D. (1999) Identifying DNA and protein patterns with statistically significant alignments of multiple sequences. Bioinformatics, 15, 563–577.

5. Thompson,W., Rouchka,E.C. and Lawrence,C.E. (2003) Gibbs recursive

sampler: finding TF binding sites.Nucleic Acids Res., 31, 3580–3585.

6. Qian,J., Lin,J., Luscombe,N.M., Yu,H. and Gerstein,M. (2003) Prediction of regulatory networks: genome-wide identification of TF

targets from gene expression data.Bioinformatics, 19, 1917–1926.

7. Segal,E., Yelensky,R. and Koller,D. (2003) Genome-wide discovery of transcriptional modules from DNA sequence and gene expression. Bioinformatics, 19, I273–I282.

8. Li,H., Rhodius,V., Gross,C. and Siggia,E.D. (2002) Identification of the

binding sites of regulatory proteins in bacterial gnomes.Proc. Natl Acad.

Sci. USA, 99, 11772–11777. Table 1. Total number of predictions—supported by orthology or scoring

above the strong cutoff—and TFs found in each organism

Organism Total predictions Number of TFs

E.coli K12 853 74 S.typhi 725 23 S.typhimurium LT2 752 26 S.flexneri 2a 658 32 Y.pestis KIM 354 11 H.influenzae 140 5 V.cholerae 234 7 S.oneidensis 285 6 B.aphidicola 7 3 P.aeruginosa 22 11 P.putida KT2440 17 9 P.syringae 16 9 V.parahaemolyticus 141 28 V.vulnificus CMCP6 112 23 X.axonopodis 3 2 X.campestris 5 4 X.fastidiosa 7 4

Nucleic Acids Research, 2005, Vol. 33, Database issue D101

(5)

9. Thieffry,D., Salgado,H., Huerta,A.M. and Collado-Vides,J. (1998) Prediction of transcriptional regulatory sites in the complete

genome sequence ofEscherichia coli K12. Bioinformatics, 14,

391–400.

10. Tan,K., Moreno-Hagelsieb,G., Collado-Vides,J. and Stormo,G. (2001) A comparative genomics approach to prediction of new members of

regulons.Genome Res., 11, 566–584.

11. Huerta,A.M. and Collado-Vides,J. (2003) Sigma70 promoters in Escherichia coli: specific transcription in dense regions of overlapping

promoter-like signals.J. Mol. Biol., 333, 261–278.

12. Salgado,H., Moreno-Hagelsieb,G., Smith,T.F. and Collado-Vides,J.

(2000) Operons inEscherichia coli: Genomic analyses and predictions.

Proc. Natl Acad. Sci. USA, 97, 6652–6657.

13. Peerez-Rueda,E. and Collado-Vides,J. (2000) The repertoire of

DNA-binding transcriptional regulators inEscherichia coli K12.

Nucleic Acids Res., 28, 1838–1847.

14. Mirny,L.A. and Gelfand,M.S. (2002) Structural analysis of conserved

base pairs in protein–DNA complexes.Nucleic Acids Res., 30,

1704–1711.

15. Moreno-Hagelsieb,G. and Collado-Vides,J. (2002) A powerful non-homology method for the prediction of operons in prokaryotes. Bioinformatics, 18, S329–S336.

16. Rajewsky,N., Socci,N.D., Zapotocky,M. and Siggia,E.D. (2002) The evolution of DNA regulatory regions for proteo-gamma

bacteria by interspecies comparisons.Genome Res., 12,

298–308.

17. Peerez-Rueda,E. and Collado-Vides,J. (2001) A common origin of

transcriptional repression by helix-turn-helix proteins in the context of

the evolution of regulatory families in Archea and Eubacteria. J. Mol. Biol., 53, 172–179.

18. Panina,E.M., Mironov,A.A. and Gelfand,M.S. (2001) Comparative

analysis of FUR regulons in gamma-proteobacteria.Nucleic Acids Res.,

29, 5195–5206.

19. Erill,I., Escribano,M., Campoy,S. and Barbee,J. (2003)In silico

analysis reveals substantial variability in the gene contents of the

gamma proteobacteria LexA-regulon.Bioinformatics, 19,

2225–2236.

20. McCue,L., Thompson,W., Carmack,C., Ryan,M.P., Liu,J.S., Derbyshire,V. and Lawrence,C.E. (2001) Phylogenetic footprinting of

TF binding sites in proteobacterial genomes.Nucleic Acids Res.,

29, 774–782.

21. Salgado,H., Gama-Castro,S., Martıńez-Antonio,A., Dıáz-Peredo,E., Sańchez- Solano,F., Peralta-Gil,M., Garcia-Alonso,D.,

Jimeenez-Jacinto,V., Santos-Zavaleta,A., Bonavides-Martı´nez,C. and

Collado-Vides,J. (2004) RegulonDB (version 4.0): transcriptional

regulation, operon organization and growth conditions inEscherichia coli

K-12.Nucleic Acids Res., 32, 303–306.

22. Lawrence,C.E., Altschul,S.F., Boguski,M.S., Liu,J.S., Neuwald,A.F. and Wootton,J.C. (1993) Detecting subtle sequence signals: a Gibbs sampling

strategy for multiple alignment.Science, 262, 208–214.

23. Karp,P.D., Arnaud,M., Collado-Vides,J., Ingraham,J., Paulsen,I.T. and

Saier,M.H.,Jr (2004) TheE. coli EcoCyc Database: no longer just a

metabolic pathway database.ASM News, 70, 25–30.

24. Martı´nez-Antonio,A. and Collado-Vides,J. (2003) Identifying global

regulators in transcriptional regulatory networks in bacteria.Curr. Opin.

Microbiol., 6, 482–489.