PREDetector : Prokaryotic Regulatory Element Detector

(1)

PREDetector : Prokaryotic Regulatory Element Detector

Samuel Hiard

1

, Sébastien Rigali

2

, Séverine Colson

2

, Raphaël Marée

1

and Louis Wehenkel

1

1 Bioinformatics and Modeling, GIGA & Department of Electrical Engineering and Computer Science – University of Liège, Sart-Tilman B28, Liège, Belgium 2_{Centre for Protein Engineering – University of Liège, Sart-Tilman B6 Liège, Belgium}

The weight matrix based

approach

Transcription factor binding sites are usually slightly variable in their sequences. Positional weight matrix summarizes information about binding sites sequence alignment. It also allows to predict the occurrence of new sites and estimate their binding efficiency for a transcription factor.

The generation of a position weight matrix starts with the alignment of the experimentally validated DNA motifs of a specific transcription factor.

Multiple alignment

A C G T C

A C G G T

C C G C T

The multiple alignment is then converted into an

alignment matrix that represents how many times

nucleotide i was observed in position j of the alignment.

The alignment matrix is then converted into a weight matrix via the formula:

where :

- n_i,j is the observed frequency of nucleotide i in

position j

- N is the number of sequences in the set

- p_i is the expected frequency of nucleotide i in the genome. For instance 0,25 for each nucleotide in a 50% rich GC genome.

Weight matrix

Scores in red are those for the best nucleotide at each position. The consensus sequence is

ACG(C/G)T. The score of a L-length sequence is computed by summing the weights of each nucleotide.

1. Weight matrix creation

The first part of PREDetector consists in the generation of a weight matrix according to a set of experimentally validated binding sites. The weight matrix can be saved into user’s library and further used to scan different bacterial genomes.

Why PREDetector ?

Our motivation to generate PREDetector came from our intense utilisation of previously described similar programmes, such as Target Explorer (A. Sosinsky et al., 2003), Predictregulon (S. Yellaboina et al, 2004), or Virtual footprint (R. Munch et al., 2006), that were not appropriate to predict some of our in vivo experimentally validated DNA binding sites. The priority and challenge of PREDetector was to offer a programme which, all at once, would provide an easy way to estimate the reliability of the predictions, and beyond the identification of strongly reliable cis-acting elements, would guarantee users the possibility to access information among the predicted sites with scores generally regarded with no regulatory function because categorized beyond statistical reliability thresholds.

Conclusion

PREDetector is an accurate prokaryotic regulon prediction tool that maximally answers biologists’ requests. Suggestions for improvements are welcome

(contact S.Hiard@ulg.ac.be, L.Wehenkel@ulg.ac.be).

Abstract

Background: In the post-genomic area, in silico predictions of regulatory networks are considered as a powerful approach to decipher and understand biological pathways within prokaryotic

cells. The emergence of position weight matrices based programs has facilitated the access to this approach. However, a tool that automatically estimates the reliability of the predictions and would allow users to extend predictions in genomic regions generally regarded with no regulatory functions was still highly demanded.

Result: Here, we introduce PREDetector, a tool developed for predicting regulons of DNA-binding proteins in prokaryotic genomes that (i) automatically predicts, scores and positions potential

binding sites and their respective target genes, (ii) includes the downstream co-regulated genes, (iii) extends the predictions to coding sequences and terminator regions, (iv) saves private matrices and allows predictions in other genomes, and (v) provides an easy way to estimate the reliability of the predictions.

Conclusion: We present, with PREDetector, an accurate prokaryotic regulon prediction tool that maximally answers biologists’ requests. PREDetector can be downloaded freely at

http://www.montefiore.ulg.ac.be/~hiard/predetectorfr.html

1

2

3

4

5 A

2

0

0 C

1

3

0

1

1 G

0

3

1

0 T

0

1

2

1

2

3

4

5 A

0.65 -1.39 -1.39 -1.39 -1.39

C

0.41

1.39 -1.39

0.41

0.41 G

-1.39 -1.39

1.39 0.41

-1.39

T

-1.39 -1.39 -1.39 0.08

0.65 





i

j

i

j

i

p

N

p

n

weight

_,



ln

,



/



1 4. Prediction Reliability

One of the main advantages provided by PREDetector is the opportunity for the user to estimate the reliability of the predictions. The large natural occurrence of transcription factors binding sites are located within intergenic regions and not within coding sequences. PREDetector provides these statistics and therefore the user can estimate the scores at which he will find strongly or weakly reliable sites.

2. Regulon Prediction

The search for potential binding sites of the regulatory protein starts with the selection of one of the saved weight matrices and the definition of the cut-off score. The lowest score among the input sequences used to build a matrix is fixed by default as the recommended off score for this matrix. Users can modify the cut-off score. PREDetector is able to scan either complete or selected regions of bacterial genomes available in the GenBank database. Users can determine the bounds of the so-called “regulatory regions” (estimation of maximal distances upstream and downstream the translational start wherein functional regulatory motifs could be found), as well as bounds of co-directionally transcribed genes.

3. Results

Once the options have been set, PREDetector scans the selected genome sequences and classifies the predicted target DNA motifs according to their localisation in the genome. This includes coding sequences or intergenic sequences, which can be classified as (1) regulatory regions (where regulatory elements are predicted to be found), (2) upstream regions (any region upstream of a translational start codon), and (3) terminator regions (in PREDetector a terminator region terminology is only used to indicate regions between two translational stop codons). Predictions results are distributed among these four genome localization categories

Terminator region Regulatory region

Regulatory region

A

Upstream region Co-transcribed genes

orf 1 orf 2 orf 3 orf 4 orf 5