Genefinders and Feature Detection in DNA - Sequence Analysis, Pairwise Alignment, and Database

Part III: Tools for Bioinformatics

Chapter 7. Sequence Analysis, Pairwise Alignment, and Database Searching

7.5 Genefinders and Feature Detection in DNA

Once a large chunk of DNA has been mapped and sequenced, the task of understanding its function begins. In this section, we describe some programs that search the sequence for genes and other biologically important features. A feature is a sequence pattern with some functional significance, such as start and stop codons, splice sites (in the case of eukaryotes), and sequences that are bound by proteins in order to regulate gene expression. Some features can be found by searching for a specific sequence, such as the restriction site cleaved by a given restriction enzyme. Others, such as promoters and genes, aren't so easy to pick out.

Analysis of single DNA sequences in search of sequence features is a rapidly growing research area in bioinformatics.

There are two reasons that genefinding and feature detection are such notoriously difficult problems. First, there are a huge number of protein-DNA interactions, many of which have not yet been experimentally characterized, and some of which differ from organism to organism. More importantly, we don't always know what constitutes a binding sequence.

Current promoter detection algorithms yield about 20-40 false positives for each real promoter identified. Some proteins bind to specific sequences; others are more flexible in their preference for attachment sites. To complicate matters further, a protein can bind in one part of a chromosome but affect a completely different region hundreds or thousands of base pairs away.

7.5.1 Predicting Gene Locations

Genefinders are programs that identify (or try to, anyway) all the open reading frames in unannotated DNA. They use a variety of approaches to locate genes, but the most successful combine content-based and pattern-recognition approaches. Content-based methods for gene prediction take advantage of the fact that the distribution of nucleotides in genes is different than in non-genes. The GRAIL family of programs developed at Oak Ridge National Laboratories uses a neural network to combine evidence from seven different statistical measures of DNA content (frame bias, periodicities, fractal dimension, coding 6-tuples, in-frame 6-tuples, k-tuple commonality, and repetitive 6-tuple words); subsequent versions measure additional features to better exploit these different types of data. At each position in the DNA sequence, the program weighs each type of information, integrates them, and comes up with a score that represents the likelihood that the region in question is in an ORF or an intergenic region. Pattern-recognition methods look for characteristic sequences associated with genes (start and stop codons, promoters, splice sites) to infer the presence and structure of a gene.

In isolation, each method goes only so far. You have a similar rate of success if you try to identify human faces by looking for either a characteristic skin texture (content) or the presence of a moustache (pattern), but not both. Not surprisingly, the current generation of genefinders combine both methods with additional knowledge, such as gene structure or

Safari | Developing Bioinformatics Computer Skills -> 7.5 Genefinders and Feature Detection in DNA

http://safari.oreilly.com/main.asp?bookname=bioskills&snode=76 (1 of 3) [6/2/2002 1:21:53 PM]

All Books Search

sequences of other, known genes.

Some genefinders are accessible only though web interfaces, making the interaction very straightforward: the sequence that needs to be examined for genes is submitted to the

program, it is processed, and the output is returned. On one hand, this eliminates the need for installation and maintenance of the genefinder on your system, and it provides a relatively uniform interface for the different programs. On the other, if you plan to rely on the results of a genefinder, you should take the time to understand underlying algorithm, find out if the model is specific for a given species or family, and, in the case of content-based models, know which sequences they are. The accuracy of a genefinder can be misleadingly high if it is trained on the same sequence with which you test it.

Some commonly used programs in gene finding include Oak Ridge National Labs' GRAIL, GENSCAN (developed by Chris Burge, now at MIT, and Samuel Karlin at Stanford), PROCRUSTES (developed by Pavel Pevzner and coworkers), and GeneWise (developed by Ewan Birney and Richard Durbin). GRAIL combines evidence from a variety of signal and content information using a neural network. GENSCAN combines information about content statistics with a probabilistic model of gene structure. PROCRUSTES and GeneWise find open reading frames by translating the DNA sequence and comparing the resulting protein sequence with known protein sequences. PROCRUSTES compares potential ORFs with close homologs, while GeneWise compares the gene against a single sequence or a model of an entire protein family.

7.5.2 Feature Detection

In addition to their role in genefinder systems, feature-detection algorithms can be used on their own to find patterns in DNA sequences. Frequently, these tools help interpret freshly sequenced DNA or choose targets for designing PCR primers or microarray oligomers. Some starting places for tools like these include the Center for Biological Sequence Analysis at the Technical University of Denmark (which has several web-based applications for finding intron-exon splice sites and transcription start sites in eukaryotic DNA), the CodeHop server at the Fred Hutchinson Cancer Research Center (which predicts PCR primers based on conserved protein sequences), and the Tools collection at the European Bioinformatics Institute.

In addition to these special-purpose tools, another popular approach is to use motif discovery programs that automatically find common patterns in sequences. We will examine these programs in greater detail when we look at multiple sequence analysis methods.

Delivered for Maurice ling Swap Option Available: 7/15/2002

< BACK Make Note | Bookmark CONTINUE >

Safari | Developing Bioinformatics Computer Skills -> 7.5 Genefinders and Feature Detection in DNA

http://safari.oreilly.com/main.asp?bookname=bioskills&snode=76 (2 of 3) [6/2/2002 1:21:53 PM]

Index terms contained in this section

Birney, Ewan (GeneWise developer) Burge, Chris (GENSCAN developer) DNA

genefinders and feature detection Durbin, Richard (GeneWise developer) feature detection

in DNA genefinders genes

predicting locations of

GeneWise (sequence analysis tool) GENSCAN (sequence analysis tool) GRAIL (sequence analysis tool) gene locating, predicting

Karlin, Samuel (GENSCAN developer) Oak Ridge National Laboratories

Pevzner, Pavel (PROCRUSTES developer) PROCRUSTES (sequence analysis tool)

Safari | Developing Bioinformatics Computer Skills -> 7.5 Genefinders and Feature Detection in DNA

http://safari.oreilly.com/main.asp?bookname=bioskills&snode=76 (3 of 3) [6/2/2002 1:21:53 PM]

Programming > Developing Bioinformatics Computer Skills > 7. Sequence Analysis, Pairwise Alignment, and Database Searching > 7.6 DNA Translation

See All Titles

< BACK Make Note | Bookmark CONTINUE >

158127045003020048038218232180015152049140171155089105071170183109173227138107188136182

Dans le document URLs Referenced in This Book (Page 194-197)