Michael Rebhan - Data Mining Techniques for the Life Sciences

Abstract

Protein sequence databases do not contain just the sequence of the protein itself but also annotation that reflects our knowledge of its function and contributing residues. In this chapter, we will discuss various public protein sequence databases, with a focus on those that are generally applicable. Special attention is paid to issues related to the reliability of both sequence and annotation, as those are fundamental to many questions researchers will ask. Using both well-annotated and scarcely annotated human proteins as examples, it will be shown what information about the targets can be collected from freely available Internet resources and how this information can be used. The results are shown to be summarized in a simple graphical model of the protein’s sequence architecture highlighting its structural and functional modules.

Key words: proteins, protein sequence, protein annotation, protein function, databases, knowledgebases, Web resources, expert review, sequence data curation.

1. Introduction

Since Fred Sanger’s work paved the way for obtaining the sequences of proteins in the 1950s (1), we have been accumulating information about proteins with an ever-increasing pace. By 1965, 10 years after the publication of the sequence of insulin (1), a few dozen protein sequences were published. This triggered the inter-est of scientists who started to wonder how this valuable informa-tion can best be compared between species and how such comparisons of sequences could help us to elucidate the evolution of molecular mechanisms. Would such distant taxa as mammals, insects, fungi, plants, and bacteria have much in common at the level of sequence? And how would we be able to interpret such similarity? At that time, Margaret Dayhoff (1925–1983), a

O. Carugo, F. Eisenhaber (eds.),Data Mining Techniques for the Life Sciences, Methods in Molecular Biology 609, DOI 10.1007/978-1-60327-241-4_3,ªHumana Press, a part of Springer Science+Business Media, LLC 2010

scientist with a very interdisciplinary background and a lot of foresight, decided that it is time to put all this information together into a book that has since become the ancestor of all protein collections (2). By providing well-organized information on the 65 proteins sequences that were published at the time, she wanted to make it easier for other scientists to join her in the quest for developing an understanding of their biological meaning through comparison. Soon it turned out that this work, indeed, became one of the foundations of a new field, which is now commonly referred to as ‘‘bioinformatics.’’

In the 1980s, in her last few years, one of Dr. Dayhoff’s major efforts was to ensure the continuation of this work, by trying to obtain adequate long-term funding to support the maintenance and further development of the collection. Less than a week before her death, she submitted a proposal to the Division of Research Resources at NIH (3). Her vision was to develop an online system of computer programs and databases which can be accessed by scientists all over the world, for making predictions based on sequences and for browsing the known information. After her death, her colleagues were determined to see her vision realized, by creating the PIR (Protein Information Resource) (4).

Inspired by Dr. Dayhoff ’s legacy, a Ph.D. student who was busy writing one of the first software packages for sequence ana-lysis in the mid 1980s encountered some problems with the data from PIR. He decided to set up his own collection, to have the freedom to develop it as he pleases. The result of this effort became the Swiss-Prot database (5). Amos Bairoch, its founder, decided to make a bold career move, that is to focus his work on the compu-tational analysis of protein sequences (which, at that time, may not have been an obvious choice to many colleagues). But his foresight was rewarded as well: the EMBL in Heidelberg agreed to distri-bute it, and as soon as the Internet was mature enough to allow direct access by researchers anywhere in the world, the associated Web site, ExPASy (6), quickly became one of the most fundamen-tal electronic resources to any scientist working with proteins.

With the rise of the Internet in the mid-1990s, thousands of small and large resources related to proteins and genes emerged as well, created by scientists who wanted to share the collections they were developing locally. But this also created increasing confusion, as biologists found it difficult to find the information they were interested in without spending the whole day on the Internet, following an increasingly bewildering forest of hyperlinks between Web sites that often did not last longer than the research project itself due to lack of long-term funding. As a result, resources like GeneCards were developed (7), with the goal of presenting a structured overview of current knowledge on genes and their products, and the ability to drill down into the information to check sources and find additional information if needed. A similar

goal was pursued by scientists at the NCBI, who wanted to create a comprehensive resource for genes of all the main organisms, including expert-reviewed, high-quality, full-length sequences with annotation. This was the beginning of the RefSeq collection of sequences, which is still one of the best sources for reliable nucleotide and protein sequences (8, 9).

With the advent of genome sequencing projects, increasing efforts were made to develop approaches for finding the correct structures of protein-coding genes in those assembled sequences.

This led to a new generation of databases that include information on protein sequences with various levels of evidence, such as Ensembl (9) and many organism-specific resources. These gen-ome resources complement the above ‘‘high confidence sets’’

(Swiss-Prot and RefSeq) and their associated large repositories of as yet uncurated sequences that have at least transcript-level experi-mental evidence – TREMBL (10), Genbank translations (11) – with even more protein sequences. They take advantage of the high reliability of mature genomic sequence and provide candidate protein sequences even for cases where the experimental evidence available at the transcript and protein level may be minimal or absent. As a result, we now have many different protein sequence databases with different strengths and limitations, and protein sequences with many different levels of evidence at the genomic, transcript, and protein levels. Unfortunately, those data are pre-sented in diverse formats and ways of access, which can be a challenge for users. Due to the flood of nucleotide sequences that are likely to be translated into proteins, we now have millions of proteins in the public databases; for example, a search at NCBI Entrez’s Protein database (12) at the time of writing returns more than 4.4 million proteins for eukaryotes, and more than 7.7 mil-lion proteins for prokaryotes. To guide scientists who are not familiar with the different databases through this complex land-scape, we will attempt to discuss some of the key differences in content and use. But, as anything that evolves, this database land-scape itself will certainly change, so we will try to emphasize fundamental issues that are likely to persist for some time.

2. The Foundation Is the Sequence

Obviously, the most fundamental piece of information on proteins is the amino acid sequence itself. Before we start to annotate it, we need to be sure that the sequence is reliable enough for our purposes. All kinds of artifacts may complicate further work, either at the computer or in the lab: a small part of the sequence could be wrong, some part may be missing, or the whole sequence may not Protein Sequence Databases 47

even exist in nature as its mRNA is actually not translated. If we are lucky enough that we can rely on the judgment of experts who can assess the reliability of the sequence, such as the curators reviewing sequences for Swiss-Prot and RefSeq, we will of course pick our sequence from their reviewed sequence collections. But in the unfortunate event that our protein of interest is not available in those resources, we have to take what we can get and perform at least some basic checks to make sure that the sequence has suffi-cient quality. During this analysis, we will first compare our sequence to the reviewed reference sequences and then check for conserved regions.

For example, imagine that we picked up our sequence based on a search for the gene name at NCBI’s Entrez query system. By performing a BLAST at ExPASy against the Swiss-Prot database, we can find out if there is clear local similarity to known proteins (rule of thumb: the BLAST score should be at least about 100 using the default search parameters), and do the same using pro-tein BLAST at NCBI against RefSeq (seeFig. 3.1andTable 3.1).

Of course, if we can find a match in Swiss-Prot or Ref Seq that is almost identical along the entire length (in the right organism), we can use that one as our new reference sequence for further analysis, instead of our original sequence. If we only get clear similarity for a small region in our protein sequence, we can consider this area as more likely to be reliable, but we cannot make a statement about other regions. If we still have insufficient information about the reliability of large regions in our sequence, we can submit the query protein to InterproScan at EBI (13), which will allow us to do a comprehensive search for conserved domains that are covered in one of the many protein family databases. If all those searches fail to result in convincing results for the whole sequence or for the part of the sequence we are interested in, and if other curated

Fig. 3.1. How to estimate the reliability of a protein sequence (see text).

resources such as the organism-specific ones listed inTable 3.1do not provide good hits either, it would be advisable to consult an expert (especially if subsequent analysis depends heavily on the quality of the sequence), or even to obtain experimental validation.

In a recent paper that brought the reliability of protein sequences into the forefront again, Michele Clamp and her cow-orkers performed an analysis of human genes, of which many were so far considered to be protein-coding by default if they did not look similar to known ncRNA genes (14). By carefully assessing the evolutionary conservation of those sequences, they concluded that only about 20,000 of those are showing conservation patterns Table 3.1

Overview of key resources for protein sequences and annotations

Name(s) Relevance References URL

Swiss-Prot Find the (in general) most reliable sequence and

RefSeq Also contains many reliable protein sequences (search the ‘‘Protein’’ database at NCBI, use the ‘‘RefSeq’’

tab, and then check their annotation to find out if they have been reviewed!)

typical for validated protein-coding genes, while the others could as well be new types of noncoding RNA genes, in contrast to common practice (to declare them protein-coding by default). In other words, in the absence of evidence to the contrary, we have to take into account the possibility that an ORF predicted in a tran-script sequence does not necessarily translate into a real protein in the cell under normal conditions. Fortunately, Swiss-Prot has recently started to include information into their database entries that clearly describes the type of evidence available for the exis-tence of a particular protein.

An additional layer of complexity is provided by alternative splicing, alternative promoter usage, and alternative start codon usage during translation and other phenomena that result in dif-ferent protein sequences being encoded by the same gene. In fact, experts for deriving protein sequences from gene and transcript data do not always agree on the protein sequence(s) for a particular gene, which can be due to limited transcript data for a gene and a number of experimental artifacts that pollute nucleotide sequence data. An example for efforts that try to address those issues is the CCDS initiative, which defines coding sequences, and therefore encoded protein sequences that different curation teams at NCBI, EBI, Sanger Institute, and UCSC can agree on (15). So, if your protein sequence of interest is part of this collection, this can be interpreted as a good sign (i.e., there is a good chance that this protein exists with exactly this sequence in nature). A gene that exemplifies this kind of complexity is the human form of the microtubule-binding protein tau (approved gene symbol:

MAPT), which displays different numbers of protein isoforms, depending on the resource you inspect (seeTable 3.1). So how should such complexity be represented in a protein database?

Should every unique sequence get its own entry? Swiss-Prot has since its beginnings decided to try to represent all the isoforms with one reference sequence that is annotated with features that describe which parts of the sequence occur in which isoforms (see the Swiss-Prot entry P10636, which currently contains nine iso-forms). Based on this concept, all isoform sequences can then be generated on demand, e.g., for comprehensive sequence similarity searches, while a single reference sequence representing the key properties of a large fraction of the protein molecules is available as well. If an isoform is rare and contains sequence that does not occur in other more common isoforms, it may not be represented in the sequence given in the main entry, but instead in the annota-tion and when querying specifically for this isoform. Therefore, such areas may be missed in sequence searches that do not search against all isoforms in Swiss-Prot. RefSeq, on the other hand, sometimes provides more than one protein sequence per gene.

In any case it may be useful to align all available Swiss-Prot isoform sequences and RefSeq protein sequences (and possibly sequences

from organism-specific databases,see Table 3.1) for the gene of interest to see where they differ, as there can be substantial differ-ences that require careful examination.

3. What Is Known About the Protein?

Once we can be sure that our protein sequence is reasonably reliable, we can try to find out more about its biological meaning.

If it is a protein that has been well characterized, or if it is highly similar to such a protein, we may want to turn to Swiss-Prot to get an overview of the information that has accumulated in the litera-ture. In addition, there can be useful annotation in other resources, like RefSeq, and some organism-specific databases that offer expert-reviewed information as well (seeTable 3.1).

But, what exactly does it mean for a protein to be ‘‘well-characterized’’? Do we really know all the functions of this protein in all cell types and development stages, and which parts of the sequence play which role in which function? Even in cases where there are hundreds of publications on a particular protein, it is possible that most of the work that has been done has looked at the protein from a particular angle, while much less attention has been paid to additional ‘‘moonlighting’’ functions, functions in other cell types and stages, and some of the transient interactions it engages in. In many cases, we know what particular regions of the protein do, but we cannot make confident statements about the function of the rest of the protein.

Onc-myc, for example, an impressive amount of information is available. A search in all databases at NCBI reveals more than 10,000 articles in Pubmed, more than 2,000 protein sequences, and 20 macromolecular structures. Few proteins have been inves-tigated so thoroughly. Therefore, one may easily think that almost every interesting aspect of the function of this protein would have been unearthed by now. Let us see what the protein databases provide in such a case. First, we will examine the human entry in Swiss-Prot (16). The human-friendly current ‘‘name’’ (or ‘‘ID’’) is

‘‘MYC_HUMAN,’’ but the more stable accession that is indepen-dent of gene name changes (which do occur!) is ‘‘P01106.’’ In this particular case, a gene name change is not expected, but in any case it is a good habit if we use the stable accession for any type of documentation, to be on the safe side. If you have a careful look at the information in the entry, it shows that the protein was entered into Swiss-Prot a long time ago (in 1986), so actually it may have been one of the first proteins in the database. But since then annotations have been added or modified, the last time quite recently. Note also the ‘‘gene name’’ field, which lists the gene Protein Sequence Databases 51

name or symbol that can be useful for searches in genome resources. For this protein, there is evidence at the protein level, as you can see in the respective field. Below the references, the

‘‘Comments’’ section provides an overview of the knowledge about its biological function. Information on molecular interac-tions is given as well, although at this point we do not know which residues mediate which interaction. Further down, a list of key-words is provided that can be very useful for obtaining a quick impression about the protein and for finding other proteins that have been annotated with the same keyword. Then, we can see

‘‘features,’’ i.e., annotations that can be localized to particular residues or regions. This includes a helix-loop-helix motif, a potential leucine zipper, and a basic motif (seeFig. 3.2, rectangles marked HLH, LZ, and BM). There is also some information on the 353 amino acids N-terminal of those C-terminal domains, but they consist mostly of posttranslational modifications (e.g., T58, which can be both phosphorylated and glycosylated), and areas with compositional bias. So, what is actually the function of this large area, and which residues are involved in which aspect of this function? At the end of the features list, we can see the position of secondary structure elements that are experimentally validated, in this case three alpha-helices at the C-terminal end. But this still leaves us wondering what is known about the N-terminal two thirds of the protein. Being aware of the amount of literature that is available, we may wonder if it would make a lot of sense to try to locate this information in Pubmed, as such data are often not obvious from the abstract of a paper.

At RefSeq, we can find the entry ‘‘NP_002458.2’’ (17), which is version 2 of entry ‘‘NP_002458.’’ Note the comments on the protein isoform created by the usage of downstream alternative start codons, which seem to have some role in the cell. In the

‘‘FEATURES’’ section, we can again find details on the residues involved in particular functions, in this case DNA binding and dimerization (summarized as ovals in Fig. 3.2). But still we did not find much new information on the N-terminal part. To see if this lack of annotation of functional modules is due to a high Fig. 3.2. Architecture of human c-myc, based on annotations in Swiss-Prot, RefSeq, and DisProt. From left to right: ‘‘iso’’ =

Dans le document Data Mining Techniques for the Life Sciences (Page 55-68)