Representation of functional information in the SWISS-PROT Data Bank

Texte intégral

(1)BIOINFORMATICS APPLICATIONS NOTE. Vol. 15 no. 12 1999 Pages 1066–1067. Representation of functional information in the SWISS-PROT Data Bank V. L. Junker 1,∗ , R. Apweiler 1 and A. Bairoch 2 1 EMBL. Outstation – The European Bioinformatics Institute, Wellcome Trust Genome Campus, Hinxton, Cambs CB10 1SD, UK and 2 Swiss Institute of Bioinformatics, Centre Medical Universitaire, 1, rue Michel Servet, 1211 Geneva 4, Switzerland. Received on 26 March 1999; revised on 18 June 1999; accepted on 29 June 1999. Abstract Summary: Functional information in SWISS-PROT results, primarily, from assessment of articles reporting characterization. Predicted information is labeled with flags describing the evidence level (e.g. potential, probable, by similarity). Availability: SWISS-PROT can be accessed via http:// www.expasy.ch/ sprot and http:// www.ebi.ac. uk/ swissprot/ Contact: [email protected]. The editorial by Peter Karp in Bioinformatics, 14, 753– 754 (1998) discussed the risk of propagating incorrect annotation in sequence database entries. The sheer amount of sequence data from genome sequencing projects has led to a dramatic increase in computational methods predicting function. Such large-scale computational analysis enhances the risk of large-scale propagation of annotation errors. In addition, the point was raised that databases do not distinguish experimentally determined functions from those determined computationally. In response to this editorial we wish to highlight how functional annotation is assigned in SWISS-PROT and how we portray experimentally and computationally determined information. SWISS-PROT is a curated protein sequence data bank that strives to provide a high level of annotation. Efforts are made to enter as much functional information as possible into the database. The main sources of information are articles reporting sequencing and/or characterization. When biochemical experiments have been undertaken to characterize a protein, this is added to the Reference Position (RP) line of the entry. This is part of the reference block and describes what has been determined in that publication. As an example, below is the RP line of an entry where the translated sequence, function and the phosphorylation of a protein have been determined: ∗ To whom correspondence should be addressed.. 1066. RP SEQUENCE FROM N.A., FUNCTION, AND RP PHOSPHORYLATION. We cannot always list all experimentally determined characteristics. In these cases ‘CHARACTERIZATION’ is added. Not all entries in SWISS-PROT are fully characterized. With the increasing amount of data coming from megasequencing projects you will find more and more proteins in SWISS-PROT without experimental characterization. These proteins can be identified through their standardized labeling of the DE line. When a protein exhibits extensive sequence similarity to a characterized protein and/or has the same conserved regions then the label ‘probable’ is used in the DE line. It is normally followed by the full name of a protein from the same family that it matches. For example: DE PROBABLE 5 -NUCLEOTIDASE PRECURSOR DE (EC 3.1.3.5). The label ‘putative’ is used in the DE line of proteins that exhibit limited sequence similarity to characterized proteins. These proteins often have a conserved site, e.g. ATP-binding site, but no other significant similarity to a characterized protein. It is most frequently used for sequences from genome projects. For example: DE PUTATIVE AMINO-ACID PERMEASE. The assignment of the labels ‘probable’ and ‘putative’ is dependent primarily on the results of sequence similarity searches against SWISS-PROT. It is important to point out here that no specific cut-off point is used to assign a protein as ‘putative’ or ‘probable’. Let us take Q10480, a predicted Schizosaccharomyces pombe protein, as an example. This entry has the following description line: DE PROBABLE MITOCHONDRIAL NUCLEASE DE (EC 3.1.30.-). The FastA results show that the sequence is 47% identical over the entire length to the mitochondrial nuclease from Saccharomyces cerevisiae (P08466). Large segments contain identical residues, the E-value of the c Oxford University Press 1999 .

(2) Representation of functional information in the SWISS-PROT Data Bank. Table 1. Flags describing the evidence level of SWISS-PROT annotation (representation of functional information in the SWISS-PROT Data Bank). Flag. Line. Explanation. Hypothetical protein Putative Probable. DE DE DE CC FT CC FT CC FT. Predicted proteins, which may not be translated. Some similarity over conserved domains/sites. Used to show a tentative classification. Strong similarity over conserved domains/sites. Used to show a likely classification. Experiments performed but not entirely conclusive. Experiments have shown the protein to be modified/processed but actual site has not been confirmed. Member of the same family has (potential) after comment. Added to features resulting from prediction programs for signal sequences, Transmembrane regions, coiled-coils, etc. Experimentally proven in member of the same family to which the new entry is believed to belong. From alignment the domain/site exists and the sequence it is aligning against has the domain/site experimentally proven.. Potential By similarity. alignment is statistically highly significant, the active site is conserved, and so we classify it as a ‘PROBABLE MITOCHONDRIAL NUCLEASE’. All predicted protein sequences lacking any significant sequence similarity to characterized proteins are labeled as ‘hypothetical proteins’. The majority of these cases come from the genome sequencing projects. Example:. DE HYPOTHETICAL 33.8 KD PROTEIN C5H10.01 DE IN CHROMOSOME I. Most of the SWISS-PROT annotation can be found in the CC and FT lines. The CC (Comment) lines and the Feature Table (FT lines) give an indication about the level of characterization of a protein. All non-experimentally verified information is flagged. Let us have again a look at Q10480, where you can find the following CC lines: CC -!- FUNCTION: THIS ENZYME HAS BOTH CC RNASE AND DNASE ACTIVITY(BY SIMILARITY). CC -!- COFACTOR: REQUIRES MANGANESE OR CC MAGNESIUM (BY SIMILARITY). CC -!- SUBUNIT: HOMODIMER (BY SIMILARITY). CC -!- SUBCELLULAR LOCATION: MITOCHONDRIAL CC INNER MEMBRANE (POTENTIAL). CC -!- SIMILARITY: BELONGS TO THE DNA/RNA CC NON-SPECIFIC ENDONUCLEASES FAMILY. The function, cofactor and subunit comments are all labeled ‘by similarity’. This indicates that these have been assigned due to similarity to an existing characterized entry, in this case the mitochondrial nuclease from Saccharomyces cerevisiae. The label ‘potential’ is also used to indicate the assignment by comparative analysis. In general this label is used if there is no experimental proof for the information given in a CC topic for a protein, but similarity searches or other prediction methods allow potential comments (as with the subcellular location comment in Q10480). If comparative analysis reveals highly likely comments, then the label ‘probable’ is used: CC -!- SUBUNIT: HOMOTRIMER (PROBABLE).. All features in the FT lines, which have been derived from sequence similarity searches and prediction programs, but are lacking final experimental verification, are flagged. If a feature is highly likely, i.e. the experiments performed are not entirely conclusive, then the label ‘probable’ is used. The label ‘potential’ is used to indicate the assignment by comparative analysis and/or prediction programs for signal sequences, transmembrane regions, coiled-coils etc. Another label used to indicate that a feature has not been experimentally proven but only inferred through sequence analysis is ‘by similarity’:. FT ACT SITE 142 142 BY SIMILARITY. This example comes again from Q10480, the ‘PROBABLE MITOCHONDRIAL NUCLEASE’ of Schizosaccharomyces pombe. The label ‘by similarity’ indicates that this feature has been assigned due to similarity to an existing characterized entry, in this case the mitochondrial nuclease from Saccharomyces cerevisiae. Table 1 outlines the rules used to flag annotation added to entries where no, or only limited, experimental evidence is available. The addition of functional information to SWISS-PROT entries is a labor-intensive process where efforts are taken to add all relevant, available data. Great care is taken to ensure that the information added can be traced either to the relevant data source or back to the entry (or entries) reporting the experimentally determined characteristics. For a more detailed description of these assignments please visit our documentation files at www.expasy.ch/ sprot/sp-docu.html. Finally we also want to point out the importance of using the original SWISS-PROT data produced by SIB and EBI. The SWISS-PROT database is widely redistributed; it is integrated into other public data sets, and is also distributed by many vendors in conjunction with software packages. Users should be aware that the database supplied may not be the latest data and may not include all of the information available in the original.. 1067.

(3)