Solving the Protein Identifier

Philip Jones

2. Solving the Protein Identifier

Problem – Just

Too Many

Databases?

109 Analysing Proteomics Identifications in the Context of Functional and Structural Protein

database, which provides a “minimally redundant yet maximally complete set of proteins for featured species” (1).

The researcher may then wish to retrieve a broad range of annotation of the identified proteins. This requires that the pro-tein identifiers from the database used (e.g. IPI numbers) are mapped on to the most appropriate species-specific database and/

or high quality human-curated database, such as UniProtKB/

Swiss-Prot (2). Clearly, the most desirable situation is that the identified protein accessions can be mapped on to all of the rele-vant protein sequence databases in one step. This can be achieved for the majority of public protein sequence databases using the Protein Identifier Cross Reference Service, PICR (http://www.

ebi.ac.uk/Tools/picr/) (3). Following is a step-by-step descrip-tion of how this can be easily achieved. These instrucdescrip-tions assume that you have a list of protein identifiers or accessions from the search database used for protein identification.

1. Build a text file (using, for example, Microsoft NotePad) contain-ing one protein ID on each line. Save this file and note its loca-tion and name on your computer. If you wish to create this file using a spreadsheet application, you should paste the protein IDs into the first column and then save the file in “CSV” format.

2. Visit http://www.ebi.ac.uk/Tools/picr/ in an Internet browser (see Fig. 1).

3. In the centre of the screen, you will see a large text area under the heading “Input Data.” Under this text area is a “browse”

button. Click on this button.

4. Browse to, then open the text file that you saved earlier. (The exact details of the dialogue box used to browse to the file will depend upon the operating system and the Internet Browser you are using, so are not described here.)

5. You should now see the path to the file displayed on the Web page next to the browse button.

6. Select a format for the protein identifier mappings. For the purpose of using the mappings to query further services, such as DAS and BioMart, the CSV format is recommended (plain text, comma-separated values file). This format can be used for importing the data into any spreadsheet software.

7. By default, there is no limitation by taxonomy. Mappings for all species are returned. You may wish, however, to limit the mappings returned to a specific species. There is a pull-down list of species located at the top right hand corner of the PICR Web form. This includes the species described in the Ensembl database. If you are interested in other taxonomic groups, type the name of the taxon into the text box below the pull down list. Suggested species matching your search will start to appear as you type.

110 Jones

8. It is recommended that you leave the check box labelled

“Return only active mappings” in its default state (checked/

ticked). This ensures that only current protein identifiers are returned from PICR.

9. Select the protein sequence database that you wish to map your identifiers to. To use the list of identifiers returned from PICR in a tool, such as DAS or BioMart, it is recommended that you select a single database to map to at a time, to keep the results from PICR simple. Please see Note 1 for a discus-sion of the default settings, SwissProt and TrEMBL.

10. Click on the red “Search” button which is situated in the

“Output Parameters” section in the middle of the screen.

11. The search may take several seconds to perform, or longer if you have supplied a long list of protein identifiers. You will see a progress bar appear on your browser, which is regularly updated. If nothing happens for a long time, click on the

“Refresh” link.

To use the data returned from PICR, it is important to under-stand how the mappings are generated. PICR maps to protein Fig. 1. The PICR service user interface. See the main text for a step-by-step description of how to use this interface. Note that there is also a “web service” interface to PICR for use directly from code.

111 Analysing Proteomics Identifications in the Context of Functional and Structural Protein

accessions that are either assigned to exactly the same sequence or have been annotated in UniProtKB/SwissProt as logical cross-references. PICR is not a BLAST service. If you are interested in finding similar protein sequences rather than alternative identifiers for the same sequence, PICR is not for you.

In step 6 above, it was recommended that you select the CSV format. This format includes four columns of data:

1. The input protein identifier (the one you searched with).

2. The name of the database that the input identifier has been mapped to.

3. The mapped protein accessions.

4. The “status” of the mapping, one of “identical” or “logical.”

“Identical” indicates that the mapped protein identifier refers to exactly the same protein sequence. “Logical” indicates that the mapped accession is a cross-reference in UniProtKB/Swiss-Prot.

The Distributed Annotation System (DAS), together with the software tools that have been developed to use this service, allows the user to retrieve annotation on protein sequences or nucleic acid sequences from many physically and geographically separate locations in one request. The real power of this system is that the separate sources of annotation need not be aware of each other in any way, so long as they are using a common naming system and coordinate system for the sequences they describe. The software tool (DAS “client”) being used by the researcher is able to locate these separate sources of annotation using a central registry. The tool then requests annotation from all of the registered sources and finally collates this annotation for display or analysis.

DAS has been in common use since it was first used for nucleic acid sequence annotation in 2001 (4), becoming a widely used and stable standard following the release of version 1.53 of the specification in 2002 (5). At this point, the focus of the standard was on serving sequence information and annotations coordi-nated on to this sequence. Since then, the scope of the standard has been expanded significantly. It is now possible to use DAS to retrieve structural information (at the level of atomic coordi-nates), to perform sequence alignments, and to retrieve interac-tion data (6). More recently, groups have been working on DAS writeback to allow researchers to contribute annotation to a remote server. These new facilities have been described in later versions of the DAS specification, including DAS 1.53E (7) (http://www.dasregistry.org/spec_1.53E.jsp) and the DAS 1.6 standard (http://www.biodas.org/wiki/DAS1.6).

3. Collating

Dans le document Data Mining in Proteomics (Page 121-124)