Jari Häkkinen and Fredrik Levander - Data Mining in Proteomics

Abstract

Proteomic experiments can be difficult to handle because of the large amount of data in different formats that is generated. Samples need to be managed and generated, data needs to be integrated with samples and annotation information. A laboratory information management system (LIMS) can be used to overcome some of the data handling problems. In this chapter, we discuss the role of a LIMS in the proteomics laboratory, and show two step-by-step examples of usage of the Proteios Software Environment (ProSE) to handle two different proteomics workflows.

The data management problem in proteomics is significant because of several factors; (i) proteomics methods are evolving rapidly with new workflows, (ii) proteomics experiments and analysis involves many steps that generate large amounts of data, and (iii) instruments produce data in different formats (1).

Consequently, a major task for the proteomics researcher is to merge heterogeneous data into meaningful information and to collect meta-data for critical evaluation of results.

A Laboratory Information Management System (LIMS) is a software used in the laboratory for the management of samples, users, instruments, protocols, data analysis, and work flow auto-mation. The goal of a lims is to create an environment where all laboratory and analysis information is tracked from biosources to final results (2). A LIMS can be a key element in an enterprise setting with connections to other information systems for streamlining production, yield, and enforce regulatory restric-tion, but here we restrict the discussion of LIMS to a laboratory setting. A LIMS should support:

1. Introduction

Michael Hamacher et al. (eds.), Data Mining in Proteomics: From Standards to Applications, Methods in Molecular Biology, vol. 696, DOI 10.1007/978-1-60761-987-1_5, © Springer Science+Business Media, LLC 2011

80 Häkkinen and Levander

Instrument integration – Information from the instrument

●

should be useful for the LIMS, and the LIMS should generate information for instruments, such as inclusion lists for targeted tandem mass spectrometry (MS).

Analysis tools – Users perform calculations, document, and

●

review results using information from instruments, reference databases, and Web-based services.

Information sharing and searching – A research group needs

●

to share data, external partners need access to data, some users should be able to monitor progress, review results, and other documentation. Users search for samples, proteins, and other relevant information, and display sample relationships based on analysis results.

Tracking – The information flow and data generation

●

throughout experiments must be tracked and researchers data tracking work should be supported by the LIMS.

Standards adoptions – For proteomics data, there are open

●

XML file formats developed for sharing and publication of information (http://www.psidev.info). A LIMS system should support such open standards but also be able to store files in other commonly used data formats, such as comma/

tab-separated (csv/tsv/txt), word processor (doc/odt), PDF and Postscript, and spreadsheet (xls/ods) formats.

The foremost advantage of using a LIMS is that the automation of experiments and data analysis can dramatically increase a labora-tory’s productivity. Accessibility to data is significantly improved, particularly if a Web-based interface allows access from remote loca-tions. In addition, traditional laboratory notebooks are not compat-ible with a multiuser, multitask environment, so an electronic means of storing and sharing data is an attractive option.

Which LIMS to choose depends on laboratory requirements, system capabilities, integration and data needs, flexibility, standard compliance, and security requirements on the LIMS. How to choose the proper application is out of scope for this chapter, and we choose to describe how to use Proteios Software Environment (ProSE, (3)) as a LIMS. There are many other applications that perform similar services, see (2, 4) for more information about other LIMSs. ProSE is built around a Web-based local data reposi-tory for proteomics experiments and features many of the require-ments on a LIMS. A feature of the system apart from pure information tracking are analysis possibilities like the combination of search results from different search engines, which are inte-grated into different proteomic workflows. Using two example scenarios, 2D gel and LC-MS-based experiments, we describe our best practices on solving issues related to information tracking from sample to results to public data repository submission.

81 Laboratory Data and Sample Management for Proteomics

To make the most of the remainder of this chapter you need access to a ProSE server. This document is based on ProSE ver-sion 2.8.0 but is kept at a general level, so later verver-sions of ProSE should also work. Either follow the installation outlined in the Note 1 or use the demo server available through the ProSE Web site http://www.proteios.org. However, the demo server does not support protein identification searching directly from the application, but you can run searches outside ProSE and upload result files. The data used in the examples is also available at http://www.proteios.org (see Note 2).

We assume that you have access to an account on a ProSE server with a set of plug-ins available for your use (as outlined in Note 1). Throughout the examples below we show one way to perform actions, but there are usually several ways to achieve the same effect.

To get the most of a LIMS, not only the laboratory practice must be adapted to the tool, but also the LIMS need to be adapted to support laboratory practices. These adaptations are mostly related to tracking issues and rise because there is no standard for per-forming tracking in the laboratory. For example, by using file naming conventions, information usable for tracking can be added to the file name, even if the information is not present in the file itself (see Note 5).

ProSE spans the whole proteomics experiment, from hypoth-esis to actual protein identifications. ProSE manages sample infor-mation, raw data, images, analysis results, as well as connectivity to protein identification, data viewing, and analysis tools. The organisation and interface of ProSE is designed to closely follow the natural workflow of the proteomics researcher, and is compat-ible with both liquid chromatography (LC)–tandem MS and two-dimensional (2D)-gel experiments (see Note 1).

The ProSE data model is designed to map the steps of a pro-teomics experiment. The ProSE development team has specifi-cally considered the fact that some parts of data in an experiment are generated automatically, whereas other data is collected man-ually. Also, we take into account that experiment steps occur at different points in time and different locations, which corresponds to a typical researcher’s work situation.

Here, we describe in a step-by-step fashion and the usage of ProSE in two different workflows. Parts which require more

2. Materials

3. Methods

82 Häkkinen and Levander

attention, like sample annotations, are discussed further in Subheading 4 below.

For detailed description of the proteomics approaches, please refer to Chap. 1 (Schönebeck et al.).

Our sample is a complex protein mixture extracted from human tissue. The sample is run on a 2D gel to separate proteins into distinct spots. The gel is scanned and passed through image analysis for the detection of spots, and a gel picking robot is set up to pick spots chosen for identification. The robot digests the proteins and extracts the peptides into wells on microtiter plates. The digests were in this study analyzed using LC-MS/

MS in a quadrupole time-of-flight (Q-TOF) instrument. The resulting spectra are subjected to database searching to identify proteins.

Throughout the laboratory work a lot of data is generated of which most information is stored in data files: spot picking and digestion logs, mass spectrum files, and protein identification search results files. ProSE supports upload/import of result files from several search engines, but we recommend running identifi-cation directly from ProSE if you have access to local search engines. The generated files and other relevant information should be collected for upload into ProSE.

The first steps in a new project (experiment) should start by doing some preparation steps in ProSE. ProSE does not enforce specific routes of data upload, but in some instances, some data objects must be in place before new data can be added. We do not care about such constraints here but rather work through data upload in a sequence. ProSE provides a gel project-biased wizard that guides the user though from creating a project for the presenta-tion of identificapresenta-tion results. However, we do not cover the wiz-ard here but rather work through using other actions available through the menu and buttons.

1. Log in and create a new project (File → New → Project).

Name the project GelProject and save. You are presented with a new page among other things a “Members” tab. This tab allows you to add other users on the server as project members. The information in the project is available for the members with privileges defined by the project owner. We do not share data in this example but more information on shar-ing information is available in the ProSE user guide found at the ProSE web site.

2. Make sure that the GelProject is active. When the project is created it automatically becomes the active project but if you 3.1. 2D-Gel

Electrophoresis Case Study

3.1.1. Laboratory Work

3.1.2. ProSE Work

83 Laboratory Data and Sample Management for Proteomics

later log out and in again, you must select the project as the active project (File → Select Project → GelProject). The active project will be listed on the menu bar. The GelProject menu item has many different actions of which several will be cov-ered throughout this tutorial.

3. Add a new sample (GelProject → LIMS → Samples, and click the “New sample” button). Fill the fields, name it

“GelSample”, “external id” is an external identifier use

“ExtGelSample”, and the original field is the amount of sam-ple (in this examsam-ple use 100, the unit is predefined in ProSE and shown for the fields). Finalize by clicking “save”. The biomaterial LIMS optionally keeps track of storage location of material and tracks amounts of material. Material is auto-matically decreased when new events are created that affect the material. For samples, we create an extract in the next step, this action is an event in ProSE, and all biomaterial events are stored in ProSE. Further information about the sample can be entered as annotations, see Note 3.

4. Select the GelSample and click on the “Make Extract” button to create a new extract. Fill the fields, name it “GelSample.e1”, enter an external id, enter the amount of sample used (use 10), and the amount of extract produced (use 35). Click “Save”.

Return to the GelSample and note that the remaining amount of sample is decreased with the used amount. Clicking on the

“Event” tab will show the events associated with the sample, and clicking on the creation event will display details about the event.

5. Now, we add the first dimension separation event for the extract.

Click “Next” (or GelProject → LIMS → Extracts), select the

“Event” tab, and click on the “New separation event” button.

Select the separation technique (here: IPG), click next.

6. The second dimension separation is done similarly to IPG.

Select the “Event” tab and click on the “New separation event” button. Select the separation technique (GelElectrophoresis), click next. There is no gel readily avail-able yet, so we need to create one following the wizard. Fill the two forms appropriately (use “pool_test” as the External ID, this is important since the sample data file set expects that for tracking as outlined in Note 5 below). ProSE will report

“Gel saved”, then finalize by adding the date of the event in the laboratory, while the used quantity should be set to zero, since no more extract was used for this step. Click “Save”.

7. Connect the IPG separation to the “pool_test” gel by select-ing the gel (GelProject → LIMS → Gels), on the right hand side of the gel information display click on “Add previous separation dimension” and select the IPG event from step 5.

84 Häkkinen and Levander

8. Create a staining event by clicking “New Staining Event” on the gel information page.

9. Create a scan event by clicking on the “New Gel Scanning Event” on the gel information page, and select the new scan-ning event in the list. In GelScanEvent display, click

“SelectImageFile” to add a gel scan image to the event.

Locate the image file in the directory listing, click on “next”

and “next” to get back to the GelScanEvent display page.

Now, you can view the image by clicking “view image file”.

This finalizes most of the manual creation of information.

Now, we import all mass spectrometer data.

10. Select GelProject → Hits Import → Gel Based. Enter the gel id “pool_test” in the Gel field. Click on “Next – Select Robot Result File[s]”, and in the file listing select the “spot_pick2.

xml” file and click on “Import” to import spots. A job listing is presented, click on the “Update” button (bottom left on screen) to update the display. When the

“GelSpotPlatePosToHitPlugin” gets the status “Done” the job is finished and the spots are imported.

11. The next step is to register all the peak lists generated by MS.

Select GelProject → Hits Import → Gel Based a second time.

The files provided in the example data all come from one microtiter plate, with the ID 181150420000TEST, and are in mzData format (see Note 4 about peak list file formats).

Select Plate external ID 181150420000TEST in the selec-tion box, and click on “Next – Select PeakList File[s]”. Select all files that begins with a string 181150420000TEST_ and click “Import”. You will be taken to the job listing display;

peak list import is done when all jobs with names like PeakListToHitPlugin File: 181150420000TEST_E2.xml get status “Done”.

12. Set up search parameters to run search engines from ProSE.

Note that the ProSE installation needs to be configured to access search engines first (see Note 1), in case you do not have access to a ProSE installation with search engine access, you can proceed by uploading the search results supplied and continue at step 15. Search engine parameters are edited by selecting View → Search Setup. Mascot and X!Tandem should be generated, and for this sample data a tolerance of 100 ppm on both precursor and fragment level, and a human database with a decoy section should be used (see Note 6 about com-bination of search results regarding the choice of database).

13. Run X!Tandem. Select GelProject → Files. Select the files starting with a string “181150420000TEST_” and click on

“Extensions”. In the pop-up, select “Use spectrum file(s) for X!Tandem search”, then the X!Tandem parameter file, and

“Next – Create search job[s]” to start an X!Tandem search.

85 Laboratory Data and Sample Management for Proteomics

You will again end up in the job listing; wait until the job finishes. When jobs are finished, the search results file will be found in the project top directory.

14. Redo the above with Mascot. Select GelProject → Files. Select the files starting with a string “181150420000TEST_” and click on “Extensions”. In the pop-up, select “Use spectrum file(s) for Mascot search”, enter your name and email address for the Mascot server, select Mascot parameter file, and “Next – Create search job[s]” to start a Mascot search.

15. Now, the search results need to be imported into the database.

So far, the files have been automatically (or manually) uploaded.

Select GelProject → Hits Import → Gel Based to import the X!Tandem and Mascot results by clicking on “Next – Select Search Result File[s]”, select the files with file type “Tandem result” and “Mascot result”, and click import. When the imports finalize, you can select GelProject → Result → Hits to get a listing of your search results, which can be examined more closely by filtering and clicking.

16. Select GelProject → Result → Combined Hits to create com-bined identification reports (see Note 6 for details). Select an acceptable false discovery rate (FDR), typically 0.05 or 0.01 (Fig. 1). Since proteins have been separated by 2D gel, the search results combination can be done using protein scores

Fig. 1. Form for combining searches in ProSE. The gel or sample is selected in the select boxes. The random hits prefix needs to be adjusted to whatever prefix is used for ran-dom hits in the database used for the searches. Search engines to include can also be selected.

86 Häkkinen and Levander

at the protein level (peptide level check box not checked). Set the result file name and click “Next” to start the job. Also see Note 6 about combining search results. When the job has finished, a text report will have been generated, and also the Hits table will be updated with combined FDRs for the identifications.

17. The reports will now contain gel spot identifiers and spot coordinates on the 2D gel as well as the associated identifica-tions. To visualize the gel spots on the gel, move to the hits report (GelProject → Result → Hits) and select view gels. All spots that are active using the current filter will be visible on the gel. For example, if “45” is entered in the Spot ID filter, only spot 45 will be shown on the gel. To show all spots where the protein Actin was found, the filter “ = *actin*” in the description column can be used. This can also be com-bined with a comcom-bined FDR filter, for example “<0.05”.

18. Now, the complete experiment is saved in ProSE. Many files will probably be found in the project main directory, and it is therefore advisable to create subdirectories and move files there. Separate directories can be made for reports, search results, peak lists etc.

19. Hopefully, the results are worth sharing with the rest of the world. Then, uploading to PRIDE (5) is recommended (also see Note 7 about publication of data). To generate files in PRIDE formatted XML, the built in PRIDE XML export can be reached from the Hits report.

In this second example workflow, the protein levels of four differ-ent cell states are compared. For this analysis, the samples were reduced and alkylated using iodoacetamide, digested with trypsin and labeled using the isobaric label TMT (6). The labeled pep-tides were loaded onto a nano LC system and analyzed online by LTQ-Orbitrap. To get many peptide identifications, CID frag-mentation and analysis in the linear trap was used. However, the reporter ions are not visible in most of the spectra, since they are found at lower masses than what can be analyzed in the ion trap using CID fragmentation. To overcome this, each MS/MS scan in the linear trap was followed by an MS/MS scan in the Orbitrap of the same precursor ion using high-energy collision-induced dissociation (HCD) fragmentation.

The major reason for using ProSE, in this project, was to get a large number of confident peptide identifications at a controlled error rate, and to automatically get the reporter ion ratios included in the report. Currently, there is no other software that takes the reporter ion quantities from adjacent scans if they are not present in the spectrum used for identification. Here, we have chosen not 3.2. Quantitative

LC-MS with Isobaric Labels Case Study 3.2.1. Laboratory Work

3.2.2. ProSE Work

87 Laboratory Data and Sample Management for Proteomics

to enter the sample information into ProSE, but rather to

Dans le document Data Mining in Proteomics (Page 92-106)