• Aucun résultat trouvé

Design of mzML

Dans le document Data Mining in Proteomics (Page 195-198)

Michael Turewicz and Eric W. Deutsch

2. Design of mzML

183 Spectra, Chromatograms, Metadata: mzML-The Standard Data Format

– “Sharing of best practice”: Methods that have been successful at identifying low abundance peptides or proteins should be reviewable for sharing of best practice.

– “Evaluation of results”: Sufficient additional information about a particular acquisition method should be provided to allow critical evaluation of the acquired data.

– “Sharing of data sets”: Public repositories should be able to import or export the data, multisite projects should be able to share the results to support integrated analysis and meta-analysis of previously published data should be possible.

– “Most comprehensive support of the instruments output”: Data should be ascertainable in all relevant forms of mass spectrome-try representation, especially in centroid mode and profile mode.

Furthermore, an agreement concerning the compatibility of these principle tasks with the two precursor philosophies had to be reached among the designers. The outcome of this discussion was a set of design principles formulated as follows:

Simplicity: Although the introduction of new features was dis-cussed, the designers decided to abandon most extensions pro-posed during the design process. Finally, the conviction prevailed, that a simple, but robust, implementation would be a better basis for the new format.

Uniqueness: The same information should always be encoded in a unique way. The designers preferred inflexible unambigu-ousness to inappropriate flexibility.

Stability: The data format should be as stable as possible and the expected frequency of software updates should be limited.

This is ensured by the concept of controlled vocabularies.

Nevertheless, it was obvious that some kind of flexibility for encoding new important information must be incorporated.

This is provided for by the concept of the <userParam> ele-ment (see Subheading 4.2).

Preservation of functionality: All features of the precursor formats should be supported. However, coevally the designers decided to refrain from introducing new features in mzML 1.0.

Rapid development: The designers recognized the duality of mzData and mzXML as the main problem for the commu-nity and the primary target for their efforts. Therefore, they decided to spend all resources to release a new standard data format and make its precursors obsolete. The rapid develop-ment of version 1.0 had higher priority than supporting new features. Support for new features has been halted until the release of version 2.0.

184 Turewicz and Deutsch

Validity: The designers decided to validate mzML first by implementing software to read and write the new format before its release.

Finally, we want to outline the application field of mzML by list-ing several of its essential use cases. The example files referred to in the following can be found on the mzML web page (16). These essential use cases include the following:

– The ability of encoding both possible ways for spectrum rep-resentation: profile mass spectra and centroid mass spectra.

– Information about all current mass spectrometers (e.g., LTQ-FT mass spectrometers) and their settings as well as their experimental output should be encodeable and their (proprie-tary) mass spectrometer output should be convertible into mzML in an easy way. Example files: small.pwiz.1.1.mzML, small_miape.pwiz.1.1.mzML, and small_zlib.pwiz.1.1.mzML (generated via conversion with the msconvert tool from ProteoWizard (29) of a Thermo RAW file from an LTQ FT instrument).

– Possibility to convert not only a single source file into a single mzML file, but also sets of files into a single mzML file. Example file: dta_example_1.1.0.mzML (folder of DTA files generated by Proteios Software Environment (30) and converted into a single mzML file).

– Possibility to convert an arbitrary common peak list file into mzML format. Example file: plgs_example_1.1.0.mzML (gen-erated by conversion of a Protein Lynx Global SERVER (31) XML peak list which was generated by Proteios).

– Provision of full support for different data and metadata from different spectrum types, such as the neutral loss spectrum, which is achieved by neutral loss scans. Example file: neutral_

loss_example_1.1.0.mzML (hand crafted).

– Another important spectrum type is the precursor spectrum.

Spectra acquired by precursor scans should be supported.

Example file: precursor_spectrum_example_1.1.0.mzML (hand crafted).

– Storage of quantitation-related data and metadata should be possible. All important modes of scanning and acquiring data, e.g., Selected Reaction Monitoring (SRM), Total Ion Current (TIC), and Selected Ion Monitoring (SIM), should be sup-ported. Example file: MRM_example_1.1.0.mzML (hand crafted).

– Another type of important instrument metadata is the informa-tion about the used detector type. It should also be possible to support all the common and different types of detectors like photodiode array (PDA) detectors, position and time-resolved

185 Spectra, Chromatograms, Metadata: mzML-The Standard Data Format

ion collector (PATRIC) detectors, Faraday cups (or cages), electron multipliers (EMs) or microchannel plate (MCP) detec-tors. Example file: The “PDA example file” (hand crafted).

– Finally, encoding the same information in mzML as in mzData and mzXML should be possible. Three example files contain-ing the same information in mzML, mzData, and mzXML have been uploaded on the mzML web page to demonstrate this: tiny1.mzML1.1.0.mzML, tiny1.mzData1.05.xml, and tiny1.mzXML3.0.mzXML.

A “controlled vocabulary” generally contains predefined terms to avoid spelling or case ambiguities. The PSI controlled vocabularies are hierarchies of controlled terms (“ontologies”) having for exam-ple “is_a” or “has_a” relationships to one or many “parent terms.”

Each term has a unique accession number and can have a value (e.g., MS:1000031, “instrument model”) and a unit for this value (e.g., MS:1001117, “theoretical mass”, unit = dalton). In an mzML file, <cvParam> elements are used to describe further details of a modeled object. Thus, most of the data concerning a mass spectrometry experiment are annotated using controlled vocabu-lary terms, e.g.,: <cvParam cvRef=“MS” accession=“MS:1000285”

name=“total ion current” value=“16675500”/>, stating the sum of all the separate ion currents carried by the ions of different m/z contributing to a complete mass spectrum or to a specified m/z range of a mass spectrum. In the controlled vocabulary hierarchy, this term “is_a” “spectrum attribute,” which itself “is_a” “object attribute” and has a “part_of” relationship to “spectrum.” The position within the hierarchy can be used to check the correct use of controlled vocabulary terms (important for file validation). If a new important term should be added to the PSI-MS controlled vocabulary, the PSI-PI workgroup must be informed (see Note 4).

The following ontologies or controlled vocabularies may also be suitable or required for some elements of mzML:

Unit Ontology (http://www.obofoundry.org/cgi-bin/detail.

OBI (Ontology of Biological Investigations – http://obi.

sourceforge.net/)

PSI Protein modifications CV (http://psidev.sourceforge.

net/mod/data/PSI-MOD.obo)

Unimod modifications database (http://www.unimod.org/

Dans le document Data Mining in Proteomics (Page 195-198)