• Aucun résultat trouvé

Databases of Protein–Protein Interactions and Complexes

Dans le document Data Mining Techniques for the Life Sciences (Page 153-168)

Hong Sain Ooi, Georg Schneider, Ying-Leong Chan, Teng-Ting Lim, Birgit Eisenhaber, and Frank Eisenhaber

Abstract

In the current understanding, translation of genomic sequences into proteins is the most important path for realization of genome information. In exercising their intended function, proteins work together through various forms of direct (physical) or indirect interaction mechanisms. For a variety of basic functions, many proteins form a large complex representing a molecular machine or a macromolecular super-structural building block. After several high-throughput techniques for detection of protein–protein interactions had matured, protein interaction data became available in a large scale and curated databases for protein–protein interactions (PPIs) are a new necessity for efficient research. Here, their scope, annotation quality, and retrieval tools are reviewed. In addition, attention is paid to portals that provide unified access to a variety of such databases with added annotation value.

Key words:protein–protein interaction, protein-complex database, PPI database.

1. Introduction

Protein–protein interactions (PPIs) are a critical attribute of most cellular processes. Protein interactions can either be direct (physi-cal) via the formation of an interaction complex (with varying affinity of interaction and duration of complex formation) or they can be indirect (just functional) via a variety of genetic dependencies, transcriptional regulation mechanisms, or bio-chemical pathways. Traditionally, instances of PPI have been stu-died by genetic, biophysical, and biochemical techniques. Until less than a decade ago, their experimental detection was cumber-some; the cost of such a laborious effort restricted the number of known complexes and the main information source about PPI was scientific journal articles that, typically, described one or a handful of interactions only.

O. Carugo, F. Eisenhaber (eds.),Data Mining Techniques for the Life Sciences, Methods in Molecular Biology 609, DOI 10.1007/978-1-60327-241-4_9,ªHumana Press, a part of Springer Science+Business Media, LLC 2010

145

The first high-throughput PPI detection technology was pro-vided with the yeast two-hybrid technology (1) followed by several others, among which tag-based mass-spectrometric techniques (2) have recently become the state of the art. Other major sources are correlated expression profiles (3, 4) and genetic interaction data (5) (e.g., on synthetic lethality) but theoretical, in silico computed approaches based on interaction predictions from gene context studies (gene fusion events (6–9), gene neighborhood (10–13), and gene co-occurrences/absences, also called the method of phylogenetic profiles (14–18)) increasingly contribute to our understanding of protein networks.

It is evident that, at present, we know only a fraction of the interaction network in cellular systems (and, of course, only in a qualitative manner). Nevertheless, the sheer size of the available data about interactions requires their collection in electronically readable databases. Currently, there are a number of competing database projects that vary in their scope, annotation quality, and availability to the public. Some of these databases are ambitious projects that try to collect all possible known interactions between proteins of every organism. The Biomolecular Interaction Net-work Database (BIND) (19, 20) (it was recently renamed Biomo-lecular Object Network Databank – BOND – and commercialized) is one of the most comprehensive databases of protein–protein interactions and complexes. Among its many features, it not only has an interactive Web portal for searching and browsing through the records, but also provides standardized application interfaces (APIs) for various computer languages like Perl, Java, C, and Cþþ to allow another avenue to access its data. Other databases on the other hand can be specific for certain diseases or organisms only. For example, NCBI’s HIV-1, Human Protein Interaction Database (http://www.ncbi.nlm.nih.gov/projects/RefSeq/HIVInteractions) attempts to collect all known interacting proteins between the various HIV-1 viral proteins and human proteins. Such databases are very specific and therefore, usually contain less data and have less functionality than the general interaction databases.

Protein interaction databases, in turn, will become useful only with respective retrieval tools and, most importantly, with their integration into annotation pipelines that enables them to become means for discovery of new biomolecular mechanisms. For exam-ple, there is a recent publication that describes the use of informa-tion about protein complexes in yeast to predict the phenotypic effect of gene mutation (21) and that this approach can possibly be extended to predicting and investigating the genes of Mendelian or complex diseases. We need to admit that, at this front, there are still many open issues and the qualitative change in biological theory aimed at more system biological understanding is still a matter of the future.

2. Recent Status of Protein–Protein Interaction and Complex Databases

Entry items in PPI databases are interactions or complexes. A protein–protein interaction usually refers to a binary relationship between one protein and another. On the other hand, protein complexes consist of several subunits and, thus, refer to a set of proteins. Each protein pair in this set forms an interaction and some pairs even interact physically in a direct manner. More generally, a protein complex can be viewed as a special case of a set of proteins with a common functional description. Other examples are the set of proteins in a pathway or the set of coexpressed targets under specific biological conditions. The amount of interactions measured with a specific method depends on the degree of interaction (e.g., its affinity) and the duration of this interaction. The duration of the interaction may be long term and with high affinity (so that the complex can survive the harsh purification procedures); it may also be rather transient as in enzyme substrate complexes.

In the following section, we mention the most important sources of PPIs currently available. We classify the protein–protein interaction databases into three main categories, based on the methods used to collect or generate the data. A majority of these databases are repositories of experimental data, which were col-lected either through manual curation, computational extraction, or direct deposit by the authors, such as DIP (22), MINT (23), and IntAct (24). The second type of databases stores predicted protein–protein interactions. Examples of these are PIPs (25), OPHID (26), and HomoMINT (27). Finally, the last category is a portal that provides unified access to a variety of protein interac-tion databases. The most advanced example of this category is STRING (28, 29). A comparison of primary databases for PPIs is provided by Mathivanan et al. (30). There are also databases for PPI in bacteria (31, 32). For a more complete list of protein–

protein interaction databases, readers can refer to Pathguide (33), which contains information about 290 biological pathway and interaction resources.

2.1. Database

of Interacting Proteins (DIP)

The main aim of DIP (22) is to provide the scientific community with a single, user-friendly online database by integrating the existing experimentally determined protein–protein interactions from various sources. It mainly records binary protein–protein interactions that were manually curated by experts. In recent years, DIP has been extended to include interactions between protein ligands and protein receptors (DLRP) (34). The database is consistently updated and the interaction data together with the protein sequences can be downloaded in several formats including tab-delimited and PSI-MI (35–37).

Databases of Protein–Protein Interactions and Complexes 147

Access to DIP requires registration and is free for academic users. An extensive help page and a search guide are provided. A search for proteins can be performed in a number of ways such as by node identifiers (a node is a protein in DIP), descriptions, keywords, BLAST query of a protein sequence, sequence motifs, or literature articles. The search returns a list of proteins that matched the search criteria. The ‘‘Links’’ field lists all the interac-tions of a particular protein. The link under the ‘‘Interaction’’ field provides the experimental evidence and the corresponding pub-lication support for the interaction. The detailed description of a protein is also given and can be viewed by selecting the ‘‘Inter-actor(s)’’ field. The ‘‘graph’’ link opens the interaction map for the current protein. To provide a reasonable visualization, only nodes up to two edges from the root node are displayed. The width of the edges reflects the number of independent experiments supporting this interaction and is useful to identify highly confident interac-tions. The interaction maps generated have links to all nodes and this allows navigation from one protein to another.

2.2. Molecular INTeraction database (MINT)

The Molecular INTeraction database (MINT) (23) is an endea-vor to document experimentally verified protein–protein inter-actions, which are mined from the scientific literature by expert curators. While the main focus of the team is on protein–protein interactions, other interaction data such as enzymatic modifica-tions of the interacting partners are also recorded. Although most of the interactions come from high-throughput experiments, the main value of MINT resides in the high number of curated articles. The data can be freely downloaded and are available in several formats.

The search interface presents several query options. The user can retrieve the list of proteins from an article based on PUBMED ID or authors. The query might also be based on protein or gene names, protein accession numbers, keywords and limited to specific data sets (all taxa, mammalian, yeast, worm, fly, or viruses). Finally, a BLAST search can be performed to find proteins which are homologous to the query protein. The search returns a list of proteins with information such as a brief description of protein function, Uniprot AC, taxonomy, and domains. The detailed page of the protein shows a summary of the protein features in the left panel while the set of the interact-ing partners of the query protein is given in the right panel. The type of evidence support from the literature is also specified together with their respective scores. The interaction network is visualized with the MINT Viewer. The viewer provides advanced features such as filtering the network based on scores as well as expanding and collapsing network sections. The result can be exported in several formats, for example, flat file, Osprey (38), and PSI-MI (36).

2.3. IntAct IntAct (24), by itself, is an open source database and software framework. The system provides a flexible data model which can accommodate a high level of experimental details. It also contains a suite of tools that can be used to visualize and analyze the interac-tion data. The interacinterac-tion data are manually extracted from public literature and annotated to a high level of detail through the extensive use of controlled vocabulary. Most of the interaction data come from protein–protein interactions, but IntAct also cap-tures nonprotein molecular interactors such as DNA, RNA, and small molecules. IntAct is updated weekly and can be downloaded in the PSI-MI format (36). Both the IntAct software Rintact (39) and the data are freely available to all users.

A simple, yet flexible search engine is provided. Users can search for a broad range of identifiers, accession numbers, names, and aliases. The search results may also be filtered with criteria such as publication ID, first author, experiment type, and interaction type. The search result is displayed in tabular form for easy browsing and can be downloaded in the PSI-MI format. To visualize the interactions for a particular protein, the link with IntAct accession instead of that of the Uniprot accession has to be selected. The new page displays basic information about the selected protein and a number of interactions involving the current protein. Then, one can select the protein and click on the ‘‘Graph’’

link. The interactive viewer provides a number of unique features such as highlighting the node based on the molecule type, Gene Ontology (40), InterPro (41) annotation, experimental and bio-logical role or species. Similar to the MINT viewer, the interaction network can also be expanded or refocused to a new protein. The result of the navigation can be immediately exported in PSI-MI format (36).

2.4. BioGRID The Biological General Repository for Interaction Data sets (Bio-GRID) (42, 43) is an effort developed to collect both protein and genetic interactions from major model organisms. BioGRID pro-vides the most up-to-date and virtually complete set of interaction data reported in the published literatures for both the budding yeast Saccharomyces cerevisiaeand the fission yeast Schizosacchar-omyces pombe (42). The database contains data from both high-throughput and conventional studies. It is updated monthly. The data can be downloaded freely in several formats such as PSI-MI (36), tab-delimited, and Osprey. The data can also be downloaded ordered by gene, publication, organism, or experimental system.

A search can be performed with a wide variety of identifiers, for example, cDNA accession and GI numbers as well as with Ensembl, Entrez gene and Uniprot accessions (see their Help page for full descriptions). The result page contains a list of matched items with and without associations. The description page of a selected protein shows the standard annotations, links to external databases, Gene Databases of Protein–Protein Interactions and Complexes 149

Ontology, the number of both protein and genetic interactions.

Subsequently, a list of interacting partners is displayed and so are the experimental support and the corresponding publications. The interaction type can be recognized via the color code of the experi-ments. It is possible to download the data for each interaction or publication in the supported formats. Currently, no visualization is available; however, Osprey can be used (38).

2.5. Human Protein Reference Database (HPRD)

The main purpose of HPRD (44) is to build a complete catalogue of human proteins pertaining to health and disease. While HPRD is not a protein–protein interaction database, it contains an exten-sive list of interaction data of human proteins. All data in HPRD are manually extracted from public literature and curated by a team of trained biologists. The data are freely available for academic users and can be downloaded in either tab-delimited or XML formats. Users can download the whole database or only pro-tein–protein interaction data without annotations in a tab-delim-ited or PSI-MI format (36).

The database can be searched by keywords or by sequences.

The ‘‘Query’’ page provides various keyword fields; these include protein names, accession numbers, gene symbols, chromosome locations, molecular classes, domains or motifs, and posttransla-tional modifications. The ‘‘Browse’’ page organizes the list of proteins into different categories for easy browsing. It is a unique feature of HPRD that the annotations for a particular protein are organized in tabs. The ‘‘Interactions’’ tab provides the list of protein interactors together with the experiment type. Nonprotein interactors are also listed on the same page. No interaction visua-lizer is provided. The ‘‘Pathways’’ tab leads to the corresponding protein entry in NetPath (www.nethpath.org), which contains a number of immune- and cancer-signaling pathways. From Net-Path, users can download the corresponding pathway in popular file formats.

2.6. MPact MPact (45) is an organism-specific database focusing on manually curated protein–protein interactions and complexes fromS. cere-visiae and acts as an access point to PPI resources available in CYGD (46). As the database is part of CYGD, the rich set of information in CYGD is directly accessible from MPact. Due to its quality, the data set has been used in numerous studies and is widely considered as a gold standard for yeast protein–protein interactions (47–49). The latest version of data is available for download in PSI-MI format (36).

A ‘‘Quick Search’’ box is provided for quick access to the interaction data by protein ID and gene name. More specific queries can be performed by using the ‘‘Query by Protein’’ search page. Here, protein attributes such as names or aliases, functional categories, cellular localization, and EC numbers can be specified.

Additional criteria such as evidence and interaction type, publication ID, and an option to exclude high-throughput experiments are available. This feature is useful to select interaction data based on the strength of the detection methods. The results can be viewed in two formats. In the short format, only protein ID, gene name, a simple description, and the link to CYGD are listed. The long format provides additional information such as the type of experi-mental evidence, the publication ID, the full function description, and the type of the interaction. The search result can be downloaded in PSI-MI format (36). A simple visualizer is available for illustrating the interaction network. The nodes are colored based on functional categories and the color of the edges reflects the level of supporting evidence for the corresponding interactions. The network can also be downloaded in PDF format for offline use.

2.7. STRING STRING was first introduced in the year 2000 (50) and evolved from a Web server of predicted functional association between proteins into a comprehensive Web portal of protein–protein inter-actions (28, 51). It integrates data from numerous sources, not only from experimental repositories, but also includes computational prediction methods, and automated text mining of public text collections such as PUBMED. To facilitate the integration of multi-ple data sets, the interactions are mapped onto a consistent set of proteins and identifiers. During the integration, isoforms are reduced to a single representative protein sequence. While this approach enables unique comparison and efficient storage, the interaction information may lead to misinterpretation of the result in later stages as some interactions only occur for a particular iso-form of the protein. While STRING data can be freely downloaded mostly in flat file or as a database dump, the complete data set is only available under a license agreement, which is free for academic users.

The interaction networks can be searched by protein names and accessions and a variety of accession types is supported. The search returns a list of proteins that match the term and the user can select the best candidate. A similar search can also be performed using the protein sequence with the best-matched protein selected automati-cally. STRING provides a powerful network visualizer together with a rich set of annotations. Several visualization tools are available for analysis and facilitate navigation within the interaction network.

Users are encouraged to refer to the online help page for more information. STRING also provides a search interface for querying the interaction network with a protein list that tries to connect all or most of them via interactions in the STRING database.

2.8. Unified Human Interactome (UniHI)

Unified Human Interactome (UniHI) (52) provides unified access to human protein interaction data from various sources includ-ing both computational and experimental repositories. The aim of UniHI is to be the most comprehensive platform to Databases of Protein–Protein Interactions and Complexes 151

study the human interactome. Currently, it contains interac-tion data extracted from six public experimental repositories, and large-scale Y2H screenings, and computational extraction through text-mining and orthologue transfer. The integration of proteins was performed using information from Ensmart (PMID: 14707178) and HGNC (PMID: 11810281).

Users can query UniHI using the UniHI search tool. A variety of protein identifiers are supported and users can submit a set of proteins to obtain their functional information and interacting partners. The search returns a list of matched proteins together with the original source database names. UniHI provides an interactive viewer to visualize the interaction networks. This software offers several options to refine the network. In addition, UniHI provides two powerful tools to analyze the human interactome. The first one is UniHI

Users can query UniHI using the UniHI search tool. A variety of protein identifiers are supported and users can submit a set of proteins to obtain their functional information and interacting partners. The search returns a list of matched proteins together with the original source database names. UniHI provides an interactive viewer to visualize the interaction networks. This software offers several options to refine the network. In addition, UniHI provides two powerful tools to analyze the human interactome. The first one is UniHI

Dans le document Data Mining Techniques for the Life Sciences (Page 153-168)