• Aucun résultat trouvé

Gene Function-Prediction Experiments

Dans le document Data Mining in Biomedicine Using Ontologies (Page 115-120)

GO-Based Gene Function and Network Characterization

5.6 Gene Function-Prediction Experiments

5.6.1 Data Processing

This step includes data collection and preprocessing. Once the required data is col-lected from online resources, it needs to be processed in order to be used appropri-ately, as our statistical methods, like many others, are sensitive to data quality. Any poor-quality data might lead to signifi cant false positives in analysis and prediction.

For example, in the case of microarray data, normalization and noise removal are important; otherwise, the estimation of the Pearson correlation coeffi cient (the most commonly used statistic for microarray data) can misrepresent the true correlation between gene-expression profi les.

5.6.2 Sequence-Based Prediction

We did multiple simulations with the hypergeometric model introduced in Section 5.5.1. In each simulation, we randomly selected 100 genes from 5,117 genes with known functions and 1,000 GO annotations from 2,859 total GO annotations.

The simulation applies a different p-value threshold to defi ne the neighbors. We compared the gene-function prediction performance between nonweighted p-values and weighted p-values. We sort the p-values in descending order and calculate the ratio between TP (true-positive) and (TP + FP (false-positive)) versus the p-value threshold, as shown in Figure 5.9. It shows that a strict p-value threshold can help enhance the prediction. When the p-value threshold increases from e-6 to e-3, pre-diction accuracy drops down quickly. In addition, the weighted method is consis-tently better than the nonweighted one.

Figure 5.9 Prediction accuracy versus p-value threshold. The x-axis is the p-value threshold, and the y-axis is TP/(TP+FP).

5.6 Gene Function-Prediction Experiments 99

5.6.3 Meta-Analysis of Yeast Microarray Data

We did a pilot study using 7 independent yeast microarray datasets from the GPL90 platform, including 116 experimental conditions, in total, for all the genes in yeast (Table 5.2). We used the microarray data of 5,419 genes from the GPL90 platform, among which 4,519 genes have GO annotations, whereas the yeast genome GO an-notation data was downloaded from the NCBI Gene Expression Omnibus (GEO) Web site, http://www.ncbi.nlm.nih.gov/geo/ [36–38].

Table 5.2 shows the dataset ID, the number of conditions or time points, and the overall experimental condition.

We plotted the conditional probability of the GO functional similarity given an individual p-value (on the log scale) for a single dataset or given the meta p-value for multiple datasets, as shown in Figure 5.10. Although the curves did not dif-fer substantially between a single dataset and the multiple datasets combined, the curve for the meta p-value is much smoother than the curve for any single data, refl ecting better statistics with a much larger sample size in the meta-analysis. We also found that there were many more statistically signifi cant pairs using the same threshold for the meta p-values of multiple datasets than those for any single da-taset. This suggests that combining multiple datasets using the meta-analysis leads to more discerning power in establishing statistical neighbors for query genes and hence, increases the sensitivity for function prediction.

To confi rm this, we applied our function-prediction method to ~10% (500) randomly selected query genes from the yeast genome, using either single datasets or multiple datasets. We compared the sensitivity-specifi city plot for 1 dataset and the one using all 7 datasets from Table 5.2. For this purpose, we selected the top 200 neighbors for each query gene to generate the coexpression-linkage network, using either 1 dataset or 7 datasets. We predicted functions for each query gene, one at a time, and then evaluated the sensitivities and specifi cities of the predictions of all query genes using the sensitivity-specifi city curve. For each prediction scheme that corresponds to a particular functional-linkage network and a specifi c cutoff value for the likelihood scores, the sensitivity and specifi city are calculated accord-ing to the followaccord-ing defi nition. We consider assignaccord-ing a function to a gene as a

Table 5.2 Selection of Microarray Datasets for the Yeast Study Dataset Columns Experimental Condition

1 GDS 777 24

Nutrient limitation under aerobic and anaerobic condition effect on gene expression (growth protocol variation)

2 GDS 772 18

Histone deacetylase RPD3 deletion and histone mutation effect on gene regulation (genotype/variation)

3 GDS 344 11 Chitin synthesis (protocol variation)

4 GDS 1205 12

Ssl1 mutant for a subunit of TFIIH response to methyl methanesulfonate (genotype/variation)

5 GDS 1103 12 Leu3 mutant expression profi les (genotype/variation)

6 GDS 991 15

Phosphomannose isomerase PMI40 deletion strain response to excess mannose (dose variation)

7 GDS1013 24 IFH1 overexpression (time course)

decision/prediction, which can be verifi ed from the annotation data. There are two types of errors we can make: (1) we assign an incorrect function to a gene, which is a type I error, or a false positive; and (2) we do not assign a known function to a gene, which is a type II error, or false negative. On the other hand, if we assign a correct function to a gene, it is a true positive; if a gene does not have a function and we do not assign it, it is a true negative. We consider all query genes and all available GO IDs in the annotation data and summarize the results in the format of Table 5.3.

Figure 5.10 Conditional probability of functional similarity given an individual p-value (on log scale) for a single dataset (from fi ve datasets) and given the meta p-value for multiple datasets from yeast.

5.6 Gene Function-Prediction Experiments 101

By changing the number of predictions selected for each query gene based on the likelihood scores for a fi xed coexpression-linkage network, we can obtain a sensitivity-specifi city plot, where

where K is the number of query genes, TPi is the number of correctly predicted functions for gene i, FNi is the number of known functions that are not predicted for gene i, FPi is the number of incorrectly assigned functions for gene i, and TNi is the number of functions among all available GO IDs that are neither known nor predicted for gene i.

We applied our method to the yeast data. Figure 5.11 shows that the meta-analysis using all 7 datasets signifi cantly improved the prediction accuracy over any 1 dataset (4 were chosen as examples). The result suggests that the proposed meth-od of combining multiple microarray datasets using meta-analysis works well.

5.6.4 Case Study: Sin1 and PCBP2 Interactions

When SIN1 (MAPKAP1) was used as the bait in a two-hybrid screen of a human bone marrow cDNA library, its most frequent partner was poly(rC) binding protein 2 (PCBP2/hnRNP-E2). PCBP2 associates with the N-terminal domain of SIN1 and the cytoplasmic domain of the IFN receptor IFNAR2. SIN1, but not PCBP2, also associates with the receptors that bind TNF. PCBP2 is known to bind to pyrimidi-nerich repeats within the 3′ UTR of mRNAs and has been implicated in the control of RNA stability and translation and selective capindependent transcription. RNAi silencing of either SIN1 or PCBP2 renders cells sensitive to basal and stress-induced apoptosis. Stress in the form of TNF and H2O2 treatments rapidly raises the cell content of SIN1 and PCBP2, an effect reversible by inhibiting MAPK14.

Human microarray data from the NCBI Gene Expression Omnibus (GEO, www.ncbi.nlm. nih.gov/geo/) SOFT (Simple Omnibus in Text Format) were ana-lyzed to determine the datasets in which SIN1 and PCBP2 showed a signifi cant

Table 5.3 Decision Table for Function Prediction

Prediction: GO ID Not Assigned Prediction: GO ID Assigned

Known: GO ID not assigned True negative (TN) False positive (FP) Known: GO ID assigned False negative (FN) True positive (TP)

(up or down) change in expression level. Then, the meta-analysis [39] was per-formed on these datasets to determine which genes were coexpressed with SIN1.

The analysis created a statistical neighboring linkage network based on functional similarity score and its signifi cance level [17]. Close neighbors (i.e., genes that are coexpressed with SIN1 over time or in response to treatments) were assumed to have related functions of SIN1. Here, the meta-analysis was confi ned to 1 data-set microarray platform, GPL96 (i.e., an Affymetrix Gene-Chip Human Genome U133 Array Set HG-U133A) and used 13 curated microarray datasets, each of which had between 50 and 154 arrays. The data was preprocessed and analyzed to provide 2 separate neighbor lists for SIN1 and PCBP2, respectively. The genes in common to each list with a signifi cance level of P < 0.01 were then identifi ed and ranked, based on associated confi dence scores. The annotations of these identifi ed genes are shown in Figure 5.12.

The meta-analysis of human microarray data supports the hypothesis that SIN1 plays a central, directive role in controlling apoptosis [40]. With few excep-tions, genes and pathways regulated in concert with SIN1 are involved in reacting

Figure 5.11 Performance comparison between single datasets versus meta-analysis in yeast. In each plot, various cutoff values for the likelihood scores of the prediction functions for the query genes are used to generate different points in the sensitivity-specifi city curve. In particular, the 7 points correspond to using the top 50, 100, 200, 400, 800, 1,600, and 3200 predictions for each query gene.

Dans le document Data Mining in Biomedicine Using Ontologies (Page 115-120)