• Aucun résultat trouvé

GrAPFI: Graph Based Inference for Automatic Protein Function Annotation

N/A
N/A
Protected

Academic year: 2021

Partager "GrAPFI: Graph Based Inference for Automatic Protein Function Annotation"

Copied!
2
0
0

Texte intégral

(1)

HAL Id: hal-01876907

https://hal.inria.fr/hal-01876907v2

Submitted on 14 Nov 2018

HAL is a multi-disciplinary open access archive for the deposit and dissemination of sci- entific research documents, whether they are pub- lished or not. The documents may come from teaching and research institutions in France or abroad, or from public or private research centers.

L’archive ouverte pluridisciplinaire HAL, est destinée au dépôt et à la diffusion de documents scientifiques de niveau recherche, publiés ou non, émanant des établissements d’enseignement et de recherche français ou étrangers, des laboratoires publics ou privés.

GrAPFI: Graph Based Inference for Automatic Protein Function Annotation

Bishnu Sarker, David Ritchie, Sabeur Aridhi

To cite this version:

Bishnu Sarker, David Ritchie, Sabeur Aridhi. GrAPFI: Graph Based Inference for Automatic Protein Function Annotation. ECCB 2018 - 17th European Conference on Computational Biology, Sep 2018, Athens, Greece. �hal-01876907v2�

(2)

GrAPFI: Graph Based Inference for

Automatic Protein Function Annotation

Bishnu Sarker, David W Ritchie and Sabeur Aridhi

University of Lorraine, CNRS, INRIA, LORIA, Vandoeuvre-les-Nancy, France

[email protected]

Introduction

1(a) 1(b)

The growing number of protein sequences in UniProtKB (Fig-1(a)) makes it increas- ingly expensive to annotate them manually.

Here, we present GrAPFI (Graph based Automatic Protein Function Inference), a tool for the automatic functional annotation of proteins with Enzyme Commission (EC) numbers. The EC System uses a four digit numbering with a hierarchical structure.

GrAPFI utilizes the domain composition of the proteins. Our general observation is that protein domains shares functional proper- ties. Fig-1(b). illustrates this for proteins from viruses dataset. The 6 different colors in this figure correspond to the 6 different top level EC classes.

Graph Construction

GrAPFI constructs the protein graph based on domain composition of each protein.

Each node of the graph represents a pro- tein, while a link between two nodes means that the proteins exhibit a given minimum level of domain similarity. Each node u is identified by a set of labels L(u) (one or more annotations to propagate), has a set of neighbours N (u) , and for every neighbour v it has an associated weight W

u,v

. If u and v have domain composition D1=(d1,d2,d3,d4) and D2=(d1,d3,d5), then, W

u,v

= |(d1, d2, d3, d4) ∩ (d1, d3, d5)|

|(d1, d2, d3, d4) ∪ (d1, d3, d5)|

= |(d1, d3)|

|(d1, d2, d3, d4, d5)|

= 2

5 = 0.4.

GrAPFI Annotation Workflow

Fig-2: GrAPFI follows the above work-flow to annotate an un-reviewed protein.

Example Function Annotation of 2rk2A Enzyme

3(a) 3(b)

Fig-3(a) shows a query protein (yellow node) connected with its five neighbors. Fig-3(b) shows the result of label propagation applied on the graph in Fig-3(a).

Results

Dataset # Nodes # Edges Average Degree # Domains Total EC

SwissProt Viruses 3208 478447 298.28 1031 150

SwissProt Archaea 10619 1168710 220.12 2499 727

SwissProt Eukaryota 55042 30753219 1117.45 6744 2832

SwissProt Bacteria 193429 409837148 4237.6 6480 2902

A comparative study is presented based on a benchmark of 318 proteins taken from COFACTOR[4]. After removing proteins that do not have ground truth EC annotation, we had 297 Enzymes to annotate.

Conclusion

GrAPFI is a novel graph based approach for automatic protein functional annotation. It utilizes a domain based graph representation of the UniProtKB/SwissProt pro- tein database.

To evaluate the performance, leave one out cross val- idation is used on the graph built on four species from UniProt/SwissProt namely Viruses, Archaea, Eu- karyota and Bacteria. Average precision,P ravg =

1 M

P

∀p∈P P rp, recall, Reavg = 1

M

P

∀p∈P Rep and F-Measure, FM easure = 2×P ravg×Reavg

P ravg+Reavg are used as performance metrics. Our performance compar- ison shows GrAPFI outperforms other methods e.g.

DEEPre[3], SVMProt[2]. The future work aims at using GrAPFI for protein Annotation with GO terms.

References

1. T. U. Consortium. Uniprot: a hub for protein infor- mation. Nucleic Acids Research, 2014.

2. Li, Ying Hong, et al. SVM-prot 2016: A web-server for machine learning prediction of protein functional fami- lies from sequence irrespective of similarity, 2016.

3. Li, Yu, et al. DEEPre: sequence-based enzyme EC number prediction by deep learning, 2017.

4.Chengxin Zhang, Peter L. Freddolino, and Yang Zhang. COFACTOR: improved protein function pre- diction by combining structure, sequence and protein- protein interaction information. Nucleic Acids Re- search,2017.

Acknowledgement Bishnu Sarker is funded by an INRIA Cordi-S Doctoral Grant.

Références

Documents relatifs

Etienne Papegnies, Vincent Labatut, Richard Dufour, Georges Linares. Graph-based Features for Automatic Online Abuse Detection. 5th International Conference on Statistical Language

Due to the availability of biomedical data and many biomedical databases, it has become now feasible to build biomedical knowledge graphs and perform reasoning over such

combines face colour approach with principle component analysis (PCA) [12, 13] to achieve a high reliability in face detection. MIT [11] also exploit colour cues for face recognition

In some modern Euler and Venn diagrams, we tend to associate regions with a high number of elements to a larger area of the drawing, and regions with low number of elements to a

To satisfy the goals of annotation management, the upper menu allows for further functions, namely: (1) a display function that presents the list of

GM-PLSA [11] deals with the data of different modalities in terms of their char- acteristics, which assumes that feature vectors in an image are governed by a Gaussian

Recall that our core intuition is that the spectrum-based fault localization algorithm misses causal information about the propagation of a fault in the program. Using our

It is after all these comings and goings between manual observation and automatic data processing that we developed a stable algorithm of automatic segmentation