• Aucun résultat trouvé

Chapter 4 Conclusion

4.1. Big data challenge

MS datasets range in size from a few thousand to tens of millions of spectra. The software engineering challenge this pose is writing software that can gracefully scale from the smallest to the largest datasets while remaining computationally efficient and easy to develop. To address this challenge, we developed our software to run on the Apache Spark data processing engine. We showed that our software can scale from running on a single computer to a cluster containing 256 CPUs.

4.2. Code Quality

Like any other software project, a challenge for bioinformatics research projects is maintaining good code quality. In bioinformatics code quality is important because it helps ensure that research which uses the code is valid and that software development time is optimized. The characteristics of good quality code which are important for bioinformatics are:

 Correctness

 Maintainability

 Ease of use

 Efficiency

82

We addressed the code quality challenge by applying test driven development, measuring code quality using SonarQube and using Jenkins for continuous integration. In the paper that describes MzJava, we compared the MzJava code quality to other bioinformatics software packages using SonarQube. The results showed that MzJava compares favourably to the other libraries. MzJava has the highest test coverage and the lowest package tangle index. Both MzMod and Glycoforest are based on MzJava which greatly accelerated their development.

4.3. Accurate assignment challenge

Conceptually assigning a molecule to a spectrum can be split into two steps. The first is to generate a candidate set of potential molecules. And the second is to score the candidates so that the best molecule can be selected from the candidate set. Both improvements to the candidate generation and the scoring can help improve the accuracy of the assignment.

Improvements to the candidate generation are particularly important in cases where the candidates are not all known a priory, for example in glycomics. Improvements to the scoring are important when the correct candidates are available but are not being assigned the highest score.

There are two ways that the scoring can be improved. The first is to develop a scoring function that reduces the total false discovery rate (FDR). The second is to improve the correlation between the score that is calculated and the quality of the spectrum - spectrum match (SSM).

This does not increase the total FDR, but it does increase the fraction of the results that can be accepted for a given FDR. The scoring function that we used as a starting point for MzMod was the normalized dot product (ndp) which had a total FDR of approximately 11%. We were able to reduce the total FDR to 5% by adding peptide sequence and modification positioning information to the scoring function. The addition of a delta score improved the correlation between the score and SSM quality making it possible to use a 1% FDR and still retain over half of the results.

For Glycoforest we developed a novel approach to candidate generation and scoring based on the open similarity that is used by the OMS. Our test showed that Glycoforest generated the correct candidate structure for 92% of query spectra in our test set. And that the scoring algorithm has a total FDR of 30%. This FRD shows that more work is required before

83

Glycoforest can be used to automatically assign glycans to spectra. However, the fragment delta results showed that Glycoforest is very good at inferring the general glycan structure. In total 90% of the results have a fragment delta that is smaller than three. This shows that only a small manual intervention is required to correct the structure, making Glycoforest a valuable tool to accelerate the annotation of glycan spectra.

The outcome of this thesis is three papers and their associated software packages: MzJava, MzMod and Glycoforest.

 MzJava provides the building blocks for developing MS/MS analysis software

 MzMod is a validated and improved OMS spectrum library search engine

 Glycoforest pioneered a new approach for automating glycomics analysis

84

Chapter 5 References

(1) Nørregaard Jensen, O. Curr. Opin. Chem. Biol. 2004, 8 (1), 33–41.

(2) Walsh, C. T.; Garneau-Tsodikova, S.; Gatto, G. J. Angew. Chemie Int. Ed. 2005, 44 (45), 7342–7372.

(3) Prabakaran, S.; Lippens, G.; Steen, H.; Gunawardena, J. Wiley Interdiscip. Rev. Syst.

Biol. Med. 2012, 4 (6), 565–583.

(4) Beltrao, P.; Bork, P.; Krogan, N. J.; van Noort, V. Mol. Syst. Biol. 2013, 9 (1), n/a – n/a.

(5) Aoki-Kinoshita, K.; Agravat, S.; Aoki, N. P.; Arpinar, S.; Cummings, R. D.; Fujita, A.;

Fujita, N.; Hart, G. M.; Haslam, S. M.; Kawasaki, T.; Matsubara, M.; Moreman, K. W.;

Okuda, S.; Pierce, M.; Ranzinger, R.; Shikanai, T.; Shinmachi, D.; Solovieva, E.; Suzuki, Y.; Tsuchiya, S.; Yamada, I.; York, W. S.; Zaia, J.; Narimatsu, H. Nucleic Acids Res.

2016, 44 (D1), D1237–D1242.

(6) Khoury, G. A.; Baliban, R. C.; Floudas, C. a. Sci. Rep. 2011, 1, 1–5.

(7) Han, L.; Costello, C. E. Biochem. Biokhimii͡a 2013, 78 (7), 710–720.

(8) Pinho, S. S.; Reis, C. A. Nat. Rev. Cancer 2015, 15 (9), 540–555.

(9) Connelly, M. A.; Gruppen, E. G.; Otvos, J. D.; Dullaart, R. P. F. Clin. Chim. Acta 2016, 459, 177–186.

(10) Maverakis, E.; Kim, K.; Shimoda, M.; Gershwin, M. E.; Patel, F.; Wilken, R.;

Raychaudhuri, S.; Ruhaak, L. R.; Lebrilla, C. B. J. Autoimmun. 2015, 57, 1–13.

(11) Silva, A. M. N.; Vitorino, R.; Domingues, M. R. M.; Spickett, C. M.; Domingues, P.

Free Radic. Biol. Med. 2013, 65, 925–941.

(12) Olsen, J. V.; Mann, M. Mol. Cell. Proteomics 2013, 12 (12), 3444–3452.

(13) Murray, K. Schematic of tandem mass spectrometry

https://en.wikipedia.org/wiki/Tandem_mass_spectrometry#/media/File:MS_MS.png.

(14) Nesvizhskii, A. I. J. Proteomics 2010, 73 (11), 2092–2123.

(15) Thaysen-Andersen, M.; Packer, N. H. Biochim. Biophys. Acta - Proteins Proteomics 2014, 1844 (9), 1437–1452.

(16) Lingdong Quan, M. L. Mod. Chem. Appl. 2013, 01 (01).

(17) Roepstorff, P.; Fohlman, J. Biol. Mass Spectrom. 1984, 11 (11), 601–601.

85

(18) Hernandez, P.; Müller, M.; Appel, R. D. Mass Spectrom. Rev. 2006, 25 (2), 235–254.

(19) Ahrné, E.; Müller, M.; Lisacek, F. Proteomics 2010, 10 (4), 671–686.

(20) Griss, J.; Perez-Riverol, Y.; Lewis, S.; Tabb, D. L.; Dianes, J. A.; Del-Toro, N.; Rurik, M.; Walzer, M.; Kohlbacher, O.; Hermjakob, H.; Wang, R.; Vizcaíno, J. A. Nat. Methods 2016, 13 (8), 651–656.

(21) Chick, J. M.; Kolippakkam, D.; Nusinow, D. P.; Zhai, B.; Rad, R.; Huttlin, E. L.; Gygi, S. P. Nat. Biotechnol. 2015, 33 (7), 743–749.

(22) Karlsson, N. G.; Schulz, B. L.; Packer, N. H. J. Am. Soc. Mass Spectrom. 2004, 15 (5), 659–672.

(23) Jin, C.; Padra, J. T.; Sundell, K.; Sundh, H.; Karlsson, N. G.; Lindén, S. K. J. Proteome Res. 2015, 3239–3251.

(24) Jin, C.; Kenny, D. T.; Skoog, E. C.; Padra, M.; Adamczyk, B.; Vitizeva, V.; Thorell, A.;

Venkatakrishnan, V.; Lindén, S. K.; Karlsson, N. G. Mol. Cell. Proteomics 2017, 16 (5), 743–758.

(25) Hadoop - Apache Software Foundation project home page http://hadoop.apache.org/.

(26) Zaharia, M.; Chowdhury, M.; Franklin, M. J.; Shenker, S.; Stoica, I. In Proc. 2nd USENIX Conf. Hot Top. Cloud Comput; 2010.

(27) Taylor, R. C. BMC Bioinformatics 2010, 11 Suppl 1 (Suppl 12), S1.

(28) Kalyanaraman, A.; Cannon, W. R.; Latt, B.; Baxter, D. J. Bioinformatics 2011, 27 (21), 3072–3073.

(29) Pratt, B.; Howbert, J. J.; Tasman, N. I.; Nilsson, E. J. Bioinformatics 2012, 28 (1), 136–

137.

(30) Hung, C.-L.; Hua, G.-J. Biomed Res. Int. 2013, 2013, 1–7.

(31) Wiewiórka, M. S.; Messina, A.; Pacholewska, A.; Maffioletti, S.; Gawrysiak, P.;

Okoniewski, M. J. Bioinformatics 2014, 30 (18), 2652–2653.

(32) Freeman, J.; Vladimirov, N.; Kawashima, T.; Mu, Y.; Sofroniew, N. J.; Bennett, D. V;

Rosen, J.; Yang, C.-T.; Looger, L. L.; Ahrens, M. B. Nat. Methods 2014, 11 (9), 941–

950.

(33) Ghemawat, S.; Gobioff, H.; Leung, S.-T. In Proceedings of the nineteenth ACM symposium on Operating systems principles - SOSP ’03; ACM Press: New York, New York, USA, 2003; p 29.

86

(34) Dean, J.; Ghemawat, S. In Communications of the ACM - 50th anniversary issue: 1958 - 2008; 2008; pp 107–113.

(35) Dudley, J. T.; Butte, A. J. PLoS Comput. Biol. 2009, 5 (12), e1000589.

(36) Sandve, G. K.; Nekrutenko, A.; Taylor, J.; Hovig, E. PLoS Comput. Biol. 2013, 9 (10), e1003285.

(37) Wilson, G.; Aruliah, D. A.; Brown, C. T.; Chue Hong, N. P.; Davis, M.; Guy, R. T.;

Haddock, S. H. D.; Huff, K. D.; Mitchell, I. M.; Plumbley, M. D.; Waugh, B.; White, E.

P.; Wilson, P. PLoS Biol. 2014, 12 (1), e1001745.

(38) Leprevost, F. da V.; Barbosa, V. C.; Francisco, E. L.; Perez-Riverol, Y.; Carvalho, P. C.

Front. Genet. 2014, 5.

(39) Beck, K. Test-driven development; Addison-Wesley Professional, 2003.

(40) Fowler, M.; Beck, K.; Brant, J.; Opdyke, W.; Roberts, D. Refactoring: Improving the Design of Existing Code; Addison-Wesley Professional, 1999.

(41) Duvall, P. M.; Matyas, S.; Glover, A. Continuous Integration, Improving Software Quality and Reducing Risk; Addison-Wesley Professional, 2007.

(42) Biemann, K.; Cone, C.; Webster, B. R.; Arsenault, G. P. J. Am. Chem. Soc. 1966, 88 (23), 5598–5606.

(43) Maxwell, E.; Tan, Y.; Tan, Y.; Hu, H.; Benson, G.; Aizikov, K.; Conley, S.; Staples, G.

O.; Slysz, G. W.; Smith, R. D.; Zaia, J. PLoS One 2012, 7 (9), e45474.

(44) Li, F.; Glinskii, O. V; Glinsky, V. V. Proteomics 2013, 13 (2), 341–354.

(45) Pevzner, P. A.; Dan, V.; Tang, C. L. J. Comput. Biol. 2000, 7 (6), 777–787.

(46) Na, S.; Jeong, J.; Park, H.; Lee, K.-J.; Paek, E. Mol. Cell. Proteomics 2008, 7 (12), 2452–

2463.

(47) Na, S.; Bandeira, N.; Paek, E. Mol. Cell. Proteomics 2012, 11 (4), M111.010199.

(48) Savitski, M. M.; Nielsen, M. L.; Zubarev, R. A. Mol. Cell. Proteomics 2006, 5 (5), 935–

948.

(49) Falkner, J. A.; Falkner, J. W.; Yocum, A. K.; Andrews, P. C. J. Proteome Res. 2008, 7 (11), 4614–4622.

(50) Ahrné, E.; Nikitin, F.; Lisacek, F.; Müller, M. J. Proteome Res. 2011, 10 (7), 2913–2921.

(51) Ye, D.; Fu, Y.; Sun, R.-X.; Wang, H.-P.; Yuan, Z.-F.; Chi, H.; He, S.-M. Bioinformatics 2010, 26 (12), i399–i406.

87

(52) Ma, C. W. M.; Lam, H. J. Proteome Res. 2014, 13 (5), 2262–2271.

(53) Sun, W.; Lajoie, G. A.; Ma, B.; Zhang, K. 2015; pp 320–330.

(54) Jeong, K.; Kim, S.; Pevzner, P. A. Bioinformatics 2013, 29 (16), 1953–1962.

(55) Domokos, L.; Hennberg, D.; Weimann, B. Anal. Chim. Acta 1984, 165, 61–74.

(56) Yates, J. R.; Morgan, S. F.; Gatlin, C. L.; Griffin, P. R.; Eng, J. K. Anal. Chem. 1998, 70 (17), 3557–3565.

(57) Kameyama, A.; Kikuchi, N.; Nakaya, S.; Ito, H.; Sato, T.; Shikanai, T.; Takahashi, Y.;

Takahashi, K.; Narimatsu, H. Anal. Chem. 2005, 77 (15), 4719–4725.

(58) Eng, J. K.; McCormack, A. L.; Yates, J. R. J. Am. Soc. Mass Spectrom. 1994, 5 (11), 976–989.

(59) Perkins, D. N.; Pappin, D. J.; Creasy, D. M.; Cottrell, J. S. Electrophoresis 1999, 20 (18), 3551–3567.

(60) Craig, R.; Beavis, R. C. Bioinformatics 2004, 20 (9), 1466–1467.

(61) Ranzinger, R.; Herget, S.; Wetter, T.; von der Lieth, C.-W. BMC Bioinformatics 2008, 9 (1), 384.

(62) Joshi, H. J.; Harrison, M. J.; Schulz, B. L.; Cooper, C. a.; Packer, N. H.; Karlsson, N. G.

Proteomics 2004, 4 (6), 1650–1664.

(63) Na, S.; Paek, E.; Introduction, I. 2014, No. Cid, 1–15.

(64) Kong, A. T.; Leprevost, F. V; Avtonomov, D. M.; Mellacheruvu, D.; Nesvizhskii, A. I.

Nat. Methods 2017, 14 (5), 513–520.

(65) Kim, M.-S.; Pinto, S. M.; Getnet, D.; Nirujogi, R. S.; Manda, S. S.; Chaerkady, R.;

Madugundu, A. K.; Kelkar, D. S.; Isserlin, R.; Jain, S.; Thomas, J. K.; Muthusamy, B.;

Leal-Rojas, P.; Kumar, P.; Sahasrabuddhe, N. A.; Balakrishnan, L.; Advani, J.; George, B.; Renuse, S.; Selvan, L. D. N.; Patil, A. H.; Nanjappa, V.; Radhakrishnan, A.; Prasad, S.; Subbannayya, T.; Raju, R.; Kumar, M.; Sreenivasamurthy, S. K.; Marimuthu, A.;

Sathe, G. J.; Chavan, S.; Datta, K. K.; Subbannayya, Y.; Sahu, A.; Yelamanchi, S. D.;

Jayaram, S.; Rajagopalan, P.; Sharma, J.; Murthy, K. R.; Syed, N.; Goel, R.; Khan, A.

A.; Ahmad, S.; Dey, G.; Mudgal, K.; Chatterjee, A.; Huang, T.-C.; Zhong, J.; Wu, X.;

Shaw, P. G.; Freed, D.; Zahari, M. S.; Mukherjee, K. K.; Shankar, S.; Mahadevan, A.;

Lam, H.; Mitchell, C. J.; Shankar, S. K.; Satishchandra, P.; Schroeder, J. T.;

Sirdeshmukh, R.; Maitra, A.; Leach, S. D.; Drake, C. G.; Halushka, M. K.; Prasad, T. S.

88

K.; Hruban, R. H.; Kerr, C. L.; Bader, G. D.; Iacobuzio-Donahue, C. A.; Gowda, H.;

Pandey, A. Nature 2014, 509 (7502), 575–581.

(66) Cheng, K.; Ning, Z.; Zhang, X.; Li, L.; Liao, B.; Mayne, J.; Stintzi, A.; Figeys, D.

Microbiome 2017, 5 (1), 157.

(67) Müller, M.; Gfeller, D.; Coukos, G.; Bassani-Sternberg, M. Front. Immunol. 2017, 8.

(68) Blanco-Míguez, A.; Gutiérrez-Jácome, A.; Fdez-Riverola, F.; Lourenço, A.; Sánchez, B.

Food Microbiol. 2016, 60, 137–141.

(69) Savitski, M. M.; Lemeer, S.; Boesche, M.; Lang, M.; Mathieson, T.; Bantscheff, M.;

Kuster, B. Mol. Cell. Proteomics 2011, 10 (2), M110.003830.

(70) Hayes, C. A.; Karlsson, N. G.; Struwe, W. B.; Lisacek, F.; Rudd, P. M.; Packer, N. H.;

Campbell, M. P. Bioinformatics 2011, 27 (9), 1343–1344.

(71) Campbell, M. P.; Nguyen-Khuong, T.; Hayes, C. A.; Flowers, S. A.; Alagesan, K.;

Kolarich, D.; Packer, N. H.; Karlsson, N. G. Biochim. Biophys. Acta 2014, 1844 (1 Pt A), 108–116.

89

Chapter 6 Appendix Co-Author Papers

6.1. Property Graph vs RDF Triple Store: A Comparison on Glycan

Substructure Search

Property Graph vs RDF Triple Store: A

Comparison on Glycan Substructure Search

Davide Alocci1,2, Julien Mariethoz1, Oliver Horlacher1,2, Jerven T. Bolleman3, Matthew P. Campbell4, Frederique Lisacek1,2*

1Proteome Informatics Group, SIB Swiss Institute of Bioinformatics, Geneva, 1211, Switzerland, 2Computer Science Department, University of Geneva, Geneva, 1227, Switzerland,3Swiss-Prot Group, SIB Swiss Institute of Bioinformatics, Geneva, 1211, Switzerland,4Department of Chemistry and Biomolecular Sciences, Macquarie University, Sydney, Australia

*Frederique.lisacek@isb-sib.ch

Abstract

Resource description framework (RDF) and Property Graph databases are emerging tech-nologies that are used for storing graph-structured data. We compare these techtech-nologies through a molecular biology use case: glycan substructure search. Glycans are branched tree-like molecules composed of building blocks linked together by chemical bonds. The molecular structure of a glycan can be encoded into a direct acyclic graph where each node represents a building block and each edge serves as a chemical linkage between two build-ing blocks. In this context, Graph databases are possible software solutions for storbuild-ing gly-can structures and Graph query languages, such as SPARQL and Cypher, gly-can be used to perform a substructure search. Glycan substructure searching is an important feature for querying structure and experimental glycan databases and retrieving biologically meaning-ful data. This applies for example to identifying a region of the glycan recognised by a glycan binding protein (GBP). In this study, 19,404 glycan structures were selected from Glyco-meDB (www.glycome-db.org) and modelled for being stored into a RDF triple store and a Property Graph. We then performed two different sets of searches and compared the query response times and the results from both technologies to assess performance and accu-racy. The two implementations produced the same results, but interestingly we noted a dif-ference in the query response times. Qualitative measures such as portability were also used to define further criteria for choosing the technology adapted to solving glycan sub-structure search and other comparable issues.

Introduction

Nowadays the use of high throughput technologies and optimized pipelines allows life scien-tists to generate terabytes of data in a reduced amount of time and subsequently feed quickly and comprehensively online bioinformatics databases. In this scenario, the interoperability between data resources has become a fundamental challenge. Issues are gradually being solved in applications involving genome (DNA) or transcriptome (RNA) analyses but

PLOS ONE | DOI:10.1371/journal.pone.0144578 December 14, 2015 1 / 17

OPEN ACCESS

Citation:Alocci D, Mariethoz J, Horlacher O, Bolleman JT, Campbell MP, Lisacek F (2015) Property Graph vs RDF Triple Store: A Comparison on Glycan Substructure Search. PLoS ONE 10(12):

e0144578. doi:10.1371/journal.pone.0144578

Copyright:© 2015 Alocci et al. This is an open access article distributed under the terms of the Creative Commons Attribution License, which permits unrestricted use, distribution, and reproduction in any medium, provided the original author and source are credited.

Data Availability Statement:All relevant data are within the paper and its Supporting Information files.

Funding:This work is supported by the Seventh Framework Programme - Initial Training Network N°

316929 (http://ec.europa.eu/research/fp7/index_en.

cfm). The grant recipient is FL. This activity at SIB is supported by the Swiss Federal Government through the State Secretariat for Education, Research and Innovation SERI. DA is supported by EU-ITN (FP7-PEOPLE-2012-ITN # 316929). MPC acknowledges funding support from the Australian National eResearch Collaboration Tools and Resources project (NeCTAR).

problems remain for less documented molecules such as lipids, glycans (also referred to as

“carbohydrate”,“oligosaccharide”or“polysaccharide”to designate this type of molecule) or metabolites.

Glycosylation is the addition of glycan molecules to proteins and/or lipids. It is an important post-translational modification that enhances the functional diversity of proteins and influ-ences their biological activities and circulatory half-life. A glycan is a branched tree-like mole-cule that naturally lends itself to graph encoding. However, glycans have long been described in the IUPAC linear format [1], that is, as regular expressions delineating branching structures with different bracket types. Such encoding can generate directional/linkage/topology ambigu-ity and is not sufficient in the handling of incomplete or repeated units. More recently, several encoding formats for glycans have developed based on sets of nodes and edges, e.g., GlycoCT [2], Glyde-II [3,4], IUPAC condensed [5], KCAM/KCF [6,7] or more recently WURCS [8].

To date the GlycoCT format is acknowledged as the default format for data sharing between databases [9] and consequently the most commonly used format for storing structural data.

Glycans are composed of monosaccharides (8 common building blocks and dozens of less fre-quent ones as described in MonosaccharideDB (http://www.monosaccharidedb.org) that are cyclic molecules. These monosaccharides are linked together in different ways depending on carbon attachment positions in the cycle as detailed further.

In a graph representation of a glycan, each monosaccharide residue is a node possibly asso-ciated with a list of properties and each linkage is an edge also potentially assoasso-ciated with a list of properties. In fact, chemical bonds between building blocks, called glycosidic linkages, are transformed into edges in the acyclic graph structure. An example is shown inFig 1, where the simplified graphic representation popularised by the Consortium for Functional Glycomics (CFG) [10] (originally proposed by the authors of Essentials in Glycobiology [11]) is matched to a graph. This notation assigns each monosaccharide to a coloured shape (e.g., yellow circle for galactose, shortened as Gal). Shared colours or shapes express structural similarity among monosaccharides. For example, N-Acetylgalactosamine (yellow square) differs from galactose (yellow circle) through a so-called substituent (removal of an OH group and addition of an amino-acetyl group).Substituentas a property is precisely the type that qualifies a node.

Starting with the CarbBank project in 1987 [12], a range of glycoinformatics resources con-taining glycan-related information has been developed thereby creating a variety of reference databases [13–15]. In the last two years, two articles [16,17] have been published proposing mechanisms for connecting glycan-related (or glycomics) databases. The respective authors suggested moving towards new technologies designed for semantic web, which are adapted to aggregating information from different sources. The outcome for these proposals was a com-mon standard ontology called GlycoRDF [16] that is now being widely adopted by the commu-nity to develop glycomics resources that cooperate and share standard formats enabling federated queries [9]. Nonetheless, in an effort to confirm the relevance of RDF as the technol-ogy of choice for glycomics data representation and integration, comparison with other graph-based technologies such as Property Graph is necessary. In this paper we discuss methods for interconnecting different databases through addressing the question of glycan substructure (motif or pattern) searching. This report should be considered as a preliminary study of feasi-bility and a comparison between different software implementations. Our main goal is to compare the performance of these two technologies (RDF and Property Graph) for glycan substructure search. In this study, Neo4j was chosen as a representative option for Property Graphs and multiple RDF triple stores were tested mainly due to the shared graph query lan-guage. Specifically, we have used for the latter: Virtuoso Open-Source Edition [18], Sesame [19], Jena Fuseki [20] and Blazegraph [21].

Property Graph and RDF Triple Store Comparison

PLOS ONE | DOI:10.1371/journal.pone.0144578 December 14, 2015 2 / 17

Competing Interests:The authors have declared that no competing interests exist.

Searching (sub)structures is meaningful in glycomics. It matches the concept of glycan epi-tope or glycan determinant associated with the part of the whole structure that is recognized by glycan-binding proteins, which include lectins, receptors, toxins, adhesins, antibodies and enzymes [22]. Several substructure search solutions have been developed including for instance, regular expression matching [23] as applied to processing glycans in linear format (IUPAC) [5]. Although these approaches have showed reasonable robustness, they also have intrinsic limitations especially in handling structural ambiguities. Frequently, experimental data is insufficient to assign a precise monosaccharide (e.g., Galactose) to a position in the structure so that it is characterised only by its carbon content (e.g., Hexose). Regular expres-sions do not account for dependencies while a graph description handles inheritance of proper-ties (e.g., Hexose ->Galactose). Emerging technologies capable of storing graph models can be used, namely, Resource Description Framework (RDF) [24] and Property Graph databases [25]. Even though RDF and Property Graph databases can be used to reach the same goal, these two approaches are not synonymous and can be distinguished. RDF is designed for stor-ing statements in the form of subject–predicate–object called triples, whereas a Property Graph

Fig 1. Glycan CFG encoding and graph encoding.On the left hand side a glycan structure encoded with CFG nomenclature is presented, while the right hand side shows the same structure translated into a graph. Each monosaccharide or substituent becomes a node and each glycosidic bond becomes an edge in the graph. Avoiding any loss of information all the properties of each monosaccharide or substituent are converted in node properties whereas glycosidic bond properties are translated in edge properties. To be more clear the colour code associate with the monosaccharide type is preserved among the images.

doi:10.1371/journal.pone.0144578.g001

PLOS ONE | DOI:10.1371/journal.pone.0144578 December 14, 2015 3 / 17

is designed to implement different types of graphs such as hypergraphs, undirected graphs, weighted graphs, etc. Nonetheless, all possible types of graphs can be built with triples and stored in an RDF triple store [26]. Furthermore, Property Graphs are node-centric whereas RDF triple stores are edge-centric. For this reason, RDF triple stores use a list of edges, many of which are properties of a node and not critical to the graph structure itself. Moreover, Property Graph databases tend to be optimized for graph traversals where only one big graph is present.

With RDF triple stores, the cost of traversing an edge tends to be logarithmic [27]. Finally, the query language is another key point in the comparison: RDF triple stores support SPARQL [28] as a native query language whereas Property Graphs have mainly proprietary languages.

Even though Neo4J has a plugin for SPARQL, it relies essentially on its own proprietary lan-guage called Cypher [29]. In either case, a substructure search can be defined with the query language provided by these new technologies. To reach our benchmarking goal in this applica-tion, we first describe the main steps of building substructure search software using alterna-tively an RDF triple store or a Property Graph databases. We then provide a selection of quantitative and qualitative measures investigated during the study such as portability. Finally, we summarise the main achievements and draw some conclusions on the expected characteris-tics of a web application for glycan substructure search.

Material and Methods

The development of a software solution to perform glycan substructure searching involves two main tasks 1) the storage of glycan structures and 2) translation of a query into a specific query language for pulling out all the glycan structures that match the query pattern.

The glycan encoding and the data storage sections provide details regarding these tasks using native graph stores. The substructure search query shows how specific languages pro-vided by RDF and Property Graph databases can be used.

Data structure

As previously described most dedicated glycan databases store structures using string encoding standards. This multiplicity led to create a new layer between the data store and the glycan encoding formats to decouple the glycan structure information from any string-encoding format.

A comprehensive framework was developed within the EUROCarbDB project for parsing multiple glycan encoding formats [30]. Even though this framework is still used and includes parsers for each glycan encoding format, for pragmatic reasons we have relied on an in-house library, called MzJava [31]. It is an open source library that provides a data structure and spe-cific readers for multiple glycan encoding formats that are not limited to processing mass spec-trometry data. The MzJava data structure organizes the information from a glycan structure into a directed acyclic graph, which can be directly stored into a graph storage solution. Mono-saccharides, the basic units, which compose the glycan structures, are treated as graph nodes and all the biological properties of these building blocks are stored as separate node properties.

Substituents—particular building blocks, which can be combined with basic units—are

Substituents—particular building blocks, which can be combined with basic units—are