• Aucun résultat trouvé

1.4 Glycomics

1.4.1 Glycan Diversity

Glycans, also frequently referred to as carbohydrates or saccharides, are large lin-ear or branched compounds constituted of multiple monosaccharides that are linked by glycosidic bonds. Since monosaccharides cannot be hydrolyzed into simpler forms [93], they represent the monomer units enabling the assembly of oligosaccharides (fewer than a dozen units) or polysaccharides (more than a dozen units) [94]. They are therefore the equivalent of nucleotides for nucleic acids or amino acids for pro-teins. Monosaccharides can be in linear or cyclic form and are generally constituted by a backbone of three to nine carbon atoms. Hexoses, such as glucose or galactose, are composed of six carbon atoms and represent the most common monosaccha-rides [95]. The hydroxyl and amino groups of monosacchamonosaccha-rides can be further mod-ified by enzymes, leading to an increase in structure diversity [96]. Two distinct stereoisomers (D and L) of the same monosaccharide can coexist based on the three-dimensional orientation of the atoms, but the D configuration is more frequent [96].

This stereoisomerism can modulate their biological properties. Different sites are available on monosaccharides to form glycosidic bonds, and the linkage itself can take two distinct orientations (α or β) depending on the relative stereochemistry of

the C1 carbon compared with the plane of the ring [97]. There is formation of an α-glycosidic bond when the two linking carbons have the same anomeric configura-tion and of a β-glycosidic bond when they have different anomeric configuraconfigura-tions.

The orientation of glycosidic bonds directly affects the structure of glycans.

Figure 1.8:N-linked and O-linked glycans in the SNFG format.

Glycans can be either covalently bonded to proteins (glycoproteins) or lipids (glycol-ipids) to form glycoconjugates [93], or free when released in body fluids by those coconjugates [98]. In the case of glycoproteins, there are two main categories of gly-cans based on the type of linkage: N-linked and O-linked glygly-cans. N-linked glygly-cans consist in the attachment of an oligosaccharide to the amine nitrogen of the side chain of an asparagine (Asn) residue [99]. The N-glycosylation sites usually follow a Asn-X-Ser/Thr pattern, where X is any amino acid with the exception of proline [99]. In eukaryote organisms, N-glycans all share a common core sequence defined as Manα1-3(Manα1-6)Manβ1-4GlcNAcβ1–4GlcNAcβ1 [99]. They can be further subdivided into three categories based on their structure (Figure 1.8): oligoman-noses (i) when the core is solely extended by mannose (Man) residues; complexes when the core is extended by antennas starting with a N-Acetylglucosamine (Glc-NAc) residue (ii); and hybrid when the Manα1-3 arm of the core is extended by one or two GlcNAc residues and the Manα1-6 arm is extended by Man residues (iii) [99]. O-linked glycans consist in the attachment of an oligosaccharide to the oxygen atom of the side chain of generally a serine (Ser) or threonine (Thr) residue [100]. O-glycans all start with a N-Acetylgalactosamine (GalNAc) monosaccharide that is mostly elongated into eight different core structures (Figure1.8), which can be in turn further elongated and modified [101].

1.4. GLYCOMICS

1.4.2 Mass Spectrometry

In comparison to the characterization of proteins, additional challenges are encoun-tered in glycomics when aiming to characterize glycan structures using MS-based approaches. A given precursor mass can only provide reliable information about the composition of a glycan, failing to fully resolve the corresponding structure [102].

As they have the same mass, isomeric structures and isomeric monosaccharides cannot be directly distinguished. Through the use of MS/MS approaches, the frag-mentation of glycans can provide supplementary information about their primary sequences. The additional use of orthogonal techniques, such as nuclear magnetic resonance (NMR), is however often required to reduce the number of possible gly-can structures through the determination of glygly-can configurations and linkage po-sitions [102]. In comparison to the traditional CID or HCD fragmentation, electron activation fragmentation techniques, such as ECD or ETD, were shown to result in more detailed glycan structures [102]. In a typical protein-based glycomics MS ex-periment, the glycans are first released from their proteins either enzymatically for N-glycans using peptide N-glycosidase F (PNGase F) or chemically for O-glycans us-ing β-elimination [103]. In glycoproteomics approaches, this step is not performed in order to be able to identify glycosylation sites inside the amino acid sequences of intact glycoprotein and glycopeptides. Site localization is obtained at the cost of precise structural information, as only the glycan composition can be fully re-solved [102]. As for glycomics, ECD and ETD fragmentation techniques nowadays tend to be preferred in glycoproteomics experiments as they provide overall more information about glycan structures. They also mainly fragment the peptide back-bone instead of glycosidic bonds, preventing the complete loss of glycans and their corresponding site information [103].

1.4.3 Data and Databases

Once fully resolved by glycomics approaches, the glycan structures require to be en-coded into standardized data formats to reliably store the corresponding sequence information. As no clear community standard was originally available, numerous glycan databases and initiatives ended up developing their own internal format.

This led to the coexistence of a multitude of glycan structure formats (Table1.3), which can vary regarding the level of detail they encapsulate. They are also differ-entiated by the way the sequence information is encoded. Some formats like GLYDE [104] are XML-based, while others like KCF [105] are formatted as indented tables.

The majority of formats, such as LINUCS [106] or Linear Code [107], consists how-ever in a one-line string. While those string formats tend to produce very condensed

results, they can prove to be difficult to read. This is especially true in the case of branched structures. Another issue is that each format uses a distinct notation for the encoded monosaccharides. The MonosaccharideDB [108] database notably addresses this problematic by listing the existing notations and providing transla-tion tools. While tools enabling the conversion of a given glycan format to another exist, they usually only support a restricted number of input and output formats.

Recently, GlycoCT [109] was selected by several databases as a default format in an effort to improve the information exchange and cross-references between glycoin-formatics resources. This however did not prevent new formats, such as WURCS [110], from being developed.

Figure 1.9: Graphical representation of glycans. Example of a glycan structure (GlyCon-nect:2641; GlyTouCan:G01614ZM) represented in four distinct formats: SNFG (a); Oxford (b); text (c); and as its chemical structure depiction (d).

With the advent of MS-based glycoproteomics approaches, there has been a con-sequential increase in the number of observed glycans that lack a fully resolved structure. For these cases, the structure formats detailed above cannot be used since solely the monosaccharide composition is fully known. Once more, no consen-sus was originally achieved and distinct glycan composition formats were developed by databases and tools working with such data. Glycan composition formats all con-sist in describing the observed monosaccharides and their number of occurrences

1.4. GLYCOMICS

<residue_linkfrom="2"to="1"<atom_linkfrom="N1H"to="C2"to_replace="O2"bond_order="1"></residue_link>

<residue_linkfrom="3"to="1"<atom_linkfrom="C1"to="O3"from_replace="O1"bond_order="1"></residue_link>

<residue_linkfrom="4"to="3"<atom_linkfrom="C2"to="O3"from_replace="O2"bond_order="1"></residue_link>

<residue_linkfrom="5"to="4"<atom_linkfrom="N1H"to="C5"to_replace="O5"bond_order="1"></residue_link>

<residue_linkfrom="6"to="1"<atom_linkfrom="C2"to="O6"from_replace="O2"bond_order="1"></residue_link>

<residue_linkfrom="7"to="6"<atom_linkfrom="N1H"to="C5"to_replace="O5"bond_order="1"></residue_link>

</molecule></GlydeII>

Table 1.3:Common glycan sequence formats. Example of a glycan structure (GlyConnect:

2641; GlyTouCan: G01614ZM) represented in different data format that can be found in glycomics.

as key-value pairs (e.g., Hex:5 HexNAc:4 NeuAc:2). While these formats are in-herently less complex than glycan structure formats, they share the same issue of having distinct notations for monosaccharides.

The graphical representation of glycan structures also required the design of spe-cialized formats (Figure1.9). In addition to the classical chemical structure depic-tion commonly preferred by chemists, new formats emerged using specific encoding for the monosaccharide appearance. While the most basic one uses text abbrevia-tions, the Oxford [111] and SNFG [112] formats opted for symbols varying in shape and color to represent the various monosaccharide classes. In contrast to the gly-can sequences formats, the SNFG format was rapidly adopted by a large number of databases and related resources and is notably recommended by the Essentials of Glycobiology[113].

Database Content URL

CFG Glycan array data http://www.functionalglycomics.org

CSDB Plant, fungal and bacterial glycan structures http://csdb.glycoscience.ru

GlycoEpitope Glycan epitopes https://www.glycoepitope.jp

GlyConnect Glycan structures, glycoproteins, glycosylation sites https://glyconnect.expasy.org GLYCOSCIENCES.de 3D glycan structures, NMR data http://www.glycosciences.de

GlyTouCan Glycan structures https://glytoucan.org

KEGG Glycan structures, pathways https://www.genome.jp/kegg/glycan/

MonosaccharideDB Monosaccharide data http://www.monosaccharidedb.org SugarBind Pathogen-glycan binding data https://sugarbind.expasy.org

UniCarb-DB Glycan MS/MS data https://unicarb-db.expasy.org

UniCarbKB Glycan structures, glycoproteins, glycosylation sites http://www.unicarbkb.org

Table 1.4: Main glycan databases and repositories. The resources that were discontinued or focus on topics that are too specifc (e.g., enzymes) are not listed.

Databases in glycomics differ significantly in terms of goals and content (Table1.4).

While some focus solely on describing the structures and compositions of glycan, others aim to characterize their interactions with proteins or pathogens. GlyCon-nect [114] is a database aiming to characterize the different molecular actors of pro-tein glycosylation. It was developed at the SIB Proteome Informatics Group (PIG) and is openly accessible on the ExPASy server (https://glyconnect.expasy.org). The database specifically covers the N- and O-linked glycans and the glycoproteins that harbor them. The data stored in GlyConnect can be accessed through any of the nine available categories: structures (i); compositions (ii); proteins (iii); peptides (iv); sites (v); taxonomies (vi); tissues (vii); diseases (viii); and references (ix). This enables users to perform multi-criteria queries answering specific questions, such

1.4. GLYCOMICS

as finding all the glycan structures that have been reported in a given disease. Gly-Connect is built upon the content of the GlycoSuiteDB database [115], following the transfer of the license to the SIB. The database is consequently constituted of a manually curated data core containing numerous glycan structures. With the rise of high-throughput glycoproteomics approaches, a large amount of data was subsequently added to GlyConnect. While a lot of information was gained about the location of glycosylation sites in protein sequences, the proportion of glycan for which only the monosaccharide composition is known drastically increased. As a result, there is a growing need to develop new computer software to help to make sense of this data.

Documents relatifs