Semi-Automatic Annotation based on Local Structural Analysis (SAALSA)

Part 1 - Mapping structures to sequence

2.2 Methods and implementation

2.2.1 Semi-Automatic Annotation based on Local Structural Analysis (SAALSA)

In the following paragraphs, we will firstly present methods that were developed in order to prepare data to be used by the SAALSA application: the validation of disulfide bonds and the pre-computation of local environments of PDB ligands.

Secondly, we will expose technical details concerning the design and the implementation of the application. This description will be followed by a summary about how the database is used to store SAALSA outputs and manual actions all along the annotation procedure. Thirdly, we will present methods applied on the fly during a SAALSA run: the definition of non-redundant binding sites and the formatting of final annotations.

2.2.1.1 Validation of disulfide bonds

In this paragraph, we describe the method we used to automatically detect and validate disulfide bond information from the atomic coordinate section of PDB files.

Whilst authors and curators of structural data annotate the presence of intra and inter chain disulfide bonds in SSBOND lines of PDB files, annotation does not describe necessarily all disulfide bonds and some of the annotation once can be erroneous. A typical disulfide bond can be geometrically described with the following parameters and associated canonical values:

-the distance between sulfur atoms of each cysteine noted as S!1-S!2 of 2.05Å;

-the angles C"1-S!1-S!2 and S!1-S!2-C"2.

Figure 15 - Typical disulfide bond with canonical angles and inter-sulfur distance.

In Annex 5, we describe the analysis that was performed to set up the threshold for the intersulfur atomic distance and the C"1-S!1-S!2 and S!1-S!2-C"2 angles in order to guarantee the existence of the disulfide bond in the structure. The following thresholds were defined such as:

1.9Å < S!1-S!2 < 2.2Å 95° < C"1-S!1-S!2 <115°

95° < S!1-S!2-C"2 <115°

Then a combinatorial search of disulfide bonds was performed in all PDB files. Each pair of cysteines in every PDB structure was tested regarding to the thresholds of the geometrical parameters defined above, in order to know if they form a disulfide bridge or not.

2.2.1.2 Pre-computation of the local environment of PDB ligands

A script (get_sub3d.pl) was developed to compute molecular environments in PDB structures. In the case of PDB ligand, the program requires three parameters:

the PDB filename, the 3-letter PDB ligand code and the threshold for the maximal distance between atoms of the PDB ligand and atoms of the protein of interest. The interatomic distance threshold was empirically set to 3.4Å, accordingly to current annotation practices. This value is sufficiently high to collect most of the residues interacting non-covalently with the ligand i.e. through metallic bonds, hydrogen bonds or hydrophobic contacts (except some cases of pi-stacking) and at the same time excluding residues that are not interacting with the ligand.

The molecular environment of each ligand of each PDB structure was calculated as follows:

First, atomic coordinates are extracted from the structure. Second, the structure is hashed into cubes of 2Å side, i.e. each atom is associated to a given indexed cube. Third, molecules labeled with the PDB ligand 3-letter code are identified.

Fourth, structural environments are computed for each ligand: corresponding cubes

are selected (taking into account each of their atoms) and distance tests are performed between ligand’s atoms and atoms in the surrounding cubes that do not belong to the ligand (Figure 16). Atom pairs satisfying the distance threshold (3.4Å in this case) as well as the list of corresponding residues are reported in details in the program output.

Computed data are stored in the sub3d.hetatm table; one row by distinct environment.

Figure 16 - 2D representation of the selection of atoms (blue dots) close to a given reference atom (red dot) in a hashed PDB structure (grid). According to the index of this reference atom (i.e. an atom of the ligand of interest) and of its position in space, the search space is restrained to a fraction of the structure (highlighted in red), corresponding to adjacent indexed squares in the different dimensions. Finally, distance tests can be done between the reference atom and each of the atoms present in this fraction of the overall atom set in order to retrieve atoms closer than the indicated threshold (materialized by the black circle).

2.2.1.3 Design of the tool

SAALSA has been designed following a standard 3-tier architecture. The backend corresponds to the core four schemata of the SSMap database (cf. 1.2.2), plus the annot, auto_annot (cf. 2.2.1.4) and sub3d (cf. 2.2.1.2) schemata. The logic of the application is contained in a collection of Perl functions dispatched into several modules to deal with data retrieval from the database, data integration and finally data formatting. The front-end is composed of CGI Perl scripts that generate HTML and Javascript code. The web interface is composed of several pages articulated around a main result page. Specific pages are dedicated to the annotation of ligand binding sites and curation of the SSMap alignments (Figure 17). Jmol [Willighagen E, 2007] was used to render 3D environment of PDB ligands.

Figure 17 - Global navigation map in the SAALSA web interface. Different tools can be called from the main page (in blue): an alignment editor to modify the residue level of SSMap mappings (in green) and a set of views and tools to produce ligand binding sites (in violet).

Final annotation can be generated from the main page.

When a SAALSA search is initiated, a series of queries is performed on the database to generate the main result page. Firstly, SAALSA collects information related to the UniProtKB entry: the isoform identifiers and related sequences as well as taxonomy information. Secondly, it queries the SSMap mapping and obtains the list of automatically mapped and potentially mappable PDB chains (associated to alignments stored in the ssmap.ali table) with related alignment information.

Potentially mappable PDB chains are defined as sharing at least 90% sequence identity with the UniProtKB sequence. Residue-level mappings are obtained through the get_ali_equivalence2 subroutine (cf. 1.2.4). Thirdly, basic 3D-structural information and annotations are retrieved for each of the mapped (or mappable) PDB chains from the pdb schema of the SSMap database: the PDB reconstructed sequence, the list of amino-acid positions in structures, the resolution of structures, validated disulfide bonds (cf. 2.2.1.1) and precomputed ligand environments (cf.

2.2.1.2), as well as various PTMs annotated in the PDB flat file (i.e. MODRES, SSBOND, CROSSLNK records). Finally, existing manual annotations are imported from various tables located in the annot schema (cf. next paragraph).

2.2.1.4 Recorded data

During a SAALSA run, every manual actions performed by curators is stored in several tables contained in the annot schema of the SSMap database. A manual action can be a validation, a modification or a rejection of automatic annotations.

Annotations are not directly recorded in the database, but basic information needed to construct the final annotations by applying automatically UniProtKB/Swiss-Prot annotation rules is stored. In the following paragraphs we will summarize the data that are actually stored in the database for mappings and protein features, successively.

For mappings between UniProtKB entries and PDB chains (entry-level mappings), UniProtKB accession code, splice variant number, PDB code, chain name, boundaries on the UniProtKB sequence and the PDB sequence are the essential data to produce cross-references to PDB. These data are stored in the annot.mapping table. At the residue level, mapping changes can be translated into new values for data originally contained in the ssmap.ali table (cf. 1.2.2):

-the sum_ali field, which summarizes the sequence of gapped and ungapped

Position-associated protein features can be classified into 2 categories. The first one corresponds to features that can be unambiguously attributed to specific residues and often already annotated in the PDB file itself. In this category of features, there are sequence variations and chemical modifications of residues.

Sequence variations are derived directly from the alignment between the reconstructed PDB sequence and the UniProtKB sequence. Modifications of residues are retrieved from PDB file HETATM fields and MODRES fields. For disulfide bonds and cross-links between residues, information is retrieved from DISULFID and CROSSLINK fields respectively. They are extracted from the PDB files and stored in the pdb schema during the production / update pipeline of SSMap (cf. 1.2.1.1). Both sequence variations and residue modifications, once validated by UniProtKB curators for a given UniProtKB entry are stored in the annot.ft_pos table. The second category of position-specific features can be defined as molecule environments. These features cannot be extracted directly from the PDB file and have to be defined computationally with the method described in paragraph 2.2.1.2. In this category of features, we can cite ligand binding sites, protein-protein interfaces and more generally macromolecule-protein interfaces. As the format of UniProtKB entries does not allow detailed position-specific description of interfaces between proteins and interacting macromolecules, we concentrated our efforts on the definition of binding sites.

There are several pieces of information needed to generate binding site annotation: the name of the ligand; the type of the interaction of the ligand with each of the involved residues; and a number to identify this site among potential other sites for the same ligand and the same protein (i.e. several non-redundant environments for the same ligand; cf. next paragraph). These data are stored in the annot.ft_env_pos. The association between the name of the ligand in PDB and in UniProtKB is done semi-automatically. Indeed, for the hundred most frequently

encountered ligand names in PDB, we manually attributed a name and indicated when the ligand has high chances to be not physiologic (stored in the auto_annot schema). As an example, phosphate, often found in crystallographic structures is highly concentrated in the mother liquor⁴, and so artificially bound to macromolecules.

2.2.1.5 Definition of non-redundant environments of PDB ligands

Single protein chains can bind several ligands of same or different nature. Binding sites are necessarily non-redundant considering a single protein chain. However, there are several reasons that cause redundancy in ligand binding site information in PDB structures. Firstly, in a given 3D-structure, several copies of a same protein chain can be present (can either interact to form biological unit or simply be due to artificial constraints in the crystal). Each of the copies of the protein chain can be crystallized with their ligands. Secondly, several 3D structures may be available for a given protein. Thus, in summary, several individual ligand binding sites in a same structure or different structures can describe in fact the same non-redundant binding site which is of interest for the annotation purpose. In these non-redundant binding sites, the ligand binds the same residues of the protein.

Moreover, in different PDB structures, similar (but different) PDB ligands can interact with the same protein residues (i.e. same binding site). In UniProtKB, only physiological ligands are indicated in the annotation. However, in order to produce this annotation, curators often need to integrate knowledge available for structural analogues of the physiological compound.

In brief, UniProtKB ligand binding sites are defined as non-redundant environments;

each of them being built from individual PDB ligand environments for a compound and structural analogues of interest.

Thus, during the process of annotation, we have to compute non-redundant environments for each UniProtKB ligand from the individual PDB ligand environments in PDB structures (Figure 18). To solve this problem, we applied a simple clustering algorithm based on residue positions in the UniProtKB sequence. Individual

4Solution in which protein crystals grow.

environments are compared two by two. Reference environment and order of the list of compared environments are chosen arbitrarily. Individual environments sharing more than 80% residues are clustered to a same non-redundant environment.

Definition of non-redundant environment of PDB ligands is performed on the fly when accessing to the ligand environment editor in the SAALSA web interface (Figure 17).

Figure 18 - Relation between individual binding site, non-redundant binding sites and annotations. The color scheme is the same as the one used Figure 17.

2.2.1.6 Automatic filters and checks to produce of the final annotation

To produce the annotation, rules (cf. Annex 2 and Annex 3) as well as rule exceptions (cf. Annex 4) have been implemented and have been applied to the validated structural features. Obviously, part of the rules are coded and applied during the validation of disulfide bond (cf. 2.2.1.1) and precomputing of ligand

environments (cf. 2.2.1.2). Other rules are applied on the fly, during the annotation process with SAALSA:

- In case of the existence of several binding sites for the same ligand, the program automatically numbers the ligands following the order of the first amino-acid encountered for each binding site along the UniProtKB sequence. The order of lines is also automatically determined according to the position of residues in the UniProtKB sequence.

- For covalently bound sugars, a CARBOHYD key replaces the default BINDING key.

In these cases, the UniProtKB ligand name changes in function of the type of the linkage (N-, C-, S- or O-linked). N-glycosylation by N-acetylglucosamine molecules (the most frequent case) is automatically detected and triggers adequate format changes in the resulting annotation. Several criteria are checked to ensure the existence of covalently bound N-acetylglucosamine. First, only one interatomic distance lower than 2.5Å between the ligand and an asparagine at the position n in the sequence must exist. And second, the amino acid type of the residue at the position n+2 must be a serine or a threonine. In the other cases (quite rare), curators can change manually the ligand name, which will imply the change of key.

Dans le document Interfacing sequences and structures of proteins: applications to protein annotation and sequence feature visualization (Page 66-74)