• Aucun résultat trouvé

Prediction of SAMPL-1 yydration free energies using a continuum electrostatics-dispersion model

N/A
N/A
Protected

Academic year: 2021

Partager "Prediction of SAMPL-1 yydration free energies using a continuum electrostatics-dispersion model"

Copied!
11
0
0

Texte intégral

(1)

Publisher’s version / Version de l'éditeur:

The journal of physical chemistry. B, 113, 14, pp. 4511-4520, 2009-06-03

READ THESE TERMS AND CONDITIONS CAREFULLY BEFORE USING THIS WEBSITE.

https://nrc-publications.canada.ca/eng/copyright

Vous avez des questions? Nous pouvons vous aider. Pour communiquer directement avec un auteur, consultez la

première page de la revue dans laquelle son article a été publié afin de trouver ses coordonnées. Si vous n’arrivez pas à les repérer, communiquez avec nous à PublicationsArchive-ArchivesPublications@nrc-cnrc.gc.ca.

Questions? Contact the NRC Publications Archive team at

PublicationsArchive-ArchivesPublications@nrc-cnrc.gc.ca. If you wish to email the authors directly, please see the first page of the publication for their contact information.

NRC Publications Archive

Archives des publications du CNRC

This publication could be one of several versions: author’s original, accepted manuscript or the publisher’s version. / La version de cette publication peut être l’une des suivantes : la version prépublication de l’auteur, la version acceptée du manuscrit ou la version de l’éditeur.

For the publisher’s version, please access the DOI link below./ Pour consulter la version de l’éditeur, utilisez le lien DOI ci-dessous.

https://doi.org/10.1021/jp8061477

Access and use of this website and the material on it are subject to the Terms and Conditions set forth at

Prediction of SAMPL-1 yydration free energies using a continuum

electrostatics-dispersion model

Sulea, Traian; Wanapun, Duangporn; Dennis, Sheldon; Purisima, Enrico O.

https://publications-cnrc.canada.ca/fra/droits

L’accès à ce site Web et l’utilisation de son contenu sont assujettis aux conditions présentées dans le site

LISEZ CES CONDITIONS ATTENTIVEMENT AVANT D’UTILISER CE SITE WEB.

NRC Publications Record / Notice d'Archives des publications de CNRC:

https://nrc-publications.canada.ca/eng/view/object/?id=ed74687e-60de-48b6-96ca-41f2f55bf0cf https://publications-cnrc.canada.ca/fra/voir/objet/?id=ed74687e-60de-48b6-96ca-41f2f55bf0cf

(2)

Prediction of SAMPL-1 Hydration Free Energies Using a Continuum

Electrostatics-Dispersion Model

Traian Sulea, Duangporn Wanapun,‡ Sheldon Dennis, and Enrico O. Purisima*

Biotechnology Research Institute, National Research Council Canada, 6100 Royalmount AVenue, Montreal, Quebec H4P 2R2, Canada

ReceiVed: July 11, 2008; ReVised Manuscript ReceiVed: October 16, 2008

The SAMPL-1 hydration free energy blind prediction challenge data set includes 63 compounds that are more chemically diverse, polyfunctional, drug-like, and with examples of transfer free energies and molecular weights larger than ever before seen in previously tabulated data sets of neutral compounds. For the prospective SAMPL-1 study, we employed a continuum model including a boundary element solution of the Poisson equation to describe electrostatic solvation, a molecular surface area-based cost of cavity formation in water, and a continuum Lennard-Jones potential to account for dispersion-repulsion solute-solvent effects. For the latter contribution, continuum van der Waals atom-type coefficients were calibrated and validated on previously available hydration data sets. In the prospective study, this continuum hydration model yielded SAMPL-1 predictions highly correlated with experimental data, albeit with a slope of slightly above 0.5, suggesting a valid model but with a systematic error. Analysis of the major outliers, all overestimating the experimental hydration data, highlights a common structural theme as a possible cause of the prediction errors: densely polar and hydrogen-bond-capable structures, featuring primarily substituted (sulfon)amide groups, often in conjugated systems. By examining analog pairs within the SAMPL-1 data set, it was also noted that certain solvation trends are captured neither by chemical sense nor by our hydration model, which seem too additive. A retrospective analysis of model transferability between hydration data sets as a function of its parameters and complexity indicates that the electrostatic component of the model is fairly transferrable across data sets, but the nonelectrostatic terms are less so. For the chemical space covered in SAMPL-1, absolute prediction errors indicate that the simpler transferrable electrostatics-only model outperforms the more complex model including cavity and continuum dispersion terms. Possible directions to further improve this continuum hydration model are proposed.

Introduction

Changes in hydration free energy upon complex formation are a critical component of binding affinities in aqueous solution.1-3 Hence, simulation methods that aim to predict

binding free energies require accurate solvation models to achieve their goal. Over the years, much effort has been expended in developing and parametrizing solvation models at various levels of theory.4-10These invariably involve the use

of published data sets of vacuum-to-water transfer free energies of small molecules that are divided into training and validation subsets. Experimental data are available for a few hundreds of small organic molecules.11These molecules are quite similar,

mostly monofunctional, with a sprinkling of still relatively simple polyfunctional molecules. In contrast, most organic molecules of biological interest, e.g., drug molecules, are highly polyfunctional.

Recently, a new test set became available through the SAMPL-1 blind prediction challenge sponsored by OpenEye Scientific Software.12 This consisted of 63 highly diverse,

densely polyfunctional, neutral but very polar compounds, encompassing larger magnitudes of hydration free energies and

molecular weights than seen in the common public data sets. The structures of the molecules were made available to the public a few months before the experimental transfer free energies were disclosed, allowing a true test of prediction methods.

We participated in the SAMPL-1 blind prediction challenge using a continuum model of hydration with terms that include continuum solute-solvent dispersion-repulsion contributions, a surface-area-based cost of cavity formation in water, and electrostatic contributions from a boundary element method. Calibration and validation of continuum dispersion-repulsion atom-type parameters was based on commonly available hydra-tion data sets.

In this paper, we evaluate the performance of our prospective SAMPL-1 predictions and retrospectively analyze the merits and generality of the hydration model. We obtain insights on how general and transferrable our solvation model calibrated against the publicly available hydration data sets is to more complex highly polyfunctional molecules.

Materials and Methods

Hydration Model. For the SAMPL-1 blind prediction of hydration free energies, we employed the continuum framework shown in eq 1:

Part of the special section “Calculation of Aqueous Solvation Energies

of Drug-Like Molecules: A Blind Challenge”.

* Corresponding author. Phone: 514-496-6343. Fax: 514-496-5143. E-mail: rico@bri.nrc.ca.

Current address: Department of Chemistry, Purdue University

J. Phys. Chem. B 2009, 113,4511–4520 4511

10.1021/jp8061477 CCC: $40.75 Published 2009 by the American Chemical Society Published on Web 03/06/2009

Downloaded by CANADA INSTITUTE FOR STI on October 16, 2009 | http://pubs.acs.org

(3)

where the first, second, and third terms in the right-hand side represent continuum formulations for electrostatic, cavity and dispersion-repulsion solute-solvent effects, respectively. The electrostatic contribution is the change in solute reaction field energy, ∆GhydR , calculated by solving the Poisson equation in

water and in vacuum. The cavity contribution is proportional to the total molecular surface area, MSA. The dispersion-repulsion contribution describing solute-solvent van der Waals interac-tions is obtained from integrals of the continuum Lennard-Jones potential over the molecular surface, UivdW, for defined atom types, i.13-18Details of the continuum van der Waals calculation

are described in the Appendix. (For the sake of brevity, for the remainder of the paper, we shall use dispersion as synonymous with dispersion-repulsion.) All electrostatic calculations were carried out with the boundary element method (BEM) imple-mented in the BRI BEM program.19,20All molecular surface

calculations were carried out with a variable-radius probe.21

AM1-BCC partial atomic charges22were employed throughout.

For a given molecular geometry and set of partial atomic charges, the continuum model presented in eq 1 depends on a number of parameters: the solute dielectric constant, Din; atomic

radii; the cavity surface coefficient, γcav; and the set of

atom-type-dependent continuum van der Waals (c-vdW) coefficients, {Bi}. A convenient treatment of the dependence on atomic radii is to uniformly alter AMBER van der Waals radii23with a linear

scaling coefficient, F.24The parameters D

inand F can take values

within reasonable physical ranges.24,25 Prospective SAMPL-1

predictions were carried out at Din) 1 and F ) 0.9.

Retrospec-tive analyses presented throughout this paper for SAMPL-1 and other hydration data sets were performed with Din set at 1.0,

2.0, or 4.0 and F set at 0.9, 1.0, or 1.1.

At a given F, the cavity surface coefficient, γcav, was fitted

to the cavity contribution to hydration for a set of rigid alkanes, previously derived on the basis of free energy perturbation simulations in explicit solvent.26 For a given (D

in, F, γcav)

parameter set, the set of c-vdW coefficients, {Bi}, for the defined atom types i and the constant C were determined by a fit that minimizes the mean absolute error between the hydration free energies calculated from eq 1 and the experimental hydration free energies for compounds in a training hydration data set using the R statistics software.27Various atom type definitions

ranging from 2 to 29 atom types were explored and evaluated on training and test hydration data sets. An atom type definition that affords a substantial improvement of mean absolute error relative to simpler definitions consisted of 11 elemental atom types (C, O, N, S, P, F, Cl, Br, and 3 types of H). At the other end of the spectrum, a relatively complex definition consisting of 25 atom types formed by separating elements according to hybridization and connectivity further improved the fit. By merging different elemental types having similar coefficients, this 25-atom-type definition was reduced to a 16-atom-type definition (C, 3 types of O, 3 types of N, 2 types of S, P, F, Cl, Br, and 3 types of H) with marginal decay in performance. Both the 16- and the 25-atom-type definitions were used for prospec-tive predictions and re-examined in the retrospecprospec-tive analysis.

Hydration Data Sets.A data set consisting of experimental hydration free energies for 295 neutral organic small molecules was compiled from the published literature.24,28-30The

composi-tion of the data set is provided in Tables S1 and S2 of the

Supporting Information. A subset of 129 compounds constituted a training data set used for calibrating the continuum solute-solvent Lennard-Jones atom-type coefficients, Bi, in eq 1. In the training set, we included mostly rigid representatives of the various chemical classes, with the majority of compounds (117) being monofunctional, and only a few polyfunctional compounds (12) were included to increase coverage of functional groups. The remaining 166 compounds formed the test data set, which mirrors the training data set in terms of chemical class representation for monofunctional compounds (117) but differ from the training analogs by having increased flexibility and containing a larger collection of polyfunctional compounds (49). Single-conformation molecular geometries were generated using CONCORD.31Atomic partial charges were calculated with the

AM1-BCC method22 using Molcharge (OpenEye, Inc., Santa

Fe, NM).

The SAMPL-1 data set consists of 63 polyfunctional com-pounds. We explored the chemical class coverage of this data set by hierarchical clustering based on atom pairs and 2D chemical fingerprints as descriptors using Selector in Sybyl 8.0 (Tripos, Inc., St. Louis, MO). A useful chemical classification was achieved by forming 25 clusters. Molecular structures and partial atomic charges employed for hydration calculations were those provided in the SAMPL-1 blind prediction challenge; that is, single-conformation Merck molecular force field (MMFF)-optimized geometries and AM1-BCC partial atomic charges. We retrospectively rebuilt 7 compounds (37, 42, 58-62) that initially had erroneous molecular structures in the distributed SAMPL-1 data set, optimized their geometries with the MMFF force field32in Sybyl 8.0, and assigned AM1-BCC partial atomic

charges22with Molcharge. Results and Discussion

Hydration Model Parameters.The SAMPL-1 blind predic-tions were obtained with a hydration model that includes continuum treatments of electrostatic, cavity, and dispersion solute-water interactions as described in eq 1. The hydration model parameters used for these single-conformation-based predictions were a solute dielectric constant of 1 (Din), scaling

of AMBER van der Waals atomic radii by a factor of 0.9 (F), a cavity molecular-surface coefficient of 0.1213 kcal/mol/Å2

(γcav), and two different sets of continuum dispersion coefficients

{Bi} corresponding to separate parametrizations consisting of 16 and 25 atom types. These atom-type definitions and their dispersion coefficients, fitted using a hydration training data set of 129 compounds (see the Materials and Methods section and Tables S1 and S2 of the Supporting Information), are shown in the upper panels of Figure 1A and B (see also Table S4 of the Supporting Information). The 16-atom-type definition was derived from the 25-atom-type definition by merging elemental atom types that yielded similar dispersion coefficients upon fitting at (Din, F, γcav) of (1, 0.9, 0.1213). An interesting

atom-typing resulted in this process; for example, differentiating O sp3atom types on the basis of the hybridization of the connected

atom (e.g., O sp3bound to C sp3vs O sp3bound to C sp2) rather

than on the substitution pattern (e.g., O sp3 in alcohols and

phenols vs O sp3in ethers and esters).

Most dispersion coefficients obtained at (Din, F) of (1, 0.9)

are positive, consistent with attractive solute-water van der Waals interactions. It should be noted that we use the label “van der Waals interaction” here rather loosely. This fitted term also includes part of the hydrogen-bonding interactions not suf-ficiently captured by the continuum electrostatics model as well as other solvation effects. General trends include larger

coef-∆Ghyd calc (Din, F, γcav, {Bi}) ) ∆Ghyd R (Din, F) + γcavMSA(F) +

i UivdW(Bi) + C (1) 4512 J. Phys. Chem. B, Vol. 113, No. 14, 2009

Downloaded by CANADA INSTITUTE FOR STI on October 16, 2009 | http://pubs.acs.org

(4)

ficients for N atom types than for O and C atom types and monotonically increased coefficients in the halogen series (BF

<BCl<BBr). Opposite-sign coefficients are obtained for polar

and nonpolar H atom types, the negative coefficients for polar H atoms suggesting repulsive van der Waals interactions with water, presumably reflecting a close water approach upon hydrogen bonding. A negative coefficient was also obtained for a buried S atom connected to two O atoms (the S.o2 type found, for instance, in sulfones, sulfonates, sulfonamides), apparently indicating a repulsive effect. Interestingly, the positive dispersion coefficient for P was obtained by fitting to phosphates, phos-phonates, and thiophosphates present in the training data set, where the P atom has a similar topology with S.o2. Further analysis indicated that the fitted values of BS.o2and BPremain

distinct (but the large negative value of BS.o2can vanish) when

dispersion coefficients are calibrated at other (Din, F) values (e.g.,

as in the lower panels of Figure 1A and B).

The 16-atom-type model is relatively parsimonious with the number of parameters, considering that it covers nine atomic elements. The use of atom types based on continuum van der Waals energies instead of accessible surface area appears to confer greater generality. For example, for the frequently occurring elements C, H, N, and O, the GB-SA model of Gallicchio et al.9requires 5 C, 3 H, 5 N, and 3 O atom types,

as compared with the 1 C, 3 H, 3 N, and 2 O atom types in our 16-atom-type model. Models with no continuum electrostatics require even more atom types. The weighted solvent-accessible

Figure 1. Continuum dispersion coefficients obtained by fitting to the training data set (129 compounds) at indicated internal dielectric, Din, and

AMBER radii scaling, F. (A) 16-atom-type definition. (B) 25-atom-type definition. Corresponding cavity coefficients fitted on rigid aliphatic hydrocarbons are 0.1213 kcal/mol/Å2for models at F ) 0.9, and 0.1151 kcal/mol/Å2 for models at F ) 1.0. See Table S4 of the Supporting

Information for the numerical values of these coefficients.

Prediction of SAMPL-1 Hydration Free Energies J. Phys. Chem. B, Vol. 113, No. 14, 2009 4513

Downloaded by CANADA INSTITUTE FOR STI on October 16, 2009 | http://pubs.acs.org

(5)

surface (WSAS) model of Wang et al.30required 11 C, 12 H,

9 N, and 5 O atom types.

Overall Measures of SAMPL-1 Predictions.The SAMPL-1 blind predictions afforded by the model in eq 1 and with the parameters described above are listed in Table S3 of the Supporting Information. Statistical measures, including mean and median absolute errors, squared correlation coefficient, and correlation slope and intercept between the predicted and experimental hydration free energies are given in Table 1 both for the entire SAMPL-1 data set (63 compounds) and for a SAMPL-1 subset (56 compounds), which exclude compounds with erroneous chemical structures provided during the blind prediction stage (and which we corrected afterward and now included in the entire data set). Correlation plots are shown in Figure 2.

As seen in Table 1, an overall good performance was achieved on the entire SAMPL-1 blind prediction data set. Similar statistical measures are returned for the 56-compound SAMPL-1 subset. It is encouraging that the 25-atom-type dispersion definition yielded predictions similar to the simpler 16-atom-type-based model, given the overfitting concerns generated by the data on the training and test data sets showing no improvement with the increase in model complexity (Table 1). For the SAMPL-1 data set, the c-vdW contributions (Table S3 of the Supporting Information) showed subtle variations between the two dispersion models, apparently favoring the more complex model. Our prospective SAMPL-1 predictions are particularly good in terms of squared correlation coefficients (R2) of 0.80 to 0.83, and acceptable in terms of median absolute

errors of 1.6-2.0 kcal/mol (see bootstrapped averages and SD values in Table 1). However, a critical analysis indicates mediocre values for the mean absolute error (about 2.5 kcal/

mol), as well as correlation slope values of about 0.55 and intercept values of about -2.7 kcal/mol, which deviate from ideal correlation. With the exception of R2, all statistical indices

listed in Table 1 indicate a marked deterioration of model performance for the SAMPL-1 data set relative to commonly available, typical, hydration data sets used here for model training and testing. Being able to highlight this behavior, which is also graphically reflected in the correlation plots shown in Figure 2, underlines the value of blind predictions with original data sets such as SAMPL-1 to diagnose model parametrization and transferability problems.

Chemical Class and Outlier Analysis. The above global analysis shows that our prospective continuum hydration model achieved highly correlative SAMPL-1 predictions at the expense of a correlation slope of ∼0.5, which translates into an overestimation of hydration free energies for the more polar compounds (by as much as 11.7 kcal/mol), coupled with a less-pronounced but apparent underestimation for the more nonpolar compounds (by as much as 2.9 kcal/mol) (Figure 2). In addition, the smaller values for median absolute errors relative to the mean absolute errors signal that a large number of compounds were relatively well-predicted, but that predictions include a few major outliers. These observations prompted us to perform hierarchical chemical clustering of the SAMPL-1 data set to (1) examine whether the correlation is maintained for chemical classes represented within SAMPL-1, if any, and (2) relate the major outliers to chemical classes or at least specific moieties. Given that the 63 compounds in the SAMPL-1 data set are highly diverse and polyfunctional, we were able to form only a relatively high number of 25 tight chemical clusters, out of which 9 are singletons (Table S3, Supporting Information). These clusters contained at least 5 analogs and deserved

TABLE 1: Statistical profile of select hydration models. hydration model

mode modela D

in, F set N mean |error| median |error| R2 slope intercept

prospective RF + 1, 0.9 training 129 0.60 ( 0.046 0.47 ( 0.051 0.93 ( 0.013 0.95 ( 0.025 -0.18 ( 0.100 Cav + test 166 0.72 ( 0.050 0.53 ( 0.081 0.92 ( 0.015 0.93 ( 0.027 -0.47 ( 0.093 c-vdW test-pf 49 1.07 ( 0.125 0.98 ( 0.184 0.85 ( 0.056 0.86 ( 0.042 -0.88 ( 0.248 (16 atom types) SAMPL-1 56 2.56 ( 0.341 1.79 ( 0.336 0.83 ( 0.053 0.55 ( 0.041 -2.75 ( 0.268 SAMPL-1 63 2.68 ( 0.326 1.97 ( 0.266 0.82 ( 0.050 0.54 ( 0.038 -2.77 ( 0.265 prospective RF + 1, 0.9 training 129 0.59 ( 0.045 0.46 ( 0.039 0.93 ( 0.013 0.95 ( 0.024 -0.18 ( 0.098 Cav + test 166 0.72 ( 0.050 0.51 ( 0.071 0.92 ( 0.014 0.93 ( 0.027 -0.50 ( 0.094 c-vdW test-pf 49 1.03 ( 0.118 1.01 ( 0.186 0.87 ( 0.050 0.86 ( 0.043 -0.87 ( 0.247 (25 atom types) SAMPL-1 56 2.41 ( 0.332 1.64 ( 0.331 0.82 ( 0.058 0.57 ( 0.047 -2.68 ( 0.305 SAMPL-1 63 2.58 ( 0.317 1.87 ( 0.343 0.80 ( 0.056 0.55 ( 0.045 -2.71 ( 0.298 retrospective RF + 2, 1.0 training 129 0.55 ( 0.046 0.40 ( 0.037 0.93 ( 0.013 1.01 ( 0.029 0.03 ( 0.109 Cav + test 166 0.80 ( 0.044 0.68 ( 0.068 0.92 ( 0.012 0.93 ( 0.028 -0.59( 0.110 c-vdW test-pf 49 0.87 ( 0.097 0.74 ( 0.174 0.90 ( 0.034 0.90 ( 0.040 -0.50 ( 0.234 (25 atom types) SAMPL-1 56 2.25 ( 0.380 1.14 ( 0.229 0.82 ( 0.058 0.57 ( 0.052 -2.50( 0.362

SAMPL-1 63 2.54 ( 0.364 1.39 ( 0.322 0.80 ( 0.053 0.55 ( 0.046 -2.50( 0.366 retrospective RF 1, 1.1 training 129 1.13 ( 0.072 0.99 ( 0.074 0.79 ( 0.036 1.16 ( 0.063 0.70 ( 0.212 test 166 1.33 ( 0.071 1.17 ( 0.081 0.78 ( 0.030 1.29 ( 0.072 1.06 ( 0.263 test-pf 49 1.77 ( 0.167 1.53 ( 0.287 0.68 ( 0.077 1.23 ( 0.171 0.46 ( 0.926 SAMPL-1 56 1.46 ( 0.163 1.17 ( 0.158 0.81 ( 0.058 0.84 ( 0.053 -1.14 ( 0.343 SAMPL-1 63 1.55 ( 0.174 1.21 ( 0.148 0.77 ( 0.065 0.80 ( 0.060 -1.39 ( 0.405 retrospective RF 2, 1.0 training 129 1.14 ( 0.078 1.00 ( 0.115 0.79 ( 0.039 1.22 ( 0.072 0.65 ( 0.226 test 166 1.38 ( 0.078 1.23 ( 0.094 0.77 ( 0.030 1.36 ( 0.076 1.06 ( 0.257 test-pf 49 1.89 ( 0.183 1.68 ( 0.235 0.68 ( 0.081 1.34 ( 0.187 0.71 ( 0.945 SAMPL-1 56 1.57 ( 0.159 1.39 ( 0.229 0.79 ( 0.062 0.85 ( 0.057 -1.28 ( 0.354 SAMPL-1 63 1.65 ( 0.163 1.44 ( 0.231 0.75 ( 0.068 0.81 ( 0.065 -1.55 ( 0.408 a

Following eq 1, RF is the change in reaction field energy as the electrostatic component, ∆GhydR , Cav is the cavity contribution, γcavMSA,

and c-vdW is the continuum van der Waals contribution,∑iUivdW(Bi). Statistics are given as av ( SD for 5000 bootstrap samples.

4514 J. Phys. Chem. B, Vol. 113, No. 14, 2009

Downloaded by CANADA INSTITUTE FOR STI on October 16, 2009 | http://pubs.acs.org

(6)

correlation analyses: 7 aliphatic nitrates (1, 2, 3, 4, 5, 6, 41), 6 meta-dinitrobenzenes (11, 27, 28, 40, 46, 56), and 5 aromatic sulfonureas (12, 20, 39, 51, 54). For the first two clusters (solid triangles and squares in lower graphs of Figure 2), high correlations are maintained (R2 of 0.88-0.95), but slopes

(0.65-0.67) improve only slightly relative to the entire data set. This indicates a rather systematic deficiency in the model, also considering that experimental data for these clusters span from -1.82 to -8.18 kcal/mol and, thus, do not reach to the extreme negative values that might be thought of influencing

Figure 2. Correlation plots for the training, test, and SAMPL-1 data sets. (A) Hydration model with 16 c-vdW atom types, at Din) 1 and F ) 0.9.

(B) Hydration model with 25 c-vdW atom types at Din) 1 and F ) 0.9. For the test data set, data for the polyfunctional compounds (49) are

represented with filled squares. For the SAMPL-1 data set, the following representations are used: 9, aliphatic nitrates (7); 2, m-dinitrobenzene derivatives (6); b, aromatic sulfoneureas (5); O, compounds with structures corrected after the prospective study (7); 0, the remainder of the data set. A perfect correlation is indicated by the diagonal.

Prediction of SAMPL-1 Hydration Free Energies J. Phys. Chem. B, Vol. 113, No. 14, 2009 4515

Downloaded by CANADA INSTITUTE FOR STI on October 16, 2009 | http://pubs.acs.org

(7)

the slope most. Only a weak correlation is obtained for the cluster of sulfoneureas (R2<0.43).

We identified 12 major outliers with absolute errors above 4 kcal/mol, all overestimating the experimental hydration data (labeled in Figure 2, see also Table S3 of the Supporting Information). In Figure 3, we compare these outliers with other SAMPL-1 compounds that contain similar substructures. The cluster of five sulfoneureas mentioned earlier (solid circles in

Figure 2) are the most highly solvated compounds in the SAMPL-1 data set (-14.01 to -20.25 kcal/mol), which is also unprecedented in the hydration data sets previously reported for neutral molecules. Their trisubstituted pyrimidine and triazine moieties can also be found in SAMPL-1 compounds 24 and

50, both with hydration free energies predicted very well in the prospective study. The phenylsulfone moiety, present in three of the sulfoneurea outliers and similar to the benzylsulfone and

Figure 3. Major outliers (12) in the prospective SAMPL-1 predictions based on a 16-atom-type definition (err1) and a 25-atom-type definition (err2) of the continuum dispersion model, at Din) 1 and F ) 0.9. The outliers are drawn in the center. Doubled-headed arrows indicate interesting

relationships between these outliers and similar SAMPL-1 compounds (boxed). A few additional compounds are also included in these comparisons: NN-DMBA (N,N-dimethylbenzamide), N-MAA (N-methylacetamide), and NN-DMAA (N,N-dimethylacetamide). In this figure, SAMPL-1 compounds are assigned the original identification numbers of general form CUP080XX, while throughout the rest of the paper and Supporting Information, only the last two digits are used for compound identification, for brevity.

4516 J. Phys. Chem. B, Vol. 113, No. 14, 2009

Downloaded by CANADA INSTITUTE FOR STI on October 16, 2009 | http://pubs.acs.org

(8)

thienylsulfone moieties in the other two, is also present in the SAMPL-1 compound 40, which was closely predicted as well. This leaves the substituted urea in the central parts of these molecules as a moiety presumably responsible for the prediction error. Consistently, we identified SAMPL-1 compound 32, a small substituted urea whose hydration free energy was also overestimated by our models (by over 30%). We note that the hydration model gives an excellent prediction for N,N-dimeth-ylbenzamide from our test set, which lacks an NH group relative to the urea derivative 32. These observations suggest that our hydration model may not be parametrized sufficiently to handle the urea moiety in general, as well as urea derivatives carrying aromatic and polar (e.g., sulfone) substituents.

It is interesting to compare the very closely related SAMPL-1 compounds 39 and 51, the worst and best predicted among these aromatic sulfoneureas. Compound 39 contains two heteroatoms in addition to analog 51, and we predicted that 39 would be more hydrated than 51 (by about 5 kcal/mol), yet the experi-mental data show the opposite; that is, 39 is less-hydrated than

51(by about 5 kcal/mol). We also note that the hydration free energy of the apparently highly polar compound 39 is actually only 5.5 kcal/mol more negative than that of N-methylacetamide; we predicted a difference almost 3 times as large. Obviously, such solvation phenomena are not captured either by chemical intuition nor by our hydration model, which seem too “additive”. Two related clusters contain another group of four major outliers: compounds 10, 23, 13, and 52. Although they are smaller than the previously discussed sulfoneurea outliers, some of the errors obtained for these compounds are even larger, which shatter the idea that model deficiencies may be related to size. The thiophosphate ester moiety in outliers 10 and 23 is also present in SAMPL-1 compounds 17 and 36 with well-predicted hydration free energies. Since N-substituted benzamide is a substructure of both 10 and 23, we also note the good prediction of hydration free energy for N,N-dimethylbenzamide. Although less clear than in the case of the highly congeneric sulfoneureas, these four outliers highlight a common structural theme as a possible candidate for the obtained prediction errors: densely polar and hydrogen-bond-capable structures, featuring primarily substituted amide groups, often in conjugated systems with aromatic, unsaturated or other amides moieties. For example, the N-substituted pyrrolidine-2,5-dione ring is present in both the major outlier 23 and the well-predicted compound

14, but the two carbonyl groups are conjugated (to a fused benzene ring) only in the outlier. The pattern seems to carry on to outliers 13 and 52, where one of the carbonyls is conjugated to an endocyclic double bond, whereas the other carbonyl is engaged in a cyclic urea moiety. Replacing one amide group of outlier 52 by just a CdN bond as in analog 49 leads to an almost perfect prediction of hydration free energy, but further replacing the CdN bond with an exposed, conjugated NdN bond as in compound 10 leads to our biggest SAMPL-1 outlier. Finally, we note three pairs of SAMPL-1 compounds, each consisting of a major outlier and a well-predicted close analog. Experimental data indicate that the major outlier 42 is less hydrated (by ∼0.5 kcal/mol) than its well-predicted analog 37, despite converting the substituted amide group of 42 into a methyl group in 37. Our models predicted the opposite, a lower hydration free energy for analog 42 by ∼6.5 kcal/mol. This compares favorably with the hydration free energy of N,N-dimethylacetamide (-8.5 kcal/mol). We note in the outlier 42 the presence of two amide groups in a highly polar, conjugated, hydrogen-bond-capable structure, in consensus with the features highlighted by the other outliers. Another pair consists of the

major outlier 29 and its well-predicted analog 30, with experimental data indicating that the latter analog is the more hydrated one (by ∼0.6 kcal/mol), despite containing three aliphatic C centers instead of a sulfurous ester SO2group in

the former analog. Our prospective predictions favored hydration of analog 29 by over 6 kcal/mol. The discrepancy is not as pronounced for the pair of SAMPL-1 analogs 62 and 58, with both compounds overestimated by our models, however, the former one by as much as 6 kcal/mol.

Retrospective Analysis of Model Transferability. The SAMPL-1 prospective study indicated that our continuum hydration model in eq 1 with the employed set of parameters described earlier is not entirely transferable between chemical classes that span a wide range of polarity, size, and chemical complexity. Using (Din, F) of (1, 0.9) with a consistent cavity

surface coefficient, γcav, and with solute-solvent continuum

dispersion coefficients, Bi, fitted to a training hydration data set, we observed a deterioration of prediction quality from the commonly available hydration data sets to the SAMPL-1 data set. Although this was not totally unexpected, it provides an opportunity to retrospectively explore how the transferability of the model depends on its parameters and complexity. Here, we explore the model performance and transferability between data sets as a function of (Din, F) parameters, which set the γcav

value and influence Bivalues that are refitted at each (Din, F)

combination on the same training set used for prospective studies. We do not attempt to fit the dispersion term on SAMPL-1 compounds due to overfitting concerns. We maintain the two atom-type definitions (with 16 and 25 atom types) for the continuum dispersion model as in the prospective study. Although modifying atom-type definitions is probably one way to improve model generality, it is outside the scope of this paper. The trends in statistical measures of the continuum hydration model in eq 1 at a relatively coarse yet sufficiently informative sampling of the (Din, F) parameter space for the training, test,

and SAMPL-1 data sets are plotted in Figure 4. A first observation is the small variation in the model performance at various parameter combinations for each of the training set (129 compounds), test set (166 compounds) and a test subset consisting of 49 polyfunctional compounds. The parameters derived at (Din, F) of (1, 0.9) and used in the prospective study

give best or competitive performances in both training and testing, while slightly better parameter combinations can be selected for the test subset of polyfunctionals. The stable performance across parameter combinations for a given data set is largely attributed to the dispersion contribution that was fitted to experimental hydration data. It is encouraging that this stability is maintained in model testing, although not entirely surprising, given that there is a fair similarity between the chemical space present in the training and test data sets. Additionally, there is only a minor deterioration of the model performance from the training data set to the test data set set and then to the polyfunctionals test subset. Mean and median absolute errors range from about 0.4 kcal/mol for the training set up to only about 1.1 kcal/mol in the worst test predictions,

R2coefficients reach close to 0.95 for the training data set and

as low as only about 0.85 in the worst test predictions, and the almost ideal slope and intercept for the training data set go down to values of about 0.9 and -1, respectively, in the worst test predictions. All these data indicate the robustness of the hydration model across the chemical space covered by the commonly available hydration data set, and this with a profoundly degenerate set of optimal parameters.

Prediction of SAMPL-1 Hydration Free Energies J. Phys. Chem. B, Vol. 113, No. 14, 2009 4517

Downloaded by CANADA INSTITUTE FOR STI on October 16, 2009 | http://pubs.acs.org

(9)

However, the hydration model in eq 1 loses its performance transferability to the SAMPL-1 hydration data set. What is maintained instead is the model stability at various optimal parameter combinations, which remain highly degenerate due to the compensatory effects, mostly between the electrostatic and dispersion contributions (the cavity contributions vary less with parameters). Most deterioration is seen in terms of mean absolute error, now increased to between 2.2 and 3.1 kcal/mol, and in slope and intercept, ranging tightly between 0.5 and 0.6 and between -2.3 and -2.8, respectively. The median absolute

error and R2deteriorate less relative to training and testing and

show more dependence on parameter settings. It appears that (Din, F) of (2, 1.0) with the 25-atom-type definition of dispersion

is the better performing and the more transferable of the tested parametrizations for the continuum model from eq 1, slightly outperforming the parametrizations at the (Din, F) of (1, 0.9)

used in the prospective study, primarily in terms of median errors (Table 1, Figure 5). We see in Figure 1 that the continuum dispersion coefficients react to the electrostatic parameters (Din,

F), particularly to the doubling of the solute dielectric constant,

Figure 4. Comparative statistics for different models on training, test, and SAMPL-1 data sets. The analysis covers models derived at different solute dielectric constants (1, 2, or 4) as indicated. At each Din, a dependence on scaling of atomic radii is given, with the AMBER radii scaling

factor, F, taking the values of 0.9, 1.0, and 1.1 for the left, middle, and right connected points, respectively. Arrows indicate the statistic parameters of the models used in the prospective study. RF, Cav, and c-vdW are contributions to the hydration free energy arising from the change in reaction field energy, cavity formation, and solute-solvent van der Waals interactions, respectively.

4518 J. Phys. Chem. B, Vol. 113, No. 14, 2009

Downloaded by CANADA INSTITUTE FOR STI on October 16, 2009 | http://pubs.acs.org

(10)

by having larger, more positive values (i.e., increased attractive dispersion to compensate for weaker favorable electrostatics at

Dinof 2 vs 1). This also points to the difficulty in interpreting

the physical meaning of different contributions upon parameter variation (see also the calculated contributions listed in Table S3, Supporting Information).

To gain some understanding for the source of this ability behavior, we repeated the parameter-dependent transfer-ability analysis for a simpler version of the continuum model. We completely eliminated from eq 1 the nonpolar contribution (cavity and dispersion terms) and retained only the electrostatic term consisting of the change in reaction field between water and vacuum. We note that since the continuum dispersion term was eliminated, the training data set is no longer used for the training of this hydration model, and it essentially becomes another test data set. As seen in Figure 4, the mean and median absolute errors of this simple model based on the electrostatic contribution alone are much more sensitive to the (Din, F)

combination than the full model is. There is a pronounced parameter dependence for the electrostatics-only model,

fol-lowing the pattern of concerted changes in Dinand F, as we

have previously described.24However, at specific optimal (D in,

F) combinations (e.g., (1, 1.1), (2, 1.0), (4, 0.9)), this simplest model has a stable performance across the different data sets and for the SAMPL-1 data set is able to actually outperform the complex model from eq 1, particularly in terms of the mean absolute error. This is due in large part to a reduction in the predicted solvation energies of the major outliers. For the median absolute error, the performance of the simple and full models for SAMPL-1 is comparable. Interestingly, the simpler models give correlation coefficients that are higher for the SAMPL-1 data set than for traditional polyfunctional hydration data sets. In addition, the correlation slopes between predicted and experimental data for the various data sets examined are more centered around the unit slope using the electrostatics-only model at optimal parameters vs the complex model, especially closer to unit in the case of SAMPL-1 (Table 1, Figure 5).

The above retrospective analysis indicates that the electrostatic component of the hydration model is reasonably transferrable across different data sets. However, by itself, the electrostatics captures only part of the hydration free energy and leads to errors in the range of 1-2 kcal/mol. This simpler model reflects the overall trend in polarity while lacking some of the correlation details within certain chemical classes (e.g., hydrocarbons) and, thus, performs poorer on the traditional data sets than the fitted complex model (Table 1, Figure 4). The addition of the nonelectrostatic terms reduces the errors to less than 1 kcal/ mol for the traditional data sets, but the errors are twice as large for the SAMPL-1 set. This may result from a certain degree of parameter overfitting of the complex model to the chemical space provided by the previously available hydration data sets. However, the complex model gives SAMPL-1 predictions that are highly correlated with the experimental data, albeit with a slope significantly less than 1, suggesting a valid model but with a systematic error due to a miscalculated or absent property.

Conclusion

The continuum hydration model (eq 1) used in this work appears to capture the essential behavior of the hydration free energies of organic molecules. However, quantitative deficien-cies were revealed in the application of the model to the highly polyfunctional SAMPL-1 data set. A re-examination of both the cavity formulation and the dispersion atom-type definition would be worthwhile. Additional terms may also need to be included to describe solute-solvent interactions that are cur-rently poorly represented in the model; for example, directional effects in hydrogen bonding.10This may reduce the observed

overestimation that is apparently due to an overly additive contribution of the dispersion term in the case of densely polar, hydrogen-bond-capable compounds, such as the major outliers highlighted earlier. A major benefit of the distinct hydration data set from the SAMPL-1 blind prediction challenge is that it highlights shortcomings of our current solvent model that need to be improved upon.

Appendix

Continuum van der Waals Calculation.The continuum van der Waals terms are calculated as described by Floris et al.14,15

Briefly, the discrete surrounding water molecules are replaced by a continuum of uniform density distribution, and the solute-solvent van der Waals interaction is taken to be proportional to the integral of the solute-continuum interaction over all of space. For ease of computation, the volume integral is transformed into a surface integral at the solute-solvent

Figure 5. Scatter correlation plots comparing retrospective models at

Din) 2 and F ) 1.0. (A) Model including electrostatic contributions

(reaction field) plus nonelectrostatic contributions consisting of cavita-tion and a continuum dispersion model with 25 atom types trained on the original set (129 compounds). (B) Model based only on the reaction field contribution.

Prediction of SAMPL-1 Hydration Free Energies J. Phys. Chem. B, Vol. 113, No. 14, 2009 4519

Downloaded by CANADA INSTITUTE FOR STI on October 16, 2009 | http://pubs.acs.org

(11)

boundary defined by the solvent-accessible surface. For a 6-12 Lennard-Jones potential, this leads to

where rikis the vector from atom i to surface point k, nkis the surface normal at k, and C is a scaling factor that incorporates the solvent number density.

One modification we have made is that we use the excluded or molecular surface (MS) rather than the solvent-accessible surface (SAS) to carry out the surface integral. Our reason for doing this is that, in future applications to protein-ligand binding, using the MS will better capture the solute-solvent van der Waals interactions for narrow channels containing a single layer of water molecules. These would be greatly underestimated or completely lost when using the SAS. The typical 6-12 Lennard-Jones potential over the SAS needs to be modified when carrying out a continuum van der Waals calculation over the MS. We find that a scaled 4-8 Lennard-Jones potential is appropriate for use with the MS and reproduces the radial behavior of the 6-12 Lennard-Jones potential when computing the contribution of successive radial shells of solvent volume; that is, we can find a single scaling factor such that

is very good approximation. This allows us to carry out the surface integral over the MS rather than the SAS.

For a given solute atom, we also require the minimum of the 4-8 potential to be at the atomic radius. In terms of the WCA decomposition,33 we are using only the attractive part of the

decomposition, since the integral is never evaluated for points inside the solute. The c-vdW energy can then be written as

where ri,0is the radius of atom i and the scaling factor C has

been incorporated into Bi, which is obtained by fitting to a training set rather than using the force field parameters. The surface integral is calculated using the same tessellated surface generated by the BRI BEM program19,20for solving the Poisson

equation.

Acknowledgment. This is National Research Council of Canada publication number 49593.

Supporting Information Available: Referenced experi-mental hydration free energies for the molecules in the training and test sets (Table S1), composition of the training and test

data sets by chemical class (Table S2), predicted hydration free energies and their components for the molecules in the SAMPL-1 data set (Table S3), and fitted cavity and continuum van der Waals coefficients represented in Figure 1 (Table S4). This information is available free of charge via Internet at http:// pubs.acs.org.

References and Notes

(1) Rashin, A. A. Prog. Biophys. Mol. Biol. 1993, 60, 73–200. (2) Honig, B.; Sharp, K.; Yang, A.-S. J. Phys. Chem. 1993, 97, 1101– 1109.

(3) Gilson, M. K.; Zhou, H. X. Annu. ReV. Biophys. Biomol. Struct.

2007, 36, 21–42.

(4) Eisenberg, D.; McLachlan, A. D. Nature 1986, 319, 199–203. (5) Kang, Y. K.; Ne´methy, G.; Scheraga, H. A. J. Phys. Chem. 1987,

91, 4109–4117.

(6) Sitkoff, D.; Sharp, K. A.; Honig, B. J. Phys. Chem. 1994, 98, 1978– 1988.

(7) Chambers, C. C.; Hawkins, G. D.; Cramer, C. J.; Truhlar, D. G. J.

Phys. Chem. 1996, 100, 16385–6398.

(8) Marten, B.; Kim, K.; Cortis, C.; Friesner, R. A.; Murphy, R. B.; Ringnalda, M. N.; Sitkoff, D.; Honig, B. J. Phys. Chem. 1996, 100, 11775– 11788.

(9) Gallicchio, E.; Zhang, L. Y.; Levy, R. M. J. Comput. Chem. 2002,

23, 517–529.

(10) Tan, C.; Yang, L.; Luo, R. J. Phys. Chem. B 2006, 110, 18680– 18687.

(11) Cabani, S.; Gianni, P.; Mollica, V.; Lepori, L. J. Solution Chem.

1981, 10, 563–595.

(12) Guthrie, J. P. J. Phys. Chem. B 2009, 113, 4501-4507. (13) Huron, M.-J.; Claverie, P. J. Phys. Chem. 1972, 76, 2123–2133. (14) Floris, F.; Tomasi, J. J. Comput. Chem. 1989, 10, 616–627. (15) Floris, F. M.; Tomasi, J.; Pascual-Ahuir, J. L. J. Comput. Chem.

1991, 12, 784–791.

(16) Zacharias, M. J. Phys. Chem. A 2003, 107, 3000–3004. (17) Levy, R. M.; Zhang, L. Y.; Gallicchio, E.; Felts, A. K. J. Am. Chem.

Soc. 2003, 125, 9523–9530.

(18) Tan, C.; Tan, Y. H.; Luo, R. J. Phys. Chem. B 2007, 111, 12263– 12274.

(19) Purisima, E. O.; Nilar, S. H. J. Comput. Chem. 1995, 16, 681– 689.

(20) Purisima, E. O. J. Comput. Chem. 1998, 19, 1494–1504. (21) Bhat, S.; Purisima, E. O. Proteins: Struct., Funct., Bioinf. 2006,

62, 244–261.

(22) Jakalian, A.; Bush, B. L.; Jack, D. B.; Bayly, C. I. J. Comput. Chem.

2000, 21, 132–146.

(23) Cornell, W. D.; Cieplak, P.; Bayly, C. I.; Gould, I. R.; Merz, K. M., Jr.; Ferguson, D. M.; Spellmeyer, D. C.; Fox, T.; Caldwell, J. W.; Kollman, P. A. J. Am. Chem. Soc. 1995, 117, 5179–5197.

(24) Rankin, K. N.; Sulea, T.; Purisima, E. O. J. Comput. Chem. 2003,

24, 954–962.

(25) Naı¨m, M.; Bhat, S.; Rankin, K. N.; Dennis, S.; Chowdhury, S. F.; Siddiqi, I.; Drabik, P.; Sulea, T.; Bayly, C.; Jakalian, A.; Purisima, E. O.

J. Chem. Inf. Model. 2007, 47, 122–133.

(26) Gallicchio, E.; Kubo, M. M.; Levy, R. M. J. Phys. Chem. B 2000,

104, 6271–6285.

(27) R: A Language and EnVironment for Statistical Computing; R Foundation for Statistical Computing: Vienna, Austria, 2005.

(28) Rizzo, R. C.; Aynechi, T.; Case, D. A.; Kuntz, I. D. J. Chem. Theory

Comput. 2006, 2, 128–139.

(29) Mobley, D. L.; Dill, K. A.; Chodera, J. D. J. Phys. Chem. B 2008,

112, 938–946.

(30) Wang, J.; Wang, W.; Huo, S.; Lee, M.; Kollman, P. A. J. Phys.

Chem. B 2001, 105, 5055–5067.

(31) Pearlman, R. S. Chem. Des. Auto. News 1987, 2, 1–7. (32) Halgren, T. A. J. Comput. Chem. 1999, 20, 730–748.

(33) Weeks, J. D.; Chandler, D.; Andersen, H. C. J. Chem. Phys. 1971,

54, 5237–5247. JP8061477 c-vdW ) C

i

S

(

Ai 9rik12 -Bi 3rik6

)

rik· nkds (A1)

R R+2rsol (6-12 L-J) dΩ ≈f

R-rsol R+rsol (4-8 L-J) dΩ (A2) c-vdW )

i

SBi

(

ri,04 2rik8 -1 rik4

)

rik· nkds (A3)

4520 J. Phys. Chem. B, Vol. 113, No. 14, 2009

Downloaded by CANADA INSTITUTE FOR STI on October 16, 2009 | http://pubs.acs.org

Figure

Figure 2) are the most highly solvated compounds in the SAMPL-1 data set (-14.01 to -20.25 kcal/mol), which is also unprecedented in the hydration data sets previously reported for neutral molecules

Références

Documents relatifs

Part I covers high-performance contact detection algorithms for Discrete Element compu- tation, presenting a new algorithm that scales linearly in the number of

/ La version de cette publication peut être l’une des suivantes : la version prépublication de l’auteur, la version acceptée du manuscrit ou la version de l’éditeur. For

Abstract: We have studied structural and optical properties of thin films of TiO2, doped with 5% ZnO and deposited on glass substrate (by the sol–gel method).. Dip-coated thin

Keywords: Thin films, TiO 2 –ZnO, Sol–gel, Anatase, Brookite, Optical properties, Structural properties,

ii) Baked anodes were prepared as described above using different recipes. Their properties were measured. From the results, the particle size distribution which

Figure 1: The standard Drift Diffusion Model (Ratcliff, 1978). These figures show that when accuracy is stressed, empirical and model-based decompositions are strongly correlated.

In this article, we propose that the observed ridge and broad away-side features in two-particle correlations may be due to an average triangular anisotropy in the initial

According to the DLVO 1 theory [55], thermal agitation induces a stable state of the particles and cause the agglomeration of the system. For the same sonication time, probe