2,601
Views
33
CrossRef citations to date
0
Altmetric
Research Article

Development of linear and nonlinear predictive QSAR models and their external validation using molecular similarity principle for anti-HIV indolyl aryl sulfones

&
Pages 980-995 | Received 19 Sep 2007, Accepted 29 Oct 2007, Published online: 20 Oct 2008

Abstract

Quantitative structure–activity relationship (QSAR) studies have been carried out on indolyl aryl sulfones, a class of novel HIV-1 non-nucleoside reverse transcriptase inhibitors, using physicochemical, topological and structural parameters along with appropriate indicator variables. The statistical tools used were linear methods (e.g., stepwise regression analysis, partial least squares (PLS), factor analysis followed by multiple regression (FA-MLR), genetic function approximation combined with multiple linear regression (GFA-MLR) and GFA followed by PLS or G/PLS and nonlinear method (artificial neural network or ANN). In case of physicochemical parameters, GFA-MLR generated the best Equation (n = 97, R2 = 0.862, Q2 = 0.821). Using topological parameters, the best Equation (based on leave-one-out Q2) was obtained with stepwise regression technique (n = 97, R2 = 0.867, Q2 = 0.811). When topological and physicochemical parameters were used in combination, statistical quality increased to a great extent (n = 97, R2 = 0.891, Q2 = 0.849 from stepwise regression). Furthermore, the whole dataset had been divided into test (25% of whole dataset) and training (remaining 75%) sets. Models were developed based on the training set and predictive potential of such models was checked from the test set. The selection of the training set was based on K-means clustering of the standardized descriptors (topological and physicochemical). In this case also the best results were obtained with stepwise regression (n = 72, R2 = 0.906, Q2 = 0.853) but external predictive capacity of this model () was inferior to the model developed from GFA-MLR technique (R2 = 0.883, Q2 = 0.823, ). However, the squared regression coefficient between observed activity and predicted activity values of the test set compounds for the best linear model, i.e., GFA-MLR (r2 = 0.736) was lower in comparison to the best nonlinear model developed using artificial neural network (r2 = 0.781). Thus, based on external validation, the ANN models were superior to the linear models. The predictive potential of the best linear Equation (stepwise regression model) was superior to that of the previously published CoMFA (Q2 = 0.81, SDEPTest = 0.89) on the same data set (Ragno R. et al., J Med Chem 2006, 49, 3172–3184). Furthermore, the physicochemical parameter based models also supported the previous observations based on docking (Ragno R. et al., J Med Chem 2005, 48, 213–223).

Introduction

Acquired immunodeficiency syndrome (AIDS), characterized by opportunistic infections (T4 cell falls below 200/μL) and opportunistic neoplasms, is one of the leading causes of death worldwide [Citation1]. About 39.5 million people are living with HIV positive till 2006. Nearly 4.3 million people have newly infected with HIV, and AIDS have claimed 2.9 million people including 3,80,000 children under 15 years in the year of 2006 [Citation2].

There are generally two serotypes of HIV virus, which can be distinguished genetically and antigenetically. HIV-1 causes more fatal and rapid infection than HIV-2. There are three subgroups of HIV-1 including M (major or main), N (new) and O (outlier) [Citation3]. HIV is a special type of retrovirus of lentivirus family. There are at least nine recognizable genes in the HIV virus, but the major structures are composed of gag, pol and env. The other six genes are involved in the infection process as well as regulatory production in gag, pol and env genes. The gag gene is “group specific antigen” composed of viral nucleocapsid. It is responsible for development of virus in the absence of pol and env genes. The pol gene codes for HIV enzymes—reverse transcriptase, protease and integrase. Finally the env gene codes for the two major envelope's glycoproteins (gp120 and gp4) [Citation4]. When HIV enters into the blood stream, it binds its glycoprotein (gp120) to a T4 cell's or macrophage's, CD4 receptor and the coreceptor CCR5 and/or CXCR4. Then it fuses with the cell membrane and penetrate through it. Inside the cell virion sheds off its coat and leaves its envelope. Single stranded RNA is converted to single stranded DNA using reverse transcriptase from which DNA synthesis of a second strand occurs to form double stranded DNA. This migrates to the nucleus of the cell and integrates into host nucleus by integrase. This provirus transfers its genetic codes to that of the host which becomes a virus factory [Citation5]. Newly formed HIV core proteins, enzymes and genomic RNA gather inside the cell and an immature viral particles composed of long chain proteins are not infectious. These are divided into small fragments to make them infectious using protease enzyme [Citation6]. Thus, the inhibition of reverse transcriptase and protease are the most effective target for anti-HIV drug development.

Because of emergence of resistance to present antiviral therapy, development of new anti-HIV drugs is necessary. This necessitates QSAR studies for developing good predictive models involving ligands with different functionalities acting on different anti-HIV targets. Villar et al. have developed a theoretical model using probabilistic neural network that discriminates between active and non-active drugs against HIV-1 with four different mechanisms of action of active drugs [Citation7]. QSAR of HIV-1 reverse transcriptase inhibitory activities of 2-(2,6-dihalophenyl-2-yl)-thiazolidine-4-ones have been studied by Probhakar et al. using topological descriptors obtained from DRAGON software [Citation8]. Senese and Hopfinger have used HIV-1 protease inhibitors derived from norstatine containing 3S-amino-2S-hydroxy-4-phenylbutanoic acid core to construct 4-D QSAR model [Citation9]. 3D-QSAR studies have been performed on a series of inhibitors of HIV-1 integrase with respect to their inhibition of 3′-processing and 3′-end joining steps in vitro by Makhija and Kulkarni [Citation10]. CoMFA and CoMSIA and docking studies have been performed by Buolamwini et al. on conformationally restrained cinnamoyl HIV-1 integrase inhibitors to explore binding mode at the active site [Citation11]. Niwa has predicted responses for biological targets including HIV-1 protease for diverse molecules using probabilistic neural network and atom type descriptors from their chemical structure for generating focused libraries, selecting compounds for screening and annotating biological activity for those compounds whose activities are unknown [Citation12]. Weekes and Fogel have used evolutionary optimization, back propagation and data preparation issues in QSAR modeling of HIV inhibition by HEPT derivatives. Evolutionary computation gives appropriate set of weights and bias terms associated with artificial neural network that minimize selected functions of the error between the actual and desired outputs [Citation13]. Hopfinger and Senese have used simple clustering technique to facilitate and improve model selection and test set prediction (training set of 50 tetrahydropyrimidine 2-one based inhibitors of HIV-1 protease) [Citation14]. A 3D-QSAR was applied to a set of dipyridodiazepinone derivatives which is active against wild and mutant type HIV-1 reverse transcriptase by Pungpo et al. [Citation15]. Ragno et al. have also performed docking and 3D-QSAR (CoMFA) on indolyl aryl sulfones to explore binding mode at HIV-1 reverse transcriptase binding site [Citation16].

The present group of authors [Citation1,17–25] has developed a few anti-HIV QSARs involving different series of chemicals acting on different targets. In continuation of such efforts, the present paper deals with modeling of anti-HIV-1 activity of reverse transcriptase inhibitor indolyl aryl sulfones reported by Ragno et al. [Citation16,26]. Some compounds were excluded from our study due to lack of quantitative activity data. We have modeled the data set using linear techniques like multiple linear regression (with stepwise regression, factor analysis and genetic function approximation as variable selection strategy) and partial least squares and compared the results with those obtained from nonlinear method (feed-forward back propagation artificial neural network). The objectives of the present study include development of QSAR models with physicochemical significance in one hand and development of predictive model with good validation characteristics on the other hand for anti-HIV-1 activity of reverse transcriptase inhibitor indolyl aryl sulfones.

Materials and methods

The anti-HIV data (EC50) of indolyl aryl sulfone derivatives [Citation16,26] had been converted to logarithmic scale [pC = − logEC50 (M)] and then used for the QSAR study. Though the original paper reported 117 compounds in total, twenty of the reported compounds do not have exact biological activity values. Thus, 97 compounds were considered for the present QSAR study (). There are four different positions of substitutions: one is the fragment flanked between the indole nucleus and phenyl ring (X) and the second one is the substitution on the phenyl ring (). The other two positions are second and fifth positions of the indole nucleus.

Table I. Structural features, observed and calculated data of HIV-1 reverse transcriptase inhibitory activity of indole aryl sulfones.

For the construction of the linear models, multiple regression (with stepwise regression, factor analysis and genetic function approximation as variable selection tools) and partial least squares were used.

Descriptors

Physicochemical parameters like hydrophobicity (π), electronic (Hammett σ), steric (molar refractivity MR and STERIMOL L, B1 to B4) substituent constants () of phenyl ring substituents were taken from reference [Citation27]. The topological descriptors including Balaban index (Jx), connectivity indices ( etc), kappa shape indices ( etc), molecular flexibility index (Φ), Wiener index, Zagreb index, E-state indices ( etc) were calculated using Cerius2 version 10 [Citation28] on a silicon graphics computer. Besides these, structural descriptors like number of chiral centres, molecular weight (MW), number of rotatable bond (Rotlbonds) and indicator parameters as defined in were also used. For the development of the QSAR models we initially considered physicochemical and topological parameters separately and then combination of both types of parameters along with indicators variables were used. In the total pool of descriptors, there were 23 physicochemical, 52 topological and 9 indicator parameters from which variable selection was made using different strategies as detailed below.

Table II. Values of physicochemical parameters (substituent constants)#.

Table III. Definition of indicator variables.

Cluster analysis and validation

Initially QSAR models were developed on the whole data set. The models were crossvalidated using leave-one-out method. However, internal validation does not ascertain that the model will perform well on a new set of data. Thus, the whole data set was divided into training and test sets and the models developed from training set were externally validated using the test set. Predictive capacity of a model for new chemical entities is influenced by chemical nature of the training set molecules used for development of the model [Citation29–31]. In actual case, the test set molecules will be predicted well when these molecules are structurally very similar to the training set molecules. The reason is that the model has captured all features common to the training set molecules.

Any QSAR modeling should ultimately lead to statistically robust models capable of making accurate and reliable predictions of biological activities of compounds. When QSAR models are developed, it is important to validate any fitted models to check that it is plausible that its predictions will carry over to fresh data not used in the model fitting exercise. The validation strategies check the reliability of the developed models for their possible application on a new set of data, and confidence of prediction can thus be judged. Often, truly external data points being unavailable for prediction purpose, original data set compounds are divided into training and test sets. A QSAR model's ability to predict the properties of unknown chemicals depends largely on the nature of the training set and the algorithm used to establish the structure–activity relationship [Citation29]. The process is based on the assumption that a molecule that is structurally very similar to the training set molecules will be predicted well because the model has captured features that are common to the training set molecules and is able to find them in the new molecule. On the other hand, a new molecule which has very little in common with the training set data should not be predicted very well, i.e., the confidence in its prediction should be low [Citation30]. A model's predictive accuracy and confidence for different unknown chemicals varies according to how well the training set represents the unknown chemicals and how robust the model is in extrapolating beyond the chemistry space defined by the training set. Therefore, assessing a model's prediction accuracy outside the training domain is a vital step toward defining the application domain of a model for the regulatory acceptance of QSARs. The selection of the training set is significantly important in QSAR analysis. In this paper, the data set was divided into training and test sets (75% and 25% respectively of the total number of compounds) based on clusters obtained from K-means clustering applied on standardized descriptor matrix. All the parameters were standardized to values between 0 and 1 and the whole dataset was clustered into seven subgroups from each of which 25% of compounds were selected as members of the test set. Cluster analysis is a method of arrangement of objects into groups [Citation32–34. It classifies different objects into groups in such a way that the degree of association between two objects is maximum if they possess same group and otherwise minimum. Most clustering techniques are hierarchical, i.e, the resultant classification has an increasing number of nested classes [Citation32]. There are some non-hierarchical methods e.g., K-means clustering [Citation32–34]. In this method, number of groups or clusters (K) generated is specified by the user. At the end of the analysis the data are split between K clusters. From the results of K-means clustering analysis, one can examine the means for each cluster on each dimension to assess how distinct the K clusters are. After clustering, the test set compounds are selected from each cluster by taking approximately 25% of the compounds from each cluster so that test and training set can represent all clusters and the whole dataset.

For the development of equations different chemometric tools were utilized.

Stepwise regression

In stepwise regression, a multiple-term linear equation is built step-by-step. First, an initial model is recognized and then the model is altered repeatedly at the previous step by adding or removing a predictor variable according to the “stepping criteria” [Citation35]. At the last step the search is terminated when stepping is no longer possible or when a specified maximum number of steps has been reached. Specifically, at each step all variables are reviewed and evaluated to determine which one will contribute most to the equation. The method selected for stepwise regression is forward selection and backward elimination. The criteria “F to Enter” and “F to Remove” determine how significant or insignificant the contribution of a variable in the regression equation respectively for adding to the equation and removing from the equation. A limitation of the stepwise regression search approach is that it presumes that there is a single “best” subset of X variables and seeks to identify it.

PLS

For PLS [Citation36,37], “leave-one-out” method was used for crossvalidation to obtain the optimum number of components. PLS is a useful technique when number of factors is large and they are highly collinear. This technique generalizes and combines features from principal component and multiple regression. In case of PLS analysis on the present data set, based on the standardized regression coefficients, the variables with smaller coefficients were removed from the PLS regression, until there was no further improvement in Q2 value, irrespective of the components. It gives statistically more robust solution than MLR. To avoid overfitting, a strict test for the significance of each consecutive PLS component is necessary and then stopping when the components are non-significant. This ensures that the QSAR equations are selected based on their ability to predict the data rather than to fit the data.

Fa-mlr

Factor analysis [Citation38,39] has been performed to find out the relationship among variables. It reduces the large numbers of variables to few factors from which important variables for multiple linear regression can be identified. It is a data processing step to identify the variables contributing to the response variable. The whole dataset containing biological activity and all descriptor variables is extracted by principle component method and rotated by VARIMAX rotation to obtain Thurston's simple structure. The effective variables are selected from rotated component matrix obtained from the previous operation. Linear regression is performed using these variables.

Gfa-mlr

For the development of genetic function approximation (GFA) model Cerius2 4.10 version has been used. GFA provides a new approach to the problem of developing QSAR models. Genetic algorithms are derived with the spread of mutations in a population. It was initially conceived from two seemingly disparate algorithms - i) Holland's genetic algorithm and ii) Friedman's multivariate adaptive regression splines algorithm [Citation40]. GFA allows construction of models superior to standard techniques and gives additional information not provided by other techniques. A distinctive feature of GFA is that it produces a population of models (e.g., 100), instead of generating a single model, as do most other statistical methods. These multiple models are generated from random initial models using a genetic algorithm. A “fitness function” (lack-of-fit or LOF) is used to measure the quality of each model, so that the best model receives the best fitness score. Features, which are necessary for construction of the model, are automatically selected by GFA. GFA can build not only linear models but also higher order polynomials, splines and Gaussian, i.e., the type of model is user-defined. It can automatically remove outlier and perform classification using spline-based terms. Its random search procedure for building of the model leads to the discovery of highly predictive models.

G/PLS

Genetic partial least squares (G/PLS) model is derived from two methods: i) genetic function approximation (GFA) and ii) partial least squares (PLS) [Citation36,37]. Both of these methods are valuable analytical tools for QSAR modeling where numbers of descriptors are more than samples. Genetic function approximation is used to select the appropriate variables to be used in the development of model. It is followed by PLS regression as fitting technique to weigh the relative contribution of the selected variables in the final model.

ANN

For the purpose of development of nonlinear model, multilayer perceptron (MLP) of “Custom Network Designer” had been selected to design the network. We selected the back propagation method of MLP followed by conjugate gradient descent to train the network. The back propagation method is the most popular method of developing nonlinear model. There are at least three layers including one input layer, one hidden layer and one output layer. Each layer is interconnected with each other. In this method difference between output of the network and the desired output is calculated. This error value is back propagated to the transfer function for adjustment of weight. Through transfer function (sigmoid function), the output is obtained as [Citation41] where Oj is the output of node j and β is a gain, being able to adjust the form of the function. Usually β is taken as 1. Using the error signal to adjust the connected weights, the following adjusted weights are obtained for the output layer.

In back propagation, the gradient vector of the error surface is calculated. This vector points in the direction of steepest descent from the current point, so one knows that if one moves along it a “short” distance, one will decrease the error. A sequence of such moves will eventually find a minimum of some sort.

Conjugate gradient descent method is a good secondary and advanced method of training multilayer perceptron. It is generally used for the network of large numbers of weights and/or multiple output units. It is a batch update algorithm whereas back propagation adjusts the weights of the network. Learning rate and momentum of each epoch are adjusted and weight decay is regularized.

Most work on assessing performance in neural modeling concentrates on approaches to resampling. A neural network is optimized using a training subset. Often, a separate subset (the selection subset) is used to halt training to mitigate over-learning, or to select from a number of models trained with different parameters. Then, a third subset (the test subset) is used to perform an unbiased estimation of the network's likely performance.

Although the use of a test subset set allows us to generate unbiased performance estimates, these estimates may exhibit high variance. Ideally, one would like to repeat the training procedure a number of different times, each time using new training, selection and test cases drawn from the population - then, one could average the performance prediction over the different test subsets, to get a more reliable indicator of generalization performance.

In reality, one seldom has enough data to perform a number of training runs with entirely separate training, selection and test subsets. Crossvalidation is the simplest resampling technique. We have cross-validated the network using 15 resampling. Using different numbers of hidden layers and different numbers of units per layer it was shown that the one hidden layer of 39 units had good predictive capacity on this dataset.

Model quality

The statistical qualities of the multiple regression equations [Citation42] were judged by the parameters like explained variance (), correlation coefficient (R), standard error of estimate (s) and variance ratio (F) at specified degrees of freedom (df). All accepted MLR equations have regression coefficients and F ratios significant at 95% and 99% levels respectively, if not stated otherwise. The generated QSAR equations were validated by leave-one-out or LOO statistics [Citation43,44] and cross-validation R2 (Q2) and predicted residual sum of squares (PRESS) values were reported. In case of external validation, predictive capacity of a model was judged by its application for prediction of test set activity values and calculation of predictive R2 () value.

Softwares

MINITAB [Citation45] was used for stepwise regression and PLS whereas SPSS [Citation46] and STATISTICA [Citation47] were used for FA-MLR and ANN respectively. Cerius2 version 4.10 [Citation28] was used for GFA and G/PLS analyses.

Results and discussions

QSAR of the whole data set using physicochemical descriptors and indicator variables

Stepwise regression

The following best equation was obtained with ten variables using F criterion (F = 4 for inclusion; F = 3.9 for exclusion).

All regression coefficients were significant at 95% confidence level and the corresponding confidence intervals were mentioned within parentheses. Equation 1 could explain 83.7% of the variance of the anti-HIV activity while the leave-one-out predicted variance was 81.6%. Equation 1 contains 10 number of independent variables, which is considerably large for 97 compounds, though it maintains the recommended 1:5 ratio for the number of descriptors and number of observations [Citation29]. According to Livingstone and Salt [Citation48,49], when multiple linear regression models are constructed from a large pool of potential independent variables, they suffer from an effect known as “selection bias”. The effect of selection bias is to make the resulting models appear more significant than they really are. According to them, a critical F 5% values should be used to judge the significance of MLR models constructed by best subset selection and the critical value (Fmax) is calculated as follows [Citation49]:

In the above equation, p is the number of predictor variables used in a MLR equation, k is the total number of variables from which the p variables have been chosen and n is the number of compounds. For Equation 1, the values of p, k and n are 10, 32 and 97 respectively. N is defined as k!/(p!(k − p)!) and v2 is the second degree of freedom of the F-statistics, i.e., n − p − 1. For Equation 1, Fmax is calculated to be 27.222 whereas the F value of the equation is 50.324. Thus, Equation 1 passes the criterion of critical F value as prescribed by Livingstone and Salt [Citation49].

The positive value of hydrophobic substituent constant (πm) indicates that the anti-HIV-1 activity increases with increase in lipophilicity of meta substituents of the phenyl ring. The previously reported docking study [Citation16] indicates that the phenyl ring of indolyl aryl sulfones occupies a hydrophobic aromatic-rich pocket formed mainly by the side chain of Tyr181, Tyr188, Phe227 and Tryp229. This explains why compound 29 (containing 3,5-dimethylphenyl moiety) has better anti-HIV-1 activity than compound 18 (containing 2-amino-5-chlorophenyl moiety). The negative coefficients of the STERIMOL width parameters (B1) at the ortho and para positions of the phenyl ring also indicate that the anti-HIV-1 activity decreases with increase in width of the ortho and para substituents. This indicates that the pocket within which the phenyl ring fits is of restricted size. The chlorine atom at the W position of the indole moiety also increases the anti-HIV-1 activity as evidenced from the positive coefficient of IW_Cl. This is supported by the observation of the previously reported docking study that this chlorine atom interacts with Pro236. The positive coefficient of the parameter IZ_amino indicates that a carboxamido substituent at 2 position of the indole nucleus is conducive to the anti-HIV-1 activity while the negative coefficients of IZ_hethydr and IZ_hetami indicate that presence of hydroxyethylamino or hydroxyethylhydrazine group at Z position contributes negatively to the anti-HIV-1 activity. The positive coefficient of IX indicates that –SO2– group at X position increases the activity. The sulfonyl group fits in a little hydrophobic pocket made by the side chains of Val106, Lys103 (only α and β CH2) and Val179 as evidenced from the results of the docking study[Citation16]. The positive coefficient of INH indicates that presence of the unsubstituted indole NH is conducive for the anti-HIV-1 activity as evidenced from the interaction of indole NH with the Lys101 carbonyl by hydrogen bond[Citation16].

PLS

In case of PLS, the following equation with two components was developed.

Equation 2 could explain and predict 82.2% and 79.2% respectively of the variance of the anti-HIV activity, the values being slightly inferior to the corresponding values in case of stepwise regression equation. The negative coefficient of the electronic parameter (σm) indicates that presence of electron withdrawing meta substituent on the phenyl ring decreases the anti-HIV-1 activity.

Fa-mlr

From the factor analysis on the data matrix consisting of anti-HIV activity of indolyl aryl sulfones and physicochemical parameters with indicator variables, it was observed that 8 factors could explain the data matrix to the extent of 96.531%. The anti-HIV activity was highly loaded with factor 6 (loaded in IZ_hethy and IZ_amino) and moderately loaded in factor 1 (loaded in πm, MRm, L_m, B1_m, B2_m, B3_m and B4_m), factor 7 (loaded in IZ_hetami) and factor 9 (loaded in Ix). Based on factor analysis the following variables were selected for multiple linear regression. The best equation evolved was as follows:

Equation 3 involved six descriptors explaining and predicting 75.7% and 63.7% respectively of the variance of the activity. The statistical quality of Equation 3 was inferior to both Equations 1 and 2. The critical Fmax value for Equation 3 calculated according to Livingstone and Salt [Citation49] was 17.958. The F value of Equation 3 being 50.890, this equation passed the critical F value test.

GFA

In case of GFA (100,000 iterations), the following equation with 11 variables appeared as the best equation.

Equation 4 was comparable in statistical quality to that of Equation 1. The critical Fmax value for Equation 4 calculated according to Livingstone and Salt [Citation49] was 28.723. The F value of Equation 4 being 48.280, this equation passed the critical F value test. Equation 4 showed negative coefficient of B1 for meta substituent on the phenyl ring while positive coefficient of B3 for the same substituent and this suggested that the meta substituents on the phenyl ring should be of optimum shape for interaction with the enzyme cavity.

G/PLS

In G/PLS (5,000 iterations) the best equation was obtained with nine variables and three components.

Equation 5 was comparable in statistical quality to Equation 2. Equation 5 could explain and predict 81.9% and 78.8% respectively of the variance of the anti-HIV-1 activity.

QSAR of the whole data set using topological descriptors and indicator variables

In search of models of better statistical quality, equations using topological parameters and indicator variables were developed which are shown in . From stepwise regression, Equation 6 with twelve predictor variables was obtained. This equation involves five E-state parameters, two connectivity parameters and one flexibility index. Equation 6 was comparable in statistical quality to Equation 1 obtained from physicochemical descriptors. In case of PLS regression, Equation 7 with nineteen variables and six components (optimized with crossvalidation) was obtained. Equation 7 involves five E-state parameters, two molecular connectivity parameters, four indicator parameters apart from Balaban index, hydrogen bonding parameter and number of rotatable bonds. The predicted variance (Q2) value of Equation 7 was lower than that of the stepwise regression model Equation 6 but was comparable to that of Equation 2, the PLS model obtained from physicochemical and indicator parameters. From FA-MLR, Equation 8 was obtained which showed statistical quality and prediction potential less than those of both stepwise regression and PLS models. Equation 8 was slightly inferior to Equation 3 obtained from the physicochemical descriptors. The GFA model of 100,000 iterations produced the best equation with eleven variables [Equation 9]. The predicted variance of this equation was 80.4% while the explained variance was 84.1%. In case of G/PLS with 5,000 iterations, Equation 10 (five components) with 73.6% predicted variance and 77.0% explained variance was obtained. Because of large pool of descriptors used (from which variables were chosen), Equations 6 and 9 did not pass the criterion of critical Fmax value as recommended by Livingstone and Salt [Citation49]. However, we selected the best equation based on internal validation statistics (see the Overview section below).

Table IV. Different equations obtained from topological descriptors using different statistical methods (whole set).

QSAR of the whole data set using combination of physicochemical and topological descriptors and indicator variables

Attempt was also made to develop models using combined pool of descriptors and the best equations are shown in . Equation 11 was developed from stepwise regression. This equation consisted of eleven predictors which included one flexibility index, three E-state indices, two sterimol parameters and one kappa shape index. The values of explained variance and predicted variance had been improved to some extent on using physicochemical and topological descriptors in combination. Equation 12 involving 18 descriptors (five components) was a PLS model obtained from combined pool of descriptors and statistical quality of this model was better than that of the PLS models obtained from individual group of descriptors [Equations 2 and 7]. From FA-MLR, Equation 13 was developed using six variables. This equation had more predictive power than Equation 3 and Equation 8 obtained from individual group of descriptors. The GFA model obtained from 100,000 iterations (Equation 14) consisted of ten predictors out of which five are topological three are physicochemical and two are indicator variables. In case of G/PLS obtained from 5000 iterations, the best equation (five components) showed 80.8% explained variance and 78.5% predicted variance. shows a comparison of statistical parameters of different models applied on the whole data set. For QSAR using the combined set of descriptors (total 84 in number as a pool from which variables were chosen), we did not check the Fmax criterion of Livingstone and Salt [Citation49]. However, we selected equations based on validation criteria (internal, or external for the next section).

Table V. Different equations obtained from combined set of descriptors using different statistical methods (whole set).

Table VI. Comparative study of statistical parameters of models using different descriptors.

Splitting of the data set into training and test sets

Not even a robust and validated QSAR model can be expected to reliably predict the response for the entire universe of chemicals [Citation50]. Only the predictions for chemicals falling within the applicability domain can be considered reliable. The selection of training and test sets should be based on the proximity of the representative points of the test set to representative points of the training set in the multidimensional descriptor space.There are different methods of rational division of QSAR data set into training and test sets [Citation51] out of which a clustering technique has been used in the present case.

At first all compounds were clustered according to their structural similarities (using K-means clustering technique) as shown in . From each of seven clusters, approximately 25% of the compounds were taken in the test set as shown in by superscript “*”. Using 72 compounds as the training set, equations were developed which are shown in . The test set compounds were used to examine predictive potential of the compounds and predictive R2 values () were noted. While developing the models, both physicochemical and topological descriptors along with indicator variables were used. In this case, stepwise regression gave the best Equation (Equation 16) with twelve predictors including E-state parameters, connectivity index, Balaban index, hydrophobicity parameters etc. In case of PLS, Equation 17 was obtained with eleven variables (six components). Although the value of explained variance was low but predictive power of the model was good. In case of FA-MLR, Equation 18 had shown that except connectivity index all other four variables had negative impact on anti-HIV-1 activity. Equation 19 was obtained from GFA-MLR with 100,000 iterations with nine variables. This equation includes six topological descriptors and three indicator variables. In case of G/PLS with 5000 iterations, Equation 20 was obtained. Like PLS, it has low explained variance but good predictive capacity. It consists of two E-state parameters, two kappa shape indices, one indicator, one hydrophobicity parameter and one connectivity index. shows that the GFA derived model has the maximum value. The squared correlation coefficient values between the observed and predicted values of the test set compounds with (r2) and without () were also noted. All the models have satisfied the requirement of the value of (r2)/r2 being less than 0.1 as recommended by Golbraikh and Tropsha [Citation52].

Table VII. Serial numbers of compounds under different clusters.

Table VIII. Different equations obtained from combined set of descriptors using different statistical methods from the training set.

Table IX. Comparison of predictivity parameters of different models obtained from the training set.

Moreover, value is mainly controlled by the value of , i.e., mean of training data set. Thus, it may not truly reflect the predictive capability on new dataset. Besides squared regression coefficient (r2) between observed and predicted values of the test set compounds does not necessarily mean that the predicted values are very near to observed activity (there may be considerable numerical difference between the values though maintaining an overall good intercorrelation). To better indicate external predictive capacity of a model a modified r2 term () was defined in the following manner [Citation53].

In case of good external prediction predicted values will be very close to observed activity values. So, r2 value will be very near to value. In the best case will be equal to r2 whereas in the worst case value will be zero. In the present case values of all the models [Equations 16–20] are acceptable.

Artificial neural network

For the development of neural network model we had trained the network with the training set using backpropagation followed by conjugate gradient descent method. The network so developed was used for prediction of anti-HIV-1 activity values of the test set compounds. Using different iterations of backpropagation and conjugate gradient descent, varying numbers of hidden layers and its units per layer, a number of models were developed. Neural networks were optimized using a training subset. A separate subset (the selection subset) was used to halt training to mitigate over-learning, or to select from a number of models trained with different parameters. Then, a third subset (the test subset) was used to perform an unbiased estimation of the network's likely performance. During this study we have first selected certain values for iterations, numbers of hidden layers, numbers of elements per layer etc. After that we have increased and decreased the numbers of a particular parameter by fixing the other parameters. Here we have presented comparative study of 5 different models in Tables and . The model shown in bold is the best one which we have developed so far. In that model one hidden layer of 39 elements was used. Initialization method selected for network was random uniform. Weight decay was regularized in both phases (decay factor = 0.01, scale factor = 1). Learning rate and momentum of each iteration were adjusted to 0.01 and 0.3 respectively. Number of crossvalidated resampling was set to 15. During 15 resampling, numbers of cases selected for training, selection and test were 36, 19 and 4 respectively. Root mean square error of prediction (RMSEP) of this model was 0.765. In , r2 (correlation coefficient between observed and predicted value) of nonlinear method (ANN) was compared with the r2 value obtained from other linear methods.

Table X. Comparative study of different networks.

Table XI. Comparison of external predictivity characteristics of different ANN models.

Table XII. Comparison of r2 between observed and predicted values of the test set compounds using different techniques.

Overview

Different statistical methods like stepwise regression, PLS, FA-MLR, GFA-MLR, G/PLS have been applied for modeling anti-HIV-1 activity of indolyl aryl sulfone derivatives using physicochemical and topological descriptors along with indicator variables. Different equations indicate that sulfonyl group flanked between the indole nucleus and phenyl ring, NH group of the indole nucleus, hydrophobicity of the substituents on the phenyl nucleus and chlorine atom at the 5 position of indole moiety are necessary for optimum interaction with reverse transcriptase enzyme. These interactions are also supported by the previously published results of docking studies on this group of compounds [Citation16]. In case of the modeling of the whole data set with physicochemical parameters, the best equation based on internal validation characteristics was obtained with GFA-MLR (Q2 = 0.821). When the data set was modeled with topological descriptors, the best model came from stepwise regression (Q2 = 0.811). On using combined set of descriptors, the best model was obtained from stepwise regression (Q2 = 0.849). Again, the whole dataset was divided into training set (72 compounds) and test set (25 compounds). The best model obtained from the training set (stepwise regression) showed good internal predictive power (Q2 = 0.853) which was superior to predictive power of the model (Q2 = 0.81) obtained from 3D-QSAR study published in reference [Citation26]. The external predictive power of the model was also encouraging (). However, the model showing the best external validation parameter was one obtained from GFA-MLR. (). Again, ANN model was developed based on the training set data. The best model obtained from ANN showed a good r2 value (squared regression coefficient between observed and predicted values) for the test set compounds (0.781) which was superior to the corresponding value (0.736) in case of the best linear model (GFA-MLR). This suggests that nonlinear modeling performs better than the linear technique for this data set. The calculated (or predicted) values of the compounds according to Equations 16 and 19, and ANN model (4) are given in . The scatter plots of observed versus predicted values of the test set compounds according to stepwise regression, GFA and ANN models are shown in .

Figure 1. Scatter plots of observed versus predicted values of the test set compounds using (a) stepwise regression model {Equation 16]; (b) GFA-MLR model [Equation 19; (c) artificial neural network model [ANN model 4].

Figure 1.  Scatter plots of observed versus predicted values of the test set compounds using (a) stepwise regression model {Equation 16]; (b) GFA-MLR model [Equation 19; (c) artificial neural network model [ANN model 4].

Conclusions

Among the linear models, the best equation based on internal validation was obtained with stepwise regression while the best model based on external validation was obtained from GFA-MLR. Again, ANN models were better than GFA-MLR model based on external validation. Thus, nonlinear modeling performs better than the linear technique for this data set.

Acknowledgements

Financial support under the DST Fast Track Scheme for Young Scientists (DST, Govt. of India, New Delhi) is thankfully acknowledged.

References

  • K Roy, and J Leonard. (2004). QSAR modeling of HIV-1 reverse transcriptase inhibitor 2-amino-6-arylsulfonylbenzonitriles and congeners using molecular connectivity and E-state parameters. Bioorg Med Chem 12:745–754.
  • www.unaids.org
  • pathmicro.med.sc.edu/lecture/hiv2000.htm.
  • www.rhodes.edu/biology/glindquester/viruses/pagespass/hiv/hiv.html
  • uhavax.hartfold.edu/bugl/hiv.htm
  • www.aegis.com/topics/basics/hivandaids.html
  • S Villar, L Santana, and E Uriate. (2006). PNN model for in silico evaluation of anti-HIV activity and mechanism of action. J Med Chem 49 (3):118–1124.
  • YS Prabhakar, RK Rawal, MK Gupta, VR Solomon, and SB Katti. (2005). Topological descriptors in modeling the HIV inhibitory activity of 2-aryl 3-pyridyl-thiazolidin 4-ones. Comb Chem High Throughput Screen 8 (5):431–437.
  • CL Senese, and AJ Hopfinger. (2003). Receptor-independent 4D-QSAR analysis of a set of norstatine derived inhibitors of HIV-1 protease. J Chem Inf Comput Sci 43:1297–1307.
  • MT Makhija, and VM Kulkarni. (2002). 3D-QSAR and molecular modeling of HIV-1 integrase inhibitors. J Comp Aided Mol Des 16 (3):181–200.
  • JK Buolamwini, and H Assefa. (2002). CoMFA and CoMSIA 3D-QSAR and docking studies on conformationally-restrained cinnamoyl HIV-1 integrase inhibitors: Exploration of a binding mode at the active site. J Med Chem 45 (4):841–852.
  • T Niwa. (2004). Prediction of biological targets using PNN and atom-type descriptors. J Med Chem 47 (10):2645–2650.
  • D Weekes, and GB Fogel. (2003). Evolutionary optimization, backpropagation and data preparation issues in QSAR modeling of HIV inhibition by HEPT derivatives. Biosystems 72:149–158.
  • CL Senese, and AJ Hopfinger. (2003). A simple clustering technique to improve QSAR model selection and predictivity: Application to a receptor independent 4D-QSAR analysis of cyclic urea derived inhibitors of HIV-1 protease. J Chem Inf Comp Sci 43 (6):2180–2193.
  • P Pungpo, and S Hannongbua. (2000). 3D-QSAR study on HIV-1 RT inhibitors in the class of dipyridodiazepinone derivatives, using CoMFA. J Mol Graph Model 18 (6):581–590. & 601
  • R Ragno, M Artico, GD Martino, GL Regina, A Coluccia, AD Pasquali, and R Silvestri. (2005). Docking and 3-D QSAR studies on indolyl aryl sulfones. Binding mode exploration at the HIV-1 reverse transcriptase non-nucleoside binding site and design of highly active N-(2-Hydroxyethyl) carboxamide and N-(2-Hydroxyethyl) carbohydrazide derivatives. J Med Chem 48:213–223.
  • JT Leonard, and K Roy. (2004). Classical QSAR modeling of HIV-1 reverse transcriptase inhibitor 2-amino-6-arylsulfonylbenzonitriles and congener. QSAR Comb Sci 23:23–35.
  • JT Leonard, and K Roy. (2003). QSAR modeling of anti-HIV activities of alkenyldiarylmethanes using topological and physicochemical descriptors. Drug Des Discov 18:165–180.
  • JT Leonard, and K Roy. (2004). Classical QSAR modeling of CCR5 receptor binding affinity of substituted benzylpyrazoles. QSAR Comb Sci 23:387–398.
  • K Roy, and JT Leonard. (2005). Classical QSAR modeling of anti-HIV 2,3-diaryl-1,3-thiazolidin-4-ones. QSAR Comb Sci 24:579–592.
  • K Roy, and JT Leonard. (2005). QSAR by LFER model of cytotoxicity data of anti-HIV 5-phenyl-1-phenylamino-1H-imidazole derivatives using principal component factor analysis and genetic function approximation. Bioorg Med Chem 13:2967–2973.
  • K Roy, and JT Leonard. (2006). Topological QSAR modeling of cytotoxicity data of anti-HIV5-phenyl-1-phenylamino-imidazole derivatives using GFA, G/PLS, FA and PCRA techniques. Indian J Chem 45A:126–137.
  • K Roy, and JT Leonard. (2005). QSAR analyses of 3-(4-benzylpiperidin-1-yl)-N phenylpropylamine derivatives as potent CCR5 antagonists. J Chem Inf Model 45:1352–1368.
  • JT Leonard, and K Roy. (2006). QSAR by LFER model of HIV protease inhibitor mannitol derivatives using FA-MLR, PCRA, and PLS techniques. Bioorg Med Chem 14:1039–1046.
  • JT Leonard, K Roy, and QSAR Comparative. (2006). modeling of CCR5 receptor binding affinities of substituted 1-(3,3-diphenylpropyl)-piperidinyl amides and ureas. Bioorg Med Chem Lett 16:4467–4474.
  • R Ragno, A Coluccia, GL Regina, GD Martino, F Piscitelli, A Lavecchia, E Novellino, A Bergamini, C Ciaprini, A Sinistro, G Maga, E Crespan, M Artico, and R Silvestri. (2006). Design molecular modeling, synthesis, and anti-HIV-1 activity of new indolyl aryl sulfones. Novel derivatives of the indole-2-carboxamide. J Med Chem 49:3172–3184.
  • C Hansch, A Leo, and D Hoekman. Exploring QSAR Hydrophobic, electronic and steric constants. Washington, DC: American Chemical Society; (1995).
  • Cerius2 Version 4.10 is a product of Accelrys Inc., San Diego, CA.
  • L Eriksson, J Jaworska, AP Worth, MTD Cronin, RM McDowell, and P Gramatica. (2003). Methods for reliability and uncertainty assessment and for applicability evaluations of classification and regression-based QSARs. Environ Health Perspect 111:1361–1375.
  • R Guha, and PC Jurs. (2005). Determining the validity of a QSAR Model-A classification approach. J Chem Inf Model 45:65–73.
  • JT Leonard, and K Roy. (2006). On selection of training and test sets for the development of predictive QSAR models. QSAR Comb Sci 25 (3):235–251.
  • BS Everitt, S Landau, and M Leese. Cluster analysis. Edward Arnold: London; (2001).
  • RB Kowalski, S Wold. Handbook of statistics. Amsterdam: North Holland Publishing Company; (1982).
  • GM Downs, P. Willett. In: H van de Waterbeemd, editor. Advanced computer assisted techniques in drug discovery. Weinheim (Ger.): VCH; (1995). p 111–130.
  • RB Darlington. Regression and linear models. New York: McGraw-Hill; (1990).
  • S Wold. In: H van de Waterbeemd, editor. Chemometric methods in molecular design. Weinheim: VCH; (1995). p 195.
  • Y Fan, LM Shi, KW Kohn, Y Pommier, JN Weinstein. (2001). Quantitative structure–antitumor activity relationships of camptothecin analogues: Cluster analysis and genetic algorithm-based studies. J Med Chem 44:3254.
  • R Franke. Theoretical drug design methods. Amsterdam: Elsevier, (1984). p 184.
  • R Franke, A Gruska. Chemometric methods in molecular design. In: H van de Waterbeemd, editor. Weinheim: VCH; (1995). p 113.
  • D Rogers, and AJ Hopfinger. (1994). Application of genetic function approximation to quantitative structure–activity relationships and quantitative structure–property relationships. J Chem Inf Comput Sci 34:854–866.
  • Y Tang, HL Jiang, KX Chen, and RY Ji. (1996). QSAR study of artemisinin (Qinghaosu) derivatives using neural network method. Indian J Chem 35B:325–332.
  • GW Snedecor, WG Cochran. . In: H van de Waterbeemd, editor. Statistical methods. New Delhi: Oxford and IBH; (1967). p 381.
  • S Wold, L Eriksson. In: H van de Waterbeemd. , editor. Chemometric methods in molecular designWeinheim: VCH; (1995). p 312.
  • AK Debnath. In AK Ghose, and VN Viswanadhan, editors. Combinatorial library design and evaluation. New York: Marcel Dekker Inc. (2001). p 73.
  • MINITAB is a statistical software of Minitab Inc.; USA.
  • SPSS is a statistical software of SPSS Inc.; USA.
  • STATISTICA is a statistical software of STATSOFT Inc.; USA.
  • DJ Livingstone, and DW Salt. (2005). Judging the significance of multiple linear regression models. J Med Chem 48:661–663.
  • http://www.port.ac.uk/research/cmd/research/selectionbiasinmultipleregression/
  • P Gramatica. (2007). Principles of QSAR models validation: Internal and external. QSAR Comb Sci 26:694–701.
  • K Roy. On some aspects of validation of predictive QSAR models. Expert Opin Drug Discov 2007;2:1567–1577.
  • A Golbraikh, and A Tropsha. (2002). Beware of q2!. J Mol Graphics Mod 20:269–276.
  • P Roy, and K Roy. (2007). On some aspects of variable selection for partial least squares regression models. QSAR Comb Sci 26: http://dx.doi.org/10.1002/qsar.200710043

Reprints and Corporate Permissions

Please note: Selecting permissions does not provide access to the full text of the article, please see our help page How do I view content?

To request a reprint or corporate permissions for this article, please click on the relevant link below:

Academic Permissions

Please note: Selecting permissions does not provide access to the full text of the article, please see our help page How do I view content?

Obtain permissions instantly via Rightslink by clicking on the button below:

If you are unable to obtain permissions via Rightslink, please complete and submit this Permissions form. For more information, please visit our Permissions help page.