503
Views
59
CrossRef citations to date
0
Altmetric
Review

Current approaches for choosing feature selection and learning algorithms in quantitative structure–activity relationships (QSAR)

& ORCID Icon
Pages 1075-1089 | Received 24 Sep 2018, Accepted 26 Oct 2018, Published online: 03 Nov 2018

References

  • Hansch C, Muir RM, Fujita T, et al. The correlation of biological activity of plant growth regulators and chloromycetin derivatives with Hammett constants and partition coefficients. J Am Chem Soc. 1963;85(18):2817–2824.
  • Hansch C, Fujita T. p-σ-π analysis. A method for the correlation of biological activity and chemical structure. J Am Chem Soc. 1964;86(8):1616–1626.
  • Hansch C, Hoekman D, Gao H. Comparative QSAR: toward a deeper understanding of chemicobiological interactions. Chem Rev. 1996;96(3):1045–1076.
  • Yasri A, Hartsough D. Toward an optimal procedure for variable selection and QSAR model building. J Chem Inf Comput Sci. 2001;41(5):1218–1227.
  • Katritzky AR, Lobanov VS, Karelson M. QSPR: the correlation and quantitative prediction of chemical and physical properties from structure. Chem Soc Rev. 1995;24(4):279–287.
  • Siraki A, Chevaldina T, Moridani M, et al. Quantitative structure-toxicity relationships by accelerated cytotoxicity mechanism screening. Curr Opin Drug Discov Devel. 2004;7(1):118–125.
  • Nigam A, Klein MT. A mechanism-oriented lumping strategy for heavy hydrocarbon pyrolysis: imposition of quantitative structure-reactivity relationships for pure components. Ind Eng Chem Res. 1993;32(7):1297–1303.
  • Driebergen R, Moret E, Janssen L, et al. Electrochemistry of potentially bioreductive alkylating quinones: part 3. Quantitative structure-electrochemistry relationships of aziridinylquinones. Anal Chim Acta. 1992;257(2):257–273.
  • Tömpe P, Clementis G, Petnehazy I, et al. Quantitative structure-electrochemistry relationships of α, β-unsaturated ketones. Anal Chim Acta. 1995;305(1–3):295–303.
  • Hemmateenejad B, Yazdani M. QSPR models for half-wave reduction potential of steroids: A comparative study between feature selection and feature extraction from subsets of or entire set of descriptors. Anal Chim Acta. 2009;634(1):27–35.
  • Mozrzymas A, Rozycka-Roszak B. Prediction of critical micelle concentration of nonionic surfactants by a quantitative structure–property relationship. Comb Chem High Throughput Screen. 2010;13(1):39–44.
  • Fourches D, Pu D, Tassa C, et al. Quantitative nanostructure−activity relationship modeling. ACS Nano. 2010;4(10):5703–5712.
  • Mauri A, Consonni V, Pavan M, et al. Dragon software: an easy approach to molecular descriptor calculations. Match. 2006;56(2):237–248.
  • Yap CW. PaDEL‐descriptor: an open source software to calculate molecular descriptors and fingerprints. J Comput Chem. 2011;32(7):1466–1474.
  • Shahlaei M. Descriptor selection methods in quantitative structure–activity relationship studies: a review study. Chem Rev. 2013;113(10):8093–8103.
  • Kohavi R, John GH. Wrappers for feature subset selection. Artif Intell. 1997;97(1–2):273–324.
  • Kohavi R, John GH. The wrapper approach. Feature extraction, construction and selection. Springer: New York; 1998. p. 33–50.
  • Dutta D, Guha R, Wild D, et al. Ensemble feature selection: consistent descriptor subsets for multiple QSAR models. J Chem Inf Model. 2007;47(3):989–997.
  • Blanchet FG, Legendre P, Borcard D. Forward selection of explanatory variables. Ecology. 2008;89(9):2623–2632.
  • Unger SH. Consequences of the Hansch paradigm for the pharmaceutical industry. Med Chem. 1980;9:47–119. Elsevier.
  • Merkwirth C, Mauser H, Schulz-Gasch T, et al. Ensemble methods for classification in cheminformatics. J Chem Inform Comput Sci. 2004;44(6):1971–1978.
  • Guyon I, Weston J, Barnhill S, et al. Gene selection for cancer classification using support vector machines. Mach Learn. 2002;46(1–3):389–422.
  • Wegner J, Fröhlich H, Zell A. Feature selection for descriptor based classification models. 1. Theory and GA-SEC algorithm. J Chem Inform Comput Sci. 2004;44(3):921–930.
  • Santos V. Uma perspectiva da modelagem QSPR para triagem/desenho de catalisadores para a síntese de carbonatos oleoquímicos Porto Alegre: Pontifícia Universidade Católica do Rio Grande do Sul; 2018
  • Seierstad M, Agrafiotis DK. A QSAR model of hERG binding using a large, diverse, and internally consistent training set. Chem Biol Drug Des. 2006;67(4):284–296.
  • Derksen S, Keselman HJ. Backward, forward and stepwise automated subset selection algorithms: frequency of obtaining authentic and noise variables. Br J Math Stat Psychol. 1992;45(2):265–282.
  • Yang S-P, Song S-T, Tang Z-M, et al. Optimization of antisense drug design against conservative local motif in simulant secondary structures of HER-2 mRNA and QSAR analysis. Acta Pharmacol Sin. 2003;24(9):897–902.
  • Szaleniec M, Tadeusiewicz R, Witko M. How to select an optimal neural model of chemical reactivity? Neurocomputing. 2008;72(1–3):241–256.
  • Darlington RB. Regression and linear models. New York: McGraw-Hill; 1990.
  • Yousefinejad S, Hemmateenejad B. Chemometrics tools in QSAR/QSPR studies: A historical perspective. Chemom Intell Lab Syst.2015; 149:177–204.
  • Liu -S-S, Liu H-L, Yin C-S, et al. VSMP: a novel variable selection and modeling method based on the prediction. J Chem Inform Comput Sci. 2003;43(3):964–969.
  • Furnival GM, Wilson RW. Regressions by leaps and bounds. Technometrics. 1974;16(4):499–511.
  • Chen B-K, Horvath C, Bertino JR. Multivariate analysis and quantitative structure-activity relationships. Inhibition of dihydrofolate reductase and thymidylate synthetase by quinazolines. J med chem. 1979;22(5):483–491.
  • Zhou Y-X, Xu L, Wu Y-P, et al. A QSAR study of the antiallergic activities of substituted benzamides and their structures. Chemom Intell Lab Syst. 1999;45(1–2):95–100.
  • Goldberg DE. Genetic algorithms in search, optimization machine learning. Adison-Wesley, Publishing Co. Reading MA; ; 1989.
  • Rogers D, Hopfinger AJ. Application of genetic function approximation to quantitative structure-activity relationships and quantitative structure-property relationships. J Chem Inform Comput Sci. 1994;34(4):854–866.
  • Holland JH. Adaption in natural and artificial systems. Ann Arbor MI: The University of Michigan Press; 1975.
  • Lucasius CB, Kateman G. Understanding and using genetic algorithms Part 1. Concepts, properties and context. Chemom Intell Lab Syst. 1993;19(1):1–33.
  • Aoyama T, Suzuki Y, Ichikawa H. Neural networks applied to structure-activity relationships. J med chem. 1990;33(3):905–908.
  • Agatonovic-Kustrin S, Beresford R. Basic concepts of artificial neural network (ANN) modeling and its application in pharmaceutical research. J Pharm Biomed Anal. 2000;22(5):717–727.
  • Winkler DA, Burden FR. Application of neural networks to large dataset QSAR, virtual screening, and library design. Combinatorial Library: Totowa, NJ: Springer; 2002. p. 325–367.
  • Wikel JH, Dow ER. The use of neural networks for variable selection in QSAR. Bioorg Med Chem Lett. 1993;3(4):645–651.
  • Tetko IV, Tanchuk VY, Chentsova NP, et al. HIV-1 reverse transcriptase inhibitor design using artificial neural networks. J med chem. 1994;37(16):2520–2526.
  • Tetko IV, Villa AEP, Livingstone DJ. Neural network studies. 2. Variable selection. J Chem Inform Comput Sci. 1996;36(4):794–803.
  • Andersson M. A comparison of nine PLS1 algorithms. J Chemom: J Chemometrics Soc. 2009;23(10):518–529.
  • Eberhart RC, Shi Y, editors. Comparing inertia weights and constriction factors in particle swarm optimization. Evolutionary computation, 2000. Proceedings of the 2000 Congress on Evolutionary Computation; 2000. Piscataway, NJ: IEEE.
  • Eberhart RC, Shi Y, editors. Tracking and optimizing dynamic systems with particle swarms. Evolutionary computation, 2001. Proceedings of the 2001 Congress on Evolutionary Computation; 2001. Piscataway, NJ: IEEE.
  • Eberhart R, Simpson P, Dobbins R. Computational intelligence PC tools. Boston, MA: Academic Press Professional, Inc.; 1996.
  • Poli R, Kennedy J, Blackwell T. Particle swarm optimization. Swarm Intelligence. 2007;1(1):33–57.
  • Sutter JM, Kalivas JH. Comparison of forward selection, backward elimination, and generalized simulated annealing for variable selection. Microchem J. 1993;47(1–2):60–66.
  • van Laarhoven PJ, Aarts EH. Simulated annealing: theory and applications. Netherlands: Springer; 1987.
  • Ghosh P, Bagchi MC. QSAR modeling for quinoxaline derivatives using genetic algorithm and simulated annealing based feature selection. Curr Med Chem. 2009;16(30):4032–4048.
  • Ghosh P, Bagchi M. Comparative QSAR studies of nitrofuranyl amide derivatives using theoretical structural properties. Mol Simul. 2009;35(14):1185–1200.
  • Bhaisare ML, Karthikeyan C, Tanwar O, et al., editors. A QSAR analysis of some amino substituted Pyrido [3, 2-b] pyrazinones as potent and selective PDE-5 inhibitors. Proceedings of the 14th Int. Electron. Conf. Synth. Org. Chem. Sciforum Electronic Conferences Series Electronic conference hosted at https://sciforum.net/; 2010.
  • Burden FR, Ford MG, Whitley DC, et al. Use of automatic relevance determination in QSAR studies using Bayesian neural networks. J Chem Inform Comput Sci. 2000;40(6):1423–1430.
  • MacKay DJ. Bayesian non-linear modeling for the energy prediction competition. ASHRAE Transactions. 1994;100(pt 2):1053–1062.
  • Burden F, Winkler D. Bayesian Regularization of neural networks. Artificial neural networks. Totowa, NJ: Springer; 2008. p. 23–42.
  • Winkler DA, Burden FR. Modelling blood-brain barrier partitioning using Bayesian neural nets. J Mol Graph Model. 2004;22(6):499–505.
  • Duchowicz PR, Castro EA, Fernandez FM. Alternative algorithm for the search of an optimal set of descriptors in QSAR-QSPR studies. MATCH Commun Math Comput Chem. 2006;55:179–192.
  • Duchowicz PR, Fernã¡Ndez M, Caballero J, et al. QSAR for non-nucleoside inhibitors of HIV-1 reverse transcriptase. Bioorg Med Chem. 2006;14(17):5876–5889.
  • Morales AH, Duchowicz PR, Perez MAC, et al. Application of the replacement method as a novel variable selection strategy in QSAR. 1. Carcinogenic potential. Chemom Intell Lab Syst. 2006;81(2):180–187.
  • Duchowicz PR, Castro EA, Fernandez FM, et al. A new search algorithm for QSPR/QSAR theories: normal boiling points of some organic molecules. Chem Phys Lett. 2005;412(4–6):376–380.
  • Duchowicz PR, González MP, Helguera AM, et al. Application of the replacement method as novel variable selection in QSPR. 2. Soil sorption coefficients. Chemom Intell Lab Syst. 2007;88(2):197–203.
  • Duchowicz PR, Castañeta H, Castro EA, et al. QSPR prediction of the Dubinin–radushkevich’s k parameter for the adsorption of organic vapors on BPL carbon. Atmos Environ. 2006;40(16):2929–2934.
  • Zheng W, Tropsha A. Novel variable selection quantitative structure− property relationship approach based on the k-nearest-neighbor principle. J Chem Inform Comput Sci. 2000;40(1):185–194.
  • Araujo M, Saldanha TCB, Galvao RKH, et al. The successive projections algorithm for variable selection in spectroscopic multicomponent analysis. Chemom Intell Lab Syst. 2001;57(2):65–73.
  • Daszykowski M, Stanimirova I, Walczak B, et al. Improving QSAR models for the biological activity of HIV reverse transcriptase inhibitors: aspects of outlier detection and uninformative variable elimination. Talanta. 2005;68(1):54–60.
  • Franke R. Theoretical drug design methods (Pharmacochemistry Library). Vol. 7. Amsterdam: Elsevier; 1984.
  • Van Waterbeemd H. Chemometric methods in molecular design. Manhnhold R, Krogsgaard-Larsen P, Timmerman H, et al., editors. New York: Wiley-VCH; 1995.
  • Leonard JT, Roy K. QSAR by LFER model of HIV protease inhibitor mannitol derivatives using FA-MLR, PCRA, and PLS techniques. Bioorg Med Chem. 2006;14(4):1039–1046.
  • Bhattacharya P, Leonard JT, Roy K. Exploring QSAR of thiazole and thiadiazole derivatives as potent and selective human adenosine A3 receptor antagonists using FA and GFA techniques. Bioorg Med Chem. 2005;13(4):1159–1165.
  • Roy K, Roy PP. Comparative chemometric modeling of cytochrome 3A4 inhibitory activity of structurally diverse compounds using stepwise MLR, FA-MLR, PLS, GFA, G/PLS and ANN techniques. European J Med Chem. 2009;44(7):2913–2922.
  • Roy K, Leonard JT. QSAR by LFER model of cytotoxicity data of anti-HIV 5-phenyl-1-phenylamino-1H-imidazole derivatives using principal component factor analysis and genetic function approximation. Bioorg Med Chem. 2005;13(8):2967–2973.
  • Weisberg S. Applied linear regression. Vol. 528. Hoboken: John Wiley & Sons; 2005.
  • Schneider C, Catalano P, Biggio J, et al. Associations of neonatal adiponectin and leptin with growth and body composition in African American infants. Pediatr Obes. 2018;13(8):485–491.
  • Duchowicz PR. Linear regression QSAR models for Polo-Like Kinase-1 inhibitors. Cells. 2018;7(2):13.
  • Wang L, Chen B, Zhang T. Predicting hydrolysis kinetics for multiple types of halogenated disinfection byproducts via QSAR models. Chem Eng J. 2018;342:372–385.
  • Roy K, Ambure P. The “double cross-validation” software tool for MLR QSAR model development. Chemom Intell Lab Syst. 2016;159:108–126.
  • Frank LE, Friedman JH. A statistical view of some chemometrics regression tools. Technometrics. 1993;35(2):109–135.
  • Jolliffe I. Principal component analysis. International encyclopedia of statistical science: Springer. Berlin, Heidelberg; 2011;1094–1096.
  • Liu RX, Kuang J, Gong Q, et al. Principal component regression analysis with SPSS. Comput Methods Programs Biomed. 2003;71(2):141–147.
  • Constantin C. Principal component analysis-a powerful tool in computing marketing information. Bull Transilvania Univ Brasov Econ Sci Ser. 2014;7(2):25.
  • Valle S, Li W, Qin SJ. Selection of the number of principal components: the variance of the reconstruction error criterion with a comparison to other methods. Ind Eng Chem Res. 1999;38(11):4389–4401.
  • Wold S, Sjöström M, Eriksson L. PLS-regression: a basic tool of chemometrics. Chemom Intell Lab Syst. 2001;58(2):109–130.
  • Agrafiotis DK, Rassokhin DN, Lobanov VS. Multidimensional scaling and visualization of large molecular similarity tables. J Comput Chem. 2001;22(5):488–500.
  • Esposito Vinzi V, Russolillo G. Partial least squares algorithms and methods. Wiley Interdisciplinary Reviews: Comput Stat. 2013;5(1):1–19.
  • Hoerl AE, Kennard RW. Ridge regression: applications to nonorthogonal problems. Technometrics. 1970;12(1):69–82.
  • Rogers D. G/SPLINES. In Proceedings of the Fourth International Conference on Genetic Algorithms; San Diego. A hybrid of Friedman’s multivariate adaptive regression splines (MARS) algorithm with Holland’s genetic algorithm. 1991.
  • Rogers D, editor. Data analysis using G/SPLINES. Advances in Neural Information Processing Systems. San Mateo,, CA; 1992.
  • Holland J. Adaptation in natural and artificial systems: an introductory analysis with application to biology,control and artificial intelligence. Cambridge, MA: MIT Press1975.
  • Rogers D. Some theory and examples of genetic function approximation with comparison to evolutionary techniques. Genetic algorithms in molecular modeling. New York: Elsevier; 1996. p. 87–107.
  • Livingstone DJ, Manallack DT, Tetko IV. Data modelling with neural networks: advantages and limitations. J Comput Aided Mol Des. 1997;11(2):135–142.
  • McCulloch WS, Pitts W. A logical calculus of the ideas immanent in nervous activity. Bull Math Biophys. 1943;5(4):115–133.
  • Devillers J. Neural networks in QSAR and drug design. New York: Academic Press; 1996.
  • Rumelhart DE, Hinton GE, Williams RJ. Learning internal representations by error propagation. California Univ San Diego La Jolla Inst for Cognitive Science. La Jolla, CA: 1985.
  • Caudill M, Butler C. Naturally intelligent systems. Cambridge, MA: MIT press; 1992.
  • Eberhart RC. Neural network PC tools: a practical guide. New York: Academic Press; 1990.
  • Goh ATC. Back-propagation neural networks for modeling complex systems. Artif Intelligence Eng. 1995;9(3):143–151.
  • MacKay DJC. A practical Bayesian framework for backpropagation networks. Neural Comput. 1992;4(3):448–472.
  • Neal RM. Priors for infinite networks. Bayesian learning for neural networks. New York: Springer; 1996. p. 29–53.
  • Fox T, Kriegl JM. Machine learning techniques for in silico modeling of drug metabolism. Curr Top Med Chem. 2006;6(15):1579–1591.
  • Baskin II. Machine learning methods in computational toxicology. Computational toxicology. New York: Springer; 2018;119–139.
  • Cortes C, Vapnik V. Support-vector networks. Mach Learn. 1995;20(3):273–297.
  • Vapnik V. The nature of statistical learning theory. New York : Springer; 1995.
  • Burges CJC. A tutorial on support vector machines for pattern recognition. Data Min Knowl Discov. 1998;2(2):121–167.
  • Andres C, Hutter MC. CNS permeability of drugs predicted by a decision tree. QSAR Comb Sci. 2006;25(4):305–309.
  • Quinlan JR. Induction of decision trees. Mach Learn. 1986;1(1):81–106.
  • Amit Y, Geman D. Shape quantization and recognition with randomized trees. Neural Comput. 1997;9(7):1545–1588.
  • Yee LC, Wei YC. Current modeling methods used in QSAR/QSPR. Statistical Modelling of Molecular Descriptors in QSAR/QSPR;.2012 2:1–31.
  • Barandiaran I. The random subspace method for constructing decision forests. IEEE Trans Pattern Anal Mach Intell. 1998;20:8.
  • Breiman L. Random forests. Mach Learn. 2001;45(1):5–32.
  • Lavecchia A. Machine-learning approaches in drug discovery: methods and applications. Drug Discov Today.  2015;20(3):318–331.
  • Duda H, Hart PE, David G. Stork, pattern classification.  Weinheim, Germany:John Wiley & Sons; 2001.
  • Kauffman GW, Jurs PC. QSAR and k-nearest neighbor classification analysis of selective cyclooxygenase-2 inhibitors using topologically-based numerical descriptors. J Chem Inform Comput Sci. 2001;41(6):1553–1560.
  • Konovalov DA, Coomans D, Deconinck E, et al. Benchmarking of QSAR models for blood-brain barrier permeation. J Chem Inf Model. 2007;47(4):1648–1656.
  • The Codessa Project software. [cited 2018 Sept 26]. Available from: http://codessa-pro.com/
  • Guha R. Chemical informatics functionality in R. J Stat Softw. 2007;18(5):1–16.
  • Roy K, Kar S, Das RN. A primer on QSAR/QSPR modeling: fundamental concepts (SpringerBriefs in molecular science). NY: Springer; 2015.
  • Roy K, Mitra I. On various metrics used for validation of predictive QSAR models with applications in virtual screening and focused library design. Comb Chem High Throughput Screen. 2011;14(6):450–474.
  • Golbraikh A, Tropsha A. Beware of q2! J Mol Graph Model. 2002;20(4):269–276.
  • Roy K, Das RN, Ambure P, et al. Be aware of error measures. Further studies on validation of predictive QSAR models. Chemom Intell Lab Syst. 2016;152:18–33.
  • Gadaleta D, Mangiatordi GF, Catto M, et al. Applicability domain for QSAR models: where theory meets reality. Int J Quant Struct-Prop Relat. 2016;1(1):45–63.
  • Melagraki G, Ntougkos E, Rinotas V, et al. Cheminformatics-aided discovery of small-molecule Protein-Protein Interaction (PPI) dual inhibitors of Tumor Necrosis Factor (TNF) and receptor activator of NF-κB ligand (RANKL). PLoS Comput Biol. 2017;13(4):e1005372.
  • Melagraki G, Afantitis A. A risk assessment tool for the virtual screening of metal oxide nanoparticles through enalos InSilicoNano platform. Curr Top Med Chem. 2015;15(18):1827–1836.
  • OECD. Validation of QSAR models; [cited 2018 Oct 24]. Available from: http://www.oecd.org/chemicalsafety/risk-assessment/validationofqsarmodels.htm
  • Roy K, Ambure P, Kar S, et al. Is it possible to improve the quality of predictions from an “intelligent” use of multiple QSAR/QSPR/QSTR models? J Chemom. 2018;32(4):e2992.
  • Roy K, Ambure P, Kar S. How precise are our quantitative structure–activity relationship derived predictions for new query chemicals? ACS Omega. 2018;3(9):11392–11406.
  • Fujita T, Winkler DA. Understanding the roles of the “Two QSARs”. J Chem Inf Model. 2016;56(2):269−274.
  • Bhattarai B, Garg R, Gramatica P. Are mechanistic and statistical QSAR approaches really different? MLR studies on 158 cycloalkyl‐pyranones. Mol Inform. 2010;29(6–7):511–522.
  • Devinyak OT, Lesyk RB. 5-year trends in QSAR and its machine learning methods. Curr Comput Aided Drug Des. 2016;12(4):265–271.
  • Antanasijević D, Antanasijević J, Trišović N, et al. From classification to regression multitasking QSAR modeling using a novel modular neural network: simultaneous prediction of anticonvulsant activity and neurotoxicity of succinimides. Mol Pharm. 2017;14(12):4476–4484.

Reprints and Corporate Permissions

Please note: Selecting permissions does not provide access to the full text of the article, please see our help page How do I view content?

To request a reprint or corporate permissions for this article, please click on the relevant link below:

Academic Permissions

Please note: Selecting permissions does not provide access to the full text of the article, please see our help page How do I view content?

Obtain permissions instantly via Rightslink by clicking on the button below:

If you are unable to obtain permissions via Rightslink, please complete and submit this Permissions form. For more information, please visit our Permissions help page.