331
Views
9
CrossRef citations to date
0
Altmetric
Research Article

Sample-size dependence of validation parameters in linear regression models and in QSAR

ORCID Icon, & ORCID Icon
Pages 247-268 | Received 09 Nov 2020, Accepted 10 Feb 2021, Published online: 22 Mar 2021

References

  • R.G. Brereton, The use and misuse of p values and related concepts, Chemom. Intell. Lab. Syst. 195 (2019), pp. 1–7. doi:10.1016/j.chemolab.2019.103884.
  • P.R. Bevington, Data Reduction and Error Analysis for the Physical Sciences, McGraw-Hill, New York, 1969.
  • S. Dowdy, S. Wearden, and D. Chilko, Statistics for Research, 3rd ed., Wiley, Hoboken, 2004.
  • A.J. Miller, Subset Selection in Regression, 2nd ed., Chapman & Hall, London, UK, 2002.
  • A. Nicholls, Confidence limits, error bars and method comparison in molecular modeling. Part 1: The calculation of confidence intervals, Comput. Aided. Mol. Des. 28 (2014), pp. 887–918. doi:10.1007/s10822-014-9753-z.
  • R. Todeschini, V. Consonni, A. Mauri, and M. Pavan, Detecting “bad” regression models: Multicriteria fitness functions in regression analysis, Anal. Chim. Acta 515 (2004), pp. 199–208. doi:10.1016/j.aca.2003.12.010.
  • R. Todeschini, D. Ballabio, and F. Grisoni, Beware of unreliable Q2! A comparative study of regression metrics for predictivity assessment of QSAR models, J. Chem. Inf. Model. 56 (2016), pp. 1905–1913. doi:10.1021/acs.jcim.6b00277.
  • K. Roy (ed.), Advances in QSAR Modeling, Challenges and Advances in Computational Chemistry and Physics Vol. 24, Springer, Cham, 2017. doi:10.1007/978-3-319-56850-8.
  • K. Kjeldahl and R. Bro, Some common misunderstandings in chemometrics, J. Chemom. 24 (2010), pp. 558–564. doi:10.1002/cem.1346.
  • P. Gramatica and A. Sangion, A historical excursus on the statistical validation parameters for QSAR models: A clarification concerning metrics and terminology, J. Chem. Inf. Model. 56 (2016), pp. 1127–1131. doi:10.1021/acs.jcim.6b00088.
  • A. Golbraikh and A. Tropsha, Predictive QSAR modeling based on diversity sampling of experimental datasets for the training and test set selection, J. Comput. Aided. Mol. Des. 16 (2002), pp. 357–369. doi:10.1023/A:1020869118689.
  • A. Golbraikh, M. Shen, Z. Xiao, Y.D. Yiao, K.H. Lee, and A. Tropsha, Rational selection of training and test sets for the development of validated QSAR models, J. Comput. Aided. Mol. Des. 17 (2003), pp. 241–253. doi:10.1023/A:1025386326946.
  • P.K. Ojha, I. Mitra, R.N. Das, and K. Roy, Further exploring rm2 metrics for validation of QSPR models, Chemom. Intell. Lab. Syst. 107 (2011), pp. 194–205. doi:10.1016/j.chemolab.2011.03.011.
  • R. Veerasamy, H. Rajak, A. Jain, S. Sivadasan, C.P. Varghese, and R.K. Agrawal, Validation of QSAR Models - Strategies and Importance, Intern. J. Drug Des. Discov. 2 (2011), pp. 511–519.
  • G.B. McBride, A proposal for strength-of agreement criteria for Lin’s Concordance Correlation Coefficient, NIWA Client report, National Institute of Water and Atmospheric research Ltd, Hamilton, New Zeeland, 2005; Available at http://www.medcalc.org/download/pdf/McBride2005.pdf.
  • N. Chirico and P. Gramatica, Real external predictivity of QSAR models: How to evaluate it? Comparison of different validation criteria and proposal of using the concordance correlation coefficient, J. Chem. Inf. Model. 51 (2011), pp. 2320–2335. doi:10.1021/ci200211n.
  • I.L.K. Lin, A concordance correlation coefficient to evaluate reproducibility, Biometrics 45 (1989), pp. 255–268. doi:10.2307/2532051.
  • N.M. Faber and R. Rajkó, How to avoid over-fitting in multivariate calibration—The conventional validation approach and an alternative, Anal. Chim. Acta 595 (2007), pp. 98–106. doi:10.1016/j.aca.2007.05.030.
  • C. Rücker, G. Rücker, and M. Meringer, y-Randomization and its variants in QSPR/QSAR, J. Chem. Inf. Model. 47 (2007), pp. 2345–2357. doi:10.1021/ci700157b.
  • E. Besalú and L. Vera, Internal test set (ITS) method: A new cross-validation technique to assess the predictive capability of QSAR models. Application to a benchmark set of steroids, J. Chil. Chem. Soc. 53 (2008), pp. 1576–1580. doi:10.4067/S0717-97072008000300005.
  • P.P. Roy and K. Roy, On some aspects of variable selection for partial least squares regression models, QSAR Comb. Sci. 27 (2008), pp. 302–313. doi:10.1002/qsar.200710043.
  • K. Roy and A.S. Mandal, Development of linear and nonlinear predictive QSAR models and their external validation using molecular similarity principle for anti-HIV indolyl aryl sulfones, J. Enzyme Inhib. Med. Chem. 23 (2008), pp. 980–995. doi:10.1080/14756360701811379.
  • G. Schüürmann, R.U. Ebert, J. Chen, B. Wang, and R. Kühne, External validation and prediction employing the predictive squared correlation coefficients test set activity mean vs training set activity mean, J. Chem. Inf. Model. 48 (2008), pp. 2140–2145. doi:10.1021/ci800253u.
  • F. Gharagheizi, Prediction of upper flammability limit percent of pure compounds from their molecular structures, J. Haz. Mat. 167 (2009), pp. 507–510. doi:10.1016/j.jhazmat.2009.01.002.
  • N.K. Mahobia, R.D. Patel, N.W. Sheikh, S.K. Singh, A. Mishra, and R. Dhardubey, Validation method used in quantitative structure activity relationship, Der Pharma Chemica 2 (2010), pp. 260–271.
  • K. Roy, I. Mitra, S. Kar, P.K. Ojha, R.N. Das, and H. Kabir, Comparative studies on some metrics for external validation of QSPR models, J. Chem. Inf. Model. 52 (2012), pp. 396–408. doi:10.1021/ci200520g.
  • A. Rácz, D. Bajusz, and K. Héberger, Consistency of QSAR models: Correct split of training and test sets, ranking of models and performance parameters, SAR QSAR Environ. Res. 26 (2015), pp. 683–700. doi:10.1080/1062936X.2015.1084647.
  • A.P. Toropova and A.A. Toropov, Does the index of ideality of correlation detect the better model correctly? Mol. Inf. 38 (2019), pp. 1800157. doi:10.1002/minf.201800157.
  • V. Consonni, D. Ballabio, and R. Todeschini, Comments on the definition of the Q2 parameter for QSAR validation, J. Chem. Inf. Model. 49 (2009), pp. 1669–1678. doi:10.1021/ci900115y.
  • K. Roy, I. Mitra, P.K. Ojha, S. Kar, R.N. Das, and H. Kabir, Introduction of rm2(rank) metric incorporating rank-order predictions as an additional tool for validation of QSAR/QSPR models, Chemom. Intell. Lab. Syst. 118 (2012), pp. 202–220. doi:10.1016/j.chemolab.2012.06.004.
  • OECD principles for the validation, for regulatory purposes, of (quantitative) structure-activity relationship models, Organisation for Economic Cooperation and Development (OECD), 2004; Available at http://www.oecd.org/chemicalsafety/risk-assessment/37849783.pdf.
  • P. Gramatica, Principles of QSAR models validation: Internal and external, QSAR Comb. Sci. 26 (2007), pp. 694–701. doi:10.1002/qsar.200610151.
  • E. Papa and P. Gramatica, Externally validated QSPR modelling of VOC tropospheric oxidation by NO3 radicals, SAR QSAR Environ. Res. 19 (2008), pp. 655–668. doi:10.1080/10629360802550697.
  • P.P. Roy, J.T. Leonard, and K. Roy, Exploring the impact of size of training sets for the development of predictive QSAR models, Chemom. Intell. Lab. Syst. 90 (2007), pp. 31–42. doi:10.1016/j.chemolab.2007.07.004.
  • P. Filzmoser, B. Liebmann, and K. Varmuza, Repeated double cross validation, J. Chemom. 23 (2009), pp. 160–171. doi:10.1002/cem.1225.
  • T.M. Martin, P. Harten, D.M. Young, E.N. Muratov, A. Golbraikh, H. Zhu, and A. Tropsha, Does rational selection of training and test sets improve the outcome of QSAR modeling? J. Chem. Inf. Model. 52 (2012), pp. 2570–2578. doi:10.1021/ci300338w.
  • P. Gramatica, E. Giani, and E. Papa, Statistical external validation and consensus modeling: A QSPR case study for Koc prediction, J. Mol. Graph. Model. 25 (2007), pp. 755–766. doi:10.1016/j.jmgm.2006.06.005.
  • V. Rastija, M. Molnar, T. Siladi, and V.H. Masand, QSAR analysis for antioxidant activity of dipicolinic acid derivatives, Comb. Chem. High. Throughput Screen. 21 (2018), pp. 1–11. doi:10.2174/1386207321666180213092352.
  • E.W. Steyerberg, F.E. Harrell, G.J.J.M. Borsboom, M.J.C.R. Eijkemans, Y. Vergouwe, and J.D.F. Habbema, Internal validation of predictive models: Efficiency of some procedures for logistic regression analysis, J. Clin. Epidemiol. 54 (2001), pp. 774–781. doi:10.1016/S0895-4356(01)00341-9.
  • E.W. Steyerberg, S.E. Bleeker, H.A. Moll, D.E. Grobbee, and K.G.M. Moons, Internal and external validation of predictive models: A simulation study of bias and precision in small samples, J. Clin. Epidemiol. 56 (2003), pp. 441–447. doi:10.1016/S0895-4356(03)00047-7.
  • O.Y. Rodionova and A.L. Pomerantsev, Subset selection strategy, J. Chemom. 22 (2008), pp. 674–685. doi:10.1002/cem.1103.
  • N. Fox, A. Hunn, and N. Mathers, Sampling and Sample Size Calculation, The NIHR RDS for the East Midlands/Yorkshire & the Humber, Sheffield, 2007.
  • T. Puzyn, A. Mostrag-Szlichtyng, A. Gajewicz, M. Skrzynski, and A.P. Worth, Investigating the influence of data splitting on the predictive ability of QSAR/QSPR models, Struct. Chem. 22 (2011), pp. 795–804. doi:10.1007/s11224-011-9757-4.
  • I.L.K. Lin, Assay validation using the concordance correlation coefficient, Biometrics 48 (1992), pp. 599–604. doi:10.2307/2532314.
  • T. Borovicka, Training set construction methods, Master thesis, Czech Technical University in Prague, Prague, 2012.
  • P. Tüfecki, Prediction of full load electrical power output of a base load operated combined cycle power plant using machine learning methods, Int. J. Elec. Power 60 (2014), pp. 126–140. doi:10.1016/j.ijepes.2014.02.027.
  • I.C. Yeh, Modeling of strength of high performance concrete using artificial neural networks, Cement Concrete Res. 28 (1998), pp. 1797–1808. doi:10.1016/S0008-8846(98)00165-3.
  • Kaggle Inc. Available at http://kaggle.com.
  • D. Dua and C. Graff, UCI machine learning repository, University of California, School of Information and Computer Science, Irvine, CA, 2019; Available at http://archive.ics.uci.edu/ml. Direct URL to datasets: https://archive.ics.uci.edu/ml/datasets/combined+cycle+power+plant https://archive.ics.uci.edu/ml/datasets/concrete+compressive+strength
  • V. Ruusmann, S. Sild, and U. Maran, QSAR DataBank repository: Open and linked qualitative and quantitative structure–activity relationship models, J. Cheminf. 7 (2015), pp. 32. doi:10.1186/s13321-015-0082-6, http://www.qsardb.org, Direct links to datasets: http://dx.doi.org/10.15152/QDB.196 http://dx.doi.org/10.15152/QDB.197 http://dx.doi.org/10.15152/QDB.177 http://dx.doi.org/10.15152/QDB.135 http://dx.doi.org/10.15152/QDB.115 http://dx.doi.org/10.15152/QDB.84 http://dx.doi.org/10.15152/QDB.183 http://dx.doi.org/10.15152/QDB.114 http://dx.doi.org/10.15152/QDB.182.
  • G. Tóth, TOX3_TOX4_TOX5 generated data, Mendeley Data, V1, 2021; dataset available at http://dx.doi.org/10.17632/y5jyd3ycgf.1.
  • R Core Team, R: A language and environment for statistical computing, R Foundation for Statistical Computing, Vienna, Austria, 2014; software available at http://www.R-project.org/.
  • F. Gharagheizi, A QSPR model for estimation of lower flammability limit temperature of pure compounds based on molecular structure, J. Hazard. Mater. 169 (2009), pp. 217–220. doi:10.1016/j.jhazmat.2009.03.083.
  • P. Gramatica and N. Chiroco, QSARINS-Chem: Insubria datasets and new QSAR/QSPR models for environmental pollutants in QSARINS, J. Comp. Chem. 35 (2014), pp. 1036–1044. doi:10.1002/jcc.23576.
  • G. Piir, S. Sild, A. Roncaglioni, E. Benfenati, and U. Maran, QSAR model for the prediction of bio-concentration factor using aqueous solubility and descriptors considering various electronic effects, SAR QSAR Environ. Res. 21 (2010), pp. 711–729. doi:10.1080/1062936X.2010.528596.
  • T.W. Schultz, M. Hewitt, T.I. Netzeva, and M.T.D. Cronin, Assessing applicability domains of toxicological QSARs: Definition, confidence in predicted values, and the role of mechanisms of action, QSAR Comb. Sci. 26 (2007), pp. 238–254. doi:10.1002/qsar.200630020.
  • P.P. Roy, S. Kovarich, and P. Gramatica, QSAR model reproducibility and applicability: A case study of rate constants of hydroxyl radical reaction models applied to polybrominated diphenyl ethers and (benzo-)triazoles, J. Comput. Chem. 32 (2011), pp. 2386–2396. doi:10.1002/jcc.21820.
  • E. Papa, F. Villa, and P. Gramatica, Statistically validated QSARs, based on theoretical descriptors, for modeling aquatic toxicity of organic chemicals in Pimephales promelas (fathead minnow), J. Chem. Inf. Model. 45 (2005), pp. 1256–1266. doi:10.1021/ci050212l.
  • P. Gramatica, S. Cassani, P.P. Roy, S. Kovarich, C.W. Yap, and E. Papa, QSAR modeling is not “push a button and find a correlation”: A case study of toxicity of (benzo-)triazoles on algae, Mol. Inform. 31 (2012), pp. 817–835. doi:10.1002/minf.201200075.
  • V. Consonni, D. Ballabio, and R. Todeschini, Evaluation of model predictive ability by external validation techniques, J. Chemom. 24 (2010), pp. 194–201. doi:10.1002/cem.1290.
  • J.S. Cramer, Mean and variance of R2 in small and moderate samples, J. Econom. 35 (1987), pp. 253–266. doi:10.1016/0304-4076(87)90027-3.
  • D. Kovács, Validációs statisztikai paraméterek mintamérettől való függésének vizsgálata, Young researcher competition project work, Institute of Chemistry, Eötvös Loránd University, Budapest, Hungary, 2019.
  • L. Sachs, Angewandte Statistik, 9th ed., Springer, New York, 1999.
  • G. Tóth, Z. Bodai, and K. Héberger, Estimation of influential points in any data set from coefficient of determination and its leave-one-out cross-validated counterpart, J. Comput. Aided Mol. Des. 27 (2013), pp. 837–844. doi:10.1007/s10822-013-9680-4.
  • K. Roy, R.N. Das, P. Ambure, and R.B. Aher, Be aware of error measures. Further studies on validation of predictive QSAR models, Chemom. Intell. Lab. Syst. 152 (2016), pp. 18–33. doi:10.1016/j.chemolab.2016.01.008.
  • G. Tóth, P. Király, and D. Kovács, Effect of variable allocation on validation and optimality parameters and on cross-optimization perspectives, Chemom. Intell. Lab. Syst. (2020), pp. 104106. doi:10.1016/j.chemolab.2020.104106.

Reprints and Corporate Permissions

Please note: Selecting permissions does not provide access to the full text of the article, please see our help page How do I view content?

To request a reprint or corporate permissions for this article, please click on the relevant link below:

Academic Permissions

Please note: Selecting permissions does not provide access to the full text of the article, please see our help page How do I view content?

Obtain permissions instantly via Rightslink by clicking on the button below:

If you are unable to obtain permissions via Rightslink, please complete and submit this Permissions form. For more information, please visit our Permissions help page.