331
Views
9
CrossRef citations to date
0
Altmetric
Research Article

Sample-size dependence of validation parameters in linear regression models and in QSAR

ORCID Icon, & ORCID Icon
Pages 247-268 | Received 09 Nov 2020, Accepted 10 Feb 2021, Published online: 22 Mar 2021
 

ABSTRACT

The dependence of statistical validation parameters was investigated on the size of the sample taken in fit of multivariate linear curves. We observed that R2 and related internal parameters were misleading as they overestimated the goodness-of-fit of models at small sample size. Cross-validation metrics showed correct trends. It was possible to scale the leave-one-out and the leave-many-out results close to identical by correcting the degrees of freedom of the models. y and x-randomized validation parameters were calculated and the methods provided close to identical results. We suggest to use the simplest methods in both cases. The external parameters followed correct trends with respect to the sample size, but their sensitivity differed. We plotted the Roy-Ojha metrics in 2D and we coloured them with respect to other external parameters to provide an easy classification of models. The rank correlations were calculated between the performance parameters. Up to a sample size, goodness-of-fit and robustness were distinguishable, but above a certain sample size, the parameters were redundant. The external-internal pairs were weakly correlated. Our data show that all the three aspects of validation are necessary at small sample sizes, but the internal check of robustness is not informative above a given sample size.

Acknowledgements

The authors thank the fruitful discussions with the participants of the Conferentia Chemometrica conference held in Karcag (Hungary) in September 2019. The investigation was partly supported by grant NKFI K-128136.

Description of the supplementary material

In the supplementary material, we clarified two questions on validation parameters, where we found several misinterpretations or not thorough conclusions in the literature. In Table S1 of the supplementary material, the differences in the interpretation of validation parameters were collected between unconstrained linear regression and other model types. In figure S1 in the supplementary material we showed biased models with shift and scale in order to demonstrate the behaviour of CCC and R2, that seemed to be necessary due to the several misinterpretations originating from the only conditional equivalence of R2 and the square of Pearson correlation coefficient. Here, we found that the signalling power of CCC was surely not larger than that of R2.

Disclosure statement

No potential conflict of interest was reported by the authors.

Supplementary material

Supplementary data for this article can be accessed at: https://doi.org/10.1080/1062936X.2021.1890208.

Additional information

Funding

This work was supported by the National Research, Development and Innovation Office, Hungary [NKFI K-128136].

Log in via your institution

Log in to Taylor & Francis Online

PDF download + Online access

  • 48 hours access to article PDF & online version
  • Article PDF can be downloaded
  • Article PDF can be printed
USD 61.00 Add to cart

Issue Purchase

  • 30 days online access to complete issue
  • Article PDFs can be downloaded
  • Article PDFs can be printed
USD 543.00 Add to cart

* Local tax will be added as applicable

Related Research

People also read lists articles that other readers of this article have read.

Recommended articles lists articles that we recommend and is powered by our AI driven recommendation engine.

Cited by lists all citing articles based on Crossref citations.
Articles with the Crossref icon will open in a new tab.