331
Views
9
CrossRef citations to date
0
Altmetric
Research Article

Sample-size dependence of validation parameters in linear regression models and in QSAR

ORCID Icon, & ORCID Icon
Pages 247-268 | Received 09 Nov 2020, Accepted 10 Feb 2021, Published online: 22 Mar 2021
 

ABSTRACT

The dependence of statistical validation parameters was investigated on the size of the sample taken in fit of multivariate linear curves. We observed that R2 and related internal parameters were misleading as they overestimated the goodness-of-fit of models at small sample size. Cross-validation metrics showed correct trends. It was possible to scale the leave-one-out and the leave-many-out results close to identical by correcting the degrees of freedom of the models. y and x-randomized validation parameters were calculated and the methods provided close to identical results. We suggest to use the simplest methods in both cases. The external parameters followed correct trends with respect to the sample size, but their sensitivity differed. We plotted the Roy-Ojha metrics in 2D and we coloured them with respect to other external parameters to provide an easy classification of models. The rank correlations were calculated between the performance parameters. Up to a sample size, goodness-of-fit and robustness were distinguishable, but above a certain sample size, the parameters were redundant. The external-internal pairs were weakly correlated. Our data show that all the three aspects of validation are necessary at small sample sizes, but the internal check of robustness is not informative above a given sample size.

Acknowledgements

The authors thank the fruitful discussions with the participants of the Conferentia Chemometrica conference held in Karcag (Hungary) in September 2019. The investigation was partly supported by grant NKFI K-128136.

Description of the supplementary material

In the supplementary material, we clarified two questions on validation parameters, where we found several misinterpretations or not thorough conclusions in the literature. In Table S1 of the supplementary material, the differences in the interpretation of validation parameters were collected between unconstrained linear regression and other model types. In figure S1 in the supplementary material we showed biased models with shift and scale in order to demonstrate the behaviour of CCC and R2, that seemed to be necessary due to the several misinterpretations originating from the only conditional equivalence of R2 and the square of Pearson correlation coefficient. Here, we found that the signalling power of CCC was surely not larger than that of R2.

Disclosure statement

No potential conflict of interest was reported by the authors.

Supplementary material

Supplementary data for this article can be accessed at: https://doi.org/10.1080/1062936X.2021.1890208.

Additional information

Funding

This work was supported by the National Research, Development and Innovation Office, Hungary [NKFI K-128136].

Reprints and Corporate Permissions

Please note: Selecting permissions does not provide access to the full text of the article, please see our help page How do I view content?

To request a reprint or corporate permissions for this article, please click on the relevant link below:

Academic Permissions

Please note: Selecting permissions does not provide access to the full text of the article, please see our help page How do I view content?

Obtain permissions instantly via Rightslink by clicking on the button below:

If you are unable to obtain permissions via Rightslink, please complete and submit this Permissions form. For more information, please visit our Permissions help page.