Views

CrossRef citations to date

Altmetric

Research Article

Sample-size dependence of validation parameters in linear regression models and in QSAR

D. KovácsInstitute of Chemistry, Loránd Eötvös University, Budapest, Hungary

https://orcid.org/0000-0001-9584-0878 View further author information

P. KirályInstitute of Chemistry, Loránd Eötvös University, Budapest, HungaryView further author information

G. TóthInstitute of Chemistry, Loránd Eötvös University, Budapest, HungaryCorrespondence[email protected]

https://orcid.org/0000-0002-5146-5700 View further author information

ABSTRACT

The dependence of statistical validation parameters was investigated on the size of the sample taken in fit of multivariate linear curves. We observed that R² and related internal parameters were misleading as they overestimated the goodness-of-fit of models at small sample size. Cross-validation metrics showed correct trends. It was possible to scale the leave-one-out and the leave-many-out results close to identical by correcting the degrees of freedom of the models. y and x-randomized validation parameters were calculated and the methods provided close to identical results. We suggest to use the simplest methods in both cases. The external parameters followed correct trends with respect to the sample size, but their sensitivity differed. We plotted the Roy-Ojha metrics in 2D and we coloured them with respect to other external parameters to provide an easy classification of models. The rank correlations were calculated between the performance parameters. Up to a sample size, goodness-of-fit and robustness were distinguishable, but above a certain sample size, the parameters were redundant. The external-internal pairs were weakly correlated. Our data show that all the three aspects of validation are necessary at small sample sizes, but the internal check of robustness is not informative above a given sample size.

KEYWORDS:

Acknowledgements

The authors thank the fruitful discussions with the participants of the Conferentia Chemometrica conference held in Karcag (Hungary) in September 2019. The investigation was partly supported by grant NKFI K-128136.

Description of the supplementary material

In the supplementary material, we clarified two questions on validation parameters, where we found several misinterpretations or not thorough conclusions in the literature. In Table S1 of the supplementary material, the differences in the interpretation of validation parameters were collected between unconstrained linear regression and other model types. In figure S1 in the supplementary material we showed biased models with shift and scale in order to demonstrate the behaviour of CCC and R², that seemed to be necessary due to the several misinterpretations originating from the only conditional equivalence of R² and the square of Pearson correlation coefficient. Here, we found that the signalling power of CCC was surely not larger than that of R².

Disclosure statement

No potential conflict of interest was reported by the authors.

Supplementary material

Supplementary data for this article can be accessed at: https://doi.org/10.1080/1062936X.2021.1890208.

Additional information

Funding

This work was supported by the National Research, Development and Innovation Office, Hungary [NKFI K-128136].

Reprints and Corporate Permissions

Please note: Selecting permissions does not provide access to the full text of the article, please see our help page How do I view content?

To request a reprint or corporate permissions for this article, please click on the relevant link below:

Order Reprints Request Corporate Permissions

Academic Permissions

Please note: Selecting permissions does not provide access to the full text of the article, please see our help page How do I view content?

Obtain permissions instantly via Rightslink by clicking on the button below:

Request Academic Permissions

If you are unable to obtain permissions via Rightslink, please complete and submit this Permissions form. For more information, please visit our Permissions help page.

Sample-size dependence of validation parameters in linear regression models and in QSAR

Information for

Open access

Opportunities

Help and information

Sample-size dependence of validation parameters in linear regression models and in QSAR

ABSTRACT

Acknowledgements

Description of the supplementary material

Disclosure statement

Supplementary material

Additional information

Funding

Reprints and Corporate Permissions

Academic Permissions

Related research

To cite this article:

Download citation

Information for

Open access

Opportunities

Help and information

Keep up to date

Your download is now in progress and you may close this window

Login or register to access this feature