Factors influencing the resubstitution accuracy in multivariate classification analysis: implications for study design in ergonomics: Ergonomics: Vol 40, No 4

Views

CrossRef citations to date

Altmetric

Abstract

The use of multivariate classification analysis (e.g. discriminant analysis, linear regression, logistic regression) is becoming widespread in ergonomics, as well as numerous other disciplines. Classification analysis is frequently used to determine what combination of features (independent variables), and in what mathematical relations and proportions, defines an acceptable versus an unacceptable risk. Accurate predictive classification models can be useful in suggesting interventions that can minimize illness and injury. Frequently, classification studies in the ergonomics literature report the resubstitution accuracy—the accuracy that is realized when the classifier is evaluated on the same sample that was used to generate the classification coefficients. However, it is well established that the resubstitution accuracy is optimistically biased. The extent, or magnitude, of this bias is not well understood. Thus, a Monte Carlo simulation study was conducted to investigate this bias. Random data containing no true classification power (denoted the ‘Nil Model’) were generated, then analysed using discriminant analysis. For the case of two outcome groups, the true accuracy of the Nil Model is 50% (i.e. no better than flipping a fair coin). For conditions similar to those in the literature, the random data ‘reported’ highly accurate classification performance—results as high as 100%. These ‘reports’ represent the bias artefact of resubstitution accuracy. Factors influencing the extent of the bias were studied. It was found that the resubstitution bias is reduced if; sample size is increased, the number of candidate features is decreased, the number of selected features is decreased, and the proportion of samples from each outcome group is equalized. Feature correlation did not influence resubstitution accuracy. These simulation studies indicate that reporting of the resubstitution accuracy alone can be problematic. It is suggested that research reports that incorporate classification analysis either (1) train the classification function on one data set, but report as the performance metric the classification accuracy achieved on an independent, adequately-sized test data set, or (2) demonstrate that the magnitude of the resubstitution bias is minimal.

Keywords:

Reprints and Corporate Permissions

Please note: Selecting permissions does not provide access to the full text of the article, please see our help page How do I view content?

To request a reprint or corporate permissions for this article, please click on the relevant link below:

Order Reprints Request Corporate Permissions

Academic Permissions

Please note: Selecting permissions does not provide access to the full text of the article, please see our help page How do I view content?

Obtain permissions instantly via Rightslink by clicking on the button below:

Request Academic Permissions

If you are unable to obtain permissions via Rightslink, please complete and submit this Permissions form. For more information, please visit our Permissions help page.

Factors influencing the resubstitution accuracy in multivariate classification analysis: implications for study design in ergonomics

Information for

Open access

Opportunities

Help and information

Factors influencing the resubstitution accuracy in multivariate classification analysis: implications for study design in ergonomics

Abstract

Reprints and Corporate Permissions

Academic Permissions

Related research

To cite this article:

Download citation

Information for

Open access

Opportunities

Help and information

Keep up to date

Your download is now in progress and you may close this window

Login or register to access this feature