75
Views
5
CrossRef citations to date
0
Altmetric
Original Articles

Factors influencing the resubstitution accuracy in multivariate classification analysis: implications for study design in ergonomics

Pages 417-427 | Published online: 09 Nov 2010
 

Abstract

The use of multivariate classification analysis (e.g. discriminant analysis, linear regression, logistic regression) is becoming widespread in ergonomics, as well as numerous other disciplines. Classification analysis is frequently used to determine what combination of features (independent variables), and in what mathematical relations and proportions, defines an acceptable versus an unacceptable risk. Accurate predictive classification models can be useful in suggesting interventions that can minimize illness and injury. Frequently, classification studies in the ergonomics literature report the resubstitution accuracy—the accuracy that is realized when the classifier is evaluated on the same sample that was used to generate the classification coefficients. However, it is well established that the resubstitution accuracy is optimistically biased. The extent, or magnitude, of this bias is not well understood. Thus, a Monte Carlo simulation study was conducted to investigate this bias. Random data containing no true classification power (denoted the ‘Nil Model’) were generated, then analysed using discriminant analysis. For the case of two outcome groups, the true accuracy of the Nil Model is 50% (i.e. no better than flipping a fair coin). For conditions similar to those in the literature, the random data ‘reported’ highly accurate classification performance—results as high as 100%. These ‘reports’ represent the bias artefact of resubstitution accuracy. Factors influencing the extent of the bias were studied. It was found that the resubstitution bias is reduced if; sample size is increased, the number of candidate features is decreased, the number of selected features is decreased, and the proportion of samples from each outcome group is equalized. Feature correlation did not influence resubstitution accuracy. These simulation studies indicate that reporting of the resubstitution accuracy alone can be problematic. It is suggested that research reports that incorporate classification analysis either (1) train the classification function on one data set, but report as the performance metric the classification accuracy achieved on an independent, adequately-sized test data set, or (2) demonstrate that the magnitude of the resubstitution bias is minimal.

Reprints and Corporate Permissions

Please note: Selecting permissions does not provide access to the full text of the article, please see our help page How do I view content?

To request a reprint or corporate permissions for this article, please click on the relevant link below:

Academic Permissions

Please note: Selecting permissions does not provide access to the full text of the article, please see our help page How do I view content?

Obtain permissions instantly via Rightslink by clicking on the button below:

If you are unable to obtain permissions via Rightslink, please complete and submit this Permissions form. For more information, please visit our Permissions help page.