ABSTRACT
It is widely believed that unlabeled data are promising for improving prediction accuracy in classification problems. Although theoretical studies about when/how unlabeled data are beneficial exist, an actual prediction improvement has not been sufficiently investigated for a finite sample in a systematic manner. We investigate the impact of unlabeled data in linear discriminant analysis and compare the error rates of the classifiers estimated with/without unlabeled data. Our focus is a labeling mechanism that characterizes the probabilistic structure of occurrence of labeled cases. Results imply that an extremely small proportion of unlabeled data has a large effect on the analysis results.
Acknowledgment
The authors would like to acknowledge the associate editor and anonymous reviewers for their helpful comments and suggestions.
Funding
K. Hayashi is supported by JSPS KAKENHI (Grant-in-Aid for Scientific Research) grant number 24700276. K. Takai is supported by JSPS KAKENHI (Grant-in-Aid for Scientific Research) grant number 20572019.
Notes
1 To be precise, this quantity should be called “non-MCAR-ness.” However, for simplicity we call this “MCAR-ness.”