ABSTRACT
Randomization is considered a safeguard against bias and a gold standard in clinical studies. To assess the generalizability of the accuracy of a model, a common approach is to randomly split a master data set into two parts: one for training and the other for testing. In this paper, we demonstrated the limitations of random split in assessing the generalizability of the accuracy of models through simulation studies. We generated three simulation data for binary or continuous endpoints, each with large sample size (n = 10,000). In each simulation scenario, we randomly split the data into two, one for training and one for testing, and then compare the performance of the model between training and testing data. All simulations were repeated 1,000 times. When random split was used, the model performance based on training and testing data behaved similarly in terms of the true positive fraction and false positive fraction for binary data and mean-squared errors for continuous data. However, when there is a time drift effect in the data, random split will result in large differences between training and testing data. As the training and testing data are similar through a random split, assessing the generalizability of the model on similar data will generate similar results. Generalizability of the accuracy of models is thus best achieved if testing is done in a distinct and independent study.
Disclosure statement
No potential conflict of interest was reported by the author(s).