283
Views
1
CrossRef citations to date
0
Altmetric
Article

Effect of data preprocessing on ensemble learning for classification in disease diagnosis

ORCID Icon, ORCID Icon & ORCID Icon
Pages 1657-1677 | Received 04 Feb 2021, Accepted 09 Mar 2022, Published online: 24 Mar 2022

References

  • Acock, A. C. 2005. Working with missing values. Journal of Marriage and Family 67 (4):1012–28.
  • Acuña, E, and C. Rodriguez. 2004. The treatment of missing values and its effect on classifier accuracy. In Classification, clustering, and data mining applications. studies in classification, data analysis, and knowledge organisation, ed. D. Banks, F. R. McMorris, P. Arabie, W. Gaul. Berlin, Heidelberg: Springer. doi:10.1007/978-3-642-17103-1_60.
  • Alanis-Tamez, M. D., Y. Villuendas-Rey, and C. Yáñez-Márquez. 2017. Computational intelligence algorithms applied to the pre-diagnosis of chronic diseases. Research in Computing Science 138 (1):41–50. Accessed July 6, 2019. https://www.semanticscholar.org/paper/Computational-Intelligence-Algorithms-Applied-to-of-Alanis-Tamez-Villuendas-Rey/055677db623b94f2cc940b94e3c732c111a9677d. doi:10.13053/rcs-138-1-4.
  • Alcalá-Fdez, J., L. Sánchez, S. García, M. J. del Jesus, S. Ventura, J. M. Garrell, J. Otero, C. Romero, J. Bacardit, V. M. Rivas, et al. 2009. KEEL: A software tool to assess evolutionary algorithms for data mining problems. Soft Computing 13 (3):307–18. doi:10.1007/s00500-008-0323-y.
  • Alghamdi, M., M. Al-Mallah, S. Keteyian, C. Brawner, J. Ehrman, and S. Sakr. 2017. Predicting diabetes mellitus using SMOTE and ensemble machine learning approach: The Henry Ford ExercIse Testing (FIT) project. PLoS One 12 (7):e0179805. doi:10.1371/journal.pone.0179805.
  • Allison, P. 2002. Missing data. 1st ed. San Francisco, CA, USA: SAGE Publications, Inc. doi:10.4135/9781412985079.
  • Amaratunga, D., J. Cabrera, and Y. S. Lee. 2008. Enriched random forests. Bioinformatics 24 (18):2010–4. doi:10.1093/bioinformatics/btn356.
  • Azencott, C. 2018. Machine learning and genomics: Precision medicine vs. patient privacy. Philosophical Transactions of the Royal Society A Mathematical Physical and Engineering Sciences 376(20170350):1–13. doi:10.1098/rsta.2017.0350
  • Bennet, D. A. 2001. How can I deal with missing data in my study? Australian and New Zealand Journal of Public Health 25 (5):464–9.
  • Bradley, A. P. 1997. The use of the area under the ROC curve in the evaluation of machine learning algorithms. Pattern Recognition 30 (7):1145–59. doi:10.1016/S0031-3203(96)00142-2.
  • Breiman, L. 2001. Random forests. Machine Learning 45 (1):5–32. doi:10.1023/A:1010933404324.
  • Brodley, C. E, and M. A. Friedl. 1999. Identifying mislabeled training data. Journal of Artificial Intelligence Research 11:131–67. doi:10.1613/jair.606.
  • Buckland, M, and F. Gey. 1994. The relationship between Recall and Precision. Journal of the American Society for Information Science 45 (1):12–9. doi:10.1002/(SICI)1097-4571(199401)45:1<12::AID-ASI2>3.0.CO;2-L.
  • Chawla, N. V., K. W. Bowyer, L. O. Hall, and W. P. Kegelmeyer. 2002. SMOTE: Synthetic minority over-sampling technique. Journal of Artificial Intelligence Research 16:321–57. doi:10.1613/jair.953.
  • Chen, J. H, and S. M. Asch. 2017. Machine learning and prediction in medicine – Beyond the peak of inflated expectations. The New England Journal of Medicine 376 (26):2507–9. doi:10.1056/NEJMp1702071.
  • Cohen, J. 1960. A coefficient of agreement for nominal scales. Educational and Psychological Measurement 20 (1):37–46. doi:10.1177/001316446002000104.
  • Dietterich, T. G. 2000. Ensemble methods in machine learning. In International Workshop on Multiple Classifier Systems, Springer, Berlin, 1–15. doi:10.1007/3-540-45014-9_1.
  • Dong, Y, and C. Y. J. Peng. 2013. Principled missing data methods for researchers. SpringerPlus 2 (1):222. doi:10.1186/2193-1801-2-222.
  • Fatima, M, and M. Pasha. 2017. Survey of machine learning algorithms for disease diagnostic. Journal of Intelligent Learning Systems and Applications 9 (1):1–16. doi:10.4236/jilsa.2017.91001.
  • Fernández, A., V. López, M. Galar, M. J. del Jesus, and F. Herrera. 2013. Analysing the classification of imbalanced data-sets with multiple classes: Binarization techniques and ad-hoc approaches. Knowledge-Based Systems 42:97–110. doi:10.1016/j.knosys.2013.01.018.
  • Fluss, R., D. Faraggi, and B. Reiser. 2005. Estimation of the Youden index and its associated cutoff point. Biometrical Journal. Biometrische Zeitschrift 47 (4):458–72. doi:10.1002/bimj.200410135.
  • Folorunso, S. O, and A. B. Adeyemo. 2013. Alleviating classification problem of imbalanced dataset. African Journal of Computing & ICT 6 (2):137–44.
  • Frénay, B, and M. Verleysen. 2014. Classification in the presence of label noise: A survey. IEEE Transactions on Neural Networks and Learning Systems 25 (5):845–69. doi:10.1109/TNNLS.2013.2292894.
  • Friedman, J. H. 2001. Greedy function approximation: A gradient boosting machine. The Annals of Statistics 29 (5):1189–232. doi:10.1214/aos/1013203451.
  • Friedman, J., T. Hastie, and R. Tibshirani. 2000. Additive logistic regression: A statistical view of boosting. The Annals of Statistics 28 (2):337–407. doi:10.1214/aos/1016218223.
  • Gammerman, A. 2010. Modern machine learning techniques and their applications to medical diagnostics. Berlin: Springer. doi:10.1007/978-3-642-16239-8_2.
  • Gunčar, G., M. Kukar, M. Notar, M. Brvar, P. Černelč, M. Notar, and M. Notar. 2018. An application of machine learning to haematological diagnosis. Scientific Reports 8 (411):1–12. doi:10.1038/s41598-017-18564-8.
  • Hajian-Tilaki, K. 2013. Receiver operating characteristic (ROC) curve analysis for medical diagnostic test evaluation. Caspian Journal of Internal Medicine 4 (2):627–35.
  • Ho, T. K. 1998. The random subspace method for constructing decision forests. IEEE Transactions on Pattern Analysis and Machine Intelligence 20 (8):832–44. doi:10.1109/34.709601.
  • Hu, S., Y. Liang, L. Ma, and Y. He. 2009. MSMOTE: Improving classification performance when training data is imbalanced. 2nd International Workshop on Computer Science and Engineering, WCSE 2009, Qingdao, China, vol. 2, 13–7. doi:10.1109/WCSE.2009.756.
  • Jutel, A. 2011. Classification, disease, and diagnosis. Perspectives in Biology and Medicine 54 (2):189–205. doi:10.1353/pbm.2011.0015.
  • Kang, H. 2013. The prevention and handling of the missing data. Korean Journal of Anesthesiology 64 (5):402–6. doi:10.4097/kjae.2013.64.5.402.
  • Khalilia, M., S. Chakraborty, and M. Popescu. 2011. Predicting disease risks from highly imbalanced data using random forest. BMC Medical Informatics and Decision Making 11 (1):51. doi:10.1186/1472-6947-11-51.
  • Kononenko, I. 2001. Machine learning for medical diagnosis: History, state of the art and perspective. Artificial Intelligence in Medicine 23 (1):89–109. doi:10.1016/S0933-3657(01)00077-X.
  • Kotsiantis, S. 2011. Combining bagging, boosting, rotation forest and random subspace methods. Artificial Intelligence Review 35 (3):223–40. doi:10.1007/s10462-010-9192-8.
  • Krawczyk, B, and B. T. McInnes. 2018. Local ensemble learning from imbalanced and noisy data for word sense disambiguation. Pattern Recognition 78:103–19. doi:10.1016/j.patcog.2017.10.028.
  • Kübler, S., C. Liu, and Z. A. Sayyed. 2017. To use or not to use: Feature selection for sentiment analysis of highly imbalanced data. Natural Language Engineering 24:3–37. doi:10.1017/S1351324917000298.
  • Kumar, G. R., V. S. Kongara, and G. A. Ramachandra. 2013. An efficient ensemble based classification techniques for medical diagnosis. International Journal of Latest Technology in Engineering, Management & Applied Science 2 (8):5–9. Accessed July 5, 2019. https://www.academia.edu/35248870/An_Efficient_Ensemble_Based_Classification_Techniques_for_Medical_Diagnosis.
  • Lavanya, D, and U. Rani. 2012. Ensemble decision tree classifier for breast cancer data. International Journal of Information Technology Convergence and Services 2 (1):17–24. doi:10.5121/ijitcs.2012.2103.
  • Little, R, and D. Rubin. 1987. Statistical analysis with missing data. 1st ed. New Jersey, USA: John Wiley & Sons, Inc. doi:10.4135/9781412985079.
  • Little, R. J. A., and D. B. Rubin. 1991. Statistical Analysis with Missing Data. Journal of Educational Statistics, 16 (2):150–5.
  • Liu, A. Y. C. 2004. The effect of oversampling and undersampling on classifying imbalanced text datasets (Doctoral dissertation, University of Texas at Austin).
  • Maimon, O, and L. Rokach. 2005. Data mining and knowledge discovery handbook. Berlin: Springer-Verlag.
  • McHugh, M. L. 2012. Interrater reliability: The kappa statistic. Biochemia Medica 22 (3):276–82. doi:10.11613/BM.2012.031.
  • Mislevy, R., R. Little, and D. Rubin. 1991. Statistical analysis with missing data. Journal of Educational Statistics 16 (2):150–5. doi:10.2307/1165119.
  • Moon, H., H. Ahn, R. L. Kodell, S. Baek, C. J. Lin, and J. J. Chen. 2007. Ensemble methods for classification of patients for personalized medicine with high-dimensional data. Artificial Intelligence in Medicine 41 (3):197–207. doi:10.1016/J.ARTMED.2007.07.003.
  • Morales, P., J. Luengo, L. P. F. Garcia, A. C. Lorena, A. C. P. L. F. Carvalho, and F. Herrera. 2017. The NoiseFiltersR package: Label noise preprocessing in R. The R Journal 9 (1):219–28. doi:10.32614/RJ-2017-027.
  • Obermeyer, Z, and E. J. Emanuel. 2016. Predicting the future – Big data, machine learning, and clinical medicine. The New England Journal of Medicine 375 (13):1216–9. doi:10.1056/NEJMp1606181.
  • Sáez, J. A., B. Krawczyk, and M. Woźniak. 2016. On the influence of class noise in medical data classification: Treatment using noise filtering methods. Applied Artificial Intelligence 30 (6):590–609. doi:10.1080/08839514.2016.1193719.
  • Salcedo-Bernal, A., M. P. Villamil-Giraldo, and A. D. Moreno-Barbosa. 2016. Clinical data analysis: An opportunity to compare machine learning methods. Procedia Computer Science 100:731–8. doi:10.1016/j.procs.2016.09.218.
  • Salzberg, S. L. 1994. C4.5: Programs for machine learning. Machine Learning 16 (3):235–40. doi:10.1007/BF00993309.
  • Schafer, J. L. 1999. Multiple imputation: A primer. Statistical Methods in Medical Research 8 (1):3–15. doi:10.1177/096228029900800102.
  • Shah, P., F. Kendall, S. Khozin, R. Goosen, J. Hu, J. Laramie, M. Ringel, and N. Schork. 2019. Artificial intelligence and machine learning in clinical development: A translational perspective. npj Digital Medicine 2 (1):1–5. doi:10.1038/s41746-019-0148-3.
  • Sinharay, S., H. S. Stern, and D. Russell. 2001. The use of multiple imputation for the analysis of missing data. Psychological Methods 6 (4):317–29. doi:10.1037/1082-989X.6.4.317.
  • Sutton, C. D. 2005. Classification and regression trees, bagging, and boosting. In Handbook of statistics, ed. R. C. Rao and A. S. R. S. Rao, 24th ed., 303–30. San Francisco, CA, USA: Elsevier. doi:10.1016/S0169-7161(04)24011-1.
  • Swets, J. 1988. Measuring the accuracy of diagnostic systems. Science 240 (4857):1285–93. doi:10.1126/science.3287615.
  • Thottakkara, P., T. Ozrazgat-Baslanti, B. Hupf, P. Rashidi, P. Pardalos, P. Momcilovic, and A. Bihorac. 2016. Application of machine learning techniques to high-dimensional clinical data to forecast postoperative complications. PLoS One 11 (5):e0155705. doi:10.1371/journal.pone.0155705.
  • UCI. 2019. UC Irvine Machine Learning Repository. Center for Machine Learning and Intelligent Systems. Accessed July 27, 2019. https://archive.ics.uci.edu/ml/index.php.
  • Verma, B, and Z. S. Hassan. 2011. Hybrid ensemble approach for classification. Applied Intelligence 34 (2):258–78. doi:10.1007/s10489-009-0194-7.
  • Wang, R. Y., V. C. Storey, and C. P. Firth. 1995. A framework for analysis of data quality research. IEEE Transactions on Knowledge and Data Engineering 7 (4):623–40. doi:10.1109/69.404034.
  • Watson, P. F, and A. Petrie. 2010. Method agreement analysis: A review of correct methodology. Theriogenology 73 (9):1167–79. doi:10.1016/J.THERIOGENOLOGY.2010.01.003.
  • Witten, I. H., E. Frank, and A. Hall. 2011. Data mining: Practical machine learning tools and techniques, Data Mining (Third Edit, Vol. 277). San Francisco, CA, USA: Elsevier. doi:10.1002/1521-3773(20010316)40:6<9823::AID-ANIE9823>3.3.CO;2-C.
  • Wu, X, and X. Zhu. 2008. Mining with noise knowledge: Error-aware data mining. IEEE Transactions on Systems, Man, and Cybernetics - Part A: Systems and Humans 38 (4):917–32. doi:10.1109/TSMCA.2008.923034.
  • Wu, Z., W. Lin, and Y. Ji. 2018. An integrated ensemble learning model for imbalanced fault diagnostics and prognostics. IEEE Access. 6:8394–402. doi:10.1109/ACCESS.2018.2807121.
  • Xu, B., J. Z. Huang, G. Williams, Q. Wang, and Y. Ye. 2012. Classifying very high-dimensional data with random forests built from small subspaces. International Journal of Data Warehousing and Mining 8 (2):44–63. doi:10.4018/jdwm.2012040103.
  • Yang, P., J. Y. H. Yang, B. Zhou, and A. Zomaya. 2010. A review of ensemble methods in bioinformatics. Current Bioinformatics 5 (4):296–308. doi:10.2174/157489310794072508.
  • Youden, W. J. 1950. Index for rating diagnostic tests. Cancer 3 (1):32–5. doi:10.1002/1097-0142(1950)3:1<32::AID-CNCR2820030106>3.0.CO;2-3.
  • Zhang, Y., B. Liu, J. Cai, and S. Zhang. 2017. Ensemble weighted extreme learning machine for imbalanced data classification based on differential evolution. Neural Computing and Applications 28 (s1):259–67. doi:10.1007/s00521-016-2342-4.
  • Zhang, Z., Y. Zhao, A. Canes, D. Steinberg, O. Lyashevska, and written on behalf of AME Big-Data Clinical Trial Collaborative Group. 2019. Predictive analytics with gradient boosting in clinical medicine. Annals of Translational Medicine 7 (7):152–9. doi:10.21037/atm.2019.03.29.
  • Zhu, X, and X. Wu. 2004. Class noise vs. attribute noise: A quantitative study. Artificial Intelligence Review 22 (3):177–210. doi:10.1007/s10462-004-0751-8.

Reprints and Corporate Permissions

Please note: Selecting permissions does not provide access to the full text of the article, please see our help page How do I view content?

To request a reprint or corporate permissions for this article, please click on the relevant link below:

Academic Permissions

Please note: Selecting permissions does not provide access to the full text of the article, please see our help page How do I view content?

Obtain permissions instantly via Rightslink by clicking on the button below:

If you are unable to obtain permissions via Rightslink, please complete and submit this Permissions form. For more information, please visit our Permissions help page.