ABSTRACT
This paper combines a theoretical-based model with a data-driven approach to develop an Early Warning System that detects students who are more likely to dropout. The model uses innovative multilevel statistical and machine learning methods. The paper demonstrates the validity of the approach by applying it to administrative data from a leading Italian university.
Acknowledgments
This research stems from an institutional initiative launched by Politecnico di Milano under the label ‘Data Analytics for Institutional Support’, which broad aim is to leverage the available (administrative) datasets of the university to analyze many aspects of the academic life, and support better decision-making. We are grateful to the University’s management for their support and encouragement and to the IT Office of Politecnico di Milano for their support in extracting data and pre-processing them. All the eventual errors are our solely responsibility.
Disclosure statement
No potential conflict of interest was reported by the author(s).
Notes
1 An important note is needed here. Dropout represents a net waste of resources in the cases in which students leave university, but sometimes they do so for switching major or university. In this latter case, the effect is not a net waste of resources for the society, but only for the abandoned university. The argument holds its validity then, although its application is dependent upon the specific definition of dropout. In this paper, we consider the viewpoint of the single university involved (see the section about Methodology and data).
2 It is worth to recall that relevance of covariates and threshold values in the splits are automatically identified by the tree, standing on certain input parameters.
3 We chose this threshold because the third semester after the enrolment represents the deadline for students to enrol in the second academic year.
4 In the early dropout analyses, late dropout students are excluded from the sample and vice-versa.
5 Tables in Annexes A1 and A2 report detailed results of Models 1a, 1b and 1c, for early and late dropout, respectively. The association between student-level covariates and the response remains coherent across the models.
6 We are aware that there could be a portion of students who do not take any attempts because they have already decided to drop, creating a potential endogeneity issue in studying the phenomenon. In order to check the robustness of our results and to avoid this potential confounding factor, we re-run our linear models for predicting early dropout excluding from the sample those students who did not take any attempts at the first semester. Results, reported in , confirm that student characteristics associated to the dropout probability, together with models predictive performance, remain quite unchanged (AUC indexes are slightly lower when excluding zero attempts students).
7 The technical and mathematical details about the computation of degree courses’ effects are reported in Pinheiro and Bates (Citation2006) and Pellagatti et al. (Citation2021).
8 We provide mean and interquartile range for numerical variables and percentage for categorical variables.