5,129
Views
6
CrossRef citations to date
0
Altmetric
Articles

Combining statistical and machine learning methods to explore German students’ attitudes towards ICT in PISA

ORCID Icon & ORCID Icon
Pages 180-199 | Received 07 Dec 2019, Accepted 17 Jul 2021, Published online: 05 Aug 2021

ABSTRACT

In our age of big data and growing computational power, versatility in data analysis is important. This study presents a flexible way to combine statistics and machine learning for data analysis of a large-scale educational survey. The authors used statistical and machine learning methods to explore German students’ attitudes towards information and communication technology (ICT) in relation to mathematical and scientific literacy measured by the Programme for International Student Assessment (PISA) in 2015 and 2018. Implementations of the random forest (RF) algorithm were applied to impute missing data and to predict students’ proficiency levels in mathematics and science. Hierarchical linear models (HLM) were built to explore relationships between attitudes towards ICT and mathematical and scientific literacy with the focus on the nested structure of the data. ICT autonomy was an important variable in RF models, and associations between this attitude and literacy scores in HLM were significant and positive, while for other ICT attitudes the associations were negative (ICT in social interaction) or non-significant (ICT competence and ICT interest). The need for further research on ICT autonomy is discussed, and benefits of combining statistical and machine learning approaches are outlined.

Introduction

Educational researchers are facing a new challenge brought about by growing volumes of data. In order to cope with this challenge, creative adaptability of methods is required, which needs to be an integral part of academic training. Contemporary data science methodologies endeavour to integrate statistical analysis with the relatively novel machine learning approach, which aims at finding generalizable data patterns and making predictions on their basis (Breiman Citation2001b, Donoho Citation2017). In educational research, this integration remains to be attained: Machine learning is increasingly used for data analysis of large-scale educational surveys (Bethmann et al. Citation2014, Gabriel et al. Citation2018, Nationales Bildungspanel Citation2019, Saarela et al. Citation2016) but rarely in combination with statistical modeling. A flexible way to use statistical and machine learning methods can be advantageous for data analysis of a large-scale educational survey, as we illustrate with our analysis of German students’ attitudes towards information and communication technology (ICT).

Attitudes towards ICT, which have a considerable impact on learning (Fraillon et al. Citation2020, Organisation for Economic Co-operation and Development [OECD] Citation2015, Citation2019d), are examined in the frame of large-scale educational surveys (Eickelmann and Vennemann Citation2017, Fraillon et al. Citation2019). In this study, we used the ICT framework (Goldhammer et al. Citation2016) from the Programme for International Student Assessment (PISA). PISA is conducted by the OECD every three years, starting from 2000. It measures 15-year-old students’ literacy (competence required for coping with adult life) in different domains, attitudes to schooling, and a broad range of demographic factors. Literacy scores are translated into proficiency levels, from Level 1b to Level 6, using cut-off points (OECD Citation2017b); students with scores below the baseline Level 2 are classified as low performers and students with scores at Level 5 and above as high performers (OECD Citation2016b, Citation2019b, Pokropek et al. Citation2018).

Based on previous findings (Güzeller and Akin Citation2014, Luu and Freeman Citation2011, Petko et al. Citation2017, Zhang and Liu Citation2016), we explored students’ attitudes towards ICT in PISA 2015 and 2018 in relation to their mathematical and scientific literacy. We used solely German samples to avoid pitfalls outlined by critics of PISA with regard to international comparisons based on simplistic interpretations (Anderson et al. Citation2009, Goldstein Citation2004, Hopfenbeck et al. Citation2018, Zhao Citation2020). In order to determine the relative importance of attitudes towards ICT, we compared them with each other and with two demographic variables, which were shown to be influential factors for academic achievement in mathematics and science in Germany: economic, social and cultural status (ESCS) and gender (Mostafa and Schwabe Citation2019, OECD Citation2016a, Citation2019c, Rodrigues and Biagi Citation2017).

From the methodological perspective, our aim was to present a versatile way to apply statistical and machine learning methods to a large-scale educational survey. We selected the most suitable tools from statistical and machine learning toolboxes for each of the three consecutive tasks that we needed to complete. For the first task, missing data imputation, we chose random forest (RF), a machine learning algorithm (Stekhoven Citation2013). For the second task, predicting proficiency levels in mathematics and science (below Level 2, Levels 2–4, or Level 5 and above), another implementation of the RF algorithm (Breiman Citation2001a, Liaw and Wiener Citation2002) was used. For the third task, exploring associations between literacy scores and attitudes towards ICT with the focus on the hierarchical structure of the data, we resorted to a traditional statistical method, hierarchical linear modeling (HLM; Dedrick et al. Citation2009, Harrison et al. Citation2018, McCoach Citation2010). We justify our analytical decisions and discuss methodological recommendations for educational researchers interested in novel methods of data analysis.

In the second section of this paper, we describe the current state of research; in the third, we outline our analytical decisions; in the fourth, we report our results; and in the fifth, we discuss limitations of the study and its implications for future research. Our R script is freely available to the readersFootnote1, to allow them to reproduce the findings of the study, practice data analysis with plausible values in a large-scale educational survey, or get acquainted with the basics of machine learning in R.

Theoretical background

In this section, we discuss statistical and machine learning methods in educational research and versatility in application of these approaches. Then we focus on the RF algorithm and model-agnostic methods, which make machine learning models interpretable. Finally, as we illustrate the use of these tools by exploring relationships between ICT attitudes and academic achievement in PISA 2015 and 2018, we briefly review previous research on this topic.

Statistical and machine learning methods in educational research

Statistical methods in educational research vary widely, and their comprehensive description is beyond the frame of this paper. In this study, we focus on multilevel methods, which are increasingly used for large-scale educational surveys (Anderson et al. Citation2009, OECD Citation2009), as failing to recognize the hierarchical structure of the data leads to Type I error inflation (Cheah Citation2009, Musca et al. Citation2011). Multilevel structural equation modeling (Areepattamannil and Santos Citation2019, Goldstein et al. Citation2007), multilevel latent class analysis (Yalçın Citation2018), multilevel multidimensional item response theory (Lu and Bolt Citation2015), and the most frequently used method, HLM (Areepattamannil Citation2014, Cosgrove and Cunningham Citation2011, Sun et al. Citation2012, Yıldırım Citation2012) were applied to hierarchical data in large-scale educational surveys. Methodological recommendations for the best practices in HLM were summarized by Dedrick et al. (Citation2009) and Harrison et al. (Citation2018).

In addition to statistical methods, the relatively novel machine learning approach has recently been applied to large-scale educational surveys (Bethmann et al. Citation2014, Gabriel et al. Citation2018, Nationales Bildungspanel Citation2019, Saarela et al. Citation2016). Shmueli (Citation2010) discussed the contrast between explanatory modeling typical for statistics and predictive modeling in machine learning: ‘While statistical theory has focused on model estimation, inference and fit, machine learning and data mining have concentrated on developing computationally efficient predictive algorithms and tackling the bias-variance trade-off in order to achieve high predictive accuracy’ (p. 306). Machine learning differs from statistics in regard to data pre-processing (splitting the data into the training set and the test set), choice of variables (less theory-driven than in statistics), and model evaluation (predictive accuracy instead of explanatory power). Machine learning methods are widely used by such interdisciplinary communities as educational data mining and learning analytics (see Rienties, Køhler Simonsen, and Herodotou [Citation2020], Romero and Ventura [Citation2020]).

Integration of statistical and machine learning approaches has been discussed for more than two decades (Friedman Citation1997). Breiman (Citation2001b) urged statisticians to embrace algorithmic thinking typical for machine learning. Hastie et al. (Citation2009) presented a combination of statistical and machine learning methods to train the most effective predictive model for the dataset in question. Shmueli (Citation2010) suggested a bi-dimensional approach that takes into consideration explanatory power and predictive accuracy of a model, thus combining statistical and machine learning approaches to model evaluation. Donoho (Citation2017) stressed that data science, which uses scientific methods to create meaning from raw data, needs to include mutually enriching statistical and machine learning perspectives. In the frame of this integration, novel methods were developed which combine statistics and machine learning. For instance, Ryo and Rillig (Citation2017) introduced statistically reinforced machine learning, in which machine learning models are augmented by investigating significance and effect sizes typical for statistical models. Another way to integrate statistics and machine learning is to exercise versatility in data analysis, when the researcher selects tools from either of these toolboxes depending on the study aims and the dataset characteristics (Breiman Citation2001b, Donoho Citation2017, Hastie et al. Citation2009). The best techniques for specific tasks can be chosen based on existing research on their comparative effectiveness (Breiman Citation2001b, Bzdok et al. Citation2018, Couronné et al. Citation2018, Makridakis et al. Citation2018).

In educational science, however, integration of statistics and machine learning remains to be attained. We demonstrate that the two approaches can be combined for data analysis of a large-scale educational survey, and it would be beneficial for statisticians to widen the scope of their techniques with machine learning methods. One of these methods, RF, is discussed in detail in the next section.

Random forest and model-agnostic methods

RF, which has recently grown from a single algorithm (Breiman Citation2001a) into a framework of various models (Denil et al. Citation2014, Fawagreh et al. Citation2014), is increasingly used in educational research (Beaulac and Rosenthal Citation2019, Golino et al. Citation2014, Han et al. Citation2019, Hardman et al. Citation2013, McDaniel Citation2018, Spoon et al. Citation2016, Yi and Na Citation2019). RF belongs to the family of algorithms based on decision trees. The idea of a decision tree is to split the dataset on a particular variable based on information gained from this split. As splits are binary (yes or no), continuous variables are transformed into categorical variables, that is, lesser than or greater than a certain value. shows a decision tree predicting students’ mathematical proficiency (the classification task), which we built for illustration purpose. For this decision tree, the first two splits are based on the ESCS value, and the third on the value of ICT autonomy.

Figure 1. Decision Tree for Mathematical Proficiency Levels in PISA 2015. Note: Split labels specify how the decision tree splits the data. Node labels contain the following information: the predicted class for the node; the probability per class of observations in the node (conditioned on the node, the sum across a node is 1); and the percentage of observations in the node. ESCS = economic, social, and cultural status; AUTO = ICT autonomy; INTE = ICT interest; SOCI = ICT in social interaction; class 1 = mathematical proficiency below Level 2; class 2 = Levels 2–4; class 3 = Level 5 and above.

Figure 1. Decision Tree for Mathematical Proficiency Levels in PISA 2015. Note: Split labels specify how the decision tree splits the data. Node labels contain the following information: the predicted class for the node; the probability per class of observations in the node (conditioned on the node, the sum across a node is 1); and the percentage of observations in the node. ESCS = economic, social, and cultural status; AUTO = ICT autonomy; INTE = ICT interest; SOCI = ICT in social interaction; class 1 = mathematical proficiency below Level 2; class 2 = Levels 2–4; class 3 = Level 5 and above.

Decision trees have a number of advantages: They are interpretable as they mirror human decision-making more than other predictive models, they handle a wide range of problems and different types of variables, and they impose no assumptions on the data (Golino and Gomes Citation2014). Their substantial disadvantage is the problem of overfitting, when the algorithm performs poorly in new datasets. This problem is resolved in RF by generating a large number of bootstrapped trees based on random samples of variables (a ‘forest’ of decision trees) and aggregating their results.

RF is effective for highly dimensional data and handles interactions and nonlinearity (Fawagreh et al. Citation2014, Molnar Citation2019). It has become a widely used method for classification tasks (Denil et al. Citation2014), as it has lower predictive error than logistic regression (Breiman Citation2001b, Couronné et al. Citation2018), and for multiclass tasks, it does not require the proportional odds assumption needed for ordinal regression (Field et al. Citation2017, O’Connell and Liu Citation2011). RF is increasingly used for missing data imputation, as it is adaptive to interactions and nonlinearity (Shah et al. Citation2014), performs well even with data missing not at random (Tang and Ishwaran Citation2017), and has better imputation accuracy than nearest neighbours imputation and multiple imputation by chained equations (Misztal Citation2019, Waljee et al. Citation2013). Therefore, Golino and Gomes (Citation2016) suggested using RF to impute missing data in educational and psychological research.

To measure model performance of RF, different metrics can be used (Chai and Draxler Citation2014, Janitza et al. Citation2014). One of the most commonly used metrics is the area under the Receiver Operating Characteristic (ROC) curve (Kleiman and Page Citation2019). ROC is a probability curve, which is plotted with true positive rate on the y-axis against false positive rate on the x-axis. The area under the curve (AUC) represents the ability of the model to separate classes. For a perfect classifier, the area is 100%, and for a random classifier, when true positive rate is equal to false positive rate, it is 50%. AUC can be extended to multiclass problems, with separate comparisons for each pair of classes (Hand and Till Citation2001).

RF models are better predictors than decision trees, but they are less understandable. Breiman (Citation2001b) described this trade-off between accuracy and interpretability of a model as the Occam dilemma. In order to make the output of such ‘black boxes’ more understandable, model-agnostic (that is, flexibly applicable to any model) methods can be used, such as permutation variable importance and partial dependence plots (Molnar Citation2019).

The concept of permutation variable importance can be briefly explained as follows. A variable is ‘considered important if deleting it seriously affects prediction accuracy’ of the model (Breiman Citation2001b, p. 230). However, removing one variable after another and retraining the model each time would be computationally expensive, and instead of it, we replace a variable with random noise to see how it influences the model’s performance. The noise is created by permuting the variable (that is, shuffling its values). Permutation variable importance was shown to be less biased than other variable importance measures (Altmann et al. Citation2010).

Partial dependence plot for a variable shows the marginal effect that it has on the predicted outcome of the model (Molnar Citation2019). It depicts the nature of the relationship between the variable and the outcome and indicates whether this relationship is linear, monotonic etc. In classification tasks, partial dependence plots can be built for each class separately to show the probability for the class given the values of the variable. Partial dependence plots can be constructed for two variables in 3D to display not only their influence on the outcome but also their interactions with each other (Greenwell Citation2017).

Based on the findings showing effectiveness of RF for missing data imputation and for classification tasks, we chose the RF algorithm for these tasks in our study. Model-agnostic methods (variable importance and partial dependence plots) were used to increase interpretability of the models.

Attitudes towards ICT in PISA

Attitude is a multifaceted structure comprised of affective, cognitive, and behavioural components (Guillén-Gámez and Mayorga-Fernández Citation2020). Attitudes towards ICT were conceptualized in different ways (Zhang, Aikman, and Sun Citation2008) and examined in the frame of large-scale educational surveys (Eickelmann and Vennemann Citation2017, Fraillon et al. Citation2019, Fraillon et al. Citation2020). In the current study, the ICT framework from PISA 2015 (Goldhammer et al. Citation2016) was used.

In previous PISA waves, the ICT framework was different: PISA 2003, 2006, and 2009 assessed students’ confidence related to the three types of ICT tasks (basic, Internet-related, and high-level tasks). In PISA 2012, positive components of attitude towards ICT (perceiving ICT as a useful tool) were measured as a construct independent from negative components (perceiving ICT as an uncontrollable entity). As ICT literacy measure has not been developed in PISA yet, attitudes towards ICT were studied in relation to mathematical and scientific literacy. In PISA 2003–2012, positive attitudes towards ICT were significantly and positively associated with mathematical and scientific literacy scores (Güzeller and Akin Citation2014, Luu and Freeman Citation2011, Petko et al. Citation2017, Zhang and Liu Citation2016).

In PISA 2015, the new conceptualization of ICT engagement (Goldhammer et al. Citation2016) was introduced. This framework was based on self-determination theory (Deci and Ryan Citation2000) and included dimensions of ICT competence, ICT interest, ICT autonomy, and ICT in social interaction (see Appendix A). Scale indices for each dimension were obtained in the frame of item response theory by means of generalized partial credit model, which allowed for the item discrimination to vary between items within any given scale (OECD Citation2017b). It was shown (Kunina-Habenicht and Goldhammer Citation2020, Meng et al. Citation2019) that the scales were reliable instruments with expected dimensionality. The same ICT framework was used in PISA 2018.

Recent findings on attitudes towards ICT in PISA 2015 showed that ICT autonomy was significantly positively associated with mathematical and scientific literacy scores in all countries participating in the optional questionnaire (Hu et al. Citation2018), and ICT in social interaction was significantly negatively associated with these scores in all countries (Hu et al. Citation2018). For ICT competence and ICT interest, the significance and the sign of their relationships with literacy scores varied at the country-specific level (Hu et al. Citation2018, Meng, Qiu, and Boyd-Wilson Citation2019, Odell et al. Citation2020).

In the current study, we re-examined the associations between German students’ attitudes towards ICT and mathematical and scientific literacy in PISA 2015 using a different methodology than Meng et al. (Citation2019); additionally, we explored these associations in PISA 2018. We also used attitudes towards ICT to predict students’ proficiency in mathematics and science (below Level 2, Levels 2–4, or Level 5 and above) in PISA 2015 and PISA 2018.

Methods

Dataset

The data for the study was obtained from the OECD website, PISA 2015 and 2018 databases (OECD Citationn.d.-a, Citationn.d.-b).Footnote2 The authors used German subsets of PISA student questionnaire and the optional ICT familiarity questionnaire for students. Cases with 100% missing ICT responses, which were due to the fact that the ICT questionnaire was optional in Germany, were removed (Reiss et al. Citation2016). In accordance with this criterion, 1093 cases (16.81%) were removed from the 2015 dataset and 944 cases (17.32%) were removed from the 2018 dataset before the analysis. The resulting 2015 sample consisted of 5411 students (50.71% female, n = 2744; 49.29% male, n = 2667) from 254 German schools. The resulting 2018 sample consisted of 4507 students (47.33% female, n = 2133; 52.67% male, n = 2374) from 208 German schools.

Measures

Mathematical and scientific literacy

Mathematical literacy is defined as ‘an individual’s capacity to formulate, employ and interpret mathematics in a variety of contexts’ (OECD Citation2017a, p. 67). Scientific literacy is defined as ‘the ability to engage with science-related issues, and with the ideas of science, as a reflective citizen’ (OECD Citation2017a, p. 22). Ten plausible values were included in analysis for each literacy score (OECD Citation2009). In both PISA waves, the cut-off score for performance below Level 2 in mathematics was 420.07 and in science 409.54. The cut-off score for performance at Level 5 and above in mathematics was 606.99 and in science 633.33 (OECD Citation2019b).

Demographics

ESCS is a composite score based on the three indicators: parental education, highest parental occupation, and home possessions. The latter indicator is used in PISA as a proxy for family wealth (OECD Citation2017a). The measure is constructed via principal component analysis and standardized for a standard deviation of one, with zero representing the overall OECD average (OECD Citation2017a). In the datasets, gender was coded as 1 for female and 2 for male. We recoded it as 0 for female and 1 for male.

Attitudes towards ICT

Attitudes towards ICT were measured with the 4-point Likert scale from strongly disagree to strongly agree, and the following IRT-scaled indices were calculated (OECD Citation2017b): (a) perceived ICT competence (COMPICT) on five items of the IC014 scale; (b) interest to ICT (INTICT) on six items of the IC013 scale; (c) ICT in social interaction (SOIAICT) on five items of the IC016 scale; and (d) ICT autonomy (AUTICT) on five items of the IC015 scale. The measures were the same in PISA 2018 (OECD Citation2019a).

Analytical strategy

In this section, we describe our analytical choices in completing the three consecutive tasks: (a) missing data imputation; (b) the classification task with proficiency levels (below Level 2, Levels 2–4, or Level 5 and above) as the categorical outcome variable, and (c) the regression task with the literacy score as the continuous outcome variable. The data were analysed with R, version 4.0.0 (R Core Team Citation2018). The R script is available at https://github.com/OlgaLezhnina/PISAstatsML and in Supplemental Materials.

Missingness patterns were checked with the package VIM (Kowarik and Templ Citation2016). For missing data imputation, we used the RF algorithm as an effective and unbiased method (Golino and Gomes Citation2016, Misztal Citation2019, Waljee et al. Citation2013) and chose the package missForest (Stekhoven Citation2013), as this implementation was shown to be more robust than other versions of RF imputation (Tang and Ishwaran Citation2017). Histograms of variables for the complete cases and the dataset after imputation were compared to visually assess the results of imputation. The imputed data were used for further analysis.

For predicting students’ proficiency levels in mathematics and science (below Level 2, Levels 2–4, or Level 5 and above), RF classification models were built, and attitudes towards ICT, ESCS and gender were used as predictors. We preferred RF to ordinal regression, as it does not require proportional odds assumption (O’Connell and Liu Citation2011) and handles nonlinearity and interactions (Fawagreh et al. Citation2014, Molnar Citation2019). We used the package randomForest (Liaw and Wiener Citation2002), which is a simple and commonly used implementation of the algorithm based on Breiman’s (Citation2001a) original code. As the sample was imbalanced (there were substantially more students on Levels 2–4), we oversampled the training set with the package UBL (Branco et al. Citation2016). To avoid excessively optimistic performance assessment, we evaluated the models on the test set (20% of the 2015 data) which was not oversampled; and then, on the 2018 data. AUC was used as a measure of model performance; multiclass AUC and separate class comparisons were estimated with the package pROC (Robin et al. Citation2011). The mean decrease in accuracy (permutation importance) was chosen as a variable importance measure. Partial dependence plots for each predictor were built with the package pdp (Greenwell Citation2017). Three-dimensional partial dependence plots for pairs of predictors were built to illustrate their relationships with each other and with the predicted outcome.

In order to obtain a more detailed picture of relationships between attitudes towards ICT and mathematical and scientific literacy with the focus on the nested structure of the data (different schools), we needed a multilevel regression model. Our RF models were not hierarchical due to a limit for factor levels in randomForest (Liaw and Wiener Citation2002), and we had to choose a different method to build such a model. It was shown that for predictive multilevel models (such as RF), an increase in group size is more beneficial than an increase in the number of groups, while for estimative models (such as HLM) the opposite is the case (Afshartous and de Leeuw Citation2005). The group size in our case was the number of student participants in each school, and the number of groups was the number of schools. In both datasets, the group size varied from one to 30 students, while the number of groups was rather large (254 for the 2015 dataset and 208 for the 2018 dataset), which made HLM our instrument of choice.

HLM was conducted with the package lme4 (Bates et al. Citation2015) in accordance with methodological recommendations summarized by Dedrick et al. (Citation2009) and Harrison et al. (Citation2018). Restricted maximum likelihood was used as an estimation method (Dedrick et al. Citation2009). Independent variables were grand-mean centred and standardized by two standard deviations for comparability of continuous and binary variables (Gelman Citation2008). The need for multilevel modeling was assessed by exploring variance decomposition in null (unconditional) models with random intercepts (McCoach Citation2010). As full models with random slopes and random intercepts (Harrison et al. Citation2018) failed to converge, full models with random intercepts were built. To estimate sampling variance, Fay's method with 80 replicates was used, and statistical estimates were averaged for 10 plausible values (OECD Citation2009). Formulae for analysis with plausible values and replicate weights, which we used in our R script, are given in Appendix B. As we had only six predictors (gender, ESCS and four ICT attitudes), they were all included in the model simultaneously (see Heinze et al. [Citation2018] for variable selection methods). Significance of estimates was assessed with the type III Wald test (Luke Citation2017), and a significance level of .001 was set as appropriate for our procedures and for the sample size (Lakens et al. Citation2018). Nakagawa’s marginal and conditional R² (Nakagawa and Schielzeth Citation2013, Nakagawa et al. Citation2017) were calculated with the package performance (Lüdecke et al. Citation2020). Assumptions for residuals were checked with diagnostic plots from the package sjPlot (Lüdecke Citation2019). In check for multicollinearity, the cut-off value of 3 for the variance inflation factor was used as recommended by Zuur et al. (Citation2010). Intra-class correlation coefficients (ICCs) and Nakagawa’s marginal and conditional R² were reported as effect size measures for random effects (LaHuis et al. Citation2014, Lorah Citation2018).

Results

Random Forest for missing data imputation and classification tasks

In the 2015 dataset, 2.41% of the data was missing, and 3.81% of the data was missing in the 2018 dataset. Aggregation plots and scatter matrices for missing values (see the R script) suggested that the data was missing at random, as there were no discernible patterns in missingness. RF imputation was conducted with the default settings (the number of trees = 100, the maximal number of iterations = 10). Histograms of variables from the complete cases dataset and the dataset after imputation (see ) showed that the imputation did not lead to changes in their distributions that could potentially bias our models.

Figure 2. Histograms of Complete Cases and Imputed Data in Years 2015 and 2018.Note: The upper row is for the 2015 data, the lower row for the 2018 data. Dark green = complete cases, light green = imputed data. ESCS = economic, social, and cultural status; INTE = ICT interest; COMP = ICT competence; AUTO = ICT autonomy; SOCI = ICT in social interaction.

Figure 2. Histograms of Complete Cases and Imputed Data in Years 2015 and 2018.Note: The upper row is for the 2015 data, the lower row for the 2018 data. Dark green = complete cases, light green = imputed data. ESCS = economic, social, and cultural status; INTE = ICT interest; COMP = ICT competence; AUTO = ICT autonomy; SOCI = ICT in social interaction.

To predict students’ proficiency levels (below Level 2, Levels 2–4, or Level 5 and above), two RF models were trained: the mathematics model and the science model. The training set (80% of the 2015 dataset) was used. Variable importance plots for the models are shown in . Attitudes towards ICT were less important than ESCS but more important than gender in both models, and ICT autonomy was more important than other attitudes.

Figure 3. Permutation Variable Importance. Note: ESCS = economic, social, and cultural status; AUTO = ICT autonomy; SOCI = ICT in social interaction; INTE = ICT interest; COMP = ICT competence; GEND = gender.

Figure 3. Permutation Variable Importance. Note: ESCS = economic, social, and cultural status; AUTO = ICT autonomy; SOCI = ICT in social interaction; INTE = ICT interest; COMP = ICT competence; GEND = gender.

Variable importance measures do not indicate whether association of a variable with the predicted outcome is positive or negative; this information can be obtained from partial dependence plots for each variable. In , the partial dependence plots for the science model are shown, and the plots for the mathematics model (see the R script) revealed similar patterns. In the upper row, we see the probability for a student to perform below Level 2, and in the lower row, the probability to perform on Level 5 and above, as predicted by ICT interest, ICT competence, ICT autonomy, ICT in social interaction, and ESCS. In the plots, higher levels of ICT autonomy predicted a higher probability for a student to perform on Level 5 and above in science, while lower levels of autonomy increased a probability to perform below Level 2. For ICT in social interaction, the partial dependence plots gave the opposite picture: higher levels of this attitude predicted a higher probability to perform below Level 2. The plots show nonlinearity of relationships between predictors and the predicted outcome, which is most clearly seen for ICT interest.

Figure 4. Partial Dependence Plots for Variables in the Science Model. Note: INTE = ICT interest; COMP = ICT competence; AUTO = ICT autonomy; SOCI = ICT in social interaction; ESCS = economic, social, and cultural status; SCI-1 = probability for class 1 (below Level 2); SCI-3 = probability for class 3 (Level 5 and above).

Figure 4. Partial Dependence Plots for Variables in the Science Model. Note: INTE = ICT interest; COMP = ICT competence; AUTO = ICT autonomy; SOCI = ICT in social interaction; ESCS = economic, social, and cultural status; SCI-1 = probability for class 1 (below Level 2); SCI-3 = probability for class 3 (Level 5 and above).

Partial dependence plots for pairs of variables indicated nonlinear relationships between them in terms of probability of the predicted outcome for each class (below Level 2, Levels 2–4, or Level 5 and above). In , partial dependence plots for ESCS and ICT autonomy, the two most important variables, are shown for the science model. It can be seen that the highest probability for a student to perform on Level 5 and above in science was predicted by high levels in both ESCS and ICT autonomy.

Figure 5. Partial Dependence Plots for ESCS and ICT Autonomy in the Science Model.Note: ESCS = economic, social, and cultural status; AUTO = ICT autonomy; SCI-1 = probability for class 1 (below Level 2); SCI-2 = probability for class 2 (Levels 2–4); SCI-3 = probability for class 3 (Level 5 and above).

Figure 5. Partial Dependence Plots for ESCS and ICT Autonomy in the Science Model.Note: ESCS = economic, social, and cultural status; AUTO = ICT autonomy; SCI-1 = probability for class 1 (below Level 2); SCI-2 = probability for class 2 (Levels 2–4); SCI-3 = probability for class 3 (Level 5 and above).

The models were evaluated on the test set from the 2015 data (20% of the sample). The multiclass AUC was 67.44% for the mathematics model and 71.66% for the science model. When evaluated on the 2018 data (the whole dataset), the multiclass AUC was 69.19% for the mathematics model and 68.51% for the science model. Although the model performance was suboptimal (see Limitations section), it is noteworthy that it did not change considerably for the 2018 data. The ROC curves for the class comparisons are presented in .

Figure 6. Receiver Operating Characteristic Curves for the Models Evaluated on the 2015 Test Set and the 2018 Data. Note: SCI = Science; MATH = Mathematics; TP = True Positive; FP = False Positive; class 1 = proficiency below Level 2; class 2 = Levels 2–4; class 3 = Level 5 and above.

Figure 6. Receiver Operating Characteristic Curves for the Models Evaluated on the 2015 Test Set and the 2018 Data. Note: SCI = Science; MATH = Mathematics; TP = True Positive; FP = False Positive; class 1 = proficiency below Level 2; class 2 = Levels 2–4; class 3 = Level 5 and above.

Multilevel modeling

Unconditional models for mathematical literacy and scientific literacy were built. For mathematical and scientific literacy, multilevel modeling was required, as can be seen from ICC values reported in .

Table 1. Unconditional Models for Mathematical Literacy and Scientific Literacy.

For mathematical literacy and scientific literacy, full models with random intercepts were built. Fixed and random effects estimates are reported in .

Table 2. Full Models for Mathematical Literacy and Scientific Literacy.

Assumptions for the full models were checked for each of 10 plausible values in mathematical and scientific literacy in the 2015 dataset and the 2018 dataset. Values of the variance inflation factor were below the cut-off value of 3, indicating that there was no multicollinearity in the data. Diagnostic plots for residuals showed that linearity and homoscedasticity assumptions were met, and residuals of the models were normally distributed.

As the data was standardized by two standard deviations, we could compare the relative importance of all variables, including binary (gender), based on their regression coefficients. ICT in social interaction was significantly negatively associated with mathematical literacy and scientific literacy. ICT autonomy was significantly positively associated with mathematical literacy and scientific literacy, and it was almost as influential as gender and ESCS for mathematical literacy both in the 2015 dataset and in the 2018 dataset. Its association with scientific literacy (β= 38.29) was the strongest among all variables in the 2015 dataset and the second strongest after ESCS in the 2018 dataset (β = 25.99). In , regression coefficients of all variables are shown with confidence intervals in different colours for 10 plausible values.

Figure 7. HLM Estimates for 10 Plausible Values with Confidence Intervals. Note: GEND = gender; ESCS = economic, social, and cultural status; COMP = ICT competence; INTE = ICT interest; SOCI = ICT in social interaction; AUTO = ICT autonomy.

Figure 7. HLM Estimates for 10 Plausible Values with Confidence Intervals. Note: GEND = gender; ESCS = economic, social, and cultural status; COMP = ICT competence; INTE = ICT interest; SOCI = ICT in social interaction; AUTO = ICT autonomy.

Discussion

Findings on attitudes towards ICT

The findings of the study, which are consistent with previous studies (Hu et al. Citation2018, Meng et al. Citation2019), show that ICT in social interaction was significantly negatively associated with mathematical and scientific literacy, while ICT competence, the least important attitudinal variable in RF models, was not significantly associated with literacy scores in HLM. Negative associations of German students’ mathematical and scientific literacy with ICT interest in PISA 2015 were reported as significant by Meng et al. (Citation2019) but were non-significant in our study. This discordance that can be explained by the difference in analytical choices, which are to some degree inevitably subjective and need to be reported transparently, so that they can be tested in further research (see Silberzahn et al. [Citation2018]).

Our results on ICT autonomy, which are also congruent with previous studies (Hu et al. Citation2018, Meng et al. Citation2019), suggest that this attitude might play a prominent role in mathematics and science domains. For German students, ICT autonomy was an important variable in classification models, and it was significantly positively associated with literacy scores in HLM in PISA 2015 and PISA 2018. The role of ICT autonomy is probably not fully recognized by contemporary educational systems, with their emphasis on collaborative learning and group activities (Dillenbourg et al. Citation2009). In line with this tendency, the OECD focuses increasingly on collaborative problem solving in PISA, with less attention paid to learners’ autonomy. The change in the optional ICT questionnaire planned for PISA 2021 (Lorenceau et al. Citation2019) is indicative of this trend, as it seems that ICT autonomy in its current operationalization will cease to be a topic of research in PISA (OECD Citation2019d). We strongly believe, however, that research on ICT autonomy and its development in students (Areepattamannil and Santos Citation2019) are crucially important for the educational system. In our contemporary world, which requires ability to make independent decisions and interact with increasingly digitalized environment, ICT autonomy might become one of the main building blocks of effective learning.

Combining statistical and machine learning approaches to data analysis

In the current study, we presented a flexible way to combine statistical and machine learning methods for data analysis of a large-scale educational survey. For each task in the analytical process, we chose the most suitable tool from either statistical or machine learning toolbox and justified our analytical decisions.

As missing data is a persistent problem in PISA (Hopfenbeck et al. Citation2018) and in large-scale educational surveys in general (Golino and Gomes Citation2016), educational researchers need to keep themselves informed about different statistical and machine learning imputation techniques. For this task, we chose the RF algorithm implemented in the package missForest based on previous studies showing its effectiveness (Misztal Citation2019, Waljee et al. Citation2013) and checked the quality of imputation with histograms of the imputed data in comparison to the complete cases. We would suggest that researchers include RF imputation in their toolbox as an effective and unbiased imputation method.

With a different RF implementation from the package randomForest, we built classification models for scientific and mathematical proficiency levels of German students in PISA 2015 and PISA 2018. We adhered to the three-part division of proficiency levels (below Level 2, Levels 2–4, or Level 5 and above) accepted in PISA (OECD Citation2016b, Citation2019b, Pokropek et al. Citation2018), but another classification task, such as predicting the whole range of proficiency levels or low vs regular academic performance (a binary task), would be also possible. While choosing an instrument for this task, we preferred RF to ordinal regression (O’Connell and Liu Citation2011) and to more sophisticated machine learning models (see Limitations section). A different analytical decision for the same dataset is also feasible (Silberzahn et al. Citation2018). Performance of the models built for the 2015 data did not decrease on the 2018 data, indicating that the same patterns persisted in both years. With the help of model-agnostic methods (variable importance and partial dependence plots), we assessed the relative importance of our variables (attitudes towards ICT, ESCS, and gender) for proficiency levels and visually presented relationships between the predictors and the predicted outcome. Thus, we observed nonlinear relationships between the predictors as indicated by the partial dependence plots. We recommend using model-agnostic methods in educational research to increase interpretability of machine learning models.

As we learnt from the RF models that students’ ICT attitudes were important for predicting their proficiency levels, we decided to explore relationships between these variables and mathematical and scientific literacy in more detail. The data had a hierarchical structure; in our choice of an instrument for multilevel modeling, we relied on previous studies indicating that in our case, HLM was preferable to machine learning models because of the dataset characteristics (Afshartous and de Leeuw Citation2005). With HLM, we obtained information on fixed and random effects, such as significance of relationships, their direction, and effect sizes. The 2018 data revealed the same patterns in associations between ICT attitudes and literacy scores as the 2015 data, and fixed effects estimates in HLM for these two years did not substantially differ.

Combining machine learning and statistical approaches is beneficial for research on large-scale educational surveys, as the former is a valuable tool for finding generalizable patterns, while the latter is useful for testing hypotheses and making statistical inferences. A flexible choice of analytic instruments from both toolboxes depending on the study aims and the dataset characteristics is an effective way of analysing the data.

Limitations and further research

The limitations of the study are related to PISA design and to our goals, which determined our analytical choices. We briefly outline both areas of limitations.

As any cross-sectional design, PISA cannot allow us to establish causal relationships but only to detect associations between variables; the question of causality should be left to longitudinal or experimental studies. The following flaws of PISA sampling design were mentioned in literature and summarized by Hopfenbeck et al. (Citation2018) and Zhao (Citation2020): (a) exclusion of students with disabilities from PISA is problematic, as it prevents them from taking part in policy aiming at educational equity (Schuelka Citation2013); (b) there is a sampling bias towards slower-maturing students, as some students are excluded because of early graduation, dropout or other forms of attrition (Anderson et al. Citation2009); and (c) age criterion for selecting participants means that students might have had different exposure to curriculum, as some of them might have repeated a class or skipped one (Prais Citation2004). Concerns have been raised about plausible values in PISA, which are constructed even for students who have not taken a single item on the subject (Kreiner and Christensen Citation2014). In addition, in attitudes towards ICT, as in any self-report measures, social desirability might be a source of bias.

In this study, we chose simple and commonly used implementations of RF to make our analytical procedures more understandable and our R script easily reproducible. A more sophisticated implementation of the algorithm for the classification task (Janitza et al. Citation2014, Ryo and Rillig Citation2017, Strobl et al. Citation2007) with model hyperparameters tuned with the package mlr (Schiffner et al. Citation2016) would have probably led to a better model performance. Another limitation, which is relevant both to RF and HLM models, is determined by our choice of variables. Both HLM and RF models would have benefited from adding more variables related to these domains, such as science attitudes in PISA 2015 (Areepattamannil and Santos Citation2019); the RF algorithm could have effectively selected the most relevant variables out of a wider scope of attitudinal and demographic factors. Exploring the whole range of factors that influence mathematical and scientific literacy was not, however, the goal of this study. We intended to use the same variables in all our models, so that attitudes towards ICT could be compared with each other and with ESCS and gender; for HLM, the nested structure of the data (different schools) was also taken into account. The other reason for not including more variables, as well as another limitation of the study, was related to the issue of model convergence. HLM with random intercepts and slopes (Harrison et al. Citation2018) failed to converge, and we had to resort to random intercept models. In further research, more advanced statistical and machine learning methodology can be applied (Romero and Ventura Citation2020, Ryo and Rillig Citation2017).

Conclusion

In this study, statistical and machine learning models were used to explore German students’ attitudes towards ICT in PISA 2015 and PISA 2018. We demonstrated the benefits of choosing the most suitable methods from statistical and machine learning toolboxes for data analysis of a large-scale educational survey. For the first task, missing data imputation, and the second, predicting proficiency levels, we used the RF algorithm, while for the third task, exploring associations between literacy scores and attitudes, we resorted to HLM. Findings obtained with our classification models (RF) and regression models (HLM) indicate that ICT autonomy might play a prominent role in mathematics and science domains and needs to be studied further. The flexible way to combine statistical and machine learning approaches can be recommended to educational researchers, as in our age of growing data volumes and complexity, creative adaptability and versatility are indispensable. As Breiman (Citation2001b, p. 204) said, ‘To solve a wider range of data problems, a larger set of tools is needed’.

Supplemental material

Acknowledgements

We are grateful to Manuel Prinz for his insightful comments on the first draft, to Junaid Ghauri for thought-provoking discussions, and to anonymous reviewers for their criticism and constructive suggestions that helped us substantially improve the manuscript.

Disclosure statement

No potential conflict of interest was reported by the author(s).

Notes

References

Appendices

Appendix A. The ICT engagement framework

IC013: ICT Interest

IC013Q01NA. I forget about time when I'm using digital devices.

IC013Q04NA. The Internet is a great resource for obtaining information I am interested in (e.g. news, sports, dictionary).

IC013Q05NA. It is very useful to have social networks on the Internet.

IC013Q11NA. I am really excited discovering new digital devices or applications.

IC013Q12NA. I really feel bad if no internet connection is possible.

IC013Q13NA. I like using digital devices.

IC014: ICT Competence

IC014Q03NA. I feel comfortable using digital devices that I am less familiar with.

IC014Q04NA. If my friends and relatives want to buy new digital devices or applications, I can give them advice.

IC014Q06NA. I feel comfortable using my digital devices at home.

IC014Q08NA. When I come across problems with digital devices, I think I can solve them.

IC014Q09NA. If my friends and relatives have a problem with digital devices, I can help them.

IC014: ICT Autonomy

IC015Q02NA. If I need new software, I install it by myself.

IC015Q03NA. I read information about digital devices to be independent.

IC015Q05NA. I use digital devices as I want to use them.

IC015Q07NA. If I have a problem with digital devices I start to solve it on my own.

IC015Q09NA. If I need a new application, I choose it by myself.

IC014: ICT in Social Interaction

IC016Q01NA. To learn something new about digital devices, I like to talk about them with my friends.

IC016Q02NA. I like to exchange solutions to problems with digital devices with others on the internet.

IC016Q04NA. I like to meet friends and play computer and video games with them.

IC016Q05NA. I like to share information about digital devices with my friends.

IC016Q07NA. I learn a lot about digital media by discussing with my friends and relatives.

Appendix B. Formulae for analysis with plausible values and replicate weights

Instructions given by PISA Data Analysis Manual (OECD Citation2009) on computation of statistical estimates and standard errors with plausible values and replicate weights can be summarized as follows:

  1. To calculate a final estimate (e.g. a regression coefficient), estimates for models with all plausible values and the final weight are averaged (OECD Citation2009, p. 118): θ=1Mi=1Mθi

  2. The standard error of this final estimate is calculated as follows:

    • (2.1) The sampling variance is calculated by averaging sampling variances for all plausible values (OECD Citation2009, p. 118): σ2=1Mi=1Mσi2

    • Each of them is calculated as the averaged (with the coefficient specified below) sum of squared differences between the final estimate and each of replicate estimates (OECD Citation2009, pp. 73–74): σi2=1G(1k)2i=1G(θiθ)2

    • (2.2) The imputation variance is calculated as the averaged (with the coefficient specified below) sum of squared differences between the final estimate and an estimate for each of plausible values (OECD Citation2009, p. 100): B=1M1i=1M(θiθ)2

    • (2.3) The final variance is calculated as the sum (with the coefficient specified below) of the sampling variance and the imputation variance (OECD Citation2009, p. 100): V=σ2+(1+1M)B

    • (2.4) The standard error of the final estimate is the square root of the final variance (OECD Citation2009, p. 119).

In these formulae, θ is a statistical estimate (e.g. a regression coefficient); σ² is the sampling variance; B is the imputation variance; M is the number of plausible values (in our case, 10); G is the number of replicate weights (in our case, 80); k is a deflation factor, in our case, ‘PISA uses the Fay method with a factor of 0.5’ (OECD Citation2009, p. 73).