3,169
Views
17
CrossRef citations to date
0
Altmetric
Articles

Are all Pupils Equally Motivated to do Their Best on all Tests? Differences in Reported Test-Taking Motivation within and between Tests with Different Stakes

Pages 95-111 | Received 22 Dec 2014, Accepted 15 Jul 2015, Published online: 25 Jan 2016

ABSTRACT

This study investigated changes in reported test-taking motivation from a low-stakes to a high-stakes test and if there are differences in reported test-taking motivation between school classes. A questionnaire including scales assessing reported effort, expectancies, perceived importance, interest, and test anxiety was administered to a sample of pupils (n = 375) in 9th grade in direct connection to a national test field trial and then again to the same sample in connection to the regular national test in science. Two-level second-order latent growth modelling was used to analyse data. In summary, the results show a significant increase in reported test-taking motivation from the field trial to the regular test and a significant variability in test-taking motivation between classes.

When interpreting results from achievement tests, a basic assumption often made is that every pupil was motivated to give his or her best effort to the test, or at least that there were equal levels of test-taking motivation among pupils. This is possibly a reasonable assumption when the test result is important or otherwise has consequences for the test-taker (high-stakes test). However, in many test situations, such as field trials, school accountability tests, and international comparative studies, the test result is not important for the test-taker personally (low-stakes). Thus, such assumptions might not hold and the test result might reflect not only the pupils’ knowledge in the area of interest, but also, to varying degrees, their level of motivation to do their best on the given test (i.e., their test-taking motivation). If test-taking motivation is similar within certain groups of individuals (e.g., school classes or countries), but different across groups participating in the same assessment, this could lead to systematically biased estimates of, for example, item difficulty, school performance, and differences between groups. Systematic errors are easy to deal with if we know what they are. If we do not know their magnitude, however, systematic errors are serious since they, unlike random error, do not cancel out (Kane, Citation2011). Thus knowledge of test-taking motivation and how it can differ over individuals, groups of individuals, or test situations is important for valid interpretation of many achievement test results (American Educational Research Association, American Psychological Association, & National Council on Measurement in Education [AERA, APA, & NCME], Citation2014; Sundre & Kitsantas, Citation2004; Wise & DeMars, Citation2005).

Also, how pupils perceive tests in terms of, for example, test anxiety, self-efficacy, perceived importance, and interest can affect their motivation for learning and their general achievement results (Harlen & Deakin Crick, Citation2003). For example, if too much focus is paid on assessment, pupils seem to be more motivated towards performance goal, be more anxious, have lower level of self-efficacy/competence beliefs, and use less effective learning strategies (Harlen & Deakin Crick, Citation2003; Ryan & Deci, Citation2009). It is further argued that tests motivate only some pupils and increase the gap between higher and lower achieving pupils (Harlen & Deakin Crick, Citation2003; Ryan & Deci, Citation2009). Thus, with a changing test climate in several countries (Stobart & Eggen, Citation2012), monitoring pupils’ perceptions of different tests is important not only for valid interpretation of test results, but also in a more general psycho-educational sense.

Although there is some significant research on situation-specific test-taking motivation, there is still limited research with samples of young adolescents. Further, few if any studies have compared test-taking motivation across authentic high- and low-stakes test situations for the same sample of pupils, which we aimed to do in the present study. The assessment context in the present study is two versions of the Swedish national tests in science for 9th grade pupils: a field trial, where items for the national test are tried out, and a regular national test.

The Swedish National Assessment System

The Swedish school system has undergone major reform over the past decades. As a part of this, national tests have been introduced in more subjects and ages than before (Eklöf & Nyroos, Citation2013). Although the national tests constitute a significant part of pupils’ schooling, there is still limited research on how the pupils perceive these tests (see, however, Eklöf & Nyroos, Citation2013; Nyroos & Wiklund-Hörnqvist, Citation2011; Silfver, Sjöberg, & Bagger, Citation2013).

The Swedish national tests are mandatory for all pupils participating in regular schooling. The aim of the tests is to support equal and fair assessment and grading and to provide information on the extent to which curricular goals are reached. As test scores are supportive, rather than decisive, of pupil grades, the national test might not be a high-stakes test in a proper sense. Still, the national tests are the highest-stakes test in Swedish compulsory education: they are the only external tests used on a regular basis, there is a high level of agreement between national test scores and final course grades (and if there is not, schools are criticized) (Swedish National Agency for Education, Citation2014) and there is empirical evidence suggesting the national tests are perceived as high-stakes by pupils (Eklöf & Knekta, Citation2014). For simplicity, the regular national test is referred to as a “high-stakes test” in this paper.

To ensure high quality of large-scale achievement tests, each new item has to be evaluated in one or several item try-outs (AERA, APA & NCME, Citation2014). During the process of developing the Swedish national tests, item quality, in terms of pupils’ understanding of the items, item scoring rubrics, group differences, item difficulty, and discrimination, is evaluated in try-outs in the target population. These item try-outs are administrated and implement in separate processes—so called field trials. Schools are invited to participate in field trials of sets of test items, and participation is voluntary. Unfortunately, field trials are often performed under conditions that are considered as lower-stakes to the test-taker than the regular test. Thus it may be difficult to motivate test-takers to perform in the same way they would have during the regular test (Wendler & Walker, Citation2009) and this might affect the validity of the inferences drawn from the field trials. Data simulation studies have shown that low motivation for test-takers, in terms of higher probability of omitting items and random guessing, have a large influence on item parameter estimates during test construction, and if pupils’ results are spread due to motivation rather than ability, an item can be believed to be more discriminating then it actually is (van Barneveld, Citation2007). Further, because field trials are sometimes made on a relatively small scale, the validity of the results can be especially sensitive to a group of unmotivated pupils, and thus controlling pupils’ motivation is important to be able to interpret the results. Although field trials in general are considered as low-stakes, very little research has actually been made on pupils’ test-taking motivation at taking them. In the context of field trial for the Swedish national test, no studies concerning pupils’ test-taking motivation have been made previously. Hence, we do not really know how these tests are perceived. In the present study, the field trial is referred to as a “low-stakes test” even if the perceived stakes may vary across pupils and classes.

Theoretical Rationale

Within this study, we define test-taking motivation as the extent to which test-takers are motivated to give their “best effort to the test, with the goal being to accurately represent what one knows and can do” (Wise & DeMars, Citation2005, p. 2). Thus, it is defined as a state and concerns a specific type of achievement motivation focusing on a specific situation (i.e., a specific test). Theoretically, this study is based on the expectancy-value theory of achievement motivation (Eccles et al., Citation1983; Wigfield & Eccles, Citation1992, Citation2000, Citation2002). According to the theory, student motivation (expressed as performance, persistence, and task choice) depends on expectations for success and the value placed on the task. The expectancy component is divided into ability beliefs and expectancies (Wigfield & Eccles, Citation2002). The value component is divided into importance, interest, utility, and cost (Eccles et al., Citation1983). The theory is widely used as a theoretical framework for understanding domain-specific motivation. Also, much of the previous research on test-taking motivation refers to the propositions of the expectancy-value framework (e.g., Liu, Bridgeman, & Adler, Citation2012; O'Neil, Abedi, Miyoshi, & Mastergeorge, Citation2005; Thelk, Sundre, Horst, & Finney, Citation2009). Still, only a few studies concerning test-taking motivation have measured more than three constructs of the theory (see, however, Cole, Bergin, & Whittaker, Citation2008; Knekta & Eklöf, Citation2015; Penk, Pöhlmann, & Roppelt, Citation2014). Based on Wigfield and Eccles (Citation1992, Citation2000, Citation2002) definition of motivation and applied to a specific test situation, we hypothesize invested effort at the particular test as an overall measure of test-taking motivation that depends on expectancies for success on the test (expectancies), the personal importance of doing well on the test (importance), inherent immediate enjoyment of taking the test (interest), the importance of the test for some future goals (utility), and the negative aspects of engaging in the test, here operationalized as experienced test anxiety (cost). Invested effort, in turn, is hypothesized to directly affect test performance. The expectancy-value theory does not explicitly state how the different constructs are affected when the consequences of a task changes. However, it seems reasonable to assume that constructs that relate more to intrinsic motivation, such as interest (Wigfield, Tonks, & Lutz Klauda, Citation2009), would be less affected by the stakes of the test than constructs that depends more on situated external factors, such as utility and importance (Eccles et al., Citation1983; Wigfield et al., Citation2009). Test anxiety is here conceived as a state, and previous research has suggested that higher-stakes of assessments are associated with higher levels of test anxiety (e.g., Segool, Carlson, Goforth, von der Embse, & Barterian, Citation2013). Following the theoretical discussion by Simzar, Martinez, Rutherford, Domina, and Conley (Citation2015), we assume expectancy beliefs to be relatively stable over stakes.

Previous Research on Test-Taking Motivation

There are few studies that have compared pupils’ reported test-taking motivation between tests with different stakes and the majority of these studies are based on experimentally induced differences, such as different test instructions or incentives for correct items (Baumert & Demmrich, Citation2001; Liu et al., Citation2012; Sundre & Kitsantas, Citation2004; Wolf & Smith, Citation1995), or are based on comparisons between different samples, taking different types of tests in different contexts and at different points in time (Eklöf & Knekta, Citation2014; Thelk et al., Citation2009). The studies performed have shown that pupils tend to report higher effort, importance, and or test anxiety on high-stakes tests as compared to low-stakes tests (Eklöf & Knekta, Citation2014; Liu et al., Citation2012; Segool et al., Citation2013; Sundre & Kitsantas, Citation2004; Thelk et al., Citation2009; Wolf & Smith, Citation1995). However, Baumert and Demmrich (Citation2001) found only small differences in effort, test-anxiety, importance, and utility between stakes.

Pupils’ test-taking motivation also seems to vary within low-stakes situations. In a recent review, DeMars, Bashkov, and Socha (Citation2013) showed that existing literature reports small but consistent gender differences in test-taking motivation at low-stakes tests. Females generally report higher effort. Boekaerts (Citation2002) reports several gender differences in test-taking motivation. For example, boys score higher on emotional state (e.g., ease and being more enthusiastic) before and after doing tasks. They also score higher on results assessment (how well did you do). Other studies found no or only minor gender differences (Abdelfattah, Citation2010; Eklöf & Nyroos, Citation2013). Effort spent during the test also seems to differ between individuals within tests depending on personality traits such as openness and agreeableness (Barry, Horst, Finney, Brown, & Kopp, Citation2010). Cole et al. (Citation2008) found that the relationships between motivational constructs varies depending on the subject of the test and they therefore recommended researchers to consider subject in future test-taking motivation studies. On the other hand, Eklöf and Nyroos (Citation2013) found small differences between groups of pupils taking the Swedish national test in science in different subjects. Variations in task persistence within a test between classes and countries has been shown in the TIMSS context (Boe, May, & Boruch, Citation2002). Further, a study performed in higher education contexts has shown that a supportive and active test administrator behaviour can make a significant difference in the level of test-taking motivation between groups within a low-stakes test situation (Lau, Jones, Anderson, & Markle, Citation2009). Drawing on these findings, and the fact that teachers, schools, and classrooms are considered as important for pupils’ achievement motivation in general (e.g., Schunk, Pintrich, & Meece, Citation2010; Stroet, Opdenakker, & Minnaert, Citation2013; Urdan & Schoenfelder, Citation2006), it seems reasonable to assume that there might be differences in test-taking motivation between school classes within low-stakes tests. These differences could depend on, for example, how the teacher/test administrator communicates the importance of the test or on social relationships in the classroom. If there are considerable differences in test-taking motivation between different classes, and if this causes performance differences between classes, this might be problematic when interpreting results from small-scale try-outs or when achievement test results are used for comparative purposes. For example, conclusions from an item try-out regarding difficulty of a specific test item will depend on whether the try-out was performed in a high- or low-motivated class, or, when test results are used for accountability purposes, differences between schools or classes might be due to test-taking motivation rather than knowledge.

Test-Taking Motivation and Performance

Although the level of test-taking motivation might differ between and within tests with different stakes, serious effects on the validity of the interpretation and use of test results are only likely if test-taking motivation affects test performance. Several studies have shown differences in test scores between test-stakes and that there is a correlation between different constructs of test-taking motivation and performance (Abdelfattah, Citation2010; Eklöf, Japelj, & Grønmo, Citation2014; von der Embse & Hasson, Citation2012; Wise & DeMars, Citation2005). In some studies, effort/persistence has been an equally strong or stronger predictor of test performance than college admission test scores (Cole et al., Citation2008) and content knowledge (Boe et al., Citation2002). Sundre and Kitsantas (Citation2004) showed that test-taking motivation, measured as effort and importance, did predict pupil performance in low-stakes conditions (14% of the variance in test scores accounted for) but not in high-stakes conditions. Further, differences in performance between high- and low-stakes test have shown to be larger for constructed response items as compared to multiple choice items (DeMars, Citation2000). On the other hand, O'Neil et al. (Citation2005) found no correlation between effort and test performance. Also, studies have indicated almost zero correlation between ability and effort (Wise & Kong, Citation2005). Thus, reported effort at the test situation does not seem to be merely a proxy for ability.

A majority of the studies on test-taking motivation have been done in higher education contexts. As pupils’ motivational beliefs develop over time (Wigfield et al., Citation2009), and as their perceptions of different tests can be assumed to change with increased experience of testing, the results from these studies might not generalize to younger pupils.

Study Objectives

The main objective of the present study were to investigate (1) changes in reported test-taking motivation for the same groups of pupils from a field trial (low-stakes test) to a regular national test (high-stakes test) and (2) whether there are differences in reported test-taking motivation between school classes. In our analyses, we also looked at the role of gender and subject in relation to test-taking motivation as well as the relationship between test-taking motivation and test performance.

Methods

Participants

The sample included 375 pupils (47% girls) in 26 classes in 17 municipalities in Sweden, all participating both in a national test field trial and then in a regular national test. Of these, test scores for 361 pupils (96%) on the field trial and for 229 pupils (61%) on the regular national test and preliminary course grades for 287 pupils (77%) were submitted. The pupils were all in 9th grade (last year of compulsory schooling in Sweden, 15–16 years old). Preliminary grades for the pupils in the sample were representative of the final course grades in the Swedish population of 9th grade pupils.

Instruments

The motivation questionnaire.

The instrument used for measuring test-taking motivation included 15 items and consisted of five subscales representing five constructs of the expectancy-value theory of achievement motivation (Wigfield & Eccles, Citation2000): Effort (4 items, “I did my best on this test”), Expectancies (2 items, “I did well on this test”), Importance (3 items, “This was an important test to me”), Interest (3 items, “It was fun to do this test”), and Test Anxiety (3 items, “I was scared of failing on this test”). All items were rated on a 4-point Likert-type scale ranging from 1 (strongly disagree) to 4 (strongly agree). One item had a reverse wording. Coefficient ω ranged from ω = .67 (test anxiety) to ω = .76 (effort) for the subscales and model fit for the structural equation model (where effort was modelled as the outcome of motivation) of the questionnaire was adequate (Field trial, n = 375; Comparative Fit Index (CFI) = .91, Root Mean Square Error of Approximation (RMSEA) = .070, Standardized Root Mean Square Residual (SRMR) = .064; Regular test, n = 375; CFI = .91, RMSEA = .062, SRMR = .064). An invariance study on partly the same sample has shown that the questionnaire had partial scalar invariance over the field trial and the regular test (Knekta & Eklöf, Citation2015). For a more detailed description of the motivation questionnaire see Knekta and Eklöf (Citation2015).

The national test in science.

The regular national test in this study included a paper-and-pencil part with 12–14 items (14–23 sub-items) and a laboratory part with 3 items (6 sub-items). About a fifth of the sub-items were multiple choice or matching items, a fifth were short answers, and three fifths were longer, constructed response items. Each item was designed to assess one or several skills (ability to describe and employ concepts, models, and theories; scientific methods and how science has developed and influenced us; and use of scientific and other arguments). The sub-items were scored with pass or fail at one or several levels (pass, pass with distinction, and pass with special distinction). The pupils had 210 minutes to complete the test and were awarded a maximum of 39 points. Each school was assigned to complete the test in either biology, physics, or chemistry. The tests in the different subjects are based on the same test model and are designed to assess the same skills and to have the same degree of difficulty. The Coefficient alphas for the tests were .87 for biology, .90 for physics, and .89 for chemistry. The field trial at this particular administration consisted of 10 different sets of items with between 4 and 7 items each, from the theoretical part or the laboratory part. The main focus of this field trial was to evaluate the pupils’ understanding of the test items. The quality of the item scoring rubrics was still to be improved by, for instance, explanatory pupil answers received from the field trial. Therefore achievement test scores from the field trial might not be very reliable and must be interpreted with caution.

Procedure

The motivation questionnaire was first administered to all schools (n = 49) participating in a national test field trial and then again to the same schools in connection with a regular national test, one-to-three weeks later. Questionnaires were returned from 47 schools from the field trial and 17 schools from the regular test. Only pupils who completed the questionnaire on both test administrations were included in this study. Teachers were requested to hand out the questionnaire immediately after completion of the achievement test. The questionnaire took around three minutes to complete. Teachers were advised to return the completed questionnaires together with test scores and preliminary grades in the subject.

The large difference in response rate between the field trial and the regular test is probably due to the different administrative procedures at the field trial and the regular test. At the field trial, the questionnaire was distributed in the same process as the test items, while at the regular test the teacher had to distribute it separately. Because distributing the questionnaire at the regular test was both time consuming and voluntary, many teachers chose not to participate. This could lead to a biased sampling having only the most “test positive” teachers participating and potentially also more “test postitive” pupils. However, we still consider the comparison of motivation over the different test situations to be worthwhile.

Statistical Analysis

Descriptive statistics.

First, data were screened using descriptive statistics. Item means and composite means scores for each scale were calculated. Data were analysed with respect to skewness, kurtosis, multicollinearity, and outliers. Variance inflation factor (VIF) was used to screen for multicollieanrity, and influence value was used to screen for outliers on both the individual and class level.

Two-level second-order latent growth models.

For each of the motivational constructs—effort, expectancies, importance, interest, and test anxiety—two-level second-order latent growth models was used to assess the effect of test situation, gender, and subject on test-taking motivation. The effect of test-taking motivation and preliminary grades on test score and variances in test-taking motivation between test situations, individuals, and classes were also assessed (). First-level units were pupils (n = 375) and second-level units were classes (n = 26). The pupils’ responses on the questionnaire were treated as continuous.

Figure 1. Schematic picture for the two-level second-order latent growth model, Model 5. Note: Effort is used as an example. E1 to E4 represent questionnaire items that form the latent variable effort at the field trial and the regular test. The intercept and slope represent the latent growth factors. Small oblique arrows illustrate estimated variances and residual variances (arrows for questionnaire items were excluded to enhance the clarity of the picture).

Figure 1. Schematic picture for the two-level second-order latent growth model, Model 5. Note: Effort is used as an example. E1 to E4 represent questionnaire items that form the latent variable effort at the field trial and the regular test. The intercept and slope represent the latent growth factors. Small oblique arrows illustrate estimated variances and residual variances (arrows for questionnaire items were excluded to enhance the clarity of the picture).

We chose to use a multilevel approach because our data consist of observations nested within individuals and individuals nested within classes. We cannot assume homogenous variance over the two time points or over the different classes. Thus, basic assumptions for paired sample t-test and repeated analysis of variance are violated and using any of these methods to estimate differences between field trial and regular test or between schools would give downward biased standard errors (Hox, Citation2010). Further, multilevel modelling allows for inclusion of variables at several levels (i.e., gender at the individual level and different subjects at different time points and for different classes). Consequently, data do not need to be aggregated or disaggregated, which could lead to loss of power, high operational alpha levels, or interpretation of effects on the wrong level (Hox, Citation2010).

Second-order latent growth modelling was chosen to estimate change over time (Newsom, Citation2002; Voelkle, Citation2007). Latent growth models can handle heterogeneous error structures and the growth curve can be incorporated into more complex models (Newsom, Citation2002). For example, latent constructs can be defined by multiple items and thereby measurement errors are taken into account (Newsom, Citation2002). In our growth model, there were two parallel first-order latent variables for each pupil, each defined by multiple items in the questionnaire (e.g., a latent effort variable for the field trial and the regular test, respectively []). The second-order factors consist of the growth parameters, the intercept, and the slope. The intercept represents the initial value of the motivational construct and the slope the change in the construct between the field trial and the regular test. In line with conventional multilevel modelling, the paths from the intercept to the two latent first-order factors were fixed to one and the slope growth factors were set to 0 and 1 to define a linear growth (). Because there were only two time points (dyadic data/two wave data), the model was non-identified if intercept variance, slope variance, covariance between slope and intercept, and unique residual variance for the latent variables at each time point were all allowed to be freely estimated at the lower level (Newsom, Citation2002; Voelkle, Citation2007). Therefore slope variance and covariance between slope and intercept were set to 0 at the first level. The residual variances for the latent variables were still free to be uniquely estimated because we expected the variance at the field trial to be different from the variance at the regular test.

At the second level, intercept variance, slope variance, and covariance between the slope and intercept were introduced in a stepwise manner (see below). Residual variances for the first-order latent variable were set to zero. In general, residual variances for the questionnaire items were free to be estimated. However, if the model estimation did not terminate normally because of negative residual variance for an item at the between level, the residual variance at the between level for this particular item was set to 0.001. Means for the latent growth factors and the intercepts for the items were estimated at the between level. Scalar invariance across levels and partial scalar invariance across time was assumed for the instrument. Based on a previous invariance study across test conditions with partly the same sample, the intercepts for two items were free to vary over time (Knekta & Eklöf, Citation2015).

Data analyses were run in MPlus 7.11 (Muthén & Muthén, Citation1998Citation2012) and the maximum likelihood estimation with robust standard errors was used for model estimation. Following the recommendation by Schermelleh-Engel, Moosbrugger, and Müller (Citation2003) the following criteria was used to evaluate the adequacy of the models: CFI > .90, RMSEA < .08, and SRMR < .10 was considered adequate and CFI > .95, RMSEA < .05 and SRMR < .05 was considered good fit.

Five different multilevel latent growth models were specified for each of the motivational constructs ():

  1. A null model with an intercept with a fixed part (grand mean of, e.g., effort over all classes) and a random part (variability in individual intercept and variability in class intercept). No fixed factors or variations in slopes were included. The aim of this model was to look at the intraclass correlation (i.e. if there are any differences in the reported construct between classes).

  2. For the second model, the slope in the growth model was allowed to vary at the class level and a covariance between the intercept and the slope in the growth model part was added. Thus random variation in the change in the constructs between the field trial and the regular test was allowed.

  3. For the third model, the fixed effect gender (-0.5 = girl, 0.5 = boy) on the individual level was added. Cohen's d for differences in motivation between stakes was calculated by dividing the latent growth factor (fixed part of the slope) with the squared root of the total variance (s/√[(σeffort field trial + σeffort regular test)/2 + σintercept individ + σintercept class]).

  4. For the fourth model, fixed effects for the second level were added. Subject was added to see how much of the variance between classes that could be explained by the fact that the classes did the test in different science subjects. Subject was represented by two dummy coded variables, with 0 if the test was taken in chemistry and 1 if the test was taken in either biology or physics. Further, the intercept and slopes from the growth model were regressed on a dummy variable, with a 0 if the field trial and regular test were taken in the same subject and 1 if the tests were taken in different subjects.

  5. To analyse the relationships between the latent motivational constructs and the test score, test score and preliminary grades were added to the third model described above. Test score was regressed on the latent motivational construct and preliminary grade. Before the analysis, achievement test scores were rescaled to relative scores, ranging from 0 (0% correct) to 1 (100% correct).

Results

Descriptive Statistics

The items had between 0.5 and 8% missing values, with two items with > 5% missing values. The mean for the questionnaire items ranged between 1.8 and 3.0 at the field trial and between 2.1 and 3.7 at the regular test (1 = strongly disagree, 4 = strongly agree). All items except oneFootnote1 had higher means at the regular test. The proportion of class-level varianceFootnote2 for the items varied between 1.1 and 17%. Plots of the observed composite means of the different constructs at the field trial and the regular test in different classes in general indicated an increase between the two test situations as well as variability between classes (). All questionnaire items except oneFootnote3 had a skewness and kurtosis < |1.1| and < |2.2| in both test situations. The SD ranged between 0.73–1.05 for the field trial and between 0.51–0.96 for the regular test. The mean relative test score was .36 (SD = 0.23) for the field trial and .47 (SD = 0.18) for the regular test. The skewness and kurtosis of test score were < |0.6|. Thus the data were considered fairly univariate and normally distributed. No multicollinearity (VIF < 2) was suggested when regressing the motivational constructs on gender and subject or when regressing test score on mean composite score of all motivational constructs, preliminary credits, gender, and subject. Two classes and a few individuals in those two classes had high influence values and are thus possible outliers. Analysis without those outliers did not change the substantive interpretation of the models. Further, we could find no theoretical reason why these classes and individuals should be classified as outliers. It was decided to keep all classes and individuals in the analysis.

Figure 2. Observed mean composite score for the different test-taking motivation constructs at the field trial and the regular test for the 26 classes.

Figure 2. Observed mean composite score for the different test-taking motivation constructs at the field trial and the regular test for the 26 classes.

Two-Level Second-Order Latent Growth Models

The model fit indices CFI, RMSEA, and SRMR overall indicated that the model with gender as the only fixed effect (Model 3) had the best model fit for all constructs (). The CFI was > .91, RMSEA < .07, and SRMR < .06, thus the models had adequate or good model fit. However, chi squared values for all models had a p-value < .05.

Table 1. Model Fit and Parameter Estimates for Two-Level Second-Order Latent Growth Models for Effort, Expectancies, Importance, Interest and Test Anxiety at the Field Trial and the Regular Test (n = 375)

The intraclass correlation, based on Model 1 variances, showed that 8, 3, 9, 11, and 6% of the variation in the motivational constructs for effort, expectancies, importance, interest, and test anxiety, respectively, could be explained by variations among classes. Thus, there seems to be variance in the different constructs between classes. Allowing the classes to have different slopes and the slope and intercept at the class level to covariate (Model 2) increased model fit for all constructs ().

Mean value at the field trial for effort, expectancies, importance, interest, and test anxiety were 2.84, 2.61, 2.56, 2.06, and 2.25, respectively (, Model 3, fixed part intercept). All constructs showed a significant increase between the field trial and the regular test (, Model 3, fixed part slope). Cohen's d was 0.89, 0.30, 1.10, 0.33, and 1.01 for effort, expectancies, importance, interest, and test anxiety, respectively. Overall, the variance was higher at the field trial, with the largest differences in variance when it comes to importance. Thus, the perceptions of the field trial differs more among the pupils than the perceptions of the regular test.

For all constructs except expectancies there was a significant variability among classes with respect to the field trial (, Model 3, class intercept). For effort, importance, and test anxiety, the mean change between the field trial and the regular test did also vary across classes (, Model 3, class slope). The covariance between the intercept and slope was significant for test anxiety and importance (, slope*intercept). The higher the test anxiety and importance at the field trials, the lower the change to the regular test.

Gender had a significant (p < .05) effect on expectancies, importance, and test anxiety (). Girls in general reported higher perceived importance and test anxiety and lower expectancies. The effect was largest for test anxiety.

The models that included subject and whether the field trial and the regular test were made in the same or a different subject had bad fit to data for most constructs and in all cases worse fit than the model without subject. Therefore, interpretation concerning the effect of these variables is tentative and not included in the Tables. To add subject did not substantially affect the results concerning variability among classes, and significant effects were found only for a few variables. The reported expectancies at the field trial seem to be negatively affected if the pupils knew they were going to do the regular test in another subject. Further, pupils reported higher perceived importance if the field trial were made in physics or biology and the change in perceived importance was larger if the field trial and regular test were made in the same subject.

Relation to Performance

Effort, expectancies, importance, and interest were significant predictors of test score, after preliminary grades had been accounted for, at both the field trial and the regular test (). Test anxiety was not significantly related to performance. At the field trial, effort and expectancies had a stronger effect on performance than preliminary grades did. It should be noted that the models including the regression of test scores on preliminary grades and the different motivational constructs had a bad fit to data for most constructs, and in all cases worse fit than the model without grades and test score. As the quality of performance data from the field trial could also be questioned, these results need to be interpreted with caution.

Table 2. Model Fit and Selected Fixed Parameter Estimates for Test Score Regressed on the Motivational Constructs and Preliminary Grades at the Field Trial and the Regular Test (Model 5, n = 287)

Discussion

The primary aim of this study was to investigate possible changes in reported test-taking motivation from a low-stakes to a high-stakes test, and to explore whether there seemed to be variation in test-taking motivation between classes. This was done using a sample of pupils in the 9th grade participating in a field trial to a national test as well as a regular national test. Test-taking motivation was assessed with a self-report instrument containing five different subscales, covering different motivational constructs: effort, expectancies, importance, interest, and test anxiety.

The study is motivated by the fact that few previous studies have compared reported levels of test-taking motivation in young adolescent samples across authentic low-stakes and high-stakes settings. This is, however, a highly relevant issue to explore, particularly in contexts like the present one, where the high-stakes test is dependent on the findings from the low-stakes test. Even fewer studies have investigated possible differences in test-taking motivation between different school classes, although this also is a relevant topic, particularly in low-stakes assessment contexts. Differences at the between-class level could suggest that the teacher and his or her behaviour is important for the pupils’ motivation at low-stakes tests, and that it may be difficult to validly evaluate findings from a field trial if different groups of pupils have been differently motivated.

The first research question, concerning changes in reported test-taking motivation from the field trial to the regular national test, was analysed with latent growth factors. In summary, the results show a significant increase in all motivational constructs from the field trial to the regular test. Effect size (Cohen's d) ranged between 0.30 and 1.10. Thus, pupils reported that they spent more effort, expected to do better, perceived the test more important and interesting, and experienced more test anxiety at the regular test. The largest difference between the two test situations was seen for reported effort, importance, and test anxiety, while there were rather small changes for expectancies and interest. This follows the assumptions we made concerning changes in situation-specific motivation when stakes changes. Similar differences in effort, importance, and test anxiety between stakes have been reported by Eklöf and Knekta (Citation2014), Lau et al. (Citation2009), Liu et al. (Citation2012), Thelk et al. (Citation2009), and Wolf and Smith (Citation1995). Effect sizes for effort and importance in the present study were lower than those reported by Thelk et al. (Citation2009: Cohen's d = 1.3 for effort and 2.3 for importance) and Wolf and Smith (Citation1995: Cohen's d = 1.45Footnote4) and higher than those reported by Liu et al. (Citation2012; .57 SD difference4) and Lau et al. (Citation2009; Cohen's d < 0.57 for effort), but then the assessment conditions were also different in the different studies. The two first studies analysed differences between true high-stakes and true low stakes tests (test counted towards grade or not), while the two later studies analysed differences between pupils taking the same low-stakes test, with differences in stakes induced by instructionsFootnote5 or different strategies used by the test administrator. Concerning variation in the data between the different test situations, it was observed that, overall, the variance was higher at the field trial than at the regular test. The largest difference in variance between the two tests was seen for importance. These findings also make sense: in the low-stakes condition where the consequences of the test are less clear, pupils vary more in how they perceive the test and how motivated they feel to do their best. When an “external consequence” or extrinsic motivator is added to the test in the regular test condition (teachers have to consider the test result as one of their sources for grading the pupils, even if the test score is not decisive of pupil grades), all pupils have a similar motivator to relate to, the test is on average perceived as more important, and pupils therefore become more similar with respect to reported level of motivation. Lower variance in high-stakes settings compared to low-stakes settings has also been reported earlier (e.g. Lau et al., Citation2009; Sundre & Kitsantas, Citation2004; Thelk et al., Citation2009). When performing field trials with relatively small sample sizes, as is the case in the Swedish national test system, high variability in motivation might pose a real threat to the reliability and validity of the test results, particularly as different groups of pupils take different item sets.

The second research question concerned whether there were differences in reported test-taking motivation between school classes. If the variability in motivation is found also at the class level, the inferences made based on the test results may be dependent on which class/classes happened to do that particular field trial. In our study, between 3 and 11% of the variance in the test-taking motivation constructs was found at the class level. Not only was there variability in the motivation between classes, the mean change between the field trial and the regular test did vary for effort, importance, and test anxiety. This could make, for example, an overall correction of item difficulty from the field trial to the regular test problematic. As our second-level sample size is small (n = 26), multilevel modelling might underestimate group-level variance and their standard errors (Maas & Hox, Citation2004). It follows that differences between classes might be larger than estimated by the model, but at the same time we run the risk of too often arriving at the conclusion that the differences are significant when they are not. On the other hand, if we had ignored the multilevel structure of the data, standard errors would have been underestimated too.

In our study, all test-taking motivation constructs except test anxiety predicted pupil performance at both the field trial and the regular test, even when preliminary course grades was accounted for. Similar results have been reported by Cole et al. (Citation2008). In our model, the test scores were relative test scores, ranging from 0 to 1, the mean change from the field trial to the regular test in, for example, effort was 0.56 and the unstandardized regression coefficient (performance on effort) was 0.13. If this information was applied to a student with a hypothetical field trial score of 40, this would mean that if the student would invest the same effort in the field trial as in the regular test, her field trial result would on average increase by three points (0.13 × 0.56 × 40 = 2.9). This could be considered to be a relatively small effect, but this number is averaged over individuals, classes, and different item types. It could be argued from a theoretical perspective, and previous studies have shown, that test-taking motivation has different effects on different types of items and thus some items might be more affected by low test-taking motivation than others (see DeMars, Citation2000; Sundre & Kitsantas, Citation2004). It must be noted that not all pupils completed the same set of items at the field trial and scoring rubrics were in an early stage of development. Consequently, conclusions regarding relationships between test-taking motivation and performance must be seen as preliminary, but they do suggest, much in line with previous research, that test-taking motivation has an effect on performance that is not negligible.

Previous studies as well as a recent review have suggested that girls in general report higher levels of effort and motivation on low-stakes tests (DeMars et al., Citation2013). Contrary to this, we found no significant differences in invested effort between girls and boys. Our results are, however, similar to those of previous studies in the Swedish context, with non-significant gender differences in invested effort, while girls reported higher test anxiety (Eklöf et al., Citation2014; Eklöf & Nyroos, Citation2013). As also described by Boekaerts (Citation2002), girls scored lower on expectancies. In the present study we also looked at the effect of subject, as not all pupils take the national test in science in the same subject, and our findings suggest that subject did not have a substantial effect on test-taking motivation (as also indicated by Eklöf & Nyroos, Citation2013). Differences between classes persisted even if subject was included in the model.

Methods to Reduce the Effect of Low Test-Taking Motivation

The high cost, work load, and policy consequences of many of the tests that are perceived as low-stakes from the test-taker's perspective make it relevant to make an effort to secure high quality for these tests. Test-taking motivation is one important quality aspect to consider. If there are indications of low or varying test-taking motivation in a particular test situation, the results from pupils or groups with low motivation should possibly be filtered out, as described by Swerdzewski, Harmes, and Finney (Citation2011), or at least interpreted with care. Even better than post-test adjustments is to increase or at least equalize pupils’ test-taking motivation before and during the test situation. In our study effort, importance and test anxiety were the factors that varied most between stakes, and the strongest relationship with test score was found for effort and expectancies. Further, for effort, importance, and interest more than 7% of the variance at the field trial was found at the class level. Thus, to obtain more similar motivation over tests with different stakes, and across pupils in different classes, effort seems to be most important to work with. However, because the other constructs directly or indirectly relate to effort (Cole et al., Citation2008; Knekta & Eklöf, Citation2015) these could be useful to consider too. One solution could be to make the field trials higher-stakes for the test-takers by, for example, including the test items in other high-stakes tests in school. That is not always possible due to administrative constraints such as security and exposure concerns or difficulties in finding a sample that reflects the target population. Further, increasing the stakes could give other undesirable effects such as increased test anxiety. Another solution would be to standardize the conditions under which the field trials should be performed. As Kane (Citation2011) pointed out, standardization of tests might decrease random errors. However, by doing so we run the risk of increasing systematic errors. Strict standardizations might increase motivation for some groups of pupils, while it might decrease motivation for other groups of pupils. Strict standardizations might even lead to teachers or school choosing not to participate in the field trials. However, information to test administrators/teachers on how they could create a positive test climate and how they should present the test might be useful. Lau et al. (Citation2009) showed that giving the test administrators strategies to enhance pupils’ effort during the test session was an effective method to increase effort in low-stakes situations.

One of the strengths of this study is its high ecological validity. The study was performed in the real world of schools and pupils, in real assessment settings, without any manipulation by the researchers. The strength of the present study is, however, also one of its weaknesses, as we only had access to a limited number of second-level units (number of classes = 26), performance data is not too reliable, and we have no control over how the different tests were actually administered. Although the present study is a rather small-scale example that should not be generalized to other settings, we believe the findings could be relevant for teachers, researchers, and policy makers.

Conclusion

Our study is one of the first studies analysing differences in test-taking motivation between classes, and one of few studies comparing test-taking motivation between authentic tests with different stakes for the same group of pupils. In conclusion, our study showed that test-taking motivation varies between and within tests with different stakes, and that test-taking motivation seems to be one factor affecting test performance . Taken together, this could affect the validity of the interpretation and use of the test results from especially small-scale field trials. Our findings suggest that the issue of pupil test-taking motivation in different assessment settings is worth further exploration in order to enhance the quality of data and inferences made from data, as well as to increase the understanding of pupil perceptions of tests and of how the pupil-motivation-assessment dynamic works.

Disclosure statement

No potential conflict of interest was reported by the author.

Additional information

Funding

This work was supported by the Swedish Research Council [grant number 2012-5075].

Notes

1 “It was fun to do this test”.

2 Observed variance at class level divided by total observed variance.

3 Kurtosis for “I did my best on this test” on the regular test was 6.0.

4 They used a combined effort and importance scale.

5 The highest stake condition was “However, your test scores may be released to faculty in your college or to potential employers to evaluate your academic ability.”

References

  • Abdelfattah, F. (2010). The relationship between motivation and achievement in low-stakes examinations. Social Behavior and Personality: An International Journal, 38(2), 159–167. doi: 10.2224/sbp.2010.38.2.159
  • American Educational Research Association (AERA), American Psychological Association (APA), & National Council for Measurement in Education (NCME). (2014). Standards for educational and psychological testing. Washington, DC: American Educational Research Association.
  • van Barneveld, C. (2007). The effect of examinee motivation on test construction within an IRT framework. Applied Psychological Measurement, 31(1), 31–46. doi: 10.1177/0146621606286206
  • Barry, C. L., Horst, S. J., Finney, S. J., Brown, A. R., & Kopp, J. P. (2010). Do examinees have similar test-taking effort? A high-stakes question for low-stakes testing. International Journal of Testing, 10(4), 342–363. doi: 10.1080/15305058.2010.508569
  • Baumert, J., & Demmrich, A. (2001). Test motivation in the assessment of student skills: The effects of incentives on motivation and performance. European Journal of Psychology of Education, 16(3), 441–462. doi: 10.1007/BF03173192
  • Boe, E. E., May, H., & Boruch, R. F. (2002). Student task persistence in the third international mathematics and science study: A major source of achievement differences at the national, classroom, and students levels. (Report No. CRESP-RR-2002-TIMSS1). Retrieved from http://files.eric.ed.gov/fulltext/ED478493.pdf
  • Boekaerts, M. (2002). The on-line motivation questionnaire: A self-report instrument to assess studentś context sensitivity. In P. R. Pintrich & M. L. Maehr (Eds.), New directions in measure and methods ( Vol. 12, pp. 77–120). Oxford, UK: Elsevier Science.
  • Cole, J., Bergin, D., & Whittaker, T. (2008). Predicting student achievement for low stakes tests with effort and task value. Contemporary Educational Psychology, 33(4), 609–624. doi: 10.1016/j.cedpsych.2007.10.002
  • DeMars, C. E. (2000). Test stakes and item format interactions. Applied Measurement in Education, 13(1), 55–77. doi: 10.1207/s15324818ame1301_3
  • DeMars, C. E., Bashkov, B. M., & Socha, A. B. (2013). The role of gender in test-taking motivation under low-stakes conditions. Research & Practice in Assessment, 8, 69–82. Retrieved from http://www.rpajournal.com
  • Eccles, J., Adler, T. F., Futterman, R., Goff, S. B., Machala, C. M., Meece, J. L., & Midgley, C. (1983). Expectancies, values and academic behaviors. In J. T. Spence (Ed.), Achievement and achievement motives (pp. 75–146). San Francisco, CA: Freeman.
  • Eklöf, H., Japelj, B., & Grønmo, L. S. (2014). A cross-national comparison of reported effort and mathematics performance in TIMSS advanced. Applied Measurement in Education, 27(1), 31–45. doi: 10.1080/08957347.2013.853070
  • Eklöf, H., & Knekta, E. (2014, April). Different stakes, different motivation? Swedish studies of test-taking motivation in different assessment contexts. In D. L. Sundre (Chair), The impact of test-taking motivation and test consequences on the validity of test score inferences. Symposisum conducted at the annual meeting of the American Educaional Research Association, Philadelphia, PA.
  • Eklöf, H., & Nyroos, M. (2013). Pupil perceptions of national tests in science: Perceived importance, invested effort, and test anxiety. European Journal of Psychology of Education, 28(2), 497–510. doi: 10.1007/s10212-012-0125-6
  • von der Embse, N., & Hasson, R. (2012). Test anxiety and high-stakes test performance between school settings: Implications for educators. Preventing School Failure: Alternative Education for Children and Youth, 56(3), 180–187. doi: 10.1080/1045988X.2011.633285
  • Harlen, W., & Deakin Crick, R. (2003). Testing and motivation for learning. Assessment in Education: Principles, Policy & Practice, 10(2), 169–207. doi: 10.1080/0969594032000121270
  • Hox, J. J. (2010). Multilevel analysis: Techniques and applications (2nd ed.). New York: Routledge.
  • Kane, M. (2011). The errors of our ways. Journal of Educational Measurement, 48(1), 12–30. doi: 10.1111/j.1745-3984.2010.00128.x
  • Knekta, E., & Eklöf, H. (2015). Modeling the test-taking motivation construct through investigation of psychometric properties of an expectancy-value-based questionnaire. Journal of Psychoeducational Assessment, 33(7), 662–673. doi: 10.1177/0734282914551956
  • Lau, A. R., Jones, A. T., Anderson, R. D., & Markle, R. E. (2009). Proctors matter: Strategies for increasing examinee effort on general education program assessments. Journal of General Education, 58(3), 196–217. doi: 10.1353/jge.0.0045
  • Liu, O. L., Bridgeman, B., & Adler, R. M. (2012). Measuring learning outcomes in higher education: Motivation matters. Educational Researcher, 41(9), 352–362. doi: 10.3102/0013189X12459679
  • Maas, C. J. M., & Hox, J. J. (2004). Robustness issues in multilevel regression analysis. Statistica Neerlandica, 58(2), 127–137. doi: 10.1046/j.0039-0402.2003.00252.x
  • Muthén, L. K., & Muthén, B. O. (1998–2012). Mplus user's guide (7th ed.). Los Angeles, CA: Muthén & Muthén.
  • Newsom, J. T. (2002). A multilevel structural equation model for dyadic data. Structural Equation Modeling: A Multidisciplinary Journal, 9(3), 431–447. doi: 10.1207/S15328007SEM0903_7
  • Nyroos, M., & Wiklund-Hörnqvist, C. (2011). The association between working memory and educational attainment as measured in different mathematical subtopics in the Swedish national assessment: primary education. Educational Psychology, 32(2), 239–256. doi: 10.1080/01443410.2011.643578
  • O'Neil, H. F., Abedi, J., Miyoshi, J., & Mastergeorge, A. (2005). Monetary incentives for low-stakes tests. Educational Assessment, 10(3), 185–208. doi: 10.1207/s15326977ea1003_3
  • Penk, C., Pöhlmann, C., & Roppelt, A. (2014). The role of test-taking motivation for students’ performance in low-stakes assessments: An investigation of school-track-specific differences. Large-scale Assessments in Education, 2(1), 1–17. doi: 10.1186/s40536-014-0005-4
  • Ryan, M. R., & Deci, E. L. (2009). Promoting self-determined school engagement, motivation, learning, and wellbeing. In K. R. Wentzel & A. Wigfield (Eds.), Handbook of motivation at school (pp. 171–195). New York, NY: Routledge.
  • Schermelleh-Engel, K., Moosbrugger, H., & Müller, H. (2003). Evaluating the fit of structural equation models: Tests of significance and descriptive goodness-of-fit measures. Methods of Psychological Research, 8(2), 23–74. Retrieved from http://www.cob.unt.edu/slides/Paswan../BUSI6280/Y-Muller_Erfurt_2003.pdf
  • Schunk, D. H., Pintrich, P. R., & Meece, J. L. (2010). Motivation in education: Theory, research, and applications (3rd ed.). London: Pearson Education.
  • Segool, N. K., Carlson, J. S., Goforth, A. N., von der Embse, N., & Barterian, J. A. (2013). Heightened test anxiety among young children: Elementary school students’ anxious responses to high-stakes testing. Psychology in the Schools, 50(5), 489–499. doi: 10.1002/pits.21689
  • Silfver, E., Sjöberg, G., & Bagger, A. (2013). Changing our methods and disrupting the power dynamics: National tests in third-grade classrooms. International Journal of Qualitative Methods, 12, 39–51. Retrieved from https://ejournals.library.ualberta.ca
  • Simzar, R. M., Martinez, M., Rutherford, T., Domina, T., & Conley, A. M. (2015). Raising the stakes: How students’ motivation for mathematics associates with high- and low-stakes test achievement. Learning and Individual Differences, 39(0), 49–63. doi: 10.1016/j.lindif.2015.03.002
  • Stobart, G., & Eggen, T. (2012). High-stakes testing: Value, fairness and consequences. Assessment in Education: Principles, Policy & Practice, 19(1), 1–6. doi: 10.1080/0969594X.2012.639191
  • Stroet, K., Opdenakker, M.-C., & Minnaert, A. (2013). Effects of need supportive teaching on early-adolescents’ motivation and engagement: A review of the literature. Educational Research Review, 9, 65–87. doi: 10.1016/j.edurev.2012.11.003
  • Sundre, D. L., & Kitsantas, A. (2004). An exploration of the psychology of the examinee: Can examinee self-regulation and test-taking motivation predict consequential and non-consequential test performance? Contemporary Educational Psychology, 29(1), 6–26. doi: 10.1016/S0361-476X(02)00063-2
  • Swedish National Agency for Education. (2014). Redovisning av uppdrag om avvikelser mellan provresultat och betyg i grundskolans årskurs 6 och årskurs 9 [Report of assignment regarding discrepancy between test scores and grades in primary school grade 6 and grade 9] (Dnr U2014/335/S). Retrieved from http://www.skolverket.se/publikationer?id=3209
  • Swerdzewski, P. J., Harmes, J. C., & Finney, S. J. (2011). Two approaches for identifying low-motivated students in a low-stakes assessment context. Applied Measurement in Education, 24(2), 162–188. doi: 10.1080/08957347.2011.555217
  • Thelk, A. D., Sundre, D. L., Horst, S. J., & Finney, S. J. (2009). Motivation matters using the student opinion scale to make valid inferences about pupil performance. Journal of General Education, 58(3), 129–151. doi: 10.1353/jge.0.0047
  • Urdan, T., & Schoenfelder, E. (2006). Classroom effects on student motivation: Goal structures, social relationships, and competence beliefs. Journal of School Psychology, 44(5), 331–349. doi: 10.1016/j.jsp.2006.04.003
  • Voelkle, M. C. (2007). Latent growth curve modeling as an integrative approach to the analysis of change. Psychology Science, 49(4), 375–414. Retrieved from http://www.psychologie-aktuell.com
  • Wendler, C. L. W., & Walker, M. E. (2009). Practical issues in designing and maintaining multiple test forms for large-scale programs. In S. M. Downing & T. M. Haladyna (Eds.), Handbook of test development (pp. 445–467). New York, NY: Routledge.
  • Wigfield, A., & Eccles, J. S. (1992). The development of achievement task values: A theoretical analysis. Developmental Review, 12(3), 265–310. doi: 10.1016/0273-2297(92)90011-P
  • Wigfield, A., & Eccles, J. S. (2000). Expectancy-value theory of achievement motivation. Contemporary Educational Psychology, 25(1), 68–81. doi: 10.1006/ceps.1999.1015
  • Wigfield, A., & Eccles, J. S. (2002). The development of competence beliefs, expectancies for success, and achievement values from childhood through adolescence. In A. Wigfield, and J. S. Eccles (Eds.), Development of achievement motivation (pp. 91–120). San Diego; CA: Academic Press.
  • Wigfield, A., Tonks, S., & Lutz Klauda, S. (2009). Expectancy-value theroy. In K. R. Wentzel & A. Wigfield (Eds.), Handbook of motivation at school (pp. 55–75). New York, NY: Routledge.
  • Wise, S. L., & DeMars, C. E. (2005). Low examinee effort in low-stakes assessment: Problems and potential solutions. Educational Assessment, 10(1), 1–17. doi: 10.1207/s15326977ea1001_1
  • Wise, S. L., & Kong, X. (2005). Response time effort: A new measure of examinee motivation in computer-based tests. Applied Measurement in Education, 18(2), 163–183. doi: 10.1207/s15324818ame1802_2
  • Wolf, L. F., & Smith, J. K. (1995). The consequence of consequence: Motivation, anxiety, and test performance. Applied Measurement in Education, 8(3), 227–242. doi: 10.1207/s15324818ame0803_3