1,120
Views
9
CrossRef citations to date
0
Altmetric
Articles

Using Multilevel Modeling in Language Assessment Research: A Conceptual Introduction

Pages 241-273 | Published online: 15 Aug 2013
 

Abstract

This article critiques traditional single-level statistical approaches (e.g., multiple regression analysis) to examining relationships between language test scores and variables in the assessment setting. It highlights the conceptual, methodological, and statistical problems associated with these techniques in dealing with multilevel or nested data and discusses an alternative approach, multilevel modeling (MLM), that can handle such data appropriately. An example focusing on contrast effects in essay rating is used to illustrate the main points discussed in the paper and the applications and advantages of MLM. The article also discusses some of the main considerations and issues in MLM (e.g., model building and testing, centering) and concludes by pointing out areas where MLM can be applied in language assessment research.

Notes

1There are numerous ways to combine analytic scores, depending on assessment purpose and context. The simplest approach is to sum or average them, which means assigning equal weight to all rating criteria. Other approaches involve assigning more weight to some criteria than others based on statistical, practical, and/or theoretical considerations.

2A MANOVA could be used here, but to keep the discussion short and simple, an ANOVA is used.

3In this discussion, holistic scores are used as the outcome, but the same equation applies to the analytic scores as well.

4See the discussion of centering that follows.

5The following terms are used interchangeably in this article: residual term, error term, unexplained variance, and unmodeled variance.

6In an ANOVA, independent variables are always categorical (e.g., group), whereas in MR, independent variables can be either continuous (e.g., test scores) or categorical.

7Readers are referred to the references just listed for a statistical explanation of this point.

8 CitationKreft and Leeuw (1998) noted that the sample size lies somewhere between the number of ratings and the number of raters depending on the amount of variance within and between raters (i.e., ICC).

9Aggregation may be less common in language assessment research because such research usually includes a small number of observations.

10 It is possible that these authors considered the issues just discussed, but there is no indication in the publications themselves whether the issue of nested data and its implications were examined empirically (e.g., by computing ICC).

11 ANOVA, G-theory, and MFRM also cannot estimate mediating effects of Level-2 factors (e.g., rater characteristics) on the relationships between Level-1 variables (e.g., between EPL and target essay score).

12 Context here refers to the context of the ratings that includes the rater, the rating panel (e.g., size and composition of panel), physical context of rating (e.g., place, time), and so forth.

13 MLM models fall into two broad statistical categories: a multiple regression approach and a structural equation modeling (SEM) approach (CitationHeck & Thomas, 2000; CitationHox, 2002). This article adopts a multiple regression approach because it is conceptually less complex.

14 As is shown next, a first step in conducting MLM is to run a null model to estimate ICC and determine whether one needs to conduct MLM or, if ICC is close to 0, one can use MR.

15 Level 2 variables can be categorical (e.g., novice vs. experienced) or continuous (e.g., number of years of experience).

16 The decisions and considerations discussed in this section are similar to those in SEM (see CitationKunnan, 1998).

17 Computer programs such as HLM provide user-friendly tools, examples, and step-by-step guidelines for building and evaluating various MLM models.

18 As CitationHox (1995) explained, if the models being compared are not nested models, “the principle that models should be as simple as possible (theories and models should be parsimonious) indicates that we should generally stick with the simpler model” (p. 17).

19 It is the ratio of the true parameter variance to the observed variance (which consists of true and error variances; CitationDeadrick et al., 1997).

20 It should be noted here that conventional statistical models (i.e., ANOVA, MR) form the building blocks of MLM and can be considered special single-level cases of MLM (CitationHox, 2002).

21 Analyses of the residuals from the final model in the Applying MLM to the Data Set section indicated that the distribution of Level-1 and Level-2 residuals did not depart significantly from normality. However, the novice raters tended to have higher residual variances, suggesting more unpredictable variability in the ratings of these raters compared to those of the experienced raters.

22 Some of the scenarios described here can be modeled using G-theory if the predictors are categorical. Where predictors are continuous or mixed, only MLM can handle such data.

Reprints and Corporate Permissions

Please note: Selecting permissions does not provide access to the full text of the article, please see our help page How do I view content?

To request a reprint or corporate permissions for this article, please click on the relevant link below:

Academic Permissions

Please note: Selecting permissions does not provide access to the full text of the article, please see our help page How do I view content?

Obtain permissions instantly via Rightslink by clicking on the button below:

If you are unable to obtain permissions via Rightslink, please complete and submit this Permissions form. For more information, please visit our Permissions help page.