Abstract
Human raters are normally involved in L2 performance assessment; as a result, rater behavior has been widely investigated to reduce rater effects on test scores and to provide validity arguments. Yet raters’ cognition and use of rubrics in their actual rating have rarely been explored qualitatively in L2 speaking assessments. In this study three rater groups (novice, developing, and expert) were first operationalized on the basis of four background variables (rating experience, teaching experience, rater training, and educational background) to predict different levels of expertise in rating. The three groups of raters then evaluated 18 ESL learners’ oral responses using an analytic scoring rubric across three occasions, separated by one-month intervals. Recorded verbal report data were analyzed (a) to compare rater behavior across the three groups and (b) to examine the development of rating performance within each group over time. The analysis revealed that the three groups of raters from different backgrounds presented varying levels of rating ability and different paces of improvement in their rating performance. The findings of the study suggest that a comprehensive consideration of rater characteristics contributes to a better understanding of raters’ different needs for training and rating.
ACKNOWLEDGMENTS
The author thanks James Purpura, Hansun Waring, and Kirby Grabowski, who read the previous version of this manuscript. Thanks to the three anonymous reviewers of the Language Assessment Quarterly, who provided insightful comments.
Notes
1 Originally, Lumley (Citation2005) used the term “educational background” (i.e., postgraduate qualifications in Applied Linguistics and/or ESL) as a criterion of rater selection. Instead, the current study used the term “coursework” to further differentiate the degree of educational background of the raters who were MA students or recent graduates of the TESOL and Applied Linguistics programs.
2 The original scoring rubric used in the language program to score the speaking placement test had not been developed on the basis of a solid theoretical grounding. Therefore, the rubric was revised, deriving the five components from Purpura’s (Citation2004) definition of language knowledge, and was pilot-tested for the current study.