ABSTRACT
We evaluate the feasibility of developing predictive models of rater behavior, that is, rater-specific models for predicting the scores produced by a rater under operational conditions. In the present study, the dependent variable is the score assigned to essays by a rater, and the predictors are linguistic attributes of the essays used by the e-rater® engine. Specifically, for each rater, the linear regression of rater scores on the linguistic attributes is obtained based on data from two consecutive time periods. The regression from each period was cross validated against data from the other period. Raters were characterized in terms of their level of predictability and the importance of the predictors. Results suggest that rater models capture stable individual differences among raters. To evaluate the feasibility of using rater models as a quality control mechanism, we evaluated the relationship between rater predictability and inter-rater agreement and performance on pre-scored essays. Finally, we conducted a simulation whereby raters are simulated to score exclusively as a function of essay length at different points during the scoring day. We concluded that predictive rater models merit further investigation as a means of quality controlling human scoring.
Disclosure statement
No potential conflict of interest was reported by the authors.
Notes
1 The data used in this investigation are from an argumentative writing prompt of an operational high-stakes writing assessment. At the time the data for the investigation were sampled, essays were scored by one rater and a portion of the essays was double scored. Essays were scored operationally following standard ETS human scoring procedures.
2 The details of the engine evolve over time, of course. For this investigation, we used the features as defined for version 14.1 of the e-rater engine.
3 https://www.kaggle.com/c/asap-aes