References
- Abadie, A., & Imbens, G. W. (2006). Large sample properties of matching estimators for average treatment effects. Econometrica, 74(1), 235–267. doi:10.1111/j.1468-0262.2006.00655.x
- Abadie, A., & Imbens, G. W. (2011). Bias-corrected matching estimators for average treatment effects. Journal of Business and Economic Statistics, 29(1), 1–11. doi:10.1198/jbes.2009.07333
- Al-Bayatti, M., & Jones, B. (2005). NAA enhancing the quality of marking project: The effect of sample size on increased precision in detecting errant marking. London: QCA.
- Attali, Y. (2016). A comparison of newly-trained and experienced raters on a standardized writing assessment. Language Testing, 33(1), 99–115. doi:10.1177/0265532215582283
- Bachman, L. F., & Palmer, A. (2010). Language assessment in practice. Oxford: Oxford University Press.
- Baird, J.-A., Meadows, M., Leckie, G., & Caro, D. (2017). Rater accuracy and training group effects in expert- and supervisor-based monitoring systems. Assessment in Education: Principles, Policy & Practice, 24, 44–59.
- Barkaoui, K. (2010). Explaining ESL essay holistic scores: A multilevel modeling approach. Language Testing, 27(4), 515–535. doi:10.1177/0265532210368717
- Barkaoui, K. (2011). Effects of marking method and rater experience on ESL essay scores and rater performance. Assessment in education: Principles, Policy & Practice, 18, 279–293.
- Cole, T. L., Cochran, L. F., Troboy, L. K., & Roach, D. W. (2012). Efficiency in assessment: Can trained student interns rate essays as well as faculty members? International Journal for the Scholarship of Teaching and Learning, 6(2), 1–11. doi:10.20429/ijsotl.2012.060206
- Congdon, P. J., & McQueen, J. (2000). The stability of rater severity in large-scale assessment programs. Journal of Educational Measurement, 37(2), 163–178. doi:10.1111/j.1745-3984.2000.tb01081.x
- Cumming, A. (1990). Expertise in evaluating second language compositions. Language Testing, 7(1), 31–51. doi:10.1177/026553229000700104
- Engelhard, G., & Myford, C. M. (2003). Monitoring faculty consultant performance in the advanced placement English literature and composition program with many-faceted Rasch model (College Board Research Report No. 2003-1). New York, NY: College Entrance Examination Board.
- Erdosy, M. U. (2004). Exploring variability in judging writing ability in a second language: A study of four experienced raters of ESL compositions (ETS Research Report No. RR-03-17). Princeton, NJ: ETS.
- Glazer, N. (2017, April). The rater calibration process. Paper presented at the annual meeting of the National Council on Measurement in Education, San Antonio, TX.
- Hoskens, M., & Wilson, M. (2001). Real-Time feedback on rater drift in constructed response items: An example from the golden state examination. Journal of Educational Measurement, 38(2), 121–145. doi:10.1111/j.1745-3984.2001.tb01119.x
- Imbens, G. W., & Rubin, D. B. (2015). Causal inference for statistics, social, and biomedical sciences. Cambridge: Cambridge University Press.
- Knoch, U., Read, J., & von Randow, J. (2007). Re-training writing raters online: How does it compare with face-to-face training? Assessing Writing, 12(1), 26–43. doi:10.1016/j.asw.2007.04.001
- Lamprianou, I. (2006). The stability of marker characteristics across tests of the same subject and across subjects. Journal of Applied Measurement, 7(2), 192–205.
- Lane, S., & Stone, C. (2006). Performance assessment. In R. L. Brennan (Ed.), Educational measurement (4th ed., pp. 387–431). Westport, CT: American Council on Education.
- Leckie, G., & Baird, J.-A. (2011). Rater effects on essay scoring: A multilevel analysis of severity drift, central tendency, and rater experience. Journal of Educational Measurement, 48(4), 399–418. doi:10.1111/j.1745-3984.2011.00152.x
- Lim, G. S. (2011). The development and maintenance of rating quality in performance writing assessment: A longitudinal study of new and experienced raters. Language Testing, 28(4), 543–560. doi:10.1177/0265532211406422
- Lunz, M. E., & Stahl, J. A. (1990). Judge consistency and severity across grading periods. Evaluation & the Health Professions, 13(4), 425–444. doi:10.1177/016327879001300405
- Meadows, M., & Billington, L. (2005). A review of the literature on marking reliability. Report for the National Assessment Agency by AQA Centre for Education Research and Policy, London, United Kingdom.
- Myford, C. M., & Wolfe, E. W. (2009). Monitoring rater performance over time: A framework for detecting differential accuracy and differential scale category use. Journal of Educational Measurement, 46(4), 371–389. doi:10.1111/j.1745-3984.2009.00088.x
- Powers, D. E., Escoffery, D. S., & Duchnowski, M. P. (2015). Validating automated essay scoring: A (modest) refinement of the “gold standard”. Applied Measurement in Education, 28(2), 130–142. doi:10.1080/08957347.2014.1002920
- Raczynski, K. R., Cohen, A. S., Engelhard, G., & Lu, Z. (2015). Comparing the effectiveness of self-paced and collaborative frame-of-reference training on rater accuracy in a large-scale writing assessment. Journal of Educational Measurement, 52(3), 301–318. doi:10.1111/jedm.12079
- Weigle, S. C. (1994). Effects of training on raters of ESL compositions. Language Testing, 11(2), 197–223. doi:10.1177/026553229401100206
- Weigle, S. C. (1998). Using FACETS to model rater training effects. Language Testing, 15(2), 263–287. doi:10.1177/026553229801500205
- Wolfe, E. W., Jurgens, M., Sanders, B., Vickers, D., & Yue, J. (2014). Evaluation of pseudo-scoring as an extension of rater training (Research Report). Iowa City, IA: Pearson.
- Wolfe, E. W., Kao, C.-W., & Ranney, M. (1998). Cognitive differences in proficient and nonproficient essay scorers. Written Communication, 15(4), 465–492. doi:10.1177/0741088398015004002
- Wolfe, E. W., Matthews, S., & Vickers, D. (2010). The effectiveness and efficiency of distributed online, regional online, and regional face-to-face training for writing assessment raters. The Journal of Technology, Learning, and Assessment, 10(1), 4–21.
- Wolfe, E. W., Myford, C. M., Engelhard, G., & Manalo, J. R. (2007). Monitoring reader performance and DRIFT in the AP® English Literature and Composition examination using benchmark essays (College Board Research Report No. 2007-2). New York, NY: The College Board.