261
Views
0
CrossRef citations to date
0
Altmetric
Research Article

The Impact of Operational Scoring Experience and Additional Mentored Training on Raters’ Essay Scoring Accuracy

&

References

  • Abadie, A., & Imbens, G. W. (2006). Large sample properties of matching estimators for average treatment effects. Econometrica, 74(1), 235–267. doi:10.1111/j.1468-0262.2006.00655.x
  • Abadie, A., & Imbens, G. W. (2011). Bias-corrected matching estimators for average treatment effects. Journal of Business and Economic Statistics, 29(1), 1–11. doi:10.1198/jbes.2009.07333
  • Al-Bayatti, M., & Jones, B. (2005). NAA enhancing the quality of marking project: The effect of sample size on increased precision in detecting errant marking. London: QCA.
  • Attali, Y. (2016). A comparison of newly-trained and experienced raters on a standardized writing assessment. Language Testing, 33(1), 99–115. doi:10.1177/0265532215582283
  • Bachman, L. F., & Palmer, A. (2010). Language assessment in practice. Oxford: Oxford University Press.
  • Baird, J.-A., Meadows, M., Leckie, G., & Caro, D. (2017). Rater accuracy and training group effects in expert- and supervisor-based monitoring systems. Assessment in Education: Principles, Policy & Practice, 24, 44–59.
  • Barkaoui, K. (2010). Explaining ESL essay holistic scores: A multilevel modeling approach. Language Testing, 27(4), 515–535. doi:10.1177/0265532210368717
  • Barkaoui, K. (2011). Effects of marking method and rater experience on ESL essay scores and rater performance. Assessment in education: Principles, Policy & Practice, 18, 279–293.
  • Cole, T. L., Cochran, L. F., Troboy, L. K., & Roach, D. W. (2012). Efficiency in assessment: Can trained student interns rate essays as well as faculty members? International Journal for the Scholarship of Teaching and Learning, 6(2), 1–11. doi:10.20429/ijsotl.2012.060206
  • Congdon, P. J., & McQueen, J. (2000). The stability of rater severity in large-scale assessment programs. Journal of Educational Measurement, 37(2), 163–178. doi:10.1111/j.1745-3984.2000.tb01081.x
  • Cumming, A. (1990). Expertise in evaluating second language compositions. Language Testing, 7(1), 31–51. doi:10.1177/026553229000700104
  • Engelhard, G., & Myford, C. M. (2003). Monitoring faculty consultant performance in the advanced placement English literature and composition program with many-faceted Rasch model (College Board Research Report No. 2003-1). New York, NY: College Entrance Examination Board.
  • Erdosy, M. U. (2004). Exploring variability in judging writing ability in a second language: A study of four experienced raters of ESL compositions (ETS Research Report No. RR-03-17). Princeton, NJ: ETS.
  • Glazer, N. (2017, April). The rater calibration process. Paper presented at the annual meeting of the National Council on Measurement in Education, San Antonio, TX.
  • Hoskens, M., & Wilson, M. (2001). Real-Time feedback on rater drift in constructed response items: An example from the golden state examination. Journal of Educational Measurement, 38(2), 121–145. doi:10.1111/j.1745-3984.2001.tb01119.x
  • Imbens, G. W., & Rubin, D. B. (2015). Causal inference for statistics, social, and biomedical sciences. Cambridge: Cambridge University Press.
  • Knoch, U., Read, J., & von Randow, J. (2007). Re-training writing raters online: How does it compare with face-to-face training? Assessing Writing, 12(1), 26–43. doi:10.1016/j.asw.2007.04.001
  • Lamprianou, I. (2006). The stability of marker characteristics across tests of the same subject and across subjects. Journal of Applied Measurement, 7(2), 192–205.
  • Lane, S., & Stone, C. (2006). Performance assessment. In R. L. Brennan (Ed.), Educational measurement (4th ed., pp. 387–431). Westport, CT: American Council on Education.
  • Leckie, G., & Baird, J.-A. (2011). Rater effects on essay scoring: A multilevel analysis of severity drift, central tendency, and rater experience. Journal of Educational Measurement, 48(4), 399–418. doi:10.1111/j.1745-3984.2011.00152.x
  • Lim, G. S. (2011). The development and maintenance of rating quality in performance writing assessment: A longitudinal study of new and experienced raters. Language Testing, 28(4), 543–560. doi:10.1177/0265532211406422
  • Lunz, M. E., & Stahl, J. A. (1990). Judge consistency and severity across grading periods. Evaluation & the Health Professions, 13(4), 425–444. doi:10.1177/016327879001300405
  • Meadows, M., & Billington, L. (2005). A review of the literature on marking reliability. Report for the National Assessment Agency by AQA Centre for Education Research and Policy, London, United Kingdom.
  • Myford, C. M., & Wolfe, E. W. (2009). Monitoring rater performance over time: A framework for detecting differential accuracy and differential scale category use. Journal of Educational Measurement, 46(4), 371–389. doi:10.1111/j.1745-3984.2009.00088.x
  • Powers, D. E., Escoffery, D. S., & Duchnowski, M. P. (2015). Validating automated essay scoring: A (modest) refinement of the “gold standard”. Applied Measurement in Education, 28(2), 130–142. doi:10.1080/08957347.2014.1002920
  • Raczynski, K. R., Cohen, A. S., Engelhard, G., & Lu, Z. (2015). Comparing the effectiveness of self-paced and collaborative frame-of-reference training on rater accuracy in a large-scale writing assessment. Journal of Educational Measurement, 52(3), 301–318. doi:10.1111/jedm.12079
  • Weigle, S. C. (1994). Effects of training on raters of ESL compositions. Language Testing, 11(2), 197–223. doi:10.1177/026553229401100206
  • Weigle, S. C. (1998). Using FACETS to model rater training effects. Language Testing, 15(2), 263–287. doi:10.1177/026553229801500205
  • Wolfe, E. W., Jurgens, M., Sanders, B., Vickers, D., & Yue, J. (2014). Evaluation of pseudo-scoring as an extension of rater training (Research Report). Iowa City, IA: Pearson.
  • Wolfe, E. W., Kao, C.-W., & Ranney, M. (1998). Cognitive differences in proficient and nonproficient essay scorers. Written Communication, 15(4), 465–492. doi:10.1177/0741088398015004002
  • Wolfe, E. W., Matthews, S., & Vickers, D. (2010). The effectiveness and efficiency of distributed online, regional online, and regional face-to-face training for writing assessment raters. The Journal of Technology, Learning, and Assessment, 10(1), 4–21.
  • Wolfe, E. W., Myford, C. M., Engelhard, G., & Manalo, J. R. (2007). Monitoring reader performance and DRIFT in the AP® English Literature and Composition examination using benchmark essays (College Board Research Report No. 2007-2). New York, NY: The College Board.

Reprints and Corporate Permissions

Please note: Selecting permissions does not provide access to the full text of the article, please see our help page How do I view content?

To request a reprint or corporate permissions for this article, please click on the relevant link below:

Academic Permissions

Please note: Selecting permissions does not provide access to the full text of the article, please see our help page How do I view content?

Obtain permissions instantly via Rightslink by clicking on the button below:

If you are unable to obtain permissions via Rightslink, please complete and submit this Permissions form. For more information, please visit our Permissions help page.