Search in:

Applied Measurement in Education Volume 33, 2020 - Issue 3: Using Human Raters in Constructed Response Scoring: Understanding, Predicting, and Modifying Performance

Submit an article Journal homepage

261

Views

CrossRef citations to date

Altmetric

Research Article

The Impact of Operational Scoring Experience and Additional Mentored Training on Raters’ Essay Scoring Accuracy

Ikkyu Choia Research, Educational Testing ServiceCorrespondence[email protected]

Edward W. Wolfeb Psychometrics, Statistics, and Data Sciences, Educational Testing Service

Pages 210-222 | Published online: 21 Jul 2020

Cite this article
https://doi.org/10.1080/08957347.2020.1750404
CrossMark

Full Article
Figures & data
References
Citations
Metrics
Reprints & Permissions

References

Abadie, A., & Imbens, G. W. (2006). Large sample properties of matching estimators for average treatment effects. Econometrica, 74(1), 235–267. doi:10.1111/j.1468-0262.2006.00655.x
Web of Science ®Google Scholar
Abadie, A., & Imbens, G. W. (2011). Bias-corrected matching estimators for average treatment effects. Journal of Business and Economic Statistics, 29(1), 1–11. doi:10.1198/jbes.2009.07333
Web of Science ®Google Scholar
Al-Bayatti, M., & Jones, B. (2005). NAA enhancing the quality of marking project: The effect of sample size on increased precision in detecting errant marking. London: QCA.
Google Scholar
Attali, Y. (2016). A comparison of newly-trained and experienced raters on a standardized writing assessment. Language Testing, 33(1), 99–115. doi:10.1177/0265532215582283
Web of Science ®Google Scholar
Bachman, L. F., & Palmer, A. (2010). Language assessment in practice. Oxford: Oxford University Press.
Google Scholar
Baird, J.-A., Meadows, M., Leckie, G., & Caro, D. (2017). Rater accuracy and training group effects in expert- and supervisor-based monitoring systems. Assessment in Education: Principles, Policy & Practice, 24, 44–59.
Web of Science ®Google Scholar
Barkaoui, K. (2010). Explaining ESL essay holistic scores: A multilevel modeling approach. Language Testing, 27(4), 515–535. doi:10.1177/0265532210368717
Web of Science ®Google Scholar
Barkaoui, K. (2011). Effects of marking method and rater experience on ESL essay scores and rater performance. Assessment in education: Principles, Policy & Practice, 18, 279–293.
Google Scholar
Cole, T. L., Cochran, L. F., Troboy, L. K., & Roach, D. W. (2012). Efficiency in assessment: Can trained student interns rate essays as well as faculty members? International Journal for the Scholarship of Teaching and Learning, 6(2), 1–11. doi:10.20429/ijsotl.2012.060206
Google Scholar
Congdon, P. J., & McQueen, J. (2000). The stability of rater severity in large-scale assessment programs. Journal of Educational Measurement, 37(2), 163–178. doi:10.1111/j.1745-3984.2000.tb01081.x
Web of Science ®Google Scholar
Cumming, A. (1990). Expertise in evaluating second language compositions. Language Testing, 7(1), 31–51. doi:10.1177/026553229000700104
Google Scholar
Engelhard, G., & Myford, C. M. (2003). Monitoring faculty consultant performance in the advanced placement English literature and composition program with many-faceted Rasch model (College Board Research Report No. 2003-1). New York, NY: College Entrance Examination Board.
Google Scholar
Erdosy, M. U. (2004). Exploring variability in judging writing ability in a second language: A study of four experienced raters of ESL compositions (ETS Research Report No. RR-03-17). Princeton, NJ: ETS.
Google Scholar
Glazer, N. (2017, April). The rater calibration process. Paper presented at the annual meeting of the National Council on Measurement in Education, San Antonio, TX.
Google Scholar
Hoskens, M., & Wilson, M. (2001). Real-Time feedback on rater drift in constructed response items: An example from the golden state examination. Journal of Educational Measurement, 38(2), 121–145. doi:10.1111/j.1745-3984.2001.tb01119.x
Web of Science ®Google Scholar
Imbens, G. W., & Rubin, D. B. (2015). Causal inference for statistics, social, and biomedical sciences. Cambridge: Cambridge University Press.
Google Scholar
Knoch, U., Read, J., & von Randow, J. (2007). Re-training writing raters online: How does it compare with face-to-face training? Assessing Writing, 12(1), 26–43. doi:10.1016/j.asw.2007.04.001
Google Scholar
Lamprianou, I. (2006). The stability of marker characteristics across tests of the same subject and across subjects. Journal of Applied Measurement, 7(2), 192–205.
PubMedGoogle Scholar
Lane, S., & Stone, C. (2006). Performance assessment. In R. L. Brennan (Ed.), Educational measurement (4th ed., pp. 387–431). Westport, CT: American Council on Education.
Google Scholar
Leckie, G., & Baird, J.-A. (2011). Rater effects on essay scoring: A multilevel analysis of severity drift, central tendency, and rater experience. Journal of Educational Measurement, 48(4), 399–418. doi:10.1111/j.1745-3984.2011.00152.x
Web of Science ®Google Scholar
Lim, G. S. (2011). The development and maintenance of rating quality in performance writing assessment: A longitudinal study of new and experienced raters. Language Testing, 28(4), 543–560. doi:10.1177/0265532211406422
Web of Science ®Google Scholar
Lunz, M. E., & Stahl, J. A. (1990). Judge consistency and severity across grading periods. Evaluation & the Health Professions, 13(4), 425–444. doi:10.1177/016327879001300405
Web of Science ®Google Scholar
Meadows, M., & Billington, L. (2005). A review of the literature on marking reliability. Report for the National Assessment Agency by AQA Centre for Education Research and Policy, London, United Kingdom.
Google Scholar
Myford, C. M., & Wolfe, E. W. (2009). Monitoring rater performance over time: A framework for detecting differential accuracy and differential scale category use. Journal of Educational Measurement, 46(4), 371–389. doi:10.1111/j.1745-3984.2009.00088.x
Web of Science ®Google Scholar
Powers, D. E., Escoffery, D. S., & Duchnowski, M. P. (2015). Validating automated essay scoring: A (modest) refinement of the “gold standard”. Applied Measurement in Education, 28(2), 130–142. doi:10.1080/08957347.2014.1002920
Web of Science ®Google Scholar
Raczynski, K. R., Cohen, A. S., Engelhard, G., & Lu, Z. (2015). Comparing the effectiveness of self-paced and collaborative frame-of-reference training on rater accuracy in a large-scale writing assessment. Journal of Educational Measurement, 52(3), 301–318. doi:10.1111/jedm.12079
Web of Science ®Google Scholar
Weigle, S. C. (1994). Effects of training on raters of ESL compositions. Language Testing, 11(2), 197–223. doi:10.1177/026553229401100206
Google Scholar
Weigle, S. C. (1998). Using FACETS to model rater training effects. Language Testing, 15(2), 263–287. doi:10.1177/026553229801500205
Google Scholar
Wolfe, E. W., Jurgens, M., Sanders, B., Vickers, D., & Yue, J. (2014). Evaluation of pseudo-scoring as an extension of rater training (Research Report). Iowa City, IA: Pearson.
Google Scholar
Wolfe, E. W., Kao, C.-W., & Ranney, M. (1998). Cognitive differences in proficient and nonproficient essay scorers. Written Communication, 15(4), 465–492. doi:10.1177/0741088398015004002
Web of Science ®Google Scholar
Wolfe, E. W., Matthews, S., & Vickers, D. (2010). The effectiveness and efficiency of distributed online, regional online, and regional face-to-face training for writing assessment raters. The Journal of Technology, Learning, and Assessment, 10(1), 4–21.
Google Scholar
Wolfe, E. W., Myford, C. M., Engelhard, G., & Manalo, J. R. (2007). Monitoring reader performance and DRIFT in the AP® English Literature and Composition examination using benchmark essays (College Board Research Report No. 2007-2). New York, NY: The College Board.
Google Scholar

Reprints and Corporate Permissions

Please note: Selecting permissions does not provide access to the full text of the article, please see our help page How do I view content?

To request a reprint or corporate permissions for this article, please click on the relevant link below:

Order Reprints Request Corporate Permissions

Academic Permissions

Please note: Selecting permissions does not provide access to the full text of the article, please see our help page How do I view content?

Obtain permissions instantly via Rightslink by clicking on the button below:

Request Academic Permissions

If you are unable to obtain permissions via Rightslink, please complete and submit this Permissions form. For more information, please visit our Permissions help page.

Share icon
Back to Top

Related research

People also read lists articles that other readers of this article have read.

Recommended articles lists articles that we recommend and is powered by our AI driven recommendation engine.

Cited by lists all citing articles based on Crossref citations.
Articles with the Crossref icon will open in a new tab.

People also read
Recommended articles
Cited by

To cite this article:

Reference style: APA Chicago Harvard

Citation copied to clipboard

Reference styles above use APA (6th edition), Chicago (16th edition) & Harvard (10th edition)

Download citation

Download a citation file in RIS format that can be imported by citation management software including EndNote, ProCite, RefWorks and Reference Manager.

Choose format: RIS BibTex RefWorks Direct Export

Choose options: Citation Citation & abstract Citation & references

The Impact of Operational Scoring Experience and Additional Mentored Training on Raters’ Essay Scoring Accuracy

References

Information for

Open access

Opportunities

Help and information

Your download is now in progress and you may close this window

Login or register to access this feature

The Impact of Operational Scoring Experience and Additional Mentored Training on Raters’ Essay Scoring Accuracy

References

Reprints and Corporate Permissions

Academic Permissions

Related research

To cite this article:

Download citation

Information for

Open access

Opportunities

Help and information

Keep up to date