Search in:

Applied Measurement in Education Volume 33, 2020 - Issue 3: Using Human Raters in Constructed Response Scoring: Understanding, Predicting, and Modifying Performance

Submit an article Journal homepage

265

Views

CrossRef citations to date

Altmetric

Research Article

Predictive Modeling of Rater Behavior: Implications for Quality Assurance in Essay Scoring

Isaac I. BejarR&D, Educational Testing ServiceCorrespondence[email protected]

Chen LiR&D, Educational Testing Service

Daniel McCaffreyR&D, Educational Testing Service

Pages 234-247 | Published online: 21 Jul 2020

Cite this article
https://doi.org/10.1080/08957347.2020.1750406
CrossMark

Full Article
Figures & data
References
Citations
Metrics
Reprints & Permissions

References

Attali, Y., & Burstein, J. (2006). Automated essay scoring with e-rater® V.2. Journal of Technology, Learning, and Assessment, 4(3). Retrieved from http://www.jtla.org
Google Scholar
Baker, B. A. (2012). Individual differences in rater decision-making style: An exploratory mixed-methods study. Language Assessment Quarterly, 9(3), 225–248. doi:10.1080/15434303.2011.637262
Web of Science ®Google Scholar
Bejar, I. I. (2017). A historical survey of research regarding constructed-response formats. In R. Bennett & M. von Davier (Eds.), Advancing human assessment: Methodological, psychological, and policy contributions. New York, NY: Springer. Retrieved from https://link.springer.com/chapter/10.1007/978-3-319-58689-2_18
Google Scholar
Bejar, I. I., Williamson, D. M., & Mislevy, R. J. (2006). Human scoring. In D. M. Williamson, R. J. Mislevy, & I. I. Bejar (Eds.), Automated scoring of complex tasks in computer-based testing (pp. 49–82). Mahwah, NJ: Lawrence Erlbaum.
Google Scholar
Bennett, R. E., & Ben-Simon, A. (2005). Toward theoretically meaningful automated essay scoring. Journal of Technology, Learning, and Assessment.
Google Scholar
Braun, H. I. (1986). Calibration of essay readers: Final report (RR-86-09). Princeton, NJ. doi:10.1002/j.2330-8516.1986.tb00164.x
Google Scholar
Braun, H. I. (1988). Understanding scoring reliability: Experiments in calibrating essay readers. Journal of Educational Statistics, 13(1), 1–18. doi:10.3102/10769986013001001
Google Scholar
Bröder, A., Gräf, M., & Kieslich, P. J. (2017). Measuring the relative contributions of rule-based and exemplar-based processes in judgment: Validation of a simple model. Judgment and Decision Making, 12(5), 491–506.
Web of Science ®Google Scholar
Cohen, Y. (2017). Estimating the intra-rater reliability of essay raters. Frontiers in Education, 2(49). doi:10.3389/feduc.2017.00049
Google Scholar
Cumming, A., Kantor, R., & Powers, D. E. (2002). Decision making while rating ESL/EFL writing tasks: A descriptive framework. The Modern Language Journal, 86(1), 67–96. doi:10.1111/1540-4781.00137
Web of Science ®Google Scholar
Dawes, R. M., & Corrigan, B. (1974). Linear models in decision making. Psychological Bulletin, 81(2), 95–106. doi:10.1037/h0037613
Web of Science ®Google Scholar
Diederich, P. B., French, J. W., & Carlton, S. T. (1961). Factors in judgments of writing ability (RB-61-15). Princeton, NJ. doi:10.1002/j.2333-8504.1961.tb00286.x
Google Scholar
Eckes, T. (2008). Rater types in writing performance assessments: A classification approach to rater variability. Language Testing, 25(2), 155–185. doi:10.1177/0265532207086780
Web of Science ®Google Scholar
Eckes, T. (2012). Operational rater types in writing assessment: Linking rater cognition to rater behavior. Language Assessment Quarterly, 9(3), 270–292. doi:10.1080/15434303.2011.649381
Web of Science ®Google Scholar
Egberink, I. J. L., Meijer, R. R., Veldkamp, B. P., Schakel, L., & Smid, N. G. (2010). Detection of aberrant item score patterns in computerized adaptive testing: An empirical example using the CUSUM. Personality and Individual Differences, 48(8), 921–925. doi:10.1016/j.paid.2010.02.023
Web of Science ®Google Scholar
Farag, Y., Yannakoudakis, H., & Briscoe, T. (2018). Neural automated essay scoring and coherence modeling for adversarially crafted input. In Proceedings of NAACL-HLT 2018 (pp. 263–271). New Orleans, LA: ACL.
Google Scholar
Freedman, S. W., & Calfee, R. C. (1983). Holistic assessment of writing: Experimental design and cognitive theory. In P. Mosenthal, L. Tamor, & S. A. Walmsley (Eds.), Research on writing: Principles and methods (pp. 75–98). New York, NY: Longman.
Google Scholar
Houston, W. M., Raymond, M. R., & Svec, J. C. (1991). Adjustments for rater effects in performance assessment. Applied Psychological Measurement, 15(4), 409–421. doi:10.1177/014662169101500411
Web of Science ®Google Scholar
Karren, R. J., & Barringer, M. W. (2002). A review and analysis of the policy-capturing methodology in organizational research: Guidelines for research and practice. Organizational Research Methods, 5(4), 337–361. doi:10.1177/109442802237115
Web of Science ®Google Scholar
Kuhn, M., & Johnson, K. (2013). Applied predictive modeling. New York, NY: Springer.
Google Scholar
McClellan, C. A. (2010). Constructed-response scoring–doing it right (Report No. RDC-13). Princeton, NJ. Retrieved from http://www.ets.org/research/policy_research_reports/rdc-13
Google Scholar
Monaghan, W., & Bridgeman, B. (2005). e-rater as a quality control of human scores. Retrieved from Princeton, NJ From the ETS Web site: http://www.ets.org/Media/Research/pdf/RD_Connections7.pdf
Google Scholar
Myford, C. M., & Wolfe, E. W. (2003). Detecting and measuring rater effects using many-facet rasch measurement: Part I. Journal of Applied Measurement, 4(4), 386–422.
PubMedGoogle Scholar
Naylor, J. C., & Wherry, R. J., Sr. (1965). The use of simulated stimuli and the “JAN” technique to capture and cluster the policies of raters. Educational and Psychological Measurement, 25(4), 969–986. doi:10.1177/001316446502500403
Web of Science ®Google Scholar
Nguyen, H., & Dery, L. (2018). Neural networks for automated essay grading. Retrieved from https://cs224d.stanford.edu/reports/huyenn.pdf
Google Scholar
Page, E. S. (1954). Continuous inspection schemes. Biometrika, 41(1/2), 100–115. doi:10.2307/2333009
Web of Science ®Google Scholar
Patz, R. J., Junker, B. W., Johnson, M. S., & Mariano, L. T. (2002). The hierarchical rater model for rated test items and its application to large-scale educational assessment data. Journal of Educational and Behavioral Statistics in Medicine, 27(4), 341–384. doi:10.3102/10769986027004341
Web of Science ®Google Scholar
Paul, S. R. (1981). Bayesian methods for calibration of examiners. British Journal of Mathematical and Statistical Psychology, 34(2), 213–223. doi:10.1111/j.2044-8317.1981.tb00630.x
Google Scholar
Powers, D. E. (2005). “Wordiness”: A selective review of its influence, and suggestions for investigating its relevance in tests requiring extended written responses (RM-04-08). Princeton, NJ. Retrieved from https://www.ets.org/Media/Research/pdf/RM-04-08.pdf
Google Scholar
Powers, D. E., & Fowles, M. E. (2000). Likely impact of the GRE® writing assessment on graduate admissions decisions (GRE 97-06R, ETS RR −16). Princeton, NJ
Google Scholar
Raymond, M. R., Harik, P., & Clauser, B. E. (2011). The impact of statistically adjusting for rater effects on conditional standard errors of performance ratings. Applied Psychological Measurement, 35(3), 235–246. doi:10.1177/0146621610390675
Web of Science ®Google Scholar
Ridgeway, G. (1999). The state of boosting. Computing Science and Statistics, 31, 172–181.
Google Scholar
Sakyi, A. A. (2000). Validation of holistic scoring for ESL writing assessment: How raters evaluate compositions. In J. J. Kunnan (Ed.), Fairness and validation in language assessment (pp. 129–152). Cambridge: Cambridge University Press.
Google Scholar
Song, Y., Heilman, M., Beigman Klebanov, B., & Deane, P. (2014). Applying argumentation schemes for essay scoring. Proceedings of the First Workshop on Argumentation Mining (pp. 69–78). Baltimore, Maryland: Association for Computational Lingustics.
Google Scholar
Suto, I. (2012). A critical review of some qualitative research methods used to explore rater cognition. Educational Measurement: Issues and Practice, 31(3), 21–30. doi:10.1111/j.1745-3992.2012.00240.x
Google Scholar
Suto, I., & Greatorex, J. (2008). What goes through an examiner’s mind? Using verbal protocols to gain insights into the GCS marking process. British Educational Research Journal, 34(2), 213–233. doi:10.1080/01411920701492050
Web of Science ®Google Scholar
Wang, C., Song, T., Wang, Z., & Wolfe, E. (2017). Essay selection methods for adaptive rater monitoring. Applied Psychological Measurement, 41(1), 60–79. doi:10.1177/0146621616672855
PubMed Web of Science ®Google Scholar
Wolfe, E. W. (2014). Methods for monitoring rating quality: Current practices and suggested changes.
Google Scholar
Zhang, J. (2016). Same text different processing? Exploring how raters’ cognitive and meta-cognitive strategies influence rating accuracy in essay scoring. Assessing Writing, 27, 37–53. doi:10.1016/j.asw.2015.11.001
Web of Science ®Google Scholar

Reprints and Corporate Permissions

Please note: Selecting permissions does not provide access to the full text of the article, please see our help page How do I view content?

To request a reprint or corporate permissions for this article, please click on the relevant link below:

Order Reprints Request Corporate Permissions

Academic Permissions

Please note: Selecting permissions does not provide access to the full text of the article, please see our help page How do I view content?

Obtain permissions instantly via Rightslink by clicking on the button below:

Request Academic Permissions

If you are unable to obtain permissions via Rightslink, please complete and submit this Permissions form. For more information, please visit our Permissions help page.

Share icon
Back to Top

Related research

People also read lists articles that other readers of this article have read.

Recommended articles lists articles that we recommend and is powered by our AI driven recommendation engine.

Cited by lists all citing articles based on Crossref citations.
Articles with the Crossref icon will open in a new tab.

People also read
Recommended articles
Cited by

To cite this article:

Reference style: APA Chicago Harvard

Citation copied to clipboard

Reference styles above use APA (6th edition), Chicago (16th edition) & Harvard (10th edition)

Download citation

Download a citation file in RIS format that can be imported by citation management software including EndNote, ProCite, RefWorks and Reference Manager.

Choose format: RIS BibTex RefWorks Direct Export

Choose options: Citation Citation & abstract Citation & references

Predictive Modeling of Rater Behavior: Implications for Quality Assurance in Essay Scoring

References

Information for

Open access

Opportunities

Help and information

Your download is now in progress and you may close this window

Login or register to access this feature

Predictive Modeling of Rater Behavior: Implications for Quality Assurance in Essay Scoring

References

Reprints and Corporate Permissions

Academic Permissions

Related research

To cite this article:

Download citation

Information for

Open access

Opportunities

Help and information

Keep up to date