1,640
Views
2
CrossRef citations to date
0
Altmetric
Research Articles

Benchmark rating procedure, best of both worlds? Comparing procedures to rate text quality in a reliable and valid manner

ORCID Icon, & ORCID Icon
Pages 302-319 | Received 30 Mar 2021, Accepted 24 Jul 2023, Published online: 11 Aug 2023

References

  • Attali, Y., & Burstein, J. (2006). Automated essay scoring with e-rater V.2. Journal of Technology, Learning, and Assessment, 4(3), 1–30.
  • Bachman, L. F., & Palmer, A. S. (1996). Language testing in practice: Designing and developing useful language tests. Oxford University Press.
  • Barkaoui, K. (2007). Rating scale impact on EFL essay marking: A mixed-method study. Assessing Writing, 12(2), 86–107. https://doi.org/10.1016/j.asw.2007.07.001
  • Barkaoui, K. (2011). Effects of marking method and rater experience on ESL essay scores and rater performance. Assessment in Education: Principles, Policy & Practice, 18(3), 279–293. https://doi.org/10.1080/0969594X.2010.526585
  • Black, P., Harrison, C., Hodgen, J., Marshall, B., & Serret, N. (2011). Can teachers’ summative assessments produce dependable results and also enhance classroom learning? Assessment in Education: Principles, Policy & Practice, 18(4), 451–469. https://doi.org/10.1080/0969594X.2011.557020
  • Black, P., & William, D. (2018). Classroom assessment and pedagogy. Assessment in Education: Principles, Policy & Practice, 25(6), 551–575. https://doi.org/10.1080/0969594X.2018.1441807
  • Bouwer, R., Béguin, A., Sanders, T., & Van den Bergh, H. (2015). Effect of genre on the generalizability of writing scores. Language Testing, 32(1), 83–100. https://doi.org/10.1177/0265532214542994
  • Bouwer, R., & Gerits, H. (2022). Aan de slag met het schoolexamen schrijfvaardigheid [Getting started with the school exams for writing skills]. Levende Talen Magazine, 109(2), 10–15.
  • Bouwer, R., & Koster, M. (2016). Bringing writing research into the classroom. The effectiveness of Tekster, a newly developed writing program for elementary students [ Unpublished doctoral dissertation]. Utrecht University.
  • Bouwer, R., Koster, M., & Van den Bergh, H. (2018). Effects of a strategy-focused instructional program on the writing quality of upper elementary students in the Netherlands. Journal of Educational Psychology, 110(1), 58–71. https://doi.org/10.1037/edu0000206
  • Brennan, R. L. (2001). Generalizability theory. Springer-Verlag. https://doi.org/10.1007/978-1-4757-3456-0
  • Cook, T. D., & Campbell, D. T. (1979). Quasi-experimentation: Design and analysis issues for field settings. Houghton Mifflin.
  • Cooper, C. R., & Odell, L. (1977). Evaluating writing: Describing, measuring, judging. National Council of Teachers of English.
  • Cronbach, L. J., Gleser, G. C., Nanda, H., & Rajaratnam, N. (1972). The dependability of behavioral measurements. John Wiley.
  • De Smedt, F., Van Keer, H., & Merchie, E. Student, teacher and class-level correlates of Flemish late elementary school children’s writing performance. (2016). Reading & Writing, 29(5), 833–868. https://doi.org/10.1007/s11145-015-9590-z
  • Diederich, P. B., French, J. W., & Carlton, S. T. (1961). Factors in judgments of writing ability. (Research Bulletin RB-61-15). Educational Testing Service.
  • Eckes, T. (2008). Rater types in writing performance assessments: A classification approach to rater variability. Language Testing, 25(2), 155–185. https://doi.org/10.1177/0265532207086780
  • Feenstra, H. (2014). Assessing writing ability in primary education. On the evaluation of text quality and text complexity [ Unpublished doctoral dissertation]. University of Twente.
  • Feldt, L. S. (1980). A test of the hypothesis that Cronbach’s alpha reliability coefficient is the same for two tests administered to the same sample. Psychometrika, 45(1), 99–105. https://doi.org/10.1007/BF02293600
  • Grabowski, J., Becker-Mrotzek, M., Knopp, M., Jost, J., & Weinzierl, C. (2014). Comparing and combining different approaches to the assessment of text quality. In D. Knorr, C. Heine, & J. Engberg (Eds.), Methods in writing process research (pp. 147–165). Lang.
  • Graham, S. (2018). Instructional feedback in writing. In A. A. Lipnevich & J. K. Smith (Eds.), The Cambridge handbook of instructional feedback (pp. 145–168). Cambridge University Press. https://doi.org/10.1017/9781316832134.009
  • Hakstian, A. R., & Whalen, T. E. (1976). A K-sample significance test for independent alpha coefficients. Psychometrika, 41(2), 219–231. https://doi.org/10.1007/BF02291840
  • Hopster den Otter, D., Wools, S., Eggen, T. J. H. M., & Veldkamp, B. P. (2019). A general framework for the validation of embedded formative assessment. Journal of Educational Measurement, 56(4), 715–732. https://doi.org/10.1111/jedm.12234
  • Huot, B. (1990). The literature of direct writing assessment: Major concerns and prevailing trends. Review of Educational Research, 60(2), 237–263. https://doi.org/10.3102/00346543060002237
  • Inspectorate of Education. (2021). Peil.Schrijfvaardigheid: Einde (speciaal) basisonderwijs 2018-2019 [Level.Writing ability: End of (special) elementary education 2018-2019].
  • Jönsson, A., Balan, A., & Hartell, E. (2021). Analytic or holistic? A study about how to increase the agreement in teachers’ grading. Assessment in Education: Principles, Policy & Practice, 28(3), 212–227. https://doi.org/10.1080/0969594X.2021.1884041
  • Jönsson, A., & Svingby, G. (2007). The use of scoring rubrics: Reliability, validity and educational consequences. Educational Research Review, 2(2), 130–144. https://doi.org/10.1016/j.edurev.2007.05.002
  • Kane, M. T. (2013). Validating the interpretations and uses of test scores. Journal of Educational Measurement, 50(1), 1–73. https://doi.org/10.1111/jedm.12000
  • Kane, M. T. (2016). Explicating validity. Assessment in Education: Principles, Policy & Practice, 23(2), 198–211. https://doi.org/10.1080/0969594X.2015.1060192
  • Kuhlemeier, H., Til, A. V., Hemker, B., de Klijn, W., & Feenstra, H. (2013). Balans van de schrijfvaardigheid in het basis- en speciaal basisonderwijs 2. Periodieke Peiling van het Onderwijsniveau (No. 53). [Present state of writing competency in elementary and special education 2. Periodical assessment of the level of education]. Cito.
  • Laming, D. R. J. (2004). Human Judgment: The Eye of the Beholder. Thomson Learning.
  • Lesterhuis, M. (2018). The validity of comparative judgement for assessing text quality: an assessor’s perspective [ Unpublished doctoral dissertation]. University of Antwerp.
  • Lloyd-Jones, R. (1977). Primary trait scoring. In C. R. Cooper & L. Odell (Eds.), Evaluating writing: Describing, measuring, judging (pp. 33–68). National Council of Teachers of English.
  • Lord, F. M., & Novick, M. R. (1968). Statistical theories of mental test scores. Addison-Welsley.
  • Lumley, T. (2002). Assessment criteria in a large-scale writing test: What do they really mean to the raters? Language Testing, 19(3), 246–276. https://doi.org/10.1191/0265532202lt230oa
  • Marshall, N., Shaw, K., Hunter, J., & Jones, I. (2020). Assessment by comparative judgement: An application to secondary statistics and English in New Zealand. New Zealand Journal of Educational Studies, 55(1), 49–71. https://doi.org/10.1007/s40841-020-00163-3
  • Pollitt, A. (2012). The method of adaptive comparative judgement. Assessment in Education: Principles, Policy & Practice, 19(3), 281–300. https://doi.org/10.1080/0969594X.2012.665354
  • Pollman, E., Prenger, J., & De Glopper, K. (2012). Het beoordelen van leerlingteksten met behulp van een schaalmodel [Rating student’ texts with a benchmark scale]. Levende Talen Tijdschrift, 13(3), 15–24.
  • Pullens, T. (2012). Bij wijze van schrijven: Effecten van computerondersteund schrijven in het primair onderwijs [In a manner of writing: Effects of computer-supported writing in primary education] ( Unpublished doctoral dissertation). Utrecht University.
  • Rietdijk, S., Janssen, T., van Weijen, D., van den Bergh, H., & Rijlaarsdam, G. (2017). Improving writing in primary schools through a comprehensive writing program. Journal of Writing Research, 9(2), 173–225. https://doi.org/10.17239/jowr-2017.09.02.04
  • Sadler, D. R. (1998). Formative assessment: Revisiting the territory. Assessment in Education: Principles, Policy & Practice, 5(1), 77–84. https://doi.org/10.1080/0969595980050104
  • Sadler, D. R. (2009). Indeterminacy in the use of preset criteria for assessment and grading. Assessment & Evaluation in Higher Education, 34(2), 159–179. https://doi.org/10.1080/02602930801956059
  • Schoonen, R. (2005). Generalizability of writing scores: An application of structural equation modeling. Language Testing, 22(1), 1–30. https://doi.org/10.1191/0265532205lt295oa
  • Shermis, M. D., Burstein, J., Elliot, N., Miel, S., & Foltz, P. W. (2017). Automated writing evaluation: An expanding body of knowledge. In C. A. MacArthur, S. Graham, & J. Fitzgerald (Eds.), Handbook of writing research (2nd ed., pp. 395–410). The Guilford Press.
  • Smith, G. S., & Paige, D. D. (2019). A study of reliability across multiple raters when using the NAEP and MDFS rubrics to measure oral reading fluency. Reading Psychology, 40(1), 34–69. https://doi.org/10.1080/02702711.2018.1555361
  • Solomon, C., Lutkus, A. D., Kaplan, B., & Skolnik, I. (2004). Writing in the nation’s classrooms. Teacher interviews and student work collected from participants in the NAEP 1998 Writing Assessment. ETS-NAEP Technical and Research Report 04-R02. ETS. https://www.ets.org/Media/Research/pdf/ETS-NAEP-04-R02.pdf
  • Stemler, S. E. (2004). A comparison of consensus, consistency, and measurement approaches to estimating interrater reliability. Practical Assessment, Research & Evaluation, 9(4), 66–78.
  • Tillema, M., Van den Bergh, H., Rijlaarsdam, G., & Sanders, T. (2012). Quantifying the quality difference between L1 and L2 essays: A rating procedure with bilingual raters and L1 and L2 benchmark essays. Language Testing, 30(1), 71–97. https://doi.org/10.1177/0265532212442647
  • Uzun, N. B., Alici, D., & Aktas, M. (2019). Reliability of the analytic rubric and checklist for the assessment of story writing skills: G and decision study in generalizability theory. European Journal of Educational Research, 8(4), 169–180. https://doi.org/10.12973/eu-jer.8.1.169
  • Van den Bergh, H., De Maeyer, S., Van Weijen, D., & Tillema, M. (2012). Generalizability of text quality scores. In E. van Steendam, M. Tillema, G. Rijlaarsdam, & H. van den Bergh (Eds.), Measuring writing: Recent insights into theory, methodology and practices (Vol. 27, pp. 23–32). Brill.
  • Weigle, C. S. (1994). Effects of training on raters of ESL compositions. Language Testing, 11(2), 197–223. https://doi.org/10.1177/026553229401100206
  • Weigle, S. C. (2002). Assessing writing. Cambridge University Press.
  • Wesdorp, H. (1981). Evaluatietechnieken voor het moedertaalonderwijs [Evaluation techniques for the mother tongue education]. Stichting voor Onderzoek van het Onderwijs.