Search in:

Advanced search

Assessment in Education: Principles, Policy & Practice Volume 30, 2023 - Issue 3-4

Submit an article Journal homepage

Open access

1,640

Views

CrossRef citations to date

Altmetric

Research Articles

Benchmark rating procedure, best of both worlds? Comparing procedures to rate text quality in a reliable and valid manner

Renske BouwerInstitute for Language Sciences, Utrecht University, Utrecht, The NetherlandsCorrespondence[email protected]

https://orcid.org/0000-0003-0434-0224 View further author information

Monica KosterInstitute for Language Sciences, Utrecht University, Utrecht, The NetherlandsView further author information

Huub van den BerghInstitute for Language Sciences, Utrecht University, Utrecht, The Netherlands

https://orcid.org/0000-0002-1320-5334 View further author information

Pages 302-319 | Received 30 Mar 2021, Accepted 24 Jul 2023, Published online: 11 Aug 2023

Cite this article
https://doi.org/10.1080/0969594X.2023.2241656
CrossMark

Full Article
Figures & data
References
Supplemental
Citations
Metrics
Licensing
Reprints & Permissions
View PDF PDF View EPUB EPUB

References

Attali, Y., & Burstein, J. (2006). Automated essay scoring with e-rater V.2. Journal of Technology, Learning, and Assessment, 4(3), 1–30.
Google Scholar
Bachman, L. F., & Palmer, A. S. (1996). Language testing in practice: Designing and developing useful language tests. Oxford University Press.
Google Scholar
Barkaoui, K. (2007). Rating scale impact on EFL essay marking: A mixed-method study. Assessing Writing, 12(2), 86–107. https://doi.org/10.1016/j.asw.2007.07.001
Google Scholar
Barkaoui, K. (2011). Effects of marking method and rater experience on ESL essay scores and rater performance. Assessment in Education: Principles, Policy & Practice, 18(3), 279–293. https://doi.org/10.1080/0969594X.2010.526585
Google Scholar
Black, P., Harrison, C., Hodgen, J., Marshall, B., & Serret, N. (2011). Can teachers’ summative assessments produce dependable results and also enhance classroom learning? Assessment in Education: Principles, Policy & Practice, 18(4), 451–469. https://doi.org/10.1080/0969594X.2011.557020
Google Scholar
Black, P., & William, D. (2018). Classroom assessment and pedagogy. Assessment in Education: Principles, Policy & Practice, 25(6), 551–575. https://doi.org/10.1080/0969594X.2018.1441807
Web of Science ®Google Scholar
Bouwer, R., Béguin, A., Sanders, T., & Van den Bergh, H. (2015). Effect of genre on the generalizability of writing scores. Language Testing, 32(1), 83–100. https://doi.org/10.1177/0265532214542994
Web of Science ®Google Scholar
Bouwer, R., & Gerits, H. (2022). Aan de slag met het schoolexamen schrijfvaardigheid [Getting started with the school exams for writing skills]. Levende Talen Magazine, 109(2), 10–15.
Google Scholar
Bouwer, R., & Koster, M. (2016). Bringing writing research into the classroom. The effectiveness of Tekster, a newly developed writing program for elementary students [ Unpublished doctoral dissertation]. Utrecht University.
Google Scholar
Bouwer, R., Koster, M., & Van den Bergh, H. (2018). Effects of a strategy-focused instructional program on the writing quality of upper elementary students in the Netherlands. Journal of Educational Psychology, 110(1), 58–71. https://doi.org/10.1037/edu0000206
Web of Science ®Google Scholar
Brennan, R. L. (2001). Generalizability theory. Springer-Verlag. https://doi.org/10.1007/978-1-4757-3456-0
Google Scholar
Cook, T. D., & Campbell, D. T. (1979). Quasi-experimentation: Design and analysis issues for field settings. Houghton Mifflin.
Google Scholar
Cooper, C. R., & Odell, L. (1977). Evaluating writing: Describing, measuring, judging. National Council of Teachers of English.
Google Scholar
Cronbach, L. J., Gleser, G. C., Nanda, H., & Rajaratnam, N. (1972). The dependability of behavioral measurements. John Wiley.
Google Scholar
De Smedt, F., Van Keer, H., & Merchie, E. Student, teacher and class-level correlates of Flemish late elementary school children’s writing performance. (2016). Reading & Writing, 29(5), 833–868. https://doi.org/10.1007/s11145-015-9590-z
Web of Science ®Google Scholar
Diederich, P. B., French, J. W., & Carlton, S. T. (1961). Factors in judgments of writing ability. (Research Bulletin RB-61-15). Educational Testing Service.
Google Scholar
Eckes, T. (2008). Rater types in writing performance assessments: A classification approach to rater variability. Language Testing, 25(2), 155–185. https://doi.org/10.1177/0265532207086780
Web of Science ®Google Scholar
Feenstra, H. (2014). Assessing writing ability in primary education. On the evaluation of text quality and text complexity [ Unpublished doctoral dissertation]. University of Twente.
Google Scholar
Feldt, L. S. (1980). A test of the hypothesis that Cronbach’s alpha reliability coefficient is the same for two tests administered to the same sample. Psychometrika, 45(1), 99–105. https://doi.org/10.1007/BF02293600
Web of Science ®Google Scholar
Grabowski, J., Becker-Mrotzek, M., Knopp, M., Jost, J., & Weinzierl, C. (2014). Comparing and combining different approaches to the assessment of text quality. In D. Knorr, C. Heine, & J. Engberg (Eds.), Methods in writing process research (pp. 147–165). Lang.
Google Scholar
Graham, S. (2018). Instructional feedback in writing. In A. A. Lipnevich & J. K. Smith (Eds.), The Cambridge handbook of instructional feedback (pp. 145–168). Cambridge University Press. https://doi.org/10.1017/9781316832134.009
Google Scholar
Hakstian, A. R., & Whalen, T. E. (1976). A K-sample significance test for independent alpha coefficients. Psychometrika, 41(2), 219–231. https://doi.org/10.1007/BF02291840
Web of Science ®Google Scholar
Hopster den Otter, D., Wools, S., Eggen, T. J. H. M., & Veldkamp, B. P. (2019). A general framework for the validation of embedded formative assessment. Journal of Educational Measurement, 56(4), 715–732. https://doi.org/10.1111/jedm.12234
Web of Science ®Google Scholar
Huot, B. (1990). The literature of direct writing assessment: Major concerns and prevailing trends. Review of Educational Research, 60(2), 237–263. https://doi.org/10.3102/00346543060002237
Web of Science ®Google Scholar
Inspectorate of Education. (2021). Peil.Schrijfvaardigheid: Einde (speciaal) basisonderwijs 2018-2019 [Level.Writing ability: End of (special) elementary education 2018-2019].
Google Scholar
Jönsson, A., Balan, A., & Hartell, E. (2021). Analytic or holistic? A study about how to increase the agreement in teachers’ grading. Assessment in Education: Principles, Policy & Practice, 28(3), 212–227. https://doi.org/10.1080/0969594X.2021.1884041
Google Scholar
Jönsson, A., & Svingby, G. (2007). The use of scoring rubrics: Reliability, validity and educational consequences. Educational Research Review, 2(2), 130–144. https://doi.org/10.1016/j.edurev.2007.05.002
Google Scholar
Kane, M. T. (2013). Validating the interpretations and uses of test scores. Journal of Educational Measurement, 50(1), 1–73. https://doi.org/10.1111/jedm.12000
Web of Science ®Google Scholar
Kane, M. T. (2016). Explicating validity. Assessment in Education: Principles, Policy & Practice, 23(2), 198–211. https://doi.org/10.1080/0969594X.2015.1060192
Web of Science ®Google Scholar
Kuhlemeier, H., Til, A. V., Hemker, B., de Klijn, W., & Feenstra, H. (2013). Balans van de schrijfvaardigheid in het basis- en speciaal basisonderwijs 2. Periodieke Peiling van het Onderwijsniveau (No. 53). [Present state of writing competency in elementary and special education 2. Periodical assessment of the level of education]. Cito.
Google Scholar
Laming, D. R. J. (2004). Human Judgment: The Eye of the Beholder. Thomson Learning.
Google Scholar
Lesterhuis, M. (2018). The validity of comparative judgement for assessing text quality: an assessor’s perspective [ Unpublished doctoral dissertation]. University of Antwerp.
Google Scholar
Lloyd-Jones, R. (1977). Primary trait scoring. In C. R. Cooper & L. Odell (Eds.), Evaluating writing: Describing, measuring, judging (pp. 33–68). National Council of Teachers of English.
Google Scholar
Lord, F. M., & Novick, M. R. (1968). Statistical theories of mental test scores. Addison-Welsley.
Google Scholar
Lumley, T. (2002). Assessment criteria in a large-scale writing test: What do they really mean to the raters? Language Testing, 19(3), 246–276. https://doi.org/10.1191/0265532202lt230oa
Google Scholar
Marshall, N., Shaw, K., Hunter, J., & Jones, I. (2020). Assessment by comparative judgement: An application to secondary statistics and English in New Zealand. New Zealand Journal of Educational Studies, 55(1), 49–71. https://doi.org/10.1007/s40841-020-00163-3
Web of Science ®Google Scholar
Pollitt, A. (2012). The method of adaptive comparative judgement. Assessment in Education: Principles, Policy & Practice, 19(3), 281–300. https://doi.org/10.1080/0969594X.2012.665354
Google Scholar
Pollman, E., Prenger, J., & De Glopper, K. (2012). Het beoordelen van leerlingteksten met behulp van een schaalmodel [Rating student’ texts with a benchmark scale]. Levende Talen Tijdschrift, 13(3), 15–24.
Google Scholar
Pullens, T. (2012). Bij wijze van schrijven: Effecten van computerondersteund schrijven in het primair onderwijs [In a manner of writing: Effects of computer-supported writing in primary education] ( Unpublished doctoral dissertation). Utrecht University.
Google Scholar
Rietdijk, S., Janssen, T., van Weijen, D., van den Bergh, H., & Rijlaarsdam, G. (2017). Improving writing in primary schools through a comprehensive writing program. Journal of Writing Research, 9(2), 173–225. https://doi.org/10.17239/jowr-2017.09.02.04
Web of Science ®Google Scholar
Sadler, D. R. (1998). Formative assessment: Revisiting the territory. Assessment in Education: Principles, Policy & Practice, 5(1), 77–84. https://doi.org/10.1080/0969595980050104
Google Scholar
Sadler, D. R. (2009). Indeterminacy in the use of preset criteria for assessment and grading. Assessment & Evaluation in Higher Education, 34(2), 159–179. https://doi.org/10.1080/02602930801956059
Web of Science ®Google Scholar
Schoonen, R. (2005). Generalizability of writing scores: An application of structural equation modeling. Language Testing, 22(1), 1–30. https://doi.org/10.1191/0265532205lt295oa
Google Scholar
Shermis, M. D., Burstein, J., Elliot, N., Miel, S., & Foltz, P. W. (2017). Automated writing evaluation: An expanding body of knowledge. In C. A. MacArthur, S. Graham, & J. Fitzgerald (Eds.), Handbook of writing research (2nd ed., pp. 395–410). The Guilford Press.
Google Scholar
Smith, G. S., & Paige, D. D. (2019). A study of reliability across multiple raters when using the NAEP and MDFS rubrics to measure oral reading fluency. Reading Psychology, 40(1), 34–69. https://doi.org/10.1080/02702711.2018.1555361
Web of Science ®Google Scholar
Solomon, C., Lutkus, A. D., Kaplan, B., & Skolnik, I. (2004). Writing in the nation’s classrooms. Teacher interviews and student work collected from participants in the NAEP 1998 Writing Assessment. ETS-NAEP Technical and Research Report 04-R02. ETS. https://www.ets.org/Media/Research/pdf/ETS-NAEP-04-R02.pdf
Google Scholar
Stemler, S. E. (2004). A comparison of consensus, consistency, and measurement approaches to estimating interrater reliability. Practical Assessment, Research & Evaluation, 9(4), 66–78.
Google Scholar
Tillema, M., Van den Bergh, H., Rijlaarsdam, G., & Sanders, T. (2012). Quantifying the quality difference between L1 and L2 essays: A rating procedure with bilingual raters and L1 and L2 benchmark essays. Language Testing, 30(1), 71–97. https://doi.org/10.1177/0265532212442647
Web of Science ®Google Scholar
Uzun, N. B., Alici, D., & Aktas, M. (2019). Reliability of the analytic rubric and checklist for the assessment of story writing skills: G and decision study in generalizability theory. European Journal of Educational Research, 8(4), 169–180. https://doi.org/10.12973/eu-jer.8.1.169
Google Scholar
Van den Bergh, H., De Maeyer, S., Van Weijen, D., & Tillema, M. (2012). Generalizability of text quality scores. In E. van Steendam, M. Tillema, G. Rijlaarsdam, & H. van den Bergh (Eds.), Measuring writing: Recent insights into theory, methodology and practices (Vol. 27, pp. 23–32). Brill.
Google Scholar
Weigle, C. S. (1994). Effects of training on raters of ESL compositions. Language Testing, 11(2), 197–223. https://doi.org/10.1177/026553229401100206
Google Scholar
Weigle, S. C. (2002). Assessing writing. Cambridge University Press.
Google Scholar
Wesdorp, H. (1981). Evaluatietechnieken voor het moedertaalonderwijs [Evaluation techniques for the mother tongue education]. Stichting voor Onderzoek van het Onderwijs.
Google Scholar

Share icon
Back to Top

Related research

People also read lists articles that other readers of this article have read.

Recommended articles lists articles that we recommend and is powered by our AI driven recommendation engine.

Cited by lists all citing articles based on Crossref citations.
Articles with the Crossref icon will open in a new tab.

People also read
Recommended articles
Cited by

To cite this article:

Reference style: APA Chicago Harvard

Citation copied to clipboard

Reference styles above use APA (6th edition), Chicago (16th edition) & Harvard (10th edition)

Download citation

Download a citation file in RIS format that can be imported by citation management software including EndNote, ProCite, RefWorks and Reference Manager.

Choose format: RIS BibTex RefWorks Direct Export

Choose options: Citation Citation & abstract Citation & references

Benchmark rating procedure, best of both worlds? Comparing procedures to rate text quality in a reliable and valid manner

References

Information for

Open access

Opportunities

Help and information

Your download is now in progress and you may close this window

Login or register to access this feature

Benchmark rating procedure, best of both worlds? Comparing procedures to rate text quality in a reliable and valid manner

References

Related research

To cite this article:

Download citation

Information for

Open access

Opportunities

Help and information

Keep up to date