Search in:

Advanced search

Educational Assessment Latest Articles

Submit an article Journal homepage

Views

CrossRef citations to date

Altmetric

Research Article

Resolving and Re-Scoring Constructed Response Items in Mixed-Format Assessments: An Exploration of Three Approaches

Stefanie A. WindEducational Measurement, Educational Studies in Psychology, Research Methodology, and Counseling, The University of Alabama, Tuscaloosa, AL, USACorrespondence[email protected]

Yangmeng XuEducational Measurement, Educational Studies in Psychology, Research Methodology, and Counseling, The University of Alabama, Tuscaloosa, AL, USA

Published online: 23 May 2024

Cite this article
https://doi.org/10.1080/10627197.2024.2356745
CrossMark

Full Article
Figures & data
References
Supplemental
Citations
Metrics
Reprints & Permissions

References

Bond, T. G., Yan, Z., & Heene, M. (2020). Applying the Rasch model: Fundamental measurement in the human sciences (4th ed.). Routledge. https://doi.org/10.4324/9780429030499
Google Scholar
Brennan, R. L., & Prediger, D. J. (1981). Coefficient kappa: Some uses, misuses, and alternatives. Educational and Psychological Measurement, 41(3), 687–699. https://doi.org/10.1177/001316448104100307
Web of Science ®Google Scholar
Cohen, J. (1968). Weighted kappa: Nominal scale agreement provision for scaled disagreement or partial credit. Psychological Bulletin, 70(4), 213–220. https://doi.org/10.1037/h0026256
PubMed Web of Science ®Google Scholar
Cohen, Y. (2015, April 14–19). The “third rater fallacy” in essay rating: An empirical test [Conference session]. The National Council for Measurement in Education. https://higherlogicdownload.s3.amazonaws.com/NCME/4b7590fc-3903-444d-b89d-c45b7fa3da3f/UploadedImages/Documents/NCME_2015_Program_WEB3.pdf
Google Scholar
Educational Testing Service. (n.d.). TOEFL iBT test. Retrieved August 31, 2023, from https://www.ets.org/toefl/score-users/ibt/about.html
Google Scholar
Engelhard, G. (1997). Constructing rater and task banks for performance assessments. Journal of Outcome Measurement, 1(1), 19–33. https://doi.org/10.1111/j.1745-3984.1996.tb00479.x
PubMedGoogle Scholar
Engelhard, G., & Wind, S. A. (2018). Invariant measurement with raters and rating scales: Rasch models for rater-mediated assessments. Taylor & Francis.
Google Scholar
Finkelman, M., Darby, M., & Nering, M. (2009). A two-stage scoring method to enhance accuracy of performance level classification. Educational and Psychological Measurement, 69(1), 5–17. https://doi.org/10.1177/0013164408322025
Web of Science ®Google Scholar
Johnson, R. L., Penny, J., Fisher, S., & Kuhs, T. (2003). Score resolution: An investigation of the reliability and validity of resolved scores. Applied Measurement in Education, 16(4), 299–322. https://doi.org/10.1207/S15324818AME1604_3
Web of Science ®Google Scholar
Johnson, R. L., Penny, J. A., & Gordon, B. (2000). The relation between score resolution methods and interrater reliability: An empirical study of an analytic scoring rubric. Applied Measurement in Education, 13(2), 121–138. https://doi.org/10.1207/S15324818AME1302_1
Web of Science ®Google Scholar
Johnson, R. L., Penny, J. A., & Gordon, B. (2009). Assessing performance: Designing, scoring, and validating performance tasks. The Guilford Press.
Google Scholar
Johnson, R. L., Penny, J., & Gordon, B. (2001). Score resolution and the interrater reliability of holistic scores in rating essays. Written Communication, 18(2), 229–249. https://doi.org/10.1177/0741088301018002003
Web of Science ®Google Scholar
Kim, S., & Moses, T. (2013). Determining when single scoring for constructed-response items is as effective as double scoring in mixed-format licensure tests. International Journal of Testing, 13(4), 314–328. https://doi.org/10.1080/15305058.2013.776050
Google Scholar
Linacre, J. M. (1989). Many-Facet Rasch measurement. MESA Press.
Google Scholar
Linacre, J. M. (2015). Facets rasch measurement (Version 3.71.4) [Computer software]. Winsteps.com.
Google Scholar
Mao, X., Zhang, J., & Xin, T. (2022). The optimal design of bifactor multidimensional computerized adaptive testing with mixed-format items. Applied Psychological Measurement, 46(7), 605–621. https://doi.org/10.1177/01466216221108382
PubMed Web of Science ®Google Scholar
Masters, G. N. (1982). A Rasch model for partial credit scoring. Psychometrika, 47(2), 149–174. https://doi.org/10.1007/BF02296272
Web of Science ®Google Scholar
Meijer, R. R., Niessen, A. S. M., & Tendeiro, J. N. (2016). A practical guide to check the consistency of item response patterns in clinical research through person-fit statistics: Examples and a computer program. Assessment, 23(1), 52–62. https://doi.org/10.1177/1073191115577800
PubMed Web of Science ®Google Scholar
Meijer, R. R., & Sijtsma, K. (2001). Methodology review: Evaluating person fit. Applied Psychological Measurement, 25(2), 107–135. https://doi.org/10.1177/01466210122031957
Web of Science ®Google Scholar
Miao, J., Sinharay, S., Kelbaugh, C., Cao, Y., & Wang, W. (2023). Evaluating targeted double scoring for the performance assessment for school leaders using imputation and decision theory. Educational Testing Service, 2023(1), 1–10. https://doi.org/10.1002/ets2.12363
Google Scholar
Myford, C. M., & Wolfe, E. W. (2002). When raters disagree, then what: Examining a third-rating discrepancy resolution procedure and its utility for identifying unusual patterns of ratings. Journal of Applied Measurement, 3(3), 300–324.
PubMedGoogle Scholar
National Assessment of Educational Progress. (2022). 2022 NAEP mathematics assessment. https://www.nationsreportcard.gov/itemmaps/?subj=MAT&grade=4&year=2022
Google Scholar
Rasch, G. (1960). Probabilistic models for some intelligence and achievement tests (Expanded edition, 1980 ed.). University of Chicago Press.
Google Scholar
Rodriguez, M. C. (2003). Construct equivalence of multiple-choice and constructed-response items: A random effects synthesis of correlations. Journal of Educational Measurement, 40(2), 163–184. https://doi.org/10.1111/j.1745-3984.2003.tb01102.x
Web of Science ®Google Scholar
Rupp, A. A. (2013). A systematic review of the methodology for person fit research in Item Response Theory: Lessons about generalizability of inferences from the design of simulation studies. Psychological Test and Assessment Modeling, 55(1), 3–38.
Google Scholar
Sijtsma, K., & Meijer, R. R. (2001). The person response function as a tool in person-fit research. Psychometrika, 66(2), 191–207. https://doi.org/10.1007/BF02294835
Web of Science ®Google Scholar
Sinharay, S. (2015). Assessment of person fit for mixed-format tests. Journal of Educational and Behavioral Statistics, 40(4), 343–365. https://doi.org/10.3102/1076998615589128
Web of Science ®Google Scholar
Sinharay, S., Johnson, M. S., Wang, W., & Miao, J. (2023). Targeted double scoring of performance tasks using a decision-theoretic approach. Applied Psychological Measurement, 47(2), 155–163. https://doi.org/10.1177/01466216221129271
PubMed Web of Science ®Google Scholar
Uebersax, J. (2009). The myth of chance-corrected agreement. https://john-uebersax.com/stat/kappa2.htm
Google Scholar
Walker, A. A., Jennings, J. K., & Engelhard, G. (2018). Using person response functions to investigate areas of person misfit related to item characteristics. Educational Assessment, 23(1), 47–68. https://doi.org/10.1080/10627197.2017.1415143
Web of Science ®Google Scholar
Wind, S. A. (2019). Examining the impacts of rater effects in performance assessments. Applied Psychological Measurement, 43(2), 159–171. https://doi.org/10.1177/0146621618789391
PubMed Web of Science ®Google Scholar
Wind, S. A. (2020). Exploring the impact of rater effects on person fit in rater-mediated assessments. Educational Measurement Issues & Practice, 39(4), 76–94. https://doi.org/10.1111/emip.12354
Google Scholar
Wind, S. A., & Engelhard, G. (2013). How invariant and accurate are domain ratings in writing assessment? Assessing Writing, 18(4), 278–299. https://doi.org/10.1016/j.asw.2013.09.002
Web of Science ®Google Scholar
Wind, S. A., Engelhard, G., & Wesolowski, B. (2016). Exploring the effects of rater linking designs and rater fit on achievement estimates within the context of music performance assessments. Educational Assessment, 21(4), 278–299. https://doi.org/10.1080/10627197.2016.1236676
Web of Science ®Google Scholar
Wind, S. A., & Ge, Y. (2021). Detecting rater biases in sparse rater-mediated assessment networks. Educational and Psychological Measurement, 81(5), 996–1022. https://doi.org/10.1177/0013164420988108
PubMed Web of Science ®Google Scholar
Wind, S. A., & Guo, W. (2019). Exploring the combined effects of rater misfit and differential rater functioning in performance assessments. Educational and Psychological Measurement, 79(5), 962–987. https://doi.org/10.1177/0013164419834613
PubMed Web of Science ®Google Scholar
Wind, S. A., & Walker, A. A. (2019). Exploring the correspondence between traditional score resolution methods and person fit indices in rater-mediated writing assessments. Assessing Writing, 39, 25–38. https://doi.org/10.1016/j.asw.2018.12.002
Web of Science ®Google Scholar
Wind, S. A., & Walker, A. A. (2020). Exploring the impacts of different score resolution procedures on person fit and estimated achievement in rater-mediated assessments. Language Assessment Quarterly, 17(4), 362–385. https://doi.org/10.1080/15434303.2020.1783668
Web of Science ®Google Scholar
Wind, S. A., & Walker, A. A. (2021). A model‐data‐fit‐informed approach to score resolution in performance assessments. Educational Measurement Issues & Practice, 40(3), 52–63. https://doi.org/10.1111/emip.12427
Google Scholar
Wolfe, E. W. (2013). A bootstrap approach to evaluating person and item fit to the Rasch model. Journal of Applied Measurement, 14(1), 1–9.
PubMedGoogle Scholar
Wolfe, E. W., & Song, T. (2015). Comparison of models and indices for detecting rater centrality. Journal of Applied Measurement, 16(3), 228–241.
PubMedGoogle Scholar
Wright, B. D., & Masters, G. N. (1982). Rating scale analysis: Rasch Measurement. MESA Press.
Google Scholar
Wright, B. D., Mead, R., & Draba, R. E. (1976). Detecting and correcting test item bias with a logistic response model (No. 22; Research Memorandum). Statistical Laboratory, Department of Education, University of Chicago.
Google Scholar
Zhao, Y., & Hambleton, R. K. (2017). Practical consequences of item response theory model misfit in the context of test equating with mixed-format test data. Frontiers in Psychology, 8(484), 1–11. https://doi.org/10.3389/fpsyg.2017.00484
PubMedGoogle Scholar

Reprints and Corporate Permissions

Please note: Selecting permissions does not provide access to the full text of the article, please see our help page How do I view content?

To request a reprint or corporate permissions for this article, please click on the relevant link below:

Order Reprints Request Corporate Permissions

Academic Permissions

Please note: Selecting permissions does not provide access to the full text of the article, please see our help page How do I view content?

Obtain permissions instantly via Rightslink by clicking on the button below:

Request Academic Permissions

If you are unable to obtain permissions via Rightslink, please complete and submit this Permissions form. For more information, please visit our Permissions help page.

Share icon
Back to Top

Related research

People also read lists articles that other readers of this article have read.

Recommended articles lists articles that we recommend and is powered by our AI driven recommendation engine.

Cited by lists all citing articles based on Crossref citations.
Articles with the Crossref icon will open in a new tab.

People also read
Recommended articles
Cited by

To cite this article:

Reference style: APA Chicago Harvard

Citation copied to clipboard

Reference styles above use APA (6th edition), Chicago (16th edition) & Harvard (10th edition)

Download citation

Download a citation file in RIS format that can be imported by citation management software including EndNote, ProCite, RefWorks and Reference Manager.

Choose format: RIS BibTex RefWorks Direct Export

Choose options: Citation Citation & abstract Citation & references

Resolving and Re-Scoring Constructed Response Items in Mixed-Format Assessments: An Exploration of Three Approaches

References

Information for

Open access

Opportunities

Help and information

Your download is now in progress and you may close this window

Login or register to access this feature

Resolving and Re-Scoring Constructed Response Items in Mixed-Format Assessments: An Exploration of Three Approaches

References

Reprints and Corporate Permissions

Academic Permissions

Related research

To cite this article:

Download citation

Information for

Open access

Opportunities

Help and information

Keep up to date