83
Views
0
CrossRef citations to date
0
Altmetric
Research Article

Resolving and Re-Scoring Constructed Response Items in Mixed-Format Assessments: An Exploration of Three Approaches

&

References

  • Bond, T. G., Yan, Z., & Heene, M. (2020). Applying the Rasch model: Fundamental measurement in the human sciences (4th ed.). Routledge. https://doi.org/10.4324/9780429030499
  • Brennan, R. L., & Prediger, D. J. (1981). Coefficient kappa: Some uses, misuses, and alternatives. Educational and Psychological Measurement, 41(3), 687–699. https://doi.org/10.1177/001316448104100307
  • Cohen, J. (1968). Weighted kappa: Nominal scale agreement provision for scaled disagreement or partial credit. Psychological Bulletin, 70(4), 213–220. https://doi.org/10.1037/h0026256
  • Cohen, Y. (2015, April 14–19). The “third rater fallacy” in essay rating: An empirical test [Conference session]. The National Council for Measurement in Education. https://higherlogicdownload.s3.amazonaws.com/NCME/4b7590fc-3903-444d-b89d-c45b7fa3da3f/UploadedImages/Documents/NCME_2015_Program_WEB3.pdf
  • Educational Testing Service. (n.d.). TOEFL iBT test. Retrieved August 31, 2023, from https://www.ets.org/toefl/score-users/ibt/about.html
  • Engelhard, G. (1997). Constructing rater and task banks for performance assessments. Journal of Outcome Measurement, 1(1), 19–33. https://doi.org/10.1111/j.1745-3984.1996.tb00479.x
  • Engelhard, G., & Wind, S. A. (2018). Invariant measurement with raters and rating scales: Rasch models for rater-mediated assessments. Taylor & Francis.
  • Finkelman, M., Darby, M., & Nering, M. (2009). A two-stage scoring method to enhance accuracy of performance level classification. Educational and Psychological Measurement, 69(1), 5–17. https://doi.org/10.1177/0013164408322025
  • Johnson, R. L., Penny, J., Fisher, S., & Kuhs, T. (2003). Score resolution: An investigation of the reliability and validity of resolved scores. Applied Measurement in Education, 16(4), 299–322. https://doi.org/10.1207/S15324818AME1604_3
  • Johnson, R. L., Penny, J. A., & Gordon, B. (2000). The relation between score resolution methods and interrater reliability: An empirical study of an analytic scoring rubric. Applied Measurement in Education, 13(2), 121–138. https://doi.org/10.1207/S15324818AME1302_1
  • Johnson, R. L., Penny, J. A., & Gordon, B. (2009). Assessing performance: Designing, scoring, and validating performance tasks. The Guilford Press.
  • Johnson, R. L., Penny, J., & Gordon, B. (2001). Score resolution and the interrater reliability of holistic scores in rating essays. Written Communication, 18(2), 229–249. https://doi.org/10.1177/0741088301018002003
  • Kim, S., & Moses, T. (2013). Determining when single scoring for constructed-response items is as effective as double scoring in mixed-format licensure tests. International Journal of Testing, 13(4), 314–328. https://doi.org/10.1080/15305058.2013.776050
  • Linacre, J. M. (1989). Many-Facet Rasch measurement. MESA Press.
  • Linacre, J. M. (2015). Facets rasch measurement (Version 3.71.4) [Computer software]. Winsteps.com.
  • Mao, X., Zhang, J., & Xin, T. (2022). The optimal design of bifactor multidimensional computerized adaptive testing with mixed-format items. Applied Psychological Measurement, 46(7), 605–621. https://doi.org/10.1177/01466216221108382
  • Masters, G. N. (1982). A Rasch model for partial credit scoring. Psychometrika, 47(2), 149–174. https://doi.org/10.1007/BF02296272
  • Meijer, R. R., Niessen, A. S. M., & Tendeiro, J. N. (2016). A practical guide to check the consistency of item response patterns in clinical research through person-fit statistics: Examples and a computer program. Assessment, 23(1), 52–62. https://doi.org/10.1177/1073191115577800
  • Meijer, R. R., & Sijtsma, K. (2001). Methodology review: Evaluating person fit. Applied Psychological Measurement, 25(2), 107–135. https://doi.org/10.1177/01466210122031957
  • Miao, J., Sinharay, S., Kelbaugh, C., Cao, Y., & Wang, W. (2023). Evaluating targeted double scoring for the performance assessment for school leaders using imputation and decision theory. Educational Testing Service, 2023(1), 1–10. https://doi.org/10.1002/ets2.12363
  • Myford, C. M., & Wolfe, E. W. (2002). When raters disagree, then what: Examining a third-rating discrepancy resolution procedure and its utility for identifying unusual patterns of ratings. Journal of Applied Measurement, 3(3), 300–324.
  • National Assessment of Educational Progress. (2022). 2022 NAEP mathematics assessment. https://www.nationsreportcard.gov/itemmaps/?subj=MAT&grade=4&year=2022
  • Rasch, G. (1960). Probabilistic models for some intelligence and achievement tests (Expanded edition, 1980 ed.). University of Chicago Press.
  • Rodriguez, M. C. (2003). Construct equivalence of multiple-choice and constructed-response items: A random effects synthesis of correlations. Journal of Educational Measurement, 40(2), 163–184. https://doi.org/10.1111/j.1745-3984.2003.tb01102.x
  • Rupp, A. A. (2013). A systematic review of the methodology for person fit research in Item Response Theory: Lessons about generalizability of inferences from the design of simulation studies. Psychological Test and Assessment Modeling, 55(1), 3–38.
  • Sijtsma, K., & Meijer, R. R. (2001). The person response function as a tool in person-fit research. Psychometrika, 66(2), 191–207. https://doi.org/10.1007/BF02294835
  • Sinharay, S. (2015). Assessment of person fit for mixed-format tests. Journal of Educational and Behavioral Statistics, 40(4), 343–365. https://doi.org/10.3102/1076998615589128
  • Sinharay, S., Johnson, M. S., Wang, W., & Miao, J. (2023). Targeted double scoring of performance tasks using a decision-theoretic approach. Applied Psychological Measurement, 47(2), 155–163. https://doi.org/10.1177/01466216221129271
  • Uebersax, J. (2009). The myth of chance-corrected agreement. https://john-uebersax.com/stat/kappa2.htm
  • Walker, A. A., Jennings, J. K., & Engelhard, G. (2018). Using person response functions to investigate areas of person misfit related to item characteristics. Educational Assessment, 23(1), 47–68. https://doi.org/10.1080/10627197.2017.1415143
  • Wind, S. A. (2019). Examining the impacts of rater effects in performance assessments. Applied Psychological Measurement, 43(2), 159–171. https://doi.org/10.1177/0146621618789391
  • Wind, S. A. (2020). Exploring the impact of rater effects on person fit in rater-mediated assessments. Educational Measurement Issues & Practice, 39(4), 76–94. https://doi.org/10.1111/emip.12354
  • Wind, S. A., & Engelhard, G. (2013). How invariant and accurate are domain ratings in writing assessment? Assessing Writing, 18(4), 278–299. https://doi.org/10.1016/j.asw.2013.09.002
  • Wind, S. A., Engelhard, G., & Wesolowski, B. (2016). Exploring the effects of rater linking designs and rater fit on achievement estimates within the context of music performance assessments. Educational Assessment, 21(4), 278–299. https://doi.org/10.1080/10627197.2016.1236676
  • Wind, S. A., & Ge, Y. (2021). Detecting rater biases in sparse rater-mediated assessment networks. Educational and Psychological Measurement, 81(5), 996–1022. https://doi.org/10.1177/0013164420988108
  • Wind, S. A., & Guo, W. (2019). Exploring the combined effects of rater misfit and differential rater functioning in performance assessments. Educational and Psychological Measurement, 79(5), 962–987. https://doi.org/10.1177/0013164419834613
  • Wind, S. A., & Walker, A. A. (2019). Exploring the correspondence between traditional score resolution methods and person fit indices in rater-mediated writing assessments. Assessing Writing, 39, 25–38. https://doi.org/10.1016/j.asw.2018.12.002
  • Wind, S. A., & Walker, A. A. (2020). Exploring the impacts of different score resolution procedures on person fit and estimated achievement in rater-mediated assessments. Language Assessment Quarterly, 17(4), 362–385. https://doi.org/10.1080/15434303.2020.1783668
  • Wind, S. A., & Walker, A. A. (2021). A model‐data‐fit‐informed approach to score resolution in performance assessments. Educational Measurement Issues & Practice, 40(3), 52–63. https://doi.org/10.1111/emip.12427
  • Wolfe, E. W. (2013). A bootstrap approach to evaluating person and item fit to the Rasch model. Journal of Applied Measurement, 14(1), 1–9.
  • Wolfe, E. W., & Song, T. (2015). Comparison of models and indices for detecting rater centrality. Journal of Applied Measurement, 16(3), 228–241.
  • Wright, B. D., & Masters, G. N. (1982). Rating scale analysis: Rasch Measurement. MESA Press.
  • Wright, B. D., Mead, R., & Draba, R. E. (1976). Detecting and correcting test item bias with a logistic response model (No. 22; Research Memorandum). Statistical Laboratory, Department of Education, University of Chicago.
  • Zhao, Y., & Hambleton, R. K. (2017). Practical consequences of item response theory model misfit in the context of test equating with mixed-format test data. Frontiers in Psychology, 8(484), 1–11. https://doi.org/10.3389/fpsyg.2017.00484

Reprints and Corporate Permissions

Please note: Selecting permissions does not provide access to the full text of the article, please see our help page How do I view content?

To request a reprint or corporate permissions for this article, please click on the relevant link below:

Academic Permissions

Please note: Selecting permissions does not provide access to the full text of the article, please see our help page How do I view content?

Obtain permissions instantly via Rightslink by clicking on the button below:

If you are unable to obtain permissions via Rightslink, please complete and submit this Permissions form. For more information, please visit our Permissions help page.