References
- Bond, T. G., Yan, Z., & Heene, M. (2020). Applying the Rasch model: Fundamental measurement in the human sciences (4th ed.). Routledge. https://doi.org/10.4324/9780429030499
- Brennan, R. L., & Prediger, D. J. (1981). Coefficient kappa: Some uses, misuses, and alternatives. Educational and Psychological Measurement, 41(3), 687–699. https://doi.org/10.1177/001316448104100307
- Cohen, J. (1968). Weighted kappa: Nominal scale agreement provision for scaled disagreement or partial credit. Psychological Bulletin, 70(4), 213–220. https://doi.org/10.1037/h0026256
- Cohen, Y. (2015, April 14–19). The “third rater fallacy” in essay rating: An empirical test [Conference session]. The National Council for Measurement in Education. https://higherlogicdownload.s3.amazonaws.com/NCME/4b7590fc-3903-444d-b89d-c45b7fa3da3f/UploadedImages/Documents/NCME_2015_Program_WEB3.pdf
- Educational Testing Service. (n.d.). TOEFL iBT test. Retrieved August 31, 2023, from https://www.ets.org/toefl/score-users/ibt/about.html
- Engelhard, G. (1997). Constructing rater and task banks for performance assessments. Journal of Outcome Measurement, 1(1), 19–33. https://doi.org/10.1111/j.1745-3984.1996.tb00479.x
- Engelhard, G., & Wind, S. A. (2018). Invariant measurement with raters and rating scales: Rasch models for rater-mediated assessments. Taylor & Francis.
- Finkelman, M., Darby, M., & Nering, M. (2009). A two-stage scoring method to enhance accuracy of performance level classification. Educational and Psychological Measurement, 69(1), 5–17. https://doi.org/10.1177/0013164408322025
- Johnson, R. L., Penny, J., Fisher, S., & Kuhs, T. (2003). Score resolution: An investigation of the reliability and validity of resolved scores. Applied Measurement in Education, 16(4), 299–322. https://doi.org/10.1207/S15324818AME1604_3
- Johnson, R. L., Penny, J. A., & Gordon, B. (2000). The relation between score resolution methods and interrater reliability: An empirical study of an analytic scoring rubric. Applied Measurement in Education, 13(2), 121–138. https://doi.org/10.1207/S15324818AME1302_1
- Johnson, R. L., Penny, J. A., & Gordon, B. (2009). Assessing performance: Designing, scoring, and validating performance tasks. The Guilford Press.
- Johnson, R. L., Penny, J., & Gordon, B. (2001). Score resolution and the interrater reliability of holistic scores in rating essays. Written Communication, 18(2), 229–249. https://doi.org/10.1177/0741088301018002003
- Kim, S., & Moses, T. (2013). Determining when single scoring for constructed-response items is as effective as double scoring in mixed-format licensure tests. International Journal of Testing, 13(4), 314–328. https://doi.org/10.1080/15305058.2013.776050
- Linacre, J. M. (1989). Many-Facet Rasch measurement. MESA Press.
- Linacre, J. M. (2015). Facets rasch measurement (Version 3.71.4) [Computer software]. Winsteps.com.
- Mao, X., Zhang, J., & Xin, T. (2022). The optimal design of bifactor multidimensional computerized adaptive testing with mixed-format items. Applied Psychological Measurement, 46(7), 605–621. https://doi.org/10.1177/01466216221108382
- Masters, G. N. (1982). A Rasch model for partial credit scoring. Psychometrika, 47(2), 149–174. https://doi.org/10.1007/BF02296272
- Meijer, R. R., Niessen, A. S. M., & Tendeiro, J. N. (2016). A practical guide to check the consistency of item response patterns in clinical research through person-fit statistics: Examples and a computer program. Assessment, 23(1), 52–62. https://doi.org/10.1177/1073191115577800
- Meijer, R. R., & Sijtsma, K. (2001). Methodology review: Evaluating person fit. Applied Psychological Measurement, 25(2), 107–135. https://doi.org/10.1177/01466210122031957
- Miao, J., Sinharay, S., Kelbaugh, C., Cao, Y., & Wang, W. (2023). Evaluating targeted double scoring for the performance assessment for school leaders using imputation and decision theory. Educational Testing Service, 2023(1), 1–10. https://doi.org/10.1002/ets2.12363
- Myford, C. M., & Wolfe, E. W. (2002). When raters disagree, then what: Examining a third-rating discrepancy resolution procedure and its utility for identifying unusual patterns of ratings. Journal of Applied Measurement, 3(3), 300–324.
- National Assessment of Educational Progress. (2022). 2022 NAEP mathematics assessment. https://www.nationsreportcard.gov/itemmaps/?subj=MAT&grade=4&year=2022
- Rasch, G. (1960). Probabilistic models for some intelligence and achievement tests (Expanded edition, 1980 ed.). University of Chicago Press.
- Rodriguez, M. C. (2003). Construct equivalence of multiple-choice and constructed-response items: A random effects synthesis of correlations. Journal of Educational Measurement, 40(2), 163–184. https://doi.org/10.1111/j.1745-3984.2003.tb01102.x
- Rupp, A. A. (2013). A systematic review of the methodology for person fit research in Item Response Theory: Lessons about generalizability of inferences from the design of simulation studies. Psychological Test and Assessment Modeling, 55(1), 3–38.
- Sijtsma, K., & Meijer, R. R. (2001). The person response function as a tool in person-fit research. Psychometrika, 66(2), 191–207. https://doi.org/10.1007/BF02294835
- Sinharay, S. (2015). Assessment of person fit for mixed-format tests. Journal of Educational and Behavioral Statistics, 40(4), 343–365. https://doi.org/10.3102/1076998615589128
- Sinharay, S., Johnson, M. S., Wang, W., & Miao, J. (2023). Targeted double scoring of performance tasks using a decision-theoretic approach. Applied Psychological Measurement, 47(2), 155–163. https://doi.org/10.1177/01466216221129271
- Uebersax, J. (2009). The myth of chance-corrected agreement. https://john-uebersax.com/stat/kappa2.htm
- Walker, A. A., Jennings, J. K., & Engelhard, G. (2018). Using person response functions to investigate areas of person misfit related to item characteristics. Educational Assessment, 23(1), 47–68. https://doi.org/10.1080/10627197.2017.1415143
- Wind, S. A. (2019). Examining the impacts of rater effects in performance assessments. Applied Psychological Measurement, 43(2), 159–171. https://doi.org/10.1177/0146621618789391
- Wind, S. A. (2020). Exploring the impact of rater effects on person fit in rater-mediated assessments. Educational Measurement Issues & Practice, 39(4), 76–94. https://doi.org/10.1111/emip.12354
- Wind, S. A., & Engelhard, G. (2013). How invariant and accurate are domain ratings in writing assessment? Assessing Writing, 18(4), 278–299. https://doi.org/10.1016/j.asw.2013.09.002
- Wind, S. A., Engelhard, G., & Wesolowski, B. (2016). Exploring the effects of rater linking designs and rater fit on achievement estimates within the context of music performance assessments. Educational Assessment, 21(4), 278–299. https://doi.org/10.1080/10627197.2016.1236676
- Wind, S. A., & Ge, Y. (2021). Detecting rater biases in sparse rater-mediated assessment networks. Educational and Psychological Measurement, 81(5), 996–1022. https://doi.org/10.1177/0013164420988108
- Wind, S. A., & Guo, W. (2019). Exploring the combined effects of rater misfit and differential rater functioning in performance assessments. Educational and Psychological Measurement, 79(5), 962–987. https://doi.org/10.1177/0013164419834613
- Wind, S. A., & Walker, A. A. (2019). Exploring the correspondence between traditional score resolution methods and person fit indices in rater-mediated writing assessments. Assessing Writing, 39, 25–38. https://doi.org/10.1016/j.asw.2018.12.002
- Wind, S. A., & Walker, A. A. (2020). Exploring the impacts of different score resolution procedures on person fit and estimated achievement in rater-mediated assessments. Language Assessment Quarterly, 17(4), 362–385. https://doi.org/10.1080/15434303.2020.1783668
- Wind, S. A., & Walker, A. A. (2021). A model‐data‐fit‐informed approach to score resolution in performance assessments. Educational Measurement Issues & Practice, 40(3), 52–63. https://doi.org/10.1111/emip.12427
- Wolfe, E. W. (2013). A bootstrap approach to evaluating person and item fit to the Rasch model. Journal of Applied Measurement, 14(1), 1–9.
- Wolfe, E. W., & Song, T. (2015). Comparison of models and indices for detecting rater centrality. Journal of Applied Measurement, 16(3), 228–241.
- Wright, B. D., & Masters, G. N. (1982). Rating scale analysis: Rasch Measurement. MESA Press.
- Wright, B. D., Mead, R., & Draba, R. E. (1976). Detecting and correcting test item bias with a logistic response model (No. 22; Research Memorandum). Statistical Laboratory, Department of Education, University of Chicago.
- Zhao, Y., & Hambleton, R. K. (2017). Practical consequences of item response theory model misfit in the context of test equating with mixed-format test data. Frontiers in Psychology, 8(484), 1–11. https://doi.org/10.3389/fpsyg.2017.00484