53
Views
0
CrossRef citations to date
0
Altmetric
Research Article

Detecting Rater Bias in Mixed-Format Assessments

ORCID Icon & ORCID Icon

References

  • Andrich, D. A. (1978). A rating formulation for ordered response categories. Psychometrika, 43(4), 561–573. https://doi.org/10.1007/BF02293814
  • Andrich, D. A., & Hagquist, C. (2012). Real and artificial differential item functioning. Journal of Educational and Behavioral Statistics, 37(3), 387–416. https://doi.org/10.3102/1076998611411913
  • Andrich, D. A., & Hagquist, C. (2015). Real and artificial differential item functioning in polytomous items. Educational and Psychological Measurement, 75(2), 185–207. https://doi.org/10.1177/0013164414534258
  • Dewberry, C., Davies-Muir, A., & Newell, S. (2013). Impact and causes of rater severity/leniency in appraisals without postevaluation communication between raters and ratees. International Journal of Selection and Assessment, 21(3), 286–293. https://doi.org/10.1111/ijsa.12038
  • Draba, R. E.1977The identification and interpretation of item bias (NO. 25; Research Memorandum). Statistical Laboratory, Department of Education, University of Chicago.
  • Eckes, T. (2015). Introduction to many-facet rasch measurement: Analyzing and evaluating rater-mediated assessments (2nd ed.). Peter Lang.
  • Engelhard, G., & Wind, S. A. (2013). Rating quality studies using rasch measurement theory (Research Report No. 2013–3). The College Board.
  • Engelhard, G., & Wind, S. A. (2018). Invariant measurement with raters and rating scales: Rasch models for rater-mediated assessments. Taylor & Francis.
  • Ercikan, K., Julian, M. W., Burket, G. R., Weber, M. M., & Link, V. (1998). Calibration and scoring of tests with multiple‐choice and constructed‐response item types. Journal of Educational Measurement, 35(2), 137–154. https://doi.org/10.1111/j.1745-3984.1998.tb00531.x
  • Farrokhi, F., Esfandiari, R., & Schaefer, E. (2012). A many-facet rasch measurement of differential rater severity/leniency in three types of assessment. JALT Journal, 34(1), 79. https://doi.org/10.37546/JALTJJ34.1-3
  • Guo, W., & Wind, S. A. (2021). Examining the impacts of ignoring rater effects in mixed-format tests. Journal of Educational Measurement, 58(3), 364–387. https://doi.org/10.1111/jedm.12292
  • Han, C. (2015). Investigating rater severity/leniency in interpreter performance testing: A multifaceted rasch measurement approach. Interpreting, 17(2), 255–283. https://doi.org/10.1075/intp.17.2.05han
  • Jin, K.-Y., & Eckes, T. (2021). Detecting differential rater functioning in severity and centrality: the dual DRF facets model. Educational and Psychological Measurement, 82(4), 001316442110432. https://doi.org/10.1177/00131644211043207
  • Kim, S., Walker, M. E., & McHale, F. (2008). Equating of mixed‐format tests in large‐scale assessments. ETS Research Report Series, 2008(1), i–26. https://doi.org/10.1002/j.2333-8504.2008.tb02112.x.
  • Linacre, J. M. (1989). Many-facet rasch measurement. MESA Press.
  • Linacre, J. M. (2015). Facets rasch measurement ( Version 3.71.4).
  • Mao, X., Zhang, J., & Xin, T. (2022). The optimal design of bifactor multidimensional computerized adaptive testing with mixed-format items. Applied Psychological Measurement, 46(7), 605–621. https://doi.org/10.1177/01466216221108382
  • Masters, G. N. (1982). A Rasch model for partial credit scoring. Psychometrika, 47(2), 149–174. https://doi.org/10.1007/BF02296272
  • Myers, N. D., Wolfe, E. W., Feltz, D. L., & Penfield, R. D. (2006). Identifying differential item functioning of rating scale items with the Rasch model: An introduction and an application. Measurement in Physical Education and Exercise Science, 10(4), 215–240. https://doi.org/10.1207/s15327841mpee1004_1
  • Myford, C. M., & Wolfe, E. W. (2000). Strengthening the ties that bind: Improving the linking network in sparsely connected rating designs. ETS Research Report Series, 2000(1), i–34. https://doi.org/10.1002/j.2333-8504.2000.tb01832.x.
  • Myford, C. M., & Wolfe, E. W. (2003). Detecting and measuring rater effects using many-facet Rasch measurement: Part I. Journal of Applied Measurement, 4(4), 386–422.
  • NAEP - Scoring. (n.d). National center for education statistics.
  • NAEP Scoring—Backreading. (n.d.). National center for education statistics. Retrieved June 7, 2021, from https://nces.ed.gov/nationsreportcard/tdw/scoring/scoring_backreading.aspx
  • NAEP Scoring—Within-Year Interrater Agreement. (n.d.).National center for education statistics. Retrieved June 7, 2021, from https://nces.ed.gov/nationsreportcard/tdw/scoring/scoring_within.aspx
  • Peabody, M. R., & Wind, S. A. (2019). Exploring the stability of differential item functioning across administrations and critical values using the rasch separate calibration t-test method. Measurement: Interdisciplinary Research and Perspectives, 17(2), 78–92. https://doi.org/10.1080/15366367.2018.1533782
  • Raju, N. S. (1988). The area between two item characteristic curves. Psychometrika, 53(4), 495–502. https://doi.org/10.1007/BF02294403
  • Rasch, G. (1960). Probabilistic models for some intelligence and achievement tests ( Expanded edition, 1980). University of Chicago Press.
  • Schaefer, E. (2008). Rater bias patterns in an EFL writing assessment. Language Testing, 25(4), 465–493. https://doi.org/10.1177/0265532208094273
  • Sinharay, S. (2015). Assessment of person fit for mixed-format tests. Journal of Educational and Behavioral Statistics, 40(4), 343–365. https://doi.org/10.3102/1076998615589128
  • Wind, S. A. (2019). Examining the impacts of rater effects in performance assessments. Applied Psychological Measurement, 43(2), 159–171. https://doi.org/10.1177/0146621618789391
  • Wind, S. A., & Ge, Y. (2021). Detecting rater biases in sparse rater-mediated assessment networks. Educational and Psychological Measurement, 81(5), 996–1022. https://doi.org/10.1177/0013164420988108
  • Wind, S. A., & Guo, W. (2021). Beyond agreement: Exploring rater effects in large-scale mixed format assessments. Educational Assessment, 26(4), 264–283. https://doi.org/10.1080/10627197.2021.1962277
  • Winke, P., Gass, S., & Myford, C. (2012). Raters’ L2 background as a potential source of bias in rating oral performance. Language Testing, 30(2), 231–252. https://doi.org/10.1177/0265532212456968
  • Wolfe, E. W. (2004). Identifying rater effects using latent trait models. Psychology Science, 46(1), 35–51.
  • Wolfe, E. W., & McVay, A. (2012). Application of latent trait models to identifying substantively interesting raters. Educational Measurement: Issues and Practice, 31(3), 31–37. https://doi.org/10.1111/j.1745-3992.2012.00241.x
  • Yao, L., & Schwarz, R. D. (2006). A multidimensional partial credit model with associated item and test statistics: An application to mixed-format tests. Applied Psychological Measurement, 30(6), 469–492. https://doi.org/10.1177/0146621605284537

Reprints and Corporate Permissions

Please note: Selecting permissions does not provide access to the full text of the article, please see our help page How do I view content?

To request a reprint or corporate permissions for this article, please click on the relevant link below:

Academic Permissions

Please note: Selecting permissions does not provide access to the full text of the article, please see our help page How do I view content?

Obtain permissions instantly via Rightslink by clicking on the button below:

If you are unable to obtain permissions via Rightslink, please complete and submit this Permissions form. For more information, please visit our Permissions help page.