References
- American Educational Research Association (AERA), American Psychological Association (APA), & National Council on Measurement in Education (NCME). (2014). Standards for educational and psychological testing. Washington, DC: AERA.
- Brennan, R. L. (1992). An NCME instructional module on generalizability theory. Educational Measurement: Issues and Practice, 11(4), 27–34. doi:10.1111/j.1745-3992.1992.tb00260.x
- Brennan, R. L. (2001). Generalizability theory. New York: Springer-Verlag.
- Chiu, C. W. T., & Wolfe, E. W. (2002). A method for analyzing sparse data matrices in the generalizability theory framework. Applied Psychological Measurement, 26(3), 321–338. doi:10.1177/0146621602026003006
- Cronbach, L. J., Gleser, G. C., Nanda, H., & Rajaratnam, N. (1972). The dependability of behavioral measurements. New York: Wiley.
- Eckes, T. (2017). Guest editorial rater effects: Advances in item response modeling of human ratings–Part I. Psychological Test and Assessment Modeling, 59(4), 443–452.
- Engelhard, G., Jr. (1994). Examining rater errors in the assessment of written composition with a many-faceted Rasch model. Journal of Educational Measurement, 31(2), 93–112. doi:10.1111/j.1745-3984.1994.tb00436.x
- Greener, J. M., & Osburn, H. G. (1980). Accuracy of corrections for restriction in range due to explicit selection in heteroscedastic and non-linear distributions. Educational and Psychological Measurement, 40(2), 337–346. doi:10.1177/001316448004000208
- Gross, A. L. (1982). Relaxing assumptions underlying corrections for range restriction. Educational and Psychological Measurement, 42(3), 795–801. doi:10.1177/001316448204200311
- Gulliksen, H. (1950). Theory of mental tests. New York: Wiley.
- Henderson, C. R. (1953). Estimation of variance and covariance components. Biometrics, 9(2), 226–252. doi:10.2307/3001853
- Houston, W. M., Raymond, M. R., & Svec, J. C. (1991). Adjustments for rater effects in performance assessment. Applied Psychological Measurement, 15(4), 409–421. doi:10.1177/014662169101500411
- Johnson, S., & Johnson, R. (2009). Conceptualising and interpreting reliability (Ofqual Report No. 10/4706). Coventry England: Ofqual.
- Koretz, D., Stecher, B., Klein, S., & McCaffrey, D. (1994). The vermont portfolio assessment program: Findings and implications. Educational Measurement: Issues and Practice, 13(3), 5–6. doi:10.1111/j.1745-3992.1994.tb00443.x
- Meadows, M., & Billington, L. (2005). A review of the literature on marking reliability (Report for the National Assessment Agency by AQA Centre for Education Research and Policy, England).
- Searle, S. R. (1987). Linear models for unbalanced data. New York, NY: Wiley.
- Searle, S. R., Casella, G., & McCulloch, C. E. (1992). Variance components. New York: Wiley.
- Shavelson, R. J., Gao, X., & Baxter, G. (1996). On the content validity of performance assessments: Centrality of domain-specification. In M. Birenbaum & F. Dochy (Eds.), Alternatives in assessment of achievements, learning processes and prior knowledge (pp. 131–141). Boston: Kluwer Academic Publishers.
- Shavelson, R. J., & Webb, N. M. (1991). Generalizability theory: A primer. Newbury Park, CA: Sage Publications.
- Tisi, J., Whitehouse, G., Maughan, S., & Burdett, N. (2013). A review of literature on marking reliability research (Report for Ofqual). Slough England: NFER.
- Wind, S. A., & Peterson, M. E. (2018). A systematic review of methods for evaluating rating quality in language assessment. Language Testing, 35(2), 161–192. doi:10.1177/0265532216686999
- Zhang, M. (2013, March). Contrasting automated and human scoring of essays (R&D Connections, No. 21). Princeton, NJ: Educational Testing Service.