243
Views
0
CrossRef citations to date
0
Altmetric
Measurement, Statistics, and Research Design

Exploring difficult-to-score essays with a hyperbolic cosine accuracy model and Coh-Metrix indices

ORCID Icon, ORCID Icon &

References

  • Andrich, D. (1978). A rating formulation for ordered response categories. Psychometrika, 43(4), 561–573. https://doi.org/10.1007/BF02293814
  • Andrich, D. (1988). The application of an unfolding model of the PIRT type to the measurement of attitude. Applied Psychological Measurement, 12(1), 33–51. https://doi.org/10.1177/014662168801200105
  • Andrich, D. (1995). Hyperbolic cosine latent trait models for unfolding direct responses and pairwise preferences. Applied Psychological Measurement, 19(3), 269–290. https://doi.org/10.1177/014662169501900306
  • Andrich, D. (1996). A hyperbolic cosine latent trait model for unfolding polytomous responses: Reconciling Thurstone and Likert methodologies. British Journal of Mathematical and Statistical Psychology, 49(2), 347–365. https://doi.org/10.1111/j.2044-8317.1996.tb01093.x
  • Andrich, D., & Luo, G. (1993). A hyperbolic cosine latent trait model for unfolding dichotomous single-stimulus responses. Applied Psychological Measurement, 17(3), 253–276.
  • Attali, Y. (2013). Validity and reliability of automated essay scoring. In M. D. Shermis & J. C. Burstein (Eds.), Handbook of automated essay evaluation: Current application and new directions (pp. 181–198). Psychology Press.
  • Behizadeh, N., & Pang, M. E. (2016). Awaiting a new wave: The status of state writing assessment in the United States. Assessing Writing, 29, 25–41. https://doi.org/10.1016/j.asw.2016.05.003
  • Bernardin, H. J., & Pence, E. C. (1980). Effects of rater training: Creating new response sets and decreasing accuracy. Journal of Applied Psychology, 65(1), 60–66. https://doi.org/10.1037/0021-9010.65.1.60
  • Chung, G. K. W. K., & Baker, E. L. (2003). Issues in the reliability and validity of automated scoring of constructed responses. In M. D. Shermis & J. C. Burnstein (Eds.), Automated essay scoring: A cross disciplinary approach (pp. 23–40). Lawrence Erlbaum Associates.
  • Cohen, J. (1988). Statistical power analysis for the behavioral sciences (2nd ed.). Erlbaum.
  • Coltheart, M. (1981). The MRC psycholinguistic database. The Quarterly Journal of Experimental Psychology Section A, 33(4), 497–505. https://doi.org/10.1080/14640748108400805
  • Conference on College Composition and Communication. (2004). CCCC position statement on teaching, learning, and assessing writing in digital environments. https://cccc.ncte.org/cccc/resources/positions/digitalenvironments
  • Coombs, C. H. (1950). Psychological scaling without a unit of measurement. Psychological Review, 57(3), 145–158. https://doi.org/10.1037/h0060984
  • Coombs, C. H. (1964). A theory of data. Wiley.
  • Coombs, C. H., & Avrunin, C. S. (1977). Single-peaked functions and the theory of preference. Psychological Review, 84(2), 216–230. https://doi.org/10.1037/0033-295X.84.2.216
  • Crocker, L., & Algina, J. (1986). Introduction to classical and modern test theory. Holt, Rinehart, and Winston.
  • Desarbo, W. S., & Hoffman, D. L. (1986). Simple and weighted unfolding threshold models for the spatial representation of binary choice data. Applied Psychological Measurement, 10(3), 247–264. https://doi.org/10.1177/014662168601000304
  • Engelhard Jr, G. (1994). Examining rater errors in the assessment of written composition with a many‐faceted Rasch model. Journal of Educational Measurement, 31(2), 93–112.
  • Engelhard, G. (1996). Evaluating rater accuracy in performance assessments. Journal of Educational Measurement, 56–70.
  • Engelhard, G. (2013). Invariant measurement: Using Rasch models in the social, behavioral, and health sciences. New York: Routledge.
  • Foltz, P. W., Streeter, L. A., Lochbaum, K. E., & Landauer, T. K. (2013). Implementation and applications of the intelligent essay assessor. In M. Shermis & J. Burstein (Eds.), Handbook of automated essay evaluation (pp. 68–88). Routledge.
  • Freedman, S. W. (1979). How characteristics of student essays influence teachers’ evaluations. Journal of Educational Psychology, 71(3), 328–338. https://doi.org/10.1037/0022-0663.71.3.328
  • Gilhooly, K. L., & Logie, R. H. (1980). Age of acquisition, imagery, concreteness, familiarity and ambiguity measures for 1944 words. Behavior Research Methods & Instrumentation, 12(4), 395–427. https://doi.org/10.3758/BF03201693
  • Graesser, A. C., McNamara, D. S., Louwerse, M. M., & Cai, Z. (2004). Coh-Metrix: Analysis of text on cohesion and language. Behavior Research Methods, Instruments & Computers, 36(2), 193–202. https://doi.org/10.3758/BF03195564
  • Guttman, L. (1950). The basis for scalogram analysis. In S. A. Stouffer, L. Guttman, A. E. Suchman, P. F. Lazarsfeld, S. A. Star, & J. A. Clausen (Eds.), Measurement and prediction (pp. 60–90). Princeton University Press.
  • Guttman, L., (1941). The quantification of class attributes: A theory and method of scale construction. In P. Horst & P. Wallin (Eds.), The prediction of personal adjustment (pp. 319–348). Social Science Research Council.
  • Hoijtink, H. (1990). PARELLA: Measurement of latent traits by proximity items. University of Groningen.
  • Hoyt, W. T. (2000). Rater bias in psychological research: When is it a problem and what can we do about it? Psychological Methods, 5(1), 64–86. https://doi.org/10.1037/1082-989x.5.1.64
  • Hoyt, W. T., & Kerns, M. D. (1999). Magnitude and moderators of bias in observer ratings: A meta-analysis. Psychological Methods, 4(4), 403–424. https://doi.org/10.1037/1082-989X.4.4.403
  • Landauer, T., McNamara, D. S., Dennis, S., & Kintsch, W. (Eds.). (2007). Handbook of latent semantic analysis. Erlbaum.
  • Linacre, J. M. (1989). Many-facet Rasch measurement. MESA Press.
  • Luo, G. (1998). A general formulation for unidimensional unfolding and pairwise preference models: Making explicit the latitude of acceptance. Journal of Mathematical Psychology, 42(4), 400–417. https://doi.org/10.1006/jmps.1998.1206
  • Luo, G. (2001). A class of probabilistic unfolding models for polytomous responses. Journal of Mathematical Psychology, 45(2), 224–248. https://doi.org/10.1006/jmps.2000.1310
  • Luo, G., & Andrich, D. (2003). RateFOLD computer program. Social Measurement Laboratory; School of Education, Murdoch University.
  • McNamara, D. S., Graesser, A. C., McCarthy, P. M., & Cai, Z. (2014). Automated evaluation of text and discourse with Coh-Metrix. Cambridge University Press.
  • Miles, J., & Shevlin, M. (2001). Applying regression and correlation: A guide for students and researchers. SAGE.
  • Myford, C. M., & Wolfe, E. W. (2009). Monitoring rater performance over time: A framework for detecting differential accuracy and differential scale category use. Journal of Educational Measurement, 46(4), 371–389. https://doi.org/10.1111/j.1745-3984.2009.00088.x
  • National Commission on Writing in America's Schools and Colleges. (2003). The neglected “R”: The need for a writing revolution. The report of the National Commission on Writing in America's Schools and Colleges.
  • National Council of Teachers of English & Conference on College Composition and Communication. (2014). Writing assessment: A position statement. http://www.ncte.org/cccc/resources/positions/writingassessment.
  • Paivio, A., Yuille, J. C., & Madigan, S. A. (1968). Concreteness, imagery and meaningfulness values for 925 words. Journal of Experimental Psychology Monograph Psychology, 76(1, Pt.2), 1–25. https://doi.org/10.1037/h0025327
  • Post, W. J., van Duijn, M. A., & van Baarsen, B. (2001). Single-peaked or monotone tracelines? On the choice of an IRT model for scaling data. In A. Boomsma, M. A. J. van Duijn, T. A. B. Snijders (Eds.), Essays on item response theory (pp. 391–414). Springer.
  • Rafoth, B. A., & Rubin, D. L. (1984). The impact of content and mechanics on judgments of writing quality. Written Communication, 1(4), 446–458. https://doi.org/10.1177/0741088384001004004
  • Roberts, J. S., Donoghue, J. R., & Laughlin, J. E. (2000). A general item response theory model for unfolding unidimensional polytomous responses. Applied Psychological Measurement, 24(1), 3–32. https://doi.org/10.1177/01466216000241001
  • Roberts, J. S., & Laughlin, J. E. (1996). A unidimensional item response model for unfolding responses from graded disagree-agree response scale. Applied Psychological Measurement, 20(3), 231–255. https://doi.org/10.1177/014662169602000305
  • Rost, J., & Luo, G. (1997). An application of a Rasch based model to a questionnaire on adolescent centrism. In J. Rost, & R. Langeheine (Eds.), Application of latent trait and latent class models in the social sciences (pp. 278–286). Waxmann Verlag GMBH.
  • Saal, F. E., Downey, R. G., & Lahey, M. A. (1980). Rating the ratings: Assessing the psychometric quality rating data. Psychological Bulletin, 88(2), 413–428. https://doi.org/10.1037/0033-2909.88.2.413
  • Shermis, M. D. (2014). State-of-the-art automated essay scoring: Competition, results, and future directions from a United States demonstration. Assessing Writing, 20, 53–76. https://doi.org/10.1016/j.asw.2013.04.001
  • Shin, J., & Gierl, M. J. (2021). More efficient processes for creating automated essay scoring frameworks: A demonstration of two algorithms. Language Testing, 38(2), 247–272. https://doi.org/10.1177/0265532220937830
  • Sidak, Z. (1967). Rectangular confidence regions for the means of multivariate normal distributions. Journal of the American Statistical Association, 62(318), 626–633. https://doi.org/10.2307/2283989
  • Sulsky, L. M., & Balzer, W. K. (1988). Meaning and measurement of performance rating accuracy: Some methodological and theoretical concerns. Journal of Applied Psychology, 73(3), 497–506. https://doi.org/10.1037/0021-9010.73.3.497
  • Templin, M. (1957). Certain language skills in children: Their development and interrelationships. The University of Minnesota Press.
  • Thurstone, L. L. (1927). A law of comparative judgment. Psychological Review, (4)34, 273–286. https://doi.org/10.1037/h0070288
  • Thurstone, L. L. (1928). Attitudes can be measured. American Journal of Sociology, (4)33, 529–554. https://doi.org/10.1086/214483
  • Thurstone, L. L., & Chave, E. J. (1929). The measurement of attitude: A psychophysical method and some experiments for measuring attitude toward the church. The University of Chicago Press.
  • Toglia, M. P., & Battig, W. R. (1978). Handbook of semantic word norms. Lawrence Erlbaum.
  • Verhelst, H. D., & Verstralen, H. H. F. M. (1993). A stochastic unfolding model derived from partial credit model. Kwantitatieve Methoden, 42, 93–108.
  • Wang, J., Engelhard, G., & Wolfe, E. W. (2016). Evaluating Rater Accuracy in Rater-Mediated Assessments Using an Unfolding Model. Educational and Psychological Measurement, 76(6), 1005–1025.
  • Wang, J., Engelhard, G., Raczynski, K., Song, T., & Wolfe, E. W. (2017). Evaluating rater accuracy and perception for integrated writing assessments using a mixed-methods approach. Assessing Writing, 33, 36–47.
  • Wang, J., & Engelhard, G. (2019a). Conceptualizing Rater Judgments and Rating Processes for Rater-Mediated Assessments. Journal of Educational Measurement, 56(3), 582–609.
  • Wang, J. & Engelhard, G. (2019b). Exploring the impersonal judgments and personal preferences of raters in rater-mediated assessments with unfolding models. Educational and Psychological Measurement, 79(4), 773–795.
  • Williamson, D. M., Xi, X., & Breyer, F. J. (2012). A framework for evaluation and use of automated scoring. Educational Measurement: Issues and Practice, 31(1), 2–13. https://doi.org/10.1111/j.1745-3992.2011.00223.x
  • Wind, S. A., & Engelhard, G. (2012). Examining rating quality in writing assessment: Rater agreement, error, and accuracy. Journal of Applied Measurement, 13(4), 321–335.
  • Wind, S. A., Wolfe, E. W., Engelhard Jr, G., Foltz, P., & Rosenstein, M. (2018). The influence of rater effects in training sets on the psychometric quality of automated scoring for writing assessments. International Journal of Testing, 18(1), 27–49.
  • Wind, S. A., & Peterson, M. E. (2018). A systematic review of methods for evaluating rating quality in language assessment. Language Testing, 35(2), 161–192. https://doi.org/10.1177/0265532216686999
  • Wolfe, E. W. (2006). Uncovering rater’s cognitive processing and focus using think aloud protocols. Journal of Writing Assessment, 2, 37–56.
  • Wolfe, E. W., & McVay, A. (2012). Application of latent trait models to identifying substantively interesting raters. Educational Measurement: Issues and Practice, 31(3), 31–37. https://doi.org/10.1111/j.1745-3992.2012.00241.x
  • Wolfe, E. W. (2014). Methods for monitoring rating quality: Current practices and suggested changes. Pearson.
  • Wolfe, E. W., Song, T., & Jiao, H. (2016). Features of difficult-to-score essays. Assessing Writing, 27, 1–10. https://doi.org/10.1016/j.asw.2015.06.002
  • Zhang, M. (2013). Contrasting automated and human scoring of essays. R & D Connections, 21(2), 11. https://www.ets.org/Media/Research/pdf/RD_Connections_21.pdf

Reprints and Corporate Permissions

Please note: Selecting permissions does not provide access to the full text of the article, please see our help page How do I view content?

To request a reprint or corporate permissions for this article, please click on the relevant link below:

Academic Permissions

Please note: Selecting permissions does not provide access to the full text of the article, please see our help page How do I view content?

Obtain permissions instantly via Rightslink by clicking on the button below:

If you are unable to obtain permissions via Rightslink, please complete and submit this Permissions form. For more information, please visit our Permissions help page.