Search in:

Advanced search

The Journal of Experimental Education Volume 91, 2023 - Issue 1

Submit an article Journal homepage

243

Views

CrossRef citations to date

Altmetric

Measurement, Statistics, and Research Design

Exploring difficult-to-score essays with a hyperbolic cosine accuracy model and Coh-Metrix indices

Jue Wanga The University of MiamiCorrespondence[email protected]

https://orcid.org/0000-0002-3519-2693

George Engelhard Jr.b The University of Georgia

https://orcid.org/0000-0002-1694-8942

Trenton Combsa The University of Miami

Pages 125-144 | Published online: 02 Nov 2021

Cite this article
https://doi.org/10.1080/00220973.2021.1993774
CrossMark

Full Article
Figures & data
References
Citations
Metrics
Reprints & Permissions

References

Andrich, D. (1978). A rating formulation for ordered response categories. Psychometrika, 43(4), 561–573. https://doi.org/10.1007/BF02293814
Web of Science ®Google Scholar
Andrich, D. (1988). The application of an unfolding model of the PIRT type to the measurement of attitude. Applied Psychological Measurement, 12(1), 33–51. https://doi.org/10.1177/014662168801200105
Web of Science ®Google Scholar
Andrich, D. (1995). Hyperbolic cosine latent trait models for unfolding direct responses and pairwise preferences. Applied Psychological Measurement, 19(3), 269–290. https://doi.org/10.1177/014662169501900306
Web of Science ®Google Scholar
Andrich, D. (1996). A hyperbolic cosine latent trait model for unfolding polytomous responses: Reconciling Thurstone and Likert methodologies. British Journal of Mathematical and Statistical Psychology, 49(2), 347–365. https://doi.org/10.1111/j.2044-8317.1996.tb01093.x
Web of Science ®Google Scholar
Andrich, D., & Luo, G. (1993). A hyperbolic cosine latent trait model for unfolding dichotomous single-stimulus responses. Applied Psychological Measurement, 17(3), 253–276.
Web of Science ®Google Scholar
Attali, Y. (2013). Validity and reliability of automated essay scoring. In M. D. Shermis & J. C. Burstein (Eds.), Handbook of automated essay evaluation: Current application and new directions (pp. 181–198). Psychology Press.
Google Scholar
Behizadeh, N., & Pang, M. E. (2016). Awaiting a new wave: The status of state writing assessment in the United States. Assessing Writing, 29, 25–41. https://doi.org/10.1016/j.asw.2016.05.003
Web of Science ®Google Scholar
Bernardin, H. J., & Pence, E. C. (1980). Effects of rater training: Creating new response sets and decreasing accuracy. Journal of Applied Psychology, 65(1), 60–66. https://doi.org/10.1037/0021-9010.65.1.60
Web of Science ®Google Scholar
Chung, G. K. W. K., & Baker, E. L. (2003). Issues in the reliability and validity of automated scoring of constructed responses. In M. D. Shermis & J. C. Burnstein (Eds.), Automated essay scoring: A cross disciplinary approach (pp. 23–40). Lawrence Erlbaum Associates.
Google Scholar
Cohen, J. (1988). Statistical power analysis for the behavioral sciences (2nd ed.). Erlbaum.
Google Scholar
Coltheart, M. (1981). The MRC psycholinguistic database. The Quarterly Journal of Experimental Psychology Section A, 33(4), 497–505. https://doi.org/10.1080/14640748108400805
Google Scholar
Conference on College Composition and Communication. (2004). CCCC position statement on teaching, learning, and assessing writing in digital environments. https://cccc.ncte.org/cccc/resources/positions/digitalenvironments
Google Scholar
Coombs, C. H. (1950). Psychological scaling without a unit of measurement. Psychological Review, 57(3), 145–158. https://doi.org/10.1037/h0060984
PubMed Web of Science ®Google Scholar
Coombs, C. H. (1964). A theory of data. Wiley.
Google Scholar
Coombs, C. H., & Avrunin, C. S. (1977). Single-peaked functions and the theory of preference. Psychological Review, 84(2), 216–230. https://doi.org/10.1037/0033-295X.84.2.216
Web of Science ®Google Scholar
Crocker, L., & Algina, J. (1986). Introduction to classical and modern test theory. Holt, Rinehart, and Winston.
Google Scholar
Desarbo, W. S., & Hoffman, D. L. (1986). Simple and weighted unfolding threshold models for the spatial representation of binary choice data. Applied Psychological Measurement, 10(3), 247–264. https://doi.org/10.1177/014662168601000304
Web of Science ®Google Scholar
Engelhard Jr, G. (1994). Examining rater errors in the assessment of written composition with a many‐faceted Rasch model. Journal of Educational Measurement, 31(2), 93–112.
Web of Science ®Google Scholar
Engelhard, G. (1996). Evaluating rater accuracy in performance assessments. Journal of Educational Measurement, 56–70.
Web of Science ®Google Scholar
Engelhard, G. (2013). Invariant measurement: Using Rasch models in the social, behavioral, and health sciences. New York: Routledge.
Google Scholar
Foltz, P. W., Streeter, L. A., Lochbaum, K. E., & Landauer, T. K. (2013). Implementation and applications of the intelligent essay assessor. In M. Shermis & J. Burstein (Eds.), Handbook of automated essay evaluation (pp. 68–88). Routledge.
Google Scholar
Freedman, S. W. (1979). How characteristics of student essays influence teachers’ evaluations. Journal of Educational Psychology, 71(3), 328–338. https://doi.org/10.1037/0022-0663.71.3.328
Web of Science ®Google Scholar
Gilhooly, K. L., & Logie, R. H. (1980). Age of acquisition, imagery, concreteness, familiarity and ambiguity measures for 1944 words. Behavior Research Methods & Instrumentation, 12(4), 395–427. https://doi.org/10.3758/BF03201693
Web of Science ®Google Scholar
Graesser, A. C., McNamara, D. S., Louwerse, M. M., & Cai, Z. (2004). Coh-Metrix: Analysis of text on cohesion and language. Behavior Research Methods, Instruments & Computers, 36(2), 193–202. https://doi.org/10.3758/BF03195564
PubMedGoogle Scholar
Guttman, L. (1950). The basis for scalogram analysis. In S. A. Stouffer, L. Guttman, A. E. Suchman, P. F. Lazarsfeld, S. A. Star, & J. A. Clausen (Eds.), Measurement and prediction (pp. 60–90). Princeton University Press.
Google Scholar
Guttman, L., (1941). The quantification of class attributes: A theory and method of scale construction. In P. Horst & P. Wallin (Eds.), The prediction of personal adjustment (pp. 319–348). Social Science Research Council.
Google Scholar
Hoijtink, H. (1990). PARELLA: Measurement of latent traits by proximity items. University of Groningen.
Google Scholar
Hoyt, W. T. (2000). Rater bias in psychological research: When is it a problem and what can we do about it? Psychological Methods, 5(1), 64–86. https://doi.org/10.1037/1082-989x.5.1.64
PubMed Web of Science ®Google Scholar
Hoyt, W. T., & Kerns, M. D. (1999). Magnitude and moderators of bias in observer ratings: A meta-analysis. Psychological Methods, 4(4), 403–424. https://doi.org/10.1037/1082-989X.4.4.403
Web of Science ®Google Scholar
Landauer, T., McNamara, D. S., Dennis, S., & Kintsch, W. (Eds.). (2007). Handbook of latent semantic analysis. Erlbaum.
Google Scholar
Linacre, J. M. (1989). Many-facet Rasch measurement. MESA Press.
Google Scholar
Luo, G. (1998). A general formulation for unidimensional unfolding and pairwise preference models: Making explicit the latitude of acceptance. Journal of Mathematical Psychology, 42(4), 400–417. https://doi.org/10.1006/jmps.1998.1206
PubMed Web of Science ®Google Scholar
Luo, G. (2001). A class of probabilistic unfolding models for polytomous responses. Journal of Mathematical Psychology, 45(2), 224–248. https://doi.org/10.1006/jmps.2000.1310
PubMed Web of Science ®Google Scholar
Luo, G., & Andrich, D. (2003). RateFOLD computer program. Social Measurement Laboratory; School of Education, Murdoch University.
Google Scholar
McNamara, D. S., Graesser, A. C., McCarthy, P. M., & Cai, Z. (2014). Automated evaluation of text and discourse with Coh-Metrix. Cambridge University Press.
Google Scholar
Miles, J., & Shevlin, M. (2001). Applying regression and correlation: A guide for students and researchers. SAGE.
Google Scholar
Myford, C. M., & Wolfe, E. W. (2009). Monitoring rater performance over time: A framework for detecting differential accuracy and differential scale category use. Journal of Educational Measurement, 46(4), 371–389. https://doi.org/10.1111/j.1745-3984.2009.00088.x
Web of Science ®Google Scholar
National Commission on Writing in America's Schools and Colleges. (2003). The neglected “R”: The need for a writing revolution. The report of the National Commission on Writing in America's Schools and Colleges.
Google Scholar
National Council of Teachers of English & Conference on College Composition and Communication. (2014). Writing assessment: A position statement. http://www.ncte.org/cccc/resources/positions/writingassessment.
Google Scholar
Paivio, A., Yuille, J. C., & Madigan, S. A. (1968). Concreteness, imagery and meaningfulness values for 925 words. Journal of Experimental Psychology Monograph Psychology, 76(1, Pt.2), 1–25. https://doi.org/10.1037/h0025327
PubMedGoogle Scholar
Post, W. J., van Duijn, M. A., & van Baarsen, B. (2001). Single-peaked or monotone tracelines? On the choice of an IRT model for scaling data. In A. Boomsma, M. A. J. van Duijn, T. A. B. Snijders (Eds.), Essays on item response theory (pp. 391–414). Springer.
Google Scholar
Rafoth, B. A., & Rubin, D. L. (1984). The impact of content and mechanics on judgments of writing quality. Written Communication, 1(4), 446–458. https://doi.org/10.1177/0741088384001004004
Google Scholar
Roberts, J. S., Donoghue, J. R., & Laughlin, J. E. (2000). A general item response theory model for unfolding unidimensional polytomous responses. Applied Psychological Measurement, 24(1), 3–32. https://doi.org/10.1177/01466216000241001
Web of Science ®Google Scholar
Roberts, J. S., & Laughlin, J. E. (1996). A unidimensional item response model for unfolding responses from graded disagree-agree response scale. Applied Psychological Measurement, 20(3), 231–255. https://doi.org/10.1177/014662169602000305
Web of Science ®Google Scholar
Rost, J., & Luo, G. (1997). An application of a Rasch based model to a questionnaire on adolescent centrism. In J. Rost, & R. Langeheine (Eds.), Application of latent trait and latent class models in the social sciences (pp. 278–286). Waxmann Verlag GMBH.
Google Scholar
Saal, F. E., Downey, R. G., & Lahey, M. A. (1980). Rating the ratings: Assessing the psychometric quality rating data. Psychological Bulletin, 88(2), 413–428. https://doi.org/10.1037/0033-2909.88.2.413
Web of Science ®Google Scholar
Shermis, M. D. (2014). State-of-the-art automated essay scoring: Competition, results, and future directions from a United States demonstration. Assessing Writing, 20, 53–76. https://doi.org/10.1016/j.asw.2013.04.001
Web of Science ®Google Scholar
Shin, J., & Gierl, M. J. (2021). More efficient processes for creating automated essay scoring frameworks: A demonstration of two algorithms. Language Testing, 38(2), 247–272. https://doi.org/10.1177/0265532220937830
Web of Science ®Google Scholar
Sidak, Z. (1967). Rectangular confidence regions for the means of multivariate normal distributions. Journal of the American Statistical Association, 62(318), 626–633. https://doi.org/10.2307/2283989
Web of Science ®Google Scholar
Sulsky, L. M., & Balzer, W. K. (1988). Meaning and measurement of performance rating accuracy: Some methodological and theoretical concerns. Journal of Applied Psychology, 73(3), 497–506. https://doi.org/10.1037/0021-9010.73.3.497
Web of Science ®Google Scholar
Templin, M. (1957). Certain language skills in children: Their development and interrelationships. The University of Minnesota Press.
Google Scholar
Thurstone, L. L. (1927). A law of comparative judgment. Psychological Review, (4)34, 273–286. https://doi.org/10.1037/h0070288
Google Scholar
Thurstone, L. L. (1928). Attitudes can be measured. American Journal of Sociology, (4)33, 529–554. https://doi.org/10.1086/214483
Google Scholar
Thurstone, L. L., & Chave, E. J. (1929). The measurement of attitude: A psychophysical method and some experiments for measuring attitude toward the church. The University of Chicago Press.
Google Scholar
Toglia, M. P., & Battig, W. R. (1978). Handbook of semantic word norms. Lawrence Erlbaum.
Google Scholar
Verhelst, H. D., & Verstralen, H. H. F. M. (1993). A stochastic unfolding model derived from partial credit model. Kwantitatieve Methoden, 42, 93–108.
Google Scholar
Wang, J., Engelhard, G., & Wolfe, E. W. (2016). Evaluating Rater Accuracy in Rater-Mediated Assessments Using an Unfolding Model. Educational and Psychological Measurement, 76(6), 1005–1025.
PubMed Web of Science ®Google Scholar
Wang, J., Engelhard, G., Raczynski, K., Song, T., & Wolfe, E. W. (2017). Evaluating rater accuracy and perception for integrated writing assessments using a mixed-methods approach. Assessing Writing, 33, 36–47.
Web of Science ®Google Scholar
Wang, J., & Engelhard, G. (2019a). Conceptualizing Rater Judgments and Rating Processes for Rater-Mediated Assessments. Journal of Educational Measurement, 56(3), 582–609.
Web of Science ®Google Scholar
Wang, J. & Engelhard, G. (2019b). Exploring the impersonal judgments and personal preferences of raters in rater-mediated assessments with unfolding models. Educational and Psychological Measurement, 79(4), 773–795.
PubMed Web of Science ®Google Scholar
Williamson, D. M., Xi, X., & Breyer, F. J. (2012). A framework for evaluation and use of automated scoring. Educational Measurement: Issues and Practice, 31(1), 2–13. https://doi.org/10.1111/j.1745-3992.2011.00223.x
Google Scholar
Wind, S. A., & Engelhard, G. (2012). Examining rating quality in writing assessment: Rater agreement, error, and accuracy. Journal of Applied Measurement, 13(4), 321–335.
PubMedGoogle Scholar
Wind, S. A., Wolfe, E. W., Engelhard Jr, G., Foltz, P., & Rosenstein, M. (2018). The influence of rater effects in training sets on the psychometric quality of automated scoring for writing assessments. International Journal of Testing, 18(1), 27–49.
Web of Science ®Google Scholar
Wind, S. A., & Peterson, M. E. (2018). A systematic review of methods for evaluating rating quality in language assessment. Language Testing, 35(2), 161–192. https://doi.org/10.1177/0265532216686999
Web of Science ®Google Scholar
Wolfe, E. W. (2006). Uncovering rater’s cognitive processing and focus using think aloud protocols. Journal of Writing Assessment, 2, 37–56.
Google Scholar
Wolfe, E. W., & McVay, A. (2012). Application of latent trait models to identifying substantively interesting raters. Educational Measurement: Issues and Practice, 31(3), 31–37. https://doi.org/10.1111/j.1745-3992.2012.00241.x
Google Scholar
Wolfe, E. W. (2014). Methods for monitoring rating quality: Current practices and suggested changes. Pearson.
Google Scholar
Wolfe, E. W., Song, T., & Jiao, H. (2016). Features of difficult-to-score essays. Assessing Writing, 27, 1–10. https://doi.org/10.1016/j.asw.2015.06.002
Web of Science ®Google Scholar
Zhang, M. (2013). Contrasting automated and human scoring of essays. R & D Connections, 21(2), 11. https://www.ets.org/Media/Research/pdf/RD_Connections_21.pdf
Google Scholar

Reprints and Corporate Permissions

Please note: Selecting permissions does not provide access to the full text of the article, please see our help page How do I view content?

To request a reprint or corporate permissions for this article, please click on the relevant link below:

Order Reprints Request Corporate Permissions

Academic Permissions

Please note: Selecting permissions does not provide access to the full text of the article, please see our help page How do I view content?

Obtain permissions instantly via Rightslink by clicking on the button below:

Request Academic Permissions

If you are unable to obtain permissions via Rightslink, please complete and submit this Permissions form. For more information, please visit our Permissions help page.

Related research

People also read lists articles that other readers of this article have read.

Recommended articles lists articles that we recommend and is powered by our AI driven recommendation engine.

Cited by lists all citing articles based on Crossref citations.
Articles with the Crossref icon will open in a new tab.

People also read
Recommended articles
Cited by

To cite this article:

Reference style: APA Chicago Harvard

Citation copied to clipboard

Reference styles above use APA (6th edition), Chicago (16th edition) & Harvard (10th edition)

Download citation

Download a citation file in RIS format that can be imported by citation management software including EndNote, ProCite, RefWorks and Reference Manager.

Choose format: RIS BibTex RefWorks Direct Export

Choose options: Citation Citation & abstract Citation & references

Exploring difficult-to-score essays with a hyperbolic cosine accuracy model and Coh-Metrix indices

References

Information for

Open access

Opportunities

Help and information

Your download is now in progress and you may close this window

Login or register to access this feature

Exploring difficult-to-score essays with a hyperbolic cosine accuracy model and Coh-Metrix indices

References

Reprints and Corporate Permissions

Academic Permissions

Related research

To cite this article:

Download citation

Information for

Open access

Opportunities

Help and information

Keep up to date