References
- Adams, R., & Wu, M. (Eds.). (2002). PISA 2000 technical report. OECD Publications. https://doi.org/https://doi.org/10.1787/9789264199521-en
- Akaike, H. (1987). Factor analysis and AIC. Psychometrika, 52(3), 317–332. https://doi.org/https://doi.org/10.1007/BF02294359
- Alderson, C. (2005). Diagnosing foreign language proficiency: The interface between learning and assessment. Continuum.
- Alderson, C. (2010). “Cognitive diagnosis and Q-matrices in language assessment”: A commentary. Language Assessment Quarterly, 7(1), 96–103. https://doi.org/https://doi.org/10.1080/15434300903426748
- Alderson, C., Brunfaut, T., & Harding, L. (2015). Towards a theory of diagnosis in second and foreign language assessment: Insights from professional practice across diverse fields. Applied Linguistics, 36(2), 236–260. https://doi.org/https://doi.org/10.1093/applin/amt046
- Alderson, C., & Huhta, A. (2005). The development of a suite of computer-based diagnostic tests based on the Common European Framework. Language Testing, 22(3), 301–320. https://doi.org/https://doi.org/10.1191/0265532205lt310oa
- Aryadoust, V. (2011). Application of the fusion model to while-listening performance tests. SHIKEN: JALT Testing & Evaluation SIG Newsletter, 15(2), 2–9. https://hosted.jalt.org/test/ary_2.htm
- Aryadoust, V. (2021). A cognitive diagnostic assessment study of the listening test of the Singapore-Cambridge General Certificate of Education O-Level: Application of DINA, DINO, G-DINA, HO-DINA, and RRUM. International Journal of Listening, 35(1), 29–52. https://doi.org/https://doi.org/10.1080/10904018.2018.1500915
- Aryadoust, V., Foo, S., & Ng, L. Y. (2021). What can gaze behaviors, neuroimaging data, and test scores tell us about test method effects and cognitive load in listening assessments? Language Testing, 1–34. https://doi.org/https://doi.org/10.1177/02655322211026876
- Bolt, D. (2007). The present and future of IRT-based cognitive diagnostic models (ICDMs) and related methods. Journal of Educational Measurement, 44(4), 377–383. https://doi.org/https://doi.org/10.1111/j.1745-3984.2007.00045.x
- Bolt, D. (2019). Bifactor MIRT as an appealing and related alternative to CDMs in the presence of skill attribute continuity. In M. von Davier & Y.-S. Lee (Eds.), Handbook of diagnostic classification models (pp. 395–417). Springer.
- Bolt, D., & Kim, J.-S. (2018). Parameter invariance and skill attribute continuity in the DINA model. Journal of Educational Measurement, 55(2), 264–280. https://doi.org/https://doi.org/10.1111/jedm.12175
- Bradshaw, L., & Madison, M. (2016). Invariance properties for general diagnostic classification models. International Journal of Testing, 16(2), 99–118. https://doi.org/https://doi.org/10.1080/15305058.2015.1107076
- Buck, G., & Tatsuoka, K. (1998). Application of the rule-space procedure to language testing: Examining attributes of a free response listening test. Language Testing, 15(2), 119–157. https://doi.org/https://doi.org/10.1177/026553229801500201
- Cai, L. (2010). A two-tier full-information item factor analysis model with applications. Psychometrika, 75(4), 581–612. https://doi.org/https://doi.org/10.1007/s11336-010-9178-0
- Cai, L., Thissen, D., & du Toit, S. (2011). IRTPRO user’s guide. Scientific Software International, Inc.
- Cai, Y., Tu, D., & Ding, S. (2018). Theorems and methods of a complete Q Matrix with attribute hierarchies under restricted Q-matrix design. Frontiers in Psychology, 9, 1413. https://doi.org/https://doi.org/10.3389/fpsyg.2018.01413
- Carroll, J. B. (1972). Defining language comprehension. In R. O. Freedle & J. B. Carroll (Eds.), Language comprehension and the acquisition of knowledge. (pp. 1–29). John Wiley.
- Chen, J., de la Torre, J., & Zhang, Z. (2013). Relative and absolute fit evaluation in cognitive diagnosis modeling. Journal of Educational Measurement, 50(2), 123–140. https://doi.org/https://doi.org/10.1111/j.1745-3984.2012.00185.x
- Choi, H. J. (2010). A model that combines diagnostic classification assessment with mixture item response theory models [Unpublished doctoral dissertation]. University of Georgia. https://getd.libs.uga.edu/pdfs/choi_hye-jeong_201005_phd.pdf
- Choi, I., & Papageorgiou, S. (2020). Evaluating subscore uses across multiple levels: A case of reading and listening subscores for young EFL learners. Language Testing, 37(2), 254–279. https://doi.org/https://doi.org/10.1177/0265532219879654
- de la Torre, J., & Douglas, J. (2004). Higher-order latent trait models for cognitive diagnosis. Psychometrika, 69(3), 333–353. https://doi.org/https://doi.org/10.1007/BF02295640
- de la Torre, J., & Lee, Y. S. (2010). A note on the invariance of the DINA model parameters. Journal of Educational Measurement, 47(1), 115–127. https://doi.org/https://doi.org/10.1111/j.1745-3984.2009.00102.x
- DeMars, C. E. (2013). A tutorial on interpreting bifactor model scores. International Journal of Testing, 13(4), 354–378. https://doi.org/https://doi.org/10.1080/15305058.2013.799067
- Embretson, S. E., & Reise, S. P. (2000). Item response theory for psychologists. Erlbaum.
- Embretson, S. E., & Yang, X. (2013). A multicomponent latent trait model for diagnosis. Psychometrika, 78(1), 14–36. https://doi.org/https://doi.org/10.1007/s11336-012-9296-y
- Fan, J., & Yan, X. (2017). From test performance to language use: Using self-assessment to validate a high-stakes English proficiency test. The Asia-Pacific Education Researcher, 26(1–2), 61–73. https://doi.org/https://doi.org/10.1007/s40299-017-0327-4
- Field, J. (2008). Listening in the language classroom. Cambridge University Press.
- Field, J. (2013). Cognitive validity. In L. Taylor & A. Geranpayeh (Eds.), Examining listening (pp. 77–151). Cambridge University Press.
- Geranpayeh, A., & Kunnan, A. J. (2007). Differential item functioning in terms of age in the certificate in advanced English examination. Language Assessment Quarterly, 4(2), 190–222. https://doi.org/https://doi.org/10.1080/15434300701375758
- Haberman, S., Sinharay, S., & Puhan, G. (2009). Reporting subscores for institutions. British Journal of Mathematical and Statistical Psychology, 62(1), 79–95. https://doi.org/https://doi.org/10.1348/000711007X248875
- Harding, L., Alderson, C., & Brunfaut, T. (2015). Diagnostic assessment of reading and listening in a second or foreign language: Elaborating on diagnostic principles. Language Testing, 32(3), 317–336. https://doi.org/https://doi.org/10.1177/0265532214564505
- He, L., & Chen, D. (2017). Developing common listening ability scales for Chinese learners of English. Language Testing in Asia, 7(4), 1–12. https://doi.org/https://doi.org/10.1186/s40468-017-0033-4
- Henning, G. (1992). Dimensionality and construct validity of language tests. Language Testing, 9(1), 1–11. https://doi.org/https://doi.org/10.1177/026553229200900102
- Holzknecht, F., Eberharter, K., Kremmel, B., Zehentner, M., McCray, G., Konrad, E., & Spöttl, C. (2017). Looking into listening: Using eye-tracking to establish the cognitive validity of the Aptis Listening Test (ARAGs Research Reports). British Council. https://www.britishcouncil.org/sites/default/files/looking_into_listening.pdf
- Jang, E. E. (2009). Cognitive diagnostic assessment of L2 reading comprehension ability: Validity arguments for applying Fusion Model to LanguEdge assessment. Language Testing, 26(1), 31–73. https://doi.org/https://doi.org/10.1177/0265532208097336
- Jang, E. E., Dunlop, M., Park, G., & Van Der Boom, E. H. (2015). How do young students with different profiles of reading skill mastery, perceived ability, and goal orientation respond to holistic diagnostic feedback? Language Testing, 32(3), 359–383. https://doi.org/https://doi.org/10.1177/0265532215570924
- Jang, E. E., Kim, H., Vincett, M., Barron, C., & Russell, B. (2019). Improving IELTS reading test score interpretations and utilisation through cognitive diagnosis model-based skill profiling (IELTS Research Reports Online Series No. 2). British Council, Cambridge Assessment English and IDP:. https://www.ielts.org/research/research-reports/online-series-2019-2
- Javidanmehr, Z., & Sarab, A. M. R. (2019). Retrofitting non-diagnostic reading comprehension assessment: Application of the G-DINA model to a high stakes reading comprehension test. Language Assessment Quarterly, 16(3), 294–311. https://doi.org/https://doi.org/10.1080/15434303.2019.1654479
- Kim, S. Y., Lee, W. C., & Kolen, M. J. (2020). Simple-structure multidimensional item response theory equating for multidimensional tests. Educational and Psychological Measurement, 80(1), 91-125. https://doi.org/https://doi.org/10.1177/0013164419854208
- Kirkpatrick, R., Wang, C., Shin, C.-W., Chien, Y., & Goodman, J. (2013, April). Profile classification for cognitive diagnostic assessment: A simulation study [Conference paper presentation]. The 2013 AnnualMeeting of National Council on Measurement in Education, SanFrancisco, CA, United States.
- Klem, M., Gustafsson, J.-E., & Hagtvet, B. (2015). The dimensionality of language ability in four-year-olds: Construct validation of a language screening tool. Scandinavian Journal of Educational Research, 59(2), 195–213. https://doi.org/https://doi.org/10.1080/00313831.2014.904416
- Kunnan, A. J., & Jang, E. E. (2009). Diagnostic feedback in language assessment. In M. H. Long & C. J. Doughty (Eds.), The handbook of language teaching (pp. 610–627). Wiley-Blackwell.
- Lee, Y., & Sawaki, Y. (2009). Application of three cognitive diagnosis models to ESL reading and listening assessments. Language Assessment Quarterly, 6(3), 239–263. https://doi.org/https://doi.org/10.1080/15434300903079562
- Li, H., Hunter, C. V., & Lei, P.-W. (2016). The selection of cognitive diagnostic models for a reading comprehension test. Language Testing, 33(3), 391–409. https://doi.org/https://doi.org/10.1177/0265532215590848
- Li, X., & Wang, W.-C. (2015). Assessment of differential item functioning under cognitive diagnosis models: The DINA model example. Journal of Educational Measurement, 52(1), 28–54. https://doi.org/https://doi.org/10.1111/jedm.12061
- Liu, R., Huggins-Manley, A. C., & Bulut, O. (2018). Retrofitting diagnostic classification models to responses from IRT-based assessment forms. Educational and Psychological Measurement, 78(3), 357–383. https://doi.org/https://doi.org/10.1177/0013164416685599
- Liu, Y., Li, Z., & Liu, H. (2019). Reporting valid and reliable overall scores and domain scores using bi-factor model. Applied Psychological Measurement, 43(7), 562–576. https://doi.org/https://doi.org/10.1177/0146621618813093
- Ma, W., de la Torre, J., Sorrel, M., & Jiang, Z. (2020). Package ‘GDINA’. CRAN. https://cran.r-project.org/web/packages/GDINA/GDINA.pdf
- Maydeu-Olivares, A., & Joe, H. (2014). Assessing approximate fit in categorical data analysis. Multivariate Behavioral Research, 49(4), 305–328. https://doi.org/https://doi.org/10.1080/00273171.2014.911075
- McNamara, T. (1996). Measuring second language performance. Longman.
- Mellenbergh, G. L. (2019). Counteracting methodological errors in behavioral research. Springer.
- Min, S., & He, L. (2014). Applying unidimensional and multidimensional item response theory models in testlet-based reading assessment. Language Testing, 31(4), 453–477. https://doi.org/https://doi.org/10.1177/0265532214527277
- Min, S., & Jiang, Z. (2020). 校本听力考试与《中国英语能力等级量表》对接研究 [Linking the listening subtest of an in-house English proficiency test to China’s Standards of English Language Ability (CSE)]. Foreign Language Education, 41(4), 47–51. https://www.cnki.com.cn/Article/CJFDTotal-TEAC202004009.htm
- Mirzaei, A., Vincheh, M. H., & Hashemian, M. (2020). Retrofitting the IELTS reading section with a general cognitive diagnostic model in an Iranian EAP context. Studies in Educational Evaluation, 64. Article number: 100817. https://doi.org/https://doi.org/10.1016/j.stueduc.2019.100817
- Morin, A. J. S., Arens, A. K., & Marsh, H. W. (2016). A bifactor exploratory structural equation modeling framework for the identification of distinct sources of construct-relevant psychometric multidimensionality. Structural Equation Modeling: A Multidisciplinary Journal, 23(1), 116–139. https://doi.org/https://doi.org/10.1080/10705511.2014.961800
- Musek, J. (2017). The general factor of personality. Academic Press.
- Neyman, J., & Pearson, E. S. (1992). On the problem of the most efficient tests of statistical hypotheses. In S. Kotz & N. L. Johnson (Eds.), Breakthroughs in statistics (pp. 73–108). Springer.
- Pae, T. (2004). DIF for examinees with different academic backgrounds. Language Testing, 21(1), 53–73. https://doi.org/https://doi.org/10.1191/0265532204lt274oa
- Ranjbaran, F., & Alavi, S. M. (2017). Developing a reading comprehension test for cognitive diagnostic assessment: A RUM analysis. Studies in Educational Evaluation, 55, 167–179. https://doi.org/https://doi.org/10.1016/j.stueduc.2017.10.007
- Ravand, H., & Robitzsch, A. (2018). Cognitive diagnostic model of best choice: A study of reading comprehension. Educational Psychology, 38(10), 1255–1277. https://doi.org/https://doi.org/10.1080/01443410.2018.1489524
- Reckase, M. (1979). Unifactor latent trait models applied to multifactor tests: Results and implications. Journal of Educational and Behavioral Statistics, 4(3), 207–230. https://doi.org/http://doi.org/10.3102/10769986004003207
- Reise, S. P. (2012). The rediscovery of bifactor measurement models. Multivariate Behavioral Research, 47(5), 667–696. https://doi.org/https://doi.org/10.1080/00273171.2012.715555
- Reise, S. P., Cook, K. F., & Moore, T. M. (2014). Evaluating the impact of multidimensionality on unidimensional item response theory model parameters. In S. P. Reise & D. A. Revicki (Eds.), Handbook of item response theory modeling: Applications to typical performance assessment (pp. 13–40). Routledge.
- Rupp, A. A. (2007). The answer is in the question: A guide for describing and investigating the conceptual foundations and statistical properties of cognitive psychometric models. International Journal of Testing, 7(2), 95–125. https://doi.org/https://doi.org/10.1080/15305050701193454
- Rupp, A. A., Templin, J., & Henson, R. A. (2010). Diagnostic measurement: Theory, methods, and applications. Guilford Press.
- Sawaki, Y., Kim, H. J., & Gentile, C. (2009). Q-Matrix construction: Defining the link between constructs and test items in large-scale reading and listening comprehension assessments. Language Assessment Quarterly, 6(3), 190–209. https://doi.org/https://doi.org/10.1080/15434300902801917
- Sawaki, Y., Stricker, L., & Oranje, A. (2009). Factor structure of the TOEFL Internet-based test. Language Testing, 26(1), 5–30. https://doi.org/https://doi.org/10.1177/0265532208097335
- Schwarz, G. (1978). Estimating the dimension of a model. Annals of Statistics, 6(2), 461–464. https://doi.org/https://doi.org/10.1214/aos/1176344136
- Sinharay, S., Puhan, G., Haberman, S. J., & Hambleton, R. K. (2019). Subscores: When to communicate them, what are their alternatives, and some recommendations. In D. Zapata-Rivera (Ed.), Score reporting research and applications (pp. 80–107). Routledge Taylor & Francis Group.
- Sinharay, S., Puhan, G., & Haberman, S. J. (2010). Reporting diagnostic scores in educational testing: Temptations, pitfalls, and some solutions. Multivariate Behavioral Research, 45(3), 553–573. https://doi.org/https://doi.org/10.1080/00273171.2010.483382
- Song, M.-Y. (2008). Do divisible subskills exist in second language (L2) comprehension? A structural equation modeling approach. Language Testing, 25(4), 435–464. https://doi.org/https://doi.org/10.1177/0265532208094272
- Stout, W. (2007). Skills diagnosis using IRT-based continuous latent trait models. Journal of Educational Measurement, 44(4), 313–324. https://doi.org/https://doi.org/10.1111/j.1745-3984.2007.00041.x
- Templin, J., & Bradshaw, L. (2014). Hierarchical diagnostic classification models: A family of models for estimating and testing attribute hierarchies. Psychometrika, 79(2), 317–339. https://doi.org/https://doi.org/10.1007/s11336-013-9362-0
- Toprak, E., Aryadoust, V., & Goh, C. (2019). The log-linear cognitive diagnosis modeling (LCDM) in second language listening assessment. In V. Aryadoust & M. Raquel (Eds.), Quantitative data analysis for language assessment Volume II: Advanced methods (pp. 56–78). Routledge.
- Urmston, A., Raquel, M., & Tsang, C. (2013). Diagnostic testing of Hong Kong tertiary students’ English language proficiency: The development and validation of DELTA. Hong Kong Journal of Applied Linguistics, 14(2), 60–82. https://www.academia.edu/13521940
- Vandergrift, L. (2007). Recent developments in second and foreign language listening comprehension research. Language Teaching, 40(3), 191–210. https://doi.org/https://doi.org/10.1017/S0261444807004338
- Vandergrift, L., & Goh, C. C. M. (2012). Teaching and learning second language listening. Routledge.
- von Davier, M., & Haberman, S. (2014). Hierarchical diagnostic classification models morphing into unidimensional ‘Diagnostic’ classification models – A commentary. Psychometrika, 79(2), 340–346. https://doi.org/https://doi.org/10.1007/s11336-013-9363-z
- Weeks, J. P. (2015). Multidimensional test linking. In S. P. Reise & D. A. Revicki (Eds.), Handbook of item response theory modeling: Applications to typical performance assessment (pp. 406–434). Routledge.
- Yi, Y.-S. (2017). Probing the relative importance of different attributes in L2 reading and listening comprehension items: An application of cognitive diagnostic models. Language Testing, 34(3), 1–19. https://doi.org/https://doi.org/10.1177/0265532216646141
- Yu, X., Cheng, Y., & Chang, -H.-H. (2019). Recent developments in cognitive diagnostic computerized adaptive testing (CD-CAT): A comprehensive review. In M. von Davier & Y.-S. Lee (Eds.), Handbook of diagnostic classification models (pp. 307–331). Springer.
- Zhan, P., Jiao, H., Liao, D., & Li, F. (2019). A longitudinal higher-order diagnostic classification model. Journal of Educational and Behavioral Statistics, 44(3), 251–281. https://doi.org/https://doi.org/10.3102/1076998619827593