References
- AERA, APA, & NCME. (2014). Standards for educational and psychological testing. American Educational Research Association.
- Barkaoui, K. (2014). Multifaceted Rasch analysis for test evaluation. In A. J. Kunnan (Ed.), The companion to language assessment (Vol. III, pp. 1301–1322). John Wiley & Sons.
- Bland, J. M., & Altman, D. G. (1986). Statistical methods for assessing agreement between two methods of clinical measurement. Lancet, 327(8476), 307–310. https://doi.org/https://doi.org/10.1016/S0140-6736(86)90837-8
- Bland, J. M., & Altman, D. G. (1995). Comparing methods of measurement: Why plotting difference against standard method is misleading. Lancet, 346(8982), 1085–1087. https://doi.org/https://doi.org/10.1016/S0140-6736(95)91748-9
- Bland, J. M., & Altman, D. G. (1999). Measuring agreement in method comparison studies. Statistical Methods in Medical Research, 8(2), 135–160. https://doi.org/https://doi.org/10.1177/096228029900800204
- Bland, J. M., & Altman, D. G. (2003). Applying the right statistics: Analyses of measurement studies. Ultrasound in Obstetrics & Gynecology, 22(1), 85–93. https://doi.org/https://doi.org/10.1002/uog.122
- Brennan, R. L., & Prediger, D. J. (1981). Coefficient kappa: Some uses, misuses, and alternatives. Educational and Psychological Measurement, 41(3), 687–699. https://doi.org/https://doi.org/10.1177/001316448104100307
- Brown, A., Iwashita, N., & McNamara, T. (2005). An Examination of rater orientations and test-taker performance on English-for-academic-purposes speaking tasks (ETS Research Report No. RR-05-05). Educational Testing Service. https://doi.org/http://dx.doi.org/10.1002/j.2333-8504.2005.tb01982.x
- Chapelle, C. A., Enright, M. K., & Jamieson, J. M. (Eds.). (2008). Building a validity argument for the Test of English as a Foreign Language. Routledge.
- Chen, L., Zechner, K., Yoon, S.-Y., Evanini, K., Wang, X., Loukina, A., Tao, J., Davis, L., Lee, C. M., Ma, M., Mundkowsky, R., Lu, C., Leong, C. W., & Gyawali, B. (2018). Automated scoring of nonnative speech using the SpeechRater v. 5.0 Engine (ETS Research Report No. RR-18-10). Educational Testing Service. https://doi.org/https://doi.org/10.1002/ets2.12198
- Chun, C. W. (2006). Commentary: An analysis of a language test for employment: The authenticity of the PhonePass test. Language Assessment Quarterly, 3(3), 295–306. https://doi.org/https://doi.org/10.1207/s15434311laq0303_4
- Chun, C. W. (2008). Comments on ‘Evaluation of the usefulness of the Versant for Englsih Test: A response’: The author responds. Language Assessment Quarterly, 5(2), 168–172. https://doi.org/https://doi.org/10.1080/15434300801934751
- Cohen, J. (1960). A coefficient of agreement for nominal scales. Educational and Psychological Measurement, 20(1), 37–46. https://doi.org/https://doi.org/10.1177/001316446002000104
- Cohen, J. (1968). Weighted kappa: Nominal scale agreement provision for scaled disagreement or partial credit. Psychological Bulletin, 70(4), 213–220. https://doi.org/https://doi.org/10.1037/h0026256
- Di Eugenio, B., & Glass, M. (2004). The kappa statistic: A second look. Computational Linguistics, 30(1), 95–101. https://doi.org/https://doi.org/10.1162/089120104773633402
- Enhanced Speech Technology Ltd. (2020). EST custom automated speech engine (CASE) v1.0. Cambridge English internal research report.
- Fan, J. (2014). Chinese test takers’ attitudes towards the Versant English Test: A mixed-methods approach. Language Testing in Asia, 4(6), 1–17. https://doi.org/https://doi.org/10.1186/s40468-014-0006-9
- Galaczi, E., & Taylor, L. (2018). Interactional competence: Conceptualisations, operationalisations, and outstanding questions. Language Assessment Quarterly, 15(3), 219–236. https://doi.org/https://doi.org/10.1080/15434303.2018.1453816
- Guo, C., Pleiss, G., Sun, Y., & Weinberger, K. Q. (2017). On calibration of modern neural networks. Proceedings of Machine Learning Research, 70, 1321–1330. http://proceedings.mlr.press/v70/guo17a.html
- Higgins, D., Xi, X., Zechner, K., & Williamson, D. M. (2011). A three-stage approach to the automated scoring of spontaneous spoken responses. Computer Speech and Language, 25(2), 282–306. https://doi.org/https://doi.org/10.1016/j.csl.2010.06.001
- Kane, M. T. (1992). An argument-based approach to validity. Psychological Bulletin, 112(3), 527–535. https://doi.org/https://doi.org/10.1037/0033-2909.112.3.527
- Kane, M. T. (2006). Validation. In R. L. Brennan (Ed.), Educational measurement (4th ed., pp. 17–64). Praeger.
- Khabbazbashi, N., Xu, J., & Galaczi, E. (2021). Opening the black box: Exploring automated speaking evaluation. In B. Lanteigne, C. Coombe, & J. D. Brown (Eds.), Challenges in language testing around the world (pp. 333–343). Springer.
- Lieberman, H., Faaborg, A., Daher, W., & Espinosa, J. (2005). How to wreck a nice beach you sing calm incense. In J. Riedl, A. Jameson, D. Billsus, & T. Lau (Eds.), Proceedings of the 10th international conference on Intelligent user interfaces (pp. 278–280). Association for Computing Machinery. https://doi.org/https://doi.org/10.1145/1040830.1040898
- Linacre, J. M. (1989). Many-facet Rasch measurement. MESA Press.
- Linacre, J. M. (2014). Many-facet Rasch measurement: Facets tutorial. Winsteps. https://www.winsteps.com/tutorials.htm
- Litman, D., Strik, H., & Lim, G. S. (2018). Speech technologies and the assessment of second language speaking: Approaches, challenges, and opportunities. Language Assessment Quarterly, 15(3), 294–309. https://doi.org/https://doi.org/10.1080/15434303.2018.1472265
- Lu, Y., Gales, M., Knill, K., Manakul, P., Wang, L., & Wang, Y. (2019). Impact of ASR performance on spoken grammatical error detection. Proc. Interspeech 2019, 1876–1880. https://doi.org/https://doi.org/10.21437/Interspeech.2019-1706
- Myford, C. M., & Wolfe, E. W. (2003). Detecting and measuring rater effects using many-facet rasch measurement: Part I. Journal of Applied Measurement, 4(4), 386–422. https://pubmed.ncbi.nlm.nih.gov/14523257/
- Myford, C. M., & Wolfe, E. W. (2004). Detecting and measuring rater effects using many-facet Rasch measurement: Part II. Journal of Applied Measurement, 5(2), 189–227. https://pubmed.ncbi.nlm.nih.gov/15064538/
- Ockey, G. J., & Chukharev-Hudilainen, E. (2021). Human versus computer partner in the paired oral discussion test. Applied Linguistics. Advance online publication. https://doi.org/https://doi.org/10.1093/applin/amaa067
- Pontius, R. G., & Millones, M. (2011). Death to kappa: Birth of quantity disagreement and allocation disagreement for accuracy assessment. International Journal of Remote Sensing, 32(15), 4407–4429. https://doi.org/https://doi.org/10.1080/01431161.2011.552923
- Song, Z. (2020). English speech recognition based on deep learning with multiple features. Computing, 102(3), 663–682. https://doi.org/https://doi.org/10.1007/s00607-019-00753-0
- van Dalen, R. C., Knill, K. M., & Gales, M. J. F. (2015). Automatically grading learners’ English using a Gaussian process. SLaTE 2015: Workshop on Speech and Language Technology in Education, 7–12. https://www.repository.cam.ac.uk/handle/1810/249186
- Van Moere, A. (2012). A psycholinguistic approach to oral language assessment. Language Testing, 29(3), 325–344. https://doi.org/https://doi.org/10.1177/0265532211424478
- Van Moere, A., & Downey, R. (2016). Technology and artificial intelligence in language assessment. In D. Tsagari & J. Banerjee (Eds.), Handbook of second language assessment (pp. 341–358). Walter de Gruyter.
- Wagner, E. (2020). Duolingo English test, Revised version July 2019. Language Assessment Quarterly, 17(3), 300–315. https://doi.org/https://doi.org/10.1080/15434303.2020.1771343
- Wagner, E., & Kunnan, A. J. (2015). The Duolingo English test. Language Assessment Quarterly, 12(3), 320–331. https://doi.org/https://doi.org/10.1080/15434303.2015.1061530
- Wang, Y., Gales, M. J. F., Knill, K. M., Kyriakopoulos, K., Malinin, A., van Dalen, R. C., & Rashid, M. (2018). Towards automatic assessment of spontaneous spoken English. Speech Communication, 104, 47–56. https://doi.org/https://doi.org/10.1016/j.specom.2018.09.002
- Wang, Y. [Yanhong], Luan, H., Yuan, J., Wang, B., & Lin, H. (2020) LAIX corpus of Chinese learner English: Towards a benchmark for L2 English ASR. Proc. Interspeech 2020, 414–418. https://www.isca-speech.org/archive/Interspeech_2020/pdfs/1677.pdf
- Weigle, S. C. (2010). Validation of automated scores of TOEFL iBT tasks against non-test indicators of writing ability. Language Testing, 27(3), 335–353. https://doi.org/https://doi.org/10.1177/0265532210364406
- Williamson, D. M., Xi, X., & Breyer, F. J. (2012). A framework for evaluation and use of automated scoring. Educational Measurement: Issues and Practice, 31(1), 2–13. https://doi.org/https://doi.org/10.1111/j.1745-3992.2011.00223.x
- Wu, X., Knill, K., Gales, M., & Malinin, A. (2020). Ensemble approaches for uncertainty in spoken language assessment. Proc. Interspeech 2020, 3860–3864. https://doi.org/https://doi.org/10.21437/Interspeech.2020-2238
- Xi, X. (2010). Automated scoring and feedback systems: Where are we and where are we heading? Language Testing, 27(3), 291–300. https://doi.org/https://doi.org/10.1177/0265532210364643
- Xi, X. (2012). Validity in the automated scoring of performance tests. In G. Fulcher & F. Davidson (Eds.), The Routledge handbook of language testing (pp. 438–451). Routledge.
- Xi, X., Higgins, D., Zechner, K., & Williamson, D. (2012). A comparison of two scoring methods for an automated speech scoring system. Language Testing, 29(3), 371–394. https://doi.org/https://doi.org/10.1177/0265532211425673
- Xi, X., Higgins, D., Zechner, K., & Williamson, D. M. (2008). Automated scoring of spontaneous speech using SpeechRaterSM v1.0 (ETS Research Report No. RR-08-62). Educational Testing Service. https://doi.org/https://doi.org/10.1002/j.2333-8504.2008.tb02148.x
- Xi, X., Schmidgall, J., & Wang, Y. (2016). Chinese users’ perceptions of the use of automated scoring for a speaking practice test. In G. Yu & Y. Jin (Eds.), Assessing Chinese learners of English: Language constructs, consequences and conundrums (pp. 150–175). Palgrave Macmillan.
- Xu, J. (2015). Predicting ESL learners’ oral proficiency by measuring the collocations in their spontaneous speech [Doctoral dissertation, Iowa State University]. Iowa State University Digital Repository. https://doi.org/https://doi.org/10.31274/etd-180810-4474
- Xu, J., Brenchley, J., Jones, E., Pinnington, A., Benjamin, T., Knill, K., Seal-Coon, G., Robinson, M., & Geranpayeh, A. (2020). Linguaskill: Building a validity argument for the Speaking test. Cambridge Assessment English. https://www.cambridgeenglish.org/Images/589637-linguaskill-building-a-validity-argument-for-the-speaking-test.pdf
- Yan, X. (2014). An examination of rater performance on a local oral English proficiency test: A mixed-methods approach. Language Testing, 31(4), 501–527. https://doi.org/https://doi.org/10.1177/0265532214536171
- Yannakoudakis, H., & Cummins, R. (2015). Evaluating the performance of automated text scoring systems. In J. Tetreault, J. Burstein, & C. Leacock (Eds.), Proceedings of the Tenth Workshop on Innovative Use of NLP for Building Educational Applications (pp. 213–223). Association for Computational Linguistics. https://doi.org/https://doi.org/10.3115/v1/W15-0625
- Yu, D., & Deng, L. (2016). Automatic speech recognition. Springer.