Search in:

Assessment in Education: Principles, Policy & Practice Volume 28, 2021 - Issue 4: Use of Innovative Technology in Oral Language Assessment

Submit an article Journal homepage

1,311

Views

CrossRef citations to date

Altmetric

Articles

Assessing L2 English speaking using automated scoring technology: examining automarker reliability

Jing XuCambridge Assessment English, University of Cambridge, Cambridge, UKCorrespondence[email protected]

https://orcid.org/0000-0002-0940-738X View further author information

Edmund JonesCambridge Assessment English, University of Cambridge, Cambridge, UK

https://orcid.org/0000-0002-4070-1974 View further author information

Victoria LaxtonCambridge Assessment English, University of Cambridge, Cambridge, UK

https://orcid.org/0000-0001-5590-4398 View further author information

Evelina GalacziCambridge Assessment English, University of Cambridge, Cambridge, UK

https://orcid.org/0000-0001-8269-8462 View further author information

Pages 411-436 | Received 31 Jul 2020, Accepted 31 Aug 2021, Published online: 28 Sep 2021

Cite this article
https://doi.org/10.1080/0969594X.2021.1979467
CrossMark

Full Article
Figures & data
References
Citations
Metrics
Reprints & Permissions

References

AERA, APA, & NCME. (2014). Standards for educational and psychological testing. American Educational Research Association.
Google Scholar
Barkaoui, K. (2014). Multifaceted Rasch analysis for test evaluation. In A. J. Kunnan (Ed.), The companion to language assessment (Vol. III, pp. 1301–1322). John Wiley & Sons.
Google Scholar
Bland, J. M., & Altman, D. G. (1986). Statistical methods for assessing agreement between two methods of clinical measurement. Lancet, 327(8476), 307–310. https://doi.org/https://doi.org/10.1016/S0140-6736(86)90837-8
Web of Science ®Google Scholar
Bland, J. M., & Altman, D. G. (1995). Comparing methods of measurement: Why plotting difference against standard method is misleading. Lancet, 346(8982), 1085–1087. https://doi.org/https://doi.org/10.1016/S0140-6736(95)91748-9
PubMed Web of Science ®Google Scholar
Bland, J. M., & Altman, D. G. (1999). Measuring agreement in method comparison studies. Statistical Methods in Medical Research, 8(2), 135–160. https://doi.org/https://doi.org/10.1177/096228029900800204
PubMed Web of Science ®Google Scholar
Bland, J. M., & Altman, D. G. (2003). Applying the right statistics: Analyses of measurement studies. Ultrasound in Obstetrics & Gynecology, 22(1), 85–93. https://doi.org/https://doi.org/10.1002/uog.122
PubMed Web of Science ®Google Scholar
Brennan, R. L., & Prediger, D. J. (1981). Coefficient kappa: Some uses, misuses, and alternatives. Educational and Psychological Measurement, 41(3), 687–699. https://doi.org/https://doi.org/10.1177/001316448104100307
Web of Science ®Google Scholar
Brown, A., Iwashita, N., & McNamara, T. (2005). An Examination of rater orientations and test-taker performance on English-for-academic-purposes speaking tasks (ETS Research Report No. RR-05-05). Educational Testing Service. https://doi.org/http://dx.doi.org/10.1002/j.2333-8504.2005.tb01982.x
Google Scholar
Chapelle, C. A., Enright, M. K., & Jamieson, J. M. (Eds.). (2008). Building a validity argument for the Test of English as a Foreign Language. Routledge.
Google Scholar
Chen, L., Zechner, K., Yoon, S.-Y., Evanini, K., Wang, X., Loukina, A., Tao, J., Davis, L., Lee, C. M., Ma, M., Mundkowsky, R., Lu, C., Leong, C. W., & Gyawali, B. (2018). Automated scoring of nonnative speech using the SpeechRater v. 5.0 Engine (ETS Research Report No. RR-18-10). Educational Testing Service. https://doi.org/https://doi.org/10.1002/ets2.12198
Google Scholar
Chun, C. W. (2006). Commentary: An analysis of a language test for employment: The authenticity of the PhonePass test. Language Assessment Quarterly, 3(3), 295–306. https://doi.org/https://doi.org/10.1207/s15434311laq0303_4
Google Scholar
Chun, C. W. (2008). Comments on ‘Evaluation of the usefulness of the Versant for Englsih Test: A response’: The author responds. Language Assessment Quarterly, 5(2), 168–172. https://doi.org/https://doi.org/10.1080/15434300801934751
Web of Science ®Google Scholar
Cohen, J. (1960). A coefficient of agreement for nominal scales. Educational and Psychological Measurement, 20(1), 37–46. https://doi.org/https://doi.org/10.1177/001316446002000104
Web of Science ®Google Scholar
Cohen, J. (1968). Weighted kappa: Nominal scale agreement provision for scaled disagreement or partial credit. Psychological Bulletin, 70(4), 213–220. https://doi.org/https://doi.org/10.1037/h0026256
PubMed Web of Science ®Google Scholar
Di Eugenio, B., & Glass, M. (2004). The kappa statistic: A second look. Computational Linguistics, 30(1), 95–101. https://doi.org/https://doi.org/10.1162/089120104773633402
Web of Science ®Google Scholar
Enhanced Speech Technology Ltd. (2020). EST custom automated speech engine (CASE) v1.0. Cambridge English internal research report.
Google Scholar
Fan, J. (2014). Chinese test takers’ attitudes towards the Versant English Test: A mixed-methods approach. Language Testing in Asia, 4(6), 1–17. https://doi.org/https://doi.org/10.1186/s40468-014-0006-9
Google Scholar
Galaczi, E., & Taylor, L. (2018). Interactional competence: Conceptualisations, operationalisations, and outstanding questions. Language Assessment Quarterly, 15(3), 219–236. https://doi.org/https://doi.org/10.1080/15434303.2018.1453816
Web of Science ®Google Scholar
Guo, C., Pleiss, G., Sun, Y., & Weinberger, K. Q. (2017). On calibration of modern neural networks. Proceedings of Machine Learning Research, 70, 1321–1330. http://proceedings.mlr.press/v70/guo17a.html
Google Scholar
Higgins, D., Xi, X., Zechner, K., & Williamson, D. M. (2011). A three-stage approach to the automated scoring of spontaneous spoken responses. Computer Speech and Language, 25(2), 282–306. https://doi.org/https://doi.org/10.1016/j.csl.2010.06.001
Web of Science ®Google Scholar
Kane, M. T. (1992). An argument-based approach to validity. Psychological Bulletin, 112(3), 527–535. https://doi.org/https://doi.org/10.1037/0033-2909.112.3.527
Web of Science ®Google Scholar
Kane, M. T. (2006). Validation. In R. L. Brennan (Ed.), Educational measurement (4th ed., pp. 17–64). Praeger.
Google Scholar
Khabbazbashi, N., Xu, J., & Galaczi, E. (2021). Opening the black box: Exploring automated speaking evaluation. In B. Lanteigne, C. Coombe, & J. D. Brown (Eds.), Challenges in language testing around the world (pp. 333–343). Springer.
Google Scholar
Lieberman, H., Faaborg, A., Daher, W., & Espinosa, J. (2005). How to wreck a nice beach you sing calm incense. In J. Riedl, A. Jameson, D. Billsus, & T. Lau (Eds.), Proceedings of the 10th international conference on Intelligent user interfaces (pp. 278–280). Association for Computing Machinery. https://doi.org/https://doi.org/10.1145/1040830.1040898
Google Scholar
Linacre, J. M. (1989). Many-facet Rasch measurement. MESA Press.
Google Scholar
Linacre, J. M. (2014). Many-facet Rasch measurement: Facets tutorial. Winsteps. https://www.winsteps.com/tutorials.htm
Google Scholar
Litman, D., Strik, H., & Lim, G. S. (2018). Speech technologies and the assessment of second language speaking: Approaches, challenges, and opportunities. Language Assessment Quarterly, 15(3), 294–309. https://doi.org/https://doi.org/10.1080/15434303.2018.1472265
Web of Science ®Google Scholar
Lu, Y., Gales, M., Knill, K., Manakul, P., Wang, L., & Wang, Y. (2019). Impact of ASR performance on spoken grammatical error detection. Proc. Interspeech 2019, 1876–1880. https://doi.org/https://doi.org/10.21437/Interspeech.2019-1706
Google Scholar
Myford, C. M., & Wolfe, E. W. (2003). Detecting and measuring rater effects using many-facet rasch measurement: Part I. Journal of Applied Measurement, 4(4), 386–422. https://pubmed.ncbi.nlm.nih.gov/14523257/
PubMedGoogle Scholar
Myford, C. M., & Wolfe, E. W. (2004). Detecting and measuring rater effects using many-facet Rasch measurement: Part II. Journal of Applied Measurement, 5(2), 189–227. https://pubmed.ncbi.nlm.nih.gov/15064538/
PubMedGoogle Scholar
Ockey, G. J., & Chukharev-Hudilainen, E. (2021). Human versus computer partner in the paired oral discussion test. Applied Linguistics. Advance online publication. https://doi.org/https://doi.org/10.1093/applin/amaa067
Google Scholar
Pontius, R. G., & Millones, M. (2011). Death to kappa: Birth of quantity disagreement and allocation disagreement for accuracy assessment. International Journal of Remote Sensing, 32(15), 4407–4429. https://doi.org/https://doi.org/10.1080/01431161.2011.552923
Web of Science ®Google Scholar
Song, Z. (2020). English speech recognition based on deep learning with multiple features. Computing, 102(3), 663–682. https://doi.org/https://doi.org/10.1007/s00607-019-00753-0
Web of Science ®Google Scholar
van Dalen, R. C., Knill, K. M., & Gales, M. J. F. (2015). Automatically grading learners’ English using a Gaussian process. SLaTE 2015: Workshop on Speech and Language Technology in Education, 7–12. https://www.repository.cam.ac.uk/handle/1810/249186
Google Scholar
Van Moere, A. (2012). A psycholinguistic approach to oral language assessment. Language Testing, 29(3), 325–344. https://doi.org/https://doi.org/10.1177/0265532211424478
Web of Science ®Google Scholar
Van Moere, A., & Downey, R. (2016). Technology and artificial intelligence in language assessment. In D. Tsagari & J. Banerjee (Eds.), Handbook of second language assessment (pp. 341–358). Walter de Gruyter.
Google Scholar
Wagner, E. (2020). Duolingo English test, Revised version July 2019. Language Assessment Quarterly, 17(3), 300–315. https://doi.org/https://doi.org/10.1080/15434303.2020.1771343
Web of Science ®Google Scholar
Wagner, E., & Kunnan, A. J. (2015). The Duolingo English test. Language Assessment Quarterly, 12(3), 320–331. https://doi.org/https://doi.org/10.1080/15434303.2015.1061530
Web of Science ®Google Scholar
Wang, Y., Gales, M. J. F., Knill, K. M., Kyriakopoulos, K., Malinin, A., van Dalen, R. C., & Rashid, M. (2018). Towards automatic assessment of spontaneous spoken English. Speech Communication, 104, 47–56. https://doi.org/https://doi.org/10.1016/j.specom.2018.09.002
Web of Science ®Google Scholar
Wang, Y. [Yanhong], Luan, H., Yuan, J., Wang, B., & Lin, H. (2020) LAIX corpus of Chinese learner English: Towards a benchmark for L2 English ASR. Proc. Interspeech 2020, 414–418. https://www.isca-speech.org/archive/Interspeech_2020/pdfs/1677.pdf
Google Scholar
Weigle, S. C. (2010). Validation of automated scores of TOEFL iBT tasks against non-test indicators of writing ability. Language Testing, 27(3), 335–353. https://doi.org/https://doi.org/10.1177/0265532210364406
Web of Science ®Google Scholar
Williamson, D. M., Xi, X., & Breyer, F. J. (2012). A framework for evaluation and use of automated scoring. Educational Measurement: Issues and Practice, 31(1), 2–13. https://doi.org/https://doi.org/10.1111/j.1745-3992.2011.00223.x
Google Scholar
Wu, X., Knill, K., Gales, M., & Malinin, A. (2020). Ensemble approaches for uncertainty in spoken language assessment. Proc. Interspeech 2020, 3860–3864. https://doi.org/https://doi.org/10.21437/Interspeech.2020-2238
Google Scholar
Xi, X. (2010). Automated scoring and feedback systems: Where are we and where are we heading? Language Testing, 27(3), 291–300. https://doi.org/https://doi.org/10.1177/0265532210364643
Web of Science ®Google Scholar
Xi, X. (2012). Validity in the automated scoring of performance tests. In G. Fulcher & F. Davidson (Eds.), The Routledge handbook of language testing (pp. 438–451). Routledge.
Google Scholar
Xi, X., Higgins, D., Zechner, K., & Williamson, D. (2012). A comparison of two scoring methods for an automated speech scoring system. Language Testing, 29(3), 371–394. https://doi.org/https://doi.org/10.1177/0265532211425673
Web of Science ®Google Scholar
Xi, X., Higgins, D., Zechner, K., & Williamson, D. M. (2008). Automated scoring of spontaneous speech using SpeechRaterSM v1.0 (ETS Research Report No. RR-08-62). Educational Testing Service. https://doi.org/https://doi.org/10.1002/j.2333-8504.2008.tb02148.x
Google Scholar
Xi, X., Schmidgall, J., & Wang, Y. (2016). Chinese users’ perceptions of the use of automated scoring for a speaking practice test. In G. Yu & Y. Jin (Eds.), Assessing Chinese learners of English: Language constructs, consequences and conundrums (pp. 150–175). Palgrave Macmillan.
Google Scholar
Xu, J. (2015). Predicting ESL learners’ oral proficiency by measuring the collocations in their spontaneous speech [Doctoral dissertation, Iowa State University]. Iowa State University Digital Repository. https://doi.org/https://doi.org/10.31274/etd-180810-4474
Google Scholar
Xu, J., Brenchley, J., Jones, E., Pinnington, A., Benjamin, T., Knill, K., Seal-Coon, G., Robinson, M., & Geranpayeh, A. (2020). Linguaskill: Building a validity argument for the Speaking test. Cambridge Assessment English. https://www.cambridgeenglish.org/Images/589637-linguaskill-building-a-validity-argument-for-the-speaking-test.pdf
Google Scholar
Yan, X. (2014). An examination of rater performance on a local oral English proficiency test: A mixed-methods approach. Language Testing, 31(4), 501–527. https://doi.org/https://doi.org/10.1177/0265532214536171
Web of Science ®Google Scholar
Yannakoudakis, H., & Cummins, R. (2015). Evaluating the performance of automated text scoring systems. In J. Tetreault, J. Burstein, & C. Leacock (Eds.), Proceedings of the Tenth Workshop on Innovative Use of NLP for Building Educational Applications (pp. 213–223). Association for Computational Linguistics. https://doi.org/https://doi.org/10.3115/v1/W15-0625
Google Scholar
Yu, D., & Deng, L. (2016). Automatic speech recognition. Springer.
Google Scholar

Reprints and Corporate Permissions

Please note: Selecting permissions does not provide access to the full text of the article, please see our help page How do I view content?

To request a reprint or corporate permissions for this article, please click on the relevant link below:

Order Reprints Request Corporate Permissions

Academic Permissions

Please note: Selecting permissions does not provide access to the full text of the article, please see our help page How do I view content?

Obtain permissions instantly via Rightslink by clicking on the button below:

Request Academic Permissions

If you are unable to obtain permissions via Rightslink, please complete and submit this Permissions form. For more information, please visit our Permissions help page.

Share icon
Back to Top

Related research

People also read lists articles that other readers of this article have read.

Recommended articles lists articles that we recommend and is powered by our AI driven recommendation engine.

Cited by lists all citing articles based on Crossref citations.
Articles with the Crossref icon will open in a new tab.

People also read
Recommended articles
Cited by

To cite this article:

Reference style: APA Chicago Harvard

Citation copied to clipboard

Reference styles above use APA (6th edition), Chicago (16th edition) & Harvard (10th edition)

Download citation

Download a citation file in RIS format that can be imported by citation management software including EndNote, ProCite, RefWorks and Reference Manager.

Choose format: RIS BibTex RefWorks Direct Export

Choose options: Citation Citation & abstract Citation & references

Assessing L2 English speaking using automated scoring technology: examining automarker reliability

References

Information for

Open access

Opportunities

Help and information

Your download is now in progress and you may close this window

Login or register to access this feature

Assessing L2 English speaking using automated scoring technology: examining automarker reliability

References

Reprints and Corporate Permissions

Academic Permissions

Related research

To cite this article:

Download citation

Information for

Open access

Opportunities

Help and information

Keep up to date