4,210
Views
2
CrossRef citations to date
0
Altmetric
Introduction

Conceptualizing and Operationalizing Second Language Speaking Assessment: Updating the Construct for a New Century

ORCID Icon

Speaking has been a component of second language proficiency tests for at least 100 years (Weir, Vidakovic, & Galaczi, Citation2013). Over the course of a century, ideas about the nature of speaking ability and best practice in language assessment have evolved. Early tests that assessed phonetics knowledge, reading aloud, and dictation have been followed by more communicatively oriented paired and group format tests. In addition to better-researched task types (Taylor & Wigglesworth, Citation2009; Young & He, Citation1998), tests nowadays also have more clearly defined constructs, empirically validated assessment criteria (Fulcher, Citation1996; Galaczi, Ffrench, Hubbard, & Green, Citation2011), and provisions in place for assessor training and effects (Reed & Cohen, Citation2001; Winke, Citation2012).

Indeed, speaking contexts and second language speaking assessment have continued to evolve as the use of technology has become more widespread. Technology has made face-to-face conversations between people thousands of miles apart—once by definition a contradiction in terms—not only possible but increasingly commonplace, giving rise to learners who are “digital natives” (Prensky, Citation2001). Automated technologies are also beginning to be applied more widely to the delivery and scoring of speaking tests (Bernstein, Citation2012; Xi, Higgins, Zechner, & Williamson, Citation2012). It is interesting that the rise of computer-delivered tests has seen the revival of dictation and reading-aloud tasks from a century ago. As with previous periods of change, these developments have fundamental implications for how the speaking construct is conceptualized and assessed and will play a role in its future development.

Therefore, at this juncture it seems opportune to take a look at where second language speaking assessment is at present and where it might be in the future. To this end the articles in this special issue consider the state of the art for select aspects of the speaking construct and some relevant speech technologies. Galaczi and Taylor consider the notion of interactional competence, while Xu proposes a construct he calls spoken collocational competence. De Jong discusses the topic of fluency and how research in applied linguistics more broadly might benefit language assessment practice, while Isaacs discusses work on pronunciation in historical and social context and their implications not just for testing but also for teaching. Finally, Litman, Strik, and Lim provide an overview of developments in automatic speech recognition and in spoken dialogue systems, two technologies that have the potential to be useful for second language speaking assessment in the future.

While the authors were each asked to focus on a particular aspect of the speaking construct, the essential interrelatedness of these can be seen in the resulting articles here. Isaacs’s discussion of pronunciation moves into a discussion of comprehensibility, which is affected by a lexico-grammatical factor, the topic for Xu’s article. While the focus for Xu was supposed to be lexis, a major component of his construct of spoken collocational competence is fluency. De Jong’s topic of fluency discussed pauses as a feature of interaction, which is the topic of Galaczi and Taylor’s contribution. And the assessment of interactive speech is what technologies like spoken dialogue systems in Litman et al. hope to assess. There is a refusal to be constrained to particular, narrow units of analysis, which in and of itself says something about what these authors think of the speaking construct.

Another insight that emerges clearly from these articles is that speaking constructs—as reflected in rating scales and descriptors, which are the embodiment of a test’s construct (Weigle, Citation2002)—have not been conceptualized carefully enough. For example, (De Jong, Citationthis issue) points out that disfluencies in speech can be deliberate to achieve particular communicative purposes (e.g., to signal upcoming complex speech, for turn taking, and for floor holding), but are generally only conceived of in present-day rating scales as a deficit, even though such pragmatic competence is central to and predictive of high-level language ability (Grabowski, Citation2009). Few scales reflect the rich research that looks at lexis as collocations, formulaic sequences, and lexico-grammar (Xu, Citationthis issue). Interactional competence is reflected in some scales but not represented comprehensively (Galaczi & Taylor, Citationthis issue). And Isaacs notes that accent continues to appear as a criterion, even though it does not necessarily have a negative impact on intelligibility. Thus, while the development of assessment criteria has become more empirical (Fulcher, Citation1996), there is probably an even greater need for them to be conceptually driven, for clear thinking about constructs to be reflected in future assessment materials and practices.

In discussions about the future of speaking assessment, it is difficult not to talk about the use of technology. The discourse in this regard has often tended toward the negative, focusing on the ways it can result in the narrowing of the construct. There is of course truth in that; many present-day computer-delivered speaking tests tend to elicit relatively short, monologic speech samples, so it can be more difficult to make a validity argument for and inferences about spoken interactional ability. Spoken dialogue systems, as Litman, Strik, and Lim (Citationthis issue) discuss, are not yet mature enough to simulate spontaneous conversation, and it remains to be seen whether it ever will be. However, it should be pointed out that interactive speech in some contexts is naturally more restricted, such as in aviation English (Moder, Citation2012), and the technology may well be ready to deliver interactive assessments in those domains.

What can be missed in those discussions are the ways in which technology can also result in the broadening of constructs. One way might simply be by provoking discussion about the topic. Isaacs (Citationthis issue) suggests that the emphasis placed on pronunciation by automated scoring algorithms is part of the reason why there is a resurgence of interest in the topic recently. Another way is simply by newer technologies being able to do what older technologies cannot or find more difficult. Rating scales, for example, which are the traditional “technology” for scoring, are most of the time limited to four or five criteria due to the conscious attentional limitations of human raters. That these scales take the form of a grid and typically have the same number of score points across different criteria is also known to be problematic, because raters can distinguish more or fewer levels, depending on the criteria being assessed (Galaczi & Taylor, Citationthis issue; Isaacs, Trofimovich, Yu, & Muñoz Chereau, Citation2015). By contrast, automated marking systems are able to pick up myriad predictive features (e.g., Zechner, Higgins, & Xi, Citation2007)—just as human raters actually do, if not consciously or in ways that they can articulate (Lumley, Citation2005). Scoring is based not on an arbitrary number of levels but on relevant features being applied and weighted according to their ability to predict levels of performance, leading to finer-grain evaluation and rank ordering. Being able to separate out these criteria and applying them accordingly means more clearly described constructs and therefore more valid and defensible test outcomes.

From the foregoing it should be apparent that all technologies have their own affordances and limitations and can broaden and narrow the construct in different ways. That being the case, it is perhaps more fruitful to consider how new and emerging technologies can be used or developed in ways that are fit for purpose, to operationalize particular constructs. This way of thinking makes apparent that while spoken dialogue systems may not be close to ready for some uses, they may be for others. It helped make clear that automated speech recognition based on L1 speech is not appropriate for L2 speech, leading to further development in that area (Litman et al., Citationthis issue). This way of thinking can reveal that this century’s read-aloud task differs from last century’s read-aloud task, even if they are apparently identical on the surface, because they operationalize different constructs (Van Moere, Citation2012). It can potentially lead to the myriad features that automated evaluation systems pick up becoming rich feedback for learners, helping to operationalize learning-oriented assessment (Carless, Citation2007; Jones & Saville, Citation2016; Turner & Purpura, Citation2015). For that matter, humans and technologies can jointly be used to deliver assessment: face-to-face speaking tests can be delivered via videoconferencing technology (Nakatsuhara, Inoue, Berry, & Galaczi, Citation2017); or humans and computers can each score those criteria they are best at scoring (Isaacs, Citationthis issue).

The developments in technology make this an interesting and exciting time to be engaged in speaking assessment, holding the promise that speaking assessment a century hence will look very different. At the same time the current emphasis in the field on validity and validation is vitally important. It should help to ensure that technology and construct develop in tandem, rather than one outstripping the other. Indeed, however the technology develops, there is one thing that it will not be able to help with, which is determining the reasons why we would assess in the first place (McNamara & Roever, Citation2006). The purposes are many, and the stakes are often high. For good or for ill, in the new century as in the old, speaking remains ultimately a human enterprise.

References

  • Bernstein, J. C. (2012). Computer scoring of spoken responses. In C. A. Chapelle (Ed.), Encyclopedia of applied linguistics (pp. 857–863). New York, USA: Wiley.
  • Carless, D. (2007). Learning-oriented assessment: Conceptual bases and practical implications. Innovations in Education and Teaching International, 44(1), 57–66.
  • De Jong, N. (this issue). Fluency in second language testing: Insights from different disciplines. Language Assessment Quarterly.
  • Fulcher, G. (1996). Does thick description lead to smart tests? A rating-scale approach to language test construction. Language Testing, 13(2), 208–238.
  • Galaczi, E., & Taylor, L. (this issue). Interactional competence: Conceptualisations, operationalisations, and outstanding questions. Language Assessment Quarterly.
  • Galaczi, E. D., Ffrench, A., Hubbard, C., & Green, A. (2011). Developing assessment scales for large-scale speaking tests: A multiple method approach. Assessment in Education: Principles, Policy & Practice, 18(3), 217–237.
  • Grabowski, K. C. (2009). Investigating the construct validity of a test designed to measure grammatical and pragmatic knowledge in the context of speaking (Doctoral dissertation). Available from ProQuest Dissertations & Theses Global. (304882463).
  • Isaacs, T. (this issue). Shifting sands in second language pronunciation assessment research and practice. Language Assessment Quarterly.
  • Isaacs, T., Trofimovich, P., Yu, G., & Muñoz Chereau, B. (2015). Examining the linguistic aspects of speech that most efficiently discriminate between upper levels of the revised IELTS pronunciation scale. IELTS Research Reports, 2015-4. 1–48. Cambridge: IELTS.
  • Jones, N., & Saville, N. (2016). Learning oriented assessment: A systematic approach. Cambridge, UK: Cambridge University Press.
  • Litman, D., Strik, H., & Lim, G. S. (this issue). Speech technologies and the assessment of second language speaking: Approaches, challenges, and opportunities. Language Assessment Quarterly.
  • Lumley, T. (2005). Assessing second language writing: The rater’s perspective. Frankfurt, Germany: Peter Lang.
  • McNamara, T., & Roever, C. (2006). Language testing: The social dimension. Massachusetts, USA: Blackwell.
  • Moder, C. L. (2012). Aviation english. In B. Paltridge & S. Starfield (Eds.), The handbook of English for specific purposes (pp. 227–242). New York, USA: Wiley.
  • Nakatsuhara, F., Inoue, C., Berry, V., & Galaczi, E. (2017). Exploring the use of video-conferencing technology in the assessment of spoken language: A mixed-methods study. Language Assessment Quarterly, 14(1), 1–18.
  • Prensky, M. (2001). Digital natives, digital immigrants. On the Horizon, 9(5), 1–6.
  • Reed, D. J., & Cohen, A. D. (2001). Revisiting raters and ratings in oral language assessment. In C. Elder, A. Brown, E. Grove, K. Hill, N. Iwashita, T. Lumley, … K. O’Loughlin (Eds.), Experimenting with uncertainty: Language testing essays in honor of Alan Davies (pp. 82–96). Cambridge, UK: Cambridge University Press.
  • Taylor, L., & Wigglesworth, G. (2009). Are two heads better than one? Pair work in L2 assessment contexts. Language Testing, 26(3), 325–339.
  • Turner, C. E., & Purpura, J. E. (2015). Learning-oriented assessment in the classroom. In D. Tsagari & J. Banerjee (Eds.), Handbook of second language assessment (pp. 255–274). Berlin, Germany: Mouton de Gruyter.
  • Van Moere, A. (2012). A psycholinguistic approach to oral language assessment. Language Testing, 29(3), 325–344.
  • Weigle, S. C. (2002). Assessing writing. Cambridge, UK: Cambridge University Press.
  • Weir, C. J., Vidakovic, I., & Galaczi, E. D. (2013). Measured constructs: A history of Cambridge english examinations, 1913-2012. Cambridge, UK: Cambridge University Press.
  • Winke, P. (2012). Rating oral language. In C. A. Chapelle (Ed.), Encyclopedia of applied linguistics (pp. 4849–4855). New York, USA: Wiley.
  • Xi, X., Higgins, D., Zechner, K., & Williamson, D. (2012). A comparison of two scoring methods for an automated speech scoring system. Language Testing, 29(3), 371–394.
  • Xu, J. (this issue). Measuring ‘spoken collocational competence’ in communicative speaking assessment. Language Assessment Quarterly.
  • Young, R., & He, A. W. (Eds.). (1998). Talking and testing: Discourse approaches to the assessment of oral proficiency. Amsterdam, Netherlands: John Benjamins.
  • Zechner, K., Higgins, D., & Xi, X. (2007). SpeechRater: A construct-driven approach to scoring spontaneous non-native speech. Proceedings of ISCA-SLATE, Pennsylvania, USA: ISCA, pp. 128–131.

Reprints and Corporate Permissions

Please note: Selecting permissions does not provide access to the full text of the article, please see our help page How do I view content?

To request a reprint or corporate permissions for this article, please click on the relevant link below:

Academic Permissions

Please note: Selecting permissions does not provide access to the full text of the article, please see our help page How do I view content?

Obtain permissions instantly via Rightslink by clicking on the button below:

If you are unable to obtain permissions via Rightslink, please complete and submit this Permissions form. For more information, please visit our Permissions help page.