7,316
Views
0
CrossRef citations to date
0
Altmetric
Research Article

L2 Speaking Assessment in Secondary School Classrooms in Japan

ORCID Icon

ABSTRACT

In Japanese secondary schools, speaking assessment in English classrooms is designed, conducted, and scored by teachers. Although the assessment is intended to be used for summative and formative purposes, it is not regularly or adequately practiced. This paper reports the problems (i.e., lack of continuous speaking assessment, limited speaking test formats used, and reliability that is not ensured) and presents future directions for second language speaking assessment implementation to enhance the summative and formative uses of speaking assessments: improving teacher training and resources and discussing an effective speaking assessment framework.

日本の中学校 · 高等学校において, 英語授業でのスピーキング評価は教員によって作成 · 実施 · 採点されている。その評価は総括的目的と形成的目的のために意図されているが, 評価の実践は定期的でなく, 適切でもない。本論文は, それに関連する諸問題 (継続的なスピーキング評価の不足, 限られたテスト形式のみの使用, 信頼性の不足) をまとめる。さらに, 総括的 · 形成的目的のスピーキング評価の改善に向けた方向性 (教員研修と教材資源の改善, 効果的なスピーキング評価の枠組みについての議論) を提案する。

It is widely recognized that assessing second language (L2) speaking in the classroom is an essential part of education, facilitating the tracking of learning progress and providing feedback to improve both learning and teaching (Poehner & Inbar-Lourie, Citation2020; Yan et al., Citation2021). In addition to the formative functions of L2 classroom assessment, summative functions are important for reporting progress to students and families and for future admission and employment opportunities.

To realize the formative and summative assessment of L2 speaking in the classroom, speaking tests are ideally conducted multiple times; students are invited to speak in various formats, such as oral presentations to the class or dialogues with the teacher or their peer(s); teachers and students evaluate oral performances using rubrics (i.e., teacher, self-, and peer assessment), with reliability between or within raters checked; scores and comments are used for grading and provided to students as feedback. However, ideal summative and formative practice is not often observed in classrooms, where L2 English classroom speaking assessment (SA) is designed, administered, and graded by teachers who do not usually have much knowledge about SA. While this issue can be observed across countries, this paper focuses on L2 English classrooms at secondary schools in Japan as a case study, examines relevant problems, and proposes directions to improve L2 English SA both summatively and formatively. The paper is primarily based on published reports relevant to the Japanese context, particularly from the Ministry of Education, Culture, Sports, Science, and Technology in Japan (MEXT) and the National Institute for Educational Policy Research (NIER), an organization under the purview of MEXT.

This paper addresses classroom SA in secondary school education (Years 7–12) in Japan for two reasons. First, MEXT provides national guidelines on teaching and assessment of primary and secondary school education, which can serve as the basis for discussing SA comprehensively. This contrasts with tertiary education, which does not have such guidelines and values the relatively strong autonomy of teachers (Katsuno, Citation2019), allowing universities and teachers to teach and assess at the discretion of their own principles. Second, English language education as a required subject officially began in primary school in the 2020 academic year (hereafter 2020), and it is too early to examine its practices. In contrast, data are available at the secondary school level. The current paper also focuses on classroom SA created, administered, and scored by teachers, which are embedded with learning and are learning-focused (see Akiyama, Citation2019; Allen, Citation2020; Meijitosho Shuppan, Citation2021; Sawaki & Koizumi, Citation2017, for the use of standardized speaking tests for entrance exams and monitoring learning progress).

Summative and formative use of classroom-based speaking assessment in Japan

Typically, classroom SA in Japanese secondary schools is used for formative and summative purposes, with more emphasis on the latter (Bacquet, Citation2020; Wicking, Citation2020), and the same test can serve one or two purposes. Davison and Leung (Citation2009) categorized four types of formative assessment according to different purposes and priorities: “in-class contingent formative assessment-while-teaching;” “more planned integrated formative assessment;” “more formal mock or trial assessments modeled on summative assessments but used for formative purposes;” and “prescribed summative assessments, but results also used formatively to guide future teaching/learning” (p. 400). Classroom SA in Japan, as illustrated and problematized in the current paper, is categorized as the fourth type.

In terms of summative SA in Japan, the stakes are neither high nor low; SA scores are used for grading, but carry a relatively small weight in the overall grade. However, the grades for each term and each year are cumulative and are considered for decisions such as admission to higher education (Negishi, Citation2020). For formative use, SA is meant to be used to improve learning and teaching by having students work on their speaking strengths and weaknesses and by having teachers modify future lesson plans, materials, styles, short- and long-term teaching goals, and curricula. However, SA for summative and formative purposes is not well executed in Japan, despite strong recommendations for and reforms in teacher education and training.

Previous studies on formative and summative assessment have shown positive washback effects on learning and teaching. Lee et al. (Citation2020) conducted a meta-analysis of formative assessment in U.S. K-12 education and found that learners who experienced formative assessment performed slightly better than those who did not (Hedges's g = .29, with 33 studies). Muñoz and Álvarez (Citation2010) delineated their classroom L2 SA system, with both formative and summative aspects using rubrics, self-assessment, and teacher training and regular meetings on formative assessment, and conducted formal speaking tests six times over 15 months (n = 55). They demonstrated that their system improved teachers’ instructional behaviors, students’ perceptions, and speaking ability. Green (Citation2020) stated that washback is mediated by many factors, including teachers, students, and other stakeholders, as well as the authenticity, resources, importance, and difficulty of the test, and emphasized that positive consequences can be derived by careful planning and collaboration with those involved in test development and use.

While summative and formative SA potentially affect teaching and learning positively, several key elements are required for it to work effectively in practice. Muñoz and Álvarez (Citation2010) summarized five principles to generate positive test washback: (a) continuous assessment that allows students to showcase their abilities; (b) effective use of various tasks that match teaching objectives and instructional tasks, that are authentic and meaningful, that cater to different types of learners, and that specify clear instructions and procedures to elicit targeted ability; (c) the use of rubrics that focus on wide aspects of speaking ability, which makes it easy for students and teachers to understand the tested ability; (d) encouragement of self-assessment; and (e) concrete, specific feedback to students. While these points are primarily related to validity, reliability and practicality should be considered, as they are vital aspects of L2 assessment (Brown & Abeywickrama, Citation2019; Green, Citation2020): (f) test scores and feedback that are sufficiently reliable, and (g) test development, administration, scoring, and feedback that are reasonably feasible. As seen in , each principle is related to summative and formative functions, and following these seven principles would lead to the improvement of classroom summative and formative SA. This improvement could exert a positive impact on learning and teaching specifically (1) by formatively providing opportunities for students to study to improve their speaking ability, grasp their characteristics, and modify their learning behaviors, and for teachers to modify future lesson plans, teaching goals, and curricula; and (2) by teachers summatively including SA results in grades and acknowledging students’ efforts and achievement. It should be noted that refinement does not guarantee success in learning and teaching L2 speaking because such educational success involves various complicated interplays among many factors such as students, teachers, materials, and learning environments, as documented in previous studies on washback, L2 acquisition, and assessment (Ali & Hamid, Citation2020; Kormos, Citation2006; Taylor, Citation2011). Among these factors, functional SA practice is an essential element of progress toward improved L2 education.

Table 1. Principles for effective classroom speaking assessment and why they are important in summative and formative assessment

Current context

Foreign language education is a required subject at primary schools (Years 5–6), junior high schools (or lower secondary school; Years 7–9), senior high schools (or upper secondary school; Years 10–12), and tertiary schools. Japan has national guidelines for the primary and secondary stages, called the Course of Study. Foreign language curricula, primarily focusing on English, aim to develop four well-balanced skills – listening, reading, speaking (interaction and production), and writing – as well as integrated skills, for example, by having students listen to and read something, then speak both extemporaneously and with preparation (Ministry of Education, Culture, Sports, Science and Technology [MEXT], Citation2020a).

L2 English proficiency among students is still considered inadequate, despite long-standing efforts to improve it. MEXT (Citation2018) aimed for over half of junior high school (HS) students to obtain a high A1 level and over half of senior HS students to obtain the A2 level of the Common European Framework of Reference (CEFR). However, these goals were not achieved; in a stratified sample of students, only 33.1% of junior HS students and 12.9% of senior HS students met the speaking target (MEXT, Citation2018).Footnote1

To boost students’ English ability, MEXT and NIER created plans to improve English teaching and assessment. Some examples include refining national guidelines, encouraging teachers to focus more on communicative ability, distributing teaching materials and tips for teachers, and conducting fact-finding surveys on English education (MEXT, Citation2020b). Others are providing teachers with handbooks that include good assessment practice samples along with explanations; encouraging performance-based speaking and writing tests; and conducting learning-oriented and criterion-referenced teacher, self-, and peer assessment (National Institute for Educational Policy Research [NIER], Citation2021).

MEXT also commissioned a research group to formulate the National Core Curriculum for Teaching English to enhance pre-service and in-service training (Kasuya et al., Citation2021). This Core Curriculum for teachers highlights the central role of L2 assessment and mandates training to acquire knowledge and skills to create paper-based and performance tests that evaluate independent and integrated skills, aiming to help teachers gain language assessment literacy (Inbar-Lourie, Citation2017) and conduct effective assessment as part of L2 education. Further, MEXT (Citation2020c) has required each region (i.e., prefecture and government-designated city) to create annual concrete plans for teacher training and make efforts to realize these plans, which should include increasing the number of speaking and writing tests.

It should be emphasized that assessment guidelines presented by NIER (Citation2021) only provide general assessment frameworks to follow, with sample tasks and rubrics. Therefore, teachers are expected to create, administer, and score SA on their own and use test scores to improve learning, teaching, and school education within the Plan-Do-Check-Act cycle. Thus, teachers have freedom to a certain degree in this system, but a lack of language assessment literacy usually creates difficulty in effectively juggling teaching and assessment.

Issues and challenges

Considering the use of L2 classroom SA in secondary schools in Japan for formative and summative purposes, the aforementioned seven key elements should be implemented. Among the seven, three areas that are essential for summative and formative purposes deserve particular attention because they point to areas for improvement based on available evidence: (a) conducting continuous assessment, (b) using various tasks effectively, and (f) ensuring sufficiently reliable test scores and feedback. The issues and challenges related to these three are summarized below. Regarding the other areas, they are not necessarily practiced well, but the lack of clear evidence related to classroom SA hinders detailed analysis.

Lack of continuous speaking assessment

Japanese secondary schools, in particular senior HSs, do not conduct SA as frequently as desired (Yonezaki, Citation2016), so SA is not continuously implemented. MEXT (Citation2020b) conducted an annual survey of all public schools to track the current conditions in English language education, including how frequently speaking and writing tests are administered. While the survey is based on self-reports from teachers, meaning precise numbers may not be very accurate, it shows an overall tendency. shows that 86.1% of junior HSs conducted both speaking and writing tests, while 8.0% conducted only speaking tests. On the other hand, 36.4% of senior HSs conducted both tests, while 15.7% conducted only speaking tests. In other words, almost all junior HS courses (94.1%) conducted SA at least once a year, while only about half of the senior HS courses (52.1%) conducted SA. In addition, SA frequency in junior HSs was higher than in senior HSs: on average, 3.8 times a year in junior HSs (98,893/26,239) and 1.8 times in senior HSs (16,067/9,083). Large variations across regions have also been reported (MEXT, Citation2020c). In Japan, the academic year is usually divided into three terms, and term tests are conducted five times per year. Thus, I would argue that SA should be conducted at least three times a year to examine progress, and the current frequency at senior HSs appears insufficient.

Figure 1. Percentages of courses that conducted (or would conduct in the same year) both a Speaking Test (ST) and Writing Test (WT), either an ST or WT, or neither tests in a year.

Note. Calculated by (a) “the number of courses in each category” divided by “the total number of courses (that had students)” multiplied by 100. The counting unit was a course: Junior high schools (HSs) typically had three courses (i.e., first-, second-, and third-year); senior HSs had several courses, whose number depends on each school (five-course example: Communication English I, II, and III, and English Expression I and II). When schools have multiple departments, such as a general department (futsuka), an international department, and an agricultural department, courses were counted by each department. Data for this figure were derived from MEXT (Citation2020b). These also apply to .
Figure 1. Percentages of courses that conducted (or would conduct in the same year) both a Speaking Test (ST) and Writing Test (WT), either an ST or WT, or neither tests in a year.

Figure 2. Percentages of speaking test formats that courses used (or would conduct in the same year).

Note. Calculated by (a) “the number of times that each format was used” divided by “the total number of tests conducted” multiplied by 100.
Figure 2. Percentages of speaking test formats that courses used (or would conduct in the same year).

Some educators may argue that it is not feasible to conduct SA three times a year because entrance examinations do not typically include SA and teachers are busy preparing students for examinations, an issue common in other Asian countries (Vongpumivitch, Citation2014).Footnote2 However, the abovementioned previous studies (e.g., Muñoz & Álvarez, Citation2010) suggest that continuous assessment is needed to meet the formative and summative purposes of SA, garner positive effects on learning and teaching, and increase L2 ability. Regardless, making time to administer and score classroom SA and changing teachers’ mindsets regarding SA are challenging. The current article addresses this issue and presents the following possible ways to make classroom SA feasible by efficiently using in-class or out-of-class time to administer and score speaking tests.

Limited speaking test formats used

One of the principles of classroom SA is to use various tasks effectively. However, SA formats used are limited in Japanese secondary classrooms (MEXT, Citation2020b). As shows, frequently used formats are interview (39.7% and 32.4% at junior and senior HSs, respectively), speech (35.6% and 27.8%), and presentation (16.8% and 24.2%), which suggests a limited range of SA formats used. Related to this finding, three issues need to be addressed. First, assessing talks between students, such as discussions and debates, is rare, although it seems feasible, as similar pair and group activities are frequently employed as learning activities, particularly in junior HS (MEXT, Citation2016). Second, prepared or scripted talk is frequently assessed (Honda, Citation2007). MEXT (Citation2018) reported that 61.2% of senior HS students responded negatively to the statement that they experienced extemporaneous talking on a given topic in class in a learning setting. Third, tasks that integrate speaking with other skills are not often conducted. MEXT (Citation2018) reported that 50.7% of senior HS students responded negatively to the statement that they experienced talking about what they listened to and read in class. Thus, particularly missing are SA formats that elicit spoken interaction with classmates, extemporaneous talk, and integrated skills.

The limited use of SA formats in Japan is likely to result in unsuccessful formative and summative SA (see ). Moreover, the restricted use of SA formats may lead to the assessment of a limited range of speaking ability. Previous studies indicate that each format taps a somewhat shared but substantially different aspect of speaking ability (Honda, Citation2007; Roever & Ikeda, Citation2021). Ockey et al. (Citation2015) examined how scores derived from different SA formats were correlated among 226 Japanese university students. They reported moderate correlations between scores of group oral discussion, oral presentation, picture and graph description, and other monologic tasks (r = .67 to .76), which indicate the distinctiveness of constructs measured by SA formats.

Narrowly measured aspects of speaking ability due to a narrow range of SA formats used lead to a lack of firm relationships between learning/teaching and assessment2. As described earlier, speaking (both interaction and production) is taught at secondary schools, as specified by the national teaching guidelines (MEXT, Citation2020a). When only a limited range of task formats are conducted, students’ achievement is not adequately measured. This leads to construct underrepresentation in summative and formative assessment, specifically in grades and feedback. In other words, because some SA formats are not used, some aspects of students’ achievement and L2 development may not be precisely identified and valued, and other aspects may be overestimated. This could also lead to difficulty in grasping students’ strengths and weaknesses of L2 speaking and the failure to give appropriate feedback and modify teaching contents, methods, syllabus, and curriculum catering to students’ needs. Further, the lack of the use of certain SA formats may suggest that certain aspects of speaking ability are not very important, which could generate negative washback on teaching and learning. Thus, the use of multiple, varied tasks would lead to better construct representation of targeted speaking ability, thus improving the validity and fairness of SA (Akiyama, Citation2004).

Reliability that is not ensured

While lack of continuous SA and limited format use of SA are primarily related to validity, issues of reliability also require attention. In classroom assessment, the degree to which the same concepts of validity and reliability as used in high-stakes assessment apply remains controversial (Lewkowicz & Leung, Citation2021). However, given the current focus on both summative and formative purposes, moderate yet adequate reliability would be required (e.g., reliability of .70 or higher; Wells & Wollack, Citation2003).

Reliability involves the consistency of test scores across test forms, tasks, occasions, and raters, which is assumed and not often examined in practice. Takaki et al. (Citation2018) examined whether interrater or instrument reliability was reported in articles published between 2002 and 2017 in one of the prestigious Japanese peer-reviewed journals Annual Review of English Language Education in Japan (ARELE). They found that reliability was reported in only 27% of speaking studies, which, according to the authors, were conducted in classrooms. Since reliability is usually examined more frequently in peer-reviewed studies than in SA practices, the results suggest rare practices of checking reliability in classrooms.

To ensure rater reliability, rater training or moderation is usually recommended (Knoch et al., Citation2021). However, teachers in Japan typically do not have such training or time to discuss how to administer and score speaking tests, which are typically scored by a single rater. In this case, misjudgment can lead to relatively large scoring variations for students with similar achievement levels. At many large schools, several classes in the same grade are taught by a few teachers, with each class taught by one teacher. Inconsistency in rating among teachers may lead to a lack of score comparability. Moreover, there is concern about consistency across and within raters (Kaneko, Citation2019). Large divergences in scoring could lead to errors in decision making and hesitation to conduct SA and use the results for feedback. Koizumi and Watanabe (Citation2021) examined the rater reliability of four speaking tests at a senior HS (N = 116) using a simple rubric without prior intensive rater training and found that rater reliability with a single rater was acceptable for presentation and paired role play (Φ = .69 and .86, respectively) but low for free group discussions (Φ = .39 to .45; three or four raters were needed to obtain Φ = .70). This result is not surprising, since relatively low rater reliability has been documented in the literature on group oral discussions, even with intensive rater training (e.g., Van Moere, Citation2006). The current practice of using a single rater without intensive rater training appears to have room for improvement, particularly in group oral discussions.

Factors behind the problems

The three major issues of summative and formative classroom SA in secondary schools in Japan, mentioned above, could be attributed to three factors: (a) the inherent nature of SA, (b) teachers’ personal factors, and (c) surrounding contextual factors that make classroom SA difficult. Since these factors are not new or specific to the Japanese context, references can date back more than 10 years and include those from other countries. In particular, (b) personal and (c) contextual factors have been comprehensively summarized in Yan et al.’s (Citation2021) systematic review based on 52 articles that contained elements affecting teachers’ willingness to conduct assessment and their actual administration2. While Yan et al. (Citation2021) focused on formative assessment, the factors that they proposed are also relevant to summative assessment in the current context (see Saito & Inoi, Citation2017; Wicking, Citation2017, for teachers’ perceptions and beliefs of assessment in general in Japan).

Inherent difficulties in L2 classroom speaking assessment

Previous research suggests that SA is inherently difficult for three main reasons (Akiyama, Citation2004, Citation2019; Hirai & Koizumi, Citation2013; Honda, Citation2007; Luoma, Citation2004; Matsuzawa, Citation2002). First, speaking performance and scores are affected by various factors. Fulcher (Citation2003) delineated key factors influencing SA scores, including test-takers’ ability, real-time processing, personality, tasks, interlocutors, local performance conditions, rubrics, raters, and test purposes. Because of these factors, SA tends to have low reliability, especially consistency across and within raters in general and across test situations with different interlocutors (i.e., interviewers, pairs, and group members).

Second, SA suffers from low practicality, particularly in administering tests and scoring students’ responses. Each format requires teachers’ time, test space, devices, and efforts, including having students prepare for SA and reflecting on their achievement using feedback.

Third, classroom assessment provides challenges for teachers who need to play two contrasting roles as facilitators (for teaching and formative assessment) and assessors (for summative assessment), with teaching as the primary focus (Teasdale & Leung, Citation2000). Studies suggest that systematically maintaining classroom assessment is challenging for teachers (Brindley, Citation2001; Davison, Citation2004).

Teachers’ personal factors

Teachers are the main agents in classroom SA for creating, administering, and scoring SA and providing feedback to students. Thus, their personal perceptions of SA may affect actual SA practices. There appears to be three major factors related to perceptions and practices (based on Yan et al., Citation2021): instrumental attitude toward SA, self-efficacy (or perceived control), and education and training.

First, teachers’ negative attitudes toward classroom SA are related to how SA is planned and conducted, especially instrumental attitudes, or the perceptions as to whether SA is effective and whether it impacts learning and teaching. It has been reported that some teachers doubt the effectiveness and usefulness of SA given its low reliability and practicality (Akiyama, Citation2004). Kaneko (Citation2019) reported several teachers’ relevant comments: A junior HS teacher mentioned that s/he can conduct classroom SA and give overall feedback to the class, but s/he cannot find time to provide individual feedback and have students reflect on their performances. This teacher wondered whether SA was effective when formative feedback was not provided. Another junior HS teacher valued the positive impact of SA on students’ learning but worried about other students’ waiting time, potentially leading to more preparation time for students who take the test later and less time devoted to learning (see also Kellermeier, Citation2010). Although this teacher asked students who finished the test to reflect on their performance, strengths, and weaknesses and provided tasks to review previous lessons, s/he still felt that SA was challenging. Some teachers use breaks between classes and time after school to conduct SA to shorten students’ waiting time. Some teachers also question whether extraverted students may have an advantage in SA, thus creating unfairness. Others also note difficulties in eliciting responses, particularly extemporaneously, from beginners who can handle only a limited range of topics and situations using narrow grammatical structures and vocabulary.

Some teachers feel that they do not need formal SA because they believe that they can grasp students’ achievement via class observation. Akiyama (Citation2004) reported that 32.7% of junior HS teachers use observation rather than speaking tests to assess speaking ability. Although observation is a powerful formative tool, one disadvantage is that it can be difficult to obtain information on progress and achievement systematically for summative purposes (McKay & Brindley, Citation2007).

Others worry about dedicating time to SA rather than preparing students for entrance examinations that do not have a speaking component. Further, Akiyama (Citation2019) reported that some senior HS teachers (16.2%) tend to believe that teaching speaking skills is less important than teaching reading and writing, and they consider “English as a cognitively demanding academic subject” (p. 170). These views correspond with the lack of SA in entrance examinations (see Vongpumivitch, Citation2014), undervaluing the importance of teaching and assessing speaking ability and maintaining wide construct representation in classroom English assessment (Akiyama, Citation2004; see also Leung et al., Citation2018, for similar conflicting values in teacher assessment in other contexts). While classroom SA is not high-stakes, teachers explicitly or implicitly have positive or negative views toward SA that can affect the implementation of SA (Akiyama, Citation2004).

The second aspect related to teachers’ personal factors is teachers’ self-efficacy or lack of confidence in conducting SA, particularly in creating rubrics and scoring students’ performances. Takashima (Citation2019) reported that junior HS teachers stated that they were less confident in administering interaction-type SA than monologue-type SA, with 57.1% vs. 38.1% lacking confidence, respectively. He also reported that junior HS teachers lacked confidence in scoring interactions and monologues to a similar degree, with 61.9% and 66.7% lacking confidence, respectively. Kaneko (Citation2019) reported that junior HS teachers tended to assign all or most SA-related work to foreign assistant language teachers, possibly due to a lack of confidence. In contrast, senior HS teachers tended to conduct SA themselves, possibly due to the limited number of assistant teachers’ visits to senior HSs (MEXT, Citation2020b). Teachers’ limited involvement in formative SA can be problematic, leading to insufficient integration of teaching objectives, instructional tasks, and test tasks.

Third, some teachers may struggle with insufficient knowledge and skills of SA as well as English itself, probably due to a lack of education and training. MEXT (Citation2020b) reported that only 38.1% of junior HS teachers and 72.0% of senior HS teachers said that they attained a level of CEFR B2 or higher. As mentioned above, teacher training for managing classroom SA is not yet sufficient at pre-service and in-service stages (see Kasuya et al., Citation2021). Kaneko (Citation2019) documented teachers’ desires for SA examples to use as templates. Teachers’ relatively low levels of English ability and language assessment literacy may be related to less confidence in managing classroom SA.

Contextual factors

In addition to teachers’ personal factors, four contextual factors appear to affect teachers’ plans and actual implementation of SA (based on Yan et al., Citation2021): school environment, student characteristics, external policy, and cultural norms.

First, the school environment, including working conditions and internal school support, may impact SA practices. Examples include short class time (i.e., 50 minutes maximum per lesson), large class size (i.e., up to 40 students), less cooperative school culture, shortage of teachers, long working hours, large workload, and time constraints, which are not specific to L2 classroom SA (Katsuno, Citation2019). Lee (Citation2010) reported similar issues in South Korea as factors that made teachers hesitant to conduct SA. A less desirable environment can lead to situations where only a limited degree of training, assistance (both mental and technical), resources, and materials are provided at school to help teachers conduct SA. Although technology can assist teachers in conducting SA efficiently, teachers struggle with a lack of devices and/or expertise in using them (MEXT, Citation2020b).

Second, student characteristics, including student resistance or anxiety, may be another reason for limited SA practices. Kellermeier (Citation2010) documented that students hesitate to participate in SA due to a lack of motivation or confidence and fear of failure, particularly in front of other students. Some teachers need to provide psychological care for students who lack speaking skills and are likely to skip speaking tests due to test anxiety (Kaneko, Citation2019).

Third and fourth, external policies and cultural norms are involved. In relation to external policy, NIER (Citation2021) strongly encourages teachers to conduct summative and formative SA, so this could be a positive move for SA administration. However, as mentioned previously, speaking skills are not typically tested in high-stakes entrance exams in Japan, so some teachers may feel that classroom SA is less important or unnecessary. Regarding cultural norms, SA practices may be affected by an examination-oriented culture with more emphasis on summative assessment rather than formative assessment, or a Confucian heritage culture (Wicking, Citation2020).

As Yan et al. (Citation2021) described, contextual factors are closely related to teachers’ personal factors. These factors, along with the inherent difficulties of SA, may make teachers hesitant to conduct SA multiple times using multiple formats, and to take time to ensure reliability.

Future directions

The previous sections discussed the major issues of classroom SA in Japan (i.e., lack of continuous assessment, limited use of task formats, reliability not ensured) and suggested the reasons behind them from three viewpoints: inherent difficulties of classroom SA, teachers’ personal factors, and contextual factors. Based on these issues, two future directions are proposed by the author to improve summative and formative SA in Japan. They are to (a) conduct intensive teacher training and make resources available to teachers, and (b) explore a possible, feasible, and effective SA system in the Japanese context through discussion.

Regarding the first point (a), although inherent characteristics of classroom SA in Japan, such as low reliability and practicality, are difficult to resolve, teachers can explore ways to conduct SA that fits into their own contexts, for example, by using simple and clear tasks and rubrics, ensuring time for rater training, sharing work with colleagues, and creating a community of practice at school and in the region (e.g., Knoch et al., Citation2021; Koizumi & Watanabe, Citation2021). However, to achieve this, teachers are required to know the basic principles and practices of classroom SA through teacher training and self-study using available resources. The teachers’ factor “education and training” has been reported to affect teachers’ perceptions and actual implementation of assessment (Yan et al., Citation2021) and could also change contextual factors such as internal school support. Concerning the second point (b), discussions on classroom SA frameworks have been scarce in Japan. These discussions could help impact “external policy” (contextual factor) and lead to the modification of personal and other contextual factors.

These two directions are likely to advance summative and formative SA practices. Among them, I would argue that the first direction should be prioritized because of feasibility in the current system and possible achievability in the short term once intensive efforts are made and resources are given. The second direction would be essential for making classroom SA more appropriate and sustainable in the long term.

Improving teacher training and resources

After the abovementioned Core Curriculum was established for pre-service and in-service teacher training, a model program for putting the Curriculum into practice was recently developed for pre-service teacher training courses (Kasuya et al., Citation2021). This program includes performance assessment as a fundamental component of L2 assessment, although this is one of numerous entries for pre-service teachers to learn, including second language acquisition, English linguistics and literature, and intercultural understanding. Still, this is a promising advancement for pre-service teacher training. However, the program has not yet included in-service training, and it is difficult to ascertain whether important components of language assessment literacy are taught and acquired in teacher training programs. Given the difficulty of organizing classroom SA, the topic should be covered mandatorily and repeatedly in training programs. Moreover, to ensure effective teacher training with up-to-date information, experts specializing in L2 assessment, particularly classroom SA, are needed. They should introduce theoretical and practical issues and provide hands-on training involving test development, administration, evaluation, and feedback using SA, especially the ways in which each teacher can apply existing insights and resources in their own school environment. Opportunities for teachers to reflect on their own explicit and implicit views toward SA are also required, which may help teachers alter negative beliefs and attitudes. Training contents and methods can be improved by learning from previous studies (Gebril, Citation2021; Malone, Citation2017).

In addition, open resources, particularly online platforms that compile existing information (Fulcher, Citation2020; CitationJapan Language Testing Association [JLTA], n.d.) and newly constructed content must be created. Guidelines, sample tasks, and sample performances should be added to enable teachers to learn how to conduct SA. One attractive website is the Tools to Enhance Assessment Literacy for Teachers of English as an Additional Language (TEAL; https://teal.global2.vic.edu.au/). According to Michell and Davison (Citation2020), this system was created in collaboration with researchers and teachers in Victoria and New South Wales, Australia. The TEAL uses assessment for learning (AfL) principles (Black & Wiliam, Citation1998) and a Vygotskian theory of learning, consisting of “(1) a set of sequenced teacher professional learning resources about EAL [English as an Additional Language] (including self-assessments) designed for small group or self-directed study; (2) an assessment tool bank containing a range of assessment tools and tasks …; (3) a range of AfL and teaching exemplars including a selection of annotated units of work across a range of subject areas and year levels showing assessment tasks with formative feedback embedded within a teaching/learning cycle; and (4) an online teacher discussion forum” (pp. 33–34). The website is organized in a way that teachers and first-time visitors can readily understand its importance and content. This would be helpful for teacher professional development, dialogue, and assessment quality maintenance (Brindley, Citation2001). Item (2) can include a task bank or “a bank of fully-piloted exemplar assessment tasks with known measurement properties that teachers can use either for specific assessment in their own classrooms or as exemplars for writing their own tasks,” and the bank “will be continuously updated as new tasks are developed and piloted, using Rasch-calibrated tasks as ‘anchors’” (Brindley, Citation2001, p. 401). Although the creation of such task banks has also been advocated by Akiyama (Citation2004) and Nakamura (Citation2019), it has not been realized in Japan. Open resources with such task banks would be useful.

Discussing effective speaking assessment directions in Japan

Although a general assessment framework is provided, there are no fixed assessment procedures for classroom SA in Japan. Teachers can decide when and how SA is conducted, because the national guidelines are not legally binding. There are three possible directions for improving SA according to the level of standardization.

Maintaining the current, unstandardized system while improving administration and scoring

The first direction is to continue with the current system of not having strict mandated procedures but with a clarification of four possible methods of administering and scoring classroom speaking tests: (a) in-class administration and in-class scoring, (b) in-class administration and out-of-class scoring, (c) out-of-class administration and scoring on the spot, and (d) out-of-class administration and scoring after the test (Koizumi, Citation2022). In (a), teachers can conduct a speaking activity in class as part of teaching speaking and assess students’ performance at the same time; this is often observed in presentation activities. Another pattern of (a) is to conduct a speaking test for a student(s) in a separate room, while other students are being taught by another teacher or study by themselves. In (b), teachers record students’ performance and listen to it after class. In (c) and (d), teachers conduct the test during lunch break or after school and score performance (c) on the spot or (d) after the test. All methods have advantages and disadvantages, and the most feasible style differs across school contexts. With respect to using time outside class, (a) requires little time, and (b) to (d) use a few hours outside of the lesson time. In terms of using lesson time, (a) uses one to three lessons, (b) uses limited time, and (c) and (d) use no time. Further, concerning collaboration with other teachers, (c) and (d) often require more understanding from other teachers, since they may have other plans for the break and after school. Additionally, (b) and (d) present difficulties in scoring while listening to recorded utterances, especially to student interactions in a noisy classroom.

Out of the four methods, (a) seems to be suitable in many situations because it is practical timewise, with the administration and scoring completed by the end of a few lessons. This method is one possible way to make continuous SA feasible without overburdening teachers. Further, SA is considered worthwhile to spend lesson time on, considering that they provide important learning opportunities, including test activities and reflections on speaking ability through self-, peer, and teacher assessment. Students also tend to value SA opportunities. For example, in response to a questionnaire that asked about SA impressions, a senior HS student who took 20 speaking tests over three years wrote that it was a greater learning experience for him to perform on a speaking test with a sense of tension than to speak on multiple occasions in normal lessons (Koizumi, Citation2022). provides one example that used the (a) in-class administration and in-class scoring method at an upper secondary school. Teachers annually conducted three or four speaking tests per English course for about 700 students (Years 10–12). Teachers used approximately two 50-minute lessons per test; a student's performance was scored by one or two teachers; SA scores were used for summative grading (30% of the total grade); and score reports were provided as formative feedback. Koizumi et al. (Citationin press) analyzed two speaking tests at this school, which used simple rubrics without intensive rater training, and reported no misfits for raters or rating criteria and high reliability across raters (Φ = .71 or higher for a single rater). This school had used method (a) for seven years. The continued good practice appears to indicate that method (a) is sustainable and suitable for conducting continuous SA formatively and summatively.

Table 2. Example of a year-long plan using an in-class administration and in-class scoring method

Adopting standardized procedures for school-based assessment

The second direction is more standardized. Globally, there are many school-based assessment procedures, some of which are stipulated by national or regional governments. There are two promising systems from which Japan may be able to learn, each of which is strongly based on AfL principles. The first system is a school-based assessment (SBA) of L2 English speaking in Hong Kong. All English teachers are required to conduct SBA in secondary-school years 4 to 6, and two scores in years 5 or 6 must be sent to the Hong Kong Examinations and Assessment Authority (HKEAA); the SBA scores account for 15% of the examination for university admission. The SBA uses two tasks (a group discussion and an individual presentation) and rubrics; teachers embed this SA in their teaching, providing formative feedback. According to Hong Kong Examinations and Assessment Authority (HKEAA, Citation2019), school representatives join “SBA conferences and coordinator-teacher meetings” before SBA. Then, a principal selects school coordinators for English Language who “oversee the conduct of the SBA,” hold meetings to have all relevant teachers understand the rubric and procedures, and “conduct a within-school meeting to review performance samples and standardise marks before the submission of marks to the HKEAA” (pp. 26–29). HKEAA then adjusts SBA scores through across-school statistical moderation.

Another useful system is pair or group interaction assessment (interact) in L2 classrooms in New Zealand (East, Citation2020). It is a high-stakes, year-end, school-based test used for summative and formative purposes. It is intended to assess extemporaneous, natural conversation and was introduced to replace converse, “a summative teacher-led interview test” (East, Citation2020, p. 224), which elicited prepared and memorized answers in response to questions from a teacher from whom students learned. During interact, students talk extemporaneously in regular lessons with other students in L2 about a topic. The teacher or students record talks of good performance in an audio/video format. Every year, students and/or teachers choose two or more recordings per student that show “the best evidence of [the students’] interactional proficiency” (East, Citation2020, p. 225). The teacher then evaluates the talks in a holistic rubric and assigns grades, and formative feedback is provided to students. To maintain comparability and consistency within and across schools, tasks are examined through within-school moderation, and sample recorded talks are examined through outside-school moderation. Supporting materials are available, such as sample tasks, evaluation criteria, and audio samples with grades and explanations (East, Citation2015).

The systems in the two countries are intended for both summative and formative uses and include not only assessment tasks, rubrics, and procedures, but also embedded teacher training and moderation processes. They are viable systems that ensure the quality of education across regions. Although limitations have been reported, they appear to function appropriately (Chan & Davison, Citation2020; East, Citation2020; Leung et al., Citation2018). While scores from the two systems are explicitly included in high-stakes entrance examinations, it would be possible to adopt one of the systems without using the scores for such examinations. Even so, one concern may be that standardization decreases teachers’ autonomy in teaching and assessment (Katsuno, Citation2019) and increases the stakes of classroom SA. However, in my opinion, these systems, which focus on AfL principles and involve teachers as active key players, are flexible enough to provide opportunities for teachers to make local, creative, and autonomous decisions, for example, about how to organize lessons, select topics for discussion/presentation, and determine the timing of the test tasks. The optimal balance of implementing standardization and securing autonomy may be a contentious issue that needs to be discussed carefully.

Adopting some aspects of standardization

The third direction is an intermediate between the first two. The current Japanese system of flexibility and freedom for teachers is maintained, while some aspects of the fixed systems in Hong Kong and New Zealand are introduced. For example, it is possible to mandate the use of common standardized rubrics (Akiyama, Citation2004), tasks, and/or procedures. An example is to require the use of fixed rubrics and specified timing of a speaking test (e.g., at the end of the final year at junior and senior HSs), while allowing teachers to determine other SA procedures. Mandating even one of the fixed components would greatly benefit teachers who are unfamiliar with formative and summative SA. To improve classroom SA in Japan, discussions on which direction(s) to adopt while considering the strengths and weaknesses of each direction and summative and formative purposes of SA would be helpful.

Conclusion

Although L2 SA in classrooms at secondary schools in Japan should be conducted regularly and adequately and used summatively and formatively, it is not well practiced, and there are many various problems surrounding teacher-made, teacher-scored SA. The current review provides a summary of the main issues and challenges and presents future directions to improve practices by providing more intensive teacher training and resources and discussing an effective SA framework that fits the Japanese educational model. Classroom SA involves complicated practical, personal, and contextual issues but is worth developing because of its distinct advantages in L2 education.

While the current paper focuses on Japan, the issues, challenges, underlying factors, and possible future directions are relevant to other countries with similar educational situations, cultures, and orientations. For example, Ross (Citation2008) mentioned that Asian countries share common assessment characteristics. Further, Wicking (Citation2020) described some East Asian nations, including Japan, Hong Kong, China, Taiwan, Singapore, and South Korea, sharing a Confucian heritage culture and common contextual factors affecting classroom SA. More research is needed in Japan and other countries to further understand the contextual factors behind SA practices, to improve classroom SA, and address problems with which teachers struggle. Such research should involve secondary school teachers and researchers, appraising theory and improving practices (Poehner & Inbar-Lourie, Citation2020). In Japan, teacher–researcher collaboration is emerging (Koizumi et al., Citationin press; Nekoda, Citation2020; Shinshu English Project, Citation2020) but should be expanded. Such studies will inform not only Japan and other countries but also the language assessment field in general. They will be particularly useful in light of Fan and Yan's (Citation2020) findings that SA studies on classroom-based or learning-oriented speaking assessment are limited.

Acknowledgments

I would like to thank Anthony J. Kunnan, Constant Leung, Yo In'nami, and three anonymous reviewers for their assistance.

Disclosure statement

No potential conflict of interest was reported by the author(s).

Additional information

Funding

This work was supported by the Japan Society for the Promotion of Science (JSPS) KAKENHI, Grant-in-Aid for Challenging Research (Pioneering), Grant Number [20K20421].

Notes

1 While this result may be affected by the sampling method and test used (as suggested by one of the reviewers), no evidence shows that students in Japan have high speaking ability. For example, Educational Testing Service (Citation2020b) reported that Japanese people taking the Test of English as a Foreign Language Internet-based test (TOEFL iBT) scored 17 points on average, with a percentile rank of 20 (Educational Testing Service, Citation2020a). This shows that the Japanese group has a CEFR B1 level speaking ability on average and ranks near the bottom. While I acknowledge that the Japanese test takers in the survey may not be representative of the entire Japanese population, I argue that the survey can be used to infer a trend.

2 The author wishes to thank one of the reviewers for suggesting this point.

References