10,394
Views
3
CrossRef citations to date
0
Altmetric
Articles

The effectiveness of multimedia input on vocabulary learning and retention

ORCID Icon
Pages 738-754 | Received 19 Apr 2022, Accepted 23 Sep 2022, Published online: 26 Oct 2022

ABSTRACT

Multimedia input can enhance vocabulary learning in the context of learning English as a foreign language (EFL). Drawing upon a mixed method, this study explores the potential of multimedia input in vocabulary learning. EFL vocabulary learning was assessed under four input conditions (definition + word information + video, definition + word information + audio, definition + word information, and definition-only). One hundred twenty-five Chinese-speaking university students were randomly allocated to the four conditions. The vocabulary knowledge test focused on receptive and productive vocabulary knowledge and served as a pretest, immediate posttest, and delayed test after 2 weeks. Participants also completed a survey related to their perceptions of the assigned learning mode. Five participants from each input condition completed an individual interview as well. Analyses of covariance supported the pronounced effects of the definition + word information + video condition on vocabulary learning and retention. The questionnaire and interview findings explained the value of multimedia input, particularly in the definition + word information + video condition. Overall, results highlight the importance of audiovisual input in vocabulary learning and retention. Relevant theoretical and pedagogical implications are discussed based on the findings.

Introduction

Vocabulary helps students think and learn; it serves as a foundation for other language-related skills, including reading, writing, speaking, and listening (Webb and Nation Citation2017). Multimedia input (e.g. text, images, audio, animation, and captions/subtitles) has been widely adopted to facilitate second language (L2)/foreign language teaching and learning. Teachers and researchers have sought to make vocabulary instruction more effective by incorporating glosses into authentic texts (Ramezanali and Faez Citation2019; Teng Citation2020), captioned videos (Lwo and Lin Citation2012; Teng Citation2019, Citation2022), L2 TV programs (Montero Perez Citation2020; Peters and Webb Citation2018; Puimège and Peters Citation2020), and multimedia input (Teng and Zhang Citation2021). Multimedia input is more beneficial than simplified texts in this regard—learners are afforded opportunities to use authentic texts without the need for extra time and effort. Despite the importance of multimodal input in vocabulary learning, vocabulary learning is still challenging for students in a foreign language context (Teng Citation2021). This challenge may be related to the nature of vocabulary knowledge: learning new words is an incremental process and requires sufficient target word exposure (Schmitt Citation2010). Access to multimedia input can strengthen target word exposure, but more remains to be understood about the role of this type of input in vocabulary learning.

The advent of multimedia related computer technology has created opportunities for teachers and researchers to enhance foreign language learning (Zhang and Zou Citation2021), including vocabulary learning and teaching (Mohsen and Balakumar Citation2011; Yanagisawa, Webb, and Uchihara Citation2020). Vocabulary researchers have endeavored to innovate vocabulary teaching and learning by examining multimedia input, e.g. multimedia glosses (Mohsen and Balakumar Citation2011; Yanagisawa, Webb, and Uchihara Citation2020). Multimedia input, conceptualized in the present study as blending text, vocals, pictures, and videos for language learning, is an essential stimulus for different aspects of vocabulary knowledge. Multimedia input can help learners construct referential connections between two forms of mental representation systems: the verbal and the visual one (Ramezanali, Uchihara, and Faez Citation2021; Yoshii and Flaitz Citation2002). Mayer (Citation2001) argued that multimedia input can keep learners cognitively engaged, such as when choosing relevant material, organizing content into visual and/or verbal models, and assimilating these new models with prior knowledge. Sustained attention can lower learners’ cognitive burden in terms of working memory. Additionally, multimedia can help learners manage their intrinsic load, optimize their germane load, and minimize their extraneous load to ensure maximum information storage for better learning (Mayer Citation2001).

Even though multimedia learning resources are continually updated to enhance vocabulary learning, research on cognitive burdens has raised doubts about the utility of presenting information in a multimedia format (Mayer and Moreno Citation1998). Multimedia-related studies have yet to come to uniform conclusions (Ramezanali, Uchihara, and Faez Citation2021). Research on video, audio, and textual input for vocabulary learning is especially inconclusive. The present study attempts to bridge this gap by examining the extent to which four multimedia input conditions (definition + word information + video, definition + word information + audio, definition + word information, and definition) influence vocabulary learning and retention in students learning English as a foreign language (EFL). Interview data were triangulated with quantitative data to provide further insight into learners’ perceptions and behavior under different conditions. Findings offer theoretical and pedagogical contributions to the domain of vocabulary instruction. The present study enriches the understanding of theories such as the cognitive theory of multimedia learning (Mayer Citation1997, Citation2001). Pedagogically, results provide insight into how the affordances of multimedia input can innovate vocabulary learning.

Literature review

Multimedia input in vocabulary learning: cognitive theory of multimedia learning

Mayer (Citation1997, Citation2001) proposed the cognitive theory of multimedia learning, framing multimedia materials as not simply meant for information delivery but as cognitive aids for knowledge construction. Mayer’s theory is based on three assumptions about how people process information: 1) the dual-channel assumption, 2) the limited-capacity assumption, and 3) the active-processing assumption. The dual-channel assumption states that ‘humans possess separate channels for processing visual and auditory information’ (Mayer Citation2009, 63). It includes the visual–pictorial channel for processing images captured by sight and the auditory–verbal channel for processing spoken words. This assumption reflects dual-coding theory (Paivio Citation1972, Citation1986, Citation1990). The limited-capacity assumption suggests that people have a limited capacity for processing and memorizing information at any given moment. People can only maintain limited visual and verbal information in working memory; metacognitive strategies are needed to manage cognitive resources more efficiently. The active-processing assumption asserts that people must engage in active cognitive processes to learn rather than passively absorbing information. Learners cannot simply be bombarded with information. Instead, they should receive input that can help them synthesize words and pictures into meaningful information for long-term memory storage.

This theory underscores the benefits of multimedia learning. The following three assumptions can be interpreted as key features of the cognitive theory of multimedia learning. First, multimedia information (i.e. verbal and visual information transmitted via one’s ears and eyes in sensory memory) can be synthesized into one’s working memory and assembled into coherent verbal/pictorial models. The combination of verbal and pictorial models, together with learners’ prior knowledge, are stored in long-term memory. Second, when learners are presented with textual information (e.g. word definitions and example sentences), they use their eyes to receive this input and bring it into sensory memory. Learners next select input and send it to working memory. Relevant information is then constructed. Third, the processing of multimedia input (i.e. textual, pictorial, and auditory) is complex in that sounds and images should not be separated: verbal and visual information should interact. Prior knowledge can thus be activated to process information for long-term memory.

Use of multimedia in vocabulary learning: empirical research

Scholars have explored the potential of multimedia input (e.g. textual, auditory, and visual forms) on vocabulary learning (e.g. Chun and Plass Citation1996; Ramezanali and Faez Citation2019; Teng and Zhang Citation2021; Yoshii Citation2006; Yoshii and Flaitz Citation2002). Consistent with the cognitive theory of multimedia learning (Mayer Citation2001), integrating verbal and visual input is more beneficial for vocabulary learning than using either verbal or visual input alone. For instance, Chun and Plass (Citation1996) observed that learners earned significantly higher vocabulary scores for words annotated with text and pictures than for words glossed with text only or text + video. Yeh and Wang (Citation2003) showed that text + picture was the most effective type of annotation for vocabulary learning; combining audio input with text annotations and pictures was not as helpful as combining pictures and text annotations. In Al-Seghayer’s (Citation2001) within-participant research design, 30 ESL learners were exposed to three input modes via a hypermedia-learning program: 1) printed text definition alone, 2) printed text definition coupled with still pictures, and 3) printed text definition coupled with video clips. Learners’ vocabulary test concerned recognition and production. The means and percentages of correct answers for words with video and text, picture and text, and text alone were 4.3 (87%), 3.3 (67%), and 2.7 (53%), respectively. Ramezanali and Faez (Citation2019) explored three input conditions (L2 definition alone, L2 definition + audio glossing, and L2 definition + video animation glossing) on learners’ vocabulary learning and retention. Dual glossing modes (i.e. L2 definition and video animation glossing) were found to be more effective for vocabulary learning and retention than single glossing modes (L2 definition). Lwo and Lin (Citation2012) suggested that using videos with captions in a multimedia learning environment was preferable to not having captions for less proficient learners: about 93.8% of participants thought that animations were more beneficial than imagery. In a more recent study (Teng and Zhang Citation2021), the best vocabulary learning outcome corresponded to having video input to supplement word information and its definition, followed by word information + definition and finally only the definition. The pronounced effects in the group receiving a video for word information and its definition were tied to learners’ individual differences in working memory capacity.

Other studies have revealed different results. For instance, Çakmak and Erçetin (Citation2018) allocated 88 students with low English proficiency levels into four groups: control, textual gloss, pictorial gloss, and textual + pictorial gloss. The gloss conditions yielded significantly better results than the control condition. Comparisons of the three gloss conditions indicated no significant differences. The authors provided two explanations for the non-significant effects of multimedia gloss conditions: 1) learners’ low proficiency level and 2) learners might not have acclimated to the mobile environment. Boers et al. (Citation2017) explored two groups’ vocabulary learning. One group condition focused on glosses containing textual information (text-only glosses); the other group received the same glosses complemented by pictures to elucidate target words’ meaning (multimodal glosses). No significant differences in meaning recognition were observed between the groups. In terms of recalling word form, the condition with pictures led to lower vocabulary learning performance. The authors argued that adding pictures may reduce learners’ amount of attention to the target words. Another study (Yanguas Citation2009) also evaluated the effects of gloss type on vocabulary production and recognition. Regarding word recognition, the three gloss groups (textual, pictorial, and textual + pictorial) significantly outperformed the control group. However, significant differences were not detected among the three experimental groups. Multimedia annotations exerted no significant impacts on the production of target vocabulary items. Similarly, Akbulut (Citation2007) examined three groups’ incidental vocabulary learning (i.e. definition only, definition + picture, and definition + video). The video group performed best on form recognition, meaning recognition, and meaning production, followed by the picture and definition group. However, no significant differences emerged between the video and picture groups.

In summary, while most of the aforementioned studies favored dual-modality glosses (visual + textual) over single-modality glosses (text-only), single-mode input outperformed dual modes for word learning and retention in certain cases. Such inconsistencies regarding multimedia input and which input combinations are more effective in facilitating L2 vocabulary learning may be due to word – and learner-related factors. As reported in a meta-analysis (Ramezanali, Uchihara, and Faez Citation2021), many aspects (e.g. learners’ L2 proficiency, gloss language and type, and research design) could affect the results of vocabulary learning in multimedia environments. Any generalization about the significantly positive effects of multimedia input on vocabulary learning should be taken with caution. The above studies used various measures to assess vocabulary knowledge, and learners hailed from different backgrounds. In addition, the distinct research designs preclude universal determinations of which types of multimedia input are most beneficial for vocabulary learning in a foreign language context.

Learners’ perceptions of multimedia for vocabulary learning

Questionnaires and interviews have been adopted to examine learners’ perceptions of multimedia for vocabulary learning. In Ramezanali and Faez’s (Citation2019) research, learners in the dual glossing group (L2 definition and video animation glossing) expressed more positive attitudes than others. These learners believed in the potential of the dual mode. For example, this model could help them learn and retain new words, process new words in greater depth, and motivate vocabulary learning via an enjoyable process. Al-Seghayer (Citation2001) conducted interviews including closed-ended questions and three open-ended questions with 30 ESL learners. Interviews were intended to uncover learners’ perceptions of annotation techniques: a printed text definition alone, a printed text definition + still pictures, and a printed text definition + video clips. Interview data corresponded to quantitative data reflecting participants’ vocabulary test performance. Learners perceived words annotated with video to be most useful because the video helped them retrieve cues to recall new words, access more contextual support, and ponder target words’ meanings. In an early study (Chun and Plass Citation1996), learners used a multimedia program called CyberBuch which provided annotations via pictures, printed text, and video. Learners wrote a recall protocol after the treatment. Clear effects in vocabulary learning through the combination of verbal and visual modes emerged because learners could construct referential connections between verbal and visual input. However, data based on think-aloud protocols in Yanguas (Citation2009) indicated that learners in multimedia conditions, including textual, pictorial, and textual + pictorial, did not exhibit deeper processing of the target words. Learners were mainly interested in sentences’ general meaning while paying little attention to the glossed words. Akbulut (Citation2007) found that 48% of participants’ comments indicated that multimedia annotations (as videos and pictures) were beneficial for vocabulary learning. Most students reported being aware of improvements in their vocabulary knowledge between the pretest and posttest. They became more confident and successful on their vocabulary tests. Sixteen percent of statements indicated that visual annotations along with definitions made target words more memorable.

Overall, the literature on multimedia and vocabulary learning reports that learners hold positive attitudes toward multimedia conditions. These perceptions seem to stem from the availability of visual and verbal information. For example, multimedia input improves learners’ engagement and motivation, produces a coherent understanding of text without the need for conventional dictionary lookups, guides learners in learning vocabulary independently, and produces more opportunities to process materials or information versus traditional input. Such qualitative analysis may explain why words annotated with verbal and visual modes are generally learned and retained better than words presented in a single mode.

Rationale for the present study

Multimedia input is one of many ways to promote learners’ attention to information for learning new words. Individuals’ degree of attention to a new word may be a significant predictor of vocabulary acquisition and retention (Teng and Zhang Citation2021). In the present study, vocabulary learning is conceptualized as the process by which learners in a foreign language context deliberately learning new words, and vocabulary retention is conceptualized as the ability to recall or remember the newly learned words after an interval of time.

The effects of single, dual, and multimodal input on vocabulary learning and retention in a multimedia setting are yet to be known. In the present study, single input types include text-only, audio-only, picture-only, and video-only input. Dual input presents information through a combination of text and pictures, text and audio, or text and video. Multimodal input can refer to short definitions or explanations in a range of input conditions. The present study is not for including all types of multimedia input. Given the inconclusive findings regarding the effectiveness of multimedia input combinations, in particular, audio and video input, in facilitating L2 vocabulary learning, the present study compares four input types on vocabulary learning among EFL students: definition only, definition + word information, definition + word information + audio, and definition + word information + video. The definition refers to the meanings for the target words while word information refers to background information on each word, including its usage in a sentence and origins of the word. Qualitative findings about learners’ perceptions of their respective conditions and learners’ efforts to acquire and retain vocabulary were triangulated with quantitative results to clarify the impacts of multimedia input on vocabulary learning and retention. The present study extends to what has been known in the domain of isolated and deliberate teaching and learning of words through multimedia tools (e.g. Chun and Plass Citation1996; Ramezanali and Faez Citation2019). The present study contributes to a theoretical and practical understanding of vocabulary learning and retention through different input types in a multimedia setting. The following research questions guided this effort:

  1. To what extent do the four input conditions differ from each other in EFL learners’ immediate and delayed vocabulary learning and retention?

  2. What are EFL learners’ perceptions of their respective input conditions for vocabulary learning and retention? Why?

Method

Participants

All participants were undergraduate students majoring in English at a university in China. The sample comprised 125 first-year students from four intact classes. Their ages ranged from 18 to 20 years old. They spoke Chinese as their first language and were EFL students. Each class was randomly assigned to one of four input conditions: definition (n = 31), definition + word information (n = 31), definition + word information + audio (n = 32), and definition + word information + video (n = 31). The initial sample contained 135 participants, but 10 failed to complete all tests and were thus excluded from data analysis. Schmitt, Schmitt, and Clapham’s (Citation2001) vocabulary levels test (VLT) was used to control for homogeneity and proficiency differences between the four conditions (see the Results section). In terms of VLT scores, ANOVA results showed that there were no significant differences among the four groups, F(3, 124) = .732, p = .535.

Target words

Three teachers who had teaching experience with the participants chose target words from the Word of the Day’ section of the Merriam-Webster Online Dictionary (available at https://www.merriam-webster.com/word-of-the-day). The teachers were also researchers in applied linguistics. Twenty-four difficult words were selected based on their opinions. The Merriam-Webster Dictionary is America’s most trusted online dictionary for English word definitions, meanings, and pronunciation. The ‘Word of the Day’ section highlights one word each day. There were two reasons for us to choose the Merriam- Webster Online Dictionary. First, the mean VLT scores for the four groups ranged from 68.19–72.74 (Table 5). The words in Merriam- Webster Online Dictionary were all low frequency words. The words were thus appropriate for assessing learners’ vocabulary learning performance after intervention as those words were unknown to the learners. Second, the Merriam- Webster Online Dictionary offers word definitions and extra information about the word, including background stories, synonyms, antonyms, example sentences, and etymology. It was convenient for research purpose. In addition, this section includes an audio recording or a video explaining the word’s background, pronunciation, meaning, use, part of speech, and example sentences. However, as the words were isolated low frequency words, the main research purpose was for explicit and intentional vocabulary teaching.

The 24 target words were analyzed through the VocabProfile section (http://www.lextutor.ca/vp/comp) of Compleat Lexical Tutor (www.lextutor.ca) to determine their frequency. All target words () were beyond the K-10 level. In a pilot study with 30 learners from similar backgrounds, all participants reported the target words as being of low frequency and unknown to them. The pretest results confirmed that no participants knew any of the target words.

Table 1. The 24 target words.

Treatment

As mentioned above, the four classes were randomly allocated to four treatment groups as detailed in . Participants in Group 1 received word definitions. The definitions were meant to help participants understand the target words’ basic meanings. Participants in Group 2 had access to background information on each word, including its usage and origins. Participants in Group 3 received the word definition, background information, and an audio recording on the word’s meaning and use. Finally, participants in Group 4 received the word definition, background information, and a video on the word’s meaning and use. All information sources, including words’ definitions, background information, audio input, and video input, were displayed on a computer.

Table 2. The four conditions.

Measures

Vocabulary learning performance

Participants’ vocabulary learning performance was evaluated using a two-part vocabulary test, covering productive and receptive vocabulary knowledge. The test first assessed productive vocabulary knowledge and then receptive vocabulary knowledge to prevent details in the receptive test from providing hints about the productive test. The vocabulary test served as a pretest, immediate posttest, and delayed posttest. For the productive test, participants were required to write down the target word based on its definition; for the receptive test, participants were instructed to choose the appropriate alternative among five options (the correct item, three distractors, and one ‘I don’t know’ option to avoid a wild guess) for each sentence. shows an example from the vocabulary test. The Cronbach’s alpha (α) value for the test was .85, indicating sound reliability.

Table 3. An example of the vocabulary test: Bonhomie

On the receptive test, participants earned 1 point for a correct answer and 0 points for an incorrect answer. On the productive test, they earned 1 point for a correct answer, half a point for a partially correct answer, and 0 points for an incorrect answer. Despite minor spelling mistakes, such as ‘bonhonie’ for ‘bonhomie,’ partially correct answers were regarded as a sign of vocabulary learning growth. Two experienced teachers rated participants’ answers independently to minimize the risk of scoring bias. Interrater reliability for the vocabulary test was 98.7%. Disagreements were resolved via discussion.

Perceptions of vocabulary learning from multimedia

A Chinese-language questionnaire was used to measure participants’ perceptions of vocabulary learning from multimedia (see the Results section). The questionnaire included five closed-ended questions focusing on participants’ perceptions of vocabulary learning and retention through different modes and their possible future practice. Participants indicated the degree to which they agreed with each statement by dragging a bar to the corresponding point (0–100 points). A score of 0 points indicated absolute disagreement, while a score of 100 indicated absolute agreement. The reliability of the questionnaire, based on Cronbach’s alpha coefficient (r = .86), was high.

Interviews were also held in Chinese to gain a deeper understanding of why participants preferred (or did not prefer) input modes. Five participants from each group were randomly invited to attend an interview. Each interview lasted approximately 10 min. Interview questions focused on how to help participants reflect on their learning practice about their respective input mode (e.g. ‘What did you like most about vocabulary learning in the assigned learning mode?’; ‘What challenged you most during the vocabulary learning process in the assigned learning setting?’; ‘Do you think the vocabulary learning mode is effective or not? Why?’).

Procedure

Prior to the study, ethical permission was granted from the university’s research committee. Participants also signed a consent form. They were assured anonymity and confidentiality and were told that they had the right to withdraw at any time if they felt uncomfortable. The research procedure for the whole study was in . In the first week, participants completed a pretest and the VLT. Participants completed the treatment session in the fifth week. The immediate posttest was administered directly after this session. Two weeks later, participants finished the delayed posttest as well as the questionnaire and interviews. Test items were largely the same for the pretest, posttest, and delayed tests with slight differences (e.g. we added different sets of non-target words). The sequence of test items also varied by test to minimize test–retest learning effects. Participants’ vocabulary learning performance was expected to be mainly attributable to the instructional sessions.

Table 4. Research procedures.

The entire study was conducted in a computer classroom. None of the participants withdrew. Each participant received a supermarket coupon after completing all research requirements.

Data analysis

One-way analyses of covariance (ANCOVAs) were performed in SPSS Version 26 to test the impacts of treatment conditions on vocabulary learning performance on the immediate and delayed tests (Pallant Citation2011). Learners’ VLT scores were controlled as a covariate. Rationale of controlling VLT as a covariate was because of the significant correlation between VLT with immediate receptive vocabulary knowledge test (r = .288), immediate productive vocabulary knowledge test (r = .313), delayed receptive vocabulary knowledge test (r = .297), and delayed productive vocabulary knowledge test (r = .317). Bonferroni post hoc comparisons were used to compare differences across input modes on vocabulary learning and retention. Statistical test assumptions were met. The level of significance was set at an alpha level of p < .05.

We converted the questionnaire findings into percentages to better understand the data. Interviews were conducted, recorded, and transcribed by the author. The participants checked the transcripts to ensure that statements reflected what the participants expressed. The author coded and extracted themes that recursively emerged in the data. To ensure reliable analysis of interview data, the above-mentioned three teachers were invited to check whether coded themes reflected the findings. The most frequent themes were selected to capture this study’s focus.

Results

The first research question explored the extent to which input modes contributed to participants’ vocabulary learning and retention. Descriptive statistics are presented first in .

Table 5. Descriptive statistics.

reports participants’ average performance (M), variation in average performance (SD), and the sample size (N) for each part of the vocabulary test and the VLT. Learners in the definition + word information + video group performed best in terms of their immediate test scores for receptive vocabulary knowledge (M = 19.06, SD = 2.57), immediate test scores for productive vocabulary knowledge (M = 14.58, SD = 2.09), delayed test scores for receptive vocabulary knowledge (M = 16.55, SD = 2.25), and delayed test scores for productive vocabulary knowledge (M = 12.26, SD = 2.05). Slight variations emerged for the VLT across the four groups.

An ANCOVA was run to examine whether vocabulary test scores differed between the four conditions while controlling for participants’ vocabulary proficiency level. Preliminary checks were completed to assess the assumptions of normality, linearity, homogeneity of regression slopes, and homogeneity of variance. A Shapiro–Wilk test indicated that scores on the immediate test of receptive vocabulary knowledge were not normally distributed in the definition group [W(31) = .85, p < .05], the definition + word information group [W(31) = .72, p < .001], the definition + word information + audio group [W(32) = .85, p < .001], and the definition + word information + video group [W(31) = .90, p < .05]. Test scores on the immediate test of productive vocabulary knowledge, delayed test of receptive vocabulary knowledge, and delayed test of productive vocabulary knowledge were also not normally distributed. However, because the distribution was close to normal and ANCOVAs are robust to this violation (Pallant Citation2011), non-normality did not need to be addressed. A scatterplot indicated that the relationship between vocabulary knowledge test scores and the VLT was linear in the four conditions. That is, vocabulary knowledge test scores increased at approximately the same rate as the VLT changed by one unit. The scatterplot also suggested that the regression slopes were similar, and an F test revealed no interaction effect between the VLT and condition for immediate test scores for receptive vocabulary knowledge [F(3, 117) = 1.355, p = .26], delayed test scores for receptive vocabulary knowledge [F(3, 117) = 1.039, p = .23], immediate test scores for productive vocabulary knowledge [F(3, 117) = .462, p = .229], and delayed test scores for productive vocabulary knowledge [F(3, 117) = 1.447, p = .236]. Levene’s test indicated that the assumption of homogeneity of variance was not violated for immediate test scores on receptive vocabulary knowledge [F(3, 121) = 2.054, p = .110], delayed test scores on receptive vocabulary knowledge [F(3, 121) = 2.168, p = .095], immediate test scores on productive vocabulary knowledge [F(3, 121) = .966, p = .411], and delayed test scores on productive vocabulary knowledge [F(3, 121) = 2.352, p = .076]. After controlling for the VLT, a significant effect of condition was observed for immediate test scores regarding receptive vocabulary knowledge [F(3, 120) = 321.442, p < .001, ηp2 = .889], delayed test scores regarding receptive vocabulary knowledge [F(3, 120) = 83.634, p < .001, ηp2 = .676], immediate test scores regarding productive vocabulary knowledge [F(3, 120) = 277.057, p < .001, ηp2 = .874], and delayed test scores regarding productive vocabulary knowledge [F(3, 120) = 225.942, p < .001, ηp2 = .85]. The VLT was significantly related to immediate test scores on receptive vocabulary knowledge [F(1, 120) = 36.669, p < .001, ηp2 = .234], delayed test scores on receptive vocabulary knowledge [F(1, 120) = 45.370, p < .001, ηp2 = .274], immediate test scores on productive vocabulary knowledge [F(1, 120) = 17.039, p < .001, ηp2 = .124], and delayed test scores on productive vocabulary knowledge [F(1, 120) = 15.063, p < .001, ηp2 = .112].

A Bonferroni post hoc test of the immediate posttest revealed that the definition + word information + video group earned significantly higher scores than the definition + word information + audio group on receptive vocabulary knowledge (p < .001) and productive vocabulary knowledge (p < .001). The definition + word information + audio information group earned significantly higher scores than the definition + word information group on receptive vocabulary knowledge (p < .001) and productive vocabulary knowledge (p < .001). The definition + word information group earned significantly higher scores than the definition group on receptive vocabulary knowledge (p < .001) and productive vocabulary knowledge (p < .001). In terms of the delayed posttest, the definition + word information + video group earned significantly higher scores than the definition + word information + audio group on receptive vocabulary knowledge (p < .001) and productive vocabulary knowledge (p < .001). The definition + word information + audio information group earned significantly higher scores than the of definition + word information group on receptive vocabulary knowledge (p < .001) and productive vocabulary knowledge (p < .001). The definition + word information group earned significantly higher scores than the definition group on receptive vocabulary knowledge (p < .001) and productive vocabulary knowledge (p < .001).

The second research question concerned participants’ perceptions of vocabulary learning under their respective input conditions and the reasons for their opinions. We first address the questionnaire findings ().

Table 6. Results on the questionnaire.

shows the results for five closed-ended statements. Participants were asked to rate their agreement on whether each gloss mode facilitated vocabulary learning, retention, emotion, and future practice. As indicated, participants perceived the definition + word information + video mode to be a simple way to learn vocabulary (Q1, M = 65.15, SD = 15.12), followed by the definition + word information + audio mode (M = 58.42, SD = 13.31), the definition + word information mode (M = 53.43, SD = 11.13), and finally the definition mode (M = 48.21, SD = 11.32). In addition, compared with the other three groups, participants perceived the definition + word information + video mode to be helpful for learning vocabulary (Q2, M = 61.65, SD = 14.18) and for retaining vocabulary (Q3, M = 61.62, SD = 13.16). Participants enjoyed the most vocabulary acquisition under the definition + word information + video mode (Q4, M = 68.19, SD = 11.13), and they expressed a higher likelihood of using this mode in the future versus other modes (Q5, M = 66.31, SD = 15.12).

Interviews complemented the questionnaire data. Overall, all the five participants (100%) expressed positive attitudes toward the definition + word information + video mode. One interviewee stated that watching video clips helped him learn and memorize new words because the clips provided ‘contextual clues to understand a new word.’ He also said he could perform better on the immediate and delayed posttest. Another interviewee explained she enjoyed the video for understanding word information because the video featured ‘cultural knowledge.’ This mode enabled her to synthesize relevant background information and better understand the word’s meaning. An interviewee recounted the benefits of a video because the images allowed him to build a mental image to grasp new words. Another interviewee regarded vocabulary learning via the definition + word information + video mode as pleasant: he was motivated to learn and retain target words, and he became ‘more confident’ and ‘more engaged’ in vocabulary acquisition. One interviewee shared that this mode helped her learn the correct pronunciation of new words. She also decided to adopt the mode for future vocabulary learning.

Three out of five participants (60%) favored the definition + word information + audio mode. Reasons included ‘low proficiency level,’ ‘difficulties in discerning the meaning from the audio recordings,’ ‘speaking speed is too fast in the audio recording,’ and ‘increased cognitive load for understanding the audio recording.’ Two out of five participants (40%) expressed positive attitude toward the advantages of the word definition + word information mode. Advantages included ‘understanding background information surrounding the target words,’ ‘understanding meanings of the target words,’ and ‘retaining target words more effectively.’ Only 1 participant (20%) was positive toward the word definition mode. Benefits included ‘the aid in understanding the word meanings.’ In sum, higher percentage of participants in the definition + word information + video group reported that they enjoyed vocabulary learning practice because they could learn new words in greater depth and comprehend different aspects of vocabulary knowledge. The qualitative findings explained why this mode is effective for vocabulary learning and retention.

Discussion and conclusion

The present study examined the potential of four multimedia input conditions for vocabulary learning and retention. Results revealed that 1) the definition + word information + video condition yielded the best vocabulary learning gains and retentions and 2) participants exposed to the definition + word information + video condition reported positive attitudes toward dual multimedia modes. The following sections consider the findings in light of previous research and relevant theories.

Roles of multimedia input in vocabulary learning and retention

The first research question explored how individuals learn and retain vocabulary from multimedia input. Findings showed that combining the definitions of target words, word information, and associated visual input (i.e. video) was more effective than combining the definition, word information, and audio input. Combining the word’s definition and word-related information was less useful for vocabulary learning and retention. Providing the definition alone was least effective. Audio input seemed to act as a ‘redundant multimedia presentation’ (Mayer and Fiorella Citation2014, 287), in that multimedia learning might not be effective when verbal or visual channels are overloaded with information in the same category. This explanation could have applied when learners in the present study processed two verbal modes (textual and aural information) without receiving visual input, resulting in single-channel overload. Results also highlight the role of the dual multimedia mode (i.e. verbal + visual information) in vocabulary learning. This outcome echoes previous studies (e.g. Ramezanali and Faez Citation2019; Teng and Zhang Citation2021; Yoshii and Flaitz Citation2002) and was unsurprising: blending multiple forms of input is intended to direct learners’ attention to unknown vocabulary items. One noteworthy finding from this work concerns integrating video to enable learners to receive verbal input via the ears and eyes; more compelling vocabulary learning can follow. A multimedia mode including video may be especially helpful because videos offer retrieval cues to learn target words. Video-provided information may then enhance word comprehension, reinforcing learners’ receptive and productive vocabulary knowledge gains. Some researchers may wonder why the video mode is not subject to redundancy and an extraneous information load for learners: videos can stimulate learners’ curiosity to follow the presented material and to build a form–meaning link between a word and its meaning, which is a key step in meaningful vocabulary acquisition (Teng Citation2021). Conversely, a word’s definition alone may not be adequate to activate prior knowledge. A core issue regarding vocabulary retention in this study entailed participants’ scores on delayed vocabulary tests. These scores were lower than those on the immediate posttest after treatment. That is, word attrition occurred in the four conditions. Yet this negative effect was counterbalanced by the fact that participants in the definition + word information + video condition displayed significantly more vocabulary knowledge than participants in other conditions. Participants in the definition + word information + video group may have sought to make lexical connections, signaling deeper processing of words and potentially longer-lasting learning. This phenomenon complements existing studies (Ramezanali and Faez Citation2019) by demonstrating that visual and textual information (by combining multimedia input in the definition + word information + video group) facilitated vocabulary retention. This result is intriguing and contrasts with findings about receptive and productive vocabulary knowledge, in which no differences were observed between experimental groups (Yanguas Citation2009). Combining multimedia input thus appears to significantly influence vocabulary retention.

As noted, this study highlights the advantage of a dual multimedia mode over a single mode. However, results are tentative. The research design, including instruction level, text type, and measurement, can affect participants’ learning outcomes (Abraham Citation2008). Compared with earlier work (Boers et al. Citation2017; Çakmak and Erçetin Citation2018; Yanguas Citation2009), inconsistencies persisted between the dual multimedia mode and the single mode in this research. Blending pictures with textual information has not been deemed more effective than glosses containing only verbal explanations (Boers et al. Citation2017); adding pictures may not shift learners’ attention to target words, but was not explored in the present study. Chun and Plass (Citation1996) argued that definitions combined with pictures were more effective in promoting vocabulary acquisition. Al-Seghayer (Citation2001) found that definitions combined with videos yielded better vocabulary learning. Akbulut’s (Citation2007) findings did not favor either videos or pictures in statistically significant results, indicating that videos + definitions could be as effective as pictures + definitions for vocabulary learning. In the present study, integrating video input led to the best performance in vocabulary learning and retention. Learners may be more engaged when using videos as multimedia input, given an anticipated substantial positive relationship between access to video and the number of recognized target words (Teng Citation2022). Schmitt’s (Citation2008) notion of engagement supports this finding: ‘the more a learner engages with a new word, the more likely they are to learn it’ (338). Using a video to convey a word’s definition and related information could therefore facilitate learners’ awareness of words, ultimately fostering receptive and productive vocabulary knowledge. The noticing hypothesis (Schmidt Citation2001) takes attention and noticing as prerequisites for input to become intake. Given that videos draw learners’ attention to target words’ meanings and background information, word comprehension may be of interest along with word form. The elaboration and definitional support accompanying the word information + definition + video condition could facilitate perceived input (i.e. choosing appropriate verbal and pictorial information to synthesize linguistic information), which is essential for noticing and learning new words. Notably, video coupled with a word’s definition and information helped learners retain information in this study: participants had opportunities to receive the same information more than twice. This modality effect may have encouraged information processing, reduced cognitive load, and increased working memory capacity for target vocabulary knowledge.

Perceptions of vocabulary learning from multimedia input

The second research question concerned participants’ perceptions of their input condition. Questionnaire results showed that approximately 65.15% of participants perceived the definition + word information + video mode as a simple way to acquire vocabulary. Consistent with Akbulut (Citation2007), participants identified multimedia annotations (in video form) as especially useful for vocabulary learning. The qualitative results corresponded to vocabulary learning performance. Participants’ comments in interviews were in line with Ramezanali and Faez’s (Citation2019) work, in which the textual definition and video mode were found to motivate vocabulary learning and retention. Al-Seghayer (Citation2001) also reported that videos can provide contextual support for understanding target words. In the present study, participants cited several benefits: cultural knowledge for word comprehension, building mental images for vocabulary learning, greater depth and attention to target words, and confidence and motivation related to vocabulary learning. Consistent with Yanguas (Citation2009), combining input channels helped learners better comprehend the text. Multimedia input can support individuals in learning target vocabulary items because the input can aid text comprehension.

Additionally, participants’ engagement with target words was higher when watching the video because they were eager to know what would happen in the next video segment. Learners’ stronger attention or concentration could have made the presentation of target words (via a video) more enjoyable and a better retrieval cue for word recognition and production. The video’s contextual richness and cultural authenticity were also important in rendering information for vocabulary learning more memorable. Participants’ positive attitudes may explain why words annotated both visually and textually led to higher scores than other modes. The quantitative and qualitative data collectively suggest the value of multimedia learning and of dual coding. Vocabulary learning and retention should occur through textual and visual modes rather than textual modes only.

Connecting the effectiveness of and perceptions about multimedia input in vocabulary learning and retention

Overall, the present study explored the efficacy of different combination types of multimedia input for learning and retaining unknown lexical items. The quantitative results suggested that a video clip in combination with a text definition and background information is more effective in learning unknown vocabulary than an audio recording in combination with a text definition and background information. That said, students could learn and recall more words when video clips were provided. The variety of modality cues in a video can reinforce each other and the different elements are linked together in meaningful ways to provide an in-depth vocabulary learning experience. Possible factors that may explain these results were detected in the qualitative analysis. For example, video better helps learners build a mental image for contextual clues and cultural knowledge, curiosity increases engagement, and video’s combination of modalities (dynamic image and sound) facilitate recall of new words.

The positive findings regarding the effectiveness of multimedia input and learners’ positive attitude about multimedia input in vocabulary learning and retention reflect Mayer’s (Citation1997, Citation2001) cognitive theory of multimedia learning. Mayer (Citation2001) claimed that transfer and retention enable multimedia learning. Transfer involves using material from multimedia input to solve problems. Retention refers to one’s ability to preserve important information for learning. Evidence from this study indicated that transfer occurred based on the pronounced effects of word information + definition + video on receptive and productive vocabulary knowledge acquisition. Retention was apparent because the combination of definitions, word information, and videos helped learners hold information to build a connection between form and meaning, leading to better vocabulary retention (Teng Citation2021). Multiple types of input, rather than a single type, therefore promoted vocabulary learning. A useful point borrowed from the cognitive theory of multimedia learning is that the dual presentation of verbal and visual information can capture learners’ attention and help them construct mental images depicting linkages or providing a gestalt. The dual channels of verbal and visual input may hence allow for the establishment of a more enduring mental model of processing key information. This model can then lessen cognitive constraints in comprehending information and boost potential short-term recall of vocabulary knowledge.

Limitations and implications

This study is not without limitations. First, individual differences (e.g. in learners’ English proficiency level, working memory, and language aptitude) were not examined. Such variations may influence vocabulary learning performance through multimodal input, as Teng and Zhang (Citation2021) reported. Second, aspects of vocabulary knowledge apart from receptive and productive knowledge (e.g. form, meaning, use, word association, word parts, and collocations) should be considered in the future. A holistic view of vocabulary knowledge may provide a more vivid picture of vocabulary acquisition in foreign language contexts. Third, participants in the treatment groups might be exposed to the target words many more times than the participants in the control group. However, word exposure frequency was not taken into account. Word-related factors, including frequency (Teng Citation2020) and even word relevance (Peters and Webb Citation2018), should be examined to identify random effects through mixed effects modeling. Fourth, Finally, more word forms, including nouns, verbs, and adjectives, should be used as target words to clarify the effects of multimedia input on vocabulary learning.

Despite these limitations, the findings provide theoretical and pedagogical implications for vocabulary instruction through multimedia material. The cognitive theory of multimedia learning (Mayer Citation1997, Citation2001) was supported: multimedia, in the dual mode of visual and verbal input, can provide retrieval cues for word learning. The design of multimedia instruction can stimulate learners’ cognitive processes. The visual and verbal information processing system enables people to build visual and verbal cues and then retrieve relevant stored information from memory to activate knowledge for learning. The cognitive theory of multimedia learning can be reconsidered on this basis. Moreover, multimedia technologies allow for a simpler presentation of authentic input. Multimodal input helps learners process real language while accelerating their information access. As the qualitative findings indicate, learners can gain more confidence and control over their learning processes and acquire vocabulary at their own pace. Lastly, the use of multiple modalities (e.g. word definition, information + video) can aid vocabulary instruction. Word definitions or information are common in such teaching; videos should be added given their apparent helpfulness. Rather than solely serving as entertainment, videos can reinforce learners’ mental imagery to promote word comprehension and retention (Teng Citation2021). Varied modalities of multimedia input provide a language learning environment that can tangibly affect vocabulary acquisition. EFL students should be guided in becoming independent through specific multimedia modes when incorporating videos. Familiarizing EFL learners with multimodal input hence shows promise.

Disclosure statement

No potential conflict of interest was reported by the author(s).

Additional information

Notes on contributors

Mark Feng Teng

Mark Feng Teng is language teacher educator in China. His main research interests include L2 vocabulary acquisition and L2 writing. His latest publications appeared in TESOL Quarterly, System, Applied Linguistics, Language Teaching Research, Computer Assisted Language Learning, Computers & Education, and other international mainstream journals. His recent monographs were published by Springer, Routledge, and Bloomsbury. He has edited or co-edited special issues for Asian EFL Journal, TESOL Journal, SSLLT, and Journal of Writing Research.

References

  • Abraham, L. B. 2008. “Computer-mediated Glosses in Second Language Reading Comprehension and Vocabulary Learning: A Meta-Analysis.” Computer Assisted Language Learning 21 (3): 199–226.
  • Akbulut, Y. 2007. “Effects of Multimedia Annotations on Incidental Vocabulary Learning and Reading Comprehension of Advanced Learners of English as a Foreign Language.” Instructional Science 35 (6): 499–517.
  • Al-Seghayer, K. 2001. “The Effect of Multimedia Annotation Modes on L2 Vocabulary Acquisition: A Comparative Study.” Language Learning & Technology 5 (1): 202–232.
  • Boers, F., P. Warren, L. He, and J. Deconinck. 2017. “Does Adding Pictures to Glosses Enhance Vocabulary Uptake from Reading?” System 66: 113–129.
  • Çakmak, F., and G. Erçetin. 2018. “Effects of Gloss Type on Text Recall and Incidental Vocabulary Learning in Mobile-Assisted L2 Listening.” ReCALL 30 (1): 24–47. doi:10.1017/S0958344017000155.
  • Chun, D. M., and J. L. Plass. 1996. “Effects of Multimedia Annotations on Vocabulary Acquisition.” The Modern Language Journal 80 (2): 183–198.
  • Lwo, L., and M. C. T. Lin. 2012. “The Effects of Captions in Teenagers’ Multimedia L2 Learning.” ReCALL 24 (2): 188–208.
  • Mayer, R. E. 1997. “Multimedia Learning: Are we Asking the Right Questions?” Educational Psychologist 32: 10–19.
  • Mayer, R. E. 2001. Multimedia Learning. New York: Cambridge University Press.
  • Mayer, R. E. 2009. Multimedia Learning (2nd ed.). Cambridge: Cambridge University Press.
  • Mayer, R. E., and L. Fiorella. 2014. “Principles of Reducing Extraneous Processing in Multimedia Learning: Coherence, Signaling, Redundancy, Spatial Contiguity, and Temporal Contiguity Principles.” In The Cambridge Handbook of Multimedia Learning, edited by R. E. Mayer, 279–315. New York, NY: Cambridge University Press.
  • Mayer, R. E., and R. Moreno. 1998. “A Split-Attention Effect in Multimedia Learning: Evidence for Dual Processing Systems in Working Memory.” Journal of Educational Psychology 90 (2): 312–320.
  • Mohsen, M. A., and M. Balakumar. 2011. “A Review of Multimedia Glosses and Their Effects on L2 Vocabulary Acquisition in CALL Literature.” ReCALL 23 (2): 135–159.
  • Montero Perez, M. 2020. “Incidental Vocabulary Learning through Viewing Video: The Role of Vocabulary Knowledge and Working Memory.” Studies in Second Language Acquisition 42 (4): 749–773.
  • Paivio, A. 1972. Imagery and Verbal Processes. New York: Holt, Rinehart & Wilston.
  • Paivio, A. 1986. Mental Representations. New York: Oxford University Press.
  • Paivio, A. 1990. Mental Representation: A Dual-Coding Approach. Oxford: Oxford University Press.
  • Pallant, J. 2011. SPSS Survival Manual: A Step by Step Guide to Data Analysis Using the SPSS Program. 4th ed. Berkshire: Allen & Unwin.
  • Peters, E., and S. Webb. 2018. “Incidental Vocabulary Acquisition Through Viewing L2 Television and Factors That Affect Learning.” Studies in Second Language Acquisition 40: 551–577.
  • Puimège, E., and E. Peters. 2020. “Learning Formulaic Sequences Through Viewing L2 Television and Factors That Affect Learning.” Studies in Second Language Acquisition 42 (3): 525–549.
  • Ramezanali, N., and F. Faez. 2019. “Vocabulary Learning and Retention Through Multimedia Glossing.” Language Learning & Technology 23 (2): 105–124. doi:10.125/44685.
  • Ramezanali, N., T. Uchihara, and F. Faez. 2021. “Efficacy of Multimodal Glossing on Second Language Vocabulary Learning: A Meta-Analysis.” TESOL Quarterly 55: 105–133. doi:10.1002/tesq.579.
  • Schmidt, R. 2001. “Attention.” In Cognition and Second Language Instruction, edited by P. Robinson, 3–32. Cambridge: Cambridge University Press.
  • Schmitt, N. 2008. “Instructed Second Language Vocabulary Learning.” Language Teaching Research 12 (3): 329–363.
  • Schmitt, N. 2010. Researching Vocabulary: A Vocabulary Research Manual. New York: Palgrave Macmillan.
  • Schmitt, N., D. Schmitt, and C. Clapham. 2001. “Developing and Exploring the Behaviour of Two New Versions of the Vocabulary Levels Test.” Language testing 18: 55–88.
  • Teng, F. 2019. “The Effects of Video Caption Types and Advance Organizers on Incidental L2 Collocation Learning.” Computers & Education 142: 103655.doi:10.1016/j.compedu.2019.103655.
  • Teng, F. 2020. “Retention of new Words Learned Incidentally from Reading: Word Exposure Frequency, L1 Marginal Glosses, and Their Combination.” Language Teaching Research 24 (6): 785–812.
  • Teng, F. 2021. Language Learning Through Captioned Videos: Incidental EFL Vocabulary Acquisition. New York: Routledge.
  • Teng, F. 2022. “Incidental L2 Vocabulary Learning from Viewing Captioned Videos: Effects of Learner-Related Factors.” System 105: 102736. doi:10.1016/j.system.2022.102736.
  • Teng, F., & Zhang, D. (2021). The Associations Between Working Memory and the Effects of Multimedia Input on L2 Vocabulary Learning. International Review of Applied Linguistics in Language Teaching. doi:10.1515/iral-2021-0130
  • Webb, S., and P. Nation. 2017. How Vocabulary is Learned. Oxford: Oxford University Press.
  • Yanagisawa, A., S. Webb, and T. Uchihara. 2020. “How do Different Forms of Glossing Contribute to L2 Vocabulary Learning from Reading?.” Studies in Second Language Acquisition 42 (2): 411–438.
  • Yanguas, I. 2009. “Multimedia Glosses and Their Effect on L2 Text Comprehension and Vocabulary Learning.” Language Learning & Technology 13 (2): 48–67.
  • Yeh, Y., and C. W. Wang. 2003. “Effects of Multimedia Vocabulary Annotations and Learning Styles on Vocabulary Learning.” CALICO Journal 21: 131–144.
  • Yoshii, M. 2006. “L1 and L2 Glosses: Their Effects on Incidental Vocabulary Learning.” Language Learning & Technology 10 (3): 85–101.
  • Yoshii, M., and J. Flaitz. 2002. “Second Language Incidental Vocabulary Retention: The Effect of Text and Picture Annotation Types.” CALICO Journal 20 (1): 33–58.
  • Zhang, R., and D. Zou. 2021. “A State-of-the-art Review of the Modes and Effectiveness of Multimedia Input for Second and Foreign Language Learning.” Computer Assisted Language Learning, 1–27. doi.10.1080/09588221.2021.1896555.