Abstract
While audiovisual integration is well known in speech perception, faces and speech are also informative with respect to speaker recognition. To date, audiovisual integration in the recognition of familiar people has never been demonstrated. Here we show systematic benefits and costs for the recognition of familiar voices when these are combined with time-synchronized articulating faces, of corresponding or noncorresponding speaker identity, respectively. While these effects were strong for familiar voices, they were smaller or nonsignificant for unfamiliar voices, suggesting that the effects depend on the previous creation of a multimodal representation of a person's identity. Moreover, the effects were reduced or eliminated when voices were combined with the same faces presented as static pictures, demonstrating that the effects do not simply reflect the use of facial identity as a “cue” for voice recognition. This is the first direct evidence for audiovisual integration in person recognition.
Acknowledgments
A huge thanks to B.T.J., M.L., D.D., P.O.D., and four additional Glasgow University staff members who agreed to temporarily donate their faces and voices in order to create the stimulus material for this study. This research was supported by a grant from the Deutsche Forschungsgemeinschaft (DFG) to S.R.S. The study further benefited from a summer studentship from the Nuffield foundation to D.R. J.M.K. has been supported by a British Academy Postdoctoral Fellowship.
Notes
1 An analysis of variance (ANOVA) with the factors familiarity (familiar vs. unfamiliar), correspondence (corresponding, noncorresponding within familiarity, and noncorresponding across familiarity), and parameter (initial, intermediate, and final positions) for absolute asynchronies only revealed a significant effect of parameter, F(2, 6) = 6.2, p < .05. Unsurprisingly perhaps, this indicated that synchronization was somewhat better for initial than for intermediate and final sentence positions (M = 41 ms, 104 ms, and 85 ms for initial, intermediate, and final positions, respectively). This analysis revealed no effects involving familiarity (M = 77 ms for both familiar and unfamiliar speakers) or correspondence (M = 65 ms, 82 ms, and 83 ms for corresponding, noncorresponding within familiarity, and noncorresponding across familiarity conditions, respectively).