1,057
Views
49
CrossRef citations to date
0
Altmetric
Original Articles

The role of syntactic structure in guiding prosody perception with ordinary listeners and everyday speech

, &
Pages 1141-1177 | Published online: 11 Jun 2010
 

Abstract

The relationship between syntactic and prosodic phrase structures is investigated in the production and perception of spontaneous speech. Three hypotheses are tested: (1) syntax influences prosody production; (2) listeners' perception of prosodic boundaries is sensitive to acoustic duration; and (3) syntax directly influences boundary perception, (partly) independent of the acoustic evidence for boundaries. Data are from the Buckeye corpus of conversational speech, and the real-time prosodic transcription of those data by 97 untrained listeners. Inter-transcriber agreement codes boundary strength at word junctures, and Boundary scores are shown to be correlated with both the syntactic context and vowel duration of a word. Vowel duration is also correlated with syntactic context, but the effect of syntactic context on boundary perception is not fully explained by vowel duration. Regression analyses show that syntactic clause boundaries and vowel duration are the first and second strongest predictors of boundary perception in spontaneous speech.

Acknowledgements

This work is supported by NSF award IIS-0414117 to the first author and Mark Hasegawa-Johnson. For their varied contributions to the work presented here, we thank three anonymous reviewers, Mark Hasegawa-Johnson, Chilin Shih, Tae-Jin Yoon, Xiaodan Zhuang, and other members of the Prosody-ASR research group, and to José Ignacio Hualde. Special thanks to Zachary Hulstrom and Eun-Kyung Lee for their help with the prosody transcription experiments. Statements in this paper reflect the opinions and conclusions of the authors, and are not endorsed by the NSF or the University of Illinois

Notes

1 Although phonetic implementation of prosody can be seen in evidence from both articulation and acoustics, our project is focused on acoustic correlates and their relation to prosody perception. There are numerous works reporting on a wide range of acoustic parameters as correlates of prosody, of which we cite a few here: Beckman (Citation1986), Beckman and Pierrehumbert (Citation1986), Ladd (1996/Citation2008) for F0; Turk and Sawusch (Citation1997) and Wightman et al. (Citation1992) for duration; Kochanski, Grabe, Coleman, and Rosner (Citation2005) for overall intensity; Heldner (Citation2003) and Sluijter and van Heuven (Citation1996) for spectral emphasis and balance (intensities in sub-bands); van Bergem (Citation1993) for formant structures; Choi, Hasegawa-Johnson, and Cole (Citation2005) for various harmonic and voice source parameters.

2 In addition to durational effects of prosody, we also find overall intensity (Root Mean Square) and acoustic measures of creaky voicing (H1*–H2* and H2*–H4*) to be significantly correlated with the perception of prosodic phrase boundaries for at least some vowel phonemes. We report only the duration findings here, as duration was not only the strongest correlate (based on Pearson's r values), but is also the only acoustic measure that is significantly correlated with boundary perception across most of the vowel phonemes (Mo, Citation2008). Pause duration is also expected to cue prosodic boundaries, but we have not yet examined pause duration in our materials. For the data reported here, speech excerpts were selected to minimise the occurrence of disfluency within the excerpt, where silent and filled pauses were one of the factors used to identify disfluency. We expect that this selection criterion has skewed the distribution of pause duration at prosodic juncture in these materials. In our ongoing work we are investigating the influence of pause duration on prosody perception with longer excerpts for which pause duration was not a selection criterion.

3 Prosodic prominence also conditions lengthening of a stressed vowel (e.g., Turk & Sawusch, Citation1997), so the duration measure examined here may in some cases exhibit combined effects of prominence and boundary lengthening. Prominence is coded in our data with a probabilistic P-score assigned to each word, parallel to the assignment of B-scores, which means that we can not simply separate prominent (pitch-accented) words from non-prominent words (unaccented), as has been done in prior studies that are based on ToBI-style prosody transcription. Instead, we use correlation and regression analysis to look at the relationship between duration, B-scores, and P-scores. Comparison of correlation coefficients between duration and B-scores (Kendall's tau=.369) vs. duration and P-scores (Kendall's tau=.243) shows that duration is more strongly correlated with B-scores (all duration measures are normalised via z-transform). The correlation between P-scores and B-scores is even weaker (Kendall's tau=.204). Furthermore, regression analysis shows that P-scores only very weakly predict B-scores (r 2=.027). Stepwise regression analysis shows that duration is the primary predictor of B-scores (r 2=.239; shown in , Model B) and P-scores as a second factor contribute only marginally as a predictor (r 2=.008). Looking at it from the perspective of duration modelling, we also find that B-scores are stronger predictors of vowel duration (r 2=.278) with P-scores again as weak predictors (r 2=.039). These findings demonstrate that boundary effects on duration outweigh prominence effects, and thus that boundary lengthening effects on words marked as prominent cannot be solely attributed to prominence-based lengthening.

4 A reviewer asks about the possibility of directly testing the independence of acoustic and lexico-syntactic cues to prosody by testing prosody perception with delexicalised speech—using filters or transformations of the acoustic signal to remove segmental information that reveals the lexical content of the speech. This approach is illustrated in the work of de Pijper and Sanderman (Citation1994) who tested prosodic boundary perception by untrained listeners with delexicalised speech materials which were created by resynthesising speech after replacing the first eight spectral peaks with peaks of fixed frequency and bandwidth, and also manipulating pitch and Linear Predictive Coefficient (LPC) gain. This manipulation has the effect of rendering every vowel as schwa-like in its spectral features, and eliminating consonantal distinctions. The resulting materials were judged by de Pijper and Sanderman to preserve prosodic cues while rendering the utterances otherwise unintelligible; and the procedure was considered more successful than simpler alternative methods involving only low-pass filtering or spectral inversion. We have also considered methods for delexicalisation in our work on prosody perception, but like de Pijper and Sanderman, we have been dissatisfied with the filtering methods we have tested thus far, which were either unsuccessful in removing segmental cues to lexical content or successful in delexicalisation but with distorted or very unnatural sounding prosody. We did not attempt the complex method of spectral peak substitution used by de Pijper an d Sanderman, which is unsuitable for our purposes given that we are interested in both segmental and suprasegmental effects of prosody. A related suggestion from this reviewer was to ask transcribers to mark prosody on the text without listening to the associated speech file, in which case lexico-syntactic features alone would guide the annotation. We did not collect such data in the initial phase of this project, whose findings are presented here, but are currently doing so for the second phase of data collection, and expect to report on the findings in our future work.

5Although there are an equal number of left and right syntactic edges coded in this dataset (each word contributes one left and one right edge), the total number of left and right syntactic edges are not equal in Tables 3 and 4 due to categories that are omitted because they have fewer than 10 instances in the dataset, or because they are not coded for both left and right edges. Examples of the latter are the right edges of subordinating conjunctions whose left edge would typically be coded as SBAR, or left edges of gerundive or subject-less infinitival clauses whose right edge would typically be coded as S or SBAR in the guidelines adopted here.

6 All differences reported here as significant by non-parametric analyses of mean differences were also confirmed as significant under ANOVA.

Log in via your institution

Log in to Taylor & Francis Online

PDF download + Online access

  • 48 hours access to article PDF & online version
  • Article PDF can be downloaded
  • Article PDF can be printed
USD 53.00 Add to cart

Issue Purchase

  • 30 days online access to complete issue
  • Article PDFs can be downloaded
  • Article PDFs can be printed
USD 444.00 Add to cart

* Local tax will be added as applicable

Related Research

People also read lists articles that other readers of this article have read.

Recommended articles lists articles that we recommend and is powered by our AI driven recommendation engine.

Cited by lists all citing articles based on Crossref citations.
Articles with the Crossref icon will open in a new tab.