370
Views
0
CrossRef citations to date
0
Altmetric
Research Article

Quantitative Aspects of PDTB-Style Discourse Relations across Languages

&
Pages 342-371 | Published online: 05 Jan 2018
 

Abstract

Frequency distribution of words, syntax and semantics in many languages abides by certain laws. However, because of the shortage of discourse corpora, few studies have examined whether the frequency of discourse relations follows some distributional patterns. Although there is some research based on the Rhetorical Structure Theory discourse treebank (RST-DT), each of these studies is limited to a single language. Otherwise to the RST-DT, the Penn Discourse Treebank (PDTB), adopting another annotation system, has had an enormous influence on the study of discourse structure and discourse annotation. Discourse corpora in other languages, such as Chinese, Hindi, Turkish, Czech and Arabic have been annotated following PDTB style. With the data from these discourse treebanks, we find that the rank-frequency of discourse relations follow the same pattern and that these languages share significant similarities in using semantic relations to organize the discourse. It is evidenced in our research that humans assume the relationship between two consecutive sentences is a causal connection or expansion link for fewer connectives used, but the relation of contrast is the most marked by connectives. This research will be of significance for understanding the homogeneity of discourse structure across languages.

Acknowledgements

We thank two anonymous reviewers of JQL for their constructive comments on previous versions. Special thanks go to Prof. Hiatao Liu, for his helpful suggestions, and Dr. Fatemeh Torabi Asr for her providing the data on PDTB.

Notes

1. RST-DT(Carlson et al., Citation2002) used 76 types (actually 86 types according to Zhang and Liu (Citation2016)) of relations, whereas Spanish RST discourse treebank used 29 types of relations (http://corpus.iingen.unam.mx/rst/manual_en.html); Chinese RST used 50 types of relations (Yue & Liu, Citation2011) as well, and German (Potsdam Commentary Corpus) used the original 23 types. The number of relations annotated in these RST-style corpora varies greatly from each other. Additionally, the classification of rhetorical relations in these corpora differs from each other, and even the classification system itself is quite confusing. For example, RST-DT classified 86 types into 16 classes, but the name of each class overlaps with its members, such as Comparison (comparison, preference, analogy, proportion), Evaluation (evaluation, interpretation, conclusion, comment), Cause (cause, result, consequence), Enablement (purpose, enablement). It is not clear that the rank-frequency data on 16 classes are really distinct from the data on 83 types in Zhang and Liu (Citation2016) because of their overlapping terminologies.

2. The original sentence comes from Taboada and Stede (http://www.sfu.ca/rst/pdfs/RST_Introduction.pdf). Actually, there are many relevant arguments about intentional relations in RST, e.g. Stede (Citation2012, p. 83, emphasis added) states, ‘RST opts to make the speaker intention the central criterion for assigning a particular relation when analysing a text.’

3. In Example 2, we examine the implicit situation only, ignoring the explicit connectives in the latter clause.

4. Put it simply, the Poisson distribution, is a discrete probability distribution, expressing the probability of a given number of events occurring in a fixed interval of time and/or space if these events occur with a known average rate and independently of the time since the last event (https://en.wikipedia.org/wiki/Poisson_distribution).

5. The total sum of 16 classes for English rhetorical relations are: elaboration (8420), attribution (2371), explanation (1131), background (1009), contrast (715), cause (641), evaluation (618), enablement (594), condition (321), manner-means (268), summary (255), temporal (241), comparison (188), topic-comment (51), topic-change (21).

6. Zhang and Liu (Citation2016) classified 83 types into 16 classes, and they just choose 15 classes to make a quantitative analysis. They also group those with the same initial parts before the dash into the same type, and there are 37 such types in the RST-DT. Moreover, the classification of rhetorical relations is quite diverse. According to Stede (Citation2012, p. 86), rhetorical relations can be classified into four main categories: nucleus and satellite; multi-nuclear; semantic; and pragmatic. Taboada and Stede present the various classifications: subject matter vs. presentational; relations that hold outside the text vs. those that are only internal to the text; relations frequently marked by a discourse marker vs. relations that are rarely, or never, marked; preferred order of spans: nucleus before satellite vs. satellite–nucleus (http://www.sfu.ca/rst/pdfs/RST_Introduction.pdf).

7. Motifs refer to uninterrupted sequences of unrepeated elements (Köhler, Citation2015), like this: [Explanation + Manner-Means + Elaboration], [Elaboration], [Elaboration + Enablement + Background].

Log in via your institution

Log in to Taylor & Francis Online

PDF download + Online access

  • 48 hours access to article PDF & online version
  • Article PDF can be downloaded
  • Article PDF can be printed
USD 53.00 Add to cart

Issue Purchase

  • 30 days online access to complete issue
  • Article PDFs can be downloaded
  • Article PDFs can be printed
USD 394.00 Add to cart

* Local tax will be added as applicable

Related Research

People also read lists articles that other readers of this article have read.

Recommended articles lists articles that we recommend and is powered by our AI driven recommendation engine.

Cited by lists all citing articles based on Crossref citations.
Articles with the Crossref icon will open in a new tab.