ABSTRACT
This study investigates the syntactic complexity of various text-types in the Czech language by analysing the mean dependency distance (MDD), a measure that quantifies the average distance between syntactic heads and their dependents within a sentence, and average sentence length (ASL). Using data from the SYN2020 corpus, a large and balanced collection of contemporary written Czech, we calculate the MDD and ASL for different text-types. Our findings reveal distinct patterns in the MDD and ASL values across genres, suggesting that syntactic complexity varies among different types of texts. We observe a clear distinction between fiction and non-fiction genres, with fiction exhibiting lower MDD and ASL values, indicating a more compact syntactic structure. Non-fiction genres, particularly scientific literature, display higher MDD and ASL values, reflecting more complex syntactic constructions. Journalistic texts, such as newspapers and magazines, fall between fiction and non-fiction in terms of MDD and ASL values. These results demonstrate the potential of MDD and ASL as quantitative measures for characterizing and differentiating text-types based on their syntactic complexity. Furthermore, our analysis contributes to a deeper understanding of the syntactic variations across diverse genres in the Czech language.
Acknowledgments
This work was supported by the Czech Science Foundation (GAČR), project No. 22-20632S.
Disclosure Statement
No potential conflict of interest was reported by the author(s).
Notes
1. The dependency tree also contains additional information, such as the part-of-speech (POS) tags of words (e.g. VERB, NOUN, ADJ) and the types of dependencies (e.g. root: the root of the sentence, nsubj: noun subject, obj: object, amod: adjective modifier). However, in our current study, we focus solely on analysing the dependency distances (DD) between heads and their dependents, as our primary interest lies in quantifying the syntactic complexity through the MDD measure. Therefore, the root dependency is excluded. Furthermore, while the POS tags and dependency types can provide valuable insights in other contexts, they are not directly relevant to the scope of this particular research.