Abstract
Choosing an appropriate measurement unit of word length is a key prerequisite for word length distribution studies, since the measurement unit varies with different types of language or text. Taking Chinese as an example, this study explores the word length distributions of Chinese spoken and written language based on a data source consisting of 20 dialogue texts (spoken language) and 20 prose texts (written language), in which the lengths of words are variously determined in terms of pinyin letter, phoneme, syllable for spoken Chinese and stroke, component, character for written Chinese respectively. With the aim of selecting the most appropriate word length measurement unit, empirical word length distribution models, synergetic linguistic theories and Menzerath’s law are used in this study. Results show that the syllable is the most appropriate measurement unit for spoken Chinese, and the component the most appropriate measurement unit for written Chinese. Chinese word length distributions can be described with the Poisson or Binomial distribution families, among which Extended Logarithmic and Mixed Poisson are the most generally accepted models for spoken and written Chinese respectively.
Acknowledgements
This work is supported by the National Social Science Foundation of China (11 & ZD 188).
Disclosure statement
No potential conflict of interest was reported by the authors.
Notes
1 Actually pinyin is a system of symbols to notate the sounds of words rather than the meanings, and literally it means “spelled-out sounds”, so it is sufficient to accurately record some language with pinyin.
2 We only explore the hierarchy below lexical level such as word, letter, character, etc.
3 Selected Prose Website:http://swsk.qikan.com.
4 Pinyin4j:http://pinyin4j.sourceforge.net.
5 If there are characters not listed in our measuring list, we will add them manually.
6 Same as above.
7 http://www.ram-verlag.biz/altmann-fitte.
8 To save space, we only show results partially.
9 Actually the proportion of each of the four word length classes is very stable in contemporary Chinese.