Abstract
The corpora for this study are from News Co-broadcasting, Daily Conversations and Behind the headlines with Wentao, each of which represents the formal written style, the colloquial style and the conversational style respectively. Sentence length, word length, part of speech (POS) and sentence-initial word POS are selected from the pre-processed corpora as features to generate text vectors and then clustered with PAM (partition around medoids) and Ward algorithms. The clustering results show: (1) It is reasonable to select sentence length, word length, POS and sentence-initial word POS as Chinese quantitative stylistic features. (2) Style is a polarized continuum, as the formal written style and the colloquial style display bipolar distributions while the conversational style lies in between and is near the pole of the colloquial style.
Acknowledgment
We thank the JQL referees for their insightful comments and suggestions.
Funding
This work was supported by the National Natural Science Foundation in China (61171114), Tsinghua University Self-determination Research Project (20111081023 & 20111081010), Human & Liberal Arts Development Foundation (2010WKHQ 009).
Notes
* Address correspondence to: Hou Renkui, Laboratory of Computational Linguistics, School of Humanities, Tsinghua University, Beijing 100084, China. E-mail: [email protected]
1 Requotes from Shuixian Li. TV News Style Research[M]. Bejing: CUC press. 2007.10, p. 23.