ABSTRACT
The system-level complexity of language has been thoroughly investigated in terms of Zipf’s law, whose quantitative features have proved to reflect text/language typology. This study extends the scope of Zipf’s law from the macroscopic scale of language to specific words in contexts, with the aim of examining its potential as an indicator of word typology. The focus is confined to the high-frequency words in English and Chinese as found in the FLOB and LCMC corpora. It has been found that the log–log rank-frequency distributions of contextual words of the words in question generally abide by the linear function y = ax+b. Moreover, it has been shown that an adjusted version of parameter a can help to distinguish the words in question’s classes. The contextual information as reflected by this Zipf-based index might be more important to the emergence of word classes of Chinese, which has no real inflection as a word-class indicator. From a Zipfian approach, the findings have preliminarily approved Saussure’s systems thinking regarding linguistic signs. Meanwhile, they may also contribute to such fields as usage-based linguistics.
Disclosure Statement
No potential conflict of interest was reported by the author(s).
Notes
1. Modal verbs were not lemmatized except for their contracted forms (i.e. ‘ll -> will, ‘d -> would). Personal pronouns were lemmatized only to remove the nominative-accusative distinction in word form (e.g. me -> I).
2. A word separated from the word in question by a punctuation mark does not count as a co-occurring word.
3. The fitting failed for the English lemma terms, tagged as II32 and always found in in terms of, and so it was excluded from the analysis that followed.
5. * represents a string of any length.
7. The auxiliaries in Chinese are rather different from auxiliaries in English. See Section 3.5 for more details.