213
Views
3
CrossRef citations to date
0
Altmetric
Original Articles

Chinese Word Frequency Approximation Based on Multitype Corpora

, &
Pages 142-166 | Published online: 14 May 2010
 

Abstract

Due to the nature of Chinese, a perfect word-segmented Chinese corpus that is ideal for the task of word frequency estimation may never exist. Therefore, a reliable estimation for Chinese word frequencies remains a challenge. Currently, three types of corpora can be considered for this purpose: raw corpora, automatically word-segmented corpora, and manually word-segmented corpora. As each type has its own advantages and drawbacks, none of them is sufficient alone. In this article, we propose a hybrid scheme which utilizes existing corpora of different types for word frequency approximation. Experiments have been performed from statistical and application-oriented perspectives. We demonstrate that, compared with other schemes, the proposed scheme is the most effective one and leads to better word frequency approximation results.

Acknowledgements

This work is supported by the National Science Foundation of China under Grant No. 60873174, the National 863 High-Tech Project of China under Grant No. 2007AA01Z148 and the China–Germany (Tsinghua–Hamburg University) CINACS Program.

Notes

1ICTCLAS 1.0: http://www.nlp.org.cn

3The four types are Numbers, Dates (the Chinese characters for “day”, “month”, “year”, respectively), English letters and Others.

Log in via your institution

Log in to Taylor & Francis Online

PDF download + Online access

  • 48 hours access to article PDF & online version
  • Article PDF can be downloaded
  • Article PDF can be printed
USD 53.00 Add to cart

Issue Purchase

  • 30 days online access to complete issue
  • Article PDFs can be downloaded
  • Article PDFs can be printed
USD 394.00 Add to cart

* Local tax will be added as applicable

Related Research

People also read lists articles that other readers of this article have read.

Recommended articles lists articles that we recommend and is powered by our AI driven recommendation engine.

Cited by lists all citing articles based on Crossref citations.
Articles with the Crossref icon will open in a new tab.