Chinese Word Frequency Approximation Based on Multitype Corpora: Journal of Quantitative Linguistics: Vol 17, No 2

213

Views

CrossRef citations to date

Altmetric

Abstract

Due to the nature of Chinese, a perfect word-segmented Chinese corpus that is ideal for the task of word frequency estimation may never exist. Therefore, a reliable estimation for Chinese word frequencies remains a challenge. Currently, three types of corpora can be considered for this purpose: raw corpora, automatically word-segmented corpora, and manually word-segmented corpora. As each type has its own advantages and drawbacks, none of them is sufficient alone. In this article, we propose a hybrid scheme which utilizes existing corpora of different types for word frequency approximation. Experiments have been performed from statistical and application-oriented perspectives. We demonstrate that, compared with other schemes, the proposed scheme is the most effective one and leads to better word frequency approximation results.

Acknowledgements

This work is supported by the National Science Foundation of China under Grant No. 60873174, the National 863 High-Tech Project of China under Grant No. 2007AA01Z148 and the China–Germany (Tsinghua–Hamburg University) CINACS Program.

Notes

¹ICTCLAS 1.0: http://www.nlp.org.cn

² http://chasen.org/taku/software/CRF++/.

³The four types are Numbers, Dates (the Chinese characters for “day”, “month”, “year”, respectively), English letters and Others.

Reprints and Corporate Permissions

Please note: Selecting permissions does not provide access to the full text of the article, please see our help page How do I view content?

To request a reprint or corporate permissions for this article, please click on the relevant link below:

Order Reprints Request Corporate Permissions

Academic Permissions

Please note: Selecting permissions does not provide access to the full text of the article, please see our help page How do I view content?

Obtain permissions instantly via Rightslink by clicking on the button below:

Request Academic Permissions

If you are unable to obtain permissions via Rightslink, please complete and submit this Permissions form. For more information, please visit our Permissions help page.

Chinese Word Frequency Approximation Based on Multitype Corpora

Information for

Open access

Opportunities

Help and information

Chinese Word Frequency Approximation Based on Multitype Corpora

Abstract

Acknowledgements

Notes

Reprints and Corporate Permissions

Academic Permissions

Related research

To cite this article:

Download citation

Information for

Open access

Opportunities

Help and information

Keep up to date

Your download is now in progress and you may close this window

Login or register to access this feature