Abstract
Statistical co-occurrence analysis is widely used in linguistic and literary analysis of corpora. There are several measures for co-occurrence statistics, and it is said that mainstream conventions are not necessarily the best choice. Against this background, the author carried out comparative evaluations of four major statistical measures, i.e., X 2, likelihood ratio, Yule’s coefficient of colligation Y, and mutual information, in the tasks of extracting valid morphological units from Japanese kanji (Chinese character) sequences and of decomposing kanji sequences. In the experiments, likelihood ratio performs better than the other measures. Furthermore, this measure is also preferable in that (1) it has a concept of association of collocations which reflects our intuition and (2) it can treat low frequency events properly.