Abstract
We evaluate existing and modified approaches for measuring the semantic similarity of sentences in the Malay language. These approaches are mainly used for English sentences and no studies to date have evaluated and compared their effectiveness when applied to Malay sentences. We used a pre-processed Malay machine-readable dictionary to calculate word-to-word semantic similarity with two methods: probability of intersection and normalization. We then used the word-to-word semantic similarity measure to identify semantic sentence similarity. We evaluated five measures of semantic sentence similarity: vector-based semantic similarity, word order similarity, highest word-to-sentence similarity, and combinations of vector-based and word-to-sentence similarity and of word order and word-to-sentence similarity. We also evaluated the effects of including and excluding lexical components such as prepositions, conjunctions, verbs, and morphological variants.
Acknowledgments
The authors wish to thank the Ministry of Higher Education for the funds provided for this project and also the anonymous referees for their helpful and constructive comments on this paper.