Abstract
Word frequencies are central to linguistic studies investigating processing difficulty, learnability, age of acquisition, diachronic transmission and the relative weight given to a concept in society. However, there are few cross-linguistic studies on entire distributions of word frequencies, and even less on systematic changes within them. Here, we first define and test an exact measure for the relative difference between distributions – the Normalised Frequency Difference (NFD). We then apply this measure to parallel corpora in overall 19 languages, explaining systematic variation in the frequency distributions within the same language and across different languages. We further establish the NFD between lemmatised and un-lemmatised corpora as a frequency-based measure of inflectional productivity of a language. Finally, we argue that quantitative measures like the NFD can advance language typology beyond abstract, theory-driven expert judgments, towards more corpus-based, empirical and reproducible analyses.
Notes
1. We made an R package available for NFD calculation and plotting via https://github.com/dimalik/nfd/.
2. Note that we included the ’s genitive both under inflexion and clitics. Theoretically it should be considered a phrasal clitic, since it does not attach directly to nouns, but rather to noun phrases. However, in practice it is found mostly on nouns and might be perceived as noun inflexion by learners and speakers.
3. In the upper panels we log-transform the ranks of the distributions, but not the ΔFreq. This exaggerates the visual differences in frequencies somewhat.
4. The POS tags used in the BTagger are the first two letters of the Multext-East morphosyntactic definitions (MSD). See a full list here: http://nl.ijs.si/ME/V4/msd/html/index.html.