4,162
Views
8
CrossRef citations to date
0
Altmetric
Articles

Variation in Word Frequency Distributions: Definitions, Measures and Implications for a Corpus-Based Language Typology

, , &

References

  • Baayen, H. (1992). Quantitative aspects of morphological productivity. In G. Booij & J. van Marle (Eds.), Yearbook of morphology 1991 (pp. 109–149). Dordrecht: Springer.
  • Baayen, R. H. (1994). Derivational productivity and text typology. Journal of Quantitative Linguistics, 1, 16–34.10.1080/09296179408589996
  • Baayen, R. Harald. (2001). Word frequency distributions. Dordrecht, Boston & London: Kluwer. 10.1007/978-94-010-0844-0
  • Bentz, C., Kiela, D., Hill, F., & Buttery, P. (2014). Zipf’s law and the grammar of languages: A quantitative study of old and modern English parallel texts. Corpus Linguistics and Linguistic Theory, 10, 175–211.
  • Bentz, C., Verkerk, A., Kiela, D., Hill, F., & Buttery, P. (2015). Adaptive communication: Languages with more non-native speakers tend to have fewer word forms. PLoS ONE, 10, e0128254. doi:10.1371/journal.pone.0128254.
  • Bentz, C., & Winter, B. (2012). The impact of L2 speakers on the evolution of case marking. In T. C. Scott-Phillips, M. Tamariz, E. A. Cartmill, & J. R. Hurford (Eds.), The evolution of language. Proceedings of the 9th International Conference (Evolang9) (pp. 58–64). Singapore: World Scientific.
  • Bentz, C., & Winter, B. (2013). Languages with more second language speakers tend to lose nominal case. Language Dynamics and Change, 3, 1–27.
  • Biber, D., Johansson, S., Leech, G., Conrad, S., & Finegan, E. (1999). Longman grammar of spoken and written English. Harlow: Longman, Pearson Education Limited.
  • Bickel, B. (2015). Distributional typology: Statistical inquiries into the dynamics of linguistic diversity. In B. Heine & H. Narrog (Eds.), Oxford handbook of linguistic analysis (pp. 901–923). (2nd) Oxford: Oxford University Press.
  • Bickel, B., & Nichols, J. (1999.). The Autotyp database. Retrieved from http://www.autotyp.uzh.ch/
  • Bochkarev, V., Solovyev, V., & Wichmann, S. (2014). Universals versus historical contingen cies in lexical evolution. Journal of The Royal Society Interface, 11, 20140841.10.1098/rsif.2014.0841
  • Bybee, J. (2007). Frequency of use and the organization of language. Oxford: Oxford University Press.10.1093/acprof:oso/9780195301571.001.0001
  • Colaiori, F., Castellano, C., Cuskley, C. F., Loreto, V., Pugliese, M., & Tria, F. (2015). General three-state model with biased population replacement: Analytical solution and application to language dynamics. Physical Review E, 91, 012808.10.1103/PhysRevE.91.012808
  • R Core Team. (2013). R: A language and environment for statistical computing. Vienna, Austria: R Foundation for Statistical Computing.
  • Corral, Á., Boleda, G., & Ferrer-i-Cancho, R. (2015). Zipf’s law for word frequencies: word forms versus lemmas in long texts. PloS ONE, 10, e0129031. doi:10.1371/journal.pone.0129031.
  • Cuskley, C. F., Pugliese, M., Castellano, C., Colaiori, F., Loreto, V., & Tria, F. (2014). Internal and external dynamics in language: Evidence from verb regularity in a historical corpus of English. PLoS ONE, 9, e102882.10.1371/journal.pone.0102882
  • Dale, R., & Lupyan, G. (2012). Understanding the origins of morphological diversity: the Linguistic Niche Hypothesis. Advances in Complex Systems, 15, 1150017/1–1150017/16.
  • Dryer, M. S., & Haspelmath, M. (Eds.). (2013). World atlas of language structures online. Munich: Max Planck Digital Library. Retrieved from http://wals.info/
  • Ellis, N. C. (2002). Frequency effects in language processing: A review with implications for theories of implicit and explicit language acquisition. Studies in Second Language Acquisition, 24, 143–188.
  • Ellis, N., & Collins, L. (2009). Input and second language acquisition: The roles of frequency, form, and function. The Modern Language Journal, 93, 329–335.10.1111/modl.2009.93.issue-3
  • Fabricius-Hansen, C., Gallmann, P., Eisenberg, P., Fiehler, R., Peters, J., Nübling, D., … Fritz, T. A. (2009). Duden. Die Grammatik. Mannheim/Zürich: Dudenverlag.
  • Freedman, J. L., & Loftus, E. F. (1971). Retrieval of words from long-term memory. Journal of Verbal Learning and Verbal Behavior, 10, 107–115.10.1016/S0022-5371(71)80001-4
  • Gesmundo, A., & Samardžić, T. (2012). Lemmatisation as a tagging task. In Proceedings of the 50th annual meeting of the association for computational linguistics: short papers- volume 2. Association for Computational Linguistics, pp. 368–372.
  • Goldschneider, J. M., & DeKeyser, R. M. (2001). Explaining the natural order of l2morpheme acquisition in English: A meta-analysis of multiple determinants. Language Learning, 51(1), 1–50.10.1111/lang.2001.51.issue-1
  • Gries, S. T. (2009). Quantitative corpus linguistics with R: A practical introduction. New York, NY: Routledge.10.1515/9783110216042
  • Ha, L., Stewart, D. W., Hanna, P., & Smith, F. J. (2006). Zipf and type-token rules for the English, Spanish, Irish and Latin languages. Web Journal of Formal, Computational and Cognitive Linguistics, 1, 1–12.
  • Hammarström, H., Forkel, R., Haspelmath, M., & Bank, S. (2015). Glottolog 2.4 Retrieved from http://glottolog.org/
  • Haspelmath, M. (2011). The indeterminacy of word segmentation and the nature of morphology and syntax. Folia Linguistica, 45, 31–80.
  • Iggesen, O. A. (2013). Number of cases. In M. S. Dryer & M. Haspelmath (Eds.), The World Atlas of Language Structures online. Leipzig: Max Planck Institute for Evolutionary Anthropology. Retrieved from http://wals.info/chapter/49
  • Jarvis, S. (2002). Short texts, best-fitting curves and new measures of lexical diversity. Language Testing, 19, 57–84.10.1191/0265532202lt220oa
  • Koehn, P. (2005). Europarl: a parallel corpus for statistical machine translation. In Mt summit (Vol. 5, pp. 79–86).
  • Köhler, R., Altmann, G., & Piotrowski, R. (2005). Quantitative linguistics: An international handbook. Berlin: Mouton de Gruyter.
  • Koplenig, A. (2015a). The impact of lacking metadata for the measurement of cultural and linguistic change using the google ngram data sets – reconstructing the composition of the German corpus in times of WWII. Digital Scholarship in the Humanities. doi:10.1093/llc/fqv037
  • Koplenig, A. (2015b). Using the parameters of the Zipf-Mandelbrot law to measure diachronic lexical, syntactical and stylistic changes: a large-scale corpus analysis. Corpus Linguistics and Linguistic Theory [ahead of print].
  • Kullback, S., & Leibler, R. A. (1951). On information and sufficiency. The Annals of Mathematical Statistics, 22, 79–86.10.1214/aoms/1177729694
  • Freeman, D. E. (1975). The acquisition of grammatical morphemes by adult ESL students. TESOL Quarterly, 9, 409–419.10.2307/3585625
  • Larsen-Freeman, D. E. (1976). An explanation for the morpheme acquisition order of second language learners. Language Learning, 26, 125–134.10.1111/lang.1976.26.issue-1
  • Lewis, M. P., Simons, G. F., & Fenning, C. D. (Eds.). (2013). Ethnologue: Languages of the world (17th). Dallas, TX: SIL International. Retrieved from http://www.ethnologue.com
  • Lieberman, E., Michel, J.-B., Jackson, J., Tang, T., & Nowak, M. A. (2007). Quantifying the evolutionary dynamics of language. Nature, 449, 713–716.10.1038/nature06137
  • Loftus, E. F., & Suppes, P. (1972). Structural variables that determine the speed of retrieving words from long-term memory. Journal of Verbal Learning and Verbal Behavior, 11, 770–777.10.1016/S0022-5371(72)80011-2
  • Lupyan, G., & Dale, R. (2010). Language structure is partly determined by social structure. PLoS ONE, 5, e8559.10.1371/journal.pone.0008559
  • Mandelbrot, B. (1953). An informational theory of the statistical structure of language. In W. Jackson (Ed.), Communication theory (pp. 468–502). London: Butterworths Scientific Publications.
  • Mayer, T., & Cysouw, M. (2014). Creating a massively parallel bible corpus. In N. Calzolari, K. Choukri, T. Declerck, H. Loftsson, B. Maegaard, J. Mariani, … S. Piperidis (Eds.), Proceedings of the Ninth International Conference on Language Resources and Evaluation (LREC-2014), Reykjavik, Iceland, May 26-31, 2014. (pp. 3158–3163). European Language Resources Association (ELRA).
  • McCarthy, P. M., & Jarvis, S. (2007). Vocd: A theoretical and empirical evaluation. Language Testing, 24, 459–488.10.1177/0265532207080767
  • McCarthy, P. M., & Jarvis, S. (2010). MTLD, vocd-D, and HD-D: A validation study of sophisticated approaches to lexical diversity assessment. Behavior Research Methods, 42, 381–392.10.3758/BRM.42.2.381
  • McWhorter, J. H. (2002). What happened to English? Diachronica, 19, 217–272.10.1075/dia.19.2
  • Michalke, M. (2014). koRpus: an R package for text analysis. ( Version 0.05-5). Retrieved from http://reaktanz.de/?c=hacking&s=koRpus
  • Michel, J.-B., Shen, Y. K., Aiden, A. P., Veres, A., Gray, M. K., Pickett, J. P., & Hoiberg, D. (2011). Quantitative analysis of culture using millions of digitized books. Science, 331, 176–182.10.1126/science.1199644
  • Mitchell, D. (2015). Type-token models: A comparative study. Journal of Quantitative Linguistics, 22(1), 1–21.10.1080/09296174.2014.974456
  • Pagel, M., Atkinson, Q. D., & Meade, A. (2007). Frequency of word-use predicts rates of lexical evolution throughout Indo-European history. Nature, 449, 717–720.10.1038/nature06176
  • Popescu, I.-I., Altmann, G., Grzybek, P., Jayaram, B. D., Köhler, R., Krupa, V., … Vidya, M. N. (2009). Word frequency studies. Berlin & New York: Mouton de Gruyter.
  • Popescu, I.-I., Altmann, G., & Köhler, R. (2010). Zipf’s law – another view. Quality & Quantity, 44, 713–731.10.1007/s11135-009-9234-y
  • Moscoso del Prado Martín, F., Kostić, A., & Baayen, R. H. (2004). Putting the bits together: An information theoretical perspective on morphological processing. Cognition, 94(1), 1–18.
  • Roy, B. C., Frank, M. C., & Roy, D. (2009). Exploring word learning in a high-density longitudinal corpus. In N. Taatgenand & H. van Rijn (Eds.), Proceedings of the 31st Meeting of the Cognitive Science Society. Amsterdam, The Netherlands. Cognitive Science Society, Inc.
  • Shannon, C. E., & Weaver, W. (1949). The mathematical theory of communication. Urabana-Champaign, IL: The University of Illinois Press.
  • Schmid, H. (1994). Probabilistic part-of-speech tagging using decision trees. In Proceedings of the International Conference on New Methods in Language Processing, (Vol. 12, pp. 44–49).
  • Schmid, H. (1995). Improvements in part-of-speech tagging with an application to German. In Proceedings of the ACL SIGDAT-Workshop.
  • Solomon, R. L., & Howes, D. H. (1951). Word frequency, personal values, and visual duration thresholds. Psychological Review, 58, 256.10.1037/h0058228
  • Trudgill, P. (2011). Sociolinguistic typology: Social determinants of linguistic complexity. Oxford: Oxford University Press.
  • Tiedemann, J. (2012). Parallel data, tools and interfaces in OPUS. In Nicoletta Calzolari, Khalid Choukri, Thierry Declerck, Mehmet Uğur Doğan, Bente Maegaard, Joseph Mariani, Asuncion Moreno, Jan Odijk, & Stelios Piperidis, (Eds.), Proceedings of the 8th International Conference on Language Resources and Evaluation (LREC 2012), Istanbul, Turkey.
  • Tweedie, F. J., & Baayen, R. H. (1998). How variable may a constant be? Measures of lexical richness in perspective. Computers and the Humanities, 32, 323–352.10.1023/A:1001749303137
  • Whaley, C. (1978). Word-nonword classification time. Journal of Verbal Learning and Verbal Behavior, 17, 143–154.10.1016/S0022-5371(78)90110-X
  • Wieling, M., Montemagni, S., Nerbonne, J., & Baayen, R. H. (2014). Lexical differences between Tuscan dialects and standard Italian: Accounting for geographic and sociodemographic variation using generalized additive mixed modeling. Language, 90, 669–692.10.1353/lan.2014.0064
  • Wieling, M., Nerbonne, J., & Baayen, R. H. (2011). Quantitative social dialectology: Explaining linguistic variation geographically and socially. PLoS ONE, 6, e23613.10.1371/journal.pone.0023613
  • Wray, A. (2014). Why are we so sure we know what a word is? In J. Taylor (Ed.), The Oxford Handbook of the Word. Oxford: Oxford University Press.
  • Wray, A., & Grace, G. W. (2007). The consequences of talking to strangers: Evolutionary corollaries of socio-cultural influences on linguistic form. Lingua, 117, 543–578.10.1016/j.lingua.2005.05.005
  • Yule, G. U. (1944). A statistical study of vocabulary. Cambridge, England: Cambridge University Press.
  • Zipf, G. K. (1932). Selected studies of the principle of relative frequency in language. Cambridge (Massachusetts): Harvard University Press.10.4159/harvard.9780674434929
  • Zipf, G. K. (1935). The psycho-biology of language. Cambridge (Massachusetts): The M.I.T Press.
  • Zipf, G. K. (1949). Human behavior and the principle of least effort. Cambridge (Massachusetts): Addison-Wesley.

Reprints and Corporate Permissions

Please note: Selecting permissions does not provide access to the full text of the article, please see our help page How do I view content?

To request a reprint or corporate permissions for this article, please click on the relevant link below:

Academic Permissions

Please note: Selecting permissions does not provide access to the full text of the article, please see our help page How do I view content?

Obtain permissions instantly via Rightslink by clicking on the button below:

If you are unable to obtain permissions via Rightslink, please complete and submit this Permissions form. For more information, please visit our Permissions help page.