308
Views
0
CrossRef citations to date
0
Altmetric
Articles

N-Gram Approaches to the Historical Dynamics of Basic Vocabulary*

&
Pages 50-64 | Published online: 17 Dec 2013
 

ABSTRACT

In this paper, we apply an information theoretic measure, self-entropy of phoneme n-gram distributions, for quantifying the amount of phonological variation in words for the same concepts across languages, thereby investigating the stability of concepts in a standardized concept list – based on the 100-item Swadesh list – specifically designed for automated language classification. Our findings are consistent with those of the ASJP project (Automated Similarity Judgment Program; Holman et al. Citation2008a). The correlation of our ranking with that of ASJP is statistically highly significant. Our ranking also largely agrees with two other reduced concept lists proposed in the literature. Our results suggest that n-gram analysis works at least as well as other measures for investigating the relation of phonological similarity to geographical spread, automatic language classification, and typological similarity, while being computationally considerably cheaper than the most widespread method (normalized Levenshtein distance), very important when processing large quantities of language data.

Notes

1 In the terminology of historical linguistics, items (words, morphemes or constructions) in related languages are cognates if they all descend directly from the same proto-language item. This is sometimes called vertical transmission, as opposed to horizontal transmission, i.e., borrowing in a wide sense. Thus, cognacy in historical linguistics explicitly excludes loanwords.

2 The term cognate set refers to a set of cognate items, i.e., words in different languages going back to the same proto-language word. In working with Swadesh lists, cognates are further required to express the same sense in order for them to be in the same cognate set.

3 Compare English wheel to Hindi chakka ‘wheel’, which do not reveal themselves to be cognates through visual inspection, but can nevertheless be traced back to the same Proto-Indo-European root.

6 We use weighted average to factor out the effect of sample size of each language family. We tried averaging using the number of families and total number of languages present in the 100-word list sample. Both the averaging techniques correlate highly (ρ > 0.92, p < 0.001) with the weighted average.

7 Dolgopolsky (Citation1986) arrived at the 23-item list by comparing 140 languages belonging to 10 families.

8 The LDND program takes about one hour to compute the inter-language distances, whereas the n-gram analysis takes less than two minutes.

Log in via your institution

Log in to Taylor & Francis Online

PDF download + Online access

  • 48 hours access to article PDF & online version
  • Article PDF can be downloaded
  • Article PDF can be printed
USD 53.00 Add to cart

Issue Purchase

  • 30 days online access to complete issue
  • Article PDFs can be downloaded
  • Article PDFs can be printed
USD 394.00 Add to cart

* Local tax will be added as applicable

Related Research

People also read lists articles that other readers of this article have read.

Recommended articles lists articles that we recommend and is powered by our AI driven recommendation engine.

Cited by lists all citing articles based on Crossref citations.
Articles with the Crossref icon will open in a new tab.