308
Views
0
CrossRef citations to date
0
Altmetric
Articles

N-Gram Approaches to the Historical Dynamics of Basic Vocabulary*

&
 

ABSTRACT

In this paper, we apply an information theoretic measure, self-entropy of phoneme n-gram distributions, for quantifying the amount of phonological variation in words for the same concepts across languages, thereby investigating the stability of concepts in a standardized concept list – based on the 100-item Swadesh list – specifically designed for automated language classification. Our findings are consistent with those of the ASJP project (Automated Similarity Judgment Program; Holman et al. Citation2008a). The correlation of our ranking with that of ASJP is statistically highly significant. Our ranking also largely agrees with two other reduced concept lists proposed in the literature. Our results suggest that n-gram analysis works at least as well as other measures for investigating the relation of phonological similarity to geographical spread, automatic language classification, and typological similarity, while being computationally considerably cheaper than the most widespread method (normalized Levenshtein distance), very important when processing large quantities of language data.

Notes

1 In the terminology of historical linguistics, items (words, morphemes or constructions) in related languages are cognates if they all descend directly from the same proto-language item. This is sometimes called vertical transmission, as opposed to horizontal transmission, i.e., borrowing in a wide sense. Thus, cognacy in historical linguistics explicitly excludes loanwords.

2 The term cognate set refers to a set of cognate items, i.e., words in different languages going back to the same proto-language word. In working with Swadesh lists, cognates are further required to express the same sense in order for them to be in the same cognate set.

3 Compare English wheel to Hindi chakka ‘wheel’, which do not reveal themselves to be cognates through visual inspection, but can nevertheless be traced back to the same Proto-Indo-European root.

6 We use weighted average to factor out the effect of sample size of each language family. We tried averaging using the number of families and total number of languages present in the 100-word list sample. Both the averaging techniques correlate highly (ρ > 0.92, p < 0.001) with the weighted average.

7 Dolgopolsky (Citation1986) arrived at the 23-item list by comparing 140 languages belonging to 10 families.

8 The LDND program takes about one hour to compute the inter-language distances, whereas the n-gram analysis takes less than two minutes.

Reprints and Corporate Permissions

Please note: Selecting permissions does not provide access to the full text of the article, please see our help page How do I view content?

To request a reprint or corporate permissions for this article, please click on the relevant link below:

Academic Permissions

Please note: Selecting permissions does not provide access to the full text of the article, please see our help page How do I view content?

Obtain permissions instantly via Rightslink by clicking on the button below:

If you are unable to obtain permissions via Rightslink, please complete and submit this Permissions form. For more information, please visit our Permissions help page.