994
Views
36
CrossRef citations to date
0
Altmetric
Original Articles

Automatic Language Classification by means of Syntactic Dependency Networks

&
Pages 291-336 | Published online: 17 Nov 2011
 

Abstract

This article presents an approach to automatic language classification by means of linguistic networks. Networks of 11 languages were constructed from dependency treebanks, and the topology of these networks serves as input to the classification algorithm. The results match the genealogical similarities of these languages. In addition, we test two alternative approaches to automatic language classification – one based on n-grams and the other on quantitative typological indices. All three methods show good results in identifying genealogical groups. Beyond genetic similarities, network features (and feature combinations) offer a new source of typological information about languages. This information can contribute to a better understanding of the interplay of single linguistic phenomena observed in language.

ACKNOWLEDGEMENTS

This work was supported by the Linguistic Networks project (http://www.linguistic-networks.net/) funded by the German Federal Ministry of Education and Research (BMBF), and by the German Research Foundation Deutsche Forschungsgemeinschaft (DFG) in the Collaborative Research Center 673 “Alignment in Communication”.

We are grateful to Ramon Ferrer i Cancho, Barbara Job, Tatiana Lokot and the anonymous reviewers for their useful comments.

Notes

1This notion goes back to Ferrer i Cancho et al. (Citation2004) and will be explained in more detail below.

2Cognates are pairs of words from different languages that originate from the same ancestor language. The common origin is determined by regular phonetic change from one language to another and by related meaning of the two words. Borrowed words are not cognates (Kruskal et al., Citation1992).

3This model is based on the Graph eXchange Language GXL (Holt et al., Citation2006). Pustylnikov and Mehler (Citation2008); Pustylnikov et al. (Citation2008) adapted this format in order to model syntactic trees. See the TreebankWiki (http://ariadne.coli.uni-bielefeld.de/wikis/treebankwiki/) for all details on the conversion process.

4Ferrer i Cancho et al. (Citation2004, p. 2).

5Note that weights of edges are not considered by this model; that is, if two words occur more than once in a modifier-head relation it does not result in an increase of degrees of these words.

6 C 2 is a variant of C 1 that weights the single c(vi )s by their vertex degrees.

7Note, that all indices are computed only for the largest connected component (LCC). This is, of course, an abstraction and some information might become lost.

8The DC is 1 for a “star graph” (i.e. all vertices have degree d = 1, and one vertex has degree d = |V| − 1).

9The features γ(S) and R 2 γS (G) represent the power-law fit of the distribution of connected components of G, and lcc is the fraction of the largest connected component of G (see Features F 10 and F 11 in ).

10We selected 300 n-grams as suggested by Cavnar and Trenkle (Citation1994) for n = {1, … , 6}.

11We selected these indices since they could be applied to our sort of data, i.e. dependency treebanks.

1229 is the maximal number of samples with 1000 tokens that can be taken from the smallest treebank (i.e. from Slovene). That is, we select 29 as the least common number of samples for each treebank.

13This and the following indices were calculated on a sample of 1499 sentences from each treebank. This number is the smallest common number of sentences obtainable from each treebank.

14The example is taken from Altmann and Lehfeldt (Citation1973).

15 m is the maximal value of the index j (see Dm), i.e. the deepest level in the dependency tree.

16All the computations of the cluster analysis are made using MATLAB version 7.11.0.584 7. (R2010b) including the Statistics and Curve Fitting Toolboxes (www.mathworks.de).

17We have computed the pairwise correlations among all features. Here and in the following paragraphs we show only results that are statistically significant.

18Note that all correlations exemplified here are significant with a p-value < 0.05.

Log in via your institution

Log in to Taylor & Francis Online

PDF download + Online access

  • 48 hours access to article PDF & online version
  • Article PDF can be downloaded
  • Article PDF can be printed
USD 53.00 Add to cart

Issue Purchase

  • 30 days online access to complete issue
  • Article PDFs can be downloaded
  • Article PDFs can be printed
USD 394.00 Add to cart

* Local tax will be added as applicable

Related Research

People also read lists articles that other readers of this article have read.

Recommended articles lists articles that we recommend and is powered by our AI driven recommendation engine.

Cited by lists all citing articles based on Crossref citations.
Articles with the Crossref icon will open in a new tab.