Automatic Language Classification by means of Syntactic Dependency Networks: Journal of Quantitative Linguistics: Vol 18 , No 4

Sample our Mathematics & Statistics journals, sign in here to start your FREE access for 14 days

Full Article
Figures & data
References
Citations
Metrics
Reprints & Permissions
Read this article /doi/full/10.1080/09296174.2011.608602?needAccess=true

Abstract

This article presents an approach to automatic language classification by means of linguistic networks. Networks of 11 languages were constructed from dependency treebanks, and the topology of these networks serves as input to the classification algorithm. The results match the genealogical similarities of these languages. In addition, we test two alternative approaches to automatic language classification – one based on n-grams and the other on quantitative typological indices. All three methods show good results in identifying genealogical groups. Beyond genetic similarities, network features (and feature combinations) offer a new source of typological information about languages. This information can contribute to a better understanding of the interplay of single linguistic phenomena observed in language.

ACKNOWLEDGEMENTS

This work was supported by the Linguistic Networks project (http://www.linguistic-networks.net/) funded by the German Federal Ministry of Education and Research (BMBF), and by the German Research Foundation Deutsche Forschungsgemeinschaft (DFG) in the Collaborative Research Center 673 “Alignment in Communication”.

We are grateful to Ramon Ferrer i Cancho, Barbara Job, Tatiana Lokot and the anonymous reviewers for their useful comments.

Notes

¹This notion goes back to Ferrer i Cancho et al. (Citation2004) and will be explained in more detail below.

²Cognates are pairs of words from different languages that originate from the same ancestor language. The common origin is determined by regular phonetic change from one language to another and by related meaning of the two words. Borrowed words are not cognates (Kruskal et al., Citation1992).

³This model is based on the Graph eXchange Language GXL (Holt et al., Citation2006). Pustylnikov and Mehler (Citation2008); Pustylnikov et al. (Citation2008) adapted this format in order to model syntactic trees. See the TreebankWiki (http://ariadne.coli.uni-bielefeld.de/wikis/treebankwiki/) for all details on the conversion process.

⁴Ferrer i Cancho et al. (Citation2004, p. 2).

⁵Note that weights of edges are not considered by this model; that is, if two words occur more than once in a modifier-head relation it does not result in an increase of degrees of these words.

⁶ C ₂ is a variant of C ₁ that weights the single c(v_i )s by their vertex degrees.

⁷Note, that all indices are computed only for the largest connected component (LCC). This is, of course, an abstraction and some information might become lost.

⁸The DC is 1 for a “star graph” (i.e. all vertices have degree d = 1, and one vertex has degree d = |V| − 1).

⁹The features γ(S) and R ² _γS(G) represent the power-law fit of the distribution of connected components of G, and lcc is the fraction of the largest connected component of G (see Features F ₁₀ and F ₁₁ in ).

¹⁰We selected 300 n-grams as suggested by Cavnar and Trenkle (Citation1994) for n = {1, … , 6}.

¹¹We selected these indices since they could be applied to our sort of data, i.e. dependency treebanks.

¹²29 is the maximal number of samples with 1000 tokens that can be taken from the smallest treebank (i.e. from Slovene). That is, we select 29 as the least common number of samples for each treebank.

¹³This and the following indices were calculated on a sample of 1499 sentences from each treebank. This number is the smallest common number of sentences obtainable from each treebank.

¹⁴The example is taken from Altmann and Lehfeldt (Citation1973).

¹⁵ m is the maximal value of the index j (see Dm), i.e. the deepest level in the dependency tree.

¹⁶All the computations of the cluster analysis are made using MATLAB version 7.11.0.584 7. (R2010b) including the Statistics and Curve Fitting Toolboxes (www.mathworks.de).

¹⁷We have computed the pairwise correlations among all features. Here and in the following paragraphs we show only results that are statistically significant.

¹⁸Note that all correlations exemplified here are significant with a p-value < 0.05.

Ferrer i Cancho , R. , Solé , R. V. and Köhler , R. 2004 . Patterns in syntactic dependency networks . Physical Review E , 69 : 051915 69:5

Web of Science ®Google Scholar

Kruskal , J. B. , Black , P. and Dyen , I. 1992 . An Indo-European Classification. A Lexicostatistical Experiment (Transactions of the American Philosophical Society) , American Philosophical Society .

Google Scholar

Holt , R. C. , Schürr , A. , Elliott Sim , S. and Winter , A. 2006 . GXL: A graph-based standard exchange format for reengineering . Science of Computer Programming , 60 ( 2 ) : 149 – 170 .

Google Scholar

Pustylnikov , O. and Mehler , A. Towards a uniform representation of treebanks: Providing interoperability for dependency tree data . Proceedings of First International Conference on Global Interoperability for Language Resources (ICGL 2008) . Hong Kong SAR . pp. January 9 – 11 .

Google Scholar

Pustylnikov , O. , Mehler , A. and Gleim , R. A unified database of dependency treebanks. Integrating, quantifying & evaluating dependency data . Proceedings of the 6th Language Resources and Evaluation Conference (LREC 2008) . Marrakech , Morocco. pp. 3359 – 3365 .

Google Scholar

Ferrer i Cancho , R. , Solé , R. V. and Köhler , R. 2004 . Patterns in syntactic dependency networks . Physical Review E , 69 : 051915 69:5

Web of Science ®Google Scholar

Cavnar , W. B. and Trenkle , J. M. N-gram-based text categorization . Proceedings of SDAIR-94, 3rd Annual Symposium on Document Analysis and Information Retrieval . Las Vegas , US. pp. 161 – 175 .

Google Scholar

Altmann , G. and Lehfeldt , W. 1973 . Allgemeine Sprachtypologie , München : Wilhelm Fink .

Google Scholar

Log in via your institution

Access through your institution

Log in to Taylor & Francis Online

Shibboleth

Log in to Taylor & Francis Online

Username Password

Forgot password?

Keep me logged in (not suitable for shared devices).

You will otherwise be logged out automatically, after a limited period, and will need to log in again.

Restore content access

Restore content access for purchases made as guest

Purchase options * Save for later Item saved, go to cart

PDF download + Online access

48 hours access to article PDF & online version
Article PDF can be downloaded
Article PDF can be printed

USD 53.00 Add to cart

PDF download + Online access - Online Checkout

Issue Purchase

30 days online access to complete issue
Article PDFs can be downloaded
Article PDFs can be printed

USD 394.00 Add to cart

Issue Purchase - Online Checkout

* Local tax will be added as applicable

Related Research

People also read lists articles that other readers of this article have read.

Recommended articles lists articles that we recommend and is powered by our AI driven recommendation engine.

Cited by lists all citing articles based on Crossref citations.
Articles with the Crossref icon will open in a new tab.

People also read
Recommended articles
Cited by

Automatic Language Classification by means of Syntactic Dependency Networks

Log in via your institution

Log in to Taylor & Francis Online

Restore content access

Related Research

Information for

Open access

Opportunities

Help and information

Automatic Language Classification by means of Syntactic Dependency Networks

Abstract

ACKNOWLEDGEMENTS

Notes

Log in via your institution

Log in to Taylor & Francis Online

Log in to Taylor & Francis Online

Restore content access

Related Research

To cite this article:

Download citation

Information for

Open access

Opportunities

Help and information

Keep up to date

Your download is now in progress and you may close this window

Login or register to access this feature