Publication Cover
Perspectives
Studies in Translation Theory and Practice
Volume 27, 2019 - Issue 1
2,933
Views
4
CrossRef citations to date
0
Altmetric
Articles

Optimising the Europarl corpus for translation studies with the EuroparlExtract toolkit

ORCID Icon
Pages 107-123 | Received 06 Feb 2018, Accepted 29 May 2018, Published online: 19 Jun 2018

Figures & data

Figure 1. Percentage of citations of Europarl according to field of study. Total number of citations = 1,000.

Figure 1. Percentage of citations of Europarl according to field of study. Total number of citations = 1,000.

Table 1. Samples of the German, English and Spanish monolingual Europarl files ep-11-05-09-018.txt (= proceedings of chapter 18 from May 9, 2011). Information about speaker and original language in metadata tags.

Table 2. Examples of inconsistent and incorrectly encoded source language identifiers in Europarl source files.

Table 3. Examples of inconsistencies in speaker names that require normalisation prior to statement matching.

Table 4. Number of Europarl statements yielded by source language identification and speaker matching procedure.

Table 5. Total sizes of extracted subcorpora in tokens. For parallel corpora, only target language tokens are counted.

Figure 2. Text quantity in parallel subcorpus by languages. Node size and colour proportional to number of tokens translated into given language; weight and colour of directed edges proportional to number of tokens translated from given language.

Figure 2. Text quantity in parallel subcorpus by languages. Node size and colour proportional to number of tokens translated into given language; weight and colour of directed edges proportional to number of tokens translated from given language.

Figure 3. Text quantity in comparable subcorpus by languages. For meaning of nodes and edges, see .

Figure 3. Text quantity in comparable subcorpus by languages. For meaning of nodes and edges, see Figure 2.
Supplemental material

Data availability statement

The data extracted with the tool presented in this article are available in Zenodo at https://zenodo.org/record/1066474#.WnnEM3wiHcs (DOI: 10.5281/zenodo.1066473) and https://zenodo.org/record/1066472#.WnnEYXwiHcs (DOI: 10.5281/zenodo.106647). These data were derived from the European Parliament Proceedings Parallel Corpus (www.statmt.org/europarl). The data supporting the bibliometric analyses presented in this article are available within the supplementary materials.