Abstract
SUMAT is a project funded through the EU ICT Policy Support Programme (2011–2014). It involves four subtitling companies (InVision, DDS, Titelbild, VSI) and five technical partners (ALS, ATC, TextShuttle, University of Maribor, Vicomtech).For the SUMAT project, translated subtitles for seven language pairs have been collected. Four subtitling companies have contributed to this effort, which has so far resulted in collections numbering between 200,000 and 2 million subtitles per language pair. This paper describes the process of converting, classifying and aligning the subtitles. Conversion to a common text format and cross-language alignment were automatically done, using specially built converters, whilst classifying the subtitles according to text genre was a manual process, performed by the teams harvesting the subtitles.The resulting subtitle corpora are perfectly suited for various applications. The focus of the SUMAT project is to use them as training material for statistical machine translation systems, and this paper will report on the initial experiences with some of the language pairs. In addition, the parallel corpora may serve as input data for parallel concordancing systems. As part of the project, a small prototype has been built which shows how word-aligned parallel subtitles offer new insights for translation science.
Acknowledgements
The work leading to these results has received funding from the European Community under grant agreement no 270919.
Notes
3. www.vsi.tv/
7. www.atc.gr
9. www.um.si
10. www.vicomtech.es
11. For more information on the technique of using ‘template’ files for subtitling see Georgakopoulou (Citation2003, pp. 210–221).
14. www.ted.com/tedx
15. The corpus is called WIT3 and can be downloaded from wit3.fbk.eu/mt.php?release=2012-01
16. www.linguee.com
17. www.glosbe.com
18. This sentence is found at the bottom of every search results page on www.glosbe.com