Reproducible Extraction of Cross-lingual Topics (rectr): Communication Methods and Measures: Vol 14 , No 4

ABSTRACT

With global media content databases and online content being available, analyzing topical structures in different languages simultaneously has become an urgent computational task. Some previous studies have analyzed topics in a multilingual corpus by translating all items into a single language using a machine translation service, such as Google Translate. We argue that this method is not reproducible in the long run and proposes a new method – Reproducible Extraction of Cross-lingual Topics Using R (rectr). Our method utilizes open-source-aligned word embeddings to understand the cross-lingual meanings of words and has a mechanism to normalize residual influence from language differences. We present a benchmark that compares the topics extracted from a corpus of English, German, and French news using our method with methods used in the literature. We show that our method is not only reproducible but can also generate high-quality cross-lingual topics. We demonstrate how our method can be applied in tracking news topics across time and languages.

Acknowledgements

The authors would like to thank Fabienne Lind (Computational Communication Science Lab, University of Vienna) for her comments that greatly improved this manuscript; Valeria Glauser (University of Zurich) for her help in trilingual coding

Disclosure Statement

No potential conflict of interest was reported by the authors.

Data availability statement

The data that support the findings of this study are openly available in OSF at https://doi.org/10.17605/osf.io/e2mhy

Notes

1. English translation: For a large class of cases – though not for all – in which we employ the word meaning, it can be defined thus: the meaning of a word is its use in the language.

2. Extensions allowing multilingual topic modeling exist, but major toolkits do not include these new implementations, with the exception being the Polylingual Topic Model (Mimno et al., Citation2009) in the MALLET toolkit. Such extensions are not very useful for most cases due to the reliance on parallel corpora (i.e., a document in the corpus must have all multilingual equivalents). At the time of writing, only a handful of social science studies have adopted such methods (e.g., Lind et al., Citation2019a; Pruss et al., Citation2019).

3. The term multilingual corpus in computational linguistics can mean two things: parallel corpus and comparable corpus. A parallel corpus is one where a document in the corpus must have all direct multilingual equivalents. An example is the European Parliament dataset, in which a speech is manually translated into 21 European languages. We argue in footnote 2 that this kind of corpus is rare and has limited applicability for communication research. In this paper, multilingual corpus means a comparable corpus, in which multilingual articles are in the same category (e.g., news articles) but not directly equivalent to each other. Examples of comparable corpora are the ones used in Reber (Citation2019) and Lucas et al. (Citation2015) and public datasets such as the Manifesto Project and Mannheim International News Discourse Data Set.

4. For example, cross-lingual news topics are topics that attract similar attention from news outlets in different languages. Thematic news topics, such as sports and culture, are usually cross-lingual. Episodic news topics that attract worldwide attention (e.g., terrorist attacks or a global pandemic) are also likely to be cross-lingual.

5. However, the Google Translate API has a rate limit even for the paid service. For this project, we needed to translate over 2,314 articles (or 10,311,329 characters) in German and French into English. Including the waiting time, this process took 5 hours and cost USD$206. Although it is much cheaper than manual translation, it still has a cost.

6. The previous case of reaching this holy grail status was crowdsourcing. Researchers announced the service to be “cheap, good, and fast” (Snow et al., Citation2008), but that was before the rapid downfall due to bots (Chmielewski & Kucker, Citation2019).

7. Besides, one cannot assume these third-party services will exist forever.

8. The latest version of Google Translate and DeepL translates this sentence as “From the animal welfare association, I have a hangover.”

9. English translation: The room is bright.

10. Hell in German and hell in English are false friends (faux ami), or bilingual homophones.

11. Using the DTM translation approach by Reber (Citation2019), the two columns are lumped together in all cases.

12. Joulin et al. (Citation2018) measured this alignment accuracy by using the aligned fastText word embeddings in a “word translation as a retrieval task.” The ground truth was based on a bilingual dictionary between two languages L1 and L2 (e.g., English and German). Given the L1 word embeddings vector of a word (e.g., France), the cosine distance between it and all aligned L2 word embeddings vectors was calculated. This alignment accuracy measure quantifies how likely the word embeddings vector of the L2 equivalent word according to the ground truth dictionary (Frankreich in this example) is to show up as the one with the shortest distance to the input L1 word embeddings vector.

13. The details of how BERT can make the word embeddings context aware are beyond the scope of this paper. Readers can refer to Devlin et al. (Citation2018) for further details.

14. An example of French New York Times articles: https://www.nytimes.com/2015/11/17/opinion/les-attentats-a-paris-revelent-les-limites-de-daesh.html As our original application is to track news topics across outlets in different languages, it is inconvenient to include multiple languages in one outlet. Therefore, we decided to remove these three articles. However, M-BERT does have built-in support for language detection.

15. We reuse the terminology of Benoit et al. (Citation2018).

16. Please refer to the online appendix and the reproducible material for how to use the package. The package is available here: https://github.com/chainsawriot/rectr

17. https://fasttext.cc/docs/en/crawl-vectors.html

18. This number of dimensions represents the total number of hidden units in the artificial neural network used to learn the meanings of words.

19. https://github.com/google-research/bert/blob/master/multilingual.md

20. Reproducing these word embeddings and language models is theoretically possible, although it requires a massive amount of computational power. The chief author of this paper successfully reproduced the training of fastText word embeddings and aligned them using the program provided by Facebook Research with an off-the-shelf Linux computer. It took two weeks. More advanced models, such as BERT or M-BERT, require even more computational power (e.g., Google Research trained M-BERT with computers that had specialized tensor processing unit clips, each with 64 G of RAM). Third-party researchers with such computational power have successfully reproduced those results. For example, French AI firm Hugging Face has reproduced both BERT and M-BERT and made them available to the public.

21. For GMM, it is possible to converge to a solution with a number of topics less than k, which indicates choosing a large k when the data do not provide enough variation to fit such a large k.

22. The correlation between the two trilingual coders is 0.502 (95% CI = 0.401–0.591). The intercoder reliability, as measured by Krippendorff’s alpha, is moderate (α for ordinal scale: 0.451). As indicated by the moderate correlation and intercoder reliability, cross-lingual topic similarity is a complex latent construct that can vary between individuals. We recalculate the intercoder reliability as Cohen’s Kappa and the value is 0.492. It is slightly lower than the Kappa value of 0.58 reported in Hatzivassiloglou et al. (Citation1999), which carried out a similar annotation task of topic similarities in news. However, our task is much more difficult because the task in Hatzivassiloglou et al. (Citation1999) was monolingual and the coders evaluated only paragraphs, whereas our task is trilingual and our coders evaluate full-articles.

23. The same graph for mb-rectr and ft-stm is available in Appendix VII.

24. Similar to the “Amsterdam Embedding Model” (https://github.com/annekroon/AEM)

25. We took advantage of this extensibility ourselves. The original implementation of rectr supported only aligned fastText. We extended the software to add support for M-BERT in a very short time frame.

26. On a typical computer, we can complete step 2 of our analysis in less than 10 minutes using aligned fastText. In comparison, the same task using M-BERT takes 4 hours.

Additional information

Funding

This project was funded by a research grant from the 1) German Research Foundation (Deutsche Forschungsgemeinschaft), 2) The Netherlands Organisation for Scientific Research (Nederlandse Organisatie voor Wetenschappelijk Onderzoek) and 3) the National Endowment for the Humanities, through the Trans-Atlantic Platform’s Digging into Data Challenge funding program.

Reproducible Extraction of Cross-lingual Topics (rectr)

Log in via your institution

Log in to Taylor & Francis Online

Restore content access

Related Research

Information for

Open access

Opportunities

Help and information

Reproducible Extraction of Cross-lingual Topics (rectr)

ABSTRACT

Acknowledgements

Disclosure Statement

Data availability statement

Notes

Additional information

Funding

Log in via your institution

Log in to Taylor & Francis Online

Log in to Taylor & Francis Online

Restore content access

Related Research

To cite this article:

Download citation

Information for

Open access

Opportunities

Help and information

Keep up to date

Your download is now in progress and you may close this window

Login or register to access this feature