ABSTRACT
With global media content databases and online content being available, analyzing topical structures in different languages simultaneously has become an urgent computational task. Some previous studies have analyzed topics in a multilingual corpus by translating all items into a single language using a machine translation service, such as Google Translate. We argue that this method is not reproducible in the long run and proposes a new method – Reproducible Extraction of Cross-lingual Topics Using R (rectr). Our method utilizes open-source-aligned word embeddings to understand the cross-lingual meanings of words and has a mechanism to normalize residual influence from language differences. We present a benchmark that compares the topics extracted from a corpus of English, German, and French news using our method with methods used in the literature. We show that our method is not only reproducible but can also generate high-quality cross-lingual topics. We demonstrate how our method can be applied in tracking news topics across time and languages.
Acknowledgements
The authors would like to thank Fabienne Lind (Computational Communication Science Lab, University of Vienna) for her comments that greatly improved this manuscript; Valeria Glauser (University of Zurich) for her help in trilingual coding
Disclosure Statement
No potential conflict of interest was reported by the authors.
Data availability statement
The data that support the findings of this study are openly available in OSF at https://doi.org/10.17605/osf.io/e2mhy
Notes
1. English translation: For a large class of cases – though not for all – in which we employ the word meaning, it can be defined thus: the meaning of a word is its use in the language.
2. Extensions allowing multilingual topic modeling exist, but major toolkits do not include these new implementations, with the exception being the Polylingual Topic Model (Mimno et al., Citation2009) in the MALLET toolkit. Such extensions are not very useful for most cases due to the reliance on parallel corpora (i.e., a document in the corpus must have all multilingual equivalents). At the time of writing, only a handful of social science studies have adopted such methods (e.g., Lind et al., Citation2019a; Pruss et al., Citation2019).
3. The term multilingual corpus in computational linguistics can mean two things: parallel corpus and comparable corpus. A parallel corpus is one where a document in the corpus must have all direct multilingual equivalents. An example is the European Parliament dataset, in which a speech is manually translated into 21 European languages. We argue in footnote 2 that this kind of corpus is rare and has limited applicability for communication research. In this paper, multilingual corpus means a comparable corpus, in which multilingual articles are in the same category (e.g., news articles) but not directly equivalent to each other. Examples of comparable corpora are the ones used in Reber (Citation2019) and Lucas et al. (Citation2015) and public datasets such as the Manifesto Project and Mannheim International News Discourse Data Set.
4. For example, cross-lingual news topics are topics that attract similar attention from news outlets in different languages. Thematic news topics, such as sports and culture, are usually cross-lingual. Episodic news topics that attract worldwide attention (e.g., terrorist attacks or a global pandemic) are also likely to be cross-lingual.
5. However, the Google Translate API has a rate limit even for the paid service. For this project, we needed to translate over 2,314 articles (or 10,311,329 characters) in German and French into English. Including the waiting time, this process took 5 hours and cost USD$206. Although it is much cheaper than manual translation, it still has a cost.
6. The previous case of reaching this holy grail status was crowdsourcing. Researchers announced the service to be “cheap, good, and fast” (Snow et al., Citation2008), but that was before the rapid downfall due to bots (Chmielewski & Kucker, Citation2019).
7. Besides, one cannot assume these third-party services will exist forever.
8. The latest version of Google Translate and DeepL translates this sentence as “From the animal welfare association, I have a hangover.”
9. English translation: The room is bright.
10. Hell in German and hell in English are false friends (faux ami), or bilingual homophones.
11. Using the DTM translation approach by Reber (Citation2019), the two columns are lumped together in all cases.
12. Joulin et al. (Citation2018) measured this alignment accuracy by using the aligned fastText word embeddings in a “word translation as a retrieval task.” The ground truth was based on a bilingual dictionary between two languages L1 and L2 (e.g., English and German). Given the L1 word embeddings vector of a word (e.g., France), the cosine distance between it and all aligned L2 word embeddings vectors was calculated. This alignment accuracy measure quantifies how likely the word embeddings vector of the L2 equivalent word according to the ground truth dictionary (Frankreich in this example) is to show up as the one with the shortest distance to the input L1 word embeddings vector.
13. The details of how BERT can make the word embeddings context aware are beyond the scope of this paper. Readers can refer to Devlin et al. (Citation2018) for further details.
14. An example of French New York Times articles: https://www.nytimes.com/2015/11/17/opinion/les-attentats-a-paris-revelent-les-limites-de-daesh.html As our original application is to track news topics across outlets in different languages, it is inconvenient to include multiple languages in one outlet. Therefore, we decided to remove these three articles. However, M-BERT does have built-in support for language detection.
15. We reuse the terminology of Benoit et al. (Citation2018).
16. Please refer to the online appendix and the reproducible material for how to use the package. The package is available here: https://github.com/chainsawriot/rectr
17. https://fasttext.cc/docs/en/crawl-vectors.html
18. This number of dimensions represents the total number of hidden units in the artificial neural network used to learn the meanings of words.
19. https://github.com/google-research/bert/blob/master/multilingual.md
20. Reproducing these word embeddings and language models is theoretically possible, although it requires a massive amount of computational power. The chief author of this paper successfully reproduced the training of fastText word embeddings and aligned them using the program provided by Facebook Research with an off-the-shelf Linux computer. It took two weeks. More advanced models, such as BERT or M-BERT, require even more computational power (e.g., Google Research trained M-BERT with computers that had specialized tensor processing unit clips, each with 64 G of RAM). Third-party researchers with such computational power have successfully reproduced those results. For example, French AI firm Hugging Face has reproduced both BERT and M-BERT and made them available to the public.
21. For GMM, it is possible to converge to a solution with a number of topics less than k, which indicates choosing a large k when the data do not provide enough variation to fit such a large k.
22. The correlation between the two trilingual coders is 0.502 (95% CI = 0.401–0.591). The intercoder reliability, as measured by Krippendorff’s alpha, is moderate (α for ordinal scale: 0.451). As indicated by the moderate correlation and intercoder reliability, cross-lingual topic similarity is a complex latent construct that can vary between individuals. We recalculate the intercoder reliability as Cohen’s Kappa and the value is 0.492. It is slightly lower than the Kappa value of 0.58 reported in Hatzivassiloglou et al. (Citation1999), which carried out a similar annotation task of topic similarities in news. However, our task is much more difficult because the task in Hatzivassiloglou et al. (Citation1999) was monolingual and the coders evaluated only paragraphs, whereas our task is trilingual and our coders evaluate full-articles.
23. The same graph for mb-rectr and ft-stm is available in Appendix VII.
24. Similar to the “Amsterdam Embedding Model” (https://github.com/annekroon/AEM)
25. We took advantage of this extensibility ourselves. The original implementation of rectr supported only aligned fastText. We extended the software to add support for M-BERT in a very short time frame.
26. On a typical computer, we can complete step 2 of our analysis in less than 10 minutes using aligned fastText. In comparison, the same task using M-BERT takes 4 hours.