326
Views
5
CrossRef citations to date
0
Altmetric
Articles

Click, click boom: Using Wikipedia data to predict changes in battle-related deaths

Pages 678-696 | Received 15 Dec 2020, Accepted 23 Mar 2022, Published online: 29 Apr 2022
 

Abstract

Data and methods development are key to improve our ability to forecast conflict. Relatively recent data sources such as mobile phone and social media data or images have received widespread attention in conflict research. Oftentimes these do not cover substantial parts of the globe or they are difficult to obtain and manipulate, which makes regular updating challenging. The sometimes vast amounts of data can also be computationally and financially costly. The data source we propose instead is cheap, readily and openly available, and updated in real time, and it provides global coverage: Wikipedia. We argue that the number of country page views can be considered a measure of interest or salience, whereas the number of page changes can be considered a measure of controversy between competing political views. We expect these predictors to be particularly successful in capturing tensions before a conflict escalates. We test our argument by predicting changes in battle-related deaths in Africa on the country-month level. We find evidence that country page views do increase predictive performance while page changes do not. Contrary to our expectation, our model seems to capture long-term trends better than sharp short-term changes.

El desarrollo de datos y métodos es fundamental para perfeccionar nuestra habilidad de previsión de conflictos. Las fuentes de datos relativamente nuevas, tales como los datos o las imágenes de los teléfonos celulares y de las redes sociales, han recibido una atención generalizada en la investigación de conflictos. A menudo, dichas fuentes no cubren las partes primordiales del mundo o son difíciles de obtener y manejar, lo que hace que la actualización periódica sea un desafío. Las cantidades de datos, en ocasiones enormes, también pueden ser costosas desde el punto de vista informático y financiero. En su lugar, la fuente de datos que proponemos es económica, está disponible fácil y públicamente, se actualiza en tiempo real y proporciona una cobertura global: Wikipedia. Planteamos que la cantidad de páginas vistas por país puede considerarse una medida de interés o prominencia, mientras que el número de cambios de página puede considerarse una medida de controversia entre las opiniones políticas rivales. Esperamos que estos indicadores tengan especial éxito a la hora de captar las tensiones antes de que el conflicto se intensifique. Ponemos a prueba nuestro argumento prediciendo cambios en las muertes relacionadas con las batallas en África a nivel mensual y por país. Encontramos pruebas de que esas páginas vistas por país aumentan el rendimiento predictivo, mientras que los cambios de página no lo hacen. Al contrario de lo que esperábamos, nuestro modelo parece registrar mejor las tendencias a largo plazo que los cambios bruscos a corto plazo.

Les données et le développement des méthodes sont essentiels à l’amélioration de notre capacité à prévoir les conflits. La recherche sur les conflits a étudié avec application des sources de données relativement récentes, telles que celles fournies par les téléphones portables ou les médias sociaux. Néanmoins, dans de nombreux cas, les données obtenues n’englobent pas l’ensemble de la planète. Elles sont parfois difficiles à obtenir et sont délicates à manipuler, ce qui rend difficile toute mise à jour régulière. En outre, du fait de leur quantité, l’analyse de ces données peut être chronophage et coûteuse. Au contraire, la source de données que nous proposons est bon marché, publique et immédiatement disponible, mise à jour en temps réel, et inclut l’ensemble de la planète : Wikipédia. Nous soutenons que le nombre de pages vues par pays permet de mesurer l’intérêt ou l’importance d’un sujet, tandis que le nombre de changements apportés aux pages permet de mesurer le niveau de controverses entre des opinions politiques concurrentes. Nous estimons que ces éléments permettent d’identifier avec succès des tensions avant qu’un conflit n’éclate. Nous vérifions la justesse de nos hypothèses en prédisant des changements en ce qui concerne le nombre de victimes dans des combats en Afrique, par pays et par mois. Nous apportons la preuve que le nombre de pages vues par pays a une valeur prédictive, mais que le nombre de changements apportés aux pages n’en a pas. Contrairement à nos attentes, notre modèle semble permettre d’identifier des tendances de long terme, et non des changements soudains de court terme.

Acknowledgements

We want to thank Michael Colaresi, Håvard Hegre, and Paola Vesco for organizing the Violence Early Warning System prediction competition and workshop and dealing with all problems we caused. We also want to thank the scoring committee, Mike Ward, Nils Weidmann, Adeline Lo, and Gregor Reisch and all competition and workshop participants. In addition, we thank the participants of the PolMeth Europe Conference 2021, the European Political Science Association Annual Conference 2021, the Jan Tinbergen European Peace Science Conference 2021, the Society for Political Methodology Annual Conference 2021, and the Predicting Conflict Workshop, as well as two anonymous reviewers and the editors, for their helpful comments and feedback. We also thank Avner Bar-Hen and François-Xavier Jollois for very timely responses to queries about the WikipediaR package.

Data availability

Replication materials and instructions are available at http://dvn.iq.harvard.edu/dvn/dv/internationalinteractions.

Notes

1 More details about the framework, the full benchmark models, and the scoring metrics can be found in Hegre, Vesco, and Colaresi (Citation2022) and Vesco et al. (Citation2022).

2 The true forecasts had to be submitted to the ViEWS team by 30 September 2020 and were stored for future evaluation.

3 Most unique insights were obtained for the test set, the best average model ablation loss score for the training set.

4 There are a number of applications where Google Trends are useful, however, such as protest detection and forecasting (Timoneda and Wibbels Citation2022) or measuring and visualizing issue salience for difficult-to-survey populations (Chykina and Crabtree Citation2018). For a recent overview of Google Trends usage in different research areas, see Jun, Yoo, and Choi (Citation2018). For this research, however, Wikipedia data are likely better suited.

5 See, for example, the influence of propaganda through radio stations (Yanagizawa-Drott Citation2014), the use of news articles in print media to predict conflict onsets (Mueller and Rauh Citation2018; Chadefaux Citation2014), or social media (Zeitzoff Citation2017) such as Twitter data to analyze different Gaza conflicts (Zeitzoff Citation2011, Citation2018).

6 In addition, while the number of linked disputed articles from country pages would be a reasonable controversy indicator, for country pages themselves this approach would lead to a binary indicator, whether it was disputed in a certain month or not. Such an approach would disregard the variation in actual changes made, which we deemed more appropriate for the prediction task of changes in fatalities. The function in the R package is also still under development.

7 We did collect data from the French Wikipedia, however, and performed all the analyses on this dataset too without observing major changes to the results presented here. Results for the French Wikipedia can be found in the Supplemental Material.

8 We do not have information on the geographical locations of page views and edits to test these expectations.

9 For more details on the outcome variable, see Hegre, Vesco, and Colaresi (Citation2022, Forthcoming) and Vesco et al. (Citation2022) in this issue. A visualization of the log-transformed fatality counts for three countries can be found in the Supplemental Material.

11 While WikipediaR can collect data prior to 2008 we limited ourselves to data starting in 2008 to be able to combine them with data gathered with wikipediatrend.

11 We did not take page views of linked articles into account as wikipediatrend does not accommodate page redirections, at least at the time of analysis, while WikipediaR does.

12 Summary statistics for more ratio variables we specified which did not improve predictive performance can be found in the Supplemental Material. Time series plots showing variation in Wikipedia page views are available in the Supplemental Material as well.

13 None of these variables helped decrease MSE or TADDA; we will return to this in the discussion and conclusion section.

14 We chose the 30 features that contributed the most to decreasing the weighted impurity of the nodes across all trees based on the training data. The most important feature, number of fatalities at t-1, had a score of 0.20, whereas features 28 through 30 had scores of 0.003. We deemed this a good cutoff as the added value of including more variables was low. Relying on a smaller subset of features was also necessary due to limited computing power. Feature importance tables based on permutation and the test data for our out-of-sample forecasts, which are more indicative of the predictive power of each variable, can be found in the Supplemental Material.

15 Results using the French-language Wikipedia page views and other model specifications can be found in the Supplemental Material. These additional specifications include interaction terms between English- and French-language page views with variables indicating whether English or French are the main languages spoken in a country and an Arab Spring dummy variable interacted with the English language page views and using the first differences of monthly page views as opposed to raw page view numbers. The results are by and large very similar.

16 See Vesco et al. (Citation2022) for a more detailed discussion of performance metrics and TADDA in particular.

17 Error maps for the seven-months-ahead forecasts can be found in the Supplemental Material. The results for the MSE are similar with the exception of Libya.

18 A plot showing how violence evolved over time in the three countries can be found in the Supplemental Material.

Log in via your institution

Log in to Taylor & Francis Online

PDF download + Online access

  • 48 hours access to article PDF & online version
  • Article PDF can be downloaded
  • Article PDF can be printed
USD 53.00 Add to cart

Issue Purchase

  • 30 days online access to complete issue
  • Article PDFs can be downloaded
  • Article PDFs can be printed
USD 640.00 Add to cart

* Local tax will be added as applicable

Related Research

People also read lists articles that other readers of this article have read.

Recommended articles lists articles that we recommend and is powered by our AI driven recommendation engine.

Cited by lists all citing articles based on Crossref citations.
Articles with the Crossref icon will open in a new tab.