How to Avoid Pitfalls in Statistical Analysis of Political Texts: The Case of Germany: German Politics: Vol 18 , No 3

Abstract

The statistical analysis of political texts has received a prominent place in the study of party politics, coalition formation and legislative decision making in Germany. Yet we still lack a thorough understanding of the conditions under which such analysis produces valid estimates of policy positions. This article examines the properties of the word scaling method ‘Wordfish’ and uses the technique to estimate party positions in Germany. Through Monte Carlo simulations, we investigate the effects of the choice of texts on party position estimates, including the number of documents included in the analysis and their length. Moreover, we present guidelines on how to process linguistic information for political scientists interested in using the technique, focusing specifically on German texts. Finally, we present an analysis of the German party system from 1969–2005 using the Wordfish algorithm. We demonstrate the robustness of the algorithm to extract left-right positions for various subsets of words, but show that agenda effects dominate when estimating a long-time series if the entire manifesto corpus is analysed.

ACKNOWLEDEGMENTS

The order of authors' names reflects the principle of rotation. Both authors have contributed equally to all work. A previous version of this article was presented at the Workshop on Estimating Policy Preferences hosted by the Mannheim Centre for European Social Research in June 2008.

Notes

Kathleen Bawn, ‘Money and Majorities in the Federal Republic of Germany: Evidence for a Veto Players Model of Government Spending’, American Journal of Political Science 43/3 (1999), pp.707–36.

Thomas König and Thomas Bräuninger, ‘The Checks and Balances of Party Federalism: German Federal Government in a Divided Legislature’, European Journal of Political Research 36/2 (1999), pp.207–34; Thomas König, ‘Bicameralism and Party Politics in Germany: an Empirical Social Choice Analysis’ Political Studies 49 (2001), pp.411–37. See also German Politics 17/3 (2008) for a general discussion.

Sven-Oliver Proksch and Jonathan B. Slapin, ‘Institutions and Coalition Formation: The German Election of 2005’, West European Politics 29/3 (2006), pp.540–59.

Marc Debus, ‘Party Competition and Government Formation in Multilevel Settings: Evidence from Germany’, Government and Opposition 43/4 (2008), pp.505–38.

Kenneth Benoit and Michael Laver, Party Policy in Modern Democracies (London: Routledge, 2006).

Ian Budge, Hans-Dieter Klingemann, Andrea Volkens and Judith Bara, Mapping Policy Preferences II: Estimates for Parties, Electors and Governments in Central and Eastern Europe, European Union and OECD 1990-2003 (Oxford: Oxford University Press, 2006).

Jonathan B. Slapin and Sven-Oliver Proksch, ‘A Scaling Model for Estimating Time-Series Party Positions from Texts’, American Journal of Political Science 52/3 (2008), pp.705–22.

Ibid. Wordfish has been implemented in the R statistical language. Current code is available at www.wordfish.org.

Michael Laver, Kenneth Benoit and John Garry, ‘Extracting Policy Positions from Political Texts Using Words as Data’, American Political Science Review 97/3 (2003), pp.311–32.

Ibid., pp. 329–330.

Sven-Oliver Proksch and Jonathan B. Slapin, ‘Position Taking in European Parliament Speeches,’ British Journal of Political Science (Forthcoming).

See Laver et al., ‘Extracting Policy Positions from Political Texts’, p.329. Text mining software allows the fast counting of any n-gram in any language, such as the free text mining package TM for R (Ingo Feinerer, Kurt Hornik and David Meyer, ‘Text Mining Infrastructure in R’, Journal of Statistical Software 25/5 (2008), pp.1–54.

Keith Poole and Howard Rosenthal, ‘A Spatial Model for Legislative Roll Call Analysis’, American Journal of Political Science 29/2 (1985), pp.357–84.

Joshua Clinton, Simon Jackman and Douglas Rivers, ‘The Statistical Analysis of Roll Call Data’, American Political Science Review 98 (2004), pp.355–70; Andrew D. Martin and Kevin M. Quinn, ‘Dynamic Ideal Point Estimation via Markov Chain Monte Carlo for the U.S. Supreme Court, 1953–1999’, Political Analysis 10 (2002), pp.134–53.

Susana Eyheramendy, David Lewis and David Madigan, ‘On the naive Bayes model for text categorization’, Proceedings of the Ninth International Workshop on Artificial Intelligence and Statistics, (2003); David D. Lewis, ‘Naive (Bayes) at Forty: The Independence Assumption in Information Retrieval’, Proceedings of the 10th European Conference on Machine Learning (1998), pp.4–15.

Andrew McCallum and Kamal Nigam, ‘A Comparison of Event Models for Naive Bayes Text Classification’, AAAI-98 Workshop on Learning for Text Categorization, (1998).

Feinerer et al., ‘Text Mining Infrastructure in R’, p.10.

Frederick Mosteller and David L. Wallace, Applied Bayesian and Classical Inference: The Case of The Federalist Papers, (Springer Verlag: New York, 1964).

Ibid.

Kenneth W. Church and William A. Gale, ‘Poisson Mixtures’, Natural Language Engineering 1/2 (1995), pp.163–90.

Martin Jansche, ‘Parametric Models of Linguistic Count Data’, 41st Annual Meeting of the Association for Computational Linguistics, Sapporo, Japan (2003), pp.288–95.

Formally, y _ijt ∼ Poisson(λ_ijt), where y _ijt is the count of word j in party i's manifesto at time t. The lambda parameter has the systematic component , with α as a set of document (party-election year) fixed effects, ψ as a set of word fixed effects, β as estimates of word specific weights capturing the importance of word j in discriminating between manifestos, and ω as the estimate of party i's position in election year t (therefore it is indexing one specific manifesto). See also Slapin and Proksch, ‘A Scaling Model for Estimating Time-Series Party Positions from Texts’, for a more detailed discussion.

The appendix lists the length of manifestos in words.

Since the only data in the model are word counts (the ‘dependent variable’), we cannot estimate the parameters on the right-hand side of the equation simultaneously. But given some starting values, we can estimate document parameters conditional on word parameters. This will yield new estimates for the document parameters, which are then used as data to re-estimate word parameters. Such an estimation procedure employed by Wordfish is an iterative process called an Expectation-Maximization algorithm: first party parameters are held fixed at a certain value while word parameters are estimated, then word parameters are held fixed at their new values while the party positions are estimated. This process is repeated until the parameter estimates reach an acceptable level of convergence. For a more detailed description of the estimation process, see ibid.

A transformation to the parameters can yield identical log-likelihoods.

Douglas Rivers, ‘Identification of Multidimensional Spatial Voting Models’, Political Methodology Working Paper, 2003, available at http://polmeth.wustl.edu; Joshua Clinton, Simon Jackman and Douglas Rivers, ‘The Statistical Analysis of Roll Call Data’, American Political Science Review 98/2 (2004), pp.355–70.

Both identification strategies are implemented in the latest release of Wordfish.

To do this, document positions are fixed and range between −1.5 and 1.5. The word fixed effects and word discrimination parameters are drawn from normal distributions, and document fixed effects are set as a sequence of values. R code to run the simulation is available upon request from the authors.

Note that we are varying the number of unique words and not the number of total words in each document. The number of total words is determined by the parameter values we set to generate the data. We fix two values of omega to identify the model and fix the extreme omegas at −1.5 and +1.5. Therefore, there are no confidence intervals for the two extreme positions as they are excluded from the estimation.

Markku Laakso and Rein Taagepera, ‘Effective Number of Parties: A Measure With Application to Western Europe’, Comparative Political Studies 12/1 (1979), pp.3–27.

Thomas Saalfeld, ‘The German Party System: Continuity and Change’, German Politics 11/3 (2002), pp.99–130; Charles Lees, ‘The German Party System(s) in 2005: A Return to Volkspartei Dominance’, German Politics 15/4 (2006), pp.361–75.

In 1990, the West German Greens failed to surpass the electoral threshold. In contrast, the East German party Bündnis 90/Die Grünen gained parliamentary seats.

Slapin and Proksch, ‘A Scaling Model for Estimating Time-Series Party Positions from Texts’.

Ian Budge, ‘The Internal Analysis of Election Programmes’, in Ian Budge, David Robertson and Derek Hearl (eds.), Ideology, Strategy, and Party Change: Spatial Analyses of Post-War Election Programmes in 19 Democracies (Cambridge University Press, 1987), p.18.

Ibid.

Michael Laver and Kenneth Benoit, ‘Locating TDs in Policy Spaces: Wordscoring Dail Speeches’ Irish Political Studies 17 (2002), p.65.

Michael Laver and John Garry, ‘Estimating Policy Positions from Political Texts’, American Journal of Political Science 44/3 (2000), p.620.

Hans-Dieter Klingemann, ‘Electoral Programmes in West Germany 1949–1980: Explorations in the Nature of Political Controversy’, Budge et al., Ideology, Strategy, and Party Change: Spatial Analyses of Post-War Election Programmes in 19 Democracies, pp.294–323.

Ibid., p.300.

Ibid.

The list is based on appendix V included with the CMP dataset, see Ian Budge, Hans-Dieter Klingemann, Andrea Volkens and Judith Bara, Mapping Policy Preferences II: Estimates for Parties, Electors and Governments in Central and Eastern Europe, European Union and OECD 1990–2003 (Oxford: Oxford University Press, 2006).

Klingemann, ‘Electoral Programmes in West Germany’, p.301.

Laver et al. ‘Extracting Policy Positions from Political Texts’ but see Sven-Oliver Proksch and Jonathan B. Slapin, ‘Institutions and Coalition Formation: The German Election of 2005’ for an alternative strategy that uses only policy-area specific subsets of the reference manifestos.

We agree with the Wordscores approach on this need.

Slapin and Proksch, ‘A Scaling Model for Estimating’, pp.712–713.

Making sure the file is saved as unicode (UTF-8) ensures cross-platform compatibility and that non-English characters, such as German vowels with umlauts, are preserved in a readable format. One should be careful to read the documents after scanning them to ensure that characters were encoded correctly and that all parts of the document were properly scanned.

Ian Budge, Hans-Dieter Klingemann, Andrea Volkens, Judith Bara and Eric Tanenbaum (eds.), Mapping Policy Preferences: Estimates for Parties, Electors, and Governments 1945–1998 (Oxford: Oxford University Press, 2001); Budge et al. Mapping Policy Preferences II.

The Comparative Electronic Manifestos Project is directed by Paul Pennings and Hans Keman, Vrije Universiteit Amsterdam, in collaboration with the Zentralarchiv für Empirische Sozialforschung, Universität zu Köln.

We thank Thomas König and Bernd Luig for making these files available to us. Each text file includes the full text of the manifesto listed in the appendix. Some texts include data that researchers may prefer to remove prior to the estimation. Examples include the listing of speakers or party names, self-reference of party names, headers and footers, enumeration, bullets, section headings, etc. This can either be done manually or with the help of pattern-matching using customized PERL or PYTHON scripts. Additionally, software specifically designed for text processing is available to perform many of these pre-processing tasks. This software will remove punctuation, numbers, and stop-words (i.e. words defined by the researcher that should be systematically removed from the text). In addition, these programs can change all capital letters to lower case letters, so words are not counted differently merely due to capitalization. In addition, researchers should also ensure that the spelling of words is consistent across documents. This may be particularly problematic in German given the recent reform of German spelling. In the version that we use, titles, candidate-oriented preambles, headers, footers, and indices were removed, and spelling and grammar corrected.

The construction of the word count matrix is done using the R text mining package TM. This text mining package allows the removal of customized stopwords, the use of dictionaries, and the consideration of bigrams instead of unigrams. Alternative word count procedures include Jfreq or Yoshikoder.

Feinerer et al., ‘Text Mining Infrastructure in R’, p. 25. The TM package includes a list of 264 German stop words, but only a small proportion is contained in the manifestos.

In the Congressional and US Supreme Court ideal point estimation literature, legislators' positions over time either are a polynomial function of previous positions (D-NOMINATE, see Keith T. Poole and Howard Rosenthal, Congress: A Political-Economic History of Roll Call Voting (New York: Oxford University Press, 1997) or defined by a random walk process, see Andrew D. Martin and Kevin M. Quinn, ‘Dynamic Ideal Point Estimation via Markov Chain Monte Carlo for the U.S. Supreme Court, 1953–1999’, Political Analysis 10 (2002), pp.134–53; Andrew D. Martin and Kevin M. Quinn, ‘Assessing Preference Change on the U.S. Supreme Court’, Journal of Law, Economics, and Organization 23 (2007), 303–325). Wordfish does not place any such constraint on the data.

Burt L. Monroe, Michael P. Colaresi and Kevin M. Quinn, ‘Fightin’ Words: Lexical Selection and Evaluation for Identifying the Content of Political Conflict', Political Analysis (Forthcoming).

A prior on the distribution of word weights in Wordfish constrains the range of estimated values.

The model is identified by fixing the mean position at 0 and the standard deviation at 1 and by constraining the FDP in 1990 to have a smaller value than the PDS in 1990.

Laver et al., ‘Extracting Policy Positions from Political Texts’.

How to Avoid Pitfalls in Statistical Analysis of Political Texts: The Case of Germany

Log in via your institution

Log in to Taylor & Francis Online

Restore content access

Related Research

Information for

Open access

Opportunities

Help and information

How to Avoid Pitfalls in Statistical Analysis of Political Texts: The Case of Germany

Abstract

ACKNOWLEDEGMENTS

Notes

Log in via your institution

Log in to Taylor & Francis Online

Log in to Taylor & Francis Online

Restore content access

Related Research

To cite this article:

Download citation

Information for

Open access

Opportunities

Help and information

Keep up to date

Your download is now in progress and you may close this window

Login or register to access this feature