ABSTRACT
It is well known among researchers and practitioners that election polls suffer from a variety of sampling and nonsampling errors, often collectively referred to as total survey error. Reported margins of error typically only capture sampling variability, and in particular, generally ignore nonsampling errors in defining the target population (e.g., errors due to uncertainty in who will vote). Here, we empirically analyze 4221 polls for 608 state-level presidential, senatorial, and gubernatorial elections between 1998 and 2014, all of which were conducted during the final three weeks of the campaigns. Comparing to the actual election outcomes, we find that average survey error as measured by root mean square error is approximately 3.5 percentage points, about twice as large as that implied by most reported margins of error. We decompose survey error into election-level bias and variance terms. We find that average absolute election-level bias is about 2 percentage points, indicating that polls for a given election often share a common component of error. This shared error may stem from the fact that polling organizations often face similar difficulties in reaching various subgroups of the population, and that they rely on similar screening rules when estimating who will vote. We also find that average election-level variance is higher than implied by simple random sampling, in part because polling organizations often use complex sampling designs and adjustment procedures. We conclude by discussing how these results help explain polling failures in the 2016 U.S. presidential election, and offer recommendations to improve polling practice.
Acknowledgments
The survey weights discussed in Footnote 2 are based on polls obtained from the iPOLL Databank provided by the Roper Center for Public Opinion Research at Cornell University. The data and code to replicate our results are available online at https://github.com/5harad/polling-errors.
Notes
1 One common technique for setting survey weights is raking, in which weights are defined so that the weighted distributions of various demographic features (e.g., age, sex, and race) of respondents in the sample agree with the marginal distributions in the target population (Voss, Gelman, and King Citation1995).
2 For a sampling of 96 polls for 2012 senate elections, only 19 reported margins of error higher than what one would compute using the SRS formula, and 14 of these exceptions were accounted for by YouGov, a polling organization that explicitly notes that it inflates variance to adjust for the survey weights. Similarly, for a sampling of 36 state-level polls for the 2012 presidential election, only 9 reported higher-than-SRS margins of error. Complete survey weights are available for 21 ABC, CBS, and Gallup surveys conducted during the 2012 election and deposited into Roper Center’s iPOLL. To account for the weights in these surveys, standard errors should on average be multiplied by 1.3 (with an interquartile range of 1.2–1.4 across the surveys), compared to the standard errors assuming SRS.
3 Most reported margins of error assume estimates are unbiased, and report 95% confidence intervals of approximately ± 3.5 percentage points for a sample of 800 respondents. This in turn implies the RMSE for such a sample is approximately 1.8 percentage points, about half of our empirical estimate of RMSE. As discussed in Footnote 3, many polling organizations do not adjust for survey weights when computing uncertainty estimates, which in part explains this gap.
4 Assuming N to be the number of polls, for each poll i ∈ {1, …, N}, let yi denote the two-party support for the Republican candidate, and let vr[i] denote the final two-party vote share of the Republican candidate in the corresponding election r[i]. Then, RMSE is .
5 To clarify our notation, we note that for each poll i, r[i] denotes the election for which the poll was conducted, and αr[i], βr[i], and τr[i] denote the corresponding coefficients for that election. Thus, for each election j, there is one (αj, βj, τj) triple. Our model allows for a linear time trend (βj), but we note that our empirical results are qualitatively similar even without this term.
6 To calculate these numbers, we removed an extreme outlier that is not shown in , which corresponds to polls conducted in Utah in 2004. There are only two polls in the dataset for each race in Utah in 2004.