4,115
Views
0
CrossRef citations to date
0
Altmetric
Research Article

The Polls and the U.S. Presidential Election in 2020 …. and 2024

&
Article: 2199809 | Received 19 Nov 2022, Accepted 02 Mar 2023, Published online: 30 May 2023

Abstract

Arguably, the single greatest determinant of U.S. public policy is the identity of the president. And if trusted, polls not only provide forecasts about presidential-election outcomes but can act to shape those outcomes. Looking ahead to the 2024 U.S. presidential election and recognizing that polls before the 2020 presidential election were sharply criticized, we consider whether such harsh assessments are warranted. Initially, we explore whether such polls as processed by the sophisticated aggregator FiveThirtyEight successfully forecast actual 2020 state-by-state outcomes. We evaluate FiveThirtyEight’s forecasts using customized statistical methods not used previously, methods that take account of likely correlations among election outcomes in similar states. We find that, taken together, the pollsters and FiveThirtyEight did an excellent job in predicting who would win in individual states, even those “tipping point” states where forecasting is more difficult. However, we also find that FiveThirtyEight underestimated Donald Trump’s vote shares by state to a modest but statistically significant extent. We further consider how the polls performed when the more primitive aggregator Real Clear Politics combined their results, and then how well single statewide polls performed without aggregation. It emerges that both Real Clear Politics and the individual polls fared surprisingly well.

1 Introduction

In 2020 as in 2016, there was widespread frustration concerning the performance of polls about the U.S. presidential election. The New York Times ran an article titled “2016 Dealt a Blow to Polling. Did 2020 Kill It?” (S. Bokat-Lindell Citation2020) Writing in The Wall Street Journal, former Secretary of State James Baker declared that “this time, we were promised that pollsters would get it right (Baker Citation2020). They didn’t.” A Washington Post op-ed [3] was titled “The Polling Industry Can’t Sweep Its Failure Under the Rug (Olson (Citation2020), while Fast Company described 2020 as “another embarrassing failure for election pollsters (Campbell Citation2020), and a commentator in Yahoo! Finance thought that “the biggest election takeaway (Ferre Citation2020) is the absolutely massive failure of polling.” The commentary about 2020 polls was just as harsh as that for their 2016 predecessors, even though, unlike those in 2016, the polls in 2020 correctly identified the winner of the election.

As the quotes above imply, whether presidential polls are accurate is an issue of immense importance in the United States. Arguably, the single greatest determinant of U.S. public policy is the identity of the president. And if trusted, polls not only provide forecasts about presidential-election outcomes but can greatly affect those outcomes, playing a large role in the choice of each party’s presidential nominee and in the behavior of voters.

There is a large literature about how polls go beyond describing voter preferences and act to shape those preferences. One well-known phenomenon is the “bandwagon effect,” under which the fact that a candidate is ahead in the polls leads to support from some voters out of a desire to be on the winning side. There is also evidence that polling results can affect voter turnout: if the election does not appear close, then some people see no need actually to vote. Therefore, there is strategic voting based on polls, whereby citizens vote for candidates other than those they actually favor (e.g., they choose a lesser-desired candidate in a presidential primary because polls say that candidate would be stronger in the general election). If the polls are suspect, however, then all these behaviors could diminish. That might not be altogether a bad thing, but neither need it be an unalloyed good. Reliable polls that depict a close election can stimulate voter turnout, and some strategic voting can yield an election winner who best reflects the policy views of the majority of voters.

Because confidence in the polls is presumably tied to their recent performance, perceptions about how they fared the 2020 presidential election can, for better or worse, have consequences for the election in 2024. For that reason, this article investigates the success or failure of the presidential polls prior to the election in 2020 elections.

Most negative assessments of 2020 polling relate to alleged deficiencies in local polls conducted in individual states. Yet even if these assessments are accurate rather than overwrought, it could be misleading to focus on the frailties of particular polls. The polls might perhaps more reasonably be treated as raw materials used by sophisticated aggregators who, taking account of the limitations of such polls as well as broader patterns, synthesize the polling results to devise probabilistic forecasts about what will happen in elections. If those predictions perform well, then the polls that contributed heavily to the forecasts might collectively be construed as successful despite their individual imperfections.

Pursuant to that viewpoint, we first concentrate here on aggregated forecasts and, more specifically, those advanced for the 2020 presidential election by FiveThirtyEight, which is arguably the best known and most respected of the aggregators. FiveThirtyEight went further than just predicting the winner of the election; it advanced a series of probabilistic assessments about state-by-state win/loss outcomes and the vote split in each state among the candidates (Trump, Biden, and third-party nominees). We evaluate these predictions using customized statistical methods, which go beyond those that FiveThirtyEight itself uses or that have appeared in recent literature about the accuracy of polling results. In actuality, we are initially evaluating the combination of the polls and FiveThirtyEight rather than the “raw” polls in themselves. If the combination succeeds, it is a joint success.

The accuracy of 2024 presidential polls is already a live issue at the start of 2023. Polls have already emerged about a rematch in 2024 between Joe Biden and Donald Trump. Other polls ask whether Democrats want Biden as their standard bearer in the 2024 election, and about how Trump would fare in 2024 against several possible Democratic opponents. Further polls ask Republicans whom they prefer as the party’s nominee in the next presidential election. A political analyst for New York magazine noted that such presidential polls have “real world consequences,” because they “affect the decision-making of potential candidates, operatives and activists” (Kilgore Citation2023).

More specifically, if such polls—as distilled by a respected aggregator like FiveThirtyEight—are viewed as trustworthy, they could affect the intensity of pressure on Joe Biden to retire. They could influence Republican voters in state primaries who wonder whether Donald Trump could plausibly win reelection. The potential candidacies of Democrats like Amy Klobuchar or Republicans like Ron DeSantis could rise or fall with their standings in voter surveys. The polls, in other words, could play a sizable role in determining who each party’s candidate will be. And once the nominees are chosen, polls could greatly influence media coverage of the election campaign. Indeed, it is routinely lamented that polls turn the election into a “horse race,” in which who is ahead and by how many furlongs gets greater attention that what the candidates say about the issues.

Moreover, there is reason to fear that, as in 2020, questions whether warranted or not will be raised about the legitimacy of the 2024 election outcome. Preelection polls that are trusted can cast light on the credibility of such accusations. The inverse of these statements is also true: if the polls are not taken seriously, they cannot help adjudicate controversies about the 2024 election.

Here we restrict ourselves to the statistical accuracy of 2020 U.S. presidential polls, and not to broader issues about the proper role for polling in the selection of the president. And, as suggested, we proceed on the premise that the polls’ performance in the most recent presidential election is the best single indicator of their ability to answer the questions about 2024 that motivate them.

As we will see, the FiveThirtyEight-mediated forecasts about the 2020 elections fared well, the only shortcoming being a modest underprediction of Donald Trump’s state-by-state vote shares. To gauge the centrality of FiveThirtyEight’s own statistical modeling to that favorable outcome, we then turn to the corresponding forecasts from the less sophisticated but highly influential aggregator Real Clear Politics, which simply averages recent polls together with no attempt to correct for their potential biases. Then we step back from aggregated forecasts, to consider the heavily-criticized results from original local polls. To a surprising degree, we find that both Real Clear Politics and the original polls did well in their own right.

1.1 Previous Work

There is a large literature that suggests that preelection polls affect election outcomes. Based on an experiment, Farjam (Citation2021) discerned a substantial bandwagon effect, estimating that “after participants saw pre-election polls, majority options on average received an additional 7% of the votes.” (Farjam also offers an extensive bibliography of papers about polls and elections.) Burden (Citation2005) explored strategic voting in U.S. presidential elections with respect to supporters of third-party candidates, concluding that many such supporters shifted their votes in 2000 to the “lesser evil” between the two major candidates, but that they rarely did so in 1992 and 1996 when polls suggested an easy victory for Bill Clinton. Bursztyn et al. (Citation2017) estimated that voter turnout increased when polls indicated a close race (with the implication that, when polls depicted a race that was not close, turnout declined relative to an average election). Westwood, Messing, and Lelkes (Citation2020) lamented that “probabilistic horse race” election coverage—like that advanced by FiveThirtyEight based on preelection polls—“confuses and demobilizes” the public, and concluded that confidence among her supporters in 2016 that Hillary Clinton would win the presidency was associated with lesser voter turnout.

There is also a voluminous literature about the accuracy of political polls. It is useful to distinguish those evaluations based on the individual polls and those that concern the forecasts of aggregators like FiveThirtyEight, which combine various polling results having adjusted for shortcomings among the polls. Below we turn first to some papers in the former category (polls only).

Arnesen and Bergfjord (Citation2014) studied U.S. presidential polls in the Uniteds States in 2008 and 2012, and offered evidence that their estimates about probabilities of victory were further from the mark than were chances of winning derived from the odds in betting prediction markets.

Prosser and Mellon (Citation2018) questioned whether conspicuous failures to predict the winners in recent United States and United Kingdom elections had created a “twilight of the polls.” They discussed several reasons that polls fell short, including late swings, inadequate turnout models, mishandling of undecided voters, and unrepresentative samples sometimes tied to nonresponse biases. In the 2016 U.S. presidential election, the authors cited the failure to weight properly for voter education levels as contributing to the underestimation of Donald Trump’s strength. However, the authors concluded that polls were not getting worse over time, and thus it was excessive to suggest their imminent demise.

Panagopoulos (Citation2021) discerned systematic polling errors in the 2020 U.S. election cycle, which he said reflected “pro-Democratic biases.” This pattern appeared in national and state-level polls in races for president, U.S. Senate, and state governors. Panagopoulos saw growing difficulties tied to rising costs, declining response rates and “a host of technical and methodological challenges” that pollsters need to confront. “In the meantime,” he advises, “the public is wise to consume polling information with caution.”

Of exceptional importance in considering preelection polls are the “post mortems” performed by the American Association for Public Opinion Research (AAPOR). The association’s evaluation for the 2016 presidential race was less negative than one might expect, given the widespread shock at Trump’s victory over Clinton (Ad Hoc Committee Citation2017). The statewide polls, AAPOR concluded, correctly indicated a competitive, uncertain contest and only implied that Clinton was ahead by “the slimmest of margins.” What weakened the polls according to AAPOR was an overrepresentation of college graduates, and the fact that undecided voters broke heavily for Trump. Moreover, forecasts about turnout were seemingly off, perhaps because some Clinton supporters, treating her election as a foregone conclusion, saw no need actually to vote.

AAPOR was somewhat more critical of presidential polls in 2020 (Clinton, et al. Citation2021). It described polling errors as “of unusual magnitude” and saw a tendency to overstate Biden’s vote shares in individual states and to understate Trump’s, a tendency that was greater in the states that supported Trump in 2016. Yet AAPOR did not see a repetition of the problems that it had noted in 2016: college graduates were not overrepresented in the surveys, and late deciding voters split evenly between Trump and Biden. Furthermore, contrary to some theories, the Association found those Trump supporters who participated in polls were not reluctant to declare their preference. AAPOR felt unable to explain why the polls faltered in 2020, though it speculated that voters supporting Trump were less willing to speak with surveyors than those who opposed him. (In 2020 as in many previous years, only a minority of those contacted by pollsters agreed to take part in the canvasing.)

What AAPOR did not do was to consider the possibility that a polling aggregator like FiveThirtyEight was aware of the biases in presidential surveys in a given year and had largely corrected for them. Whether that happened in 2020 is a major focus of this article.

As for FiveThirtyEight itself, several papers have addressed its performance in presidential elections prior to 2020. Barnett (Citation2018) spoke favorably of FiveThirtyEight’s record in the 2016 election, noting that it estimated that Trump had about a 30% chance of winning and praising its awareness that outcomes in Pennsylvania, Michigan, and Wisconsin—the three key states Clinton was expected to carry but which went to Trump—were positively correlated.

Two other performance reviews for FiveThirtyEight were less unabashedly positive. Wright and Wright (Citation2018) explored FiveThirtyEight’s state-by-state record in the 2016 presidential elections. The authors acknowledged that FiveThirtyEight had treated a Trump victory as only moderately unlikely, but suggested that the website had paid insufficient attention to a late-developing trend toward Trump. They advanced a smoothing mixed-effects model sensitive to both national and local trends that they argued would have performed better than FiveThirtyEight in 2016.

Rothschild (Citation2009) evaluated FiveThirtyEight in connection with the 2008 presidential election, the first in which FiveThirtyEight offered forecasts. While he had favorable things to say about FiveThirtyEight, he concluded that the website suffered from some anti-incumbency bias, meaning a tendency to underestimate incumbents’ vote shares. Rothschild suggested than a reason for this bias could be understating the extent to which voters who declare themselves undecided to pollsters vote for the incumbent on Election Day. He conducted a comparison between FiveThirtyEight and betting prediction markets and argued that, while FiveThirtyEight offered more accurate election forecasts, its advantage disappeared when the forecasts by prediction markets were “debiased.”

However, FiveThirtyEight presumably has sought to improve its forecasting techniques over time based on any shortcomings it identifies. For that reason, its success or lack thereof in presidential elections before 2020 bears an unknown relationship to its performance in 2020 itself.

Nate Silver, the founder of FiveThirtyEight, himself writes after each presidential election about the accuracy of its forecasts. He concluded (Silver Citation2020) that its predictions in the 2020 Biden/Trump race “did very well.” Silver drew attention to what he called the “rigorous methods” that FiveThirtyEight uses to evaluate its own performance, which we will discuss at length in Appendix B.

2 Materials and Methods

2.1 FiveThirtyEight

We focus on the website FiveThirtyEight created by Nate Silver because it is probably the best known and arguably the most respected among election-forecast aggregators in the United States. We concentrate on its final state-by-state predictions for the 2020 presidential election released in early November (FiveThirtyEight Citation2020) and which, unlike its earlier forecasts, give scant weight in key states to economic and historical factors and are based almost exclusively on polls conducted within the state.Footnote1 The website takes a weighted average of polls, related both to their recency and to their patterns of error in recent forecasts. If a poll has tended systematically in the past to (say) overstate the actual vote shares of Republican candidates, FiveThirtyEight applies a correction for that bias. To some extent, the projections consider possible correlations among the outcomes in similar states. We consider the accuracy of predictions about Donald Trump’s performance but, because Trump and Joe Biden were essentially in a two-person race, the analysis of Biden’s performance would yield equivalent results. (For actual election results, we turn to the Federal Election Commission Citation2021.)

For a given state, FiveThirtyEight presents:

  • An estimate of the probability that Trump will win that state

  • A point estimate of Trump’s share of the popular vote

  • An 80% confidence interval for Trump’s share of the popular vote, which extends from the 10th percentile to the 90th percentile of FiveThirtyEight’s distribution for that quantity. The point estimate is at the midpoint of the confidence interval.

Actually, there are 56 “states” according to FiveThirtyEight: the usual 50 states, plus the District of Columbia, the three congressional districts in Nebraska, and the two in Maine. (In these two states, the popular-vote winner in a congressional district gains its Electoral-College vote regardless of the statewide outcome.)

We will conduct tests of the accuracy of FiveThirtyEight’s 2020 projections by state.

We believe our approach to assessing its predictive accuracy is more stringent and transparent than the validation procedures the website itself uses.

2.2 Win/Loss Projections

The simplest question one might ask about FiveThirtyEight’s 2020 performance is: how many states did it get right? By “right,” we mean that the website assigned the winner a victory-probability higher than 12. An equivalent question is: how many states did the website get wrong? This right/wrong dichotomy lacks any nuance: if a candidate assigned a 45% chance of winning actually does so, then declaring the forecast an error seems superficial. But one can compare the website’s actual number of “erroneous” forecasts with the number implied by its probabilistic projections.

Let PLi be FiveThirtyEight’s estimated probability that the disfavored candidate in state i actually wins (meaning that PLi<12) and let random variable Z be the number of “erroneous” forecasts over the 56 states. Then the website’s mean number of errors would follow: E(Z)=i=156PLi

If the outcomes in different states were assumed independent, then the variance of the number of errors would be given by: (1) σ2(Z)=i=156PLi(1PLi)(1)

However, the outcomes across states may not be independent, a circumstance we will discuss.

2.3 The “Tipping Point” States

An assessment of FiveThirtyEight’s accuracy in 2020 should presumably give major emphasis to its performance in nine pivotal states, which the website identified as those “close to the tipping point” where the election would likely to decided. These nine states are:

presents FiveThirtyEight’s estimate of Trump’s chance of winning in each of the tipping point states.

Table 1 FiveThirtyEight ‘s Nine Tipping States in the 2020 Presidential Election.

Let random variable S be the total number of swing states Trump would carry. We can approximate the probability mass function for S, based on both FiveThirtyEight’s “win” probability for Trump in each state and the estimated correlations of outcomes across the states.

We define the indicator variable Xi for each of the nine listed states by: Xi={1ifTrumpwinsstatei0ifTrumplosesstatei

The i’s reflect alphabetical ordering of the states, meaning that X1 refers to Arizona, etc.

Then the total number S of Trump wins would follow: S=i=19Xi

Then, according to FiveThirtyEight just prior to the election, the mean of S would be given by: (2) E(S)=i=19E(Xi)=i=19pTi(2) where pTi=P(Trump would win state iaccording to FiveThirtyEight)

2.3.1 Correlated Outcomes across Tipping Point States

In estimating the variance and standard deviation of S, we need consider that the election outcomes in different states can be correlated. For example, two states that had the same winner in all presidential elections from 1976 to 2016 would seem likely to go the same way again. We have the general expression: (3) σ2(S)=i=19σ2(Xi)+21i<j9Cov(Xi,Xj).(3)

To estimate the covariance Cov(Xi,Xj), we focus on Aij,Footnote2 the proportion of times the two states supported the same candidate in the presidential elections between 1976 and 2016. We initially set out four linear equations as follows (4) PTB+PTT=pTiPBT+PTT=pTjPBB+PTT=AijPTB+PTT+PTB+PBB=1(4) where PTT=P(Trumpcarriesbothstates)

PTB=P(Trumpcarriesstateibutnotstatej)

PBT=P(Trumpcarriesstatejbutnotstatei)

PTB+PTT+PTB+PBB=1 PBB=P(TrumplosesbothstatestoBiden)

pTk=FiveThrtyEightsestimateofthechancethatTrumpwillwinstatek

Aij=Fractionofpresidentialelectionsover19762016 with the same outcome in states i and j

The first two of these linear equations equate Trump’s chance of winning a given state in 2020 to FiveThirtyEight’s probability estimate for that outcome. The third equation uses the frequency with which states i and j agreed on presidential outcomes over 1976–2016 as an estimate of the chance they would agree again in 2020.

But there is a potential problem. The four linear equations for four unknowns in (4) can be solved for PTT,PTB,PBT,andPBB, but there is no guarantee that these quantities will all fall in the range (0,1). To avoid that problem, we pull back on the requirement that PBB+PTT must equal Aij and insist instead that PBB+PTT be as close as possible to Aij in a least-squares sense, consistent with feasible solutions for the various probabilities. We do so by advancing the following optimization model in quadratic mathematical programming: (5) Minimize.(PBB+PTTAij)2Subject to: PTB+PTT=pTiPBT+PTT=pTjPTB+PTT+PTB+PBB=1PTB,PTT,PTB,PBB0(5)

Note that, whenever (4) yields a solution with nonnegative probabilities, it is also the solution to (5).

Once we have the estimate of PTT from (4), we can reach the probability distribution for Zij, which is defined by: Zij={1ifTrumpcarriesstatesiandj0otherwise

Note that Zij is the product of XiandXj., meaning that E(Zij)=E(XiXj).

Note too that E(Zij)=PTT.

We then have: (6) Cov(Xi,Xj)=E(Zij)E(Xi)E(Xj)=PTTpTipTj(6)

Using the covariances calculated via (5) and (6) for the (92)=36 combinations of i and j and noting that σ2(Xi)=pTi(1pTi), we can obtain via (3) an estimate of σ2(S).

However, knowing the mean and standard deviation of S does not immediately yield its probability mass function, which takes the form: S=jwithprobabilityqjforj=0,1,9

Consistent with the mean and standard deviation calculated for S, the qjs must satisfy three linear equations: (7) j=09jqj=E(S)j=09j2qj=E(S2)=σ2(S)+(E(S))2j=09qj=1(7)

But there are ten qjs, and these equations impose only three constraints on them. In consequence, there are many feasible sets of qjs that satisfy (7).

As described in Appendix A, we use an algorithm to obtain a “composite” distribution for S, in essence averaging across the feasible distributions consistent with (7).

Once the composite distribution for S is at hand, one can see where the actual number of tipping states that Trump won falls within that distribution. If it falls at (say) the extreme right tail, then the accuracy of FiveThirtyEight’s projections about the tipping states would be called into question.

2.4 Trump’s Vote Share

UCLA Coach Henry Russell Sanders informed his players that “winning isn’t everything; it’s the only thing.” But FiveThirtyEight accompanied its “win/loss” probabilities with probability distributions for the proportion of the vote Trump would receive in each state. For a full test of FiveThirtyEight’s prediction methodology, it is important to explore how well those vote-share forecasts fared against the actual Trump/Biden vote split.

Specifically. FiveThirtyEight offered 80% confidence intervals for Trump’s vote share in each of the 51 states, with the point estimate at the center of the interval. (We exclude the congressional districts of Maine and Nebraska from this analysis, because their vote-share data are fully contained in the statewide data we use.) Assuming normal distributions, the confidence interval ranged from the 10th percentile to the 90th percentile, namely, from 1.28 standard deviations below the mean to 1.28 standard deviations above it. Therefore, if the two bounds were a10 and a90 in a given state, the corresponding mean μ and standard deviation σ would be given by: μ=a10+a902σ=(a90μ)/1.28.=(a90a10)/2.56

We consider the null hypothesis H0 that all of FiveThirtyEight’s probability distributions were accurate, meaning that the actual result in each state was one random pick from the state’s specified normal distribution. Let random variable Wi be that pick when expressed as a percentile from the website’s normal distribution for state i. Under H0, Wi would be U (0, 100), because an outcome between (say) the 7th and 8th percentiles would have the same 1% chance of arising as one between the 43rd and 44th, or the 82nd and 83rd. For that reason, the mean of Wi would be 50 under H0, and the standard deviation of Wi would be 29, based on general properties of uniform distributions.

A test of H0 could fruitfully focus on W¯, the arithmetic average of the 51 Wi’s. The two-sided p-value associated with W¯ would follow: pvalue={2*P(RW¯)ifW¯502*P(RW¯)ifW¯<50where R is the average of 51 (correlated) picks from the joint distribution of the Wi’s

One would reject H0 if the p-value falls below some threshold value, the most common of which is 0.05.

If the different Wi’s were independent random variables, then W¯ under H0 would be approximately normally distributed given the Central Limit Theorem, with a mean of 50 and a standard deviation of 2951=4.07. But, like the Xi’s in the tipping states, the Wi’s need not be independent, because high-percentile outcomes in some states could foreshadow similar outcomes in others. We will consider this point in connection with actual results.

3 Results

3.1 Win/Loss Forecasts

As discussed, FiveThirtyEight’s number of state-by-state outcomes in which the disfavored candidate would win (XL) has a mean μL which follows: μL=i=156piLwhere piL = P(disfavored candidate wins in state i according to FiveThirtyEight). (p iL<12)

Based on FiveThirtyEight’s 56 win/loss probabilities for 2020, μL = 5.17. (All data we used about FiveThirtyEight’s final presidential forecasts for 2020 appear in FiveThirtyEight Citation2020) That statistic shows that FiveThirtyEight was not timid in projecting winners: because 5.17 is less than 10% of 56, the website expected that it would correctly identify the winner more than 9/10 of the time.

In actuality, only three of FiveThirtyEight’s win/loss forecasts were incorrect (in Florida, North Carolina, and the Northern Congressional District of Maine). Based on independence, the standard deviation of the number of errors (Z) would be 1.92 under (1). Furthermore, simulation reveals that the outcome three would be at the 24th percentile of the distribution of Z (i.e., the simulated value of Z was three or lower 24% of the time). Correlations between pairwise outcomes in different states could affect σ2(Z);however, their effect is unlikely to imperil the conclusion that the observed error rate fell comfortably in the distribution for the error rate based on FiveThirtyEight’s state-by-state assessments.Footnote3 In any event, when the estimated win/loss success rate is 91% (50.83/56) and the actual rate is 94% (53/56), there is no credible basis for claiming that FiveThirtyEight overestimated the accuracy of its win/loss predictions. The website made sharp claims about what would happen and satisfied them.

3.2 Trump’s Success in the Tipping States

offered some details about the nine states that FiveThirtyEight deemed most likely to “tip” the presidential election to the winning candidate. The issue is how closely the website’s predictions about those states as a group corresponded to what actually happened in 2020.

In Section 2.3.1, we described how we estimated the (92)=36 correlation coefficients for the various pairings among the nine tipping states. Once those correlations and thus the cov(Xi,Xj)s are at hand, the mean and variance of the number of tipping states S that Trump would win can be estimated based on (1), (2), and . The results are: E(S)=1.82 and σ2(S)1.64

We describe E(S) here as exactly 1.82 because it is based literally on the probabilities from FiveThirtyEight that appear in .

The next issue is the distribution of discrete random variable S, the number of the nine tipping states that Trump would win. S is a discrete random variable with a probability mass function of the form: S=j with probability qjforj=0,1,9

We describe our procedure for making estimates of the qj’s in Appendix A. We reach the following distribution: (8) S={0w.p..0941w.p..3282w.p..3803w.p..1184w.p..0415w.p.0186w.p..0107w.p..0058w.p..0049w.p..002(8)

w.p. = with probability

Among the nine states of , Trump actually won in two of them (Florida and North Carolina). That outcome is at the mode of S in (8) and as close as possible to FiveThirtyEight’s hypothesized mean of 1.82. In short, it appears that FiveThirtyEight did an excellent job in the tipping states, and that to treat Trump’s victories in Florida and North Carolina as “errors” by the website requires valuing binary “win/lose” variables above the more nuanced use of actual probabilities and correlations.

Under FiveThirtyEight probability assessments, Trump’s bleak prognosis in the tipping states all but guaranteed his defeat. The remaining 47 states (56–9) included 22 for which the website favored Biden and 25 for which it favored Trump. But based on state-by-state win probabilities in the 47 states, Biden would gain a mean of 245 electoral votes, while Trump’s mean gain would be 160 electoral votes. To counter that mean difference of 85 (245–160), Trump would needed at least 109 of the 133 electoral votes in the tipping states.Footnote4 implies that doing so would have required at least seven Trump victories in the nine states, an outcome that is assigned a probability 0.011 in (8). Even that probability is an upper bound, because many combinations of seven or eight Trump victories would fall short of yielding 109 electoral votes (e.g., all those that exclude Florida).

3.3 Vote-Share Distributions

As we discussed in Section 4, our test of the accuracy of FiveThirtyEight’s state-by-state Trump vote share distributions entailed expressing his 2020 vote shares in individual states as percentiles of the corresponding FiveThirtyEight distributions. The observed outcomes in the 2020 election tilted decisively toward the upper tails of those distributions, with the average percentile over the 51 states at 69.52%. Under H0 (all FiveThirtyEight distributions are correct), that outcome is 4.75 standard deviations above the expected value of 50% if the standard deviation is estimated as 4.07% (i.e., assuming independence across states), meaning that the calculated p-value would be infinitesimal.

However, the presence of (generally positive) cross-state correlations would increase the standard deviation of W.¯ Data that would directly allow estimating the correlation between Trump’s vote share in state i (as a FiveThirtyEight percentile) and his share in state j are not available. However, generalizing the method we applied for the nine tipping states, we can estimate the correlations of Trump’s binary win/loss variables for all (512)=1275 pairs of states, and thus FiveThirtyEight’s overall standard deviation for T, the total number of states Trump would carry (which was assigned a mean of 23.2). That standard deviation was 3.66, about twice the standard deviation of T assuming independence, which was 1.79. Using that two-to-one ratio as an approximate guide to what would happen to σ(W¯) because of correlation, we might double its value as calculated based on both independence and H0. Then the observed value of 69.52% would still be about 2.4 standard deviations above the mean, with a two-sided p-value of 0.0164. Thus, the null hypothesis H0 that FiveThirtyEight’s 51 Trump vote-share distributions were all accurate would again be rejected at the usual 5% significance level.

That this adverse outcome is not spurious is further supported by the fact that in 47 of the 51 states—all except Alaska, Colorado, Louisiana, and Maryland–FiveThirtyEight underestimated Trump’s support on election day. Over the 51 states, FiveThirtyEight’s point estimates of Trump’s vote share were too low by an average of 1.90 percentage points. (Both FiveThirtyEight’s projections and actual vote tallies took account of third parties, which received about 2% of votes nationwide.). Trump outperformed the point estimates in both heavily Democratic states like Delaware, Massachusetts, and New York and heavily Republican states like North Dakota, Kentucky, and Wyoming. Interestingly, Trump exceeded his projected vote share by twice as much in states that he won as in states that he lost (2.56 percentage points vs. 1.26). And when the state-by-state difference between Trump’s vote share and FiveThirty Eight’s point estimate for that share is regressed via OLS on the explanatory variable “point estimate,” a statistically significant positive slope emerges: y=0.055x0.707y=differencebetweenTrumpsactual2020voteshareandFiveThirtyEightpointestimate(%)x=FiveThirtyEightpointestimateofTrumpsvoteshare(%)Slopestandarderror:0.018.pvalueofslopeestimate:0.004Correlationofxandy:0.401

In essence, the more favorable to Trump’s performance was FiveThirtyEight’s estimate, the greater in general was the extent to which it was not favorable enough.

Given that FiveThirtyEight underestimated Trump’s vote share by a statistically-significant average of about two percentage points, why was it so successful in forecasting the winners in individual states? The answer is that, in the vast majority of states, the vote shares of Trump and Biden differed by more than two percentage points. That Trump carried Idaho with 63.8% of the vote rather than the projected 59.5% had no effect on the state’s win/loss outcome; that Biden carried Hawaii with 65.7% of the vote rather than 69.1% was likewise immaterial.

3.4 Polling Accuracy without FiveThirtyEight’s Intervention

We have discussed the wide public perception that the presidential polls failed in 2020, and the detailed negative conclusions reached by experts at AAPOR. For that reason, we have been treating the individual 2020 polls as presumptively deficient and exploring whether FiveThirtyEight counteracted their weaknesses. But it is worth estimating how large was the accuracy problem that FiveThirtyEight was meant to ameliorate.

It is helpful in this connection to turn to Real Clear Politics, which offers direct information about the collective performance of preelection presidential polls. In each state where Real Clear Politics (hereafter RCP) saw the 2020 election as close, it simply took an arithmetical average of local polling results for a limited period before election day. It paid no attention to the sample sizes of individual polls, to a given poll’s recency (e.g., two weeks before the election or three days before), or to any poll’s historical tendency to favor one political party over the other. If one worked with the RCP averages alone, how much worse would have been the forecasts than those produced by FiveThirtyEight?

To answer that question, one first has to make reasonable adjustments to the raw RCP data. For example, suppose that RCP’s last estimate before the election that Trump and Biden were tied at 46% in a given state, with 5% undecided and 3% favoring third-party candidates. Then RCP was not actually predicting that, on Election Day, Trump’s vote share would be 46%. Under the simple (simplistic?) premise that undecided voters would ultimately split between Trump and Biden the same way as the decided ones (and assuming third-party candidates maintained their minimal projected vote-shares), the 46–46 split would be revised to 48.5%–48.5%. Making such adjustments, we present in RCP-based projections, FiveThirtyEight projections, and actual election results, focusing on 14 states/districts where the election seemed close. (Those districts—one in Maine and one in Nebraska– each have one Electoral College vote; for simplicity, we shall speak of 14 states.) These states constitute all those classified as toss-up by RCP and/or tipping states by FiveThirtyEight (hereafter swing states). It is in swing states where the outcome is not obvious (unlike California or Alabama) that polling accuracy is most important.

Table 2 Trump’s actual 2020 vote shares and his projected vote shares in 14 swing states, for FiveThirtyEight and Real Clear Politics (RCP).

As we see, FiveThirtyEight only marginally outperformed RCP. Like FiveThirtyEight, RCP underestimated Trump’s vote shares in the swing states but RCP did so by slightly more; RCP, however, had the lower mean absolute forecast error.

Yet does not contradict the possibility that individual polls in the various states performed badly. RCP, like FiveThirtyEight, is an aggregator of polls. Perhaps what happened is that large biases of some local polls were largely canceled by opposite biases of other surveys. Even if that happened, though, the polls are not fatally flawed if the simple expedient of averaging their results yields a reasonably accurate outcome.

Actually, however, the local polls fared rather well in themselves. presents for each state the average absolute forecast errors of all the local polls RCP used (i.e., not allowing for cancelations among opposite biases). (By definition, local forecast errors in each state had the same average as RCP, meaning that their overall average was 1.93 percentage points.). The absolute errors across the 14 entities averaged 2.53 percentage points, not much higher than FiveThirtyEight’s average of 2.13 percentage points. also presents state-specific average margins of random sampling error for individual local polls, taking account of their sample sizes. These sampling margins generally exceeded absolute errors to an appreciable extent (on average, 3.83% vs. 2.53%).

Table 3 Average absolute forecast error and average margin of random sampling error for Trump’s 2020 vote share, among key local polls in 14 swing states.Table Footnotea

Against this backdrop, it is difficult to depict those local polls as major failures. Among the 70 local polls used by RCP across the 14 states, 77% yielded estimates of Trump’s vote share within the usual confidence interval for that share based on sampling fluctuations alone. That figure does fall below the theoretical 95% level, meaning that it suggests the presence of some systematic error. But not huge systematic error.

summarizes the comparison between FiveThirtyEight, RCP, and the local polls used by RCP in its final preelection estimates.

Table 4 Average Forecasting Error and Average Absolute Forecasting Error for Trump’s 2020 Vote Share in 14 Swing StatesTable Footnotea , For FiveThirtyEight, Real Clear Politics, and Individual Local Polls.Table Footnoteb

What about the other 42 states where the election was not considered close? RCP generally declined to offer forecasts there because reliable polls were scarce, presumably because the outcomes were viewed as foregone conclusions. But FiveThirtyEight did offer forecasts, although they often had to work with nonrandom polls like those from Survey Monkey that the website itself had assigned the grade D-. Under the circumstances, FiveThirtyEight did well in the 42 noncompetitive states: Section 3.3 and imply that its forecasts were too low on average by about two percentage points there, about the same as in the 14 swing states where serious polls were numerous.

4 Final Remarks

From the perspective of US public policy, a lesser role for presidential-election polls could have its advantages. Some of the energy spent obsessing over polls might be redirected to discussions of policy, while there could be lesser distortions like the bandwagon effect and abstentions from voting because the polls weren’t close. Fewer voters might strategically decline to choose their preferred candidate. It is noteworthy in this connection that dozens of countries—including Canada, France, Greece, Mexico, Norway, and Poland—impose blackout periods on preelection polls. The reasoning that led to those blackouts could well apply to the United States.

However, there is also a case that the primacy of polls in U.S. elections for president, though not ideal, is better than the alternative. The dichotomy “polls versus policy issues” is overstated: The preferences that participants express in polls to a considerable extent reflect their views on policy matters. And when polls are close, they might stimulate voter turnout, reduce strategic voting, and induce candidates to speak at greater length on the virtues of their policy stances. There is also the deeper point that attempts to restrict polls could be viewed as antidemocratic in spirit, implying that voters should be denied information they want because they might use that information inappropriately. If voters cannot be trusted with polling results, should they likewise be deprived of facts on a variety of policy matters?

Yet all this discussion becomes moot if presidential polls are not viewed as trustworthy. We have considered such polls in 2020 in the swing states where accuracy was most important and where presidential candidates focused their campaigns. Whether working with FiveThirtyEight, Real Clear Politics, or the individual polls that offered “fuel” for such aggregators, we found performances in 2020 that were objectively very good. These performances were especially impressive given that the Covid-19 pandemic caused unprecedented difficulties in forecasting election results. The suggestion that the pollsters and aggregators failed in 2020 emerges as exaggerated, while the notion that biased individual polls required massive corrections from aggregators is inconsistent with relevant data.

Yet it is concerning that, for the second election in a row, the polls underestimated the support for Donald Trump and FiveThirtyEight did not devise an appropriate adjustment for the downward bias. Measures taken after the 2016 election to counteract the bias seem not to have fully succeeded, and the American Associate for Public Opinion Research (AAPOR) has explained at length that simple explanations for the problem do not readily fit the data. The only common explanation for the shortfall that AAPOR did not exclude was that Trump supporters may have refused to take part in voter surveys to a greater extent than Trump opponents, even within identifiable subgroups like white working-class voters or Republican voters. While one hopes that lessons from 2020 will avoid the problem in 2024, there is no certainty that this will be the case.

Those who think presidential polls get undue attention can continue to advance their arguments. But given what happened in 2020, those contending in 2024 that such polls should be ignored should not advance the assertion that the polls are highly unreliable.

Acknowledgments

The authors are grateful to the editors and reviewers for their thoughtful suggestions.

Disclosure Statement

The authors have no potential competing interests.

Notes

1 In FiveThirtyEight’s nine swing states most likely to “tip” the election, the median weight it accorded to polls in its final 2020 forecast was 97%.

2 FiveThirtyEight’s modeling allows for correlated forecasts across states, but it does not disclose how, and our own approach to correlation is probably different. But test of a model need not be predicated on treating all its assumptions as correct (e.g., someone evaluating a model that assumes the earth is flat is not required to do likewise).

3 Suppose that state-by-state win/loss outcomes for Trump are positively correlated. Then a Trump victory in state A could moderately increase his chance of winning in state B, where FiveThirtyEight assigns him a 25% chance of victory, and also do so in state C (75% chance). But then the conditional probability of a win/loss error would go up from 25% at B, but this error probability would go down from 25% at C. Thus, relative to independence, the net effect on σ2(Z) of these two opposite movements could well be modest.

4 While the projected Biden/Trump difference in electoral votes could fluctuate around its mean of 85 for these 47 states, that circumstance would not meaningfully alter this approximate analysis.

5 For example, suppose that a model correctly assumes that the candidate has a 50% chance of winning in each state, but that the various outcomes have strong positive correlation. Then the percentage actually won could be polarized towards 100% or 0%, and the unbiased estimates of 50% could appear highly inaccurate.

References

Appendix A:

A Probability Distribution Based on FiveThirtyEight for the Number of Tipping States Trump Would Carry in 2020

We define S as the total number of the Tipping states Trump would carry (out of nine) and qj as the probability he carries exactly j of those states. Then, based on FiveThirtyEight and analyses about its forecasts, we estimated in the main text that E(S)=1.82, and σ2(S)1.64; we also have j=09qj=1. There are an infinite number of distributions for S that satisfy these three conditions but, to be practical, we followed the following procedure:

  1. To keep the number of feasible solutions finite, restrict the individual qj ‘s to be multiples of .01 in the range 0–1.

  2. Identify using an algorithm the combinations of qj ‘s that match E(S)=1.82, and σ2(S)1.64 and j=09qj=1.

  3. Assign equal weight to all such combinations.

  4. Average their values of qj together to get a composite estimate of the probability that Trump would win exactly j of the swing states.

Steps (iii) and (iv) reflect the premise that, to get the best representation of what FiveThirtyEight implies about outcomes in the nine tipping states, it is reasonable to average over all distributions for S that are consistent with both the website’s state-by-state probabilities and the assumption about cross-state correlation reflected in σ2(S). In a Bayesian sense, it is as if all probability distributions on the integers 0–9 were initially assumed equally likely to be correct (a uniform prior), and the distributions were updated by the requirements in (6) on mean and standard deviation. Those distributions that failed to meet the requirements were assigned a posterior probability of zero, while all the rest were assigned equal probabilities. Averaging qj-values across the “surviving” distributions could therefore be viewed as yielding a reasonable expected value for qj.

To implement (ii), we began by identifying qjs that matched E(S)=1.82 and j=09qj=1, using an algorithm that rapidly excluded the overwhelming majority of (q0,q1,q9) combinations. For example, the procedure immediately excluded values of q0 above 0.8, because, even if the remaining probability mass were assigned to q9, the mean would fall short of 1.82. For the same reason, q9 could not exceed 0.20. Given a feasible value of q0, the range of feasible values of q1 could be identified. Continuing in this way, the algorithm generated the sets of values (q0,q1,q9) that yielded a mean of 1.82 (to the nearest hundredth).

For each set (q0,q1,q9) that yielded a mean of 1.82, the algorithm calculated the variance of the distribution, and retained only those sets with mean-squares of 1.822+1.6424.95 to the nearest hundredth. Then as described earlier, we created the “composite” distribution for S based on averaging the feasible distributions, which was: S={0w.p..0941w.p..3282w.p..3803w.p..1184w.p..0415w.p.0186w.p..0107w.p..0058w.p..0049w.p..002

Appendix B:

FiveThirtyEight’s Own Model Validation Procedures

In a document at its website, FiveThirtyEight identifies its two procedures for validating its probabilistic forecasts against actual election results. The first one compares the probabilities assigned to events to the frequencies with which they actually occurred. For example, the website reports that “we’ll throw every prediction …. between a 37.5% and 42.5% chance of winning into the same “40 percent” group—and then plot the averages of each bin’s forecasted chances of winning against their actual win percentage.” This is done for a full set of ranges to create a calibration plot and, it is stated, the data points should be “close to the 45 degree line” if the forecasts are all accurate.

A second evaluation method entails use of Brier skill scores. The initial Brier score BS for a set of n probabilistic forecasts follows: BS=(1n)i=1n(piOi)2where pi = estimated probability that event i will occur Oi={1ifeventiactuallyoccurs0ifeventidoesnotoccur

BS is effectively a measure of mean-squared forecasting error and, the smaller is BS, the higher is the level of accuracy under this criterion.

The Brier Skill Score BSS compares BS to the Brier score BSref that would arise if a series of “unskilled” forecasts were advanced for the same n events. BSS takes the form: BSS=1BS/BSref

BSS is somewhat analogous to R2 in linear regression analysis. FiveThirtyEight suggests that an unskilled forecast assumes that “each candidate has an equal shot.” In a two-candidate race, that would mean that the unskilled forecaster would assign win probabilities of 12 for both candidates. Then BSref would be 14 because the quantity (piOi)2 would necessarily be 14 for all forecasts.

Comments on FiveThirtyEight’s Evaluation Methods

The website’s evaluation methods are reasonable in general, but they might be less so in the context of U.S. presidential elections. The test with the 45-degree line works well if the various forecasts are independent, but can be misleading if the forecasts are correlated.Footnote5 The Brier scores can also be problematic in the context of presidential elections. They reward predictions that are bold and accurate: if an event occurs, a 98% probability previously assigned to it is treated as far superior to a probability of 52%. The implicit premise is that the 98% reflects far greater insight than the equivocal 52%.

In the 2020 election, however, assigning strong probabilities to a Biden win in California or a Trump win in Wyoming was belaboring the obvious; the germane question is how well the forecaster performed in the “swing states” that would determine the winner. There a sophisticated forecaster processing all available information might sensibly assign a probability near 50% that Trump would win. Yet BSS would discount the forecaster’s skill in those states by likening their assessment to a coin toss. The issue is especially important because only about one-fifth of American states were swing states in 2020. Thus, the easy predictions in the other four-fifth of states will dominate the Brier score, potentially yielding a highly positive assessment of forecast accuracy even if the swing-state predictions fell short of the mark.

Importantly, FiveThirtyEight does not identify any procedures that mention correlation, or consider the accuracy of its vote-share distributions. Our tests that consider these issues, therefore, are not redundant.