ABSTRACT
This study explores the forecasting of Major League Baseball game ticket sales and identifies important attendance predictors by means of random forests that are grown from classification and regression trees (CART) and conditional inference trees. Unlike previous studies that predict sports demand, I consider different forecasting horizons and only use information that is publicly accessible in advance of a game or season. The models are trained using data from 2013 to 2014 to make predictions for the 2015 regular season. The static within-season approach is complemented by a dynamic month-ahead forecasting strategy. Out-of-sample performance is evaluated for individual teams and tested against different least-squares dummy variable regression models and a naïve lagged attendance forecast. My empirical results show high variation in team-specific prediction accuracy with respect to both models and forecasting horizons. Linear and tree-ensemble models, on average, do not vary substantially in predictive accuracy; however, least-squares regression fails to account for various team-specific peculiarities, despite accounting for team fixed effects and censoring attendance predictions to fit to stadium capacities.
Acknowledgements
I thank seminar and conference participants in Munich (Econometrics in the Castle: Machine Learning in Economics and Econometrics), Hamburg (University), and Kiel (IfW, University) for helpful comments and suggestions and, in particular, Martin Spindler, Wolfgang Maennig, and an anonymous referee.
This article has been corrected with minor changes. These changes do not impact the academic content of the article.
Disclosure statement
No potential conflict of interest was reported by the author.
Supplementary material
Supplemental data for this article can be accessed here.
Notes
1 It is common practice in the sports demand literature to use attendance and ticket sales as proxies for sports demand (Jeffery Borland & Macdonald, 2003). Furthermore, the officially reported attendance figures are the total number of sold and free tickets per game, not the number of fans that were present at a game. In this paper, the terms sports demand, ticket sales, and attendance are used interchangeably, unless explicitly stated otherwise. For an emerging literature on spectator no-show behaviour that addresses differences in fans’ attendance and ticket purchase behaviours, see, e.g., Schreyer (Citation2019).
2 http://www.retrosheet.org, https://www.mlb.com, http://www.seamheads.com, https://www.covers.com, https://darksky.net.
3 Double-header events usually stem from rescheduled games: in my data set, 105 out of 166 double-header games are rescheduled games.
4 The cleaned data sample includes 7011 games from the 2013, 2014, and 2015 regular MLB seasons;351 (5%) of the games show ticket sales that exceed stadium capacity, and 19 games have ticket sales that equal stadium capacity. In comparison, Denaux, Denaux, and Yalcin (Citation2011) analyse factors affecting MLB attendance using 22,940 game observations from 1979 to 2004 and find that 4.2% of the games show attendance above stadium capacity. Similarly, Lemke, Leonard, and Tlhokwane (Citation2010) predict MLB ticket sales for the 2007 season and find 7.5% of their analysed sample of games to be sold out (i.e., attendance at capacity or exceeding capacity), while Meehan, Nelson, and Richardson (Citation2007) analyse MLB attendance data from 2002 to 2003 and report that 4.8% of the games are sold out.
5 MCC is a balanced score measure that takes into account all four dimensions of the confusion matrix and, in contrast to the F1 score, is invariant to the definition of positive and negative classes (Chicco and Jurman Citation2020; Matthews Citation1975).