5,254
Views
44
CrossRef citations to date
0
Altmetric
Research Article

Statistical model selection with “Big Data”

& | (Reviewing Editor)
Article: 1045216 | Received 11 Mar 2015, Accepted 01 Apr 2015, Published online: 22 May 2015
 

Abstract

Big Data offer potential benefits for statistical modelling, but confront problems including an excess of false positives, mistaking correlations for causes, ignoring sampling biases and selecting by inappropriate methods. We consider the many important requirements when searching for a data-based relationship using Big Data, and the possible role of Autometrics in that context. Paramount considerations include embedding relationships in general initial models, possibly restricting the number of variables to be selected over by non-statistical criteria (the formulation problem), using good quality data on all variables, analyzed with tight significance levels by a powerful selection procedure, retaining available theory insights (the selection problem) while testing for relationships being well specified and invariant to shifts in explanatory variables (the evaluation problem), using a viable approach that resolves the computational problem of immense numbers of possible models.

JEL classifications:

Public Interest Statement

Big Data offer potential benefits for discovering empirical links, but confront potentially serious problems unless modelled with care. Key dangers include finding spurious relationships, mistaking correlations for causes, ignoring sampling biases and over-stating the significance of results. We describe the requirements needed to avoid these four difficulties when seeking relationships in Big Data. Important considerations are to commence from a general initial framework that allows for all influences likely to matter (the formulation problem), and use high-quality data analysed by a powerful search algorithm, requiring high significance (the selection problem). It is also crucial not to neglect insights from prior theory-based knowledge, and to test that claimed relationships both characterize all the evidence and are not changing over time (the evaluation problem). Finally, one must use a method that can efficiently handle immense numbers of possible models (the computational problem). Our approach provides a solution to all four problems.

Acknowledgements

Financial support from the Open Society Foundations and the Oxford Martin School is gratefully acknowledged, as are helpful comments from Jennifer L. Castle, Felix Pretis and Genaro Sucarrat. Numerical results are based on Ox and Autometrics, and we are indebted to Felix Pretis for the Lasso calculations.

Additional information

Funding

This work was financially supported by Open Society Foundations and the Oxford Martin School.

Notes on contributors

Jurgen A. Doornik

Jurgen A Doornik is James Martin Fellow, Institute for New Economic Thinking at the Oxford Martin School, University of Oxford, research fellow at Nuffield College and director of OxMetrics Technologies Ltd. He has published widely on econometric methods, modelling, software, numerical methods, computation and mathematics and developed the OxMetrics software packages, including Ox and PcGive (the latter with DF Hendry), which incorporate Autometrics, an algorithm implementing automated general-to-specific model selection.

David F. Hendry

David F Hendry is director of the Program in Economic Modeling, Institute for New Economic Thinking at the Oxford Martin School, professor of Economics and fellow of Nuffield College, Oxford University. He was knighted in 2009, and received a Lifetime Achievement Award from the Economic and Social Research Council in 2014. He has received eight Honorary Doctorates, is a Thomson Reuters Citation Laureate and has published more than 200 papers and 25 books, the latest being Empirical Model Discovery and Theory Evaluation, MIT Press, 2014, with JA Doornik.