ABSTRACT
Many interesting datasets available on the Internet are of a medium size—too big to fit into a personal computer’s memory, but not so large that they would not fit comfortably on its hard disk. In the coming years, datasets of this magnitude will inform vital research in a wide array of application domains. However, due to a variety of constraints they are cumbersome to ingest, wrangle, analyze, and share in a reproducible fashion. These obstructions hamper thorough peer-review and thus disrupt the forward progress of science. We propose a predictable and pipeable framework for R (the state-of-the-art statistical computing environment) that leverages SQL (the venerable database architecture and query language) to make reproducible research on medium data a painless reality. Supplementary material for this article is available online.
Acknowledgments
The author gratefully acknowledges the editor, associate editor, and two anonymous reviewers for helpful comments, as well as Carson Sievert, Nicholas Horton, Weijia Zhang, Wencong Li, Rose Gueth, Trang Le, and Eva Gjekmarkaj for their contributions to this work.