Abstract
This study investigates the optimal use of covariates in reducing variance when analyzing experimental data. We show that finding the variance-minimizing strategy for making use of pre-treatment observables is equivalent to estimating the conditional expectation function of the outcome given all available pre-randomization observables. This is a pure prediction problem, which recent advances in machine learning (ML) are well-suited to tackling. Through a number of empirical examples, we show how ML-based regression adjustments can feasibly be implemented in practical settings. We compare our proposed estimator to other standard variance reduction techniques in the literature. Two important advantages of our ML-based regression adjustment estimator are that (i) they improve asymptotic efficiency relative to other alternatives and (ii) they can be implemented automatically, with relatively little tuning from the researcher, which limits the scope for data-snooping.
Ackonwledgments
We thank Lyft Inc. for providing a large portion of the data used in this project. We additionally thank Adeline Sutton for her help in accessing and interpreting the CHECC data, as well as Brent Hickman, Michael Cuna, Atom Vayalinkal, and participants at the Advances in Field Experiments conference for helpful comments that have improved the article. Documentation of our procedures and our Stata and R code can be found here: https://github.com/gsun593/FlexibleRA
Declaration of Interest
John List was Chief Economist at Lyft when this research was carried out. He is now Chief Economist at Walmart. Ian Muir and Gregory Sun were also employed at Lyft at the time that the research was carried out. They are no longer affiliated with Lyft.
Author Contribution Statement
The authors confirm contribution to the paper as follows: study conception and design: Gregory Sun; data collection: Ian Muir, Gregory Sun; analysis and interpretation of results: John List, Ian Muir, Gregory Sun; draft manuscript preparation: John List, Gregory Sun. All authors reviewed the results and approved the final version of the manuscript.
Notes
1. Specifically, our estimators attain the asymptotic efficiency bound subject to the constraint that for fixed proportions 𝝆, and for all x. If randomization probabilities can be made conditional on x, then for a fixed target parameter, variance can be further decreased by exploiting heteroskedasticity in Yi(g) conditional on Xi. For instance, if the researcher is interested in estimating the average treatment effect, then the researcher could further reduce variance by over-sampling treatments for which the outcome of the variance is higher: . This information is often difficult to obtain in practice, and moreover, the optimal sampling design for one target parameter may not be optimal for another.
2. We provide code for doing so at https://github.com/gsun593/FlexibleRA
3. Such a choice makes 𝐂𝐡 deterministically 0.
4. Note a subtle difference in the justification for this fact. In this case, Ag is uncorrelated only with linear functions of X, but because we are restricting ourselves to the class of linear in X regression adjustments, the summands of Bg and are all restricted to be linear as well.
5. The two examples NW explicitly have in mind are logistic regression and Poisson regression. In the former case, while in the latter case, .
6. Where here, A and B are as in the previous section.
7. As we will see in our simulations, we still prefer to be high quality, as the ability of to fit the data affects the sampling variability of the resulting estimator.
8. However, this point should not be overstated. Nonparametric estimators typically suffer from slower rates of convergence than parametric estimators, so in a finite sample, one may still prefer linear regression adjustment. Our empirical results suggest that, in general, one should pick the method that produces the highest quality out-of-sample predictions of the outcome as measured by mean squared error.
9. Note that if the sample size is not sufficiently large, some care should be taken to ensure that each fold gets observations from each of the treatment groups g.
10. R Code implementing our flexible regression adjustment along with the analyses of the three non Lyft settings can be found at the following link: https://github.com/gsun593/FlexibleRA. We have also included a copy of the code in Appendix B
11. See Friedman et al. (Citation2004) for an interpretation of this strategy as approximating the solution to a LASSO-like estimation procedure.
12. For confidentiality reasons, we cannot report the exact size of this sample. However, at the time of our writing, Lyft recorded a number of passengers in the tens of millions.
13. Specifically, the x axis in these qq-plots is defined by the theoretical quantiles of a standard normal distribution while the y axis corresponds to the empirical quantiles. If the asymptotic theory is correct, the points in these plots should lie close to the 45 degree line, and deviations from this prediction allow us to more precisely visualize deviations from asymptotic normality.
14. This reduction is not just due to noise: the difference would be statistically significant if subjected to formal hypothesis testing.
15. If the nonparametric method being used has algorithmic complexity growing faster than linearly in dataset size (which is common), two-fold cross-fitting would be even faster than not using a split sample for sufficiently large datasets.
16. Specifically, we implemented our point estimates according to Equation11(11) (11) and our standard errors according to Equation12(12) (12) , but using an OLS fit for in place of a fitted machine learning model.