12,213
Views
4
CrossRef citations to date
0
Altmetric
General

Statistical Challenges in Online Controlled Experiments: A Review of A/B Testing Methodology

, ORCID Icon, ORCID Icon, , & ORCID Icon
Pages 135-149 | Received 14 Dec 2022, Accepted 31 Aug 2023, Published online: 18 Oct 2023

References

  • Abadie, A., Athey, S., Imbens, G. W., and Wooldridge, J. M. (2020), “Sampling-Based versus Design-Based Uncertainty in Regression Analysis,” Econometrica, 88, 265–296. DOI: 10.3982/ECTA12675.
  • Abhishek, V., and Mannor, S. (2017), “A Nonparametric Sequential Test for Online Randomized Experiments,” in Proceedings of the 26th International Conference on World Wide Web Companion, WWW ’17 Companion, pp. 610–616, Perth, Australia: International World Wide Web Conferences Steering Committee. DOI: 10.1145/3041021.3054196.
  • Athey, S., and Imbens, G. (2016), “Recursive Partitioning for Heterogeneous Causal Effects,” Proceedings of the National Academy of Sciences, 113, 7353–7360. https://www.pnas.org/content/113/27/7353.full.pdf. DOI: 10.1073/pnas.1510489113.
  • Austrian, J., Mendoza, F., Szerencsy, A., Fenelon, L., Horwitz, L. I., Jones, S., Kuznetsova, M., and Mann, D. M. (2021), “Applying A/B Testing to Clinical Decision Support: Rapid Randomized Controlled Trials,” Journal of Medical Internet Research, 23, e16651. DOI: 10.2196/16651.
  • Barber, R. F., and Candès, E. J. (2015). “Controlling the False Discovery Rate via Knockoffs,” The Annals of Statistics, 43, 2055–2085. DOI: 10.1214/15-AOS1337.
  • Basse, G. W., and Airoldi, E. M. (2018), “Model-Assisted Design of Experiments in the Presence of Networkcorrelated Outcomes,” Biometrika, 105, 849–858. DOI: 10.1093/biomet/asy036.
  • Begg, C. B., and Leung, D. H. (2000), “On the Use of Surrogate End Points in Randomized Trials,” Journal of the Royal Statistical Society, Series A, 163, 15–28. DOI: 10.1111/1467-985X.00153.
  • Berman, R., and Van den Bulte, C. (2021), “False Discovery in A/B Testing,” Management Science, 68, 6762–6782. DOI: 10.1287/mnsc.2021.4207.
  • Biddle, G. (2019), “Proxy Metrics: How to Define a Metric to Prove or Disprove Your Hypotheses and Measure Progress,” available at https://gibsonbiddle.medium.com/4-proxy-metricsa82dd30ca810. (Accessed on 03/04/2022).
  • Blake, T., and Coey, D. (2014), “Why Marketplace Experimentation is Harder than it Seems: The Role of Test-Control Interference,” in Proceedings of the Fifteenth ACM Conference on Economics and Computation, pp. 567–582.
  • Bojinov, I., and Gupta, S. (2022), “Online Experimentation: Benefits, Operational and Methodological Challenges, and Scaling Guide,” Harvard Data Science Review, 4. DOI: 10.1162/99608f92.a579756e.
  • Bojinov, I., Simchi-Levi, D., and Zhao, J. (2022), “Design and Analysis of Switchback Experiments,” Management Science, 69, 3759–3777. DOI: 10.1287/mnsc.2022.4583.
  • Boucher, C., Knoblich, U., Miller, D., Patotski, S., and Saied, A. (2020), “Metric Computation for Multiple Backends,” available at https://www.microsoft.com/en-us/research/group/experimentationplatform-exp/articles/metric-computation-for-multiple-backends/. (Accessed on 09/16/2022).
  • Box, G. E., Hunter, J. S., and Hunter, W. G. (2005), Statistics for Experimenters: Design, Innovation, and Discovery (2nd ed.), Hoboken, NJ: Wiley-Interscience.
  • Bui, T., Steiner, S. H., and Stevens, N. T. (2023), “General Additive Network Effect Models,” The New England Journal of Statistics in Data Science, 1–19. DOI: 10.51387/23-NEJSDS29.
  • Bump, P. (2019), “Analysis—’60 Minutes’ Profiles the Genius Who Won Trump’s Campaign: Facebook.”
  • Chamandy, N. (2016), “Experimentation in a Ridesharing Marketplace,” available at https://eng.lyft.com/experimentation-in-a-ridesharing-marketplace-b39db027a66e.
  • Chan, H., and Lai, T. (2005), “Importance Sampling for Generalized Likelihood Ratio Procedures in Sequential Analysis,” Sequential Analysis, 24, 259–278. DOI: 10.1081/SQA-200063280.
  • Chen, N., Liu, M., and Xu, Y. (2018), “Automatic Detection and Diagnosis of Biased Online Experiments,” arXiv preprint arXiv:1808.00114.
  • Chernozhukov, V., Chetverikov, D., Demirer, M., Duflo, E., Hansen, C., and Newey, W. (2017), “Double/Debiased/Neyman Machine Learning of Treatment Effects,” American Economic Review, 107, 261–65. DOI: 10.1257/aer.p20171038.
  • Christian, B. (2012), “The A/B Test: Inside the Technology That’s Changing the Rules of Business,” Wired, 20. Available at https://www.wired.com/2012/04/ff-abtesting/
  • Courthoud, M. (2022), “Understanding CUPED,” available at https://towardsdatascience.com/understandingcuped-a822523641af. (Accessed on 08/18/2022).
  • Crook, T., Frasca, B., Kohavi, R., and Longbotham, R. (2009), “Seven Pitfalls to Avoid When Running Controlled experiments on the Web,” in Proceedings of the 15th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining - KDD ’09, p. 1105, Paris, France: ACM Press. DOI: 10.1145/1557019.1557139.
  • Deng, A., and Hu, V. (2015), “Diluted Treatment Effect Estimation for Trigger Analysis in Online Controlled Experiments,” in Proceedings of the Eighth ACM International Conference on Web Search and Data Mining, pp. 349–358. DOI: 10.1145/2684822.2685307.
  • Deng, A., Lu, J., and Chen, S. (2016), “Continuous Monitoring of A/B Tests Without Pain: Optional Stopping in Bayesian Testing,” 2016 IEEE International Conference on Data Science and Advanced Analytics (DSAA), pp. 243–252, IEEE. DOI: 10.1109/DSAA.2016.33.
  • Deng, A., Lu, J., and Litz, J. (2017), “Trustworthy Analysis of Online A/B Tests: Pitfalls, Challenges and Solutions,” in Proceedings of the Tenth ACM International Conference on Web Search and Data Mining, WSDM ’17, Cambridge, UK: Association for Computing Machinery, pp. 641–649. DOI: 10.1145/3018661.3018677.
  • Deng, A., and Shi, X. (2016), “Data-Driven Metric Development for Online Controlled Experiments: Seven Lessons Learned,” in Proceedings of the 22nd ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, KDD ’16, pp. 77–86, San Francisco, CA: ACM. DOI: 10.1145/2939672.2939700.
  • Deng, A., Xu, Y., Kohavi, R., and Walker, T. (2013), “Improving the Sensitivity of Online Controlled Experiments by Utilizing Pre-experiment Data,” Proceedings of the Sixth ACM International Conference on Web Search and Data Mining - WSDM ’13, p. 123, Rome, Italy: ACM Press. DOI: 10.1145/2433396.2433413.
  • Deng, A., Yuan, L.-H., Kanai, N., and Salama-Manteau, A. (2023), “Zero to Hero: Exploiting Null Effects to Achieve Variance Reduction in Experiments with One-sided Triggering,” Proceedings of the Sixteenth ACM International Conference on Web Search and Data Mining, pp. 823–831. DOI: 10.1145/3539597.3570413.
  • Deng, A., Zhang, P., Chen, S., Kim, D. W., and Lu, J. (2016b), “Concise Summarization of Heterogeneous Treatment Effect Using Total Variation Regularized Regression,” arXiv preprint arXiv:1610.03917.
  • Dmitriev, P., Frasca, B., Gupta, S., Kohavi, R., and Vaz, G. (2016), “Pitfalls of Long-Term Online Controlled Experiments,” 2016 IEEE International Conference on Big Data (Big Data), pp. 1367–1376, IEEE. DOI: 10.1109/BigData.2016.7840744.
  • Dmitriev, P., Gupta, S., Kim, D. W., and Vaz, G. (2017), “A Dirty Dozen: Twelve Common Metric Interpretation Pitfalls in Online Controlled Experiments,” in Proceedings of the 23rd ACM SIGKDD International Conference on Knowledge Discovery and Data Mining. KDD ’17, pp. 1427–1436, New York: Association for Computing Machinery. DOI: 10.1145/3097983.3098024.
  • Drutsa, A., Gusev, G., and Serdyukov, P. (2015), “Future User Engagement Prediction and its Application to Improve the Sensitivity of Online Experiments,” in Proceedings of the 24th International Conference on World Wide Web, pp. 256–266. DOI: 10.1145/2736277.2741116.
  • Drutsa, A., Ufliand, A., and Gusev, G. (2015), “Practical Aspects of Sensitivity in Online Experimentation with User Engagement Metrics,” in Proceedings of the 24th ACM International on Conference on Information and Knowledge Management. CIKM ’15, pp. 763–772, Melbourne, Australia: Association for Computing Machinery. DOI: 10.1145/2806416.2806496.
  • Eckles, D., Karrer, B., and Ugander, J. (2014), “Design and Analysis of Experiments in Networks: Reducing Bias from Interference,” arXiv:1404.7530 [physics, stat]. arXiv: 1404.7530.
  • Ensor, H., Lee, R. J., Sudlow, C., and Weir, C. J. (2016), “Statistical Approaches for Evaluating Surrogate Outcomes in Clinical Trials: A Systematic Review,” Journal of Biopharmaceutical Statistics, 26, 859–879. DOI: 10.1080/10543406.2015.1094811.
  • Fabijan, A., Dmitriev, P., Holmstrom Olsson, H., and Bosch, J. (2018), ‘Online Controlled Experimentation at Scale: An Empirical Survey on the Current State of A/B Testing,” in 2018 44th Euromicro Conference on Software Engineering and Advanced Applications (SEAA), pp. 68–72. DOI: 10.1109/SEAA.2018.00021.
  • Feng, W., and Bauman, J. (2022), “Balancing Network Effects, Learning Effects, and Power in Experiments,” available at https://doordash.engineering/2022/02/16/balancing-network-effects-learning-effects-and-power-in-experiments/
  • Frangakis, C. E., and Rubin, D. B. (2002), “Principal Stratification in Causal Inference,” Biometrics, 58, 21–29. DOI: 10.1111/j.0006-341x.2002.00021.x.
  • Georgiev, G. (2022), “Fully Sequential vs Group Sequential Test,” available at https://blog.analyticstoolkit.com/2022/fully-sequential-vs-group-sequential-tests.
  • Georgiev, G. Z. (2019), Statistical Methods in Online A/B Testing, Self-Published.
  • Google (2022), “How Google’s Algorithm is Focused on Its Users - Google Search,” available at https://www.google.com/search/howsearchworks/mission/users/. (Accessed on 03/29/2022).
  • Gui, H., Xu, Y., Bhasin, A., and Han, J. (2015), “Network A/B Testing: From Sampling to Estimation,” in Proceedings of the 24th International Conference on World Wide Web, WWW ’15, pp. 399–409, Florence, Italy: International World Wide Web Conferences Steering Committee. DOI: 10.1145/2736277.2741081.
  • Gupta, S., Kohavi, R., Tang, D., Xu, Y., Andersen, R., Bakshy, E., Cardin, N., Chandran, S., Chen, N., Coey, D., Curtis, M. A., Deng, A., Duan, W., Forbes, P., Frasca, B., Guy, T., Imbens, G. W., Saint Jacques, G., Kantawala, P., Katsev, I., Katzwer, M., Konutgan, M., Kunakova, E., Lee, M., Lee, M., Liu, J., McQueen, J., Najmi, A., Smith, B., Trehan, V., Vermeer, L., Walker, T., Wong, J., and Yashkov, I. (2019), “Top Challenges from the First Practical Online Controlled Experiments Summit,” SIGKDD Explorations Newsletter, 21, 20–35. DOI: 10.1145/3331651.3331655.
  • Ham, D. W., Bojinov, I., Lindon, M., and Tingley, M. (2022), “Design-Based Confidence Sequences for Anytime-valid Causal Inference,” arXiv preprint arXiv:2210.08639.
  • Hassan, A., Shi, X., Craswell, N., and Ramsey, B. (2013), “Beyond Clicks: Query Reformulation as a Predictor of Search Satisfaction,” in Proceedings of the 22nd ACM international conference on Information & Knowledge Management, pp. 2019–2028.
  • Hern, A. (2014), “Why Google has 200m Reasons to Put Engineers Over Designers—Google—The Guardian,” available at https://www.theguardian.com/technology/2014/feb/05/why-googleengineers-designers. (Accessed on 10/26/2021).
  • Hohnhold, H., O’Brien, D., and Tang, D. (2015), “Focusing on the Long-Term: It’s Good for Users and Business,” in Proceedings of the 21th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, KDD ’15, pp. 1849–1858, Sydney, NSW, Australia: Association for Computing Machinery. DOI: 10.1145/2783258.2788583.
  • Holtz, D., Lobel, R., Liskovich, I., and Aral, S. (2020), “Reducing Interference Bias in Online Marketplace Pricing Experiments,” arXiv preprint arXiv:2004.12489.
  • Hopkins, F. (2020), “Increasing Experimental Power with Variance Reduction at the BBC—by Frank Hopkins—BBC Data Science—Medium,” available at https://medium.com/bbc-datascience/increasing-experiment-sensitivity-through-pre-experiment-variancereduction-166d7d00d8fd. (Accessed on 02/25/2022).
  • Hu, Y., and Wager, S. (2022), “Switchback Experiments under Geometric Mixing,” arXiv preprint arXiv:2209.00197.
  • Imai, K., and Ratkovic, M. (2013), “Estimating Treatment Effect Heterogeneity in Randomized Program Evaluation,” The Annals of Applied Statistics, 7, 443–470. DOI: 10.1214/12-AOAS593.
  • Imbens, G. W., and Rubin, D. B. (2015), Causal Inference in Statistics, Social, and Biomedical Sciences, Cambridge: Cambridge University Press.
  • Isaac, M. (2021), “Facebook Wrestles With the Features It Used to Define Social Networking.” The New York Times. Available at https://www.nytimes.com/2021/10/25/technology/facebook-like-share-buttons.html
  • Ivaniuk, A. (2020), “Our Evolution Towards T-REX: The Prehistory of Experimentation Infrastructure at LinkedIn—LinkedIn Engineering,” available at https://engineering.linkedin.com/blog/2020/our-evolution-towards-t-rex–the-prehistory-of-experimentation-i. (Accessed on 02/14/2022).
  • Jackson, S. (2018), “How Booking.com Increases the Power of Online Experiments with CUPED—Booking.com Data Science,” available at https://booking.ai/how-booking-com-increasesthe-power-of-online-experiments-with-cuped-995d186fff1d. (Accessed on 01/13/2021).
  • Johari, R., Koomen, P., Pekelis, L., and Walsh, D. (2022a), “Always Valid Inference: Continuous Monitoring of a/b Tests,” Operations Research, 70, 1806–1821. DOI: 10.1287/opre.2021.2135.
  • Johari, R., Li, H., Liskovich, I., and Weintraub, G. Y. (2022b), “Experimental Design in Two-Sided Platforms: An Analysis of Bias,” Management Science, 68, 7069–7089. DOI: 10.1287/mnsc.2021.4247.
  • Ju, N., Hu, D., Henderson, A., and Hong, L. (2019), “A Sequential Test for Selecting the Better Variant: Online A/B Testing, Adaptive Allocation, and Continuous Monitoring,” in Proceedings of the Twelfth ACM International Conference on Web Search and Data Mining, WSDM ’19, pp. 492–500, Melbourne VIC, Australia: Association for Computing Machinery. DOI: 10.1145/3289600.3291025.
  • Karrer, B., Shi, L., Bhole, M., Goldman, M., Palmer, T., Gelman, C., Konutgan, M., and Sun, F. (2021), “Network Experimentation at Scale,” in Proceedings of the 27th ACM SIGKDD Conference on Knowledge Discovery & Data Mining, pp. 3106–3116. DOI: 10.1145/3447548.3467091.
  • Keenan, M. (2022), “Global Ecommerce Explained: Stats and Trends to Watch in 2022,” available at https://www.shopify.ca/enterprise/global-ecommerce-statistics. (Accessed on 04/30/2023).
  • Kemp, S. (2023), “DIGITAL 2023: Global Overview Report,” available at https://datareportal.com/reports/digital-2023-global-overview-report. (Accessed on 04/30/2023).
  • Kharitonov, E., Drutsa, A., and Serdyukov, P. (2017), “Learning Sensitive Combinations of A/B Test Metrics,” in Proceedings of the Tenth ACM International Conference on Web Search and Data Mining, WSDM ’17, pp. 651–659, Cambridge, UK: Association for Computing Machinery. DOI: 10.1145/3018661.3018708.
  • Kharitonov, E., Vorobev, A., Macdonald, C., Serdyukov, P., and Ounis, I. (2015), “Sequential Testing for Early Stopping of Online Experiments,” in Proceedings of the 38th International ACM SIGIR Conference on Research and Development in Information Retrieval, SIGIR ’15, pp. 473–482, Santiago, Chile: Association for Computing Machinery. DOI: 10.1145/2766462.2767729.
  • Kohavi, R. (2012), “Online Controlled Experiments: Introduction, Learnings, and Humbling Statistics,” in Proceedings of the Sixth ACM Conference on Recommender Systems, RecSys ’12, pp. 1–2, New York: Association for Computing Machinery. DOI: 10.1145/2365952.2365954.
  • ———(2023), “Build vs Buy,” available at https://bit.ly/ABTestsBuildVsBuy8.
  • Kohavi, R., Deng, A., and Vermeer, L. (2022), “A/B Testing Intuition Busters: Common Misunderstandings in Online Controlled Experiments,” DOI: 10.1145/3534678.3539160.
  • Kohavi, R., Deng, A., Frasca, B., Longbotham, R., Walker, T., and Xu, Y. (2012), “Trustworthy Online Controlled Experiments: Five Puzzling Outcomes Explained,” in Proceedings of the 18th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, KDD ’12, pp. 786–794, Beijing, China: Association for Computing Machinery. DOI: 10.1145/2339530.2339653.
  • Kohavi, R., Deng, A., Frasca, B., Walker, T., Xu, Y., and Pohlmann, N. (2013), “Online Controlled Experiments at Large Scale,” in Proceedings of the 19th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining - KDD ’13, p. 1168, Chicago, Illinois, USA: ACM Press. DOI: 10.1145/2487575.2488217.
  • Kohavi, R., Deng, A., Longbotham, R., and Xu, Y. (2014). “Seven Rules of Thumb for Web Site Experimenters,” Proceedings of the 20th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining - KDD ’14, pp. 1857–1866. New York, New York, USA: ACM Press. DOI: 10.1145/2623330.2623341.
  • Kohavi, R., and Longbotham, R. (2023), “Online Controlled Experiments and A/B Tests,” in Encyclopedia of Machine Learning and Data Science, eds. D. Phung, G. I. Webb, and C. Sammut, pp. 1–13, New York: Springer. DOI: 10.1007/978-1-4899-7502-7_891-2.
  • Kohavi, R., Longbotham, R., Sommerfield, D., and Henne, R. M. (2009), “Controlled Experiments On the Web: Survey and Practical Guide,” Data Mining and Knowledge Discovery, 18, 140–181. DOI: 10.1007/s10618-008-0114-1.
  • Kohavi, R., Tang, D., and Xu, Y. (2020), Trustworthy Online Controlled Experiments: A Practical Guide to A/B Testing, Cambridge: Cambridge University Press. DOI: 10.1017/9781108653985.Available at https://experimentguide.com/.
  • Kohavi, R., and Thomke, S. (2017), “The Surprising Power of Online Experiments,” Harvard Business Review, 95, 74–82.
  • Kohlmeier, S. (2022), “Microsoft’s Experimentation Platform: How We Build a World Class Product - Microsoft Research,” available at https://www.microsoft.com/en-us/research/group/experimentation-platform-exp/articles/microsofts-experimentationplatform-how-we-build-a-world-class-product/. (Accessed on 02/14/2022).
  • Koning, R., Hasan, S., and Chatterji, A. (2022), “Experimentation and Start-up Performance: Evidence from A/B Testing,” Management Science, 68, 6434–6453. DOI: 10.1287/mnsc.2021.4209.
  • Koutra, V., Gilmour, S. G., and Parker, B. M. (2021), “Optimal Block Designs for Experiments on Networks,” Journal of the Royal Statistical Society, Series C, 70, 596–618. DOI: 10.1111/rssc.12473.
  • Lan, K. K. G., and DeMets, D. L. (1983), “Discrete Sequential Boundaries for Clinical Trials,” Biometrika, 70, 659–663. DOI: 10.2307/2336502.
  • Lan, Y., Bakthavachalam, V., Sharan, L., Douriez, M., Azarnoush, B., and Kroll, M. (2022), “A Survey of Causal Inference Applications at Netflix—by Netflix Technology Blog,” at https://netflixtechblog.com/a-survey-of-causal-inference-applications-at-netflix-b62d25175e6f. (Accessed on 08/18/2022).
  • Li, H., Zhao, G., Johari, R., and Weintraub, G. Y. (2022), “Interference, Bias, and Variance in Two-Sided Marketplace Experimentation: Guidance for Platforms,” in Proceedings of the ACM Web Conference 2022, pp. 182–192.
  • Lindon, M., and Malek, A. (2020), “Anytime-Valid Inference for Multinomial Count Data,” DOI: 10.48550/ARXIV.2011.03567.
  • Lindon, M., Sanden, C., and Shirikian, V. (2022), “Rapid Regression Detection in Software Deployments through Sequential Testing,” in Proceedings of the 28th ACM SIGKDD Conference on Knowledge Discovery and Data Mining. KDD ’22, pp. 3336–3346. Washington DC, USA: Association for Computing Machinery. DOI: 10.1145/3534678.3539099.
  • Liou, K., and Taylor, S. J. (2020), “Variance-Weighted Estimators to Improve Sensitivity in Online Experiments,” in Proceedings of the 21st ACM Conference on Economics and Computation, pp. 837–850. DOI: 10.1145/3391403.3399542.
  • Liu, C., Cardoso, A., Couturier, P., and McCoy, E. J. (2021), “Datasets for Online Controlled Experiments,” arXiv preprint arXiv:2111.10198.
  • Liu, M., Mao, J., and Kang, K. (2021), “Trustworthy and Powerful Online Marketplace Experimentation with Budget-split Design,” in Proceedings of the 27th ACM SIGKDD Conference on Knowledge Discovery & Data Mining, pp. 3319–3329. DOI: 10.1145/3447548.3467193.
  • Luca, M., and Bazerman, M. H. (2021), The Power of Experiments: Decision Making in a Data-Driven World, Cambridge, MA: MIT Press.
  • Manzi, J. (2012), UNCONTROLLED The Surprising Payoff of Trial-and-Error for Business, Politics, and Society, New York, NY: Basic Books.
  • Matias, J. N., Munger, K., Le Quere, M. A., and Ebersole, C. (2021), “The Upworthy Research Archive, A Time Series of 32,487 Experiments in US Media,” Scientific Data, 8, 195. DOI: 10.1038/s41597-021-00934-7.
  • McFarland, C. (2012), Experiment!: Website Conversion Rate Optimization with A/B and Multivariate Testing, pp. 190, Berkeley, CA: New Riders.
  • McFowland III, E., Gangarapu, S., Bapna, R., and Sun, T. (2021), “A Prescriptive Analytics Framework for Optimal Policy Deployment Using Heterogeneous Treatment Effects,” MIS Quarterly, 45, 1807–1832. DOI: 10.25300/MISQ/2021/15684.
  • Neyman, J. (1923), “On the Application of Probability Theory to Agricultural Experiments. Essay on Principles,” Annals of Agricultural Sciences, 1–51.
  • Ni, T., Bojinov, I., and Zhao, J. (2023), “Design of Panel Experiments with Spatial and Temporal Interference,” Available at SSRN 4466598.
  • O’Brien, P. C., and Fleming, T. R. (1979), “A Multiple Testing Procedure for Clinical Trials,” Biometrics, 35, 549–556.
  • Parker, B. M., Gilmour, S. G., and Schormans, J. (2017), “Optimal Design of Experiments on Connected Units with Application to Social Networks,” Journal of the Royal Statistical Society, Series C, 66, 455–480. DOI: 10.1111/rssc.12170.
  • Pekelis, L. (2015), “Statistics for the Internet Age: The Story Behind Optimizely’s New Stats Engine,” available at https://www.optimizely.com/insights/blog/statistics-for-the-internet-age-the-story-behind-optimizelys-new-stats-engine/. (Accessed on 03/08/2022).
  • Petersen, A., Witten, D., and Simon, N. (2016), “Fused Lasso Additive Model,” Journal of Computational and Graphical Statistics, 25, 1005–1025. DOI: 10.1080/10618600.2015.1073155.
  • Peysakhovich, A., and Lada, A. (2016), “Combining Observational and Experimental Data to Find Heterogeneous Treatment Effects,” arXiv preprint arXiv:1611.02385.
  • Pocock, S. J. (1977), “Group Sequential Methods in the Design and Analysis of Clinical Trials,” Biometrika, 64, 191–199. DOI: 10.1093/biomet/64.2.191.
  • Pokhilko, V., Zhang, Q., Kang, L., and Mays, D. P. (2019), “D-optimal Design for Network a/b Testing,” Journal of Statistical Theory and Practice, 13, 1–23. DOI: 10.1007/s42519-019-0058-3.
  • Poyarkov, A., Drutsa, A., Khalyavin, A., Gusev, G., and Serdyukov, P. (2016), “Boosted Decision Tree Regression Adjustment for Variance Reduction in Online Controlled Experiments,” in Proceedings of the 22nd ACM SIGKDD International Conference on Knowledge Discovery and Data Mining. KDD ’16, pp. 235–244. San Francisco, California, USA: Association for Computing Machinery. DOI: 10.1145/2939672.2939688.
  • Pramanik, S., Johnson, V. E., and Bhattacharya, A. (2021), “A Modified Sequential Probability Ratio Test,” Journal of Mathematical Psychology, 101, 102505. DOI: 10.1016/j.jmp.2021.102505.
  • Prentice, R. L. (1989), “Surrogate Endpoints in Clinical Trials: Definition and Operational Criteria,” Statistics in Medicine, 8, 431–440. DOI: 10.1002/sim.4780080407.
  • Quin, F., Weyns, D., Galster, M., and Silva, C. C. (2023), “A/B Testing: A Systematic Literature Review,” arXiv preprint arXiv:2308.04929.
  • Robertson, D. S., Choodari-Oskooei, B., Dimairo, M., Flight, L., Pallmann, P., and Jaki, T. (2023), “Point Estimation for Adaptive Trial Designs I: A Methodological Review,” Statistics in Medicine. 42, 122–145. DOI: 10.1002/sim.9605.
  • Robinson, P. M. (1988), “Root-N-Consistent Semiparametric Regression,” Econometrica, 56, 931–954. DOI: 10.2307/1912705.
  • Ruberg, S. J. (1995a), “Dose Response Studies I. Some Design Considerations,” Journal of Biopharmaceutical Statistics, 5, 1–14. DOI: 10.1080/10543409508835096.
  • ———(1995b), “Dose Response Studies II. Analysis and Interpretation,” Journal of biopharmaceutical statistics, 5, 15–42.
  • Rubin, D. B. (1974), “Estimating Causal Effects of Treatments in Randomized and Nonrandomized Studies,” Journal of Educational Psychology, 66, 688–701. DOI: 10.1037/h0037350.
  • Sadeghi, S., Gupta, S., Gramatovici, S., Lu, J., Ai, H., and Zhang, R. (2022), “Novelty and Primacy: A Long-Term Estimator for Online Experiments,” Technometrics, 64, 524–534. DOI: 10.1080/00401706.2022.2124309.
  • Saint-Jacques, G. (2019), “Detecting interference: An A/B test of A/B Tests—LinkedIn Engineering,” available at https://engineering.linkedin.com/blog/2019/06/detecting-interference–an-a-b-test-of-a-b-tests. (Accessed on 02/22/2022).
  • Saint-Jacques, G., Varshney, M., Simpson, J., and Xu, Y. (2019), “Using Ego-Clusters to Measure Network Effects at LinkedIn,” arXiv preprint arXiv:1903.08755.
  • Saveski, M., Pouget-Abadie, J., Saint-Jacques, G., Duan, W., Ghosh, S., Xu, Y., and Airoldi, E. M. (2017), “Detecting Network Effects: Randomizing Over Randomized Experiments,” in Proceedings of the 23rd ACM SIGKDD International Conference on Knowledge Discovery and Data Mining. KDD ’17, pp. 1027–1035. Halifax, NS, Canada: Association for Computing Machinery. DOI: 10.1145/3097983.3098192.
  • Schroeder, B. (2021), “The Data Analytics Profession And Employment Is Exploding: Three Trends That Matter,” available at https://www.forbes.com/sites/bernhardschroeder/2021/06/11/the-data-analytics-profession-and-employment-is-exploding-threetrends-that-matter/?sh=12c5c3c3f81e. (Accessed on 03/10/2022).
  • Schultzberg, M., and Ankargren, S. (2023), “Choosing Sequential Testing Framework—Comparisons and Discussions,” available at https://engineering.atspotify.com/2023/03/choosingsequential-testing-framework-comparisons-and-discussions/.
  • Sepehri, A., and DiCiccio, C. (2020), “Interpretable Assessment of Fairness During Model Evaluation,” arXiv preprint arXiv:2010.13782.
  • Sexauer, C. (2022), “CUPED on Statsig,” available at https://blog.statsig.com/cuped-on-statsigd57f23122d0e. (Accessed on 08/18/2022).
  • Sharma, C. (2021), “Reducing Experiment Durations - Eppo Blog,” available at https://www.geteppo.com/blog/reducing-experiment-durations. (Accessed on 02/25/2022).
  • ———(2022), “Bending time in experimentation - Eppo Blog,” available at https://www.geteppo.com/blog/bending-time-in-experimentation. (Accessed on 08/18/2022).
  • Shi, C., Wang, X., Luo, S., Song, R., Zhu, H., and Ye, J. (2020), “A Reinforcement Learning Framework for Time-Dependent Causal Effects Evaluation in A/B Testing,” arXiv preprint arXiv:2002.01711.
  • Skotara, N. (2023), “Sequential Testing at Booking.com,” available at https://booking.ai/sequentialtesting-at-booking-com-650954a569c7.
  • Syrgkanis, V., Lei, V., Oprescu, M., Hei, M., Battocchi, K., and Lewis, G. (2019), “Machine Learning Estimation of Heterogeneous Treatment Effects with Instruments,” in Advances in Neural Information Processing Systems, pp. 15193–15202.
  • Tang, D., Agarwal, A., O’Brien, D., and Meyer, M. (2010), “Overlapping Experiment Infrastructure: More, Better, Faster Experimentation,” in Proceedings of the 16th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, KDD ’10, pp. 17–26. Washington, DC: Association for Computing Machinery. DOI: 10.1145/1835804.1835810.
  • Thomke, S. H. (2020), Experimentation works: The Surprising Power of Business Experiments, Brighton, MA: Harvard Business Press.
  • Tran, C., and Zheleva, E. (2019), “Learning Triggers for Heterogeneous Treatment Effects,” in Proceedings of the AAAI Conference on Artificial Intelligence (Vol. 33), pp. 5183–5190. DOI: 10.1609/aaai.v33i01.33015183.
  • Tsiatis, A. A. (2006), Semiparametric Theory and Missing Data, New York, NY: Springer.
  • Ugander, J., Karrer, B., Backstrom, L., and Kleinberg, J. (2013), “Graph Cluster Randomization: Network Exposure to Multiple Universes,” arXiv:1305.6979 [physics, stat]. arXiv: 1305.6979.
  • Urban, S., Sreenivasan, R., and Kannan, V. (2016), “It’s All A/Bout Testing: The Netflix Experimentation Platform—by Netflix Technology Blog—Netflix TechBlog,” available at https://netflixtechblog.com/its-all-a-bout-testing-the-netflix-experimentation-platform-4e1ca458c15. (Accessed on 10/26/2021).
  • Visser, D. (2020), “In-House Experimentation Platforms,” available at https://www.linkedin.com/pulse/inhouse-experimentation-platforms-denise-visser/. (Accessed on 09/15/2022).
  • Von Ahn, L. (2022), “Shareholder Letter Q2 2022,” available at https://investors.duolingo.com/staticfiles/ae55dd31-2ce4-41ac-bb26-948bafe8409c.
  • Wager, S., and Athey, S. (2018), “Estimation and Inference of Heterogeneous Treatment Effects using Random Forests,” Journal of the American Statistical Association, 113, 1228–1242. DOI: 10.1080/01621459.2017.1319839.
  • Wald, A. (1945), “Sequential Tests of Statistical Hypotheses,” The Annals of Mathematical Statistics, 16, 117–186. DOI: 10.1214/aoms/1177731118.
  • ———(1947), Sequential Analysis, New York: Courier Corporation.
  • Wang, Y., Gupta, S., Lu, J., Mahmoudzadeh, A., and Liu, S. (2019), “On Heavy-user Bias in A/B Testing,” in Proceedings of the 28th ACM International Conference on Information and Knowledge Management, pp. 2425–2428. DOI: 10.1145/3357384.3358143.
  • Waudby-Smith, I., Arbour, D., Sinha, R., Kennedy, E. H., and Ramdas, A. (2021), “Time-Uniform Central Limit Theory, Asymptotic Confidence Sequences, and Anytime-Valid Causal Inference,” arXiv preprint arXiv:2103.06476.
  • Xia, T., Bhardwaj, S., Dmitriev, P., and Fabijan, A. (2019), “Safe Velocity: A Practical Guide to Software Deployment at Scale Using Controlled Rollout,” in 2019 IEEE/ACM 41st International Conference on Software Engineering: Software Engineering in Practice (ICSE-SEIP), pp. 11–20, IEEE. DOI: 10.1109/ICSE-SEIP.2019.00010.
  • Xie, H., and Aurisset, J. (2016), “Improving the Sensitivity of Online Controlled Experiments: Case Studies at Netflix,” in Proceedings of the 22nd ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, KDD ’16, pp. 645–654. San Francisco, CA: Association for Computing Machinery. DOI: 10.1145/2939672.2939733.
  • Xie, Y., Chen, N., and Shi, X. (2018), “False Discovery Rate Controlled Heterogeneous Treatment Effect Detection for Online Controlled Experiments,” in Proceedings of the 24th ACM SIGKDD International Conference on Knowledge Discovery & Data Mining. KDD ’18, pp. 876–885. London, UK: Association for Computing Machinery. DOI: 10.1145/3219819.3219860.
  • Xu, Y., Chen, N., Fernandez, A., Sinno, O., and Bhasin, A. (2015), “From Infrastructure to Culture: A/B Testing Challenges in Large Scale Social Networks,” in Proceedings of the 21th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining - KDD ’15, pp. 2227–2236. Sydney, NSW, Australia: ACM Press. DOI: 10.1145/2783258.2788602.
  • Xu, Y., Duan, W., and Huang, S. (2018), “SQR: Balancing Speed, Quality and Risk in Online Experiments,” in Proceedings of the 24th ACM SIGKDD International Conference on Knowledge Discovery & Data Mining. KDD ’18, pp. 895–904. London, UK: ACM. DOI: 10.1145/3219819.3219875.
  • Yoon, S. (2018), “Designing A/B Tests in a Collaboration Network,” The Unofficial Google Data Science Blog. Library Catalog: available at www.unofficialgoogledatascience.com. http://www.unofficialgoogledatascience.com/2018/01/designing-ab-tests-in-collaboration.html (visited on 06/11/2020).
  • Yu, M., Lu, W., and Song, R. (2020), “A New Framework for Online Testing of Heterogeneous Treatment Effect,” arXiv: 2002.03277 [stat.ME].
  • Zhang, C., Coey, D., Goldman, M., and Karrer, B. (2021), “Regression Adjustment with Synthetic Controls in Online Experiments,” Meta Research. Available at https://research.facebook.com/publications/regression-adjustment-with-synthetic-controls-in-online-experiments/
  • Zhang, Q., and Kang, L. (2022), “Locally Optimal Design for A/B Tests in the Presence of Covariates and Network Dependence,” Technometrics, 64, 358–369. DOI: 10.1080/00401706.2022.2046169.
  • Zhao, Y., Zeng, D., Rush, A. J., and Kosorok, M. R. (2012), “Estimating Individualized Treatment Rules Using Outcome Weighted Learning,” Journal of the American Statistical Association, 107, 1106–1118. DOI: 10.1080/01621459.2012.695674.
  • Zhou, Y., Liu, Y., Li, P., and Hu, F. (2020), “Cluster-Adaptive Network A/B Testing: From Randomization to Estimation.” arXiv:2008.08648 [stat]. arXiv: 2008.08648.