References
- Adcock, C. (1997), “Sample Size Determination: A Review,” Journal of the Royal Statistical Society, Series D, 46, 261–283. DOI: https://doi.org/10.1111/1467-9884.00082.
- An, D. (2018), “Find Out How You Stack Up to New Industry Benchmarks for Mobile Page Speed,” available at https://www.thinkwithgoogle.com/marketing-strategies/app-and-mobile/mobile-page-speed-new-industry-benchmarks/.
- Anderson-Cook, C. M., and Borror, C. M. (2016), “The Difference Between ‘Equivalent’ and ‘Not Different’,” Quality Engineering, 28, 249–262. DOI: https://doi.org/10.1080/08982112.2015.1079918.
- Benjamin, D. J., and Berger, J. O. (2019), “Three Recommendations for Improving the Use of p-values,” The American Statistician, 73, 186–191. DOI: https://doi.org/10.1080/00031305.2018.1543135.
- Benjamin, D. J., Berger, J. O., Johannesson, M., Nosek, B. A., Wagenmakers, E. J., Berk, R., Bollen, K. A., Brembs, B., Brown, L., Camerer, C. et al. (2018), “Redefine Statistical Significance,” Nature Human Behaviour, 2, 6–10. DOI: https://doi.org/10.1038/s41562-017-0189-z.
- Berman, R., Pekelis, L., Scott, A., and Van den Bulte, C. (2018), “P-Hacking and False Discovery in A/B Testing,” B Testing (December 11, 2018).
- Berman, R., and Van den Bulte, C. (2020), “False Discovery in A/B Testing,” B Testing (March 10, 2020).
- Betensky, R. A. (2019), “The p-Value Requires Context, Not a Threshold,” The American Statistician, 73, 115–117. DOI: https://doi.org/10.1080/00031305.2018.1529624.
- Blume, J. D., Greevy, R. A., Welty, V. F., Smith, J. R., and Dupont, W. D. (2019), “An Introduction to Second-Generation p-Values,” The American Statistician, 73, 157–167. DOI: https://doi.org/10.1080/00031305.2018.1537893.
- Colquhoun, D. (2019), “The False Positive Risk: A Proposal Concerning What to Do About p-Values,” The American Statistician, 73, 192–201. DOI: https://doi.org/10.1080/00031305.2018.1529622.
- de Castro, M., and Galea, M. (2021), “Bayesian Inference for the Pairwise Probability of Agreement Using Data From Several Measurement Systems,” Quality Engineering, 33, 571–580. DOI: https://doi.org/10.1080/08982112.2021.1931317.
- De Santis, F. (2007), “Using Historical Data for Bayesian Sample Size Determination,” Journal of the Royal Statistical Society, 170, 95–113. DOI: https://doi.org/10.1111/j.1467-985X.2006.00438.x.
- Deng, A. (2015), “Objective Bayesian Two Sample Hypothesis Testing for Online Controlled Experiments,” in Proceedings of the 24th International Conference on World Wide Web, pp. 923–928.
- Deng, A., Li, Y., Lu, J., and Ramamurthy, V. (2019), “On Post-Selection Inference in A/B Tests,” arXiv:1910.03788.
- Deng, A., Lu, J., and Chen, S. (2016), “Continuous Monitoring of A/B Tests Without Pain: Optional Stopping in Bayesian Testing,” in 2016 IEEE International Conference on Data Science and Advanced Analytics (DSAA), pp. 243–252. IEEE. DOI: https://doi.org/10.1109/DSAA.2016.33.
- Gannon, M. A., de Bragança Pereira, C. A., and Polpo, A. (2019), “Blending Bayesian and Classical Tools to Define Optimal Sample-Size-Dependent Significance Levels,” The American Statistician, 73, 213–222. DOI: https://doi.org/10.1080/00031305.2018.1518268.
- Gelman, A., Carlin, J. B., Stern, H. S., Dunson, D. B., Vehtari, A., and Rubin, D. B. (2013), Bayesian Data Analysis, Boca Raton, FL: CRC Press.
- Goodman, W. M., Spruill, S. E., and Komaroff, E. (2019), “A Proposed Hybrid Effect Size Plus p-Value Criterion: Empirical Evidence Supporting Its Use,” The American Statistician, 73, 168–185. DOI: https://doi.org/10.1080/00031305.2018.1564697.
- Greenland, S. (2019), “Valid p-Values Behave Exactly as They Should: Some Misleading Criticisms of p-Values and Their Resolution With s-Values,” The American Statistician, 73, 106–114. DOI: https://doi.org/10.1080/00031305.2018.1529625.
- Gupta, S., Kohavi, R., Tang, D., Xu, Y., Andersen, R., Bakshy, E., Cardin, N., Chandran, S., Chen, N., Coey, D. et al. (2019), “Top Challenges From the First Practical Online Controlled Experiments Summit,” ACM SIGKDD Explorations Newsletter, 21, 20–35. DOI: https://doi.org/10.1145/3331651.3331655.
- Hern, A. (2014), “Why Google has 200M Reasons to Put Engineers Over Designers,” available at https://www.theguardian.com/technology/2014/feb/05/why-google-engineers-designers.
- Hoffmann, T., and Wagenmakers, E.-J. (2021), “Bayesian Inference for the A/B test: Example Applications With r and jasp,” PsyArXiv. June 10.
- Hurlbert, S. H., Levine, R. A., and Utts, J. (2019), “Coup de grâce for a tough old bull:“statistically significant” expires,” The American Statistician, 73, 352–357. DOI: https://doi.org/10.1080/00031305.2018.1543616.
- Jeffreys, H. (1935), “Some Tests of Significance, Treated by the Theory of Probability,” in Mathematical Proceedings of the Cambridge Philosophical Society, Vol. 31, pp. 203–222. Cambridge, MA: Cambridge University Press. DOI: https://doi.org/10.1017/S030500410001330X.
- Johari, R., Koomen, P., Pekelis, L., and Walsh, D. (2017), “Peeking at A/B Tests: Why It Matters, and What to Do About It,” in Proceedings of the 23rd ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, pp. 1517–1525.
- Johari, R., Pekelis, L., and Walsh, D. J. (2015), “Always Valid Inference: Bringing Sequential Analysis to A/B Testing,” arXiv:1512.04922.
- Jones-Farmer, L. A. (2019), “Leveraging Industrial Statistics in the Data Revolution: The Youden Memorial Address at the 63rd Annual Fall Technical Conference,” Quality Engineering, 31, 205–211. DOI: https://doi.org/10.1080/08982112.2019.1572187.
- Joseph, L., and Belisle, P. (1997), “Bayesian Sample Size Determination for Normal Means and Differences Between Normal Means,” Journal of the Royal Statistical Society, Series D, 46, 209–226. DOI: https://doi.org/10.1111/1467-9884.00077.
- Joseph, L., and Wolfson, D. B. (1997), “Interval-Based Versus Decision Theoretic Criteria for the Choice of Sample Size,” Journal of the Royal Statistical Society, Series D, 46, 145–149. DOI: https://doi.org/10.1111/1467-9884.00070.
- Kamalbasha, S., and Eugster, M. J. (2021), “Bayesian A/B Testing for Business Decisions,” in Data Science–Analytics and Applications, Haber P., Lampoltshammer T., Mayr M., Plankensteiner K., eds. pp. 50–57. Vieweg, Wiesbaden: Springer.
- Kass, R. E., and Raftery, A. E. (1995), “Bayes Factors,” Journal of the American Statistical Association, 90, 773–795. DOI: https://doi.org/10.1080/01621459.1995.10476572.
- Kohavi, R., Tang, D., and Xu, Y. (2020), Trustworthy Online Controlled Experiments: A Practical Guide to A/B Testing, Cambridge: Cambridge University Press.
- Kohavi, R., and Thomke, S. (2017), “The Surprising Power of Online Experiments,” Harvard Business Review, 95, 74–82.
- Kruschke, J. K. (2011), “Bayesian Assessment of Null Values Via Parameter Estimation and Model Comparison,” Perspectives on Psychological Science, 6, 299–312. DOI: https://doi.org/10.1177/1745691611406925.
- Kruschke, J. K. (2013), “Bayesian Estimation Supersedes the t Test,” Journal of Experimental Psychology: General, 142, 573–603.
- Kruschke, J. K. (2018), “Rejecting or Accepting Parameter Values in Bayesian Estimation,” Advances in Methods and Practices in Psychological Science, 1, 270–280.
- Kruschke, J. K., and Liddell, T. M. (2018), “The Bayesian New Statistics: Hypothesis Testing, Estimation, Meta-Analysis, and Power Analysis From a Bayesian Perspective,” Psychonomic Bulletin & Review, 25, 178–206.
- Liu, M., Sun, X., Varshney, M., and Xu, Y. (2019), “Large-Scale Online Experimentation With Quantile Metrics,” arXiv:1903.08762.
- Lu, L., Anderson-Cook, C., Stevens, N., and Hagar, M. (2021), “Using a Baseline With the Probability of Agreement to Compare Distribution Characteristics,” Quality Engineering (submitted).
- Luca, M., and Bazerman, M. H. (2020), The Power of Experiments: Decision Making in a Data-Driven World, Cambridge, MA: MIT Press.
- Matthews, R. A. (2019), “Moving Towards the Post p < 0.05 Era Via the Analysis of Credibility,” The American Statistician, 73, 202–212.
- McShane, B. B., Gal, D., Gelman, A., Robert, C., and Tackett, J. L. (2019), “Abandon Statistical Significance,” The American Statistician, 73, 235–245. DOI: https://doi.org/10.1080/00031305.2018.1527253.
- Plummer, M. (2003), “JAGS: A Program for Analysis of Bayesian Graphical Models Using Gibbs Sampling,” in Proceedings of the 3rd International Workshop on Distributed Statistical Computing, Vol. 124, pp. 10. Vienna, Austria.
- Plummer, M. (2013), “rjags: Bayesian Graphical Models Using MCMC,” R package version 3 (10).
- Rougier, J. (2019), “p-values, Bayes Factors, and Sufficiency,” The American Statistician, 73, 148–151. DOI: https://doi.org/10.1080/00031305.2018.1502684.
- Saint-Jacques, G., Varshney, M., Simpson, J., and Xu, Y. (2019), “Using Ego-Clusters to Measure Network Effects at Linkedin,” arXiv:1903.08755.
- Scott, S. L. (2010), “A Modern Bayesian Look at the Multi-Armed Bandit,” Applied Stochastic Models in Business and Industry, 26, 639–658. DOI: https://doi.org/10.1002/asmb.874.
- Scott, S. L. (2015), “Multi-Armed Bandit Experiments in the Online Service Economy,” Applied Stochastic Models in Business and Industry, 31, 37–45.
- Siroker, D. (2010), “How Obama Raised $60 Million by Running a Simple Experiment,” available at https://www.optimizely.com/insights/blog/how-obama-raised-60-million-by-running-a-simple-experiment/.
- Siroker, D., and Koomen, P. (2013), A/B Testing: The Most Powerful Way to Turn Clicks Into Customers, Hoboken, NJ: Wiley.
- Stefan, A. M., Gronau, Q. F., Schönbrodt, F. D., and Wagenmakers, E.-J. (2019), “A Tutorial on Bayes Factor Design Analysis Using an Informed Prior,” Behavior Research Methods, 51, 1042–1058. DOI: https://doi.org/10.3758/s13428-018-01189-8.
- Stevens, N. T., and Anderson-Cook, C. M. (2019), “Design and Analysis of Confirmation Experiments,” Journal of Quality Technology, 51, 109–124. DOI: https://doi.org/10.1080/00224065.2019.1571344.
- Stevens, N. T., and Lu, L. (2020), “Comparing Kaplan–Meier Curves With the Probability of Agreement,” Statistics in Medicine, 39, 4621– 4635. DOI: https://doi.org/10.1002/sim.8744.
- Stevens, N. T., Lu, L., Anderson-Cook, C. M., and Rigdon, S. E. (2020), “Bayesian Probability of Agreement for Comparing Survival or Reliability Functions With Parametric Lifetime Regression Models,” Quality Engineering, 1–21. DOI: https://doi.org/10.1080/08982112.2020.1741619.
- Stevens, N. T., Rigdon, S. E., and Anderson-Cook, C. M. (2018), “Bayesian Probability of Predictive Agreement for Comparing the Outcome of Two Separate Regressions,” Quality and Reliability Engineering International, 34, 968–978. DOI: https://doi.org/10.1002/qre.2284.
- Stevens, N. T., Rigdon, S. E., and Anderson-Cook, C. M. (2020), “Bayesian Probability of Agreement for Comparing the Similarity of Response Surfaces,” Journal of Quality Technology, 52, 67–80.
- Stevens, N. T., Steiner, S. H., and MacKay, R. J. (2017), “Assessing Agreement Between Two Measurement Systems: An Alternative to the Limits of Agreement Approach,” Statistical Methods in Medical Research, 26, 2487–2504. DOI: https://doi.org/10.1177/0962280215601133.
- Stevens, N. T., Steiner, S. H., and MacKay, R. J. (2018), “Comparing Heteroscedastic Measurement Systems With the Probability of Agreement,” Statistical Methods in Medical Research, 27, 3420–3435.
- Thomke, S. H. (2020), Experimentation Works: The Surprising Power of Business Experiments, Cambridge: Harvard Business Press.
- Walker, E., and Nowacki, A. S. (2011), “Understanding Equivalence and Noninferiority Testing,” Journal of General Internal Medicine, 26, 192–196. DOI: https://doi.org/10.1007/s11606-010-1513-8.
- Wasserstein, R. L., and Lazar, N. A. (2016), “The ASA Statement on p-Values: Context, Process, and Purpose,” The American Statistician, 70, 129–133. DOI: https://doi.org/10.1080/00031305.2016.1154108.
- Wasserstein, R. L., Schirm, A. L., and Lazar, N. A. (2019), “Moving to a World Beyond ‘p ¡ 0.05,”’ The American Statistician, 73, 1–19. DOI: https://doi.org/10.1080/00031305.2019.1583913.
- Wellek, S. (2010), Testing Statistical Hypotheses of Equivalence and Noninferiority, Boca Raton, FL: Chapman and Hall/CRC.
- Xu, Y., Chen, N., Fernandez, A., Sinno, O., and Bhasin, A. (2015), “From Infrastructure to Culture: A/B Testing Challenges in Large Scale Social Networks,” in Proceedings of the 21th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, pp. 2227–2236.
- Ziliak, S. T. (2019), “How Large Are Your g-Values? Try Gosset’s Guinnessometrics When a Little ‘p’ is Not Enough,” The American Statistician, 73, 281–290. DOI: https://doi.org/10.1080/00031305.2018.1514325.