Search in:

Advanced search

The American Statistician Volume 76, 2022 - Issue 3

Submit an article Journal homepage

669

Views

CrossRef citations to date

Altmetric

General

Comparative Probability Metrics: Using Posterior Probabilities to Account for Practical Equivalence in A/B tests

Nathaniel T. StevensDepartment of Statistics & Actuarial Science University of Waterloo, Waterloo, ON, CanadaCorrespondence[email protected]
View further author information

Luke HagarDepartment of Statistics & Actuarial Science University of Waterloo, Waterloo, ON, Canada

https://orcid.org/0000-0002-1093-9463 View further author information

Pages 224-237 | Received 21 Dec 2020, Accepted 20 Oct 2021, Published online: 04 Jan 2022

Cite this article
https://doi.org/10.1080/00031305.2021.2000495
CrossMark

Full Article
Figures & data
References
Supplemental
Citations
Metrics
Reprints & Permissions

References

Adcock, C. (1997), “Sample Size Determination: A Review,” Journal of the Royal Statistical Society, Series D, 46, 261–283. DOI: https://doi.org/10.1111/1467-9884.00082.
Google Scholar
An, D. (2018), “Find Out How You Stack Up to New Industry Benchmarks for Mobile Page Speed,” available at https://www.thinkwithgoogle.com/marketing-strategies/app-and-mobile/mobile-page-speed-new-industry-benchmarks/.
Google Scholar
Anderson-Cook, C. M., and Borror, C. M. (2016), “The Difference Between ‘Equivalent’ and ‘Not Different’,” Quality Engineering, 28, 249–262. DOI: https://doi.org/10.1080/08982112.2015.1079918.
Web of Science ®Google Scholar
Benjamin, D. J., and Berger, J. O. (2019), “Three Recommendations for Improving the Use of p-values,” The American Statistician, 73, 186–191. DOI: https://doi.org/10.1080/00031305.2018.1543135.
Web of Science ®Google Scholar
Benjamin, D. J., Berger, J. O., Johannesson, M., Nosek, B. A., Wagenmakers, E. J., Berk, R., Bollen, K. A., Brembs, B., Brown, L., Camerer, C. et al. (2018), “Redefine Statistical Significance,” Nature Human Behaviour, 2, 6–10. DOI: https://doi.org/10.1038/s41562-017-0189-z.
PubMed Web of Science ®Google Scholar
Berman, R., Pekelis, L., Scott, A., and Van den Bulte, C. (2018), “P-Hacking and False Discovery in A/B Testing,” B Testing (December 11, 2018).
Google Scholar
Berman, R., and Van den Bulte, C. (2020), “False Discovery in A/B Testing,” B Testing (March 10, 2020).
Google Scholar
Betensky, R. A. (2019), “The p-Value Requires Context, Not a Threshold,” The American Statistician, 73, 115–117. DOI: https://doi.org/10.1080/00031305.2018.1529624.
Web of Science ®Google Scholar
Blume, J. D., Greevy, R. A., Welty, V. F., Smith, J. R., and Dupont, W. D. (2019), “An Introduction to Second-Generation p-Values,” The American Statistician, 73, 157–167. DOI: https://doi.org/10.1080/00031305.2018.1537893.
Web of Science ®Google Scholar
Colquhoun, D. (2019), “The False Positive Risk: A Proposal Concerning What to Do About p-Values,” The American Statistician, 73, 192–201. DOI: https://doi.org/10.1080/00031305.2018.1529622.
Web of Science ®Google Scholar
de Castro, M., and Galea, M. (2021), “Bayesian Inference for the Pairwise Probability of Agreement Using Data From Several Measurement Systems,” Quality Engineering, 33, 571–580. DOI: https://doi.org/10.1080/08982112.2021.1931317.
Web of Science ®Google Scholar
De Santis, F. (2007), “Using Historical Data for Bayesian Sample Size Determination,” Journal of the Royal Statistical Society, 170, 95–113. DOI: https://doi.org/10.1111/j.1467-985X.2006.00438.x.
Google Scholar
Deng, A. (2015), “Objective Bayesian Two Sample Hypothesis Testing for Online Controlled Experiments,” in Proceedings of the 24th International Conference on World Wide Web, pp. 923–928.
Google Scholar
Deng, A., Li, Y., Lu, J., and Ramamurthy, V. (2019), “On Post-Selection Inference in A/B Tests,” arXiv:1910.03788.
Google Scholar
Deng, A., Lu, J., and Chen, S. (2016), “Continuous Monitoring of A/B Tests Without Pain: Optional Stopping in Bayesian Testing,” in 2016 IEEE International Conference on Data Science and Advanced Analytics (DSAA), pp. 243–252. IEEE. DOI: https://doi.org/10.1109/DSAA.2016.33.
Google Scholar
Gannon, M. A., de Bragança Pereira, C. A., and Polpo, A. (2019), “Blending Bayesian and Classical Tools to Define Optimal Sample-Size-Dependent Significance Levels,” The American Statistician, 73, 213–222. DOI: https://doi.org/10.1080/00031305.2018.1518268.
Web of Science ®Google Scholar
Gelman, A., Carlin, J. B., Stern, H. S., Dunson, D. B., Vehtari, A., and Rubin, D. B. (2013), Bayesian Data Analysis, Boca Raton, FL: CRC Press.
Google Scholar
Goodman, W. M., Spruill, S. E., and Komaroff, E. (2019), “A Proposed Hybrid Effect Size Plus p-Value Criterion: Empirical Evidence Supporting Its Use,” The American Statistician, 73, 168–185. DOI: https://doi.org/10.1080/00031305.2018.1564697.
Web of Science ®Google Scholar
Greenland, S. (2019), “Valid p-Values Behave Exactly as They Should: Some Misleading Criticisms of p-Values and Their Resolution With s-Values,” The American Statistician, 73, 106–114. DOI: https://doi.org/10.1080/00031305.2018.1529625.
Web of Science ®Google Scholar
Gupta, S., Kohavi, R., Tang, D., Xu, Y., Andersen, R., Bakshy, E., Cardin, N., Chandran, S., Chen, N., Coey, D. et al. (2019), “Top Challenges From the First Practical Online Controlled Experiments Summit,” ACM SIGKDD Explorations Newsletter, 21, 20–35. DOI: https://doi.org/10.1145/3331651.3331655.
Google Scholar
Hern, A. (2014), “Why Google has 200M Reasons to Put Engineers Over Designers,” available at https://www.theguardian.com/technology/2014/feb/05/why-google-engineers-designers.
Google Scholar
Hoffmann, T., and Wagenmakers, E.-J. (2021), “Bayesian Inference for the A/B test: Example Applications With r and jasp,” PsyArXiv. June 10.
Google Scholar
Hurlbert, S. H., Levine, R. A., and Utts, J. (2019), “Coup de grâce for a tough old bull:“statistically significant” expires,” The American Statistician, 73, 352–357. DOI: https://doi.org/10.1080/00031305.2018.1543616.
Web of Science ®Google Scholar
Jeffreys, H. (1935), “Some Tests of Significance, Treated by the Theory of Probability,” in Mathematical Proceedings of the Cambridge Philosophical Society, Vol. 31, pp. 203–222. Cambridge, MA: Cambridge University Press. DOI: https://doi.org/10.1017/S030500410001330X.
Google Scholar
Johari, R., Koomen, P., Pekelis, L., and Walsh, D. (2017), “Peeking at A/B Tests: Why It Matters, and What to Do About It,” in Proceedings of the 23rd ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, pp. 1517–1525.
Google Scholar
Johari, R., Pekelis, L., and Walsh, D. J. (2015), “Always Valid Inference: Bringing Sequential Analysis to A/B Testing,” arXiv:1512.04922.
Google Scholar
Jones-Farmer, L. A. (2019), “Leveraging Industrial Statistics in the Data Revolution: The Youden Memorial Address at the 63rd Annual Fall Technical Conference,” Quality Engineering, 31, 205–211. DOI: https://doi.org/10.1080/08982112.2019.1572187.
Web of Science ®Google Scholar
Joseph, L., and Belisle, P. (1997), “Bayesian Sample Size Determination for Normal Means and Differences Between Normal Means,” Journal of the Royal Statistical Society, Series D, 46, 209–226. DOI: https://doi.org/10.1111/1467-9884.00077.
Google Scholar
Joseph, L., and Wolfson, D. B. (1997), “Interval-Based Versus Decision Theoretic Criteria for the Choice of Sample Size,” Journal of the Royal Statistical Society, Series D, 46, 145–149. DOI: https://doi.org/10.1111/1467-9884.00070.
Web of Science ®Google Scholar
Kamalbasha, S., and Eugster, M. J. (2021), “Bayesian A/B Testing for Business Decisions,” in Data Science–Analytics and Applications, Haber P., Lampoltshammer T., Mayr M., Plankensteiner K., eds. pp. 50–57. Vieweg, Wiesbaden: Springer.
Google Scholar
Kass, R. E., and Raftery, A. E. (1995), “Bayes Factors,” Journal of the American Statistical Association, 90, 773–795. DOI: https://doi.org/10.1080/01621459.1995.10476572.
Web of Science ®Google Scholar
Kohavi, R., Tang, D., and Xu, Y. (2020), Trustworthy Online Controlled Experiments: A Practical Guide to A/B Testing, Cambridge: Cambridge University Press.
Google Scholar
Kohavi, R., and Thomke, S. (2017), “The Surprising Power of Online Experiments,” Harvard Business Review, 95, 74–82.
Google Scholar
Kruschke, J. K. (2011), “Bayesian Assessment of Null Values Via Parameter Estimation and Model Comparison,” Perspectives on Psychological Science, 6, 299–312. DOI: https://doi.org/10.1177/1745691611406925.
PubMed Web of Science ®Google Scholar
Kruschke, J. K. (2013), “Bayesian Estimation Supersedes the t Test,” Journal of Experimental Psychology: General, 142, 573–603.
PubMed Web of Science ®Google Scholar
Kruschke, J. K. (2018), “Rejecting or Accepting Parameter Values in Bayesian Estimation,” Advances in Methods and Practices in Psychological Science, 1, 270–280.
Google Scholar
Kruschke, J. K., and Liddell, T. M. (2018), “The Bayesian New Statistics: Hypothesis Testing, Estimation, Meta-Analysis, and Power Analysis From a Bayesian Perspective,” Psychonomic Bulletin & Review, 25, 178–206.
PubMed Web of Science ®Google Scholar
Liu, M., Sun, X., Varshney, M., and Xu, Y. (2019), “Large-Scale Online Experimentation With Quantile Metrics,” arXiv:1903.08762.
Google Scholar
Lu, L., Anderson-Cook, C., Stevens, N., and Hagar, M. (2021), “Using a Baseline With the Probability of Agreement to Compare Distribution Characteristics,” Quality Engineering (submitted).
Web of Science ®Google Scholar
Luca, M., and Bazerman, M. H. (2020), The Power of Experiments: Decision Making in a Data-Driven World, Cambridge, MA: MIT Press.
Google Scholar
Matthews, R. A. (2019), “Moving Towards the Post p < 0.05 Era Via the Analysis of Credibility,” The American Statistician, 73, 202–212.
Web of Science ®Google Scholar
McShane, B. B., Gal, D., Gelman, A., Robert, C., and Tackett, J. L. (2019), “Abandon Statistical Significance,” The American Statistician, 73, 235–245. DOI: https://doi.org/10.1080/00031305.2018.1527253.
Web of Science ®Google Scholar
Plummer, M. (2003), “JAGS: A Program for Analysis of Bayesian Graphical Models Using Gibbs Sampling,” in Proceedings of the 3rd International Workshop on Distributed Statistical Computing, Vol. 124, pp. 10. Vienna, Austria.
Google Scholar
Plummer, M. (2013), “rjags: Bayesian Graphical Models Using MCMC,” R package version 3 (10).
Google Scholar
Rougier, J. (2019), “p-values, Bayes Factors, and Sufficiency,” The American Statistician, 73, 148–151. DOI: https://doi.org/10.1080/00031305.2018.1502684.
Web of Science ®Google Scholar
Saint-Jacques, G., Varshney, M., Simpson, J., and Xu, Y. (2019), “Using Ego-Clusters to Measure Network Effects at Linkedin,” arXiv:1903.08755.
Google Scholar
Scott, S. L. (2010), “A Modern Bayesian Look at the Multi-Armed Bandit,” Applied Stochastic Models in Business and Industry, 26, 639–658. DOI: https://doi.org/10.1002/asmb.874.
Web of Science ®Google Scholar
Scott, S. L. (2015), “Multi-Armed Bandit Experiments in the Online Service Economy,” Applied Stochastic Models in Business and Industry, 31, 37–45.
Web of Science ®Google Scholar
Siroker, D. (2010), “How Obama Raised $60 Million by Running a Simple Experiment,” available at https://www.optimizely.com/insights/blog/how-obama-raised-60-million-by-running-a-simple-experiment/.
Google Scholar
Siroker, D., and Koomen, P. (2013), A/B Testing: The Most Powerful Way to Turn Clicks Into Customers, Hoboken, NJ: Wiley.
Google Scholar
Stefan, A. M., Gronau, Q. F., Schönbrodt, F. D., and Wagenmakers, E.-J. (2019), “A Tutorial on Bayes Factor Design Analysis Using an Informed Prior,” Behavior Research Methods, 51, 1042–1058. DOI: https://doi.org/10.3758/s13428-018-01189-8.
PubMed Web of Science ®Google Scholar
Stevens, N. T., and Anderson-Cook, C. M. (2019), “Design and Analysis of Confirmation Experiments,” Journal of Quality Technology, 51, 109–124. DOI: https://doi.org/10.1080/00224065.2019.1571344.
Web of Science ®Google Scholar
Stevens, N. T., and Lu, L. (2020), “Comparing Kaplan–Meier Curves With the Probability of Agreement,” Statistics in Medicine, 39, 4621– 4635. DOI: https://doi.org/10.1002/sim.8744.
PubMed Web of Science ®Google Scholar
Stevens, N. T., Lu, L., Anderson-Cook, C. M., and Rigdon, S. E. (2020), “Bayesian Probability of Agreement for Comparing Survival or Reliability Functions With Parametric Lifetime Regression Models,” Quality Engineering, 1–21. DOI: https://doi.org/10.1080/08982112.2020.1741619.
Web of Science ®Google Scholar
Stevens, N. T., Rigdon, S. E., and Anderson-Cook, C. M. (2018), “Bayesian Probability of Predictive Agreement for Comparing the Outcome of Two Separate Regressions,” Quality and Reliability Engineering International, 34, 968–978. DOI: https://doi.org/10.1002/qre.2284.
Web of Science ®Google Scholar
Stevens, N. T., Rigdon, S. E., and Anderson-Cook, C. M. (2020), “Bayesian Probability of Agreement for Comparing the Similarity of Response Surfaces,” Journal of Quality Technology, 52, 67–80.
Web of Science ®Google Scholar
Stevens, N. T., Steiner, S. H., and MacKay, R. J. (2017), “Assessing Agreement Between Two Measurement Systems: An Alternative to the Limits of Agreement Approach,” Statistical Methods in Medical Research, 26, 2487–2504. DOI: https://doi.org/10.1177/0962280215601133.
PubMed Web of Science ®Google Scholar
Stevens, N. T., Steiner, S. H., and MacKay, R. J. (2018), “Comparing Heteroscedastic Measurement Systems With the Probability of Agreement,” Statistical Methods in Medical Research, 27, 3420–3435.
PubMed Web of Science ®Google Scholar
Thomke, S. H. (2020), Experimentation Works: The Surprising Power of Business Experiments, Cambridge: Harvard Business Press.
Google Scholar
Walker, E., and Nowacki, A. S. (2011), “Understanding Equivalence and Noninferiority Testing,” Journal of General Internal Medicine, 26, 192–196. DOI: https://doi.org/10.1007/s11606-010-1513-8.
PubMed Web of Science ®Google Scholar
Wasserstein, R. L., and Lazar, N. A. (2016), “The ASA Statement on p-Values: Context, Process, and Purpose,” The American Statistician, 70, 129–133. DOI: https://doi.org/10.1080/00031305.2016.1154108.
Web of Science ®Google Scholar
Wasserstein, R. L., Schirm, A. L., and Lazar, N. A. (2019), “Moving to a World Beyond ‘p ¡ 0.05,”’ The American Statistician, 73, 1–19. DOI: https://doi.org/10.1080/00031305.2019.1583913.
Web of Science ®Google Scholar
Wellek, S. (2010), Testing Statistical Hypotheses of Equivalence and Noninferiority, Boca Raton, FL: Chapman and Hall/CRC.
Google Scholar
Xu, Y., Chen, N., Fernandez, A., Sinno, O., and Bhasin, A. (2015), “From Infrastructure to Culture: A/B Testing Challenges in Large Scale Social Networks,” in Proceedings of the 21th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, pp. 2227–2236.
Google Scholar
Ziliak, S. T. (2019), “How Large Are Your g-Values? Try Gosset’s Guinnessometrics When a Little ‘p’ is Not Enough,” The American Statistician, 73, 281–290. DOI: https://doi.org/10.1080/00031305.2018.1514325.
Web of Science ®Google Scholar

Reprints and Corporate Permissions

Please note: Selecting permissions does not provide access to the full text of the article, please see our help page How do I view content?

To request a reprint or corporate permissions for this article, please click on the relevant link below:

Order Reprints Request Corporate Permissions

Academic Permissions

Please note: Selecting permissions does not provide access to the full text of the article, please see our help page How do I view content?

Obtain permissions instantly via Rightslink by clicking on the button below:

Request Academic Permissions

If you are unable to obtain permissions via Rightslink, please complete and submit this Permissions form. For more information, please visit our Permissions help page.

Share icon
Back to Top

Related research

People also read lists articles that other readers of this article have read.

Recommended articles lists articles that we recommend and is powered by our AI driven recommendation engine.

Cited by lists all citing articles based on Crossref citations.
Articles with the Crossref icon will open in a new tab.

People also read
Recommended articles
Cited by

To cite this article:

Reference style: APA Chicago Harvard

Citation copied to clipboard

Reference styles above use APA (6th edition), Chicago (16th edition) & Harvard (10th edition)

Download citation

Download a citation file in RIS format that can be imported by citation management software including EndNote, ProCite, RefWorks and Reference Manager.

Choose format: RIS BibTex RefWorks Direct Export

Choose options: Citation Citation & abstract Citation & references

Comparative Probability Metrics: Using Posterior Probabilities to Account for Practical Equivalence in A/B tests

References

Information for

Open access

Opportunities

Help and information

Your download is now in progress and you may close this window

Login or register to access this feature

Comparative Probability Metrics: Using Posterior Probabilities to Account for Practical Equivalence in A/B tests

References

Reprints and Corporate Permissions

Academic Permissions

Related research

To cite this article:

Download citation

Information for

Open access

Opportunities

Help and information

Keep up to date