11,365
Views
4
CrossRef citations to date
0
Altmetric
General

Statistical Challenges in Online Controlled Experiments: A Review of A/B Testing Methodology

, ORCID Icon, ORCID Icon, , & ORCID Icon
Pages 135-149 | Received 14 Dec 2022, Accepted 31 Aug 2023, Published online: 18 Oct 2023

Abstract

The rise of internet-based services and products in the late 1990s brought about an unprecedented opportunity for online businesses to engage in large scale data-driven decision making. Over the past two decades, organizations such as Airbnb, Alibaba, Amazon, Baidu, Booking.com, Alphabet’s Google, LinkedIn, Lyft, Meta’s Facebook, Microsoft, Netflix, Twitter, Uber, and Yandex have invested tremendous resources in online controlled experiments (OCEs) to assess the impact of innovation on their customers and businesses. Running OCEs at scale has presented a host of challenges requiring solutions from many domains. In this article we review challenges that require new statistical methodologies to address them. In particular, we discuss the practice and culture of online experimentation, as well as its statistics literature, placing the current methodologies within their relevant statistical lineages and providing illustrative examples of OCE applications. Our goal is to raise academic statisticians’ awareness of these new research opportunities to increase collaboration between academia and the online industry.

1 Introduction

1.1 Background

It is estimated that in 2022, 5.16 billion people (64.4% of the world’s population) used the internet, each engaging with it on average 6.5 hr per day, and in aggregate spending over $5 trillion (USD) on consumer goods, travel and tourism, digital media, and health-related products and services (Kemp Citation2023). In 2023, e-commerce is predicted to account for 21% of all commerce, and by 2025 that number is expected to grow to nearly 25% (Keenan Citation2022). Given this scale of internet use, it is unsurprising that the optimization of online products and services is of great interest to online businesses and online components of traditional brick-and-mortar businesses.

Online controlled experiments (OCEs), digital versions of randomized controlled trials (RCTs) (Box, Hunter, and Hunter Citation2005), collect user-generated data to test and improve internet-based products and services (Kohavi and Longbotham Citation2023). Informally referred to as A/B tests, OCEs are an indispensable tool for major technology companies when it comes to maximizing revenue and optimizing the user experience (Luca and Bazerman Citation2021). Industry giants run hundreds of experiments on millions of users every day (Gupta et al. Citation2019), testing changes along multiple axes including: websites, services, and installed software; desktop and mobile devices; front- and back-end product features; personalization and recommendations; and monetization strategies. With OCEs, the causal impact of such changes—whether it be positive, negative, or zero—can be estimated. While this article focuses on experiments, we acknowledge that observational methods for causal inference are also relevant here, though their use is less prominent. We discuss them briefly in Section SM4 of the supplementary material.

While most positive changes are small, and improvement is incremental (Bojinov and Gupta Citation2022), results from OCEs can be potentially lucrative. Google’s famous “41 shades of blue” experiment is a classic example of an OCE that translated into a $200 million (USD) increase in annual revenue (Hern Citation2014); Amazon used insights from an OCE to move credit card offers from the homepage to the checkout page, resulting in tens of millions (USD) in profit annually (Kohavi and Thomke Citation2017); Bing deployed an A/B test for ad displays that resulted in $100 million (USD) of additional annual revenue in the United States alone (Kohavi, Tang, and Xu Citation2020). Even though such million-dollar ideas are relatively rare, the net gains from OCEs have been so profound that many organizations have completely overhauled their business models, with experimentation at the epicenter (Thomke Citation2020). For instance, Netflix attributes its membership growth from two countries to over 190 in the span of just 6 years to its adoption of online controlled experimentation (Urban, Sreenivasan, and Kannan Citation2016), and Duolingo’s 2022 Q2 shareholder letter attributes their growth to an “A/B test everything” mentality (Von Ahn Citation2022). The document even includes a description of their A/B testing process and several examples of how the product has evolved through experimentation.

Organizations that have accepted OCEs as standard practice generally adopt a so-called “culture of experimentation,” which is rooted in three tenets (Kohavi et al. Citation2013): (a) the organization wants to make data-driven decisions, (b) the organization is willing to invest in the people and infrastructure needed to run trustworthy experiments, and (c) the organization recognizes that it is poor at assessing the value of ideas. Generally, more than 50% of ideas fail to generate meaningful improvements (Kohavi, Tang, and Xu Citation2020). And in some domains, the failure rate of experiments (due to a combination of bad ideas or buggy implementation) is 90% or higher (Kohavi, Deng, and Vermeer Citation2022). Thus, carefully executed experiments provide a trustworthy, data-driven means to determine which ideas improve key metrics, which hurt, and which have no detectable impact, allowing the organization to invest in those that work, while pivoting to avoid the others. Within this culture, the attitude of “more, better, faster” is prevalent (Tang et al. Citation2010); organizations strive to increase the number of experiments so that all changes are properly evaluated; invalid experiments and harmful combinations of variants are straightforward to identify; and deployment, run time, and analysis occur within a relatively short period of time.

Compared to physical controlled experiments (in e.g., agriculture, manufacturing, pharmaceutical development), the cost incurred to design and run an OCE is low, even negligible for organizations with expertise in software development and statistics. Consequently, practitioners are able to run large numbers of experiments with potentially enormous sample sizes. In the case of large tech organizations, the combination of new features and modifications can result in billions of different versions of a given product (Kohavi et al. Citation2013). In these cases, hundreds of thousands of users are randomized concurrently to hundreds of experiments running simultaneously, so users may be in hundreds of experiments at a time (Gupta et al. Citation2019). Companies performing OCEs at this scale typically use experimentation platforms (software that is licensed or developed in-house) to automate the experimentation process, such as randomizing users, collecting data, managing concurrent experiments, and generating analysis reports (Tang et al. Citation2010; Fabijan et al. Citation2018; Ivaniuk Citation2020; Kohlmeier Citation2022). See Visser (Citation2020) for a catalogue of in-house experimentation platforms developed by several prominent tech companies. Smaller companies tend to opt for third-party vendors that specialize in setting up, deploying, and analyzing OCEs. Several vendors were compared in Kohavi (Citation2023) including (in alphabetical order): A/B Smartly, AB Tasty, EPPO, GrowthBook, Kameleoon, Optimizely, Split, Statsig, VWO, and Webtrends-Optimize. In all cases, this level of automation necessitates data quality checks like A/A tests and sample ratio mismatch (SRM) tests to establish trust in the experimentation platform. For further discussion of these practices and challenges, see Chapters 19 and 21 of Kohavi, Tang, and Xu (Citation2020), and the introduction in Lindon and Malek (Citation2020).

In this online setting, with the culture of testing as many ideas as possible, as quickly as possible, novel practical issues and modern challenges abound (see, e.g., Gupta et al. Citation2019; Bojinov and Gupta Citation2022; Quin et al. Citation2023 for nontechnical discussions, and Georgiev (Citation2019) for a technical primer). The context in which OCEs operate departs markedly from the original applications for which traditional experiments were developed nearly a century ago; understanding this context is vital for developing relevant methodology for OCEs. For statisticians, online controlled experimentation provides a host of new opportunities for methodological and theoretical development. New approaches that fit the nuances of OCE applications are in high demand, with the majority of cutting-edge research spearheaded by those in industry. The purpose of this article, therefore, is to review the statistical methodology associated with OCEs, summarize its accompanying literature, and provide an overview of open statistical problems. We hope that increasing academic statisticians’ awareness of these research opportunities will help to bridge the gap between academia and the online industry.

1.2 The General Framework

Here we introduce the notation and key terms that will be used throughout this review and we describe the basic statistical framework for OCEs. It is useful to note that as a field, online experimentation has developed disparately across industries and domains, thus there are no unifying standards with respect to methodological approach and notation; even the term “controlled experiment” goes by different names depending on the organization: “flights” at Microsoft, “bucket tests” at Yahoo, “field experiments” at Facebook, and “1% experiments/click evaluations” at Google. The following notation largely draws from traditional RCT and causal inference literature, and is intended to help unify much of the OCE literature.

Let K be the number of variants (also known as buckets, arms, splits, and treatments) that compose the experiment. Ordinarily one of these variants is a control against which all other variants are compared. Unless explicitly stated, we shall assume for the rest of this review a standard treatment-versus-control setup, in which case K = 2. While multi-variant (K > 2) experiments exist in this space (colloquially referred to as “A/B/n tests”), we focus on the K = 2 “A/B test” for pedagogical reasons; even with K > 2 variants, determining which is optimal typically reduces to a pairwise comparison between each treatment and the control.

In A/B tests, n experimental units (e.g., users, cookies, sessions, etc.) are typically randomized in real time to one of the variants, and a response observation Yi is collected for each i=1,,n. It is important to note that these response observations are typically themselves aggregates of more granular raw event data (Boucher et al. Citation2020). For instance, consider the response variable number of clicks per user which may be a count per user aggregated across sessions and/or pages. Interest then lies in optimizing some metric, which is a numerical summary of the response. Extending the previous example, interest may lie in maximizing the average number of clicks per user. Such metrics are often, but not always, averages. In some contexts, quantile or double-average metrics may be more suitable. We discuss such applications in more detail in Section SM4 of the supplementary material.

For simplicity of exposition, we have described a situation with one metric and hence one response variable. However, in practice there may be hundreds (even thousands) of metrics computed, many of which are used for debugging, some of which may be organizational guardrail metrics that the experimenters wish to avoid negatively impacting, and a small number of which compose the overall evaluation criterion (OEC) which is to be optimized. In general, defining and selecting metrics (as well as their corresponding randomization and analysis units) are key components of OCEs and we direct the reader to Kohavi et al. (Citation2009), Deng and Shi (Citation2016), Dmitriev et al. (Citation2017), and Kohavi, Tang, and Xu (Citation2020) for further discussion.

When the metric is an average, the primary goal of the experiment is to estimate the average treatment effect (ATE); the difference between the average outcome when the treatment is applied globally and when the control is applied globally. Within the potential outcomes framework (Neyman Citation1923; Rubin Citation1974), Yi(0) represents unit i’s response in the hypothetical scenario where i receives the control, and Yi(1) is the potential response when unit i receives the treatment. Letting Wi denote the binary treatment indicator for unit i, and given a particular treatment assignment to all experimental units W=(W1,,Wn), the expected outcome is E[1ni=1nYi(Wi)]=μ(W), and the ATE is therefore given by (1) τ=μ(1)μ(0)=1ni=1nE[Yi(1)Yi(0)],(1) where these expectations may be taken with respect to the random sampling or the random assignment, depending on whether inference is sample-based or design-based (Abadie et al. Citation2020). In reality, i can only be assigned to a single variant at a time, thus one cannot directly observe both (Yi(0),Yi(1)) and so the ATE is typically estimated with the difference-in-means estimator, (2) τ̂=1n1{i:Wi=1}Yi1n0{i:Wi=0}Yi,(2) where n0 and n1 are respectively the sizes of the control and treatment groups such that n0+n1=n. In practice it is also common to define the treatment effect as a relative percent, often referred to as lift, since it is easier to interpret and it is more stable (over experiment duration, for example).

Statistical significance is the most common mechanism by which a given treatment’s effectiveness is affirmed in an A/B test. Analyses of A/B tests are therefore most often carried out via two-sample hypothesis tests for τ with standard test statistics of the form τ̂/σ̂τ̂. Such analyses, and the designs that generate data for them, commonly assume that the response of each user does not depend on other users’ treatment assignments (the Stable Unit Treatment Value Assumption, or SUTVA). SUTVA is a reasonable assumption for many scenarios; however, in Section 6 we discuss OCE settings where the assumption is violated and alternative methodologies are necessary. In many scenarios, sample sizes are large enough to confidently exploit the central limit theorem, permitting the use of the standard normal null distribution. There are, however, scenarios in which only a fraction of the user base is experimented on and asymptotic normality cannot be assumed. Such scenarios are discussed in Section 2.2. Given the heavy reliance on p-values it is important to acknowledge that the reproducibility crisis stemming from the misuse of hypothesis tests also plagues OCEs; p-value misinterpretation and problematic practices regularly lead to increased false-positive rates (Berman and Van den Bulte Citation2021; Kohavi, Deng, and Vermeer Citation2022). This is an area of ongoing practical and methodological concern in many fields, including online experiments.

1.3 Roadmap

With this context and foundation laid, we now review the statistical research in this area and discuss the many open problems. The article proceeds as follows. Section 2 discusses techniques for improving experimental power—a critical issue despite the relatively large sample sizes in OCEs. Sections 3 and 4, respectively, present literature regarding the challenges of estimating heterogeneous and long-term treatment effects. Section 5 discusses the problem of optional stopping and how sequential testing methods have been adapted to run online experiments. All of these sections presume SUTVA holds; we summarize the literature that explores violations of this assumption in Section 6. We conclude the review with a call to action for further collaboration between academia and OCE practitioners in Section 7. Note that a supplementary material file accompanies the article in which we provide expanded coverage of certain topics from the main text, as well as a brief discussion of additional topics outside the scope of this review.

2 Sensitivity and Small Treatment Effects

Motivating Example: Suppose an e-commerce website observes that 5% of their visitors make a purchase and the average purchase is $25 with a standard deviation of σ= $6 during a one-week experiment period. Therefore, on average, a visitor spends $1.25. Suppose also the company’s annual revenue is $20 million, and gains or losses of $1 million (5%) are material. If the company wishes to run an experiment and detect a 5% change in revenue (i.e., δ=1.25×0.05) with 80% power at a 5% significance level, a rough sample size calculation indicates they need n0=n1=16σ2/δ2=(16×62)/(1.25×0.05)2=147,456 users per variant. This is reasonable for a small startup. However, suppose now that the company’s annual revenue is $50 billion, with gains or losses of $10 million (0.02%) of interest. An experiment designed to detect a 0.02% change in revenue (i.e., δ=1.25×0.0002) requires n0=n1=16σ2/δ2=(16×62)/(1.25×0.0002)2=9.2 billion users per variant, that is, 18.4 billion users in a single week. The human population on Earth is about 8 billion at the time of writing, so it is impossible for this company to detect changes that would gain or lose them $10 million per year.

Many leading organizations at the forefront of online controlled experimentation have user populations numbering in the hundreds of millions, if not billions. However, the sentiment that OCEs do not suffer from inadequate sample sizes is misconceived (Tang et al. Citation2010). Given the fundamental relationship between sample size and an experiment’s ability to detect true, nonzero treatment effects (namely its power), a key challenge facing even the largest of organizations is designing adequately powered experiments. A naive solution would be to simply extend the experiment’s duration, thereby increasing the number of users. However, as we elaborate upon in Section 4, this practice is ill-advised. Instead, it is better practice to employ a tactic that is tailored to the reason for insufficient power, which is generally one of three causes.

First, the treatment impacts the entire user population and the effect is roughly homogeneous, but very small in magnitude. As illustrated in the opening examples, even a fraction of a percent-change can translate to millions of dollars in revenue. We discuss the literature around this issue in Section 2.1. Second, many experiments test features that do not affect all users, attenuating the average treatment effect (see Section 2.2). Third, the treatment effects on known subpopulations are of interest, where sample sizes are smaller by definition (we defer this discussion to Section 3). In general, research regarding improving experimental power for OCEs tends to focus on boosting sensitivity, either by directly reducing the variance of Yi or by reducing the variance of estimators for τ. The aforementioned subsections provide an overview of common methodology in this area. While specific methods of combating inadequate power are reviewed in this section, we encourage the reader to keep in mind that the issue of adequate power applies to all the challenges subsequently discussed in this review.

2.1 Transforming Y, Method of Control Variates, and Stratified Sampling

To improve sensitivity, a common approach is to transform Y into Y* of lower variance which, all else being equal, translates to a lower variance estimator of τ. In online experiments there can be dozens, even hundreds of metrics of potential interest, many with different properties that make it all but impossible to identify a “one size fits all” transformation. Much work has been devoted to documenting metric behavior and discussing techniques for metric definition. Kohavi et al. (Citation2014) describe several examples of nonintuitive metric behavior and other peculiarities, illustrating the benefits of identifying skewed metrics and capping (truncating) them to improve sensitivity. Other transformations for improving the sensitivity of Y include binarizing count metrics and revenue. Deng and Shi (Citation2016) define directionality (consistent behavior in one direction for positive treatment effects and in the opposite direction for negative effects) as an important feature when choosing metrics, suggesting that one should leverage prior experiments to compile a corpus of good metrics and to evaluate sensitivity and directionality with Bayesian priors. Deng and Shi (Citation2016) also propose aggregating metrics in the form of a weighted linear combination, which is adopted and expanded upon by Kharitonov, Drutsa, and Serdyukov (Citation2017). They frame finding sensitive combinations of metrics as a machine learning problem, incorporating both labeled and unlabeled data from past experiments. In Drutsa, Gusev, and Serdyukov (Citation2015), features are extracted from data while the experiment is running and used to forecast metrics over a hypothetical post-experiment period. The authors also note their methodology may be applied to long-term effect estimation using statistical surrogacy, which we further discuss in Section 4.

In addition to transformations of Y, a popular approach is to define an efficient, mean-zero augmented estimator of τ using the method of control variates (Courthoud Citation2022; Sexauer Citation2022; Sharma Citation2022). Briefly, this method assumes, in addition to iid {Yi}i=1n, the availability of independent observations of a covariate, {Xi}i=1n, such that E[Xi]=μx. Often, these covariate measurements are collected from prior logs or experiments. Let Yi*=Yiθ(Xiμx), then var(Yi*)=var(Yi)+θ2var(Xi)2θcov(Yi,Xi)is minimized with respect to θ using the OLS solution cov(Yi,Xi)/var(Xi). Putting this together in the context of sample means gives var(Y¯*)=(1ρ2)var(Y¯)var(Y¯), where ρ=corr(Yi,Xi). Thus, an ATE estimator that uses the difference in treatment and control means of Yi* tends to have lower variance than the traditional τ̂, particularly when Xi is strongly correlated with Yi. For OCEs, this technique is referred to as CUPED (Controlled experiments Utilizing Pre-Experiment Data) and was first proposed by Deng et al. (Citation2013). The authors empirically demonstrate that an effective covariate choice is the same variable Yi but collected during a pre-experiment period (XiYipre). Such a choice can drastically increase sensitivity and thereby reduce time to statistical significance in evaluating H1:τ0. The authors also demonstrate that μx need not be known when Xi is uncorrelated with Wi and they also emphasize that despite resembling ANCOVA, CUPED does not require any linear model assumptions and can be treated as efficiency augmentation as in semi-parametric estimation (Tsiatis Citation2006). Consequently, CUPED has become a standard tool for many practitioners, although it is important to note that it can only be applied to users for which prior information exists (Drutsa, Ufliand, and Gusev Citation2015; Jackson Citation2018; Gupta et al. Citation2019; Hopkins Citation2020; Sharma Citation2021).

A key open question with respect to CUPED applications concerns the situation when the covariate alone is not sufficiently correlated with the response. An approach that shows promise employs synthetic controls, where one identifies a similar population without treatment exposure to use as covariates for modeling Y (Zhang et al. Citation2021). Another technique is to take advantage of a phenomenon that occurs in online experiments known as “triggering” (Deng et al. Citation2023), which we further discuss in Section 2.2. Further research with respect to the interplay between CUPED and other standard variance reduction techniques is also of interest. Xie and Aurisset (Citation2016) apply CUPED to large-scale A/B tests for a subscription streaming service, and Liou and Taylor (Citation2020) compare CUPED against variance-weighted estimators on a social media platform, finding that an aggregation of the two methods outperformed either individually. Deng et al. (Citation2013) note that CUPED also permits nonlinear adjustments to the response variable. Following this, Poyarkov et al. (Citation2016) develop an approach that assumes each user has a response Y and a set of features FRp independent of treatment assignment. Let Y=f(F), where f is an unknown, nonparametric function that is estimated with machine learning. Following the general idea of control variates, the covariate is chosen to be the predicted outcomes of f̂. Poyarkov et al. (Citation2016) then use Y*=Yf̂(F) as the primary metric for estimating the ATE, noting an increase in sensitivity compared to traditional A/B tests.

Closely related to the method of control variates/CUPED is stratified sampling. We discuss these connections as well as the use and drawbacks of stratified sampling in more detail in Section SM1 of the supplementary material.

2.2 Triggered Analysis

Motivating Example: Suppose engineers are testing a change made on an e-commerce website’s checkout page. Users in the experiment who never interact with this checkout page are not impacted by the experiment and so their treatment effect is zero. Many such users will increase noise and dilute the treatment effect. Sensitivity may be increased by analyzing only the users who could have been impacted by the experiment; those that were triggered into the analysis. Although this reduces sample size, the treatment effect among the triggered users is undiluted and therefore higher and easier to detect.

Triggered analysis broadly refers to an OCE analysis where one only considers users who have the potential of being impacted by an experiment, excluding those who would not be affected by the proposed variant (Deng et al. Citation2023; Kohavi et al. Citation2009; Kohavi Citation2012; Xu, Duan, and Huang Citation2018). Users are said to have triggered the experiment when they exhibit behavior that results in direct exposure to their assigned variant. Such users could be assigned to a variant either early (e.g., when first visiting the web site) or when they exhibit some type of behavior that triggers the experiment (e.g., when reaching the checkout page, which impacts the variant they receive). Key analysis challenges include: (a) generalizing the results from the triggered users to a broader population, and (b) reducing the variance of τ estimators to offset the smaller sample sizes that result from triggering. For an in-depth discussion of triggering case-studies, including the example above, see Chapter 20 of Kohavi, Tang, and Xu (Citation2020). We will provide a brief description and an overview of research in this area.

Let Ω be the overall user population and ΘΩ the population of users who could be affected by the treatment. A given user is determined to belong to Θ via techniques such as conditional checks or counterfactual logging (Kohavi, Tang, and Xu Citation2020; Deng et al. Citation2023). If Θ comprises only a modest fraction of Ω, (i.e., |Θ|/|Ω|0.2, for instance), an experiment that samples data from the entire population could be severely under-powered, particularly when effect sizes are small (Kohavi et al. Citation2009). To mitigate this issue, practitioners focus analysis only on triggered users. The difference-in-means estimator τ̂Θ is an unbiased estimator for the ATE of the triggered population, τΘ, under standard assumptions. However, τΘ is typically larger than the population-level τΩ and the corresponding estimator generally has greater variance. The process of estimating τΩ from τ̂Θ is referred to as estimating the diluted treatment effect. This allows the business to estimate the global impact of launching a new feature to all users (some of whom will be directly impacted, and some of whom will not).

Most triggered analyses fall under the following framework. Assume a (not necessarily random) sample of N units, n of which are triggered. Each user i interacts with the website on multiple separate events. During each event, i may or may not trigger (e.g., i may interact with the checkout page during one event, but not the other). The most common analysis technique is the user-trigger analysis, which incorporates all events beginning with the first event where i triggered. Such analyses are quite popular as they do not require any assumptions regarding the treatment effect, and are amenable to common user-level metrics. Chen, Liu, and Xu (Citation2018) use the user-trigger framework to illustrate the benefits of triggered analyses in terms of power gains and variance reduction, as well as to highlight the types of biases that may occur under such approaches. The session-trigger analysis is another approach that groups events into “sessions” and only keeps sessions that contain trigger events for analysis. While Deng and Hu (Citation2015) note that estimates from session-triggered analyses do tend to have lower variance than user-trigger analyses, the treatment effect must be zero in the nontriggered sessions in order for this approach to be valid. While perhaps true in some cases, generally this assumption is difficult to verify for most applications.

One approach for estimating the diluted treatment effect is to derive τΩ in terms of τΘ, producing so-called “diluted formulas.” For additive metrics, Yi=Yi,t+Yi,u, where Yi,t is an outcome when i is triggered and Yi,u for when i is untriggered, it can be shown that the diluted treatment effect is the average treatment effect on triggered users weighted by the proportion of triggered users, that is, τΩ=τΘ×nN. Note that this expression only applies when, for iΘ, there is no treatment effect on the sessions where i is untriggered, that is, Yi,u(1)Yi,u(0)=0. In other words, this expression is only for valid session-trigger analyses. With ratio metrics, Yi=ai/bi, if there is no treatment effect for the denominator term, bi=bi(1)=bi(0), and the rate at which users are triggered into the experiment is independent of τΘ, then the diluted formula is τΩ=τΘ×nN×r¯, where r¯ is the average trigger rate as a function of bi. Further details for these derivations may be found in Deng and Hu (Citation2015), but it is important to note that these formulas only apply when the treatment effect defined in (1) is of interest. When a relative effect (i.e., lift) is of interest, one must perform relative dilution where weighting is not by the proportion of triggered users, but by their contribution to the metric. While these formulas are certainly helpful in illustrating the connection between τΘ and τΩ, they are restrictive because their underlying assumptions are not necessarily realistic and closed-form expressions only exist for special cases. As noted by Deng and Hu (Citation2015), the trigger rates are rarely independent of the triggered treatment effect; users who visit a website frequently will have higher trigger rates and tend to have a larger treatment effect than less-frequent users (see Wang et al. Citation2019).

The above formulas provide a solution for estimating τΩ but they do not address the issue of low power that typically afflicts triggered analyses. Deng and Hu (Citation2015) and Deng et al. (Citation2023) simultaneously address both issues by formalizing the connection between all diluted formulas and variance reduction. Under the assumption that there is no treatment effect when users are untriggered, Deng and Hu (Citation2015) apply the CUPED framework (Section 2.1) by augmenting τ̂Θ with mean-zero data from the trigger complement group. The authors show that the resulting augmented estimator is unbiased for τΩ, can achieve appreciable variance reduction, and applies to metrics of any form. Deng et al. (Citation2023) extend this application of CUPED to one-sided triggering, a type of one-sided noncompliance where only the triggering status of the treatment group is observed.

Compared to other challenges in OCEs, the literature for triggering is, at present, rather sparse. Consequently, there are still many areas open for further research. The discussed methodologies for estimating the diluted treatment effect each depend on assumptions that may be too restrictive in certain applications. An additional challenge concerns bias of standard ATE estimators induced by triggering. Chen, Liu, and Xu (Citation2018) identify a special type of bias that occurs when a user triggers on day d, but not day d + 1. Other types of bias, as well as the questions of how to define the randomization unit (user, session, or webpage) and how and when to aggregate data into sessions, remain open for exploration. Recent work in Deng et al. (Citation2023) also suggests that exploring noncompliance and other similar concepts from the causal inference literature (such as principle stratification) with respect to triggering may be an area for future development.

3 Heterogeneous Treatment Effects

Motivating Example: Suppose an online ad provider wishes to determine the impact of changing from static textual ads to short video ads on website traffic. For the treatment group, website traffic appears to have increased uniformly except among Safari users. Consequently, the ad team wishes to estimate the treatment effect at the browser level. Likewise, after observing an improvement in user engagement metrics, the ad team may want to perform a post-hoc analysis to determine if this increase is roughly the same for all users or perhaps concentrated within certain user segments (such as those defined by market/country, user activity level, device/platform type, and time).

Treatment effects on subgroups differing from the population-level ATE are known as heterogeneous treatment effects (HTE) and are commonly of interest in OCEs. Identifying and interpreting such heterogeneity is vitally important for business applications. For example, practitioners may be interested in estimating the treatment effect for different devices or browser types, or for users of different ages, or users living in different parts of the world. Identifying and estimating HTEs is also of concern for those wishing to develop personalized experiences, or to detect bugs, or interactions with other experiments. Three key challenges are associated with estimating HTEs: (a) small treatment effects (see Section 2) often make online studies under-powered, resulting in high false negative rates for subgroup effects; (b) testing for HTEs tends to risk inflated Type I error rates due to multiple comparisons; (c) in cases where users are not randomized to the subgroups under comparison, the usual tension between correlation and causation manifests. Below we review existing methodologies that are commonly used in the context of OCEs to address these problems.

Heterogeneous treatment effects have a rich history in statistical theory and application (Robinson Citation1988; Zhao et al. Citation2012; Imai and Ratkovic Citation2013; Athey and Imbens Citation2016; Wager and Athey Citation2018; Tran and Zheleva Citation2019). In this review, we focus on the intersection of this literature and OCEs. Assume each unit i has a pair of potential outcomes {Yi(1),Yi(0)} and a vector of pretreatment covariates Xi, with e(x)=Pr(Wi=1|Xi=x) being the probability that a user is treated given a particular value of the covariate. For randomized studies where causality may be inferred, e(Xi) is known and technically independent of Xi; however, when HTE analysis is under an observational setting, e(Xi) is typically unknown. Most of the literature regarding HTEs employs the following assumptions: (1) SUTVA and (2) unconfoundedness, meaning that the response is independent of the treatment assignment Wi conditional on the covariate, {Yi(1),Yi(0)}​​​​Wi|Xi. The main goal is to estimate the conditional average treatment effect (CATE),τ(x)=E[Yi(1)Yi(0)|Xi=x].

Another key challenge is to detect exactly for which specific levels of the covariate τ(x) differs from τ and, given several covariates, identifying which X’s are the source of heterogeneity.

Interpretation is crucial in the online industry, thus, a popular approach is to assume a linear mapping from Yi to (Wi,Xi) from which main and interaction effects may be estimated. Unsurprisingly, the relationship between Yi and Xi is often highly complex, thus a common method is to use the semi-parametric model from Robinson (Citation1988), Yi=τ(Xi)Wi+g(Xi)+εi, which makes no assumptions about the forms of τ(Xi) and g(Xi). Under unconfoundedness, one may write Yim(Xi)=τ(Xi)(Wie(Xi))+εi, where m(Xi)=E[Yi|Xi] and e(Xi) are unknown. The l2 loss function is used to estimate the heterogeneous treatment effects, resulting in the estimate τ̂(X)=argminτ 1ni=1n{Yim(Xi)τ(Xi)[Wie(Xi)]}2.

Thus HTE estimation is a ripe target for machine learning methods. Researchers have approached this problem using a technique called “Double Machine Learning” (DML) (Chernozhukov et al. Citation2017). Briefly, this technique models m(X) and e(X) as nuisance parameters, estimating them with nonparametric regression on a hold-out sample set. The CATE may then be estimated on the remaining sample set using a variety of machine learning methodologies. Chernozhukov et al. (Citation2017) demonstrate that the above squared error loss is Neyman orthogonal to m(X) and e(X), which, along with sample splitting, ensures unbiasedness of τ̂(X) and enforces parsimonious modeling of Y with respect to the nuisance parameters. Syrgkanis et al. (Citation2019) extend DML for estimating heterogeneous treatment effects when the covariates are hidden. Such situations arise in online experiments when users do not comply with a treatment due to unobserved factors. By modeling Y with instrumental variables, Syrgkanis et al. (Citation2019) estimate the HTEs using a doubly-robust, fully convex loss function that is minimized with an algorithm that builds on the DML technique. To avoid the challenge of directly estimating the HTE, Peysakhovich and Lada (Citation2016) use historic, user-level data to learn individual effect estimates conditional on the covariates that correlate with the true treatment effect. Practitioners at Netflix also used DML to understand the localized impact on viewership from subbed and dubbed movies (Lan et al. Citation2022).

Other popular machine learning approaches for estimating τ(X) are regression trees and random forests. Following the DML approach, for instance, one may use trees to identify meaningful segments of a continuous or categorical variable, and then model τ(X) with partially linear regression. In an adaptation of the classical CART algorithm, Athey and Imbens (Citation2016) build modified regression trees to partition the data into subgroups corresponding to different magnitudes of the treatment effect, thus each terminal leaf produces an estimate for τ(x), rather than the traditional estimate of E[Y|Xi=x]. To correct for over-fitting, an additional split of the training data into nonoverlapping sub-partitions for each leaf is used. Naturally, this method can be extended to random forests to create a causal forest for estimating the HTE (Wager and Athey Citation2018). While casual trees and forests do not require linearity of the treatment effects, and perhaps are conceptually more intuitive than DML, they are somewhat lacking in terms of interpretability compared to the effect estimates from DML and other similar methods. A further disadvantage is that the additional training split reduces the sample size for an application that may already suffer low power.

In addition to estimating τ(X), identifying which covariates or levels of covariates contribute to treatment heterogeneity is of great practical concern (Sepehri and DiCiccio Citation2020; McFowland III et al. Citation2021). Obtaining a parsimonious model of Y is critical in such situations, as there are typically a large number of covariates from which to choose and strong statistical significance is required for detecting HTEs. This challenge is simultaneously a variable selection and multiple testing problem. Xie, Chen, and Shi (Citation2018) assume an experimental design setup where e(X)=Pr(Wi=1), using this value to transform Yi into Yi* such that E[Yi*|X]=τ(X), which is estimated with the standard difference-in-means τ̂. Using Y*τ̂ as the response variable, the authors perform Lasso regression in conjunction with the “knockoff” variable selection defined by Barber and Candès (Citation2015) to select heterogeneous covariates while controlling false discovery rates (FDRs). They also demonstrate how to use the Benjamini-Hochberg correction to identify levels within these covariates where HTEs occur. Deng et al. (Citation2016b) also use variable selection when clusters of covariates are of interest, such as device grouped by brand name. They employ a linear model with first order effect and second order interaction terms and enforce sparsity using total variation regularization, a technique similar to Fused Lasso (Petersen, Witten, and Simon Citation2016).

Given the wide array of scenarios under which HTEs occur in online experiments, there are still many situations where the methods discussed above may not be appropriate. Much of the literature in this review make strong model assumptions that are difficult to verify in practice. For example, the SUTVA requirement that units in a given variant receive the same level of treatment may not be satisfied if users have different levels of engagement (Imbens and Rubin Citation2015); highly engaged users will typically experience a higher “dose” of their treatment (relative to light users) due to repeated exposure over the experimental period. As such, the vast literature on dose-response studies Ruberg (Citation1995a, Citation1995b) seems pertinent here. Additionally, the low power due to small effect sizes makes multiple testing quite challenging. Simulations regarding the approach for controlling FDR in Xie, Chen, and Shi (Citation2018) showed that the knockoff method may be too conservative when faced with small effect sizes, and Deng et al. (Citation2016b) reported difficulties regarding high false positive rates. For more open challenges regarding HTE estimation, we encourage the interested reader to consult Gupta et al. (Citation2019), Kohavi, Tang, and Xu (Citation2020), and Bojinov and Gupta (Citation2022).

4 Long-Term Effects

Motivating Example: At Bing, researchers hypothesized that generating large numbers of advertisements should have a positive effect on revenue, but may hurt user engagement in the long-term. To test this, the researchers exposed users to varying ad loads, noting a significant difference in engagement metrics for users exposed to a high ad load versus a low one. It was proposed that one may estimate the long-term effect by performing a post-hoc analysis some time after the experiment. Unfortunately, the post-hoc differences between high-load and low-load users could not be solely attributed to treatment assignment—many users quickly abandoned Bing as a result of too many ads, biasing results toward the users who remained (Dmitriev et al.Citation2016).

Practitioners are often interested in understanding the treatment effect not just during the experiment, but months, even years after the experiment concludes. In many online experiments, the short-term treatment effect observed during and immediately after the experiment is not necessarily the same as the long-term effect. For instance, click-bait advertising has a positive short-term effect on click-through-rates, but a negative long-term effect on user retention and revenue (Kohavi et al. Citation2012). More generally, novelty and primacy effects are of concern. A novelty effect exists when a novel change is initially intriguing, leading to increased engagement, but that diminishes over time. A primacy effect on the other hand exists when the initial reaction to a change is not positive, but over time as users get used to the change their engagement increases (McFarland Citation2012; Sadeghi et al. Citation2022). In both cases, the nature of the treatment effect may change over time as users learn. The treatment effect itself may also change dynamically over time independent of user learning. As such, the ATE τ should not be viewed as a constant with respect to time (t), it should more appropriately be regarded as a function of it: τ(t).

Current OCE literature regarding long-term effect estimation is highly context-specific. At the time of writing this review, it is difficult to pinpoint a single statistical lineage of methodologies for this area (unlike with heterogeneous treatment effects, for example). We begin by introducing several distinct approaches that draw from a variety of statistical fields, and then finish with discussion of one area in particular that shows promise in providing a statistical framework for modeling and estimating long-term effects in online settings. For more industry-specific examples of the challenges concerning long-term effects, see Gupta et al. (Citation2019) and Bojinov and Gupta (Citation2022).

A straightforward way to assess long-term effects is to simply run the experiment longer and ensure that the appropriate metrics for capturing long-term behavior are observed. However, much of the literature written by practitioners of OCEs has been devoted to describing the pitfalls associated with running long-term controlled experiments specifically for estimating long-term effects (Kohavi et al. Citation2009, Citation2012; Dmitriev et al. Citation2016; Gupta et al. Citation2019; Kohavi, Tang, and Xu Citation2020). Besides increased cost, several other external factors often make long-term experiments unappealing. For instance, when browser cookies are used to identify users, long-term experiments risk losing upwards of 75% of users as a result of cookie churn and are rendered invalid as a result (Dmitriev et al. Citation2016). These users may also reenter the experiment unbeknownst to the experimenters and receive both the treatment and control experiences. This type of contamination can also happen if users access the product or service on multiple devices, a problem that becomes more likely as the experiment’s duration increases. Additionally, the longer the experiment the more likely it is that multiple users (e.g., family members) will use the same device, obfuscating results. As such, in this section, we focus on techniques for estimating long-term treatment effects alternative to increasing experiment length.

Several approaches for estimating long-term effects intersect with other areas discussed in this review. In Wang et al. (Citation2019), long-term effects are characterized as a form of bias due to heterogeneous treatment effects (Section 3). In this context, long-term effects manifest because heavy-users (frequent users of the product) tend to be included in experiments at higher rates than light-users, biasing the ATE particularly in the short-term. Here, the treatment effect is presumed to be different depending on whether user i is a heavy- or light-user. Under SUTVA and an assumed independence of outcomes from treatment assignment, the authors derive bias due to heavy-users in closed form, proposing a bias-adjusted jackknife estimator for the overall ATE. For a two-sided market where each experimental unit has a treatment history up to time t, Shi et al. (Citation2020) leverage sequential testing (Section 5) and reinforcement learning to test for long-term treatment effects. Using data from a ride-sharing company, they demonstrate how their derived test statistic is able to detect long-term effects where regular two-sample t-tests fail. While the solutions from Wang et al. (Citation2019) and Shi et al. (Citation2020) are effective, they only target specific types of long-term effects, which limits their potential generalizability to other settings.

Another common solution is to define and measure short-term driver metrics that are causally linked to the long-term effect (Kohavi, Tang, and Xu Citation2020). Driver metrics allow practitioners to focus experiments on short-term goals while still taking into account the long-term effects (see Biddle (Citation2019) for anecdotal examples). Hassan et al. (Citation2013), define heuristics for modeling implicit indicators of customer satisfaction, noting that using query-based models instead of click-based models tend to serve as better proxies. Hohnhold, O’Brien, and Tang (Citation2015) define models for how users “learn” to search or click for a product as a result of being exposed to a treatment, such as change in number of ads shown, using “learned click-through-rates” as a driver metric for estimating long-term effect on revenue. To estimate the effect on long-term revenue using short-term effects due to treatment, the authors model this as a linear function of short-term revenue and the estimated learned click-through-rates. The model has been successfully deployed by Google and is widely cited in the OCE literature (Deng, Lu, and Litz Citation2017; Gupta et al. Citation2019; Wang et al. Citation2019; Kohavi, Tang, and Xu Citation2020). A recent paper by Sadeghi et al. (Citation2022) proposes an observational approach based on difference-in-differences to estimate user learning and hence the long-term treatment effect in contexts where novelty and primacy effects exist.

Methodology in this area tends to resemble recent works in the causal inference literature that also aim to address this challenge by combining short-term experimental data with long-term observational data. In Section SM2 of the supplementary material we elaborate on the use of surrogate outcomes (Prentice Citation1989; Begg and Leung Citation2000; Frangakis and Rubin Citation2002; Ensor et al. Citation2016) in this context.

5 Optional Stopping

Motivating Example: Suppose an online streaming service is altering a certain feature that positively correlates with subscription renewals. While an improvement to this feature could increase the rate of subscription renewals, a harmful change may have the opposite effect. It is in the service’s best interest to quickly abandon harmful or poorly performing variants, and identify those that perform well. Methods that support early termination without compromising overall statistical validity are desirable. A notable practice within this context is to “ramp up” the experiment by slowly exposing an increasing percentage of users to the treatment (Xu, Duan, and Huang Citation2018), and to integrate OCEs into a phased software deployment via “controlled rollout” (Xia et al. Citation2019).

Most OCEs are run in real-time, and it is not uncommon for estimates and confidence intervals associated with τ, and p-values associated with H0:τ=0, to be updated in near-real-time as the data are collected. Although a fixed horizon is typically determined based on development cycles (typically two weeks) and minimum sample size requirements (determined via power arguments), the near-real-time availability of results encourages a phenomenon known colloquially as “peeking,” whereby p-values are monitored continuously and the experiment is stopped as soon as a significant p-value is observed. While it is well known that this practice seriously inflates false positive rates (Johari et al. Citation2022a; Kohavi, Deng, and Vermeer Citation2022), there are nevertheless situations where having a mechanism for optional stopping is desirable. For example, it is extremely important to quickly detect and abort treatments that are negatively impacting the user experience (Lindon, Sanden, and Shirikian Citation2022). Thus, in situations like this, a methodology that permits near-real-time decision-making without inflating Type I error rates is invaluable. Unsurprisingly, design and analysis methods from the sequential testing body of literature are of relevance here. Below we describe the development and application of methods in this area to the OCE context.

As per the field of sequential testing, interest lies in assessing H0:τ=0 using sample size-dependent decision rules. Within this class of methods, Type I error is controlled at each current sample size n, which avoids the inflated risk of Type I error that is associated with preemptively stopping an experiment when the current p-value is statistically significant by chance. Such methods improve testing efficiency due to the lower sample size a sequential test will terminate at, on average, regardless of where the true treatment effect might lie. However, there is no free lunch. Existing methodology is not well-suited for all OCE applications, such as monitoring multiple metrics (e.g., the OEC and guardrails). Additionally, the reduced sample sizes guarantee under-powered HTE inference across user segments. Nevertheless, sequential testing methods have appeared in numerous OCE applications and studies (Kohavi et al. Citation2013; Kharitonov et al. Citation2015; Deng, Lu, and Chen Citation2016; Abhishek and Mannor Citation2017; Ju et al. Citation2019; Yu, Lu, and Song Citation2020; Shi et al. Citation2020; Johari et al. Citation2022a; Schultzberg and Ankargren Citation2023; Skotara Citation2023). The following section broadly introduces the method of sequential testing as it pertains to ongoing evaluation of the treatment effect(s) of interest in OCEs.

The majority of the OCE literature in sequential testing builds on the classic sequential probability ratio test (SPRT) developed by Wald (Citation1945). Define constants 0<B<A where B=β/(1α) and A=(1β)/α, and a simple hypothesis test H0:θ=θ0 versus H1:θ=θ1. The SPRT method proceeds as follows. For current sample size n, compute the likelihood ratio test statistic Λn=i=1nf(yi|θ1)f(yi|θ0),where yi are observations of iid data {Yi}i=1nf(·|θ). The rejection region divides the sample space into three mutually exclusive decision rules: (a) if Λn>A, reject H0 and stop the test. (b) If Λn<B, fail to reject H0 and stop the test. (c) If B<Λn<A, obtain another observation Yn+1 and compute Λn+1. Although it seems like testing in this manner would permit the possibility of never drawing a conclusion about H0 (i.e., n), Wald (Citation1947) proved that the SPRT will eventually terminate for finite n. SPRT does not require specifying n in advance, and requires on average about half the number of observations required for a uniformly most powerful Neyman-Pearson test for the same level of power (Wald Citation1945).

The first and perhaps most widely-known application of sequential testing in OCEs is a modified version of SPRT called the mixture sequential probability ratio test, or mSPRT (Pramanik, Johnson, and Bhattacharya Citation2021; Johari et al. Citation2022a). The mSPRT allows for a simple null hypothesis versus a composite alternative hypothesis H1:θθ0 by assuming a mixture distribution H with density h(·) defined over the parameter space of all possible θ. The test statistic is therefore a mixture of the likelihood ratios, ΛnH=i=1nf(yi|θ)f(yi|θ0)h(θ)dθ.

The procedure rejects H0 and ends if ΛnHα1. Johari et al. (Citation2022a) use the mSPRT to define “always valid p-values”, which are computed iteratively such that p0=1;pn=min{pn1,(ΛnH)1}. Thus, practitioners may stop an experiment at any time while still controlling Type I error. The always valid p-values and their confidence interval counterparts are currently deployed by Optimizely, a widely-used third-party vendor for OCEs (Pekelis Citation2015). With the vast quantity of experiments that this company has facilitated, they are able to leverage prior data for estimating the mixture distribution H. However, Johari et al. (Citation2022a) derive their optimality conditions for mSPRT only for data that comes from the exponential family of distributions, which does not include distributions for the ratio metrics that are popular in industry. Another limitation lies in how the likelihood ratios for a two-sample hypothesis test are defined. While the authors assume a standard independent, two-sample stream of data, they impose an additional restriction by arbitrarily pairing observations, which suggests a matched pairs design. This allows for defining a tractable f(yi|θ), but there is no practical reason for observations to be paired, and no practical guidance given for how to perform the pairing. Additionally, the assumption that observations arise independently may be violated when a unit generates multiple observations as they interact repeatedly with the experiment over time. Moreover, although the Type I error rate is satisfactorily controlled (when all assumption are met), unbiased estimation of the treatment effect is still a concern. Methodology that relaxes these assumptions and yields unbiased estimates is therefore valuable.

The well-publicized usage of mSPRT has inspired several related works in the literature. Abhishek and Mannor (Citation2017) account for the situation where f(·|θ) is unknown by creating a bootstrap algorithm to approximate ΛnH. While the algorithm also requires a prior distribution to approximate H, this method still allows practitioners to use mSPRT for commonly used online metrics that are otherwise difficult to model. The work in Lindon and Malek (Citation2020) extends mSPRT to multinomial count data, which includes an application for conducting SRM tests sequentially, in near-real-time. Yu, Lu, and Song (Citation2020) also extend mSPRT to the multiple testing scenario to test for heterogeneous treatment effects, using always valid p-values to allow for continuous monitoring. Xu, Duan, and Huang (Citation2018) use a technique similar to mSPRT, called a generalized sequential probability ratio test (GSPRT), to determine the risk of exposing more users to a new variant. Briefly, the GSPRT uses the supremums of the likelihoods in Λn and can be shown to require smaller sample sizes on average than mSPRT (Chan and Lai Citation2005). Xu, Duan, and Huang (Citation2018) use a prior-weighted GSPRT to provide a rigorous statistical framework dubbed “speed, quality, and risk” (SQR) for the practice of ramping up, gradually introducing users to a new variant in order to mitigate the fallout associated with exposing them to potentially negative variants (for a high-level discussion of SQR, see Chapter 15 of Kohavi, Tang, and Xu Citation2020). An alternative to frequentist sequential testing is also explored by Deng, Lu, and Chen (Citation2016), where the authors use Bayesian hypothesis testing as the foundation. Bayesian methods for OCEs are briefly discussed in Section SM4 of the supplementary material.

Finally, we acknowledge that the sequential testing methods discussed here are fully sequential. However, group sequential methods commonly used in adaptive clinical trials (Pocock Citation1977; O’Brien and Fleming Citation1979; Lan and DeMets Citation1983; Robertson et al. Citation2023) are rapidly gaining in popularity in the context of OCEs. Georgiev (Citation2022), Skotara (Citation2023), and Schultzberg and Ankargren (Citation2023) describe the use of these methods in this context, and the value they provide with respect to the speed of decision making (i.e., increased power) and false positive control. The adaptation of such methods to tailor them for use with OCEs seems like a nascent though fruitful line of research.

6 Interference

Motivating Example 1: Suppose LinkedIn plans to test the impact of a new feature for their messaging service, with the objective to increase total messages sent. Using balanced randomization, given that user i is exposed to the new feature, there is approximately a 50% chance said user’s friend j is randomized to the old service. Under this scenario and if the new feature indeed increases messages sent, it is likely that friend j will also send more messages in response to i, despite j belonging to the old service. Thus, the overall impact of the new messaging feature on total messages sent is confounded by the network interference between treatment and control groups, biasing standard estimators for the ATE (Saint-Jacques Citation2019).

Motivating Example 2: Suppose Lyft is experimenting with a new version of the pricing algorithm that results in treated passengers booking more rides. However, since the number of available drivers is finite, increased ride bookings in the treatment group necessarily reduces the supply of drivers and hence the number of possible rides for the control group (ChamandyCitation2016). This in turn biases naive treatment effect estimates that compare ride bookings in the two groups.

Recall that SUTVA requires that the potential outcome Yi(Wi) for unit i remain the same regardless of the treatment assignments and outcomes of the other experimental units. However, in certain OCE applications (e.g., social networks and online marketplaces), SUTVA may be violated when the treatments interfere with each other. Such a SUTVA violation is referred to as interference, spillover, or leakage. Interference was illustrated in both motivating examples above; in each case a unit’s outcome depended not only on its own treatment assignment, but also on the treatment assignments and outcomes of other units in the experiment. The two examples typify two different forms of interference: the first, network interference, arises when the units are connected to one another through a network, such as a social network like LinkedIn (Saint-Jacques et al. Citation2019) or Facebook (Eckles, Karrer, and Ugander Citation2014). The second form of interference, marketplace interference, arises when units compete for shared resources in two-sided marketplaces such as Lyft (Chamandy Citation2016), Ebay (Blake and Coey Citation2014), or Airbnb (Holtz et al. Citation2020), and three-sided marketplaces like DoorDash (Feng and Bauman Citation2022). Note that Kohavi, Tang, and Xu (Citation2020) defines these interference mechanisms as resulting (respectively) from “direct” and “indirect” connections among the units. Bojinov and Gupta (Citation2022) alternatively define partial and arbitrary interference in addition to marketplace interference. Whether the interference mechanism is partial or arbitrary depends on whether a unit is influenced by some or all of the other units in the experiment. Kohavi, Tang, and Xu (Citation2020) describe another type of interference whereby a malfunctioning treatment causes a crash that impacts both treatment and control users.

As described in Section 1.2, estimates of the average treatment effect seek to quantify the difference in outcomes when all units are treated versus when all units are controlled. In standard settings, a subset of units randomized to treatment and another subset randomized to control serve as an adequate proxy for the unobserved counterfactuals. However, when the treatment and control groups interfere with each other, traditional randomization no longer adequately approximates the counterfactuals and hence standard difference-in-means estimators are no longer unbiased. A rich literature has recently been developed that carefully considers both the design and analysis of OCEs in the presence of interference. Of particular relevance are experimental designs that reduce the amount of interference, and analysis methods that provide unbiased treatment effect estimates in the presence of interference. We provide a brief overview of that literature below. As will become apparent, many ideas from the clinical RCT literature are relevant here (e.g., cluster randomized trials and crossover designs).

In network A/B tests, cluster-based randomization methods have gained widespread attention (Ugander et al. Citation2013; Eckles, Karrer, and Ugander Citation2014; Gui et al. Citation2015; Saveski et al. Citation2017; Yoon Citation2018; Zhou et al. Citation2020; Karrer et al. Citation2021). Such methods first involve partitioning the network into mostly disjoint clusters, commonly via community detection algorithms. Such a partitioning yields groups of nodes with much greater intra-cluster connectivity than inter-cluster connectivity. Randomization is then performed at the cluster level, where all units in a cluster receive the same treatment assignment. In doing so, units will (for the most part) have the same treatment assignment as the units nearest—and hence most likely to influence—them. This limits the opportunity for interference and therefore better mimics the all-treated and all-controlled counterfactuals. However, with the clusters being the randomization unit (instead of the individual users), the effective sample size and hence power is dramatically reduced. Saint-Jacques et al. (Citation2019) therefore propose the use of many, smaller ego-clusters. These clusters are defined by a single user (the ego) and a subset of its direct neighbors (the alters). Ego-cluster-based randomization does not pay as significant a penalty in terms of power, since there are many more ego-clusters than in a traditional cluster-based design. Moreover, other treatment assignment schemes in which the ego is treated differently from the alters may be used to estimate the network interference. In Section SM3 of the supplementary material we elaborate on these and other methods developed for the design and analysis of network A/B tests.

The spirit of cluster-based randomization—that is, to treat units likely to influence each other in the same way—is at the heart of methodologies intended to mitigate other forms of interference as well. For instance, in settings with marketplace interference, switchback experiments are commonly used to sequentially alternate units between treatment and control over time (Bojinov, Simchi-Levi, and Zhao Citation2022). In doing so, at any given time period all users have the same treatment assignment and therefore cannot exert influence on each other. And then through repeated exposure to treatment and control over time, the ATE can be estimated. Such experiments suffer from temporal carryovers, but these can be mitigated with “burn-in” periods analogous to washout periods in clinical trials (Hu and Wager Citation2022). Optimal design strategies have also been developed to address carryover effects (Bojinov, Simchi-Levi, and Zhao Citation2022). Even still, like cluster-based randomization in the network setting, switchback experiments suffer from decreased power via a reduced effective sample size. Recent work by Ni, Bojinov, and Zhao (Citation2023) explores the use of spacial clustering and temporal balance to overcome this problem.

In marketplaces based on auctions (e.g., eBay auctions (Blake and Coey Citation2014) or advertising auctions (Liu, Mao, and Kang Citation2021)), interference due to shared resources is also a problem. A treatment that encourages higher bidding will lead to treated users winning more auctions and control users necessarily losing them. This leads to a “cannibalization bias” whereby the margin by which the treatment looks better than the control is exaggerated, because when the treatment wins, the control must lose. Switchback experiments have been proposed as a means to mitigate such bias, but their limitations (described above) have led to the development of more tailored experimental designs for online auctions. For example, Liu, Mao, and Kang (Citation2021) propose budget-split designs that eliminate the opportunity for cannibalization bias by splitting the available resources (i.e., the budget) equally and independently between the treatment and control groups. With the resources no longer shared, there exists no imbalanced competition for them.

The works described above seek to eliminate (or at least minimize) interference through the experiment’s design so that the traditional difference-in-means estimator yields unbiased estimates of the ATE. However, an alternative paradigm exists in which interest lies in de-biasing through the analysis of the experiment by modeling the interference rather than eliminating it. For instance, Bui, Steiner, and Stevens (Citation2023) develop a class of general additive network effect models that facilitate unbiased ATE estimation while flexibly modeling network influence. Many other such network modeling approaches exist. See, for example, Parker, Gilmour, and Schormans (Citation2017), Basse and Airoldi (Citation2018), Pokhilko et al. (Citation2019), Koutra, Gilmour, and Parker (Citation2021), and Zhang and Kang (Citation2022) and Section SM3 of the supplementary material for a deeper discussion of such methods. Likewise, Johari et al. (Citation2022b), and Li et al. (Citation2022) develop stochastic market models to capture interference dynamics in two-sided marketplaces. Such methods, if the interference is accurately modeled, enjoy the benefit of increased power by permitting randomization at the user-level. However, accurately modeling interference is nontrivial.

7 Conclusion

The value of experimentation and the accompanying philosophy of trial and error has been observed in many facets of society (Manzi Citation2012), and its positive impacts in the realm of business in particular are remarkable (Koning, Hasan, and Chatterji Citation2022). Online controlled experiments are vital tools utilized hundreds of times a day by companies whose products touch the lives of billions (Kohavi et al. Citation2013; Xu et al. Citation2015; Google Citation2022). As many vital societal functions shift online at an unprecedented rate, online experimentation has already found applications outside the mainstream spheres of technology and e-commerce. OCEs have been used to optimize political advertisements and increase user engagement with campaign platforms during the Obama and Trump elections (Christian Citation2012; Bump Citation2019). Decision-making tools streamlined by OCEs help clinicians make safer, more cost effective decisions regarding patient care (Austrian et al. Citation2021). OCEs have also been deployed to identify the psychological impacts of social media on younger demographics (Isaac Citation2021). Along with the significant growth and popularity of careers under the evolving “data science” profession, online experimentation is almost certainly going to become a common tool for online businesses of all sizes (Schroeder Citation2021). Given the breadth and depth of OCE applications, we believe that solving the research challenges presented in this review will improve the quality of data-driven decision making in online businesses across the applied domain.

We conclude this literature review with a call to action for greater collaboration between industry and academic statisticians to address the research challenges presented by online experimentation. While this article may be one of the first to provide a cohesive review of the OCE statistics literature, the need for increased cooperation between industry and academia already has been explicitly stated by experts at 13 leading organizations that run online experiments (Gupta et al. Citation2019). Collaborative partnerships between academia and industry do exist in this space (see, e.g., Waudby-Smith et al. Citation2021; Ham et al. Citation2022, respectively for partnerships between Carnegie Mellon and Adobe and Harvard and Netflix on the problem of sequential experimentation). However, the academic statistics community in general seems to lack familiarity with—and access to—this research area. The purpose of this review, therefore, was to introduce academicians to the context and goals of online experimentation, as well as to provide examples and broad, technical discussion of the statistical methodologies regarding sensitivity, effect size, heterogeneity, long-term effects, optional stopping, and interference. In the absence of direct collaboration with industry, it is difficult to develop and test novel methodology without access to data. While some open access repositories exist (see, e.g., Liu et al. Citation2021; Matias et al. Citation2021), the proprietary nature of these experiments makes open-access data-sharing uncommon. This is admittedly a challenge and a remaining open problem for research in this space.

Supplementary Materials

The supplementary materials file contains additional discussion of some topics not considered in the paper (e.g., ethics, observational causal inference, and Bayesian methods) as well as expanded discussion of certain topics from the paper (e.g., the relationship between stratified sampling and CUPED, surrogate methods for long-term treatment effect estimation, and network A/B testing).

Supplemental material

Supplemental Material

Download PDF (269 KB)

Acknowledgments

The authors thank Art Owen, Georgi Georgiev, and Somit Gupta for helpful comments on an earlier draft of this article.

Disclosure Statement

The authors report there are no competing interests to declare.

References

  • Abadie, A., Athey, S., Imbens, G. W., and Wooldridge, J. M. (2020), “Sampling-Based versus Design-Based Uncertainty in Regression Analysis,” Econometrica, 88, 265–296. DOI: 10.3982/ECTA12675.
  • Abhishek, V., and Mannor, S. (2017), “A Nonparametric Sequential Test for Online Randomized Experiments,” in Proceedings of the 26th International Conference on World Wide Web Companion, WWW ’17 Companion, pp. 610–616, Perth, Australia: International World Wide Web Conferences Steering Committee. DOI: 10.1145/3041021.3054196.
  • Athey, S., and Imbens, G. (2016), “Recursive Partitioning for Heterogeneous Causal Effects,” Proceedings of the National Academy of Sciences, 113, 7353–7360. https://www.pnas.org/content/113/27/7353.full.pdf. DOI: 10.1073/pnas.1510489113.
  • Austrian, J., Mendoza, F., Szerencsy, A., Fenelon, L., Horwitz, L. I., Jones, S., Kuznetsova, M., and Mann, D. M. (2021), “Applying A/B Testing to Clinical Decision Support: Rapid Randomized Controlled Trials,” Journal of Medical Internet Research, 23, e16651. DOI: 10.2196/16651.
  • Barber, R. F., and Candès, E. J. (2015). “Controlling the False Discovery Rate via Knockoffs,” The Annals of Statistics, 43, 2055–2085. DOI: 10.1214/15-AOS1337.
  • Basse, G. W., and Airoldi, E. M. (2018), “Model-Assisted Design of Experiments in the Presence of Networkcorrelated Outcomes,” Biometrika, 105, 849–858. DOI: 10.1093/biomet/asy036.
  • Begg, C. B., and Leung, D. H. (2000), “On the Use of Surrogate End Points in Randomized Trials,” Journal of the Royal Statistical Society, Series A, 163, 15–28. DOI: 10.1111/1467-985X.00153.
  • Berman, R., and Van den Bulte, C. (2021), “False Discovery in A/B Testing,” Management Science, 68, 6762–6782. DOI: 10.1287/mnsc.2021.4207.
  • Biddle, G. (2019), “Proxy Metrics: How to Define a Metric to Prove or Disprove Your Hypotheses and Measure Progress,” available at https://gibsonbiddle.medium.com/4-proxy-metricsa82dd30ca810. (Accessed on 03/04/2022).
  • Blake, T., and Coey, D. (2014), “Why Marketplace Experimentation is Harder than it Seems: The Role of Test-Control Interference,” in Proceedings of the Fifteenth ACM Conference on Economics and Computation, pp. 567–582.
  • Bojinov, I., and Gupta, S. (2022), “Online Experimentation: Benefits, Operational and Methodological Challenges, and Scaling Guide,” Harvard Data Science Review, 4. DOI: 10.1162/99608f92.a579756e.
  • Bojinov, I., Simchi-Levi, D., and Zhao, J. (2022), “Design and Analysis of Switchback Experiments,” Management Science, 69, 3759–3777. DOI: 10.1287/mnsc.2022.4583.
  • Boucher, C., Knoblich, U., Miller, D., Patotski, S., and Saied, A. (2020), “Metric Computation for Multiple Backends,” available at https://www.microsoft.com/en-us/research/group/experimentationplatform-exp/articles/metric-computation-for-multiple-backends/. (Accessed on 09/16/2022).
  • Box, G. E., Hunter, J. S., and Hunter, W. G. (2005), Statistics for Experimenters: Design, Innovation, and Discovery (2nd ed.), Hoboken, NJ: Wiley-Interscience.
  • Bui, T., Steiner, S. H., and Stevens, N. T. (2023), “General Additive Network Effect Models,” The New England Journal of Statistics in Data Science, 1–19. DOI: 10.51387/23-NEJSDS29.
  • Bump, P. (2019), “Analysis—’60 Minutes’ Profiles the Genius Who Won Trump’s Campaign: Facebook.”
  • Chamandy, N. (2016), “Experimentation in a Ridesharing Marketplace,” available at https://eng.lyft.com/experimentation-in-a-ridesharing-marketplace-b39db027a66e.
  • Chan, H., and Lai, T. (2005), “Importance Sampling for Generalized Likelihood Ratio Procedures in Sequential Analysis,” Sequential Analysis, 24, 259–278. DOI: 10.1081/SQA-200063280.
  • Chen, N., Liu, M., and Xu, Y. (2018), “Automatic Detection and Diagnosis of Biased Online Experiments,” arXiv preprint arXiv:1808.00114.
  • Chernozhukov, V., Chetverikov, D., Demirer, M., Duflo, E., Hansen, C., and Newey, W. (2017), “Double/Debiased/Neyman Machine Learning of Treatment Effects,” American Economic Review, 107, 261–65. DOI: 10.1257/aer.p20171038.
  • Christian, B. (2012), “The A/B Test: Inside the Technology That’s Changing the Rules of Business,” Wired, 20. Available at https://www.wired.com/2012/04/ff-abtesting/
  • Courthoud, M. (2022), “Understanding CUPED,” available at https://towardsdatascience.com/understandingcuped-a822523641af. (Accessed on 08/18/2022).
  • Crook, T., Frasca, B., Kohavi, R., and Longbotham, R. (2009), “Seven Pitfalls to Avoid When Running Controlled experiments on the Web,” in Proceedings of the 15th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining - KDD ’09, p. 1105, Paris, France: ACM Press. DOI: 10.1145/1557019.1557139.
  • Deng, A., and Hu, V. (2015), “Diluted Treatment Effect Estimation for Trigger Analysis in Online Controlled Experiments,” in Proceedings of the Eighth ACM International Conference on Web Search and Data Mining, pp. 349–358. DOI: 10.1145/2684822.2685307.
  • Deng, A., Lu, J., and Chen, S. (2016), “Continuous Monitoring of A/B Tests Without Pain: Optional Stopping in Bayesian Testing,” 2016 IEEE International Conference on Data Science and Advanced Analytics (DSAA), pp. 243–252, IEEE. DOI: 10.1109/DSAA.2016.33.
  • Deng, A., Lu, J., and Litz, J. (2017), “Trustworthy Analysis of Online A/B Tests: Pitfalls, Challenges and Solutions,” in Proceedings of the Tenth ACM International Conference on Web Search and Data Mining, WSDM ’17, Cambridge, UK: Association for Computing Machinery, pp. 641–649. DOI: 10.1145/3018661.3018677.
  • Deng, A., and Shi, X. (2016), “Data-Driven Metric Development for Online Controlled Experiments: Seven Lessons Learned,” in Proceedings of the 22nd ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, KDD ’16, pp. 77–86, San Francisco, CA: ACM. DOI: 10.1145/2939672.2939700.
  • Deng, A., Xu, Y., Kohavi, R., and Walker, T. (2013), “Improving the Sensitivity of Online Controlled Experiments by Utilizing Pre-experiment Data,” Proceedings of the Sixth ACM International Conference on Web Search and Data Mining - WSDM ’13, p. 123, Rome, Italy: ACM Press. DOI: 10.1145/2433396.2433413.
  • Deng, A., Yuan, L.-H., Kanai, N., and Salama-Manteau, A. (2023), “Zero to Hero: Exploiting Null Effects to Achieve Variance Reduction in Experiments with One-sided Triggering,” Proceedings of the Sixteenth ACM International Conference on Web Search and Data Mining, pp. 823–831. DOI: 10.1145/3539597.3570413.
  • Deng, A., Zhang, P., Chen, S., Kim, D. W., and Lu, J. (2016b), “Concise Summarization of Heterogeneous Treatment Effect Using Total Variation Regularized Regression,” arXiv preprint arXiv:1610.03917.
  • Dmitriev, P., Frasca, B., Gupta, S., Kohavi, R., and Vaz, G. (2016), “Pitfalls of Long-Term Online Controlled Experiments,” 2016 IEEE International Conference on Big Data (Big Data), pp. 1367–1376, IEEE. DOI: 10.1109/BigData.2016.7840744.
  • Dmitriev, P., Gupta, S., Kim, D. W., and Vaz, G. (2017), “A Dirty Dozen: Twelve Common Metric Interpretation Pitfalls in Online Controlled Experiments,” in Proceedings of the 23rd ACM SIGKDD International Conference on Knowledge Discovery and Data Mining. KDD ’17, pp. 1427–1436, New York: Association for Computing Machinery. DOI: 10.1145/3097983.3098024.
  • Drutsa, A., Gusev, G., and Serdyukov, P. (2015), “Future User Engagement Prediction and its Application to Improve the Sensitivity of Online Experiments,” in Proceedings of the 24th International Conference on World Wide Web, pp. 256–266. DOI: 10.1145/2736277.2741116.
  • Drutsa, A., Ufliand, A., and Gusev, G. (2015), “Practical Aspects of Sensitivity in Online Experimentation with User Engagement Metrics,” in Proceedings of the 24th ACM International on Conference on Information and Knowledge Management. CIKM ’15, pp. 763–772, Melbourne, Australia: Association for Computing Machinery. DOI: 10.1145/2806416.2806496.
  • Eckles, D., Karrer, B., and Ugander, J. (2014), “Design and Analysis of Experiments in Networks: Reducing Bias from Interference,” arXiv:1404.7530 [physics, stat]. arXiv: 1404.7530.
  • Ensor, H., Lee, R. J., Sudlow, C., and Weir, C. J. (2016), “Statistical Approaches for Evaluating Surrogate Outcomes in Clinical Trials: A Systematic Review,” Journal of Biopharmaceutical Statistics, 26, 859–879. DOI: 10.1080/10543406.2015.1094811.
  • Fabijan, A., Dmitriev, P., Holmstrom Olsson, H., and Bosch, J. (2018), ‘Online Controlled Experimentation at Scale: An Empirical Survey on the Current State of A/B Testing,” in 2018 44th Euromicro Conference on Software Engineering and Advanced Applications (SEAA), pp. 68–72. DOI: 10.1109/SEAA.2018.00021.
  • Feng, W., and Bauman, J. (2022), “Balancing Network Effects, Learning Effects, and Power in Experiments,” available at https://doordash.engineering/2022/02/16/balancing-network-effects-learning-effects-and-power-in-experiments/
  • Frangakis, C. E., and Rubin, D. B. (2002), “Principal Stratification in Causal Inference,” Biometrics, 58, 21–29. DOI: 10.1111/j.0006-341x.2002.00021.x.
  • Georgiev, G. (2022), “Fully Sequential vs Group Sequential Test,” available at https://blog.analyticstoolkit.com/2022/fully-sequential-vs-group-sequential-tests.
  • Georgiev, G. Z. (2019), Statistical Methods in Online A/B Testing, Self-Published.
  • Google (2022), “How Google’s Algorithm is Focused on Its Users - Google Search,” available at https://www.google.com/search/howsearchworks/mission/users/. (Accessed on 03/29/2022).
  • Gui, H., Xu, Y., Bhasin, A., and Han, J. (2015), “Network A/B Testing: From Sampling to Estimation,” in Proceedings of the 24th International Conference on World Wide Web, WWW ’15, pp. 399–409, Florence, Italy: International World Wide Web Conferences Steering Committee. DOI: 10.1145/2736277.2741081.
  • Gupta, S., Kohavi, R., Tang, D., Xu, Y., Andersen, R., Bakshy, E., Cardin, N., Chandran, S., Chen, N., Coey, D., Curtis, M. A., Deng, A., Duan, W., Forbes, P., Frasca, B., Guy, T., Imbens, G. W., Saint Jacques, G., Kantawala, P., Katsev, I., Katzwer, M., Konutgan, M., Kunakova, E., Lee, M., Lee, M., Liu, J., McQueen, J., Najmi, A., Smith, B., Trehan, V., Vermeer, L., Walker, T., Wong, J., and Yashkov, I. (2019), “Top Challenges from the First Practical Online Controlled Experiments Summit,” SIGKDD Explorations Newsletter, 21, 20–35. DOI: 10.1145/3331651.3331655.
  • Ham, D. W., Bojinov, I., Lindon, M., and Tingley, M. (2022), “Design-Based Confidence Sequences for Anytime-valid Causal Inference,” arXiv preprint arXiv:2210.08639.
  • Hassan, A., Shi, X., Craswell, N., and Ramsey, B. (2013), “Beyond Clicks: Query Reformulation as a Predictor of Search Satisfaction,” in Proceedings of the 22nd ACM international conference on Information & Knowledge Management, pp. 2019–2028.
  • Hern, A. (2014), “Why Google has 200m Reasons to Put Engineers Over Designers—Google—The Guardian,” available at https://www.theguardian.com/technology/2014/feb/05/why-googleengineers-designers. (Accessed on 10/26/2021).
  • Hohnhold, H., O’Brien, D., and Tang, D. (2015), “Focusing on the Long-Term: It’s Good for Users and Business,” in Proceedings of the 21th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, KDD ’15, pp. 1849–1858, Sydney, NSW, Australia: Association for Computing Machinery. DOI: 10.1145/2783258.2788583.
  • Holtz, D., Lobel, R., Liskovich, I., and Aral, S. (2020), “Reducing Interference Bias in Online Marketplace Pricing Experiments,” arXiv preprint arXiv:2004.12489.
  • Hopkins, F. (2020), “Increasing Experimental Power with Variance Reduction at the BBC—by Frank Hopkins—BBC Data Science—Medium,” available at https://medium.com/bbc-datascience/increasing-experiment-sensitivity-through-pre-experiment-variancereduction-166d7d00d8fd. (Accessed on 02/25/2022).
  • Hu, Y., and Wager, S. (2022), “Switchback Experiments under Geometric Mixing,” arXiv preprint arXiv:2209.00197.
  • Imai, K., and Ratkovic, M. (2013), “Estimating Treatment Effect Heterogeneity in Randomized Program Evaluation,” The Annals of Applied Statistics, 7, 443–470. DOI: 10.1214/12-AOAS593.
  • Imbens, G. W., and Rubin, D. B. (2015), Causal Inference in Statistics, Social, and Biomedical Sciences, Cambridge: Cambridge University Press.
  • Isaac, M. (2021), “Facebook Wrestles With the Features It Used to Define Social Networking.” The New York Times. Available at https://www.nytimes.com/2021/10/25/technology/facebook-like-share-buttons.html
  • Ivaniuk, A. (2020), “Our Evolution Towards T-REX: The Prehistory of Experimentation Infrastructure at LinkedIn—LinkedIn Engineering,” available at https://engineering.linkedin.com/blog/2020/our-evolution-towards-t-rex–the-prehistory-of-experimentation-i. (Accessed on 02/14/2022).
  • Jackson, S. (2018), “How Booking.com Increases the Power of Online Experiments with CUPED—Booking.com Data Science,” available at https://booking.ai/how-booking-com-increasesthe-power-of-online-experiments-with-cuped-995d186fff1d. (Accessed on 01/13/2021).
  • Johari, R., Koomen, P., Pekelis, L., and Walsh, D. (2022a), “Always Valid Inference: Continuous Monitoring of a/b Tests,” Operations Research, 70, 1806–1821. DOI: 10.1287/opre.2021.2135.
  • Johari, R., Li, H., Liskovich, I., and Weintraub, G. Y. (2022b), “Experimental Design in Two-Sided Platforms: An Analysis of Bias,” Management Science, 68, 7069–7089. DOI: 10.1287/mnsc.2021.4247.
  • Ju, N., Hu, D., Henderson, A., and Hong, L. (2019), “A Sequential Test for Selecting the Better Variant: Online A/B Testing, Adaptive Allocation, and Continuous Monitoring,” in Proceedings of the Twelfth ACM International Conference on Web Search and Data Mining, WSDM ’19, pp. 492–500, Melbourne VIC, Australia: Association for Computing Machinery. DOI: 10.1145/3289600.3291025.
  • Karrer, B., Shi, L., Bhole, M., Goldman, M., Palmer, T., Gelman, C., Konutgan, M., and Sun, F. (2021), “Network Experimentation at Scale,” in Proceedings of the 27th ACM SIGKDD Conference on Knowledge Discovery & Data Mining, pp. 3106–3116. DOI: 10.1145/3447548.3467091.
  • Keenan, M. (2022), “Global Ecommerce Explained: Stats and Trends to Watch in 2022,” available at https://www.shopify.ca/enterprise/global-ecommerce-statistics. (Accessed on 04/30/2023).
  • Kemp, S. (2023), “DIGITAL 2023: Global Overview Report,” available at https://datareportal.com/reports/digital-2023-global-overview-report. (Accessed on 04/30/2023).
  • Kharitonov, E., Drutsa, A., and Serdyukov, P. (2017), “Learning Sensitive Combinations of A/B Test Metrics,” in Proceedings of the Tenth ACM International Conference on Web Search and Data Mining, WSDM ’17, pp. 651–659, Cambridge, UK: Association for Computing Machinery. DOI: 10.1145/3018661.3018708.
  • Kharitonov, E., Vorobev, A., Macdonald, C., Serdyukov, P., and Ounis, I. (2015), “Sequential Testing for Early Stopping of Online Experiments,” in Proceedings of the 38th International ACM SIGIR Conference on Research and Development in Information Retrieval, SIGIR ’15, pp. 473–482, Santiago, Chile: Association for Computing Machinery. DOI: 10.1145/2766462.2767729.
  • Kohavi, R. (2012), “Online Controlled Experiments: Introduction, Learnings, and Humbling Statistics,” in Proceedings of the Sixth ACM Conference on Recommender Systems, RecSys ’12, pp. 1–2, New York: Association for Computing Machinery. DOI: 10.1145/2365952.2365954.
  • ———(2023), “Build vs Buy,” available at https://bit.ly/ABTestsBuildVsBuy8.
  • Kohavi, R., Deng, A., and Vermeer, L. (2022), “A/B Testing Intuition Busters: Common Misunderstandings in Online Controlled Experiments,” DOI: 10.1145/3534678.3539160.
  • Kohavi, R., Deng, A., Frasca, B., Longbotham, R., Walker, T., and Xu, Y. (2012), “Trustworthy Online Controlled Experiments: Five Puzzling Outcomes Explained,” in Proceedings of the 18th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, KDD ’12, pp. 786–794, Beijing, China: Association for Computing Machinery. DOI: 10.1145/2339530.2339653.
  • Kohavi, R., Deng, A., Frasca, B., Walker, T., Xu, Y., and Pohlmann, N. (2013), “Online Controlled Experiments at Large Scale,” in Proceedings of the 19th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining - KDD ’13, p. 1168, Chicago, Illinois, USA: ACM Press. DOI: 10.1145/2487575.2488217.
  • Kohavi, R., Deng, A., Longbotham, R., and Xu, Y. (2014). “Seven Rules of Thumb for Web Site Experimenters,” Proceedings of the 20th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining - KDD ’14, pp. 1857–1866. New York, New York, USA: ACM Press. DOI: 10.1145/2623330.2623341.
  • Kohavi, R., and Longbotham, R. (2023), “Online Controlled Experiments and A/B Tests,” in Encyclopedia of Machine Learning and Data Science, eds. D. Phung, G. I. Webb, and C. Sammut, pp. 1–13, New York: Springer. DOI: 10.1007/978-1-4899-7502-7_891-2.
  • Kohavi, R., Longbotham, R., Sommerfield, D., and Henne, R. M. (2009), “Controlled Experiments On the Web: Survey and Practical Guide,” Data Mining and Knowledge Discovery, 18, 140–181. DOI: 10.1007/s10618-008-0114-1.
  • Kohavi, R., Tang, D., and Xu, Y. (2020), Trustworthy Online Controlled Experiments: A Practical Guide to A/B Testing, Cambridge: Cambridge University Press. DOI: 10.1017/9781108653985.Available at https://experimentguide.com/.
  • Kohavi, R., and Thomke, S. (2017), “The Surprising Power of Online Experiments,” Harvard Business Review, 95, 74–82.
  • Kohlmeier, S. (2022), “Microsoft’s Experimentation Platform: How We Build a World Class Product - Microsoft Research,” available at https://www.microsoft.com/en-us/research/group/experimentation-platform-exp/articles/microsofts-experimentationplatform-how-we-build-a-world-class-product/. (Accessed on 02/14/2022).
  • Koning, R., Hasan, S., and Chatterji, A. (2022), “Experimentation and Start-up Performance: Evidence from A/B Testing,” Management Science, 68, 6434–6453. DOI: 10.1287/mnsc.2021.4209.
  • Koutra, V., Gilmour, S. G., and Parker, B. M. (2021), “Optimal Block Designs for Experiments on Networks,” Journal of the Royal Statistical Society, Series C, 70, 596–618. DOI: 10.1111/rssc.12473.
  • Lan, K. K. G., and DeMets, D. L. (1983), “Discrete Sequential Boundaries for Clinical Trials,” Biometrika, 70, 659–663. DOI: 10.2307/2336502.
  • Lan, Y., Bakthavachalam, V., Sharan, L., Douriez, M., Azarnoush, B., and Kroll, M. (2022), “A Survey of Causal Inference Applications at Netflix—by Netflix Technology Blog,” at https://netflixtechblog.com/a-survey-of-causal-inference-applications-at-netflix-b62d25175e6f. (Accessed on 08/18/2022).
  • Li, H., Zhao, G., Johari, R., and Weintraub, G. Y. (2022), “Interference, Bias, and Variance in Two-Sided Marketplace Experimentation: Guidance for Platforms,” in Proceedings of the ACM Web Conference 2022, pp. 182–192.
  • Lindon, M., and Malek, A. (2020), “Anytime-Valid Inference for Multinomial Count Data,” DOI: 10.48550/ARXIV.2011.03567.
  • Lindon, M., Sanden, C., and Shirikian, V. (2022), “Rapid Regression Detection in Software Deployments through Sequential Testing,” in Proceedings of the 28th ACM SIGKDD Conference on Knowledge Discovery and Data Mining. KDD ’22, pp. 3336–3346. Washington DC, USA: Association for Computing Machinery. DOI: 10.1145/3534678.3539099.
  • Liou, K., and Taylor, S. J. (2020), “Variance-Weighted Estimators to Improve Sensitivity in Online Experiments,” in Proceedings of the 21st ACM Conference on Economics and Computation, pp. 837–850. DOI: 10.1145/3391403.3399542.
  • Liu, C., Cardoso, A., Couturier, P., and McCoy, E. J. (2021), “Datasets for Online Controlled Experiments,” arXiv preprint arXiv:2111.10198.
  • Liu, M., Mao, J., and Kang, K. (2021), “Trustworthy and Powerful Online Marketplace Experimentation with Budget-split Design,” in Proceedings of the 27th ACM SIGKDD Conference on Knowledge Discovery & Data Mining, pp. 3319–3329. DOI: 10.1145/3447548.3467193.
  • Luca, M., and Bazerman, M. H. (2021), The Power of Experiments: Decision Making in a Data-Driven World, Cambridge, MA: MIT Press.
  • Manzi, J. (2012), UNCONTROLLED The Surprising Payoff of Trial-and-Error for Business, Politics, and Society, New York, NY: Basic Books.
  • Matias, J. N., Munger, K., Le Quere, M. A., and Ebersole, C. (2021), “The Upworthy Research Archive, A Time Series of 32,487 Experiments in US Media,” Scientific Data, 8, 195. DOI: 10.1038/s41597-021-00934-7.
  • McFarland, C. (2012), Experiment!: Website Conversion Rate Optimization with A/B and Multivariate Testing, pp. 190, Berkeley, CA: New Riders.
  • McFowland III, E., Gangarapu, S., Bapna, R., and Sun, T. (2021), “A Prescriptive Analytics Framework for Optimal Policy Deployment Using Heterogeneous Treatment Effects,” MIS Quarterly, 45, 1807–1832. DOI: 10.25300/MISQ/2021/15684.
  • Neyman, J. (1923), “On the Application of Probability Theory to Agricultural Experiments. Essay on Principles,” Annals of Agricultural Sciences, 1–51.
  • Ni, T., Bojinov, I., and Zhao, J. (2023), “Design of Panel Experiments with Spatial and Temporal Interference,” Available at SSRN 4466598.
  • O’Brien, P. C., and Fleming, T. R. (1979), “A Multiple Testing Procedure for Clinical Trials,” Biometrics, 35, 549–556.
  • Parker, B. M., Gilmour, S. G., and Schormans, J. (2017), “Optimal Design of Experiments on Connected Units with Application to Social Networks,” Journal of the Royal Statistical Society, Series C, 66, 455–480. DOI: 10.1111/rssc.12170.
  • Pekelis, L. (2015), “Statistics for the Internet Age: The Story Behind Optimizely’s New Stats Engine,” available at https://www.optimizely.com/insights/blog/statistics-for-the-internet-age-the-story-behind-optimizelys-new-stats-engine/. (Accessed on 03/08/2022).
  • Petersen, A., Witten, D., and Simon, N. (2016), “Fused Lasso Additive Model,” Journal of Computational and Graphical Statistics, 25, 1005–1025. DOI: 10.1080/10618600.2015.1073155.
  • Peysakhovich, A., and Lada, A. (2016), “Combining Observational and Experimental Data to Find Heterogeneous Treatment Effects,” arXiv preprint arXiv:1611.02385.
  • Pocock, S. J. (1977), “Group Sequential Methods in the Design and Analysis of Clinical Trials,” Biometrika, 64, 191–199. DOI: 10.1093/biomet/64.2.191.
  • Pokhilko, V., Zhang, Q., Kang, L., and Mays, D. P. (2019), “D-optimal Design for Network a/b Testing,” Journal of Statistical Theory and Practice, 13, 1–23. DOI: 10.1007/s42519-019-0058-3.
  • Poyarkov, A., Drutsa, A., Khalyavin, A., Gusev, G., and Serdyukov, P. (2016), “Boosted Decision Tree Regression Adjustment for Variance Reduction in Online Controlled Experiments,” in Proceedings of the 22nd ACM SIGKDD International Conference on Knowledge Discovery and Data Mining. KDD ’16, pp. 235–244. San Francisco, California, USA: Association for Computing Machinery. DOI: 10.1145/2939672.2939688.
  • Pramanik, S., Johnson, V. E., and Bhattacharya, A. (2021), “A Modified Sequential Probability Ratio Test,” Journal of Mathematical Psychology, 101, 102505. DOI: 10.1016/j.jmp.2021.102505.
  • Prentice, R. L. (1989), “Surrogate Endpoints in Clinical Trials: Definition and Operational Criteria,” Statistics in Medicine, 8, 431–440. DOI: 10.1002/sim.4780080407.
  • Quin, F., Weyns, D., Galster, M., and Silva, C. C. (2023), “A/B Testing: A Systematic Literature Review,” arXiv preprint arXiv:2308.04929.
  • Robertson, D. S., Choodari-Oskooei, B., Dimairo, M., Flight, L., Pallmann, P., and Jaki, T. (2023), “Point Estimation for Adaptive Trial Designs I: A Methodological Review,” Statistics in Medicine. 42, 122–145. DOI: 10.1002/sim.9605.
  • Robinson, P. M. (1988), “Root-N-Consistent Semiparametric Regression,” Econometrica, 56, 931–954. DOI: 10.2307/1912705.
  • Ruberg, S. J. (1995a), “Dose Response Studies I. Some Design Considerations,” Journal of Biopharmaceutical Statistics, 5, 1–14. DOI: 10.1080/10543409508835096.
  • ———(1995b), “Dose Response Studies II. Analysis and Interpretation,” Journal of biopharmaceutical statistics, 5, 15–42.
  • Rubin, D. B. (1974), “Estimating Causal Effects of Treatments in Randomized and Nonrandomized Studies,” Journal of Educational Psychology, 66, 688–701. DOI: 10.1037/h0037350.
  • Sadeghi, S., Gupta, S., Gramatovici, S., Lu, J., Ai, H., and Zhang, R. (2022), “Novelty and Primacy: A Long-Term Estimator for Online Experiments,” Technometrics, 64, 524–534. DOI: 10.1080/00401706.2022.2124309.
  • Saint-Jacques, G. (2019), “Detecting interference: An A/B test of A/B Tests—LinkedIn Engineering,” available at https://engineering.linkedin.com/blog/2019/06/detecting-interference–an-a-b-test-of-a-b-tests. (Accessed on 02/22/2022).
  • Saint-Jacques, G., Varshney, M., Simpson, J., and Xu, Y. (2019), “Using Ego-Clusters to Measure Network Effects at LinkedIn,” arXiv preprint arXiv:1903.08755.
  • Saveski, M., Pouget-Abadie, J., Saint-Jacques, G., Duan, W., Ghosh, S., Xu, Y., and Airoldi, E. M. (2017), “Detecting Network Effects: Randomizing Over Randomized Experiments,” in Proceedings of the 23rd ACM SIGKDD International Conference on Knowledge Discovery and Data Mining. KDD ’17, pp. 1027–1035. Halifax, NS, Canada: Association for Computing Machinery. DOI: 10.1145/3097983.3098192.
  • Schroeder, B. (2021), “The Data Analytics Profession And Employment Is Exploding: Three Trends That Matter,” available at https://www.forbes.com/sites/bernhardschroeder/2021/06/11/the-data-analytics-profession-and-employment-is-exploding-threetrends-that-matter/?sh=12c5c3c3f81e. (Accessed on 03/10/2022).
  • Schultzberg, M., and Ankargren, S. (2023), “Choosing Sequential Testing Framework—Comparisons and Discussions,” available at https://engineering.atspotify.com/2023/03/choosingsequential-testing-framework-comparisons-and-discussions/.
  • Sepehri, A., and DiCiccio, C. (2020), “Interpretable Assessment of Fairness During Model Evaluation,” arXiv preprint arXiv:2010.13782.
  • Sexauer, C. (2022), “CUPED on Statsig,” available at https://blog.statsig.com/cuped-on-statsigd57f23122d0e. (Accessed on 08/18/2022).
  • Sharma, C. (2021), “Reducing Experiment Durations - Eppo Blog,” available at https://www.geteppo.com/blog/reducing-experiment-durations. (Accessed on 02/25/2022).
  • ———(2022), “Bending time in experimentation - Eppo Blog,” available at https://www.geteppo.com/blog/bending-time-in-experimentation. (Accessed on 08/18/2022).
  • Shi, C., Wang, X., Luo, S., Song, R., Zhu, H., and Ye, J. (2020), “A Reinforcement Learning Framework for Time-Dependent Causal Effects Evaluation in A/B Testing,” arXiv preprint arXiv:2002.01711.
  • Skotara, N. (2023), “Sequential Testing at Booking.com,” available at https://booking.ai/sequentialtesting-at-booking-com-650954a569c7.
  • Syrgkanis, V., Lei, V., Oprescu, M., Hei, M., Battocchi, K., and Lewis, G. (2019), “Machine Learning Estimation of Heterogeneous Treatment Effects with Instruments,” in Advances in Neural Information Processing Systems, pp. 15193–15202.
  • Tang, D., Agarwal, A., O’Brien, D., and Meyer, M. (2010), “Overlapping Experiment Infrastructure: More, Better, Faster Experimentation,” in Proceedings of the 16th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, KDD ’10, pp. 17–26. Washington, DC: Association for Computing Machinery. DOI: 10.1145/1835804.1835810.
  • Thomke, S. H. (2020), Experimentation works: The Surprising Power of Business Experiments, Brighton, MA: Harvard Business Press.
  • Tran, C., and Zheleva, E. (2019), “Learning Triggers for Heterogeneous Treatment Effects,” in Proceedings of the AAAI Conference on Artificial Intelligence (Vol. 33), pp. 5183–5190. DOI: 10.1609/aaai.v33i01.33015183.
  • Tsiatis, A. A. (2006), Semiparametric Theory and Missing Data, New York, NY: Springer.
  • Ugander, J., Karrer, B., Backstrom, L., and Kleinberg, J. (2013), “Graph Cluster Randomization: Network Exposure to Multiple Universes,” arXiv:1305.6979 [physics, stat]. arXiv: 1305.6979.
  • Urban, S., Sreenivasan, R., and Kannan, V. (2016), “It’s All A/Bout Testing: The Netflix Experimentation Platform—by Netflix Technology Blog—Netflix TechBlog,” available at https://netflixtechblog.com/its-all-a-bout-testing-the-netflix-experimentation-platform-4e1ca458c15. (Accessed on 10/26/2021).
  • Visser, D. (2020), “In-House Experimentation Platforms,” available at https://www.linkedin.com/pulse/inhouse-experimentation-platforms-denise-visser/. (Accessed on 09/15/2022).
  • Von Ahn, L. (2022), “Shareholder Letter Q2 2022,” available at https://investors.duolingo.com/staticfiles/ae55dd31-2ce4-41ac-bb26-948bafe8409c.
  • Wager, S., and Athey, S. (2018), “Estimation and Inference of Heterogeneous Treatment Effects using Random Forests,” Journal of the American Statistical Association, 113, 1228–1242. DOI: 10.1080/01621459.2017.1319839.
  • Wald, A. (1945), “Sequential Tests of Statistical Hypotheses,” The Annals of Mathematical Statistics, 16, 117–186. DOI: 10.1214/aoms/1177731118.
  • ———(1947), Sequential Analysis, New York: Courier Corporation.
  • Wang, Y., Gupta, S., Lu, J., Mahmoudzadeh, A., and Liu, S. (2019), “On Heavy-user Bias in A/B Testing,” in Proceedings of the 28th ACM International Conference on Information and Knowledge Management, pp. 2425–2428. DOI: 10.1145/3357384.3358143.
  • Waudby-Smith, I., Arbour, D., Sinha, R., Kennedy, E. H., and Ramdas, A. (2021), “Time-Uniform Central Limit Theory, Asymptotic Confidence Sequences, and Anytime-Valid Causal Inference,” arXiv preprint arXiv:2103.06476.
  • Xia, T., Bhardwaj, S., Dmitriev, P., and Fabijan, A. (2019), “Safe Velocity: A Practical Guide to Software Deployment at Scale Using Controlled Rollout,” in 2019 IEEE/ACM 41st International Conference on Software Engineering: Software Engineering in Practice (ICSE-SEIP), pp. 11–20, IEEE. DOI: 10.1109/ICSE-SEIP.2019.00010.
  • Xie, H., and Aurisset, J. (2016), “Improving the Sensitivity of Online Controlled Experiments: Case Studies at Netflix,” in Proceedings of the 22nd ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, KDD ’16, pp. 645–654. San Francisco, CA: Association for Computing Machinery. DOI: 10.1145/2939672.2939733.
  • Xie, Y., Chen, N., and Shi, X. (2018), “False Discovery Rate Controlled Heterogeneous Treatment Effect Detection for Online Controlled Experiments,” in Proceedings of the 24th ACM SIGKDD International Conference on Knowledge Discovery & Data Mining. KDD ’18, pp. 876–885. London, UK: Association for Computing Machinery. DOI: 10.1145/3219819.3219860.
  • Xu, Y., Chen, N., Fernandez, A., Sinno, O., and Bhasin, A. (2015), “From Infrastructure to Culture: A/B Testing Challenges in Large Scale Social Networks,” in Proceedings of the 21th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining - KDD ’15, pp. 2227–2236. Sydney, NSW, Australia: ACM Press. DOI: 10.1145/2783258.2788602.
  • Xu, Y., Duan, W., and Huang, S. (2018), “SQR: Balancing Speed, Quality and Risk in Online Experiments,” in Proceedings of the 24th ACM SIGKDD International Conference on Knowledge Discovery & Data Mining. KDD ’18, pp. 895–904. London, UK: ACM. DOI: 10.1145/3219819.3219875.
  • Yoon, S. (2018), “Designing A/B Tests in a Collaboration Network,” The Unofficial Google Data Science Blog. Library Catalog: available at www.unofficialgoogledatascience.com. http://www.unofficialgoogledatascience.com/2018/01/designing-ab-tests-in-collaboration.html (visited on 06/11/2020).
  • Yu, M., Lu, W., and Song, R. (2020), “A New Framework for Online Testing of Heterogeneous Treatment Effect,” arXiv: 2002.03277 [stat.ME].
  • Zhang, C., Coey, D., Goldman, M., and Karrer, B. (2021), “Regression Adjustment with Synthetic Controls in Online Experiments,” Meta Research. Available at https://research.facebook.com/publications/regression-adjustment-with-synthetic-controls-in-online-experiments/
  • Zhang, Q., and Kang, L. (2022), “Locally Optimal Design for A/B Tests in the Presence of Covariates and Network Dependence,” Technometrics, 64, 358–369. DOI: 10.1080/00401706.2022.2046169.
  • Zhao, Y., Zeng, D., Rush, A. J., and Kosorok, M. R. (2012), “Estimating Individualized Treatment Rules Using Outcome Weighted Learning,” Journal of the American Statistical Association, 107, 1106–1118. DOI: 10.1080/01621459.2012.695674.
  • Zhou, Y., Liu, Y., Li, P., and Hu, F. (2020), “Cluster-Adaptive Network A/B Testing: From Randomization to Estimation.” arXiv:2008.08648 [stat]. arXiv: 2008.08648.