This article was originally posted on Optimizely.com as a blog post. It has been moved to the support documentation for completeness.
This article is part one of a two of a series on Stats Accelerator. Review Accelerator Under Time-Varying Signals for the next article.
What is Stats Accelerator, and why use it?
When customers set up an A/B/n test using Optimizely Experimentation, they choose what percentage of visitors (or samples) will be exposed to each test’s variations. This is the test’s traffic allocation. The traffic allocation stays the same unless a customer manually changes it — for example, to divert traffic to a variation that seems to be performing better than the others.
For several years, we only allowed customers to adjust traffic manually. Over time, we heard from several of them that our setup was inconvenient and unscalable because their goals required frequently changing traffic. Their business needs could be classified into two scenarios:
- Identifying a statistically significant variation quickly: In this scenario, a customer running an A/B/n test wants to know if a variation can lead to actionable insight. That is, is a variation different from the baseline with respect to a given metric? We call this difference the lift or improvement. If different, find the one furthest from the baseline as quickly as possible — while controlling the false discovery rate (FDR) under a predetermined threshold — so this can inform them on how to optimize their business.
- Optimizing reward for a period of time: In this scenario, customers want to maximize impact, such as revenue. For example, customers with a Black Friday promotion may wish to experiment with different headlines on their landing page to maximize profit. They are not concerned with implementing any permanent changes; they only want to direct visitors to the variation that yields the highest conversion rates right now and will remove the variations from their website after the promotion ends.
To help customers achieve these two goals, we launched Stats Accelerator with two modes: Accelerate Learnings to address the first scenario, and Accelerate Impact to address the second. Both modes dynamically change a test’s traffic allocation, but their strategies and intent differ vastly.
Optimizely Experimentation has changed how it refers to these two goals. Accelerate Impact is now referred to as Multi-Armed Bandit (MAB), and Accelerate Learning is called Stats Accelerator (SA).
Though both use multi-armed bandits to allocate visitors dynamically, we used the commonly used industry term of MAB to rename Accelerate Impact because a multi-armed bandit is often associated with minimizing regret. For more information, please review the article on Improvements to Stats Accelerator's Accelerate Impact and Accelerate Learning.
Stats Accelerator (SA) (formerly Accelerate Learning)
We have partnered with Kevin Jamieson, from the Computer Science & Engineering Department at the University of Washington, who has developed an iterative algorithm that discovers significant variations as quickly as possible by directing more visitors to the variations with a better chance of reaching statistical significance. This prioritizes finding the variation whose effect size — the observed difference between it and the baseline — is most significant while keeping FDR below a given threshold.
For simplicity, let us assume that all variations have a true mean greater than the baseline for the rest of this section. This implies that all variations will eventually be found to have statistically significant improvements. We are interested in first finding the variation that is furthest from the baseline in as few samples as possible. Loosely speaking, this algorithm allocates more traffic to the variation with the highest confidence bound on each iteration. The highest confidence bound is the upper endpoint of the confidence interval that is furthest from 0. At the next iteration, it observes the resulting variations’ confidence intervals and empirical means and makes decisions based on that. Once a variation reaches statistical significance, it is taken out of consideration and the process repeats. The algorithm focuses on the remaining inconclusive variations to determine the next best one.
Variations whose true means are close to the baseline’s true mean require more samples to detect this effect size, whereas those furthest away require fewer. To identify the variations whose true means are furthest from the baseline, the algorithm uses the confidence interval by interpreting the upper confidence bound as the highest possible value it will observe of that variation. It picks the one with the highest upper confidence bound and guesses that this will be the best performing one. If correct, the allocated samples will continue to reinforce the empirical mean it observed, and the variation will reach statistical significance faster. If wrong, the updated confidence interval and empirical mean would reveal this, and the algorithm refocuses on other variations. It does so according to the exact heuristic used when starting the round that just completed: it assumes that the variation whose upper confidence bound is greatest shows the most potential to reach statistical significance first.
This adaptive algorithm reduces the time to find the first best statistically significant variation compared to one using uniform allocation. Check out the Experiments Section of the paper, where the results indicate that this speedup can be at least two to three times faster.
For example, suppose a customer runs Stats Accelerator and is looking for the best variation.
At time t = t0, the results are:
Figure 1. An illustration of what Accelerate Learnings observes before updating the test’s allocation at time t0. The dashed line indicates where improvement is 0. The circle is the observed improvement for each variation, and the line represents its confidence interval. Anything above the dotted line means the improvement is positive.
Stats Accelerator allocates more traffic to Var1 than Var2 and Var3 because it looks like it can reach statistical significance with fewer samples than the others. Its confidence interval suggests this, whose upper bound is the furthest from 0.
And at time t = t0 + 1 — the next time we run the algorithm — we observe the results:
Figure 2. What is observed after we run Accelerate Learnings for one iteration.
Var1 reaches statistical significance, so we no longer consider it; we allocate new visitors to the remaining inconclusive variations to identify the next best statistically significant variation as soon as possible.
Multi-Armed Bandit (MAB) (formerly Accelerate Impact)
MAB for Bernoulli metrics
To optimize for a Bernoulli metric (a measurable action that is either a 0 or 1), we use Thompson sampling (a heuristic to choose how to allocate traffic while addressing the exploitation-exploration problem in a multi-armed bandit scenario.) For more information, refer to the Glossary at the end of the article for explanations of these terms.
For each variation, we use the test’s current observed number of unique conversions and number of visitors to characterize a beta distribution. We sample these distributions several times and allocate traffic according to their win ratio.
For example, suppose we have the following results on comparing unique conversions when we are updating the traffic allocation:
Figure 3. The observed results of the experiment under a Bernoulli metric.
We characterize beta distributions, denoted as Beta(𝛼, 𝛽), based on these values, where 𝛼 is the number of unique conversions, and 𝛽 is the number of visitors for each variation. For example, the Original would be characterized by Beta(189, 323).
Figure 4. The resulting Beta distributions are characterized by the results in Figure 3.
We sample these distributions N times — say 10,000 — recording the distribution that yields the highest value in each round. This simulates what might happen if we had 10,000 visitors allocated to the Original, 10,000 to Variation #1 and 10,000 to Variation #2. The ratio of times each variation wins determines the percentage of new visitors allocated to it:
Figure 5. Result of Thompson Sampling where N=10000.
On the next iteration, we run Thompson Sampling again with the updated observations.
MAB for numeric metrics
For numeric metrics, such as revenue or number of events per user, we implemented an Epsilon-Greedy bandit. This bandit distributes a small (epsilon) fraction of the traffic uniformly amongst all variations for exploration and then allocates the larger (1- epsilon) fraction to the variation with the most significant observed mean for exploitation.
There needs to be a portion of traffic designated for exploratory purposes because the empirical (sample) mean may not be close to its true mean. Both Epsilon-Greedy and Thompson Sampling in Multi-Armed bandit are common strategies that address the exploitation-exploration dilemma.
Thompson Sampling favors early exploration due to uncertainties at the early stages of a test. In contrast, Epsilon-Greedy focuses more on exploitation, funneling most traffic to the variation with the highest empirical mean. If a customer only uses the observed information, they would direct all new visitors to the variation with the highest empirical mean. But how do they know if that variation has the highest actual mean? They would need more samples to reduce this uncertainty. This is where exploration becomes essential because that reveals more evidence about the underlying means. Hence, a balance between the two is necessary.
Stats Accelerator also has a form of exploration and exploitation as well. In terms of exploitation, it uses the upper confidence bounds of each variation, allocating more traffic to the highest bound. And if its decision was suboptimal, it will know from the updated results of the test. As a result, it will explore other variations.
When should I use Stats Accelerator vs. Multi-armed Bandit?
SA provides customers with actionable insight, where the best significant variation can be implemented quickly in their business. It focuses on the statistical uncertainties of the variations, using the confidence intervals to update the test’s allocation to reduce the time to finding the best significant variation.
MAB should be used for promotions, sales, or anything temporary, where the intent is to drive more traffic to a variation with the highest metric value. It only focuses on the variations’ empirical mean to maximize some payoff by funneling more traffic to the winning variation.
For more information, refer to Multi-armed bandits vs. Stats Accelerator: when to use each.
- Arm: This is bandit lingo. An arm is synonymous with a variation.
- Bernoulli metric: A measurable action that is either a 0 or 1, such as whether a visitor converted or not.
- Effect Size: In an A/B/n test, the observed difference between the baseline and variation is observed.
- Epsilon-Greedy Bandit: Check out “Epsilon-Greedy strategy” under “Bandit Strategies” here.
- Exploitation: Bandit lingo. Making a decision given the current information.
- Exploration: Bandit lingo. Gathering more information.
- Exploration-exploitation problem: Bandit lingo. Do we direct all traffic to a variation we have observed with the highest empirical mean (exploitation)? If this is not the best variation, how do we direct some traffic to the other variations to find the one with the highest true mean (exploration)? What is the optimal trade-off between these two states?
- False Discovery Rate: This is the average number of incorrect rejections of the null hypothesis over all discoveries made in a set of A/B/n tests. Check out the Why Stats Engine controls for false discovery instead of false positives article that talks about why we use this within Stats Engine.
- Multi-armed bandit: The problem where the strategy is to optimize a specific function by deciding how to allocate resources.
- Numeric metric: A measurable event that is not a 0 or 1 (read: not a Bernoulli metric). This includes metrics such as revenue, the number of clicks per visitor, and so on.
- Sample: An observation. If we measure unique conversions per visitor, a sample would be a visitor.
- Statistical significance: The likelihood that the difference in conversion rates between a given variation and the baseline is not due to chance. Check out the article on Confidence intervals and improvement intervals to see how confidence intervals and statistical significance are related.
- Thompson Sampling: A heuristic to choose how to allocate traffic while addressing the exploitation-exploration problem in a multi-armed bandit scenario. Check out this link for an in-depth introduction to this topic.
- Traffic allocation: The distribution of new visitors to a test’s variations.
- Upper confidence bound: Bandit lingo. Also known as UCB. The upper value of the confidence interval.