History of Stats Accelerator and multi-armed bandits

  • Updated
  • Optimizely Web Experimentation
  • Optimizely Performance Edge
  • Optimizely Feature Experimentation
  • Optimizely Full Stack Experimentation (Legacy)

When you set up an A/B test using Optimizely Experimentation, you choose the percentage of visitors (or samples) exposed to each test's variations. This is the test's traffic allocation. The traffic allocation stays the same unless you manually change it. For example, you may change the allocation to direct more traffic to a variation that performs better than the others.

For several years, Optimizely Experimentation only let you adjust traffic manually. Over time, Optimizely heard from several users that Optimizely's set up was inconvenient and unscalable because they needed to be able to change traffic frequently. Their business needs could be classified into two scenarios:

  1. Identifying a statistically significant variation quickly – In this scenario, you are running an A/B test and want to know if a variation can lead to actionable insight. You want to know: is a variation different from the baseline concerning a given metric? Optimizely call this difference the lift or improvement. If different, find the one furthest from the baseline as quickly as possible — while controlling the false discovery rate (FDR) under a predetermined threshold — so this can inform them on how to optimize their business.
  2. Optimizing reward for some time – In this scenario, you want to maximize impact, such as revenue. For example, you have a Black Friday promotion and may want to experiment with different headlines on their landing page to maximize profit. You are not concerned with implementing any permanent changes. You only want to direct visitors to the variation that yields the highest conversion rates right now and will remove the variations from your website after the promotion ends.

To help users achieve these two goals, Optimizely Experimentation initially launched Stats Accelerator with two modes: Accelerate Learnings to address the first scenario and Accelerate Impact to address the second. Both modes dynamically change a test's traffic allocation, but their strategies and intent differ vastly.

Optimizely Experimentation has changed how it refers to these two goals. Accelerate Impact is now referred to as a multi-armed bandit (MAB) optimization, and Accelerate Learning is called Stats Accelerator (SA).

Although multi-armed bandit optimizations and Stats Accelerator use multi-armed bandits to allocate visitors dynamically, Optimizely Experimentation used the commonly used industry term of MAB to rename Accelerate Impact because multi-armed bandits are often associated with minimizing regret

Stats Accelerator (formerly Accelerate Learning)

Optimizely Experimentation partnered with Kevin Jamieson, from the Computer Science and Engineering Department at the University of Washington, who developed an iterative algorithm that discovers significant variations as fast as possible by directing more visitors to the variations with a better chance of reaching statistical significance. This prioritizes finding the variation whose effect size (the observed difference between it and the baseline) is most significant while keeping FDR below a given threshold. 

For simplicity, Optimizely Experimentation assumes that variations have a true mean greater than the baseline for the rest of this section. This implies that all variations will eventually be found to have statistically significant improvements. Optimizely Experimentation is interested in finding the variation furthest from the baseline in as few samples as possible. This algorithm allocates more traffic to the variation with the highest confidence bound on each iteration. The highest confidence bound is the upper endpoint of the confidence interval that is furthest from 0. At the next iteration, it observes the resulting variations' confidence intervals and empirical mean and makes decisions based on that. When a variation reaches statistical significance, it is taken out of consideration and the process repeats. The algorithm focuses on the remaining inconclusive variations to determine the next best one.

Variations whose true means are close to the baseline's true mean require more samples to detect this effect size, whereas those furthest away require fewer. To identify the variations whose true means are furthest from the baseline, the algorithm uses the confidence interval by interpreting the upper confidence bound as the highest possible value it will observe of that variation. It picks the one with the highest upper confidence bound and guesses that this is the best-performing one. If correct, the allocated samples will continue reinforcing the empirical mean observed, and the variation will reach statistical significance faster. If wrong, the updated confidence interval and empirical mean would reveal this, and the algorithm refocuses on other variations. It does so according to the exact heuristic used when starting the round that has just been completed: it assumes that the variation whose upper confidence bound is greatest shows the most potential to reach statistical significance first.

See the Experiments section of 'A Bandit Approach to Multiple Testing with False Discovery Control', where the results indicate that this speedup can be at least two to three times faster.

Multi-armed bandit optimizations (formerly Accelerate Impact)

Optimizely Experimentation uses a different algorithm depending on what type of metric the mab optimization is being run on:

  • Bernoulli metrics – To optimize for a Bernoulli metric (a measurable action that is a 0 or 1), Optimizely uses Thompson sampling (a heuristic to choose how to allocate traffic while addressing the exploitation-exploration problem in a multi-armed bandit scenario.) For information, refer to the Glossary at the end of the article for explanations of these terms.
  • Numeric metrics – For numeric metrics, such as revenue or number of events per user, Optimizely implemented an Epsilon-Greedy bandit. This bandit distributes a small (epsilon) fraction of the traffic uniformly amongst variations for exploration and then allocates the larger (1- epsilon) fraction to the variation with the most significant observed mean for exploitation.

See Maximize lift with multi-armed bandit optimizations.

Exploration-exploitation dilemma

There needs to be a portion of traffic designated for exploratory purposes because the empirical (sample) mean may not be close to its true mean. Epsilon-Greedy and Thompson Sampling in multi-armed bandits are common strategies that address the exploitation-exploration dilemma. 

Thompson Sampling favors early exploration due to uncertainties at the early stages of a test. In contrast, Epsilon-Greedy focuses more on exploitation, funneling most traffic to the variation with the highest empirical mean. If a customer only uses the observed information, they would direct new visitors to the variation with the highest empirical mean. But how do they know if that variation has the highest actual mean? They would need more samples to reduce this uncertainty. This is where exploration becomes essential because that reveals more evidence about the underlying means. Hence, a balance between the two is necessary.

Stats Accelerator also has a form of exploration and exploitation as well. In terms of exploitation, it uses the upper confidence bounds of each variation, allocating more traffic to the highest bound. And if its decision was suboptimal, it would know from the updated results of the test. As a result, it will explore other variations.

When should I use Stats Accelerator or multi-armed bandit?

See Stats Accelerator compared to multi-armed bandit optimizations for details.

Stats Accelerator – Used to reduce the time to find the best significant variation. Provides you with actionable insight into where the best significant variation can be implemented quickly. It focuses on the statistical uncertainties of the variations, using the confidence intervals to update the test's allocation to reduce the time to find the best significant variation.

Multi-armed bandit – Used for promotions, sales, or anything temporary, where the intent is to drive more traffic to a variation with the highest metric value. It only focuses on the variations' empirical mean to maximize some payoff by funneling more traffic to the winning variation.

Glossary

  • Arm – Bandit algorithm language. An arm is synonymous with a variation.
  • Bernoulli metric – A measurable action that is a 0 or 1, such as whether a visitor converted or not.
  • Effect Size – In an A/B/n test, the observed difference between the baseline and variation is observed.
  • Epsilon-Greedy Bandit – See Epsilon-Greedy strategy in Bandit Strategies section.
  • Exploitation – Making a decision given the current information.
  • Exploration – Gathering information.
  • Exploration-exploitation problem – Problem: Do you direct traffic to a variation you have observed with the highest empirical mean (exploitation)? If this is not the best variation, how do you direct some traffic to the other variations to find the one with the highest true mean (exploration)? What is the optimal trade-off between these two states?
  • False Discovery Rate (FDR) – This is the average number of incorrect rejections of the null hypothesis over the discoveries made in a set of A/B/n tests. See, Why Stats Engine controls for false discovery instead of false positives about why Optimizely uses FDR within Stats Engine.
  • Multi-armed bandit – The problem where the strategy is to optimize a specific function by deciding how to allocate resources.
  • Numeric metric – A measurable event that is not a 0 or 1 (not a Bernoulli metric). This includes metrics such as revenue, the number of clicks per visitor, and so on.
  • Sample: An observation. If you measure unique conversions per visitor, a sample would be a visitor.
  • Statistical significance – The likelihood that the difference in conversion rates between a given variation and the baseline is not due to chance. Check out the article on Confidence intervals and improvement intervals to see how confidence intervals and statistical significance are related.
  • Thompson Sampling – A heuristic to choose how to allocate traffic while addressing the exploitation-exploration problem in a multi-armed bandit scenario. See A Tutorial on Thompson Sampling for an in-depth introduction to this topic.
  • Traffic allocation – The distribution of visitors to a test's variations.
  • Upper confidence bound – The upper value of the confidence interval.