- Optimizely Web Experimentation
- Optimizely Personalization
- Optimizely Feature Experimentation
- Optimizely Full Stack (Legacy)
If you run a lot of experiments, you may face two challenges:
- Data collection is costly, and time spent experimenting means you have less time to exploit the value of the eventual winner.
- Creating more than one or two variations can delay statistical significance longer than you might like.
Stats Accelerator manipulates traffic to minimize time to statistical significance. It monitors ongoing experiments and uses machine learning to adjust traffic distribution among variations. In other words, it shows more visitors the variations with a better chance of reaching statistical significance. The process attempts to discover as many significant variations as possible.
Specifically, Stats Accelerators does the following:
- Identifies the variation showing the most significant difference from the baseline.
- Pushes traffic to that variation.
- When that variation reaches statistical significance, the variation is taken out of consideration.
- Traffic is redistributed to the remaining variations, and the cycle continues.
You can only use Stats Accelerator in partial factorial mode if running a multivariate test. When Stats Accelerator is enabled, you cannot switch directly from partial factorial to full factorial mode. You must set your traffic distribution mode to Manual if you want to use full factorial mode.
Stats Accelerator results reporting
Stats Accelerator adjusts the percentage of visitors who see each variation. This means visitor counts will reflect the distribution decisions of the Stats Accelerator.
Stats Accelerator experiments and campaigns report absolute improvement.
Stats Accelerator reports absolute improvements in percentage points, denoted by the "pp" unit.
Additionally, the winning variation displays its results in terms of approximate weighted improvement as well. This can be found just below the absolute improvement and is provided for continuity purposes so that customers accustomed to using weighted improvement can develop a sense of how absolute improvement and weighted improvement are compared to each other. Read more about weighted improvement and Stats Accelerator.
Because traffic distribution is updated frequently, Feature Experimentation and Full Stack (Legacy) customers should implement sticky bucketing to avoid exposing the same visitor to multiple variations. To do this, implement a user profile service.
Set up an experiment with Stats Accelerator as the traffic distribution mode
Create an experiment using Stats Accelerator in Web Experimentation
-
From the Experiments window, click Create New.
-
Select A/B Test from the drop-down menu.
-
Give your experiment a name, description, and URL to target, just as you would with any Optimizely experiment. Then click Create Experiment.
-
Create your variations in the Visual Editor. For experiments using Stats Accelerator, you need at least two variations and a baseline. So, three variations total.
-
Click Metrics and choose your primary metric. Your experiment will use the primary metric to determine how traffic is distributed across variations.
-
Click Traffic Allocation. Under Variation Traffic Distribution, click the Distribution Mode drop-down list and select Stats Accelerator.
-
Click Start Experiment to launch your experiment.
Create an experiment using Stats Accelerator in Feature Experimentation
- Follow the instructions on how to run an A/B test in the developer documentation.
- Select Stats Accelerator for the flag rule's Distribution mode. For experiments using Stats Accelerator, you need at least two variations and a baseline. So, three variations total.
Modify an experiment when Stats Accelerator is enabled
You can modify an experiment if you have Stats Accelerator enabled. However, there are some limitations you should be aware of.
Before starting your experiment, you can add or delete variations for Web Experimentation, Feature Experimentation, and Full Stack Experimentation (Legacy) experiments as long as you still have at least three variations.
You can also add or delete sections or section variations for multivariate tests, provided that you still have the minimum number of variations required by the algorithm you use.
Due to the possibility of affecting your experiment results, Optimizely highly discourages you from modifying a running experiment. See the importance of maintaining experiment consistency for information.
After you start your experiment, you can add, stop, or pause regular variations in Optimizely Web Experimentation, Feature Experimentation, and Full Stack Experimentation (Legacy) experiments. However, you can only add or delete sections for a multivariate test. You cannot add or delete section variations when the experiment has begun. Refer to the Multivariate tests for Optimizely or Section rollups in multivariate tests documentation for information about variable sections.
Weighted improvement
Stats Accelerator relies on dynamic traffic allocation to achieve its results. Anytime you allocate traffic dynamically over time, you risk introducing bias into your results. Left uncorrected, this bias can significantly impact your reported results.
Stats Accelerator neutralizes this bias through a technique called weighted improvement.
Do not confuse weighted improvement and absolute improvement.
Weighted improvement helps Stats Accelerator counterbalance bias created while allocating traffic to different variations, while absolute improvement is shown on the Optimizely Experimentation Results page.
Weighted improvement is designed to estimate the true lift as accurately as possible by breaking down the duration of an experiment into much shorter segments called epochs. These epochs cover periods of constant allocation. In other words, traffic allocation between variations does not change for the duration of each epoch.
Results are calculated for each epoch, which has the effect of minimizing the bias in each epoch. At the end of the experiment, these results are used to calculate the estimated true lift, filtering out the bias that would have otherwise been present.
You can further examine this by looking at the two graphs below:
- The first chart shows conversion rates for two variations when traffic allocation is static. In this example, conversions for every variation begin to decline after 5,000 visitors have seen each. And while there is plenty of fluctuation in conversion rates, the gap between the winning and losing variations never strays far from the true lift.
The steady decline in the observed conversion rates shown above is caused by the sudden, one-time shift in the true conversion rates when the experiment has 10,000 visitors.
- The following chart shows what happens when traffic is dynamically allocated instead, with 90 percent of traffic directed to the winning variation after 5,000 visitors have seen each variation. Here, the winning variation shows the same decline in conversion rates as it did in the previous example. However, fewer visitors have seen the losing variation, so its conversion rates are slower to change.
This gives the impression that the difference between the two variations is much less than it truly is.
This situation is known as Simpson's Paradox, and it is hazardous when the true lift is relatively small. In those cases, it can even cause the sign on your results to flip, essentially reporting winning variations as losers and vice versa:
Stas Accelerator example
For example, suppose a customer runs Stats Accelerator and is looking for the best variation.
At time t = t0, the results are:
Figure 1. An illustration of what Accelerate Learnings observes before updating the test's allocation at time t0. The dashed line indicates where improvement is 0. The circle is the observed improvement for each variation, and the line represents its confidence interval. Anything above the dotted line means the improvement is positive.
Stats Accelerator allocates more traffic to Var1 than Var2 and Var3 because it looks like it can reach statistical significance with fewer samples than the others. Its confidence interval suggests this, whose upper bound is the furthest from 0.
And at time t = t0 + 1 — the next time Optimizely runs the algorithm — Optimizely observes the results:
Figure 2. Observation after Optimizely runs Stats Accelerator for one iteration.
Var1 reaches statistical significance, so Optimizely no longer considers it; Optimizely allocates visitors to the remaining inconclusive variations to identify the next best statistically significant variation when possible.
Technical FAQ
How does Stats Accelerator work with Stats Engine?
Stats Engine will continue to decide when a variation has a statistically significant difference from the control, just as it always has. However because some differences are more accessible to spot than others, each variation will require a different amount of samples allocated to reach significance.
Stats Accelerator decides how many samples each variation should be allocated in real-time to get the same statistically significant results as standard A/B/n testing but in less time. These algorithms are only compatible with always-valid p-values, such as those used in Stats Engine, which holds all sample sizes and supports continuous peeking or monitoring. Like any other experiment, you may use the results page for Stats Accelerator-enabled experiments.
What algorithms or frameworks does Stats Accelerator support?
Optimizely Experimentation draws from the research area of multi-armed bandits. Specifically, for pure-exploration tasks, such as discovering all variants that have statistically significant differences from the control, algorithms are based on the popular upper confidence bound heuristic optimal for pure-exploration tasks (Jamieson, Malloy, Nowak, Bubeck 2014).
Can I use my own algorithm?
You can programmatically adjust Traffic Allocation weights as needed using the REST API. Optimizely's out-of-the-box Stats Accelerator feature was finely tuned based on millions of historical data and state-of-the-art bandits and adaptive sampling work.
How much time will I save with Stats Accelerator?
When using Stats Accelerator, users typically achieve statistical significance two to three times faster than standard A/B/n testing. With the same amount of traffic, you can reach significance using two to three times as many variants simultaneously as possible with standard A/B/n testing.
How often does Stats Accelerator make a decision?
The model that dictates Stats Accelerator is updated hourly. Even for Optimizely Experimentation users with extremely high traffic, this is more than sufficient to get the maximum benefits of a dynamic, adaptive allocation. File online tickets for Support if you require a greater or lower frequency of model updates.
What happens if I change the baseline on the results page?
Selecting another baseline has no adverse impact, but the numbers may be challenging to interpret. You should keep the original baseline when you analyze the results data.
What happens if I change my primary metric?
The Stats Accelerator scheme reacts and adapts to the primary metric. If you change the primary metric mid-experiment, the Stats Accelerator scheme will change its policy to optimize that metric. For this reason, you should not modify the primary metric when you begin the experiment or campaign. See Why you should not change a running experiment.
What happens when I pause or stop a variation?
You should refrain from doing this. Although Stats Accelerator will ignore those variations' results data when adjusting traffic distribution amongst the remaining live variations, the paused or stopped variation will exhibit conversion events due to delayed conversions.
Because weighted improvement is a weighted sum of the observed effect size within epochs, subsequent periods–after the variation's been stopped–will yield a more significant effect size due to delayed conversion events but no decisions, resulting in a skewed weighted improvement that is misleading.
If you believe that a variation is underperforming, you should let Stats Accelerator determine this, after which it will minimize traffic to this variation (because it has reached statistical significance) so it can funnel the remaining traffic to the other variations. Otherwise, create a new A/B test with the variation removed.
How does Stats Accelerator handle revenue and numeric metrics?
For numeric metrics like revenue, the number of parameters to fully describe the distribution may be unbounded. Optimizely uses robust estimators for the first few moments (for example, the mean, variance, and skew) to construct confidence bounds that are used, just like those of binary metrics.
How does Stats Accelerator work with Personalization?
In Optimizely Personalization, the Stats Accelerator option can be found in the settings for an individual experience. Stats Accelerator will automatically adjust traffic distribution between variations within campaign experiences. This will not affect the holdback. You should increase your holdback to a uniform distribution level to maximize benefit. For example, if you have three variations and a holdback, consider a 25% holdback.
What is the mathematical difference between Stats Accelerator and Multi-Armed Bandit?
See Stats Accelerator versus multi-armed bandit optimization
Can you run a "blended experiment" where you start the experiment with Stats Engine and switch to Stats Accelerator mid-experiment?
No, do not combine A/B and Bandit methods in the same experiment. Do not change your audience distribution model mid-experiment. Combining A/B and Bandit methods ad-hoc will introduce severe bias and destroy the reliability and interpretability of your results for business decisions.
For example, suppose Stats Accelerator (or any other bandit model) starts tunneling traffic to variation A. There would be a disproportionate number of conversions on variation B because some of the conversions are from people assigned in the test's initial A/B stage (that is, Stats Engine). This bias would tamper with the validity and interpretability of the results.
Additionally, the choice to ad-hoc splice in different audience models in the same experiment may lead to an example of the peeking problem. Essentially, the experimenter would manually and arbitrarily interrogate the experiment midway through, vastly inflating the chance of a false positive result. A false positive result occurs when a conclusive result is reported between two variations when there is no underlying behavior difference between them.
See Why you should not change a running experiment.
How does Stats Accelerator handle conversion rates that change over time and Simpson's Paradox?
Time variation depends on the underlying distribution of the metric value on time. Time variation occurs when a metric's conversion rate changes over time. Stats Engine assumes this distribution is identically distributed.
Time variation is caused by changes in the underlying conditions that affect visitor behavior. Examples include more purchasing visitors on weekends, an aggressive discount that yields more customer purchases, or a marketing campaign in an unexplored market that brings in many visitors with different interaction behavior than existing visitors.
Optimizely Experimentation assumes identically distributed data because this assumption enables continuous monitoring and faster learning (see Stats Engine for details). However, Stats Engine has a built-in mechanism to detect violations of this assumption. When a violation is detected, Stats Engine updates the statistical significance calculations. This is called a "stats reset."
Time variation affects experiments using Stats Accelerator because the algorithms adjust the percentage of traffic exposed to each variation during the experiment. This can introduce bias in the estimated improvement, known as Simpson's Paradox. The result is that stats resets may be much more likely to occur. See the weighted improvement section for information.
The solution is to change the way the improvement number is calculated. Specifically, Optimizely Experimentation compares the baseline conversion rates and variations within each interval between traffic allocation changes. Then, Optimizely computes statistics using weighted averages across these time intervals. For example, the difference in observed conversion rates is scaled by the number of visitors in each interval to estimate the true difference in conversion rates. This estimate is represented as a weighted improvement.
Why is Stats Accelerator routing traffic away from well-performing variations and towards poorly performing variations?
Stats Accelerator is looking to minimize time, not regret. That means it routes traffic towards variations that appear the most distinct from the baseline, regardless of whether that performance is positive or negative.
This lets Stats Accelerator achieve statistical significance on more variations more quickly by routing traffic to where insight is promising but statistical significance has not yet been reached.
Additional resources
- When to use Stats Accelerator or a multi-armed bandit optimization
- FDR Control with Adaptive Sequential Experimental Design
- Peeking at A/B Tests -- Why it matters, and what to do about it
- A Bandit Approach to Multiple Testing with False Discovery Control
- Always Valid Inference: Continuous Monitoring of A/B Tests
Please sign in to leave a comment.