Table of Contents
- Understand Optimizely's Stats Accelerator and how it affects your results
- Determine whether to use Stats Accelerator for your experiments
- Enable Stats Accelerator for your account
If you run a lot of experiments, you may face two challenges.
- Data collection is costly, and time spent experimenting means you have less time to exploit the value of the eventual winner.
- Creating more than one or two variations can delay statistical significance longer than you might like.
Stats Accelerator helps you algorithmically capture more value from your experiments by reducing the time to statistical significance, so you spend less time waiting for results.
It monitors ongoing experiments and uses machine learning to adjust traffic distribution among variations. In other words, it shows more visitors the variations that have a better chance of reaching statistical significance. The process attempts to discover as many significant variations as possible.
Specifically, Stats Accelerators does the following:
- Identifies the variation showing the most significant difference from the baseline
- Pushes traffic to that variation
- Once that variation reaches statistical significance, the variation is taken out of consideration
- Traffic is redistributed to the remaining variations, and the cycle continues
If running a multivariate test, you can only use Stats Accelerator in partial factorial mode. Once Stats Accelerator is enabled, you cannot switch directly from partial factorial to full factorial mode. You must set your distribution mode to Manual if you want to use full factorial mode.
Stats Accelerator results reporting
Stats Accelerator adjusts the percentage of visitors who see each variation. This means visitor counts will reflect the distribution decisions of the Stats Accelerator.
Stats Accelerator experiments and campaigns report absolute improvement.
Stats Accelerator reports absolute improvements in percentage points, denoted by the "pp" unit:
Additionally, the winning variation displays its results in terms of approximate weighted improvement as well. This can be found just below the absolute improvement (in this example, the weighted improvement is -12.15%) and is provided for continuity purposes so that customers accustomed to using weighted improvement can develop a sense of how absolute improvement and weighted improvement are compared to each other. Read more about weighted improvement and Stats Accelerator.
Because traffic distribution will be updated frequently, Full Stack customers should implement sticky bucketing to avoid exposing the same visitor to multiple variations. To do this, implement the user profile service.
Modify an experiment when Stats Accelerator is enabled
You can modify an experiment if you have Stats Accelerator enabled. However, there are some limitations you should be aware of.
Before starting your experiment, you can add or delete variations for Web, Personalization and Full Stack experiments as long as you still have at least three variations.
You can also add or delete sections or section variations for multivariate tests, provided that you still have the minimum number of variations required by the algorithm you use.
After you start your experiment, you can add, stop or pause regular variations in Web, Personalization and Full Stack experiments. However, you can only add or delete sections for a multivariate test. You cannot add or delete section variations once the experiment has begun. Refer to the Multivariate tests for Optimizely or Section rollups in multivariate tests documentation for more information about variable sections.
When Stats Accelerator is enabled for a test, it will periodically re-publish the Optimizely snippet so that the variation traffic distribution changes can go live. This is the same as a regular publish. When this happens, any pre-existing unpublished changes are published as well.
Stats Accelerator relies on dynamic traffic allocation to achieve its results. Anytime you allocate traffic dynamically over time, you risk introducing bias into your results. Left uncorrected, this bias can significantly impact your reported results.
Stats Accelerator neutralizes this bias through a technique called weighted improvement.
Do not confuse weighted improvement and absolute improvement.
Weighted improvement helps Stats Accelerator counterbalance bias created while allocating traffic to different variations, while absolute improvement is what is shown on the Optimizely Results page.
Weighted improvement is designed to estimate the true lift as accurately as possible by breaking down the duration of an experiment into much shorter segments called epochs. These epochs cover periods of constant allocation: in other words, traffic allocation between variations does not change for the duration of each epoch.
Results are calculated for each epoch, which has the effect of minimizing the bias in each individual epoch. At the end of the experiment, these results are all used to calculate the estimated true lift, filtering out the bias that would have otherwise been present.
We can further examine this by looking at the two graphs below:
- The first chart shows conversion rates for two variations when traffic allocation is static. In this example, conversions for both variations begin to decline after 5,000 visitors have seen each. And while we see plenty of fluctuation in conversion rates, the gap between the winning and losing variations never strays far from the true lift.
The steady decline in the observed conversion rates shown above is caused by the sudden, one-time shift in the true conversion rates when the experiment has 10,000 visitors.
- The following chart shows what happens when traffic is dynamically allocated instead, with 90 percent of all traffic directed to the winning variation after 5,000 visitors have seen each variation. Here, the winning variation shows the same decline in conversion rates as it did in the previous example. However, fewer visitors have seen the losing variation, so its conversion rates are slower to change.
This gives the impression that the difference between the two variations is much less than it truly is.
This situation is known as Simpson's Paradox, and it is hazardous when the true lift is relatively small. In those cases, it can even cause the sign on your results to flip, essentially reporting winning variations as losers and vice versa:
How does Stats Accelerator work with Stats Engine?
Stats Accelerator decides how many samples each variation should be allocated in real-time to get the same statistically significant results as standard A/B/n testing but in less time. These algorithms are only compatible with always-valid p-values, such as those used in Stats Engine, which holds all sample sizes and supports continuous peeking/monitoring. You may use the Results page for Stats Accelerator-enabled experiments like any other.
What algorithms or frameworks does Stats Accelerator support?
Can I use my own algorithm?
How much time will I save with Stats Accelerator?
How often does Stats Accelerator make a decision?
What happens if I change the baseline on the Results page?
What happens if I change my primary metric?
What happens when I pause or stop a variation?
Since Weighted Improvement is a weighted sum of the observed effect size within epochs, subsequent periods–after the variation’s been stopped–will yield a more significant effect size due to delayed conversion events but no decisions, resulting in a skewed Weighted Improvement that is misleading.
Suppose you believe that a variation is underperforming. In that case, we recommend letting Stats Accelerator determine this, after which it will minimize traffic to this variation (because it has reached statistical significance) so it can funnel the remaining traffic to the other variations. Otherwise, create a new A/B test with the variation removed.
How does Stats Accelerator handle revenue and numeric metrics?
How does Stats Accelerator work with Personalization?
Stats Accelerator will automatically adjust traffic distribution between variations within campaign experiences. This will not affect the holdback. You should increase your holdback to a level typically representing uniform distribution to maximize benefit. For example, if you have three variations and a holdback, consider a 25% holdback.
What is the mathematical difference between Stats Accelerator and Multi-Armed Bandit?
In simple terms, if your goal is to learn whether any variations are better or worse than the baseline and take actions that have a longer-term impact on your business based on this information, use Stats Accelerator. On the other hand, if you want to maximize conversions among these variations, choose Multi-Armed Bandit. For a complete explanation, see our documentation on Multi-armed bandits vs. Stats Accelerator: when to use each.
In traditional A/B/n testing, a control schema is defined in contrast to several variants to be determined better or worse than the control. Typically, such an experiment is done on a fraction of web traffic to determine the potential benefit or detriment of using a particular variant instead of the control.
In a nutshell, use Stats Accelerator when you have a control or default and investigate optional variants before committing to one and replacing the control. With Multi-Armed Bandit, the variants and control (if it exists) are on equal footing. Instead of reaching statistical significance on the hypotheses that each variant is either different or the same as the control, Multi-Armed Bandit attempts to adapt the allocation to the variant with the best performance.
Can you run a "blended experiment" where you start the experiment with Stats Engine and switch to Stats Accelerator mid-experiment?
How does Stats Accelerator handle conversion rates that change over time and Simpson's Paradox?
Time variation is caused by changes in the underlying conditions that affect visitor behavior. Examples include more purchasing visitors on weekends; an aggressive new discount that yields more customer purchases; or a marketing campaign in a new market that brings in many visitors with different interaction behavior than existing visitors.
Optimizely assumes identically distributed data because this assumption enables continuous monitoring and faster learning (see the Stats Engine article for details). However, Stats Engine has a built-in mechanism to detect violations of this assumption. When a violation is detected, Stats Engine updates the statistical significance calculations. This is called a “stats reset.”
Time variation affects experiments using Stats Accelerator because the algorithms adjust the percentage of traffic exposed to each variation during the experiment. This can introduce bias in the estimated improvement, known as Simpson's Paradox. The result is that stats resets may be much more likely to occur. (See the weighted improvement section for more information.)
The solution is to change the way the improvement number is calculated. Specifically, Optimizely compares the baseline conversion rates and variation(s) within each interval between traffic allocation changes. Then, Optimizely computes statistics using weighted averages across these time intervals. For example, the difference in observed conversion rates is scaled by the number of visitors in each interval to estimate the true difference in conversion rates. This estimate is represented as a weighted improvement.
Why is Stats accelerator routing traffic away from well-performing variations/towards poorly performing variations?
This allows stats accelerator to achieve statistical significance on more branches more quickly by routing traffic to where insight is promising but statistical significance has not yet been reached.
- Multi-armed bandits vs. Stats Accelerator: when to use each
- FDR Control with Adaptive Sequential Experimental Design is a technical white paper on the mathematical foundation of Stats Accelerator.
- Peeking at A/B Tests -- Why it matters, and what to do about it
- Stats Accelerator — The When, Why, and How
- Stats Accelerator – Acceleration Under Time-Varying Signals
- A Bandit Approach to Multiple Testing with False Discovery Control
- Always Valid Inference: Continuous Monitoring of A/B Tests