- Use confidence intervals and improvement intervals to analyze results
- Predict what behavior you should see from your results over time
Statistical significance tells you whether a variation is outperforming or underperforming the baseline, at whatever confidence level you chose. Confidence intervals tell you the uncertainty around improvement. Stats Engine provides a range of values where the conversion rate for a particular experience lies. It starts wide, but as Stats Engine collects more data, the interval narrows to show that certainty increases.
A variation is declared significant as soon as the confidence interval stops crossing zero. Intervals crossing zero mean there is not enough evidence to say if there is a clear impact or not.
Once a variation reaches statistical significance, the confidence interval always lies entirely above or below 0.
A winning variation will have a confidence interval that is entirely above 0%.
An inconclusive variation will have a confidence interval that includes 0%.
A losing variation will have a confidence interval that is entirely below 0%.
Optimizely sets your confidence interval at the same level that you put your statistical significance threshold for the project. For example, if you accept 90% significance to declare a winner, you also accept 90% confidence that the interval is accurate.
Example: Winning interval
In the example shown above, you can say there is a 99% chance that the improvement you saw in the bottom variation is not due to chance. But the improvement Optimizely measured (+38.4%) may not be the exact improvement you see going forward.
In reality, if you implement that variation instead of the original, the relative improvement in conversion rate will probably be between 22.01% and 54.79% over the baseline conversion rate. Compared to a baseline conversion rate of 34.80%, you are likely to see your variation convert between 42.46% (34.80 + 34.80*0.2201) and 53.87% (34.80 + 34.80*0.5479).
Although the statistical significance is 97%, there is a 90% chance that the actual results will fall in the confidence interval range. This is because the statistical significance setting for your project is 90%: the probability that your confidence interval will not change as your variation's observed statistical significance changes. Instead, you will see it become narrower as Optimizely collects more data.
Example: Losing interval
This example shows the time with the confidence interval entirely below 0.
In this example, you can say that there is a 99% chance that the negative improvement you saw in the bottom variation is not due to chance. The improvement Optimizely measured (-19.85%) is likely to be what you see going forward.
In reality, the difference in conversion rate will probably be between -27.37% and -12.34% under the baseline conversion rate if you implement the variation instead of the original. Compared to a baseline conversion rate of 62.90%, you're likely to see your variation convert between 45.68% (62.90 - 62.90*0.2737) and 55.14% (62.90 - 62.90*0.1234).
In this experiment, the observed difference between the original (62.90%) and variation (50.41%) was -12.49%, just within the confidence interval. If we rerun this experiment, the baseline and variation conversion rate difference will probably be in the same range.
Example: Inconclusive interval
Suppose you must stop a test early or have a low sample size. In that case, the confidence interval will give you a rough idea of whether implementing that variation will have a positive or negative impact.
For this reason, when you see low statistical significance on specific goals, the confidence interval can serve as another data point to help you make decisions. When you have an inconclusive goal, the interval will look like this:
Here, we can say that this variation's conversion rates will be between -1.35% and 71.20%. In other words, it could be either positive or negative.
When implementing this variation, you can say, "We implemented a test result that we are 90% confident is better than -1.35% worse, but not more than 71.20% better," which allows you to make a business decision about whether implementing that variation would be worthwhile.
Another way you can interpret the confidence interval is as a worst case, middle ground, and best case scenario. For example, we are 90% confident that the worst case difference between variation and baseline conversion rates is -1.35%, the best case is 71.20%, and a middle ground is 34.93%.
How statistical significance and confidence intervals are connected
Optimizely shows you the statistical likelihood that the improvement is due to changes you made on the page, not chance. Until Stats Engine has enough data to declare statistical significance, the Results page will state that more visitors are needed and show you an estimated wait time based on the current conversion rate.
Lower significance levels may increase the likelihood of error but can also help you test more hypotheses and iterate faster. Higher significance levels decrease the error probability but require a larger sample.
Choosing the right significance level should balance the types of tests you are running, the confidence you want to have in the tests and the amount of traffic you receive.
Significance and confidence intervals are connected, your experiment reaches significance at precisely the same time your confidence interval on improvement moves away from zero.
Depending on what type of experiment or optimization you are running, Optimizely will display a different improvement interval on the Results page
- For A/B experiments: Optimizely displays the relative improvement in conversion rate for the variation over the baseline as a percentage for most experiments. This is true for all A/B experiment metrics, regardless of whether they are binary conversions or numeric.
- In Optimizely, a relative improvement interval of 1% to 10% means that the variation sees between 1% and 10% improvement over baseline. For example, if the baseline conversion rate is 25%, you can expect the variation conversion rate to fall between 25.25% and 27.5%.
- For Multi-armed Bandit (MAB) optimizations: Optimizely displays the absolute improvement.
- For A/B experiments or MAB optimizations with Stats Accelerator enabled: Optimizely displays the absolute improvement.
Estimated wait time and <1% significance
As your experiment or campaign runs, Optimizely estimates how long it will take for a test to reach conclusiveness.
This estimate is based on the current, observed baseline and variation conversion rates. If those rates change, the estimate will adjust automatically.
You may see a significance of less than 1%, with a certain number of "visitors remaining." What does this mean? In statistical terms, this experiment is underpowered: Optimizely needs to gather more evidence to determine whether the change you see is a true difference in visitor behaviors or chance.
Look at the variation in the example shown above. Optimizely needs approximately 11,283 more visitors to be exposed to that variation before deciding the conversion rates between the variation and the original. Remember that the estimated 11,283 visitors assumes that the observed conversion rate does not fluctuate. If more visitors see the variation, but conversions decrease, your experiment will probably take more time, which means the "visitors remaining" estimate will increase. If conversions increase, Optimizely will need fewer visitors to be sure that the behavior change is real.
To learn more about the importance of sample size, see our article on how long to run a test.
Unlike many testing tools, Optimizely's stats engine uses a statistical approach that removes the need to decide on sample size and minimum detectable effect (MDE) before starting a test. You do not have to commit to large sample sizes ahead of time, and you can check on results whenever you want!
However, many optimization programs estimate how long tests take to run so they can build robust roadmaps. Use our sample size calculator to estimate the number of visitors you need for a test. Learn more about choosing a minimum detectable effect for our calculator in using the minimum detectable effect to prioritize experiments.