- Optimizely Web Experimentation
- Optimizely Performance Edge
- Optimizely Personalization
- Optimizely Feature Experimentation
- Optimizely Full Stack (Legacy)
Statistical significance measures how unusual your experiment results would be if there were actually no difference in performance between your variation and baseline and the discrepancy in lift was due to random chance alone. Statistical significance gives you an idea of how strongly the data contradicts the baseline hypothesis of there actually being no difference between the test variation and the control.
Confidence intervals
Confidence interval refers to an estimated range of values that are likely, but not guaranteed, to include the unknown but exact value summarizing your audience if you replicated an experiment numerous times. Confidence intervals consider the size of the sample exposed to an experiment and the amount of variability, that is, dispersion or noise, in that experiment sample.
Confidence intervals contain:
- A point estimate (uplift or improvement) – A single value derived from your statistical model of choice.
- A margin of error – Around the point estimate that indicates the amount of uncertainty surrounding the sample estimate from the population.
You should report confidence intervals to supplement your statistical significance results, as they can offer information about the observed effect size of your experiment.
Optimizely's confidence interval is adaptive. The Optimizely Experiment Results page shows the running intersection of all previous confidence intervals by tracking the smallest upper limit and the largest lower confidence interval limit values during the experiment runtime. The largest lower limit means the less negative value or the value closest to zero. This technique ensures the optimal tradeoff of estimating the bias and the variance so you can feel confident about the result.
Optimizely sets your confidence interval to the same level that you set your statistical significance threshold for the project. By default, the statistical significance setting for your project is 90%.
When Optimizely declares significance
A variation is declared significant when the confidence interval stops crossing zero. Intervals that cross zero mean there is not enough evidence to say whether there is a clear impact.
When a variation reaches statistical significance, the confidence interval constantly lies entirely above or below 0.
-
Winning variation – The confidence interval is entirely above 0%.
-
Inconclusive variation – The confidence interval includes 0%.
-
Losing variation – The confidence interval is entirely below 0%.
Confidence interval entirely above zero percent
In the preceding screenshot, enough evidence has accumulated so far that it is highly unlikely that the improvement Optimizely observed here is due to random chance alone. But the improvement Optimizely measured (+89.9%) may be different from the exact improvement you see going forward. The confidence interval indicates that this test variation has a positive impact in the long run. For this experiment iteration, the error bounds were between 77.57% and 105.21% improvement.
The statistical significance setting for this example is 90%. As Optimizely collects more data, the confidence interval may narrow.
Confidence interval includes zero percent
If you must stop a test early or have a low sample size, the confidence interval gives you an idea of whether implementing that variation has a positive or negative impact.
When you see low statistical significance on specific variations, the confidence interval can serve as another data point to help you make decisions. When you have an inconclusive variation, the interval looks like this:
This variation's conversion rate is between -20.19% and 22.13%. You can interpret the confidence interval as a worst-case, middle-ground, and best-case scenario. Optimizely is 90% confident that the difference between the variation and baseline conversion rates is:
- Worst-case – -20.19%
- Middle-ground – 0.69%.
- Best-case – 22.13%
Confidence interval entirely below zero percent
In this example, enough evidence has accumulated so far that it is highly unlikely that the negative improvement Optimizely observed here is due to random chance alone. The negative improvement Optimizely measured (-15.3%) is likely to be what you see going forward. The confidence interval indicates that this test variation has a negative impact in the long run. For this experiment iteration, the error bounds were between -15.55% and -9.19% improvement. If this experiment was rerun, the baseline and variation conversion rate difference would probably be in the same range.
How statistical significance and confidence intervals are connected
Optimizely shows you the statistical likelihood that the improvement is due to changes you made on the page, not chance. Until Stats Engine has enough data to declare statistical significance, the Experiment Results page states more visitors are needed and shows you an estimated wait time based on the current conversion rate.
Lower significance levels may increase the likelihood of error but can also help you test more hypotheses and iterate faster. Higher significance levels decrease the error probability but require a larger sample.
Choosing the right significance level should balance the types of tests you are running, the confidence you want to have in the tests, and the amount of traffic you receive.
Improvement intervals
Optimizely displays a different improvement interval on the results page depending on what type of experiment or optimization you are running:
-
For A/B tests – Optimizely displays the relative improvement in conversion rate for the variation over the baseline as a percentage. This is true for all A/B test metrics, regardless of whether they are binary or numeric conversions.
- In Optimizely Experimentation, a relative improvement interval of 1% to 10% means that the variation sees between 1% and 10% improvement over the baseline. For example, if the baseline conversion rate is 25%, you can expect the variation conversion rate to fall between 25.25% and 27.5%.
- For multi-armed Bandit (MAB) optimizations – Optimizely displays absolute improvement.
- For A/B tests or MAB optimizations with Stats Accelerator enabled – Optimizely displays absolute improvement.
Estimated wait time and <1% significance
As your experiment or campaign runs, Optimizely estimates how long it will take for a test to reach conclusiveness. This estimate is based on the current observed baseline and variation conversion rates. If those rates change, the estimate adjusts automatically.
You may see a significance of less than 1%, with a certain number of visitors remaining. In statistical terms, this experiment is underpowered. Optimizely needs to gather more evidence to determine whether the change you see is a true difference in visitor behaviors or chance.
Look at the variation in the preceding example. Greater than 100,000 additional visitors must be bucketed into that variation before deciding the conversion rates between the variation and the original. Remember that the estimate assumes that the observed conversion rate does not fluctuate. If more visitors see the variation but conversions decrease, your experiment will probably take more time, which means the visitors remaining estimate will increase. Optimizely will need fewer visitors if conversions increase to ensure the behavior change is real.
To learn more about the importance of sample size, see How long to run an experiment.
Please sign in to leave a comment.