Confidence intervals and improvement intervals

  • Updated
  • Optimizely Web Experimentation
  • Optimizely Performance Edge
  • Optimizely Web Personalization
  • Optimizely Feature Experimentation
  • Optimizely Full Stack (Legacy)

Statistical significance measures how unusual your experiment results would be if there were actually no difference in performance between your variation and baseline and the discrepancy in lift was due to random chance alone. Statistical significance gives you an idea of how strongly the data contradicts the baseline hypothesis of there actually being no difference between the test variation and the control.

Confidence intervals

Confidence interval refers to an estimated range of values that are likely, but not guaranteed, to include the unknown but exact value summarizing your audience if you replicated an experiment numerous times. Confidence intervals consider the size of the sample exposed to an experiment and the amount of variability, that is, dispersion or noise, in that experiment sample.

Confidence intervals contain: 

  • A point estimate (uplift or improvement) – A single value derived from your statistical model of choice.
  • A margin of error – Around the point estimate that indicates the amount of uncertainty surrounding the sample estimate from the population.

You should report confidence intervals to supplement your statistical significance results, as they can offer information about the observed effect size of your experiment. 

Optimizely Experimentation's confidence interval is adaptive. The Optimizely Experiment Results page shows the running intersection of all previous confidence intervals by tracking the smallest upper limit and the largest lower confidence interval limit values during the experiment runtime. The largest lower limit means the less negative value or the value closest to zero. This technique ensures the optimal tradeoff of estimating the bias and the variance so you can feel confident about the result.

Optimizely Experimentation sets your confidence interval to the same level that you set your statistical significance threshold for the project. By default, the statistical significance setting for your project is 90%.

When Optimizely Experimentation declares significance

A variation is declared significant when the confidence interval stops crossing zero. Intervals that cross zero mean there is not enough evidence to say whether there is a clear impact.

When a variation reaches statistical significance, the confidence interval constantly lies entirely above or below 0.

Confidence interval entirely above zero percent

In the preceding screenshot, enough evidence has accumulated so far that it is highly unlikely that the improvement Optimizely Experimentation observed here is due to random chance alone. But, the improvement Optimizely measured (+89.9%) may be different from the exact improvement you see going forward. The confidence interval indicates that this test variation will have a positive impact in the long run. For this experiment iteration, the error bounds were between 77.57% and 105.21% improvement.

The statistical significance setting for this example is 90%. As Optimizely Experimentation collects more data, the confidence interval may narrow.

Confidence interval includes zero percent

If you must stop a test early or have a low sample size, the confidence interval will give you an idea of whether implementing that variation will have a positive or negative impact.

When you see low statistical significance on specific variations, the confidence interval can serve as another data point to help you make decisions. When you have an inconclusive variation, the interval will look like this:

This variation's conversion rate is between -20.19% and 22.13%. You can interpret the confidence interval as a worst-case, middle-ground, and best-case scenario. Optimizely is 90% confident that the difference between the variation and baseline conversion rates is:

  • Worst-case – -20.19%
  • Middle-ground – 0.69%.
  • Best-case – 22.13%

Confidence interval entirely below zero percent

In this example, enough evidence has accumulated so far that it is highly unlikely that the negative improvement Optimizely Experimentation observed here is due to random chance alone. The negative improvement Optimizely measured (-15.3%) is likely to be what you see going forward. The confidence interval indicates that this test variation will have a negative impact in the long run. For this experiment iteration, the error bounds were between -15.55% and -9.19% improvement. If this experiment was rerun, the baseline and variation conversion rate difference would probably be in the same range.

How statistical significance and confidence intervals are connected

Optimizely Experimentation shows you the statistical likelihood that the improvement is due to changes you made on the page, not chance. Until Stats Engine has enough data to declare statistical significance, the Optimizely Experiment Results page will state that more visitors are needed and show you an estimated wait time based on the current conversion rate.

Lower significance levels may increase the likelihood of error but can also help you test more hypotheses and iterate faster. Higher significance levels decrease the error probability but require a larger sample.

Choosing the right significance level should balance the types of tests you are running, the confidence you want to have in the tests, and the amount of traffic you receive.

Significance and confidence intervals are connected but are not the same. Your experiment reaches significance precisely when your confidence interval on improvement moves away from zero.

Improvement intervals

Optimizely Experimentation will display a different improvement interval on the results page depending on what type of experiment or optimization you are running:

  • For A/B tests – Optimizely Experimentation displays the relative improvement in conversion rate for the variation over the baseline as a percentage. This is true for all A/B test metrics, regardless of whether they are binary or numeric conversions
    • In Optimizely Experimentation, a relative improvement interval of 1% to 10% means that the variation sees between 1% and 10% improvement over the baseline. For example, if the baseline conversion rate is 25%, you can expect the variation conversion rate to fall between 25.25% and 27.5%.
  • For multi-armed Bandit (MAB) optimizations – Optimizely Experimentation displays absolute improvement.  
  • For A/B tests or MAB optimizations with Stats Accelerator enabled – Optimizely Experimentation displays absolute improvement.
MAB optimizations do not generate statistical significance.

Estimated wait time and <1% significance

As your experiment or campaign runs, Optimizely Experimentation estimates how long it will take for a test to reach conclusiveness. This estimate is based on the current observed baseline and variation conversion rates. If those rates change, the estimate will adjust automatically.

You may see a significance of less than 1%, with a certain number of visitors remaining. In statistical terms, this experiment is underpowered. Optimizely Experimentation needs to gather more evidence to determine whether the change you see is a true difference in visitor behaviors or chance.

Look at the variation in the preceding example. Greater than 100,000 additional visitors must be bucketed into that variation before deciding the conversion rates between the variation and the original. Remember that the estimate assumes that the observed conversion rate does not fluctuate. If more visitors see the variation but conversions decrease, your experiment will probably take more time, which means the visitors remaining estimate will increase. Optimizely Experimentation will need fewer visitors if conversions increase to ensure the behavior change is real.

To learn more about the importance of sample size, see How long to run an experiment.