Statistical significance in Optimizely Experimentation

  • Updated
  • Optimizely Web Experimentation
  • Optimizely Performance Edge
  • Optimizely Web Personalization
  • Optimizely Feature Experimentation
  • Optimizely Full Stack (Legacy)

Statistical significance (fondly known as stat sig) measures how unusual your A/B test results would be if there were no difference between your variation and baseline and if the difference in lift was due to random chance alone.

Multi-armed bandit optimizations do not generate statistical significance. 

When you observe a lift with 90% or higher statistical significance, it means that the observed results are more unusual than what would be expected in 90% of the cases if there was no lift. Assuming there was no difference in performance between the variation and the baseline, the higher your statistical significance, the more unusual your results would seem.

Optimizely Experimentation does not declare a variation as statistically significant or update the Confidence Interval until your experiment meets specific criteria for visitors and conversions. These criteria are different for experiments using numeric and binary metrics.

  • Numeric metrics (such as revenue) – Do not require a specific number of conversions but require 100 visitors or sessions in the variations.
  • Binary metrics – Require at least 100 visitors or sessions and 25 conversions in the variation and the baseline before a winner can be declared.

In statistics, you observe and use a population sample to infer the total population. Optimizely Experimentation uses statistical significance to infer whether your variation caused movement in the metric.

Statistical significance helps Optimizely Experimentation control the rate of errors in experiments. In any controlled experiment, there are three possible outcomes:

  • Accurate results – When there is an underlying, positive or negative difference between your original and your variation, the data shows a winner or loser. When there is not a difference, the data shows an inconclusive result.

  • False-positive (Type I Error) – Your test data shows a significant difference between your original and your variation, but there is random noise in the data—there is no underlying difference between your original and your variation.

  • False-negative (Type II Error) – Your test shows an inconclusive result, but your variation differs from your baseline.

Statistical significance measures how likely your improvement is from an actual change in underlying behavior instead of a false positive.

Lower significance levels may increase the likelihood of error but can also help you test more hypotheses and iterate faster. Higher significance levels decrease the error probability but require a larger sample.

Choosing the right significance level should balance the types of tests you are running, the confidence you want to have in the tests, and the amount of traffic you receive.

One-tailed and two-tailed tests

When you run a test, you can run a one-tailed or two-tailed test.

  • Two-tailed tests – Detects the differences between the original and the variation in both directions. Two-tailed tests tell you if your variation is a winner and if your variation is a loser.
  • One-tailed test – Tells you whether the variation is a winner or a loser, but not both. One-tailed tests detect differences between the original and the variation in only one direction.

Optimizely Experimentation uses two-tailed tests because they are required for the false discovery rate control that Optimizely has implemented in Stats Engine. False discovery rate control is more important when making business decisions than whether you use a one-tailed or two-tailed test because you want to avoid implementing a false positive or negative. 

Segmentation and statistical significance

Optimizely Experimentation lets you segment your results to see if certain groups of visitors behave differently from your visitors overall. However, Optimizely Experimentation does not perform additional false discovery rate correction for segmented results.

When repeatedly segmenting results and hunting for statistically significant results, significant results are much more likely to be false positives. You can limit the risk of false positives if you only test the most meaningful segments. The higher false discovery rate arises when you search for significant results among many segments.

Novelty effect and statistical significance

The statistical significance from a novelty effect stays for a long time. As the A/B test continues running, statistical significance calculations self-correct and consider how long the test is running for, not just the sample size.

Changing statistical significance setting

You should know certain trade-offs associated with changing the statistical significance setting. In general,

  • A higher significance setting is more accurate and increases the time required for Optimizely Experimentation to declare significant results because it requires a larger sample size.
  • A lower statistical significance level decreases the amount of time needed to declare significant results, but lowering the statistical significance setting also increases the chance that some results are false positives.

For information, see Change the statistical significance setting in Optimizely Experimentation.

Changing your statistical significance setting instantly affects all currently running experiments. If your experiment has a goal with an 85% statistically significant winner, and you change your statistical significance setting from 90% to 80%, the next time you load your Experiment Results page, you see a winner (85% > 80%). Your difference intervals also shrink to reflect the reduced need for confidence.