Statistical significance

  • Updated
  • Optimizely Web Experimentation
  • Optimizely Performance Edge
  • Optimizely Personalization
  • Optimizely Feature Experimentation
  • Optimizely Full Stack (Legacy)

Statistical significance (known as stat sig) measures how unusual your A/B test results would be if the variation and baseline performed identically.

Multi-armed bandit optimizations do not generate statistical significance. 

When you observe a lift with 90% or higher statistical significance, the observed results are more unusual than expected if there was no lift. Assuming there was no difference in performance between the variation and the baseline, the higher your statistical significance, the more unusual your results seem.

Optimizely does not declare a variation statistically significant or update the Confidence Interval until your experiment meets specific criteria for visitors and conversions. These criteria are different for experiments using numeric and binary metrics.

  • Numeric metrics (such as revenue) – Do not require a specific number of conversions, but require 100 visitors or sessions in the variations.
  • Binary metrics – Require at least 100 visitors or sessions and 25 conversions in both the variation and the baseline before Optimizely declares a winner.

In statistics, you observe and use a population sample to infer the total population. Optimizely uses statistical significance to infer whether your variation caused movement in the metric.

The statistical significance level helps Optimizely control the rate of experiment errors. Every controlled experiment has three possible outcomes:

  • Correct results – When a real positive or negative difference exists between the original and the variation, the data shows a winner or loser. When no real difference exists, the data shows an inconclusive result.
  • False-positive (Type I error) – The test data shows a significant difference between the original and the variation, but the difference reflects random noise rather than a real effect.
  • False-negative (Type II error) – The test reads as inconclusive even though the variation differs from the baseline.

Lower significance levels may increase the likelihood of error but let you test more hypotheses and iterate faster. Higher significance levels decrease the error probability but require a larger sample.

Choosing the right significance level should balance the types of tests you are running, the confidence you want to have in the tests, and the amount of traffic you receive.

One-tailed and two-tailed tests

When you run a test, you can run a one-tailed or two-tailed test.

  • Two-tailed tests – Detect differences between the original and the variation in both directions, so they identify both winning and losing variations.
  • One-tailed tests – Tell you whether the variation is a winner or a loser, but not both, because they detect differences in only one direction.

Optimizely uses two-tailed tests because they are required for the false discovery rate control in Stats Engine. False discovery rate control is more important when making business decisions than whether you use a one-tailed or two-tailed test because you want to avoid implementing a false positive.

Segmentation and statistical significance

You can segment your results on the Optimizely Experiment Results page to see whether specific visitor groups behave differently from the overall audience. Optimizely does not perform additional false discovery rate control across segments.

You should avoid repeatedly segmenting your results and looking for a segment that is significant. Your chances for a false discovery increase significantly with the number of segments you look at. You can limit the risk by testing only the most meaningful segments.

Change the statistical significance setting

You should know certain trade-offs associated with changing the statistical significance setting. In general,

  • A higher significance setting increases the time required for Optimizely to declare significant results because it requires a larger sample size.
  • A lower statistical significance level decreases the amount of time needed to declare significant results, but lowering the statistical significance setting also increases the chance that some results are false positives.

For more information, see Change the statistical significance setting.

A statistical significance change instantly affects every running experiment. Suppose your experiment has a goal at 85% statistical significance. If you lower the project setting from 90% to 80%, the Experiment Results page shows that goal as a winner the next time you reload it (85% > 80%). Your confidence intervals also shrink to reflect the reduced need for confidence.