Statistical significance in Optimizely Experimentation

  • Updated

This topic describes how to:

  • Use statistical significance to analyze results

Optimizely Experimentation won't declare a variation a winner or loser until your experiment meets specific criteria for visitors and conversions. These criteria are different for experiments using numeric metrics and those using binary metrics.

Numeric metrics (such as revenue) do not require a specific number of conversions, but they do require 100 visitors/sessions in the variation. Binary metrics, on the other hand, require at least 100 visitors/sessions and 25 conversions in both the variation and the baseline before a winner can be declared.

Statistical significance is a measure of how unusual your experiment results would be if there was actually no difference between your variation and baseline and the difference in lift was due to random chance alone. When we observe a lift with 90% or higher statistical significance, it means that the observed results are more unusual than what would be expected in 90% of the cases if there was no lift. Assuming there was no difference in performance between the variation and the baseline, the higher your statistical significance and the more unusual your results would appear.

statsig-example.png

In statistics, you observe a sample of the population and use it to make inferences about the total population. Optimizely Experimentation uses statistical significance to infer whether your variation caused movement in the Improvement metric.

Statistical significance helps Optimizely Experimentation control the rate of errors in experiments. In any controlled experiment, you should anticipate three possible outcomes:

  • Accurate results – When there is an underlying, positive (negative) difference between your original and your variation, the data shows a winner (loser), and when there isn’t a difference, the data shows an inconclusive result.

  • False-positive – (Type I Error) Your test data shows a significant difference between your original and your variation, but it’s actually random noise in the data—there is no underlying difference between your original and your variation.

  • False-negative – (Type II Error) Your test shows an inconclusive result, but your variation is actually different from your baseline.

Statistical significance is a measure of how likely it is that your improvement comes from an actual change in underlying behavior, instead of a false positive.

Lower significance levels may increase the likelihood of error but can also help you test more hypotheses and iterate faster. Higher significance levels decrease the error probability but require a larger sample.

Choosing the right significance level should balance the types of tests you are running, the confidence you want to have in the tests, and the amount of traffic you actually receive.

One-tailed and two-tailed tests

When you run a test, you can run a one-tailed or two-tailed test. Two-tailed tests are designed to detect differences between your original and your variation in both directions: they tell you if your variation is a winner and if your variation is a loser. A one-tailed test will tell you whether your variation is a winner or a loser, but not both. One-tailed tests are designed to detect differences between your original and your variation in only one direction.

With the introduction of the Stats Engine, Optimizely Experimentation uses two-tailed tests because they are required for the false discovery rate control that we have implemented in our Stats Engine.

In reality, false discovery rate control is more important to your ability to make business decisions than whether you use a one-tailed or two-tailed test because when it comes to making business decisions, your main goal is to avoid implementing a false positive or negative. 

Switching from a two-tailed to a one-tailed test will typically change error rates by a factor of two, but requires the additional overhead of specifying whether you are looking for winners or losers in advance. If you know you're looking for a winner, you can increase your statistical significance setting from 90% to 95%. On the other hand, as the example above shows, not using false discovery rates can inflate error rates by a factor of five or more. 

It’s more helpful to know the actual chance of implementing false results and to make sure that your results aren’t compromised by adding multiple goals.

Segmentation and statistical significance

Optimizely Experimentation lets you segment your results so you can see if certain groups of visitors behave differently from your visitors overall. However, Optimizely Experimentation does not perform additional false discovery rate correction for segmented results. This means in the context of repeatedly segmenting results and hunting for statistically significant results, it's much more likely that said significant results are false positives.

You can limit the risk of false positives if you only test the segments that are the most meaningful. The higher false discovery rate arises when you are searching for significant results among many segments.

Novelty effect and statistical significance

Currently, the statistical significance from a novelty effect stays for a long time. In the future, statistical significance calculations will self-correct and take into account how long the test is running for, not just sample size.

Changing statistical significance setting

You should be aware of certain trade-offs associated with changing the statistical significance setting. In general, a higher significance setting is more accurate and increases the time required for Optimizely Experimentation to declare significant results because it requires a larger sample size. A lower statistical significance level decreases the amount of time needed to declare significant results, but lowering the statistical significance setting also increases the chance that some of the results will be false positives. For more information please view the document on Changing the statistical significance setting.

Changing your statistical significance setting will instantly affect all currently running experiments. If your experiment has a goal with an 85% statistically significant winner, and you change your statistical significance setting from 90% to 80%, the next time you load your Results page, you will see a winner (85% > 80%). Your difference intervals will also shrink to reflect the reduced need for confidence.