Why is my experiment failing to reach statistical significance?

  • Updated
  • Optimizely Web Experimentation
  • Optimizely Personalization
  • Optimizely Performance Edge
  • Optimizely Feature Experimentation
  • Optimizely Full Stack (Legacy)

If you have been running A/B tests, you have probably wondered why your experiment did not reach statistical significance. 

Statistical significance measures how unusual your experiment results would be if there was no difference between your variation and baseline and the difference in lift was due to random chance alone. In other words, it is a good indicator of how well the results of the sample you tested will reflect reality. On the journey of experience optimization, your speed of travel is tied to your success in getting statistically significant results.

Fortunately, savvy experiment design and an understanding of how statistical significance works under the hood will help you reach conclusive results.

This article provides a few tips on reaching statistical significance. See How long to run an experiment and Use minimum detectable effect to prioritize experiments for more tips.

Changes are too small 

Sometimes, a small change can make a difference. Other times, adjustments do not push your experiment to statistical significance.

If your revision is minor, its impact on your baseline conversion rate will also likely be small. Stats Engine picks up this small difference but takes longer to decide whether it is a chance fluctuation or a lasting change in visitor behavior. 

The chart below shows how smaller improvements over the baseline require larger sample sizes (and time) to declare a statistically significance result.

Best practice

When designing an experiment, consider making changes that will significantly impact your visitor's experience–whether the change is big or small. 

A text change to a CTA can drive more clicks if the initial text does not reflect the purpose of the CTA properly. Adjusting the copy to match the visitor's intent can be a significant change. If the purpose of the CTA is generally clear (like a "buy" button on a product page), changes to the text are less likely to drive noticeable improvements.

Low baseline  

The most important metrics to a business sometimes have relatively low baseline conversion rates. In ecommerce, for example, the "purchase" conversion rate is a relatively low-frequency event: often below 3%. 

Low baseline conversion rates affect the time it takes to reach statistical significance. In the chart above, note the difference in traffic required to reach significance for a 1% versus a 5% baseline.

Best practice

While it is important to track how experiments affect key metrics, it is only sometimes possible to directly capture the impact of that infrequent event in a timely manner. When this is the case, use a metric with a higher baseline to stand in for the other and measure success.

Imagine that you are optimizing the homepage of an ecommerce site with a banner that prompts visitors to visit the electronics category. You expect more visitors to click the banner, view electronics, and purchase. However, the baseline conversion rate for purchases is relatively low. You have a limited amount of time to run this experiment; it will take too long to reach statistical significance.

Instead of measuring success in purchases, you set your primary metric to track clicks to the banner. That way, you do not have to wait for significance to travel down the funnel to decide whether the variation wins or loses. You measure the impact of your experiment directly in clicks, where you made the change. And, you can extrapolate that win to estimate your experiment's impact on revenue.

Too many goals

Stats Engine distinguishes between primary and secondary metrics. The more secondary (or monitoring) metrics you add to an experiment, the longer it may take to reach statistical significance.

Best practices

Be strategic when deciding what metrics to track in an experiment. Add the critical goals, even if it is ten or more. Do not track goals that are not crucial in deciding whether an experiment is a success or failure for your business needs.

For example, you are optimizing the search bar on your homepage. You are tempted to track how your changes impact clicks on your customer support widget. While customer support is a valid consideration, measuring for this particular experiment may not be crucial. 

Return to your hypothesis. Does the impact on support tell you whether the hypothesis is valid or not? If not, this goal may just get in the way of reaching statistical significance. Avoid adding it to this particular experiment.