Why is my experiment failing to reach statistical significance?

  • Updated

This topic describes how to:

  • Decide how to design experiments that reach statistical significance faster
  • Determine the cause of long-running or inconclusive experiments

If you've been running A/B tests, you've probably wondered: why isn't my experiment reaching statistical significance? 

Statistical significance is the likelihood that the difference in conversion rates between a given variation and the baseline is not due to random chance. In other words, it's a good indicator of how well the results of the sample you tested will reflect reality. On the journey of experience optimization, your speed of travel is tied to your success in getting statistically significant results.

Fortunately, savvy experiment design and an understanding of how statistical significance works under the hood will help you reach conclusive results.

This article provides a few tips on reaching statistical significance. We also touch on related concepts in other articles, including How long to run an experiment and Use minimum detectable effect to prioritize experiments. Even though we repeat some of the same principles here, we recommend that you read those as well. 

Read on to learn why your experiment isn't reaching statistical significance.

Changes are too small 

Sometimes, a small change can make a huge difference. A new call-to-action (CTA) can help a charity raise $1.5m more, for example. Other times, modest adjustments don't make big enough waves to push your experiment to statistical significance.

If your revision is minor, its impact on your baseline conversion rate is likely to be small too. Stats Engine picks up this small difference but takes longer to decide whether it's a chance fluctuation or a lasting change in visitor behavior. 

Check out the chart below to see how smaller improvements over the baseline require larger sample sizes (and time) to declare a statistically significance result.

Best practice:

When you design an experiment, consider making changes that will make a significant impact on your visitor's experience -- whether the change itself is big or small. 

A text change to a CTA can drive more clicks if the initial text doesn't reflect the purpose of the CTA properly. Adjusting the copy to match the visitor's intent can be a significant change. If the purpose of the CTA is generally clear (like a "buy" button on a product page), changes to the text are less likely to drive noticeable improvements.

Low baseline  

The most important metrics to a business sometimes have relatively low baseline conversion rates. In e-commerce, for example, the "purchase" conversion rate is a relatively low-frequency event: often below 3%. 

Low baseline conversion rates affect the time it takes to reach statistical significance. In the chart above, note the difference in traffic required to reach significance for a 1% versus a 5% baseline.

Best practice:

While it's important to track how experiments affect key metrics, it's not always possible to directly capture the impact of that infrequent event in a timely manner. When this is the case, use a metric with a higher baseline to stand in for the other and measure success.

Imagine that you're optimizing the homepage of an e-commerce site with a banner that prompts visitors to visit the electronics category. You expect more visitors to click the banner, view electronics, and purchase. But the baseline conversion rate for purchases is relatively low. You have a limited amount of time to run this experiment; it will take too long to reach statistical significance.

Instead of measuring success in purchases, you set your primary metric to track clicks to the banner. That way, you don't have to wait for significance to travel all the way down the funnel to decide whether the variation wins or loses. You measure the impact of your experiment directly in clicks, where you made the change. And, you can extrapolate that win to estimate your experiment's impact on revenue.

Too many goals  

Stats Engine makes a distinction between primary and secondary metrics. The more secondary (or monitoring) metrics that you add to an experiment, the longer it may take to reach statistical significance.

Best practices:

Be strategic when deciding what metrics to track in an experiment. Add all the goals that are critical to measure, even if it's 10 or more. But don't track goals that aren't crucial in deciding whether an experiment is a success or failure for your business needs.

Imagine that you're optimizing the search bar on your homepage; you're tempted to track how your changes impact clicks on your customer support widget. While customer support is a valid consideration, it may not be crucial to measure for this particular experiment. 

Return to your hypothesis. Does the impact to support tell you whether the hypothesis is valid or not? If not, this goal may just get in the way of reaching statistical significance. Avoid adding it to this particular experiment.