Why Stats Engine results sometimes differ from traditional statistics results

  • Updated

Relevant products:

  • Optimizely Web Experimentation
  • Optimizely Performance Edge
  • Optimizely Feature Experimentation

This topic describes how to:

  • Compare Optimizely Experimentation's Stats Engine methodology to the t-test in classical Statistics
  • Interpret your Results page

Sometimes, Optimizely Experimentation will declare a winning variation in a situation where a traditional t-test would fail to find any statistically significant difference between it and the other variations. This is because Optimizely Experimentation Stats Engine uses an approach that differs from those used in classical statistics-based models, one which is simultaneously more conservative in declaring a winner and less likely to reverse that declaration as more data accumulates.

Stats Engine vs traditional statistics

Rather than using traditional statistical calculations to determine significance and declare a winning variation, Optimizely Experimentation computes a series of 100 successive confidence intervals through the course of an experiment. Each of those intervals receives a distinct statistical significance and a confidence interval.

The statistical significance that appears on the Results page reflects the smallest statistical significance value that Optimizely Experimentation saw over the course of all of these sequential intervals. The value is not the average statistical significance of the entire experiment.

Similarly, the confidence interval that you see on the Results page is the intersection of all of the confidence intervals that Optimizely Experimentation created across those sequential intervals.

Because of that, the statistical significance and confidence interval that you see may not exactly match the currently observed means in the experiment.


In classical statistics, a t-test is often used to measure differences in responses between two events, often in a “before-and-after” context. For example, a t-test might be used to measure the efficiency of a particular medical treatment by comparing the health outcomes of a sample group before and after receiving the treatment. 

Because it measures changes in response to an event, the t-test is also widely used in A/B testing. However, there are some weaknesses inherent in this approach, which is why Optimizely Experimentation uses its proprietary Stats Engine instead.

The t-test, on the other hand, only uses the currently observed mean and difference to compute statistical significance and a confidence interval.

If Optimizely Experimentation detected strong evidence of a difference between two variations at first, but then the evidence weakened over time, the statistical significance and confidence interval shown in the Results page could still reflect the strong evidence that was detected at the start of the experiment.

To learn how to capture more value from your experiments by increasing the number of conversions collected see our article on Stats Accelerator.


The approach described above essentially results in Optimizely Experimentation taking a more conservative approach to both declaring a winner and to “un-declaring” a winning variation. Using a t-test, on the other hand, is much more likely to result in an experiment flipping into—and then back out of—significance.

With Stats Engine, you are far less likely to see a result where Optimizely Experimentation has incorrectly declared a winner, even briefly, than you are with a t-test.

If you are unsure whether or not Optimizely Experimentation is likely to “un-declare” a winner, look at the currently observed mean (that is, the ‘tick mark’). If it is at the edge of the confidence interval, then it is possible Optimizely Experimentation is accumulating evidence against the conclusion it has already drawn. In those cases, it may be worth it to wait for a while. But if the observed mean is closer to the center of the confidence interval, you may feel more secure in declaring it a winner.