Why Stats Engine results sometimes differ from traditional statistics results

  • Updated
  • Optimizely Web Experimentation
  • Optimizely Personalization
  • Optimizely Performance Edge
  • Optimizely Feature Experimentation
  • Optimizely Full Stack (Legacy)

Sometimes, Optimizely declares a winning variation in a situation where a traditional t-test would fail to find any statistically significant difference between it and the other variations. This is because Optimizely Stats Engine uses an approach that differs from those used in classical statistics-based models, which is simultaneously more conservative in declaring a winner and less likely to reverse that declaration as more data accumulates.

Stats Engine versus traditional statistics

Rather than using traditional statistical calculations, Optimizely computes a series of 100 successive confidence intervals throughout an experiment. Each interval receives a distinct statistical significance and a confidence interval.

The statistical significance on the Experimentation Results page reflects the smallest statistical significance value that Optimizely saw during the sequential intervals. The value is not the average statistical significance of the entire experiment.

Similarly, the confidence interval you see on the results page is the intersection of all confidence intervals that Optimizely created across those sequential intervals.

Because of that, the statistical significance and confidence interval you see may not precisely match the currently observed means in the experiment.

T-tests

In classical statistics, a t-test measures the differences in responses between two events, often in a “before-and-after” context. For example, a t-test might be used to measure the efficiency of a particular medical treatment by comparing the health outcomes of a sample group before and after receiving the treatment. 

The t-test is also widely used in A/B testing because it measures changes in response to an event. However, this approach has some inherent weaknesses, which is why Optimizely uses its proprietary Stats Engine instead.

The t-test only uses the currently observed mean and difference to compute statistical significance and a confidence interval.

If Optimizely detected strong evidence of a difference between two variations at first, but then the evidence weakened over time, the statistical significance and confidence interval shown on the results page could still reflect the strong evidence detected at the start of the experiment.

To learn how to capture more value from your experiments by increasing the number of conversions collected, see Stats Accelerator.

Implications

The above-mentioned process results in Optimizely taking a more conservative approach to declaring a winning variation. Using a t-test, however, is more likely to result in an experiment moving in and out of significance.

With Stats Engine, you are less likely to see a result where Optimizely has incorrectly declared a winner than with a t-test.

If you are unsure whether or not Optimizely is likely to "un-declare" a winner, look at the currently observed mean (the tick mark). If it is at the edge of the confidence interval, then Optimizely is likely accumulating evidence against its already drawn conclusion. In those cases, you may want to wait. But if the observed mean is closer to the center of the confidence interval, you may feel more secure in declaring it a winner.