- Optimizely Web Experimentation
- Optimizely Performance Edge
- Optimizely Feature Experimentation
This topic describes how to:
- Compare Optimizely Experimentation's Stats Engine methodology to the t-test in classical Statistics
- Interpret your Results page
Sometimes, Optimizely Experimentation will declare a winning variation in a situation where a traditional t-test would fail to find any statistically significant difference between it and the other variations. This is because Optimizely Experimentation Stats Engine uses an approach that differs from those used in classical statistics-based models, one which is simultaneously more conservative in declaring a winner and less likely to reverse that declaration as more data accumulates.
In classical statistics, a t-test is often used to measure differences in responses between two events, often in a “before-and-after” context. For example, a t-test might be used to measure the efficiency of a particular medical treatment by comparing the health outcomes of a sample group before and after receiving the treatment.
Because it measures changes in response to an event, the t-test is also widely used in A/B testing. However, there are some weaknesses inherent in this approach, which is why Optimizely Experimentation uses its proprietary Stats Engine instead.
Stats Engine vs classical statistics
Rather than using those classical statistics tools to determine significance and declare a winning variation, Optimizely Experimentation calculates a series of 100 successive confidence intervals through the course of an experiment. Each of those intervals receives a distinct p-value and a confidence interval.
The p-value that appears on the Results page reflects the smallest p-value that Optimizely Experimentation saw over the course of all of these sequential intervals. It is not an average p-value for the entire experiment.
Similarly, the confidence interval that you see on the Results page is the intersection of all of the confidence intervals that Optimizely Experimentation created across those sequential intervals.
Because of that, the p-value and confidence interval that you see may not exactly match the currently observed means in the experiment.
The z-test, on the other hand, only uses the currently observed mean and difference to compute a p-value and confidence interval.
If Optimizely Experimentation detected strong evidence of a difference between two variations at first, but then the evidence weakened over time, the p-value and confidence interval shown in the Results page could still reflect the strong evidence that was detected at the start of the experiment.
To learn how to capture more value from your experiments, either by reducing the time to statistical significance or by increasing the number of conversions collected, see our article on Stats Accelerator.
The approach described above essentially results in Optimizely Experimentation taking a more conservative approach to both declaring a winner and to “un-declaring” a winning variation. Using a z-test, on the other hand, is much more likely to result in an experiment flipping into—and then back out of—significance.
With Stats Engine, you are far less likely to see a result where Optimizely Experimentation has incorrectly declared a winner, even briefly, than you are with a t-test.
If you are unsure whether or not Optimizely Experimentation is likely to “un-declare” a winner, look at the currently observed mean (that is, the ‘tick mark’). If it is at the edge of the confidence interval, then it is possible Optimizely Experimentation is accumulating evidence against the conclusion it has already drawn. In those cases, it may be worth it to wait for a while. But if the observed mean is closer to the center of the confidence interval, you may feel more secure in declaring it a winner.