- Optimizely Web Experimentation
- Optimizely Web Personalization
This topic describes how to:
- Understand the reasons statistical significance can change over time
- Read and interpret the Results page
Optimizely’s Stats Engine uses sequential experimentation, not the fixed-horizon experiments that you would see in other platforms. This means that instead of fluctuating, statistical significance should generally increase over time as Optimizely collects more evidence. Stronger evidence progressively increases your statistical significance.
Optimizely collects two main forms of conclusive evidence as time goes on:
Larger conversion rate differences
Conversion rate differences that persist over more visitors
The weight of this evidence depends on time. Early in an experiment, when your sample size is still low, large deviations between conversion rates are treated more conservatively than when your experiment has a larger number of visitors. At this point, you'll see a Statistical Significance line that starts flat but increases sharply as Optimizely begins to collect evidence.
In a controlled environment, you should expect a stepwise, always-increasing behavior for statistical significance. When the statistical significance increases sharply, you’re seeing the experiment accumulate more conclusive evidence than it had before. Conversely, during the flat periods, the Stats Engine is not finding additional conclusive evidence beyond what it already knew about your experiments.
Below, you'll see how Optimizely collects evidence over time and displays it on the Results page. The area circled in red is the "flat" line you would expect to see early in an experiment.
When statistical significance crosses your accepted threshold for statistical significance, we will declare a winner or loser based on the direction of the improvement.
Why stat sig might go down instead of up
In a controlled environment, Optimizely's Stats Engine will provide a statistical significance calculation that is always increasing. However, experiments in the real world are not a controlled environment, and variables can change mid-experiment. Our analysis shows that this happens rarely, in only about 4% of experiments.
Speaking very broadly, there are two conditions under which statistical significance might fall: either there was a run of data that looked significant at first, but now Optimizely has enough additional information to say that it probably is not, or there was an underlying change in the environment that requires a more conservative approach.
If Stats Engine drops its statistical significance calculation, but only by a few percent, it is due to data bucketing—how Optimizely gathers the data to calculate the results page. However, for larger decreases (potentially all the way to 0%), then it is a protective measure called a stats reset.
Optimizely calculates significance from aggregated data:
First, Optimizely splits the desired date range into 100 time-buckets of equal duration.
Then it fills each bucket with the number of conversions and visitors seen within that time window.
These time buckets can sometimes fall out of sync as the date range changes.
In this example, the Day Two buckets miss some of the granularity of Day One. Any conversion rate fluctuations that happen in between the second day's buckets are missed when we calculate significance at the end of Day Two.
Stats Engine's statistical significance calculations depend on the entire history of the data, not just on the most recent data. For that reason, any small deviations caused by adjustments in the way Optimizely buckets the experiment data can slightly alter the statistical significance calculations.
However, you should bear in mind that Stats Engine was designed to be accurate for any bucketing frequency, so the false positive rate is still true and Stats Engine remains accurate, even if the significance drops by a few percentage points over time.
If statistical significance has dropped more than just a few points—potentially even all the way down to zero—then you're probably seeing the results of a stats reset.
This usually happens when Stats Engine spots seasonality and drift in the conversion rates. Because most A/B procedures (including the t-test and Stats Engine) are only accurate in the absence of drift, Stats Engine has to compensate to protect the experiment's validity, which it does by resetting the significance.
Imagine the variation of an experiment containing a July-only promotion. During that month, the experiment may show significant uplift, due to the variation. But if the experiment continues into August, that uplift may diminish, or even disappear. In this case, any statistical conclusions based on July's data no longer apply.
By detecting this change and resetting significance, Stats Engine keeps you from jumping to an erroneous conclusion—i.e., the lift from the July promotion is no longer in effect. Of course, in this example, the effect is easy to avoid because you already know about the promotion. However, there are many external factors you might not know about that could have similar, short-term effects. Stats reset is intended to guard against events like these.
Unlike in traditional statistical tests, the currently-observed improvement value may not always lie exactly in the center of the confidence interval. It can fluctuate. If the improvement ever drifts outside the confidence interval, Stats Engine interprets that as strong evidence that the current lift and the historical lift are different, and will in turn trigger a stats reset, as depicted in the figure below.