Stats Engine: How and why statistical significance changes over time

  • Updated

Relevant products:

  • Optimizely Web Experimentation
  • Optimizely Web Personalization
  • Optimizely Feature Experimentation

This topic describes how to:
  • Understand the reasons statistical significance can change over time
  • Read and interpret the Results page

Optimizely Experimentation’s Stats Engine uses sequential experimentation, not the fixed-horizon experiments you see on other platforms. Statistical significance should generally increase over time instead of fluctuating as Optimizely Experimentation collects more evidence. More substantial evidence progressively increases your statistical significance.

Optimizely Experimentation collects two primary forms of conclusive evidence as time goes on:

  • Larger conversion rate differences

  • Conversion rate differences that persist over more visitors

The weight of this evidence depends on time. Early in an experiment, when your sample size is still low, large deviations between conversion rates are treated more conservatively than when your experiment has a larger number of visitors. At this point, you will see a Statistical Significance line that starts flat but increases sharply as Optimizely Experimentation begins to collect evidence.

You should expect a stepwise, always-increasing behavior for statistical significance in a controlled environment. When the statistical significance increases sharply, the experiment accumulates more conclusive evidence than before. Conversely, during the flat periods, the Stats Engine is not finding additional conclusive evidence beyond what it already knew about your experiments.

Below, you will see how Optimizely Experimentation collects evidence over time and displays it on the Results page. The area circled in red is the "flat" line you would expect to see early in an experiment.

201241868.png

When statistical significance crosses your accepted threshold for statistical significance, we will declare a winner or loser based on the direction of the improvement.

Why stat sig might go down instead of up

In a controlled environment, Optimizely Experimentation's Stats Engine will provide a statistical significance calculation that is constantly increasing. However, experiments in the real world are not a controlled environment, and variables can change mid-experiment. Our analysis shows that this rarely happens in only about 4% of experiments.

Speaking very broadly, there are two conditions under which statistical significance might fall: either there was a run of data that looked significant at first, but now Optimizely Experimentation has enough additional information to say that it probably is not, or there was an underlying change in the environment that requires a more conservative approach.

If Stats Engine drops its statistical significance calculation, but only by a few percent, it is due to data bucketing—how Optimizely Experimentation gathers the data to calculate the results page. However, there is a protective measure called a stats reset for larger decreases (potentially all the way to 0%).

Data bucketing

Optimizely Experimentation calculates significance from aggregated data:

  1. Optimizely Experimentation splits the desired date range into 100 time-buckets of equal duration

  2. Fills each bucket with the number of conversions and visitors seen within that time window

These time buckets can sometimes fall out of sync as the date range changes.

buckets.png

In this example, the Day Two buckets miss some of the granularity of Day One. Any conversion rate fluctuations that happen in between the second day's buckets are missed when we calculate significance at the end of Day Two.

Stats Engine's statistical significance calculations depend on the entire history of the data, not just on the most recent data. For that reason, any small deviations caused by adjustments in the way Optimizely Experimentation buckets the experiment data can slightly alter the statistical significance calculations.

However, you should remember that Stats Engine was designed to be accurate for any bucketing frequency, so the false positive rate remains true, and Stats Engine remains accurate, even if the significance drops by a few percentage points over time.

Stats reset

If statistical significance has dropped more than just a few points—potentially even all the way down to zero—then you are probably seeing the results of a stats reset.

This usually happens when Stats Engine spots seasonality and drift in the conversion rates. Because most A/B procedures (including the t-test and Stats Engine) are only accurate in the absence of drift, Stats Engine has to compensate to protect the experiment's validity, which it does by resetting the significance.

Imagine the variation of an experiment containing a July-only promotion. During that month, the experiment may show significant uplift due to the variation. But if the experiment continues into August, that uplift may diminish or even disappear. In this case, any statistical conclusions based on July's data no longer apply. 

By detecting this change and resetting significance, Stats Engine keeps you from jumping to an erroneous conclusion—that is, the lift from the July promotion is no longer in effect. Of course, in this example, the effect is easy to avoid because you already know about the promotion. However, there are many external factors you might not know about that could have similar, short-term effects. Stats reset is intended to guard against events like these.

Unlike in traditional statistical tests, the currently-observed improvement value may not always lie exactly in the center of the confidence interval. It can fluctuate. If the improvement ever drifts outside the confidence interval, Stats Engine interprets that as strong evidence that the current lift and the historical lift are different and will in turn trigger a stats reset, as depicted in the figure below.

stats_reset.png