How and why statistical significance changes over time in Optimizely Experimentation

  • Updated
  • Optimizely Web Experimentation
  • Optimizely Web Personalization
  • Optimizely Feature Experimentation
  • Optimizely Full Stack (Legacy)

Optimizely Experimentation's Stats Engine uses sequential experimentation, not the fixed-horizon experiments you see on other platforms. Statistical significance should generally increase over time instead of fluctuating as Optimizely Experimentation collects more evidence. More substantial evidence progressively increases your statistical significance.

Optimizely Experimentation collects two primary forms of conclusive evidence as time goes on:

  • Larger conversion rate differences

  • Conversion rate differences that persist over more visitors

The weight of this evidence depends on time. Early in an experiment, when your sample size is still low, large deviations between conversion rates are treated more conservatively than when your experiment has a larger number of visitors. At this point, you will see a Statistical Significance line that starts flat but increases sharply as Optimizely Experimentation begins to collect evidence.

In a controlled environment, you should expect a stepwise, always-increasing behavior for statistical significance. When the statistical significance increases sharply, the experiment accumulates more conclusive evidence than before. Conversely, during the flat periods, the Stats Engine is not finding additional conclusive evidence beyond what it already knew about your experiments.

Below, you will see how Optimizely Experimentation collects evidence over time and displays it on the Results page. The area circled in red is the "flat" line you would expect to see early in an experiment.

When statistical significance crosses your accepted threshold for statistical significance, Optimizely will declare a winner or loser based on the direction of the improvement.

Why statistical significance might go down instead of up

In a controlled environment, Optimizely Experimentation's Stats Engine will provide a statistical significance calculation that is constantly increasing. However, experiments in the real world are not a controlled environment, and variables can change mid-experiment. Our analysis shows that this rarely happens in only about 4% of experiments.

Speaking very broadly, there are two conditions under which statistical significance might fall: either there was

  • a run of data that looked significant at first, but now Optimizely Experimentation has enough additional information to say that it probably is not
  • an underlying change in the environment that requires a more conservative approach.

If Stats Engine drops its statistical significance calculation, but only by a few percent, it is due to data bucketing, how Optimizely Experimentation gathers the data to calculate the results page. However, there is a protective measure called a stats reset for larger decreases (potentially all the way to 0%).

Time bucketing

Optimizely Experimentation calculates significance from aggregated data:

1. Optimizely Experimentation splits the desired date range into 100 time-buckets of equal duration.

2. The 100 buckets are constantly getting larger as the experiment timeline gets longer. When an experiment launches, visitors are placed into 100 time buckets. Each time bucket has a duration of the total running time of the experiment divided by 100.

3. Visitors are always getting reshuffled among the 100 ever-expanding time buckets.

New visitors arrive in an experiment, denoted in blue. As before, Optimizely Experimentation places the visitors into one of the 100 time buckets.

Even though the total running time of the experiment has increased, Optimizely Experimentation still divides the total time into 100 buckets. Therefore the duration of each time bucket expands.

Because the time buckets have become longer, Optimizely Experimentation may reshuffle the visitors into a different time bucket, denoted in yellow. We continuously reshuffle the group of visitors across the 100 time buckets. Stats Engine recalculates results based on these continuously updating buckets. Consequently, there are slight fluctuations in statistical significance.

Time bucketing example

For example, if an experiment has been running for 20 days, then each time bucket represents 4.8 hours' worth of data. If an experiment has been running for 100 days, each bucket represents 24 hours or one day's worth of data. Since the time interval for the buckets is constantly changing, the number of visitors and conversions per bucket also changes, and the statistical significance has to be recalculated for each bucket each time.

For an example of why statistical significance can change, consider an experiment where statistical significance is at 45% at the end of today. But tomorrow, data for today will be grouped with data from previous or future days as the length of each time bucket increases. So, if the data for the time bucket, including today, also includes data from other days, the statistical significance value for that time bucket could end up higher or lower due to the influence of the other days' data.

Why Optimizely Experimentation uses time buckets

Mathematically speaking, statistical significance could be updated after every incoming visitor. Practically speaking, performing all these computations would result in very long load times for the Results page. So, Stats Engine divides the experiment into 100 time-interval buckets, then groups all visitors within each bucket together. Optimizely Experimentation calculates the statistical significance and confidence interval per bucket. However, the consequence of using this approximation is that as the experiment runs, visitors get shuffled into different buckets as the time interval buckets get longer. These small changes manifest in (usually tiny) fluctuations in statistical significance.

The 100-bucket approximation does not, in any way, compromise the validity of the test. The statistical significance calculated from the 100 buckets will be larger than the statistical significance calculated after every visitor. So, Stats Engine upholds the false positive guarantee. Further, the resulting graphs can be helpful for an experimenter to identify any unusual patterns in their experiment, such as atypical traffic patterns or huge spikes in conversion rates.

Stats reset

If statistical significance has dropped more than just a few points, potentially down to zero, then you are probably seeing the results of a stats reset.

Stats resets usually happens when Stats Engine spots seasonality and drift in the conversion rates. Because most A/B procedures (including the t-test and Stats Engine) are only accurate in the absence of drift, Stats Engine has to compensate to protect the experiment's validity, which it does by resetting the significance.

Imagine the variation of an experiment containing a July-only promotion. During that month, the experiment may show significant uplift due to the variation. But if the experiment continues into August, that uplift may diminish or even disappear. In this case, any statistical conclusions based on July's data no longer apply. 

Optimizely recommends running all tests for a minimum of one business cycle (seven days) to ensure all kinds of user behavior are accounted for.

By detecting this change and resetting significance, Stats Engine keeps you from jumping to an erroneous conclusion—that is, the lift from the July promotion is no longer in effect. Of course, in this example, the effect is easy to avoid because you already know about the promotion. However, there are many external factors you might not know about that could have similar, short-term effects. Stats reset is intended to guard against events like these.

Unlike in traditional statistical tests, the currently-observed improvement value may not always lie exactly in the center of the confidence interval. It can fluctuate. If the improvement ever drifts outside the confidence interval, Stats Engine interprets that as strong evidence that the current lift and the historical lift are different and will in turn trigger a stats reset, as depicted in the figure below.



Stats reset information

Stats resets are protective measures. They are not meant to indicate something is wrong with your experiment automatically. Despite the reset, Stats Engine still remembers all of the data collected before the reset to help you regain significance.

Stats resets can happen for a few reasons. Here are the most common:

  • Data initially looked significant but no longer does after gathering additional information
  • Underlying change in the environment that requires Stats Engine to be more conservative.
  • Traffic allocation changed mid-experiment, which caused accuracy issues.
Changing an experiment mid-experiment can lead to incorrect conclusions. See changing an experiment while it is running for more information.

Stats reset is a strength of Optimizely's Stats Engine. It is unwilling to declare victory too soon and does not stubbornly hold on to a conclusion contradicted by later data.