Table of Contents
 Optimizely Web Experimentation
 Optimizely Web Personalization
 Optimizely Full Stack
This topic describes how to:
 Understand the reasons statistical significance can change over time
 Read and interpret the Results page
Optimizely’s Stats Engine uses sequential experimentation, not the fixedhorizon experiments you see on other platforms. Statistical significance should generally increase over time instead of fluctuating as Optimizely collects more evidence. More substantial evidence progressively increases your statistical significance.
Optimizely collects two primary forms of conclusive evidence as time goes on:

Larger conversion rate differences

Conversion rate differences that persist over more visitors
The weight of this evidence depends on time. Early in an experiment, when your sample size is still low, large deviations between conversion rates are treated more conservatively than when your experiment has a larger number of visitors. At this point, you will see a Statistical Significance line that starts flat but increases sharply as Optimizely begins to collect evidence.
You should expect a stepwise, alwaysincreasing behavior for statistical significance in a controlled environment. When the statistical significance increases sharply, the experiment accumulates more conclusive evidence than before. Conversely, during the flat periods, the Stats Engine is not finding additional conclusive evidence beyond what it already knew about your experiments.
Below, you will see how Optimizely collects evidence over time and displays it on the Results page. The area circled in red is the "flat" line you would expect to see early in an experiment.
When statistical significance crosses your accepted threshold for statistical significance, we will declare a winner or loser based on the direction of the improvement.
Why stat sig might go down instead of up
In a controlled environment, Optimizely's Stats Engine will provide a statistical significance calculation that is constantly increasing. However, experiments in the real world are not a controlled environment, and variables can change midexperiment. Our analysis shows that this rarely happens in only about 4% of experiments.
Speaking very broadly, there are two conditions under which statistical significance might fall: either there was a run of data that looked significant at first, but now Optimizely has enough additional information to say that it probably is not, or there was an underlying change in the environment that requires a more conservative approach.
If Stats Engine drops its statistical significance calculation, but only by a few percent, it is due to data bucketing—how Optimizely gathers the data to calculate the results page. However, there is a protective measure called a stats reset for larger decreases (potentially all the way to 0%).
Data bucketing
Optimizely calculates significance from aggregated data:

Optimizely splits the desired date range into 100 timebuckets of equal duration

Fills each bucket with the number of conversions and visitors seen within that time window
These time buckets can sometimes fall out of sync as the date range changes.
In this example, the Day Two buckets miss some of the granularity of Day One. Any conversion rate fluctuations that happen in between the second day's buckets are missed when we calculate significance at the end of Day Two.
Stats Engine's statistical significance calculations depend on the entire history of the data, not just on the most recent data. For that reason, any small deviations caused by adjustments in the way Optimizely buckets the experiment data can slightly alter the statistical significance calculations.
However, you should remember that Stats Engine was designed to be accurate for any bucketing frequency, so the false positive rate remains true, and Stats Engine remains accurate, even if the significance drops by a few percentage points over time.
Stats reset
If statistical significance has dropped more than just a few points—potentially even all the way down to zero—then you are probably seeing the results of a stats reset.
This usually happens when Stats Engine spots seasonality and drift in the conversion rates. Because most A/B procedures (including the ttest and Stats Engine) are only accurate in the absence of drift, Stats Engine has to compensate to protect the experiment's validity, which it does by resetting the significance.
Imagine the variation of an experiment containing a Julyonly promotion. During that month, the experiment may show significant uplift due to the variation. But if the experiment continues into August, that uplift may diminish or even disappear. In this case, any statistical conclusions based on July's data no longer apply.
By detecting this change and resetting significance, Stats Engine keeps you from jumping to an erroneous conclusion—i.e., the lift from the July promotion is no longer in effect. Of course, in this example, the effect is easy to avoid because you already know about the promotion. However, there are many external factors you might not know about that could have similar, shortterm effects. Stats reset is intended to guard against events like these.
Unlike in traditional statistical tests, the currentlyobserved improvement value may not always lie exactly in the center of the confidence interval. It can fluctuate. If the improvement ever drifts outside the confidence interval, Stats Engine interprets that as strong evidence that the current lift and the historical lift are different and will in turn trigger a stats reset, as depicted in the figure below.
Comments
0 comments