History of how Optimizely Experimentation controls Simpson's Paradox in experiments with Stats Accelerator enabled

  • Updated
  • Optimizely Web Experimentation
  • Optimizely Web Personalization
  • Optimizely Feature Experimentation
  • Optimizely Full Stack Experimentation (Legacy)

In 2017, Optimizely announced the release of Stats Accelerator and multi-armed bandits, a pair of multi-armed bandit algorithms for accelerating experimentation. These algorithms are designed to optimize rewards for some time or identify a statistically significant variant as quickly as possible by intelligently changing the allocation of traffic between variations (or arms, in machine learning terminology) of the experiment.

However, when underlying conversion rates or means vary over time (for instance, due to seasonality), dynamic traffic allocation can cause substantial bias in estimates of the difference between the treatment and the control, a phenomenon known as Simpson's Paradox. This bias can invalidate statistical results on experiments using Stats Accelerator, breaking usual guarantees on false discovery rate control.

Bias is the observed difference in conversion rates that does not reflect the true, underlying difference.

To prevent this, Optimizely developed Epoch Stats Engine, a simple modification to Optimizely's existing Stats Engine methodology, which makes it robust to Simpson's Paradox. At its core, it is a stratified estimate of the difference in means between the control and the treatment. Because it requires no estimation of the underlying time variation and is compatible with other central limit theorem-based approaches such as the t-test, Optimizely believes it lifts a substantial roadblock to combining traditional hypothesis and bandit approaches to A/B testing in practice.

Time variation definition

A fundamental assumption behind many A/B tests is that the underlying parameter values you are interested in do not change over time. When this assumption is violated, there is time variation. In the context of A/B experiments using Stats Accelerator, Simpson's Paradox can only occur in time variation, so understanding time variation is important.

In each experiment, suppose that there are underlying quantities that determine the performance of each variation. Optimizely is interested in measuring these quantities, but they cannot be observed. These are parameters.

For example, in a conversion rate Optimizely Web Experimentation experiment, Optimizely presumes that each page variation has some ability to induce a conversion for each visitor. If Optimizely expresses this ability as the probability that each visitor will convert, then these true conversion probabilities would be the parameters of interest for each variation. Because parameters are unobserved, Optimizely must compute point estimates from the data to infer parameter values and decide whether the control or treatment is better. In the conversion rate example, the observed conversion rates would be point estimates for the true conversion probabilities of each variation.

Noise and randomness will cause point estimates to fluctuate. In the basic A/B example, Optimizely views these fluctuations as centered around a constant parameter value. However, when the parameter values themselves fluctuate over time, reflecting underlying changes in the actual performance of the control and treatment, Optimizely says that there is time variation.

save-tag-event.png

Example of a traditional view with no time variation of the parameter values behind an A/B test.

figure-1b.png

An example of time variation impacting underlying parameter values in an A/B test.

Time variation examples

Seasonality

The classic example of time variation is seasonality. For example, it is often reasonable that visitors to a website during the workweek may behave differently than visitors on the weekends. Conversion rate experiments may see higher (or lower) observed conversion rates on weekdays compared to weekends, reflecting matching potential time variation in the true conversion rates, representing true underlying differences in how visitors behave on weekdays and weekends.

Time variation

Time variation can also manifest as a one-time effect. A landing page with a banner for a 20%-off sale valid for December may generate a higher-than-usual number of purchases per visitor for December but then drop off after January arrives. This would be a time variation in the parameter representing the average number of purchases per visitor for that page.

Time variation can take different forms and affect your results in different ways. Whether time variation is cyclic (seasonality) or transient (a one-time effect), suggest other ways to interpret your results. Another critical distinction regards how the time variation affects different arms of the experiment.

  • Symmetric time variation – Occurs when parameters vary across time so that arms of the experiment are affected equally (in a sense to be defined shortly). 
  • Asymmetric time variation – Occurs over a broad swath of scenarios when parameters do not vary across time so that arms of the experiment are affected inequally.

Optimizely Experimentation's Stats Engine has a feature to detect substantial asymmetric time variation. Optimizely resets your statistical results accordingly to avoid misleading conclusions. However, handling asymmetric time variation generally requires solid assumptions and a wholly different type of analysis. This remains an open area of research.

Symmetric time variation example

In what follows, Optimizely will restrict the examples to the symmetric case with an additive effect for simplicity. Specifically, imagine the parameters for the control and treatment θC(t) and θT(t) may be written as θC(t) = μC + f(t) and θT(t) = μT + f(t) so that each can be decomposed into non-time-varying components μC and μT and the common source of the time variation f(t). The underlying lift may be written by the non-time-dependent quantity: μT – μC = θT(t) – θC(t).

Generally, the symmetric time variation occurs when the source of the time variation is not associated with a specific variation but rather the entire population of visitors. For example, visitors during the winter holiday are more inclined to make purchases. A variation that induces more purchases will maintain the difference over the control even with a higher overall holiday-influenced click-through rate for the variation and the control.

Generally, most A/B testing procedures, such as the t-test and sequential testing, are robust to this type of time variation. Optimizely is often less interested in estimating the individual parameters of the control and treatment and more interested in the difference between the parameters. If both parameters are impacted in an additive manner by the same amount, then such time variations are canceled out when differences are taken, and any subsequent inferences are relatively unaffected. Using the notation above, this can be seen in the fact that the difference in the time-varying parameters θT(t) – θC(t) = μT – μC does not contain the time-varying factor f(t).

However, when dynamic traffic allocation is introduced, the innocuous-seeming case of symmetric time variation can become a different problem.

Simpson's paradox

Suppose the traffic split in an experiment is adjusted in sync with the underlying time variation. A disproportionate amount of high or low-performing traffic may be allocated to one arm relative to the other, biasing Optimizely's view of the true difference between the two arms. This is a form of Simpson's paradox, a trend appearing in several different data groups that disappear or reverse when the groups are aggregated. This bias can completely invalidate experiments on any platform by tricking the stats methodology into declaring a larger or smaller effect size than what exists. 

See Simpson's Paradox: Discover possibilities with your segments, not shipping decisions.

Simpson's paradox example

For example, consider a two-month conversion rate experiment with one control and one treatment. In November, the actual conversion rates for the control and treatment are 10% and 20%, respectively. For December, they rise to 20% and 30%. Each month's conversion rate difference is ten percentage points (pp).

Suppose traffic is split 50% to treatment and 50% to control (or any other proportion) for the entire experiment. In that case, the final estimate of the difference between the two variations should be close to 10%. What happens if traffic is split 50/50 in November but then changes to 75% for control and 25% for treatment in December? For simplicity, assume 1000 total visitors to the experiment each month. A simple calculation shows that:

Control

Total visitors – 500 + 750 = 1250

Percent from high-converting regime – 75/1250 = 60%

Treatment

Total visitors – 500 + 250 = 750

Percent from high-converting regime – 250/750 = 33%

So, the control and treatment have equal visitors from low-converting November, but the treatment has far fewer visitors than the control from high-converting December. This imbalance indicates bias, and computing the math confirms this. The conversion rate for the control is around 16%, and the conversion rate for the treatment is about 23%, a difference of only 7% rather than the 10% that you would typically expect.

figure-2a.png

Movement of observed conversion rates under time variation with constant traffic allocation.

figure-2a-part2.png

Movement of observed conversion rates under time variation with changing traffic allocation. The simulation initially begins at 50% allocation to treatment, switching to 90% allocation to treatment after 2,500 visitors.

In this example, the diminished point estimate may cause a statistical method to fail to report significance when it otherwise would with an unbiased point estimate closer to 10%. But other adverse effects are also possible. When there is no true effect (for instance, running an A/A test), Simpson's Paradox can cause the illusion of a significant positive or negative impact, leading to inflated false positives. Or when the time variation is extreme or the traffic allocation syncs up well with the time variation. This bias can be so drastic as to reverse the sign of the estimated difference between the control and treatment parameters (as seen in the constant 50% allocation graph), completely misleading experimenters as to the true effect of their proposed change.

Epoch Stats Engine

Because Simpson's Paradox manifests as a bias in the point estimate of the difference in means of the control and treatment variations, mitigating Simpson's Paradox requires a way to remove such bias. In turn, eliminating bias requires accounting for one of the two factors causing it:

  1. Time variation.
  2. Dynamic traffic allocation.

Because time variation is unknown and must be estimated, while traffic allocation (proportion of total traffic included in the experiment) is directly controlled by you and is known, Optimizely Experimentation opted to focus on the latter.

Optimizely Experimentation's solution for avoiding Simpson's Paradox comes from the observation that bias due to Simpson's paradox cannot occur over periods of constant traffic allocation.

Optimizely Experimentation can derive an unbiased point estimate by making estimates within periods of constant allocation (called epochs) and then aggregating those individual within-epoch estimates to obtain one unbiased across-epoch estimate.

How epochs are compatible with Optimizely's Stats Engine, in theory

Here Optimizely will demonstrate how using epochs can help experiments avoid Simpson's Paradox.

There are K(n) total epochs when the experiment has seen n visitors. Within each epoch k, denote by nk,C and nk,T the control and treatment sample sizes, respectively, and by k and Ȳk the sample means of the control and treatment, respectively. Letting nk = nk,C + nk,T, the epoch estimator for the difference in means is:

epoch-part1.png

Because the dependence across epochs induced by the data-dependent allocation rule is restricted to changes in the relative balance of traffic between the control and treatment, the within-epoch estimates are orthogonal, and the variance for Tn is well-estimated by the sum of the estimated variances of each within-epoch component:

epoch-part2.png

Where σ̂C and σ̂T are consistent estimates for the standard deviations of the data-generating processes for the control and treatment arms.

 Tn is a stratified estimate at a high-level n is a stratified estimate where the strata represent data belonging to individual epochs of fixed allocation. At a low level, this is a weighted estimate of within-epoch estimates. The weight assigned to each epoch is proportional to the total number of visitors within that epoch. Optimizely also surfaces the epoch estimate as "Weighted improvement" on the results page for experiments using Stats Accelerator.

weighted-improvement.png

Calculation of an epoch-stratified estimate (computed at 15,000 visitors) of the difference in true conversion probabilities in an experiment with a traffic allocation change and one-time symmetric time variation occurring at 10,000 visitors.

The epoch estimate is guaranteed to be unbiased because each within-epoch estimate is unbiased. Optimizely provides rigorous guarantees that the epoch estimate is fully compatible with the sequential test employed by Optimizely Experimentation and is valid for use in other central limit theorem-based methods, such as the t-test. See Acceleration of A/B/n Testing under time-varying signals.

How epochs are compatible with Optimizely's Stats Engine, in practice, using simulated data

Optimizely simulated data with time variation and ran four different Stats Engine configurations on that data:

  1. Standard Stats Engine.
  2. Epoch Stats Engine.
  3. Standard Stats Engine with Stats Accelerator.
  4. Epoch Stats Engine with Stats Accelerator.

Specifically, Optimizely generated 600,000 draws from 7 Bernoulli arms with one control and six variants, one truly higher-converting arm, and all others converting at the same rate as the control. The conversion rate for the control starts at 0.10 and then undergoes cyclic time variation rising as high as 0.15. In each of these plots, Optimizely plots visitors on the horizontal axis and false discovery rate or true discovery rate on the vertical axis, averaging over 1000 simulations.

figure-4.png Average false discovery rate over time for the simulation scenario described above.

The false discovery rate graph shows that Epoch Stats Engine protects you from false discoveries due to Simpson's Paradox. The non-epoch bandit policy's false discovery rate exceeds the configured false discovery rate level (0.10) by up to 150%.

In contrast, the epoch-enabled bandit policy shows proper control of the false discovery rate at levels comparable to those achieved by Stats Engine without the bandit policy enabled. The main goal is to bring false discovery rate levels of bandit-enhanced A/B testing down to levels similar to no bandit-enabled.

figure-5.png Average true discovery rate over time for the simulation scenario described above.

The true discovery rate plot shows that Optimizely Experimentation does not lose much power due to switching from standard to Epoch Stats Engine.

First, Optimizely observes a large gap between the bandit allocation runs and the fixed allocation runs, reflecting that speedup due to bandit allocation is preserved under Epoch Stats Engine. Furthermore, Optimizely observes little difference in time to significance between the epoch and non-epoch scenarios under fixed allocation.

In contrast, Optimizely observes a small gap in time to significance between the epoch and non-epoch scenarios under the bandit policy. This gap can be explained by the fact that the non-epoch Stats Engine runs under dynamic allocation and experiences high sensitivity to time variation, especially after crossing an epoch boundary, thereby creating the scalloped shape of the blue curve. 

Conclusion

The Optimizely Experimentation Stats Engine is:

  • Proven to work well with sequential testing in both theory and simulation.
  • Simple to compute and understand.
  • Widely applicable to other methods and platforms, not just sequential testing and Optimizely.