Simpson's Paradox: Discover possibilities with your segments, not shipping decisions

  • Updated
  • Optimizely Web Experimentation
  • Optimizely Performance Edge
  • Optimizely Feature Experimentation
  • Optimizely Full Stack (Legacy)

The bottom line

Changing the traffic distribution in your live experiment invites Simpsons paradox and ruins your experiment results.

Simpson's Paradox explained

Simpson's Paradox is a statistical phenomenon where a trend observed in individual data groups vanishes or reverses when calculating the groups together.

This bias can render experiments on any platform, not just Optimizely, invalid by misleading the statistical methodology into declaring a larger or smaller effect size than what actually exists. It pertains to the disparity between the results of an entire dataset and the results of the same dataset grouped by specific characteristics, with group-level outcomes sometimes differing or even contradicting the overall narrative.

In summary, certain data groups may exhibit a trend that disappears when combined.

Bias is the observed difference in conversion rates that does not reflect the true, underlying difference.

Example of Simpson's Paradox affecting the experiment's results

For example, consider an A/B test run over two weeks:

Week One – You are very cautious about a new variation and are worried it will fail, so you assign 99% of the traffic to the control and 1% of the traffic to the variation. The experiment runs for a week, and a statistically significant difference in favor of the variation emerges.

Week Two – Feeling more confident, you decide to reallocate and increase traffic to the variation. There is now a balanced, 50/50 split. However, at the end of Week Two, when you evaluate all data from both weeks, the control is performing better than the variation.

What happened?

By combining data across two weeks, most of the traffic (99%) went to the control, and in the second week, only half (50%) of the traffic went to the control. Approximately two-thirds of all traffic to the control is from the first week, while the majority of the traffic (almost 99%) going to the test variant came from the second week. Although the Optimizely Experimentation application does not forbid it, Optimizely strongly recommends not changing an active experiment.

Traffic distribution changes

Changing the traffic distribution (proportion of traffic sent to a particular variation) in reaction to interim results invites Simpsons paradox and detrimentally affects your results.

An example of an ad-hoc, mid-experiment change in reaction to interim results includes:

  • Manually changing the traffic distribution percentage midway through a live experiment.
  • Stopping one variant because it is performing poorly.

When you make ad-hoc changes, you have confirmation bias: you look for outcomes that align with existing expectations. 

If you would like to properly make a change to the traffic distribution, Optimizely advises:

  1. Pause the experiment.
  2. Either duplicate the experiment in Optimizely Web Experimentation or copy the experiment rule in Optimizely Feature Experimentation.
  3. Publish the new experiment.

Traffic allocation

Changing the traffic allocation (proportion of total traffic included in the experiment) does not cause Simpson's Paradox. However, in some situations, poor execution of traffic allocation can hurt business decision-making as well.

An example is rolling out features to a small percentage of the audience and gradually exposing those features to more users. 

Although it does not hurt your experiment validity to initially set your experiment to a low allocation (such as below 10%) and adjust later, it is costly to keep an experiment running for a long time (for example, more than a few days) at the low traffic allocation.

The experiment will not have sufficient power to detect a meaningful impact. Without sufficient power, an experimenter risks placing users in a bad experience for an unnecessarily long period, which may induce permanent user churn.

Ramping is gradually exposing traffic to new test variations. If this is not done carefully, the process can introduce inefficiency and risk. 

The association between these terms and the different result outcomes can be challenging to memorize. 
  • Traffic Distribution – Proportion of of traffic sent to a particular variation
  • Traffic Allocation – Proportion of total traffic included in the experiment

The relationship between segmenting results and Simpson's Paradox

Simpson's paradox notes the importance of making decisions based on the overall experiment results first instead of finding user segments to support weak, spurious decisions. Interpreting metric movements at the segment level for Optimizely Web Experimentation or Optimizely Feature Experimentation may be misleading. Optimizely does not recommend making changes based on particular segments. You should focus on the average treatment effect demonstrated on the aggregate (overall) population in the experiment. Use the results of segments to discover possibilities and testing direction ideas. See take action based on the results of an experiment.

Specific subgroups

You may observe in your results that an event was positive for some subgroups. You can still learn from these segments and discover possibilities if you keep the sure-thing principle (STP) in mind. The sure-thing principle (STP) says that if an action increases the probability of a harmful event in every subpopulation, it must also increase the probability of a harmful event in the whole population. Use segmented results on major subgroups such as desktop and mobile as targets for more focused, rigorous experimentation. See Feature Experimentation: target audiences or Web: Target Audiences for more information.