Imbalance detected: What to do if Optimizely's automatic SRM detection alerts you to an imbalance in your Stats Engine A/B test

  • Updated
  • Optimizely Web Experimentation
  • Optimizely Performance Edge
  • Optimizely Feature Experimentation
  • Optimizely Full Stack (Legacy)

Sample ratio mismatch (SRM)

A sample ratio mismatch (SRM) occurs when the traffic distribution between variations in an A/B experiment becomes severely imbalanced due to an implementation issue. This imbalance may lead to experiment degradation and, in extreme cases, inaccurate results. It is important to remember that not all imbalances should cause immediate panic and abandonment of the experiment.

Timing is important for determining when to do an imbalance check for an experiment. That is why Optimizely's automatic SRM detection evaluates an experiment continuously. Detection of an imbalance indicates there is a symptom of various data quality issues. Implementation errors and third-party bots are the most common culprits behind experiment imbalances.

Automatic SRM detection alerts

An automatic SRM detection alert does not necessarily mean your experiment is ruined. If Optimizely's automatic SRM detection alerts you to an imbalance in your Stats Engine A/B test, that may indicate an external influence affecting the distribution of traffic. It is important to exercise caution and refrain from overreacting to every traffic disparity, as this does not instantly signify that an experiment is useless.

See below for an illustration of the severity level of an SRM imbalance with some real-world examples. They are categorized as follows (from highest severity to lowest):

  1. Immediate, critical failure
  2. High risk of experiment corruption
  3. Monitor performance
  4. Expected behavior
The following examples are not an exhaustive range of imbalance causes. 

Immediate, critical failure

A Critical experiment health status is the highest priority visitor imbalance. Investigate why the traffic imbalance occurs. This means that the experiment experiences a statistically significant difference in counts at a radically different probability than the experimenter intended, also known as a consistent and non-ignorable underlying assignment bias.

The following are examples of situations that may have occurred. Decide what to do with your experiment, depending on your situation. See additional Causes of imbalance.

Example 1: Traffic distribution changes mid-experiment

You have three variations in your experiment. You set the traffic distribution to 0% for one of those variations. After the experiment launches, someone on your team ramps up that 0% traffic distribution. Now your results look strange.

The experiment has become completely corrupted. It is entirely unusable. End the experiment immediately.

To restart your experiment safely, duplicate it and do not adjust the traffic experiment while it is running. For best practices, Optimizely strongly recommends against setting 0% traffic to a single variation. Instead, you can remove that variation from the experiment because no traffic is being sent to it.

Changing an experiment's traffic distribution while the experiment is running is the worst thing any experimenter can do to their experiment program. Specifically, assigning 0% traffic to a variation and launching the experiment, then at some point changing the traffic distribution to that previously 0% variation is the ultimate way to destroy results by deliberately triggering a severe imbalance and inviting Simpson's Paradox directly into the experiment.

Example 2: The results page shows a different traffic split than what you set

Your experiment is set at a 50/50 traffic split, but the results page displays an 80/20 split. None of the metrics shows the polar opposite or the expected uplift behavior presented consistently in similar experiments.

Optimizely's automatic SRM detection identifies the earliest date of your experiment imbalance. Correlate that date with any unusual behavior in the experiment change history. This lets you investigate if a team member has biased traffic distribution.

With the earliest date, you can investigate if a purposeful or accidental triggering of a third-party audience segmentation software occurred that rerouted traffic while the experiment was live.

Another way to determine if an experiment was destined from launch for corrupted results is to employ carefully chosen guardrail metrics as monitoring goals. Guardrail metrics, such as bounce rate, can help in triangulating evidence for fatal implementation errors.

High risk of experiment corruption

A Minor experiment health status indicates that a minor visitor imbalance is detected. Investigate why the visitor imbalance occurs. The following are examples of situations that can cause the minor visitor imbalance. Decide what to do with your experiment, depending on your situation. See additional Causes of imbalance.

Example 1: The percentage of overall traffic allocation moves down and then up

Someone on your team toggles the percentage of overall traffic allocation down and back up again. Now Optimizely Web Experimentation or Optimizely Feature Experimentation results report an imbalance.

You can cause an imbalance to your own experiments if you down-ramp traffic (reduce traffic allocation) and then up-ramp traffic (increase traffic allocation). Slow and steady up-ramping of traffic does not cause an imbalance in experiments.

The easiest way to avoid imbalances associated with allocation is to refrain from decreasing and increasing total traffic while an experiment is live. For information, review Optimizely's documentation on how user bucketing works.

An example of an experimenter harming their own experiments is if an experimenter launches experiments and allocates 80% of an audience, then that experimenter down-ramps traffic allocation to 50%, then up-ramps traffic allocation to 80%. Users previously exposed to the Optimizely Feature Experimentation flag may no longer see it when you ramp up the traffic allocation.

Check your experiment history for Web Experimentation or the flag's history for Feature Experimentation to troubleshoot traffic allocation changes. This lets you determine if traffic allocation was changed and by whom.

To avoid this issue entirely in Optimizely Feature Experimentation, take a proactive approach by raising the traffic monotonically (ramping up traffic) in one direction or implementing a user profile service (UPS).

UPS is only compatible with experiments, not rollouts.

For information, review Optimizely's documentation on how to ensure consistent visitor bucketing.

Monitor performance

Continue to monitor the performance of your experiment if a minor visitor imbalance is detected. The following are examples of situations that can cause the Experiment Health status to indicate a minor visitor imbalance. See additional Causes of imbalance.

Example 1: Redirect experiments and imbalances

When you create a redirect or URL split test, versions A and B sit on different URLs, while in a typical A/B test, all versions in the experiment have the same URL. Redirect tests can be efficient for small teams by recycling and reusing content while testing for impact. An example of this is homepage hero testing.

Redirect experiments produce valid results. However, due to the nature of redirects, users may close the window or tab and exit a page before the redirect finishes executing. Optimizely does not receive the data, which drops the user from being counted. Optimizely expects the behavior that a slight imbalance beyond a 2% fluctuation may occur.

There are multiple reasons why redirects are associated with imbalances. For example:

  1. The browser may reject it if there are too many redirects. Optimizely may not be the only thing redirecting the user; it may be one step in a series of redirects.
  2. A user may have a browser setting or extension that rejects redirects.
  3. The delay could be long enough that a user closes the tab before the redirect finishes.

Example 2: You have a highly specific targeted audience, and your experiment displays an imbalance after 48 hours.

When there are conditions and constraints on exposure to an experiment, a slight imbalance can emerge before the first business cycle of an experiment is completed (usually seven days). These experiments can offer an experiment program valid results, but it is the decision of the experimenter and their tolerance level for imbalances before the completion of one business cycle. These imbalances do resolve after one business cycle.

Expected behavior of the experiment health

 

A Good experiment health status indicates that no visitor imbalance is detected. You do not need to do anything, and your experiment is running smoothly. See Good experiment health status. But, it is important to note the following. 

You may observe the number of visitors assigned to each experiment variation is never exactly at a 50/50 split, and yet the experiment shows a green checkmark of "good" health

This is not a bug. Do not expect an exact, perfect 50/50 split for every experiment you run. There will always be some slight deviation. 

An imbalance occurs when the actual proportion of traffic does not match the intended size assigned to a variation. It is impossible to visually check how improbable the severity or lack of assignment bias to a particular variation may be across the life of the experiment. When an experiment shows a good health status and a slightly imperfect split, the algorithm has determined that there is nothing unusual from what the experimenter had intended.

This is supported by Optimizely's hashing function, which determines what variation to show to a user. It uses a Murmurhash function to assign visitors. For bucketing, that means Optimizely assigns each user a number between 0 and 10,000 to determine if they qualify for the experiment and, if so, which variation they see.

Think of it as a coin flip per user, but that coin flip always gives the same result for the same user. If you flip a coin 10,000 times, it is extremely unlikely you will get precisely 5,000 heads and 5,000 tails.

  • Achieving a perfect, exact 50/50 split of 5,000 heads has a probability of 0.008 (or 0.8%) if you repeat that process indefinitely.
  • You will get approximately 5,000 heads in 10,000 independently and identically distributed (IID) fair coin flips.

Your team went through all your old experiments and found imbalances using online sample ratio mismatch calculators.

Performing a retroactive or end-of-experiment imbalance check is not a recommended use of your time. Retroactive imbalance testing informs you about a possible implementation problem only after the experiment has collected all the data, which is far too late and goes against why most experimenters want imbalance detection in the first place.

Optimizely emphasizes the importance of running automatic checks. The automatic SRM detection algorithm created at Optimizely checks for imbalances after every data point, not just at the end, so that you can identify actual problematic imbalances at the first sign of trouble.

Segmenting experiment results

Optimizely Experimentation does not check for visitor imbalances when you segment your results. You should only use segmentation for exploring results, not making decisions.   

Interpret segments for a redirect experiment

There is something called the Sure Thing Principle (STP)1,2. Essentially, if doing something increases the chances of something bad happening across almost all your segments, it also dramatically raises the chances of something bad happening overall in your experiment.

Citations: 1. Savage, L. J. (1954). The Foundations of Statistics. John Wiley & Sons. 2. Joyce, J. M. (1999). The Foundations of Causal Decision Theory. Cambridge University Press.

Paused or archived experiments and flags

Optimizely Experimentation does not check for visitor imbalances for the following: