Imbalance detected: What to do if Optimizely's automatic SRM detection alerts you to an imbalance

  • Updated
  • Optimizely Web Experimentation
  • Optimizely Performance Edge
  • Optimizely Feature Experimentation
  • Optimizely Full Stack (Legacy)

Sample ratio mismatch (SRM)

A sample ratio mismatch (SRM) occurs when the traffic distribution between variations in an A/B experiment becomes severely imbalanced due to an implementation issue. This imbalance may lead to experiment degradation and, in extreme cases, inaccurate results. It is important to remember that not all imbalances should cause immediate panic and abandonment of the experiment.

Timing is important for determining when to do an imbalance check for an experiment. That is why Optimizely's automatic SRM detection evaluates an experiment continuously. Detection of an imbalance indicates there is a symptom of various data quality issues. Implementation errors and third-party bots are the most common culprits behind experiment imbalances.

Automatic SRM detection alerts

An automatic SRM detection alert does not necessarily mean your experiment is ruined. If Optimizely's automatic SRM detection alerts you to an imbalance, that may indicate an external influence affecting the distribution of traffic. It is important to exercise caution and refrain from overreacting to every traffic disparity, as this does not instantly signify that an experiment is useless.

See below for an illustration of the severity level of an SRM imbalance with some real-world examples. They are categorized as follows (from highest severity to lowest):

  1. Immediate, Critical Failure
  2. High Risk of Experiment Corruption
  3. Monitor Performance
  4. Expected Behavior
The following examples are not an exhaustive range of imbalance causes

Immediate, Critical Failure

You had three variations in your experiment. You set the traffic distribution to 0% for one of those variations. After the launch of the experiment, someone on your team ramped up that 0% traffic distribution. Now your results look strange.

The experiment has become completely corrupted. It is entirely unusable. End the experiment immediately.

To restart your experiment safely, duplicate it and do not adjust the traffic experiment while it is running. For best practices, we strongly recommend against setting 0% traffic to a single variation. Instead, you can remove that variation from the experiment since no traffic is being sent to it.

Changing an experiment's traffic distribution while the experiment is running is the absolute worst thing any experimenter can do to their experiment program. Specifically, assigning 0% traffic to a variation and launching the experiment, then at some point changing the traffic distribution to that previously 0% variation is the ultimate way to destroy results by deliberately triggering a severe imbalance and inviting Simpson's Paradox directly into the experiment.

Your experiment is set at a 50/50 split but displays a bizarre 80/20 split. None of the metrics show the polar opposite or the expected uplift behavior presented consistently in similar experiments.

Optimizely's automatic SRM detection identifies the earliest date of your experiment imbalance. Correlate that date with any unusual behavior in the experiment change history. This allows you to investigate if a team member has biased traffic distribution.

With the earliest date, experimenters can investigate if there was a purposeful or accidental triggering of a third-party audience segmentation software that rerouted traffic while the experiment was live.

One additional way to determine if an experiment was destined from launch for corrupted results is to employ carefully chosen guardrail metrics as monitoring goals. Guardrail metrics, such as bounce rate, are tremendously helpful in triangulating evidence for fatal implementation errors.

High Risk of Experiment Corruption

Someone on your team toggled the percentage of overall traffic allocation down and back up again. Now our Optimizely Web Experimentation or Optimizely Feature Experimentation results are reporting an imbalance.

An experimenter will cause an imbalance to their own experiments if they down-ramp traffic (reduce traffic allocation) and then up-ramp traffic (increase traffic allocation). Slow and steady up-ramping of traffic will not cause an imbalance to experiments. The easiest way to avoid imbalances associated with allocation is to refrain from decreasing and increasing total traffic while an experiment is live. For more information, review Optimizely's documentation on how user bucketing works.

An example of an experimenter harming their own experiments is if an experimenter launches experiments and allocates 80% of an audience, then that experimenter down-ramps traffic allocation 50%, then up-ramps traffic allocation to 80%. Users previously exposed to the Optimizely Feature Experimentation flag may no longer see it when you ramp the traffic allocation back up. One way you can troubleshoot this is to check your experiment history, see the documentation for Web Experimentation or  Feature Experimentation . This allows you to quickly determine if traffic allocation has been disrupted and by whom.

To avoid this issue entirely in Optimizely Feature Experimentation, take a proactive approach by raising the traffic monotonically (ramping up traffic) in one direction or implementing a user profile service (UPS).

UPS is only compatible with experiments, not rollouts.

For more information review Optimizely's documentation on how to ensure consistent visitor bucketing.

Monitor Performance

Redirect experiments and imbalances

When you create a redirect or URL split test, versions A and B sit on different URLs , while in a typical A/B test, all versions in the experiment have the same URL. Redirect tests can be efficient for small teams by recycling and reusing content while testing for impact. An example of this is homepage hero testing.

Redirect experiments produce valid results. However, due to the nature of redirects, users may close the window or tab and exit a page before the redirect finishes executing. Optimizely does not receive the data, which drops the user from being counted. Optimizely expects the behavior that a slight imbalance outside of the typical 2% fluctuation may occur.

There are multiple reasons why redirects are associated with imbalances. For example:

  1. The browser may reject it if there are too many redirects. Optimizely may not be the only thing redirecting the user; it may be one step in a series of redirects.
  2. A user could have a browser setting or extension that rejects redirects.
  3. The delay could be long enough that a user closes the tab before the redirect has finished.

Interpret segments for a redirect experiment

There is something called the Sure Thing Principle (STP)1,2.  Essentially, if doing something increases the chances of something bad happening across almost all your segments, it also dramatically raises the chances of something bad happening overall in your experiment.

Citations: 1. Savage, L. J. (1954). The Foundations of Statistics. John Wiley & Sons. 2. Joyce, J. M. (1999). The Foundations of Causal Decision Theory. Cambridge University Press.

You have a highly specific targeted audience and your experiment is displaying an imbalance after 48 hours.

When there are conditions and constraints on exposure to an experiment, it is not uncommon for a slight imbalance to emerge before the first business cycle of an experiment completes. These experiments can offer an experiment program valid results, but it is the decision of the experimenter and their tolerance level for imbalances before the completion of one business cycle. These imbalances do resolve after one business cycle.

Expected Behavior

The number of visitors assigned to each experiment variation are never exactly at a 50/50 split. 

This is not a bug. This is normal behavior of the product. Do not expect an exact, perfect 50/50 split for every experiment you run. There will always be some slight deviation, around 2% or less.

The Optimizely Experimentation platform's hashing function determines what variation to show to a user. It uses a Murmurhash function to assign visitors. For bucketing, that means Optimizely assigns each user a number between 0 and 10,000 to determine if they qualify for the experiment and, if so, which variation they see.

Think of it as a coin flip per user, but that coin flip always gives the same result for the same user. If you flip a coin 10,000 times, it is extremely unlikely you will get precisely 5,000 heads and 5,000 tails.

  • Achieving a perfect, exact 50/50 split of 5,000 heads has a probability of 0.008 (or 0.8%) if you repeat that process indefinitely.
  • You will get approximately 5,000 heads in 10,000 independently and identically distributed (IID) fair coin flips.

Your team went through all your old experiments and found imbalances using online sample ratio mismatch calculators.

Performing a retroactive or end-of-experiment imbalance check is not a recommended use of your time. Retroactive imbalance testing informs you about a possible implementation problem only after the experiment collected all the data, which is far too late and goes against why most experimenters want imbalance detection in the first place.

Optimizely emphasizes the importance of running automatic checks. The automatic SRM detection algorithm created at Optimizely checks for imbalances after every data point, not just at the end, so that you can identify actual problematic imbalances at the first sign of trouble.

In this blog post, Optimizely goes deeper into the mechanics of misleading typical free sample ratio mismatch calculators.