Optimizely's automatic sample ratio mismatch detection discovers any experiment deterioration early

  • Updated
  • Optimizely Web Experimentation
  • Optimizely Performance Edge
  • Optimizely Feature Experimentation
  • Optimizely Full Stack (Legacy)

A sample ratio mismatch (SRM) occurs when the traffic distribution between variations in a Stats Engine A/B experiment becomes severely and unexpectedly unbalanced, often due to an implementation issue or third-party bots.

If an SRM does occur, it indicates a potential external influence affecting the distribution of traffic. It is important to exercise caution and refrain from overreacting to every traffic disparity, as this does not automatically signify that an experiment is useless.

How Optimizely protects your Stats Engine A/B experiments with its automatic SRM detection

Optimizely Experimentation aims to alert customers to any experiment deterioration as soon as possible. Early detection helps you decide the severity of the imbalance and stop a faulty experiment. This early detection can greatly reduce the number of potential users exposed to a faulty experiment.

To rapidly detect deterioration caused by mismanaged traffic distribution, Optimizely Experimentation's automatic SRM detection uses a statistical method called sequential sample ratio mismatch (SSRM). Optimizely's SSRM algorithm continuously checks traffic counts throughout an A/B experiment. It provides immediate detection at the beginning of an experiment's lifecycle instead of waiting until the experiment's end to test for an imbalance.

For information on why Optimizely does not use chi-squared tests to evaluate for imbalances, see A Better Way to Test for Sample Ratio Mismatches (SRMs) and Validate Experiment Implementations.

Going through your old experiences and trying to find imbalances using an online ratio mismatch calculator is not helpful. This retroactive or end-of-experiment imbalance check is not a recommended use of your time. Retroactive imbalance testing informs you about a possible implementation problem only after the experiment has collected all the data, which is far too late and goes against why most experimenters want imbalance detection in the first place.

Optimizely Experimentation emphasizes the importance of running automatic checks. The automatic SRM detection algorithm created at Optimizely checks for imbalances after every data point, not just at the end, so that you can identify actual problematic imbalances at the first sign of trouble.

Optimizely's automatic SRM notification is only available for A/B tests:
  • With the traffic distribution set to Manual (Stats Accelerator is NOT enabled).
  • That are running for 45 days or less. Measured as total running time, not age. The days the experiment is paused do not count towards the day total.
  • That have at least 1000 visitors.
  • That have not had the Experiment Results page manually reset.

Segmenting experiment results

Optimizely Experimentation does not check for visitor imbalances when you segment your results.

Segments and filters should only be used for data exploration, not making decisions.
There is something called the Sure Thing Principle (STP)1,2. If doing something increases the chances of something bad happening across almost all your segments, it also dramatically raises the chances of something bad happening overall in your experiment. So, Optimizely Experimentation does not check for visitor imbalances in segments.

Citations: 1. Savage, L. J. (1954). The Foundations of Statistics. John Wiley & Sons. 2. Joyce, J. M. (1999). The Foundations of Causal Decision Theory. Cambridge University Press.

Paused or archived experiments and flags

Optimizely Experimentation does not check for visitor imbalances for the following:

Sample ratio mismatch

An SRM occurs when the traffic distribution between variations in a Stats Engine A/B experiment becomes significantly imbalanced. Optimizely Experimentation's Stats Engine does not generate SRMs, and its traffic-splitting mechanism is trustworthy. A severe traffic distribution imbalance may lead to experiment degradation and, in extreme cases, inaccurate results.

For example, in a Stats Engine A/B test, you set a 50/50 traffic split between Variation A and Variation B. But instead, you observe a 40/60 traffic distribution.

Remember, not every imbalance is a reason to panic and immediately abandon your experiment. If you properly understand the cause of the traffic distribution imbalance, you can still make concrete conclusions. An imbalance does not mean your experiment results are immediately invalid.

Evaluating experiments for traffic imbalances is most helpful at the start of your experiment launch period. Finding an experiment with an unknown source of a traffic imbalance lets you turn it off quickly and reduce the blast radius.

Optimizely Experimentation's automatic SRM detection leverages a sequential sample ratio mismatch algorithm. That algorithm continuously and efficiently checks traffic counts throughout an experiment. Optimizely Experimentation's automatic SRM detection is only for stats engine A/B experiments.

Causes of imbalance

The detection of an imbalance by Optimizely Experimentation's automatic SRM detection is a symptom of various data quality issues. Implementation errors and third-party bots are the most common culprits behind experiment imbalances. To minimize the likelihood of an imbalance, you should set up your experiment carefully.

Specific experiment configurations pose a greater risk of an imbalance occurring. Assess the following scenarios to see if they are relevant to your experiment structure.

Redirect experiments

Redirect experiments are a known and reasonable cause of traffic imbalance. In Optimizely Web Experimentation or Optimizely Performance Edge, you can compare two separate URLs as variations in a Stats Engine A/B test. For example, you might create a redirect experiment that compares two landing page versions.

Due to the nature of redirects, users may close the window or tab and leave a page before the redirect finishes executing. The Optimizely Experimentation code does not activate in this situation, preventing the transmission of the event data to Optimizely. Optimizely never receives the data, so Optimizely does not count the user. Redirect experiments are valid experiments, but it is reasonable a slight imbalance may occur.

URL redirects can vary, and you cannot rely on them to behave consistently. It is unreasonable to expect a specific fixed rate of redirects. You must not make ad-hoc adjustments with over or under-correction of traffic for redirect experiments. You must not run redirect experiments for an extended period solely to rebalance visitor counts.

There are two major reasons Optimizely Experimentation delays event tracking until after the redirect is completed:

  1. Performance – Optimizely Experimentation's redirect hides the initial page's content through CSS. Naturally, there is a delay between a user accessing a webpage and receiving any content. End users are rightfully sensitive to site performance. The delay could be exacerbated if Optimizely Experimentation waits until the event is sent. If changes are applied to the page a customer is redirected to, then the snippet still needs to apply, and the user gets pushed out from receiving an experience. Optimizely Experimentation minimizes this extra time by waiting until the customer is on the second page.

  2. Accuracy – The only way Optimizely Experimentation knows the redirect is complete, the user is bucketed, and they receive the variation is to send the event when the second page loads. You might think giving the snippet time to send the event and confirm receipt, then redirect, ensures accurate results. However, that is not the case if Optimizely Experimentation counts users in the redirect variation inaccurately and includes their data (or lack thereof) in the results processing. That would distort the reliability and precision of metrics reported on the Experimentation Results page.

There is a multitude of reasons why a redirect may fail that are out of your control. For example,

  1. The browser may reject it if there are too many redirects. Optimizely Experimentation may not be the only thing redirecting the user, and it may be one step in a series of redirects.
  2. A user can have a browser setting or extension that rejects redirects.
  3. The delay can be long enough that a user closes the tab before the redirect has finished. 

See Test two URLs in Optimizely Web Experimentation or Optimizely Performance Edge using a redirect experiment.

Reduce and then increase traffic allocation

There is a high risk of corrupting your experiment data if the percentage of overall traffic allocation is moved down and then back up. You can directly cause a traffic imbalance in your experiments if you (or a teammate) down-ramp traffic (reduce traffic allocation) and then up-ramp traffic (increase traffic allocation). Bucketing at Optimizely is:

  • Deterministic – The way Optimizely Experimentation hashes user IDs, a returning user is not reassigned to a new variation.
  • Sticky unless reconfigured – If you reconfigure a "live," running experiment, for example, by decreasing and then increasing traffic, a user may get rebucketed into a different variation.

If you down and then up-ramp traffic, re-bucketing occurs, and you can irreparably harm the results of your experiment. When users are rebucketed because you down-ramped and then up-ramped your traffic allocation, it distorts visitor counts for each variation of that experiment. This results in a traffic imbalance caused by you. 

Slow and steady up-ramping of traffic does not cause an imbalance in experiments.

For example, if you launch a few Optimizely Feature Experimentation experiments and allocate 80% of an audience, then you down-ramp traffic allocation to 50%. And then you up-ramp traffic allocation to 80%. Users previously exposed to the Optimizely Feature Experimentation flag may no longer see it when you ramp up the traffic allocation.

Check your experiment history for Web Experimentation or the flag's history for Feature Experimentation to troubleshoot traffic allocation changes. This lets you determine if traffic allocation was changed and by whom.

The easiest way to avoid imbalances associated with traffic allocation is to refrain from decreasing and increasing total traffic while an experiment is live. To avoid this issue entirely in Optimizely Feature Experimentation, take a proactive approach by raising the traffic monotonically (ramping up traffic) in one direction or implementing a user profile service (UPS).

Forced-bucketing

If a user gets bucketed in an Optimizely Web Experimentation experiment first and then that decision is used to force-bucket them in a legacy Full Stack Experimentation experiment, then the results of that Full Stack Experimentation experiment become imbalanced.

See the following example of how force-bucketing can cause an imbalance:

An experiment has two variations: Variation A and Variation B.

Variation A provides a superior user experience in comparison to Variation B. Visitors assigned to Variation A find it enjoyable, and many of them continue to log in and land in the Full Stack Experimentation experiment, where they are force-bucketed to Variation A.

In contrast, visitors assigned to Variation B do not have a good experience, and only a few proceed to log in and land in the Full Stack Experimentation experiment, where they are assigned to Variation B.

As a result, there are significantly more visitors in Variation A than in Variation B. These Variation A visitors are more likely to convert to the Full Stack Experimentation experiment because they are happier with their experience. In addition to visitor traffic split imbalance, metrics and conversion rates are also skewed.

Additional ways to harm your results

There are other situations that may arise and cause irreparable harm to the results of your experiments. 

Delayed or failed Optimizely API calls

The Event API sends event data directly to Optimizely Experimentation. A traffic imbalance may occur if anything happens that causes the calls to be delayed or not fire. For example,

Differences in IDs across devices

In some cases, the chosen user ID is not a consistent ID that works across devices (like a customer ID for logged-in users). So, the user does not see the same variation across devices.

Differences in the snippet or event dispatch timing

A traffic imbalance may occur if something causes the Optimizely Web Experimentation snippet code to misfire. Additionally, if you use the holdEvents or sendEvents JavaScript APIs in a location other than in the project JavaScript, the script may not load properly, resulting in a traffic distribution imbalance. Adding more scripts to your webpage may cause implementation or loading rates to differ across variations, particularly in the case of redirects.

The Optimizely Feature Experimentation SDKs make HTTP requests for every decision event or conversion event that gets triggered. Each SDK has a built-in event dispatcher for handling these events. A traffic distribution may occur if the events are dispatched incorrectly due to misconfiguration or other dispatching issues.

For information, refer to your Feature Experimentation SDK's configure event dispatcher documentation:

What to do if Optimizely identifies an imbalance in your experiment

If Optimizely Experimentation identifies an imbalance in your experiment, troubleshooting depends on the cause of the imbalance and if you can correct the problem directly.

  • If you can determine the root cause – Stop your experiment, fix the underlying issue, duplicate the experiment, and start that new experiment. You can continue monitoring your fix in the new experiment to verify that you have corrected the problem.
  • If you cannot determine the root cause of the imbalance – Stop the experiment or remove the variations to lessen the negative impact to customers while you investigate further.

For information, see Imbalance detected: What to do next if Optimizely identifies an SRM in your Stats Engine A/B test.