Why Stats Engine controls for false discovery instead of false positives

  • Updated
  • Optimizely Web Experimentation
  • Optimizely Performance Edge
  • Optimizely Feature Experimentation
  • Optimizely Full Stack (Legacy)

Optimizely Experimentation’s Stats Engine focuses on false discovery rate instead of false positive rate, which helps you make business decisions based on reliable results.

False positive rate can be calculated as the ratio between:

  • The number of negative events incorrectly categorized as positive.

  • The overall number of actual negative events.

In every experiment, there is a risk of getting a false positive result. This happens when an experiment reports a conclusive winner (or loser, depending on the direction of the observed difference), but, there is in fact, no real difference in visitor behavior between your variations.

False Discovery Rate = (average number of incorrect and losing declarations)/(total number of winning and losing declarations)

Here, with traditional statistics, the risk of generating at least one false positive result increases as you add more metrics and variations to your experiment. This is true even though the false positive rate stays the same for each individual metric or variation.

This may sound like a theoretical problem, and it is. But it can also have a significant real-world impact. When an experiment runs with more than one variation or more than one metric, these are collectively referred to as multiple hypotheses. It is important to correct the statistical significance calculations when you are performing tests on these multiple hypotheses at the same time. This is because if you perform several hypotheses tests simultaneously, you run into the multiple comparisons problem, that is, multiplicity or the look-elsewhere effect, where your probability of making an error by basing a critical business decision on a false positive result increases rapidly with the number of hypothesis tests you are running simultaneously.

False discovery rate

Optimizely Experimentation helps you avoid this by taking a more rigorous approach to controlling errors. Instead of focusing on the false positive rate, Optimizely Experimentation uses procedures that manage the false discovery rate, which we define like this:

false_discovery_rate

These procedures are designed to control the expected proportion of incorrect conclusive results.

In statistical language, this would be described as the number of incorrect rejections of the null hypothesis (that null hypothesis is the claim that there was no change to visitor behavior due to a particular change to your website).

False discovery rate control is a statistical procedure for simultaneously correcting multiplicity caused by running multiple hypothesis tests. Optimizely Experimentation's Stats Engine incorporates a tiered version of the Benjamini-Hochberg procedure for false discovery rate control to correct statistical significance across multiple metrics and variations. The procedure offers a way to increase the power to detect statistically significant differences between variations, all while maintaining a principled bound on the error.

You can go in-depth into how Optimizely Experimentation incorporates false discovery rate control by reading the article "Peeking at A/B Tests: Why it Matters, and what to do about it" (See pg. 1523-1525).

Citations for the specific false discovery rate control algorithms incorporated into Optimizely Experimentation's Stats Engine are:

  • Benjamini, Y., & Hochberg, Y. (1995). Controlling the false discovery rate: a practical and powerful approach to multiple testing. Journal of the Royal statistical society: series B (Methodological), 57(1), 289-300. [LINK]
  • Benjamini, Y., & Yekutieli, D. (2001). The control of the false discovery rate in multiple testing under dependency. Annals of statistics, 1165-1188. [LINK]

Example

Here is an example of how false discovery rate control delivers better results in an experiment using multiple variations and metrics. Imagine a hypothetical experiment with five variations and two distinct metrics:

example_of_false_discovery_rate_control

In this experiment, there are ten different opportunities for a conclusive result. The experiment reports two winners; however, one of them (the one labeled "false winner") is actually inconclusive.

If we were to (incorrectly) use the false positive rate as our metric, we would think the likelihood of choosing the wrong winner is ten percent because only one of the ten possible results is incorrect. This is likely an acceptable rate of risk.

However, looking at the false discovery rate, we see that our chances of selecting a false winner are actually 50%. That is because the false discovery rate only looks at actual conclusive results instead of all opportunities for results.

If you were running this experiment, the first thing you probably would do is discard all the inconclusive variation/metric combinations. You would then have to decide which of the two winning variations to implement. In doing so, you would have no better than a 50-50 chance of selecting the variation that actually helps drive the visitor behavior you wanted to encourage.

A false discovery rate of 50% would definitely be alarming. But because Optimizely Experimentation uses techniques that work to keep the false discovery rate low—approximately ten percent—your chances of selecting a true winning variation to implement are much higher than if you were using a tool that relied on more traditional statistical methods.

To learn how to capture more value from your experiments, either by reducing the time to statistical significance or by increasing the number of conversions collected, see our article on Stats Accelerator.

Rank your metrics to minimize risk

We explained earlier in the article how your chance of making an incorrect business decision increases as you add more metrics and variations (the “multiple comparisons problem”). This is true, but it is not the whole story.

Consider an experiment with seven events:

  • One headline metric that determines the success of your experiment.
  • Four secondary metrics tracking supplemental information.
  • Two diagnostic metrics used for debugging.

These metrics are not all equally important. Also, statistical significance is not as meaningful for some (the diagnostic metrics) as for others (the headline metric).

Optimizely Experimentation solves this problem by allowing you to rank your metrics:

  • The first ranked metric is your primary metric.
  • Metrics ranked 2 through 5 are considered secondary.
    • Secondary metrics take longer to reach significance as you add more, but they do not impact the primary metric's speed to significance.
  • Finally, any metrics ranked beyond the first five are monitoring metrics.
    • Monitoring metrics take longer to reach significance if there are more of them but have no impact on secondary metrics and no impact on the primary metric.

Your chance of making a mistake on your primary metric is controlled. The false discovery rate of all other metrics is also controlled while prioritizing reaching statistical significance quickly on the metrics that matter most.

Like any other statistical test, Stats Engine does not completely shield the user from the effects of randomness. Even in cases where there is no underlying difference between the control and the treatment, differences between characteristics of users bucketed to different variations will generate slight differences in performance between the variants.

These fluke differences may occasionally be significant enough to cause Stats Engine to declare statistical significance on a variation, but this should not happen frequently. The rate at which this happens is controlled according to the confidence threshold set by the user. With a confidence level (statistical significance threshold) of 90%, the user can expect an A/A comparison to result in statistical significance at most 10% of the time.

But notably, this still allows for the occasional false discovery. If the user observes a false discovery only rarely, then there is no reason to suspect an issue. If it happens consistently, there may be a need for a deeper dive.