- Optimizely Web Experimentation
- Optimizely Performance Edge
- Optimizely Feature Experimentation
- Optimizely Full Stack (Legacy)
Every experiment has a chance of reporting a false positive (reporting a conclusive result between two variations when there is actually no underlying difference in behavior between them). You can calculate an experiment's error rate as 100 - [statistical significance]
. This means that higher statistical significance numbers decrease the rate of false positives.
Using traditional statistics, you increase your exposure to false positives as you simultaneously run experiments on many goals and variations. This is called the multiple comparisons problem (multiplicity or the look-elsewhere effect). This happens because traditional statistics control the false positive rate among all goals and variations. However, this error rate does not match the chance of making an incorrect business decision or implementing a false positive among conclusive results. The risk increases as you add goals and variations.
In this diagram, there are nine inconclusive results, and one of those registers as a false winner. This results in an overall false-positive rate of about 10%. However, you decide to implement the winning variations, not the inconclusive ones. The error rate of implementing a false positive from the winning variations is one out of two or 50%. This is called the proportion of false discoveries. See the section Why Stats Engine controls for false discovery instead of false positives for more.
False discovery rate
Optimizely controls errors and the risk of incorrect business decisions by controlling the false discovery rate. The false discovery rate can be calculated with the following calculation:
False discovery rate control
False discovery rate control is a statistical procedure for correcting multiplicity caused by running multiple hypothesis tests simultaneously. Optimizely's Stats Engine uses a tiered version of the Benjamini-Hochberg procedure for false discovery rate control to correct statistical significance across multiple metrics and variations. The procedure offers a way to increase the power to detect statistically significant differences between variations while maintaining a principled bound on the error.
You can go in-depth into how Optimizely Experimentation incorporates false discovery rate control by reading Peeking at A/B Tests: Why it Matters, and what to do about it, pages 1523-1525.
The false discovery rate control protects the integrity of your metrics from the multiple comparison problem of adding several goals and variations to your experiment while letting your primary metric reach significance in a timely fashion. Optimizely Experimentation ensures that your primary metric has the highest statistical power by treating it separately from secondary and monitoring metrics in the false discovery rate control calculations.
False discovery rate control is applied to all metrics (primary, secondary, and monitoring) on the Optimizely Experiment Results page. But, when you segment results, the false discovery rate control is not maintained. The deeper you segment, the higher the risk of finding false positives, so you should use segments for data exploration instead of making decisions.
Rank your metrics
Optimizely Experimentation further boosts the power of false discovery rate control by letting you rank your metrics. Consider an experiment with the following seven events:
- One headline metric that determines the success of your experiment.
- Four secondary metrics that track supplemental information.
- Two monitoring metrics for debugging.
These metrics are not equally important. Also, statistical significance is not as meaningful for some (the monitoring metrics) as for others (the headline metric). To solve this problem, Optimizely's Stats Engine incorporates a tiered version of the Benjamini-Hochberg procedure for false discovery rate control to correct statistical significance across multiple metrics and variations.
- The first ranked metric is your primary metric – The primary metric's false discovery rate control is done in a way that is independent of secondary and monitoring metrics. This means that the primary metric's evaluation is separate and independent of other metrics.
- Metrics ranked two through five are considered secondary metrics – The significance threshold for secondary metrics is adjusted based on the number of metrics and variations to ensure proper false discovery rate control. This has the consequence that the more secondary metrics you have, the longer it may take for each to reach statistical significance. Secondary metrics do not impact the primary metric's speed to significance because the primary metric is evaluated separately.
-
Any metrics ranked beyond the first five are monitoring metrics – The false discovery rate control calculations give each monitoring metric a fractional weight. Monitoring metrics are evaluated with a weight of
1/n
, wheren
is the number of monitoring metrics. For example, if there are two monitoring metrics, each would be given a weight of 1/2. This approach ensures that monitoring metrics have minimal impact on the evaluation of secondary metrics and no impact on the primary metric.
The result is that your chance of making a mistake on your primary metric is controlled. The false discovery rate of all other metrics is also controlled while prioritizing reaching statistical significance quickly on the metrics that matter most.
Why Stats Engine controls for false discovery instead of false positives
Optimizely’s Stats Engine focuses on the false discovery rate instead of the false positive rate, which helps you make business decisions based on reliable results. The false positive rate is the proportion of false positives out of all negative outcomes.
In every experiment, there is a risk of getting a false positive result. This happens when an experiment reports a conclusive winner (or loser, depending on the direction of the observed difference), but there is, in fact, no real difference in visitor behavior between your variations.
With traditional statistics, the risk of generating at least one false positive result increases as you add more metrics and variations to your experiment. This is true even though each metric or variation's false positive rate stays the same.
When an experiment runs with more than one variation or metric, these are collectively called multiple hypotheses. It is important to correct the statistical significance calculations when you are performing tests on these multiple hypotheses at the same time. If you perform several hypothesis tests simultaneously, you run into the multiple comparisons problem, where your probability of making an error by basing a critical business decision on a false positive result increases rapidly with the number of hypothesis tests you run simultaneously.
Example
Expanding on the diagram from the introduction of this article, the following example demonstrates how false discovery rate control delivers better results in an experiment using multiple variations and metrics.
For example, an experiment has five variations and two distinct metrics. In this experiment, there are ten different opportunities for a conclusive result. The experiment reports two winners. However, one of them (the one labeled False Winner) is actually inconclusive.
If you use the false positive rate as the metric, you might think the likelihood of choosing the wrong winner is ten percent because only one of the ten possible results is incorrect. This is likely an acceptable rate of risk. However, looking at the false discovery rate, your chances of selecting a false winner are actually 50% because the false discovery rate only looks at actual conclusive results instead of all opportunities for results.
If you were running this experiment, you would likely discard all the inconclusive variation and metric combinations. You would then have to decide which of the two winning variations to implement. In doing so, you would have no better than a 50-50 chance of selecting the variation that helps drive the visitor behavior you wanted to encourage.
A false discovery rate of 50% would be alarming. Because Optimizely uses techniques that keep the false discovery rate low (approximately ten percent), your chances of selecting a true winning variation to implement are much higher than if you were using a tool that relied on more traditional statistical methods.
Citations
Citations for the specific false discovery rate control algorithms incorporated into Optimizely Experimentation's Stats Engine are the following:
- Benjamini, Y., & Hochberg, Y. (1995). Controlling the false discovery rate: a practical and powerful approach to multiple testing. Journal of the Royal statistical society: series B (Methodological), 57(1), 289-300.
- Benjamini, Y., & Yekutieli, D. (2001). The control of the false discovery rate in multiple testing under dependency. Annals of statistics, 1165-1188.
Please sign in to leave a comment.