This topic describes how to:
- Distinguish between false discovery rate control in Optimizely Experimentation
- Make business decisions based on the results you see
Every experiment has a chance of reporting a false positive—in other words, reporting a conclusive result between two variations when there’s actually no underlying difference in behavior between them. You can calculate the rate of error for a given experiment as 100 - [statistical significance]. This means that higher statistical significance numbers decrease the rate of false positives.
Using traditional statistics, you increase your exposure to false positives as you run experiments on many goals and variations at once (the “multiple comparisons problem”). This happens because traditional statistics control the false positive rate among all goals and variations. However, this rate of error does not match the chance of making an incorrect business decision or implementing a false positive among conclusive results. Here's how this risk increases as you add goals and variations:
In this illustration, there are nine truly inconclusive results and one of those registers as a false winner. This results in an overall false-positive rate of about 10%. However, the business decision you'll make is to implement the winning variations, not the inconclusive ones. The rate of error of implementing a false positive from the winning variations is one out of two or 50%. This is called the proportion of false discoveries.
Optimizely Experimentation controls errors, and the risk of incorrect business decisions, by controlling the false discovery rate instead of the false positive rate. Here is how Optimizely Experimentation defines error rate:
False Discovery Rate = (average number of incorrect winning and losing declarations) / (total number of winning and losing declarations)
Read more about the distinction between false positive rate and false discovery rate in our blog post.
Optimizely Experimentation makes sure that the goal you choose as your primary goal always has the highest statistical power by treating it differently in our false discovery rate control calculations. Our false discovery rate control protects the integrity of all your goals from the “multiple comparisons problem” of adding several goals and variations to your experiment without keeping your primary goal from reaching significance in a timely fashion.
Learn more about setting events.
Learn how to optimize your events and goals for achieving significance quickly with Stats Engine here.
False discovery rates in Optimizely Experimentation
If you perform several hypothesis tests simultaneously with traditional statistics, you will run into the multiple comparisons problem (that is, multiplicity or the look-elsewhere effect), where your probability of making an error increases rapidly with the number of hypothesis tests you are running simultaneously.
False discovery rate control is a statistical procedure for correcting multiplicity caused by running multiple hypothesis tests at once. You can go in-depth into how Optimizely Experimentation incorporates false discovery rate control by reading their article "Peeking at A/B Tests: Why it Matters, and what to do about it."
Optimizely Experimentation boosts the power of false discovery rate control even further by allowing you to rank your metrics. Consider an experiment with seven events: one headline metric that determines the success of your experiment; four secondary metrics tracking supplemental information; and two diagnostic metrics used for debugging. These metrics aren't all equally important. Also, statistical significance isn't as meaningful for some (the diagnostic metrics) as it is for others (the headline metric).
To solve this problem, Optimizely Experimentation's Stats Engine incorporates a tiered version of the Benjamini-Hochberg procedure for false discovery rate control to correct statistical significance across multiple metrics and variations. Therefore in this example, your first ranked metric is still your primary metric. Metrics ranked 2 through 5 are considered secondary. Secondary metrics take longer to reach significance as you add more of them, but they don't impact the primary metric's speed to significance. Finally, any metrics ranked beyond the first five are monitoring metrics. Monitoring metrics take longer to reach significance if there are many of them but will have minimal impact on secondary metrics and no impact on the primary metric.
The result is that your chance of making a mistake on your primary metric is controlled. The false discovery rate of all other metrics is also controlled, all while prioritizing reaching statistical significance quickly on the metrics that matter most.
Citations for the specific false discovery rate control algorithms incorporated into Optimizely Experimentation's Stats Engine are:
Benjamini, Y., & Hochberg, Y. (1995). Controlling the false discovery rate: a practical and powerful approach to multiple testing. Journal of the Royal statistical society: series B (Methodological), 57(1), 289-300. [LINK]
Benjamini, Y., & Yekutieli, D. (2001). The control of the false discovery rate in multiple testing under dependency. Annals of statistics, 1165-1188. [LINK]