- Optimizely Web Experimentation
- Optimizely Performance Edge
- Optimizely Feature Experimentation
- Optimizely Full Stack (Legacy)
Every experiment has a chance of reporting a false positive, reporting a conclusive result between two variations when there is actually no underlying difference in behavior between them. You can calculate the rate of error for an experiment as 100 - [statistical significance]. This means that higher statistical significance numbers decrease the rate of false positives.
Using traditional statistics, you increase your exposure to false positives as you run experiments on many goals and variations at once (the "multiple comparisons problem"). This happens because traditional statistics control the false positive rate among all goals and variations. However, this rate of error does not match the chance of making an incorrect business decision or implementing a false positive among conclusive results. The risk increases as you add goals and variations:
In this illustration, there are nine inconclusive results and one of those registers as a false winner. This results in an overall false-positive rate of about 10%. However, the business decision you will make is to implement the winning variations, not the inconclusive ones. The rate of error of implementing a false positive from the winning variations is one out of two or 50%. This is called the proportion of false discoveries.
False discovery rate
Optimizely Experimentation controls errors, and the risk of incorrect business decisions, by controlling the false discovery rate instead of the false positive rate.
Here is how Optimizely Experimentation defines error rate:
False Discovery Rate = (average number of incorrect winning and losing declarations) / (total number of winning and losing declarations)
Read more about the distinction between false positive rate and false discovery rate in the Optimizely.com blog post.
Optimizely Experimentation ensures that your primary goal has the highest statistical power by treating it differently in the false discovery rate control calculations. The false discovery rate control protects the integrity of your goals from the multiple comparisons problem (that is, multiplicity or the look-elsewhere effect) of adding several goals and variations to your experiment without keeping your primary goal from reaching significance in a timely fashion.
See also Primary metrics, secondary metrics, and monitoring goals in Optimizely Experimentation to learn how to optimize your events and goals for achieving significance quickly with Stats Engine.
If you perform several hypothesis tests simultaneously with traditional statistics, you run into the multiple comparisons problem, where the probability of making an error increases rapidly with the number of simultaneous hypothesis tests you are running.
False discovery rate control is a statistical procedure for correcting multiplicity caused by running multiple hypothesis tests at once. You can go in-depth into how Optimizely Experimentation incorporates false discovery rate control by reading Peeking at A/B Tests: Why it Matters, and what to do about it.
Rank your metrics
Optimizely Experimentation boosts the power of false discovery rate control even further by letting you rank your metrics. Consider an experiment with seven events:
- one headline metric that determines the success of your experiment
- four secondary metrics tracking supplemental information
- two diagnostic metrics used for debugging
These metrics are not equally important. Also, statistical significance is not as meaningful for some (the diagnostic metrics) as it is for others (the headline metric).
To solve this problem, Optimizely Experimentation's Stats Engine incorporates a tiered version of the Benjamini-Hochberg procedure for false discovery rate control to correct statistical significance across multiple metrics and variations. Therefore, in this example, your first ranked metric is still your primary metric.
Metrics ranked two through five are considered secondary. Secondary metrics take longer to reach significance as you add more of them, but they do not impact the primary metric's speed to significance.
Any metrics ranked beyond the first five are monitoring metrics. Monitoring metrics take longer to reach significance if there are many of them but will have minimal impact on secondary metrics and no impact on the primary metric.
The result is that your chance of making a mistake on your primary metric is controlled. The false discovery rate of all other metrics is also controlled, while prioritizing reaching statistical significance quickly on the metrics that matter most.
Citations for the specific false discovery rate control algorithms incorporated into Optimizely Experimentation's Stats Engine are:
- Benjamini, Y., & Hochberg, Y. (1995). Controlling the false discovery rate: a practical and powerful approach to multiple testing. Journal of the Royal statistical society: series B (Methodological), 57(1), 289-300.
- Benjamini, Y., & Yekutieli, D. (2001). The control of the false discovery rate in multiple testing under dependency. Annals of statistics, 1165-1188.