Fixed Horizon is a frequentist statistical method used to run traditional A/B tests with a predetermined sample size. This approach relies on well-established statistical concepts such as p-values, minimum detectable effect (MDE), and variance to determine whether observed differences between variations are meaningful. See the following sections to understand the key statistical elements behind Frequentist (Fixed Horizon) testing and how they impact experiment design and analysis.
To learn how to configure a Frequentist (Fixed Horizon) test in Optimizely Experimentation, see Configure a Frequentist (Fixed Horizon) A/B test.
Why Frequentists (Fixed Horizon) tests require a predetermined sample size
In a Frequentist (Fixed Horizon) experiment, you must calculate the sample size before starting. This ensures the following:
- Statistical validity is maintained – You reduce the chance that results are due to random variation.
- Mid-test peeking is avoided – You prevent inflated false positives caused by early looks at partial data.
- A clear decision rule is committed to – You evaluate results once, using pre-defined thresholds.
Sample size calculation
The required sample size per variation depends on the following:
- Baseline metric value – The current performance of the metric you are testing.
- MDE – The smallest detectable relative change.
- Statistical significance level – Your desired confidence threshold.
- Variance – The variability of your data.
These inputs work together to determine the total number of visitors needed to make a reliable conclusion. Use the built-in Frequentist (Fixed Horizon) Sample Size Calculator to automatically compute the required visitors per variation.
Why peeking mid-test is a problem
In a Fixed Horizon experiment, peeking mid-test introduces bias and increases the chance of false positives. When you check results before the full sample is collected, you might stop the experiment too early, believing you found a winner when the effect was due to random chance.
Analogy – If you flip a coin 100 times to test fairness and check after 10 flips (7 heads), you might falsely conclude bias. Waiting for all 100 flips gives a more reliable answer.
Fixed Horizon example scenario
Scenario
- Baseline Metric Value – 5%
- Minimum Detectable Effect – 10% relative increase. That is, you expect the metric value in the variant is \( 5\% \times (100 + 10)\% \).
- Statistical Significance level – 95%
- Number of Variations – 2 (1 baseline + 1 treatment)
Outcome
- Visitors needed per variation – 34,363 visitors
-
Minimum duration – 7 days.
You should run all tests for a minimum of one business cycle (seven days) to account for all kinds of user behavior. For example, differences between weekend visitors and weekday visitors. See Seasonality and traffic spikes.
Interpretation
- You must run the experiment until each variation (1 baseline + 1 test) receives 34,363 visitors and at least seven days have passed.
- You cannot view results before these conditions are complete.
Statistical calculations
The following sections assume that there two variants.
- One baseline (or control) group.
- One treatment (or test) group.
In situations with multiple treatments, Optimizely compares each treatment group to the baseline (control) group, resulting in a series of two-sample problems. This is a multiple comparison problem and requires false positive discovery control.
Confidence interval calculations
Depending on the types of metrics (binary, numeric, or ratio) and the types of improvement (absolute or relative), one of the following confidence interval calculations is used by Optimizely:
Numeric or binary metric, relative improvement
- Let \( \hat{\theta}_h \) and \( \hat{\sigma}^2_h \) denote the estimated metric (sample mean) and variance in the baseline (or control) group with sample size \( n_h \).
- Let \( \hat{\theta}_t \) and \( \hat{\sigma}^2_t \) denote the estimated metric and variance in the treatment group with sample size \( n_t \).
- Let \( \hat{R} := \frac{\hat{\theta}_t}{\hat{\theta}_h}, \). Therefore, the relative improvement is defined as \( 100(\hat{R} - 1)\% \).
An approximate \( 100(1 - \alpha)\% \) confidence interval using the Delta method is: \[ \hat{R} - 1 \pm \Phi^{-1}(1 - \alpha/2)\sqrt{\frac{1}{\hat{\theta}_h^2} \left( \frac{\hat{\sigma}_h^2 \hat{R}^2}{n_h} + \frac{\hat{\sigma}_t^2}{n_t} \right)} \]
P-values
For the Wald confidence intervals, their corresponding p-value is defined as the following:
Let \( w \) denote the observed value of the Wald statistic W.
The p-value is: \[ p\text{-value} = \mathcal{P}_{\theta_0}(|W| > |w|) \approx 2\Phi(-|w|) \]
Sample size estimation calculations
Estimation for relative improvement for binary or numeric metrics
Suppose the true ratio of treatment over control is \( \gamma \), and assume that \( n_t = kn_h \), then the desired sample size needed for achieving the power \( 1 - \beta \) is
\[ n_h = \frac{\frac{1}{\theta_h^2} \left( \sigma_h^2 \gamma^2 + \frac{\sigma_t^2}{k} \right)} {\left(\frac{1 - \gamma}{z_{\alpha / 2} + z_\beta}\right)^2}. \]
Next steps
For more information, see the following documentation:
Article is closed for comments.