Run and interpret an A/A test

  • Updated
  • Optimizely Web Experimentation
  • Optimizely Performance Edge
  • Optimizely Feature Experimentation
  • Optimizely Full Stack (Legacy)

When you run an experiment in Optimizely Experimentation in which variation A is identical to variation B, you are running what is known as an A/A experiment, rather than an A/B experiment. Such an experiment is termed an A/A experiment because there essentially is no "B" variation. The original page and variation are exactly the same. A/A experiments are also referred to as calibration testing.

A/A test versus monitoring campaign

Let us distinguish an A/A test from a monitoring campaign, which is an experiment that does not have any variations.

  • Monitoring campaign – The goal is simply to either deliver content to visitors or determine the baseline conversion rate for a certain goal before you test.
  • A/A test – The typical purpose is to validate your experiment setup. Specifically, an A/A calibration test is a data reliability and quality assurance procedure to evaluate the implementation of all your experiment comparisons.

In most cases, the majority of your A/A calibration test results should show that the conversion improvement between the identical baseline pages is statistically inconclusive. You should expect this inconclusive result because you made no changes to the original page. But, when running a calibration test, it is important to understand that a difference in conversion rate between identical baseline pages is always possible. The statistical significance of your result is a probability, not a certainty.

As with any experimental process, there will always be some percentage of outcomes that turn out to be anomalies because an experiment calculates results on a random sample from the population of all visitors to your page. An experiment must make a judgment call on how large a trend indicates a true difference to identify any variation as significant. And so a large enough fake trend can make experiments look like there may be a true difference when none actually exists. The significance level controls this trade-off between identifying more trends with significant results and seeing more errors.

Optimizely Experimentation's statistical approach

Statistical significance is a project-level setting that you can adjust up or down. Adjust it based on your comfort level with statistical error. This also holds true for an A/A test—even when there is no difference, there is a small chance that Optimizely Experimentation reports a result based on underlying trends in experiment data.

When you examine the results of your A/A test, you should see the following behavior:

  • Your statistical significance will stabilize around a certain value over time.
  • The confidence intervals for your experiment will shrink as more data is collected, ruling out non-zero values.
  • At different points in the test results, the baseline and variation might perform differently, but neither should be declared a statistically significant winner indefinitely.

Statistical significance measures how unusual your experiment results would be if there were no difference between your variation and baseline and the difference in lift was due to random chance alone. When we observe a lift with 90% or higher statistical significance, the observed results are more unusual than expected in 90% of the cases if there is no lift. Assuming no difference in performance between the variation and the baseline, the higher your statistical significance, the more unusual your results would appear.

What this means when interpreting a regular A/B experiment

Always pay attention to the statistical significance in your tests, and be skeptical of implementing variations that do not reach your chosen significance level.

With Stats Engine, Optimizely Experimentation accurately represents the likelihood of error, regardless of when you look at your results. Ultimately, you may need to find the happy medium between declaring winning and losing variations with high confidence (thus requiring more visitors) and the opportunity cost of being able to run more experiments.

See Confidence intervals and improvement intervals for more information.