Bayesian or frequentist: which approach is better for AB testing?

ExperimentationBy Isabella Beatriz Silva and Bernardo Favoreto

When analyzing AB tests, it's common to ask ourselves what is the best statistical approach to rely on. This post is focused on discussing two of the most used approaches in the experimentation market.

Benchmark

Before answering this question from a scientific point of view, it's essential to understand how some companies approach AB testing. Therefore, we have set some benchmarks to compare popular AB testing tools regarding four key aspects:

  • The approach to interpreting test results
  • The key metrics tracked
  • Whether the company's core feature is AB testing.

The benchmark shows that companies offering AB testing tools as a core feature prefer the Bayesian approach over the frequentist.

Another critical aspect observed is that companies using the Bayesian approach can present more detailed metrics (e.g., the probability of being best and the potential loss). In contrast, those using the frequentist approach have limitations on what they can calculate, which means they mostly use simplistic metrics (e.g., the conversion rate and the uplift).

The Bayesian approach seems an industry standard based on this benchmark, providing richer decision-making information, although the frequentist is still widely used.

The following table summarizes the approach and metrics used by the popular tools.

CompanyApproachAB testing as core?Metrics
AB TastyBayesianConversion rate, reliability, improvement, median growth
Dynamic YieldBayesianProbability to be best, uplift, sessions, revenue, revenue/session
Google OptimizeBayesianModeled conversion rate, the probability to be best, the probability of beating original, modeled improvement
QubitBayesianConverters, probability of uplift, the probability to beat baseline
VWOBayesianExpected conversion rate, improvement, the probability to beat baseline, conversions/visitors, absolute potential loss, the probability to beat control
AmplitudeFrequentistConversion, chance to outperform baseline
Crazy EggFrequentistThe total traffic, visitors, conversions, conversion rate, improvement
HubspotFrequentistOpen rate, click rate
On e SignalFrequentistClicks, click-through rate, and delivered
OptimizelyFrequentistUnique conversions/visitors, conversion rate, improvement, confidence interval, statistical significance
UnbounceFrequentistVisitors, views, conversions, conversion rate

The Bayesian Approach

Bayesian statistics is a theory that treats probabilities as the degree of belief in an event (e.g., how certain one is about the conversion rate of a call-to-action button).

The core idea of Bayesian statistics is to update one’s beliefs after being exposed to new evidence. For example, updating what is the most likely conversion rate after gathering new data describing conversion events.

Unlike the frequentist approach, Bayesian treats everything as a random variable, which by definition has a probability distribution (e.g., Gaussian) and parameters (e.g., mean, variance). This means that it's possible to estimate the probability that each variant is the best – and by how much – and the potential loss associated with each variant using the posterior probability distribution.

These results are crucial to making the best and safest decision after the test ends. That explains why so many arguments favoring the Bayesian approach are concerning the results' quality.

Another key benefit is that there are fewer constraints to Bayesian AB testing. Although it's always recommended to run the test for at least one week, users can stop the test as soon as they believe the results are safe and conclusive enough to make a decision. For example, if there are many sessions and conversions in a day, it might be possible to estimate with high confidence which variant is the best after a single day.

The Frequentist Approach

Frequentist inference is a method developed in the 20th century that grew to be the dominant statistical paradigm, widely used in experimental science. It is a statistically sound approach with valid results, but it presents limitations that aren't attractive in AB testing.

Moreover, the interpretation of the results is more complex than the Bayesian approach, an opinion shared by experts in the field. The metrics are confusing and often misinterpreted (there's even a Wikipedia page describing the misuse of the "p-value", the score obtained using the frequentist approach). This behavior is highly undesirable in AB testing, as the results directly impact business decisions.

Briefly put, the frequentist approach to AB testing commonly follows these steps:

  1. Define the control and treatment groups

  2. Define the null and alternative hypothesis

    Usually, these are:

    • Null hypothesis: the conversion rate for control and treatment groups are the same
    • Alternative hypothesis: the conversion rate for the treatment group is different from the control. Ideally, the test ends refuting the null hypothesis, stating a difference in conversion rate resulting from the changes made.
  3. Define the confidence level

    This defines the significance level (commonly known as the "p-value").

  4. Define the test duration

    There are online calculators available to decide the duration based on:

    • The average number of visitors participating in the test (for control and variants)
    • Estimated conversion rate
    • Minimum detectable improvement in conversion rate
    • Total number of variants (including control)

    As these calculators highlights, the test duration for the frequentist approach is always longer than the Bayesian.

  5. Run the experiment for the predefined period and just then analyze the result

    To avoid false positives, frequentist AB testing doesn't allow data peeking. Thus, users obtain the results only at the end of the test (the result here is whether or not to refute the null hypothesis, according to the "p-value" or "test statistics").

    The rationale behind this is that the statistical significance fluctuates during the test, so stopping a test once it reaches statistical significance is a recipe for undesirable outcomes. The image below shows the analysis for two identical variants. As you can see, the P-value varies even when there is no difference between them.

Graph showing how metrics behave on a Frequentist approach.
P-value plotted over time for an A/A experiment.

Source: Medium

As the steps described, the primary constraint is not seeing the results before the test ends, which is potentially costly for companies. Moreover, the frequentist result is a simple binary outcome: either there is a difference between control and variants or not (assuming users interpret the p-value correctly).

Frequentist vs. Bayesian

The following table summarizes the key differences between the Bayesian and frequentist approaches to AB testing.

FrequentistBayesian
Size of the SamplePredefinedNo need to predefine
Test durationFixed and longer**Flexible and shorter
Results' intuitivenessLow, as the p-value is a derived metricHigh, as the results are directly calculated
Data peeking during the testNot allowedAllowed (with caution)
Velocity to make decisionsSlow, as it presents more constraintsFast, as it has fewer constraints
Estimate the probability to be best (PBB)Not possiblePossible regardless of the variations count
Estimate the potential lossNot possiblePossible regardless of the variations count
Declaring a winnerWhen the p-value is below a threshold and the sample size is achievedWhen either the potential loss is below a threshold or the PBB is above a threshold
Results computationLess computationally intensiveMore computationally intensive due to simulations

For more details on the differences between frequentist and Bayesian approaches, we recommend the following references:

Why we use the Bayesian approach

The choice of using Bayesian over the frequentist approach stems mainly from the fact that its results are a lot richer and more informative than the simplistic frequentist's binary outcome.

For example, the frequentist approach doesn't predict a variation being the best or potential loss. It relies on reaching statistical significance to conclude a test, and the result simply states if there is or isn't a difference between a variation and the baseline.

Moreover, the limitations on using the frequentist approach are often unattractive for companies. For instance, it requires predefining a sample size and test duration, a feature that might discourage companies from using it.

These arguments support the decision of choosing Bayesian over the frequentist approach.

Let's grow together!

Learn practical tactics our customers use to grow by 20% or more.