How to Run A/B Tests That Actually Work
Run A/B tests that survive replication: pick one metric, power the test, stop peeking, and read results with the practical-significance check most teams skip.
Power the test first. With a baseline conversion rate of 3% and a target 15% relative lift, you need roughly 30,000 visitors per variation at 80% power and a 0.05 significance threshold[4]. Stopping early when a p-value dips below 0.05 inflates Type I error from 5% to north of 25%[2].
Run one variation, one primary metric, one full weekly cycle. Read the result against a minimum detectable effect you chose up front — not the widest confidence interval that happens to cross zero.
Most A/B-test writeups open with "X% of tests fail" — a number that traces back to vendor marketing, not peer-reviewed data. The honest story is narrower: tests fail when teams underpower them, change two things at once, or stop when early noise looks like signal. Those three failure modes dominate the published methodology literature[1][2].
This guide walks through the four decisions that actually determine whether your test produces a trustworthy answer. As of 2026-Q2, the references below are the canonical starting points — the Kohavi/Tang/Xu textbook and the KDD 2017 peeking paper are both free to read.
1. One hypothesis, one metric
Before you touch the variant, write the hypothesis in a format the team can argue with:
Changing [specific element] from [current] to [proposed] will change [one primary metric] by at least [minimum detectable effect], because [causal reasoning].
One primary metric. If you track five, the family-wise false positive rate compounds: at α = 0.05 across five independent metrics, the chance of at least one false positive is about 23%[1]. Pick the metric you would actually act on, designate the others as secondary for diagnostic purposes, and commit in writing.
Only one element should change between control and variant. If you alter the headline, CTA copy, and hero image together, the test tells you the package works — not which piece did the work. That matters when you want to compound wins rather than replay the same three-change package on every page.
2. Power the test before you launch
Sample size is a function of four numbers: baseline conversion rate, minimum detectable effect (MDE), significance level (α), and statistical power (1 − β). The standard convention in applied experimentation is α = 0.05 and power = 0.80[4].
Working example: your checkout converts at 3.0%. You want to detect a 15% relative lift (to 3.45%) at 80% power. The required sample is roughly 30,000 visitors per variation — so 60,000 total before the test can meaningfully conclude. Drop the MDE to 5% relative and you need about 10x more traffic. That is the single biggest planning mistake smaller teams make: chasing small effects on traffic volumes that cannot detect them in any reasonable timeframe[5].
If the calculated sample size exceeds your monthly traffic, the test is not ready to run. Either (a) pick a higher-traffic surface, (b) raise the MDE to something your traffic can detect, or (c) batch the change with other low-risk improvements and ship untested.
3. Run it uninterrupted through a full cycle
Run the test for at least one full weekly cycle even if your sample size is hit on day three. User behavior on Tuesdays differs from Saturdays; promotional calendars, payday effects, and weekday-vs-weekend traffic mix all introduce systematic variation[3]. A test that finishes inside one business day is measuring whatever traffic mix happened to show up that day.
Do not stop the test early because it looks like a winner. This is the peeking problem, and it is not a rounding error. The Johari et al. simulation at KDD 2017 found that teams that checked p-values daily and stopped on the first significant result had an empirical Type I error rate above 25%, not 5%[2]. Five times the false-positive rate you thought you were accepting.
Two defenses. First, decide the stopping rule before launch — usually sample-size-reached-and-one-full-week. Second, if you need the ability to peek, use a sequential testing method (mSPRT, always-valid p-values, or Bayesian posteriors with pre-registered priors) designed for continuous monitoring[2]. Do not mix frequentist p-values with continuous monitoring.
4. Read results with practical significance
Statistical significance tells you the difference is unlikely to be noise. Practical significance tells you it matters. A 0.1 percentage-point lift on a high-traffic site can reach p < 0.05 while being smaller than the implementation cost.
Report three numbers together: point estimate, 95% confidence interval, and the MDE you set before launch. If the interval spans your MDE — for example, the variant shows a 2% lift with CI [−1%, 5%] and your MDE was 5% — the test is inconclusive, not a loss. Re-power it or rethink the hypothesis.
Segment the result, but only in pre-registered ways. Roughly, if you slice a result eight ways after the fact, you will find a "significant" subgroup effect in one of them by pure chance. Pre-registration is the guardrail.
Failure modes worth naming
- Peeking with frequentist stops. Inflates Type I error by roughly 5x in simulation[2].
- Underpowered tests. A test designed to detect a 15% lift running on traffic that can only resolve a 40% lift is an expensive coin flip.
- Variant interaction on the same page. Two tests touching overlapping DOM surfaces contaminate each other's results; layer them only with explicit orthogonality.
- Novelty effects. A bright new variant can show a short-term lift that decays once the change stops being novel — typical decay windows run 2–4 weeks for UI changes[1].
- Statistical-significance-only reporting. In the typical case, the lift that moves the business is large enough to be visible without p-value drama. If you need four decimal places to see it, the business case is likely thin.
Experimentation works when the process is boring: pick a hypothesis, power it honestly, let it run, read the result against a pre-registered threshold. The interesting part is the hypothesis — everything after should be mechanical.
5. Document the decision, not just the result
A test produces a number. The valuable output is the decision that followed and the reasoning that went into it. Build a standard readout template:
- Hypothesis — including expected effect size and directional reasoning.
- Design — variant, control, traffic allocation, duration, stopping rule.
- Sample size rationale — required N at chosen power and MDE.
- Result — point estimate, confidence interval, p-value, practical significance assessment.
- Decision — ship, don't ship, or iterate. Why.
- Secondary observations — segment splits, unexpected patterns, follow-up hypotheses generated.
Over 2+ years this compounds into a high-value corpus. New team members can read what's been tried and why — which prevents the recurring "let's try X" cycle for a variant that failed replication 18 months earlier.
6. Experimentation culture is the multiplier
The statistical discipline is necessary but not sufficient. Teams that consistently extract value from experimentation share cultural habits:
- Pre-registration is default. Hypothesis and stopping rule committed before launch. Reduces motivated reasoning when results are ambiguous.
- Negative results are published. A test that shows no effect is valuable information; teams that only celebrate wins systematically overstate the win rate and under-invest in the discipline.
- Arguments about methodology, not about direction. If someone disputes a test's conclusion, the conversation should be about power, sample selection, or attribution — not about whether the result "feels right." Teams that can't separate these struggle to build trust in experimentation[1].
- Clear ownership of the experimentation roadmap. Without a PM or analyst owning priorities, tests get proposed at random and the portfolio skews to easy-to-build rather than high-leverage ideas.
The cultural work takes 12–24 months to mature. The statistical work takes a quarter. Teams that invest in both, in that order, typically see CRO win rates converge to 30–40% — consistent enough to justify sustained investment, without the false-positive inflation that undermines decision-making[5].
7. Numeric worked example — full pre-registration walkthrough
A B2B SaaS checkout converts at 4.1% baseline. The team proposes a hero redesign targeting a 15% relative lift. Pre-registration document produced before any code ships:
Hypothesis
Changing the checkout hero from feature-list to outcome-benefit
will raise completed-purchase rate from 4.1% to ≥4.72%,
because outcome-benefit framing aligned with the primary ICP
jobs-to-be-done has outperformed feature framing in our prior
3 landing-page tests at p<0.05.
Primary metric Checkout completion rate (one, committed)
Secondary metrics ASP, bounce rate, form errors
(diagnostic only; no claims made on them)
Baseline p1 = 4.1%
MDE (relative) 15% (target p2 = 4.72%)
α 0.05
Power 0.80
Required N per arm ~25,600
Monthly qualified traffic 36,000 total / 18,000 per arm
Days to N per arm 43 days
Minimum test duration 43 days + full weekly cycle = ~50 days
Stopping rule N-reached AND ≥1 full week elapsed
Peeking policy Weekly review of qualitative/UX only,
no p-value inspection until stop date[2]
Segment pre-registrations Desktop vs mobile (two segments, only) Every decision is locked before launch. At stop, the team publishes the result against the pre-registered threshold — regardless of whether a 14.8% or 17.2% result "would have been significant if we peeked earlier."
8. Failure modes worth naming
- Treating secondary metrics as tiebreakers. If the primary metric lands flat but a secondary metric crosses p<0.05, the test is flat on the pre-registered question. Claiming the "win" on the secondary metric is HARKing and compounds toward non-replication at the programme level[5].
- Running multiple overlapping tests on the same user. Without orthogonal assignment, test A's treatment effect contaminates test B's measurement. Maintain a single-experiment-per-user policy for meaningful surfaces, or formally orthogonalise with documented interaction matrices.
- Pre-registration skipped "because we know what we want." Without a written stopping rule, every team ends up peeking — the question becomes when someone rationalises ending the test, not whether. Write the stopping rule down before launch or don't run the test.
As of 2026-Q2, the published methodology literature[1][2] continues to support the same four disciplines: pre-registered hypothesis, calculated sample size, committed stopping rule, and honest read against practical significance. Teams that publish win rates over 50% on under-powered tests are reporting noise, not lift.
References
Sources
Primary sources only. No vendor-marketing blogs or aggregated secondary claims.
- 1 Kohavi, Tang, Xu — Trustworthy Online Controlled Experiments (Cambridge, 2020) — free preprint of ch. 3 — accessed 2026-04-24
- 2 Johari, Koomen, Pekelis, Walsh — Peeking at A/B Tests (KDD 2017) — accessed 2026-04-24
- 3 US Census Bureau — Advance Monthly Retail Trade Survey (methodology and standard errors) — accessed 2026-04-24
- 4 Cohen — Statistical Power Analysis for the Behavioral Sciences (2nd ed., Lawrence Erlbaum, 1988) — accessed 2026-04-24
- 5 Ioannidis — Why Most Published Research Findings Are False (PLoS Medicine, 2005) — accessed 2026-04-24
Tools referenced in this article
Related articles