Statistical Significance Calculator

Last updated: March 11, 2026
Reviewed by: LumoCalculator Team

Compare a control and variant with a two-proportion z-test: conversion rates feed a pooled standard error, the rate gap becomes a z-score, and the z-score becomes a p-value. The result helps teams judge whether an observed uplift is strong enough to support an experiment decision.

Experiment Inputs

Compare control and variant conversion counts with a two-proportion significance check.

Quick Scenarios

Control

Current experience or holdout group

Variant

New copy, layout, or offer

Confidence level

Hypothesis direction

This calculator assumes independent visitors, a binary conversion event, and one control compared with one variant.

Significance Verdict

95% confidence · Two-sided

Moderate power

The variant clears the selected significance threshold.

p = 0.0167

Absolute lift: +0.66 pp · Relative lift: +14.8%

Control conversion rate

4.5%

540 / 12,000

Variant conversion rate

5.16%

612 / 11,850

95% confidence interval

+0.12 pp to +1.21 pp

Observed power

66.8%

Incremental conversions per 10k visitors: 66.5

Interpretation

At 95% confidence, the variant is statistically higher than the control. The observed lift is +0.66 pp (+14.8%) with p = 0.0167.

Decision hint

The result is statistically significant and materially sized enough for a rollout discussion. Confirm sample quality, guardrail metrics, and segment consistency before final launch.

Detailed Breakdown

Use this section to pressure-test how strong the evidence is before you roll out the variant.

MetricValue
Pooled conversion rate4.83%
Standard error0.0028
Z-score2.393
Cohen's h0.031
Current confidence threshold95% confidence
Tail settingTwo-sided

80% power target

16,328

Recommended visitors per variant for a clearer read.

90% power target

21,858

Useful when false negatives are expensive.

Current traffic gap

4,478

Additional visitors per variant to hit the 80% power target.

Assumption notes

  • Each visitor should appear in only one group during the test window.
  • The conversion event should be binary and measured the same way for both groups.
  • Confidence intervals use the difference in conversion rates, not the relative lift.

What to review before rollout

  • Check that the experiment was not stopped early after repeated peeking.
  • Confirm guardrail metrics such as refunds, churn, or downstream activation.
  • Segment the result if major traffic sources or devices behaved differently.

Editorial & Review Information

Reviewed on: 2026-03-11

Published on: 2025-09-10

Author: LumoCalculator Editorial Team

What we checked: Formula math, default example arithmetic, interpretation thresholds, boundary statements, and source accessibility.

Purpose and scope: This page supports experiment planning for binary conversion events such as signups, checkouts, clicks, or submissions. It is not a replacement for a full experimentation platform or a multi-metric decision framework.

How to use this review: Keep user assignment clean, lock the primary metric before launch, compare the p-value with the selected confidence level, then review practical impact and guardrail metrics before rollout.

Use Scenarios

Landing-page and signup tests

Use the calculator when a product, growth, or lifecycle team is comparing one conversion event across two experiences and needs a fast “ship, hold, or keep running” readout.

Feature flags and rollout checks

Pair a primary conversion metric with a one-sided “variant worse than control” guardrail check before rolling a feature from limited beta to a broader audience.

Survey, email, and ops pilots

The same logic works for response rate, click-through rate, or task completion rate when each observation is a yes-or-no outcome and the two groups stay independent.

Formula Explanation

1) Conversion rate and absolute lift

Conversion rate = conversions / visitors

Absolute lift = variant rate - control rate

This first step turns raw counts into comparable rates. Absolute lift is the percentage-point gap between the two variants. Relative lift then divides that gap by the control rate so teams can discuss “up 12%” and “up 0.5 points” at the same time.

2) Pooled standard error

Pooled rate = (control conversions + variant conversions) / total visitors

Standard error = sqrt(pooled rate x (1 - pooled rate) x (1 / control visitors + 1 / variant visitors))

The pooled rate creates the no-difference benchmark used by the z-test. The standard error then measures how much random variation we would expect around the observed lift if there were no true effect.

3) Z-score and p-value

Z-score = absolute lift / standard error

P-value = probability of seeing a difference this large under the null hypothesis

A larger z-score means the observed gap is less compatible with pure noise. The p-value converts that z-score into a decision threshold. If the p-value is smaller than alpha, the result is marked statistically significant at the selected confidence level.

4) Confidence interval, power, and sample planning

Confidence interval = absolute lift +/- critical value x unpooled standard error

Required sample per variant ≈ 2 x ((critical value + power target z) / effect size)^2

The confidence interval shows the range of rate gaps still compatible with the data. Power and required sample size translate the same evidence into an execution question: do you have enough traffic to detect the effect size you care about with acceptable false-negative risk?

Example Cases

Case 1: Homepage CTA uplift

Inputs

  • Control: 12,000 visitors, 540 conversions
  • Variant: 11,850 visitors, 612 conversions
  • Confidence level: 95%
  • Hypothesis: Two-sided

Computed Results

  • Control rate: 4.50%
  • Variant rate: 5.16%
  • Absolute lift: +0.66 pp
  • P-value: 0.0167
  • Observed power: 66.8%

Interpretation

The variant clears a 95% threshold, but the traffic level is still below a comfortable power target for repeatability.

Decision Hint

Treat this as a launch candidate, then verify downstream quality metrics before a full rollout.

Case 2: Email subject-line test

Inputs

  • Control: 800 sends, 48 clicks
  • Variant: 810 sends, 59 clicks
  • Confidence level: 95%
  • Hypothesis: Two-sided

Computed Results

  • Control rate: 6.00%
  • Variant rate: 7.28%
  • Absolute lift: +1.28 pp
  • P-value: 0.3011
  • Observed power: 17.9%

Interpretation

The uplift looks promising, but the interval is wide and the test is badly underpowered.

Decision Hint

Keep the experiment running or narrow the minimum detectable effect before calling a winner.

Case 3: Guardrail drop check

Inputs

  • Control: 9,000 users, 396 conversions
  • Variant: 9,050 users, 340 conversions
  • Confidence level: 95%
  • Hypothesis: Variant < Control

Computed Results

  • Control rate: 4.40%
  • Variant rate: 3.76%
  • Absolute lift: -0.64 pp
  • P-value: 0.0145
  • Observed power: 70.6%

Interpretation

A one-sided guardrail check indicates the variant is materially worse than control.

Decision Hint

Pause or roll back the treatment and inspect UX friction or audience-quality shifts before relaunch.

Boundary Conditions

Visitors must be greater than zero for both groups, and conversions cannot exceed visitors.
Use this calculator only for binary outcomes such as converted / not converted or clicked / did not click.
The two groups should be independent. Repeated exposure, identity stitching problems, or audience contamination can distort the p-value.
Statistical significance does not fix early stopping, multiple metric fishing, or post-hoc hypothesis switching.
Confidence intervals here describe the difference in rates, not the exact business value of the experiment.
Use a mean-based test, regression, or bootstrap instead when the target metric is revenue, time, or another continuous measurement.

Sources & References

Frequently Asked Questions

What does “statistically significant” mean in this calculator?
It means the observed conversion-rate gap is unlikely to be random noise under the selected confidence threshold. At 95% confidence, the p-value must be below 0.05 before the calculator labels the result significant.
Why can a positive uplift still be non-significant?
Uplift and significance answer different questions. Uplift shows direction and size. Significance checks whether the evidence is strong enough to separate that uplift from normal variation. Small samples and rare conversions often produce positive lift with weak evidence.
How is the p-value calculated here?
This page uses a two-proportion z-test. It builds control and variant conversion rates, calculates a pooled standard error, converts the rate gap into a z-score, and then translates that z-score into a p-value using the selected one-sided or two-sided hypothesis.
When should I use a one-sided test instead of a two-sided test?
Use a two-sided test when any meaningful difference matters, including downside risk. Use a one-sided test only when you pre-committed to checking one direction, such as “variant is better than control” or “variant is worse than control,” and would not act on the opposite direction.
What does the power estimate tell me?
Observed power is a quick planning signal for how likely the current traffic would be to detect an effect of the size you are seeing now. Low power does not prove there is no effect. It usually means the test needs more traffic or a larger minimum detectable effect.
Does statistical significance guarantee business value?
No. A tiny but statistically significant lift can still be too small to matter after implementation cost, engineering effort, or downstream guardrail impact. Treat significance as an evidence threshold, then pair it with revenue, margin, retention, or operational impact.
Can I use this for survey completions or email clicks?
Yes, as long as the measured event is binary and the two groups are independent. The same proportion test works for signups, click-throughs, opt-ins, submissions, and similar yes-or-no outcomes.
What is not covered by this calculator?
It does not correct for multiple comparisons, sequential peeking, or overlapping users. It is also not the right tool for continuous outcomes such as order value or session duration, where a mean-based test or bootstrap is more appropriate.