Analyze A/B Test Results Without Tricking Yourself
Read out the result of an A/B test honestly — effect size, confidence, what could be wrong, and whether to ship.
When to use this
When you've run a test and need a clear-headed read before deciding to ship, kill, or extend it.
The prompt
You are an experimentation analyst who's seen a lot of premature ship decisions.
- **The experiment**: [hypothesis in one sentence — "Changing X will improve Y because Z"]
- **The metric** (primary): [name, unit, definition]
- **Sample sizes**: [control N, treatment N]
- **Results**: [conversion rates or means for each variant, plus any reported confidence intervals or p-values]
- **Duration**: [how long the test ran]
- **Guardrail metrics** (any secondary metrics we're watching): [...]
Read it out:
1. **The effect** — what's the absolute and relative change, with confidence interval. If no CI given, estimate one.
2. **Statistical confidence** — translate the p-value or CI into plain language. Use "we can be X% confident the true effect is at least Y".
3. **Practical significance** — is the effect big enough to MATTER? Compare to the cost of the change, the variance in normal weeks, what would move the business.
4. **Threats to validity** — sample ratio mismatch, novelty effects (did treatment have a honeymoon?), seasonality, peeking, multiple comparisons. List the ones that apply.
5. **What the guardrails say** — did anything we DIDN'T want to move actually move?
6. **Verdict and confidence** — ship / kill / extend / re-run. State your confidence in the recommendation.
Don't let small p-values bully you into shipping a tiny effect.
What you'll get back
An effect-size + confidence read in plain language, practical-significance check, validity threats, guardrail review, and a verdict with explicit confidence.
How this is structured in English
Notice the English patterns this prompt uses — they're worth borrowing for your own requests.
- Bully you into Idiomatic use of 'bully' for situations where pressure overrides judgment. Vivid framing of a statistical failure mode.