AI workflowAnonymized travel-tech marketplace · paid-media A/B testing

How we auto-evaluate A/B tests for clients with AI

Choose your version

A way of working more than a build. Every test starts by brainstorming with AI against the client's strategy doc, then gets written into a short brief. After that, AI pulls the numbers every Monday and walks us through them in the same format, so we scan progress in seconds and never call a test early. A human still makes the final decision.

4–6 hrs → 30 min

Specialist time per test

3 → 8–10

Tests one person can run

Every Mon

Hands-off progress digest

Days → min

Test-end to decision

Outcome

What used to eat 4–6 hours a week per test now takes about 30 minutes: set it up, read the final call. One specialist runs 8–10 tests at once instead of 3, a same-format digest lands automatically every Monday, and early gut-feel calls have all but disappeared. Same rigor, far less calendar drag, and the whole thing is a prompt any team can copy.

All numbers here are anonymized and illustrative of the pattern, not exact client data. This one is less "look what we built" and more a way of working — the whole point is that any team can copy it with Claude Code or any other LLM.

The idea

Running an A/B test isn't the hard part. The hard part is everything around it: pulling the numbers every week, reading them carefully, and — the big one — not talking yourself into calling a winner three days in. That weekly grind is where senior time quietly disappears, and where good tests get killed early on noise.

So we built a simple, repeatable way to run tests with AI doing the busywork. Three phases, and only two of them need a human:

Set the test up properly — once, with a human.
Let AI walk us through the numbers every week — automatically.
Make the final call when the test ends — always a human.

That's it. Here's each phase.

Phase 1 — Set up the test (human, once)

Every test starts as a conversation. We brainstorm with our AI — Claude Code, but any LLM works — using the strategy document we keep for each client. That matters: the ideas come from that client's actual goals and constraints, not a generic "best practices" list. The AI already knows the account, so the tests it suggests are relevant.

When we land on a test worth running, we write it down in a short brief. This is the most important step in the whole workflow — the brief becomes the context the AI uses every week to judge whether the test is going well. No brief, no automation.

We keep the brief as a simple checklist. Here's the template — ours lives as a Notion page:

🧪

A/B Test Brief: “[Test name]”

Status🟢 Running

ChannelGoogle Ads · GA4

Duration4 weeks · min 100 conv / cell

OwnerPerformance team

Hypothesis

We believe [the change] will [improve this metric] because [reason grounded in the client’s strategy doc].

The brief: fill in once

What exactly are we changing? · control vs. treatment, defined precisely

Why this test? · which client goal from the strategy doc it serves

What does a win look like? · the success metric and the threshold

How long does it run? · and the minimum data before we can call it

When would we stop early? · guardrails / abort triggers

What could fool us? · seasonality, launch lag, known confounds

Which numbers should the weekly digest show? · the segments / cuts to track

💡This brief is the context the weekly AI digest reads to judge progress. Write it once, the automation reuses it every week.

Nothing here is technical. It's just forcing yourself to answer, in plain language, what you're testing, why, what winning looks like, how long it runs, and when you'd pull the plug early.

Phase 2 — Track it every week (automated)

Once the test is live, the tracking runs itself. Every Monday morning:

The numbers get pulled automatically. The data download runs through an MCP connected to Google Ads, GA4, or whatever platform the test lives on — so the AI fetches the latest numbers itself, no copy-pasting from dashboards.
The AI reads them against the brief and writes the same digest, every time. Same layout, same order, same cuts — so we can scan it with our eyes in seconds. It tells us how far into the test we are, how the numbers moved since last week, and whether we're trending toward the "win" the brief defined.

The one rule: it reports, it doesn't advise. No "you should pause this." During the test we only want to see what's happening — the decision waits until the end (more on why below).

Here's what lands in Slack every Monday:

Marketing IntelligenceAPPMon 9:01 AM

📊 A/B Test — "Landing page v2"   ·   Day 21 of 28   (data through Mon 2026-01-20)

Status:   still running — numbers only, decision at Day 28.
Trending: gap is widening, not closing. Treatment behind on the success metric.

HEADLINE  (success metric = ROAS)
  Treatment   1.08×   (was 1.14×,  -5% WoW)
  Control     1.22×   (was 1.20×,  +2% WoW)
  Gap         -11%    (was -5%)   ← widening

BY COUNTRY  (treatment vs control)
  🇨🇿 Czechia    +6%    (was +14%)
  🇵🇱 Poland     -3%    (was  -1%)
  🇩🇪 Germany   -18%    (was -22%)
  🇦🇹 Austria    +9%    (was +12%)

TREND  (treatment as % of control, each check)
  Day 7   →  96%
  Day 14  →  95%
  Day 21  →  89%

Brief said:  ship treatment only if gap ≥ 0% with ≥100 conv/cell.
             Currently -11%, 168 conv/cell.
Guardrails:  none tripped.

📅 Next check: Mon 2026-01-27 (final readout)

A few things make the format work: the (was X%) on every row shows movement at a glance without digging up last week's message; the trend block answers the only question that matters mid-test — is the gap closing, holding, or widening?; and there are no verbs — nothing tells you to act, because acting comes at the end.

Run it yourself

You don't need our setup to do this. Paste this into your LLM — with your numbers, or wired to an MCP that pulls them for you:

You're tracking an A/B test for me. Treat this as numbers-only: do NOT recommend
pausing, shipping, or killing — decisions wait until the test ends.

TEST BRIEF:
<paste the brief — hypothesis, success metric + threshold, duration,
 guardrails, and which segments to report>

THIS WEEK'S NUMBERS (and last week's, for comparison):
<paste them, or pull them via the connected MCP from Google Ads / GA4>

Write a progress digest in EXACTLY this format every week, so I can scan it in seconds:
  1) Header: test name, day X of Y, data date.
  2) Status + trend: is the gap to the success threshold closing, holding, or widening?
  3) HEADLINE: the success metric, treatment vs control, each with the change since
     last week in (parentheses).
  4) BY SEGMENT: one row per cut named in the brief — value vs control, plus the
     change since last week in (parentheses).
  5) TREND: the headline metric at each past check, oldest to newest.
  6) One line restating the brief's success criterion and where we stand against it.
  7) Guardrails: list any from the brief that have tripped, or "none tripped".

If your LLM has an MCP that can reach Google Ads or Google Analytics, it can pull the numbers itself and post the digest on a schedule. If not, paste them in by hand — the value is in the constant format and the read against the brief, not the plumbing.

Phase 3 — Evaluate at the end (human)

Every test runs for a set period — long enough to collect the data the brief said we'd need. When that period is up, the AI does one last pass: was the hypothesis proven, is the result statistically significant, and how did the test go overall — with a clear recommendation.

But that's only the first layer. The final call is always a person's. That's how we work: when a test ends, a human reads the end-of-test readout, checks it against everything the numbers can't see — seasonality, a competitor move, a client constraint — and decides whether we actually ship the change. The AI gets us most of the way there; the human makes the decision.

What other teams could steal

Write the test down before you run it. The brief — what you're testing, why, what a win looks like, how long it runs, when you'd stop early — is the context the weekly digest reads to judge progress. Skip it and the automation has nothing to measure against. It's the highest-leverage 30 minutes in the whole workflow.
Numbers only during the test; recommendations only at the end. The most counterintuitive part. Everyone wants the weekly update to tell them what to do — resist. A digest that interprets half-finished data trains the team to kill tests on Week-1 vibes. The whole point of a test is that the call is made on final data.
Keep a human on the final call. Let AI do the pulling, the formatting, and the first-pass read of the result. Let a person decide whether it actually gets applied — checking it against what the numbers can't see. The automation buys back time; it doesn't take the wheel.

Key Takeaways

Three things worth taking away.

Write the test down before you run it.

The brief (what you're testing, why, what a win looks like, how long it runs, when you'd stop early) is the context the weekly AI digest reads to judge progress. Skip it and the automation has nothing to measure against. It's the highest-leverage 30 minutes in the whole workflow.

Numbers only during the test; recommendations only at the end.

The most counterintuitive part for stakeholders: everyone wants the weekly update to tell them what to do. Resist. A digest that interprets half-finished data trains the team to kill tests on Week-1 vibes. The whole point of a test is that the call is made on final data.

Keep a human on the final call.

Let AI do the pulling, the formatting, and the first-pass read of the result. Let a person decide whether it actually gets applied, checking it against what the numbers can't see, like seasonality or a competitor move. The automation buys back time; it doesn't take the wheel.

Want to ship something like this?

We build these systems for paid advertising teams that want strategy and execution in one engagement. One-time audit or on-demand consulting.

Book a free 30-min strategy call View More Case Studies