An audit, end to end.
An audit goes from a URL plus a short business brief to a ranked backlog of A/B-test hypotheses — each grounded in a named behavioural principle, scored for priority, and stress-tested by a second model pass. The output is a JSON-shaped contract: 14 fields per hypothesis, including the strongest argument against it. The whole pipeline runs in one request, in roughly three minutes.
What happens after you enter a URL.
Seven stages, each one adding evidence or removing noise. The model never sees a bare URL — it sees a structured, multimodal payload.
The model sees what a careful auditor would see.
A multimodal prompt assembled in build-payload.ts — three image blocks plus a structured text block, with real evidence attached.
Three image blocks — desktop, mobile, and a throttled-3G mobile render — plus a structured text block carrying the business context, DOM signals, Core Web Vitals, the live GA4 snapshot, the prior audits we’ve already run on this domain, and the outcomes you’ve logged.
It’s framed as evidence of substance, not a list of claims — the model reasons over what’s actually there.
- · Business context
- · DOM signals
- · Core Web Vitals
- · Live GA4 snapshot
- · Prior audits on this domain
- · Logged test outcomes
Five stages, in order.
The same discipline runs on every audit. Each stage can stop the process or lower the confidence tier — by design.
Does the site have enough traffic for an 8-week test to detect a real lift? Below threshold, we return BORDERLINE and refuse to ground hypotheses in numbers we don’t trust.
Structural signals: page type, heading hierarchy, form fields, schema markup, trust badges, Core Web Vitals.
A DOM observation is only actionable if GA4 or the flow simulation confirms it. With one signal it stays weak.
Each surviving hypothesis is grounded in a named principle — from Cialdini, Fogg M/A/T, prospect theory, or cognitive load theory. Mechanism over hunch.
10 binary questions, 0–16 priority points. The backlog ranks objectively.
Every finding is argued against.
Before a hypothesis reaches you, a second pass tries to break it. Here’s a generic example — the hypothesis, then the self-critique that travels with it.
If we make the primary add-to-cart button a full-width, thumb-reachable tap target on mobile product pages, mobile add-to-cart rate will rise — because the control button is below a comfortable touch-target size and sits outside the thumb zone.
The tier tells you what we actually had.
Every hypothesis is labelled with how much evidence backed it. Less data in means a lower tier out — never a confident answer dressed over a guess.
The honest boundaries.
Hypothesisly doesn’t run the test, doesn’t change your site, and doesn’t predict the future. Pair it with whichever testing tool you already use. We make the recommendation defensible; you run the experiment.
Want to see it on your site?
We’ll run a real audit and walk you through what it found — and what it couldn’t be sure about.
15 minutes · a real audit on your site · no slide deck.