How it works

An audit, end to end.

An audit goes from a URL plus a short business brief to a ranked backlog of A/B-test hypotheses — each grounded in a named behavioural principle and scored for priority, with focused single-test audits adding an adversarial second model pass that argues against the recommendation. The output is a JSON-shaped contract: 23 fields per hypothesis. The whole pipeline finishes in a few minutes.

A few more recent additions: Hypothesisly drafts your brand voice from your live site copy, so the brief is half-written before you start (you refine it before re-auditing). It puts per-vertical benchmarks next to your own numbers, flags tactics to avoid in regulated niches, and reports opportunity sizing as a range, not a single number. Every brief is versioned, and each audit pins to the exact brief that was live when it ran.

The pipeline

What happens after you enter a URL.

Three streams of evidence converge into one prompt, then narrow to a ranked backlog. The model never sees a bare URL; it sees your page, how it performs, and what your analytics show.

Evidence in · three parallel streamsReasoning · narrowed to a ranked backlog

01 · Intake

Site & business context

3-step wizard5 ways to connectsets the tier

02 · Scrape

DOM, renders, funnel walk

~24 on-page signalsCWV + throttled-3G3 viewports

03 · Behaviour

GA4 funnel & drop-off

device CVR + channels+ Clarity signals

04 · Payload

Multimodal payload assembled

Three image blocks plus structured evidence, combined into one prompt.

3 images + textvision-reconciled

05 · Generation

Grounded hypotheses

Each tied to a behavioural mechanism and a citation; thin evidence is labelled, not inflated.

23 fieldsPXL 0–145 CRO pillarstwo-signal gate

06 · Falsification

Argued against itself

Failure mode, confounder, alternative, and the test that settles it.

focused audits · higher tiers

07 · Output

23-field hypotheses, ranked

Ordered by PXL, each with its tier, opportunity size and a tracking spec.

PXL-ranked5 tiersready to run

Tap any stage to see what happens inside it

into 01 · Intakefrom 07 · Output

Outcomes you log become priors for the next audit

A few minutes · end to endIt produces the backlog — it never runs your test

What we send to the model

The model sees what a careful auditor would see.

A multimodal prompt assembled in build-payload.ts — three image blocks plus a structured text block, with real evidence attached.

Three image blocks — desktop, mobile, and a throttled-3G mobile render — plus a structured text block carrying the business context, DOM signals, Core Web Vitals, the live GA4 snapshot, the prior audits we’ve already run on this domain, and the outcomes you’ve logged.

It’s framed as evidence of substance, not a list of claims — the model reasons over what’s actually there.

The payload

Image blocks

Desktop renderMobile renderThrottled-3G mobile render

Text block

· Business context
· DOM signals
· Core Web Vitals
· Live GA4 snapshot
· Prior audits on this domain
· Logged test outcomes

The framework

Five stages, in order.

The same discipline runs on every audit. Each stage can stop the process or lower the confidence tier — by design.

Qualification

Does the site have enough traffic for an 8-week test to detect a real lift? Below threshold, we return BORDERLINE and refuse to ground hypotheses in numbers we don’t trust.

DOM audit

Structural signals: page type, heading hierarchy, form fields, schema markup, trust badges, Core Web Vitals.

Behavioural validation · the two-sources rule

A DOM observation carries full weight only when GA4 or the flow simulation confirms it. On one signal it still appears, ranked lower.

Behavioural-science framing

Each surviving hypothesis is grounded in a named principle — from Cialdini, Fogg M/A/T, prospect theory, or cognitive load theory. Mechanism over hunch.

PXL scoring

10 questions, up to 14 priority points: six standard questions plus four double-weighted evidence questions. The backlog ranks objectively.

Adversarial falsification

We argue against your next test before you run it.

On a focused, single-test audit, a second pass tries to break the recommendation before it reaches you. Here’s a generic example: the hypothesis, then the self-critique that travels with it.

HYP-014 · product pageHigh confidence

If we make the primary add-to-cart button a full-width, thumb-reachable tap target on mobile product pages, mobile add-to-cart rate will rise — because the control button is below a comfortable touch-target size and sits outside the thumb zone.

SourcesDOM auditGA4 · mobile CVR

The agent’s case against its own hypothesis

Failure mode

A larger button may push price and reviews below the fold, trading one friction for another and leaving mobile add-to-cart flat.

Likely confounder

Mobile conversion is already depressed by slow 3G load — the CTA size may be a symptom, not the cause.

Alternative explanation

Any lift could come from the variant’s stronger colour contrast, not the button’s size or position.

Discriminating test

Vary size and position independently — full-width in place vs. relocated at the same size — to isolate which factor moves add-to-cart.

A structural signal and a behavioural one both confirm this, so it ships at the High tier — with its counter-argument attached.

Confidence tiers

The tier tells you what we actually had.

Every hypothesis is labelled with how much evidence backed it. Less data in means a lower tier out — never a confident answer dressed over a guess.

Tier

What data we have

What the output can claim

Brief

Live GA4 + Microsoft Clarity behavioural data + a qualified Client Brief (brand voice, vertical class, qualitative evidence).

Our most complete tier — brand-respecting hypotheses grounded in DOM + GA4 + qualitative + Clarity behavioural signal.

Full

Live GA4 via OAuth or service account + DOM + screenshots + flow sim.

Behavioural validation, leak-targeted recommendations, lift estimates we stand behind.

High

CSV upload of GA4 data + DOM + screenshots.

Static behavioural context — directional, not live.

Estimated

Manual entry of key metrics + DOM + screenshots.

Directional only. We won’t make confident claims.

Structural

DOM + screenshots only.

Surface issues without behavioural validation. Honest about its limits.

Context carries over

Each audit starts where the last one left off.

The payload for every run already carries the prior audits on your domain and the outcomes you’ve logged. What won becomes a prior; what lost gets down-weighted. The more you test, the more each recommendation is shaped by your store’s own evidence rather than a generic playbook. That’s per-customer today; cross-customer pattern learning, abstracted to pillar and mechanism (never raw data), rolls out through 2026.

What this doesn’t do

The honest boundaries.

Hypothesisly doesn’t run the test, doesn’t change your site, and doesn’t predict the future. Pair it with whichever testing tool you already use. We make the recommendation defensible; you run the experiment. Two more lines we hold by design: the funnel walk stops at the cart and never enters checkout or payment, and if a site blocks automated access we stop rather than evade it.

Early access

Want to see it on your site?

We’ll run a real audit and walk you through what it found — and what it couldn’t be sure about.

Request early access

15 minutes · a real audit on your site · no slide deck.