How it works

An audit, end to end.

An audit goes from a URL plus a short business brief to a ranked backlog of A/B-test hypotheses — each grounded in a named behavioural principle, scored for priority, and stress-tested by a second model pass. The output is a JSON-shaped contract: 14 fields per hypothesis, including the strongest argument against it. The whole pipeline runs in one request, in roughly three minutes.

The pipeline

What happens after you enter a URL.

Seven stages, each one adding evidence or removing noise. The model never sees a bare URL — it sees a structured, multimodal payload.

01 · Intake
Site & business context
URL, goal, what you sell and to whom.
02 · Scrape
DOM + screenshots + 3G + flow simulation
Structure, visual renders, a throttled-3G mobile shot, and a simulated walk through the funnel.
03 · Behaviour
GA4 funnel
Where real users actually drop off — when a GA4 connection is present.
04 · Payload
Multimodal payload assembled
Structural, visual and behavioural evidence combined into one prompt.
05 · Generation
Hypothesis generation
The model proposes candidate hypotheses, each gated on two independent sources.
06 · Falsification
Adversarial second pass
Each survivor is attacked: failure mode, confounder, alternative explanation, discriminating test.
07 · Output
14-field hypotheses, ranked
A prioritised list, each with its tier, score and falsification attached.
What we send to the model

The model sees what a careful auditor would see.

A multimodal prompt assembled in build-payload.ts — three image blocks plus a structured text block, with real evidence attached.

Three image blocks — desktop, mobile, and a throttled-3G mobile render — plus a structured text block carrying the business context, DOM signals, Core Web Vitals, the live GA4 snapshot, the prior audits we’ve already run on this domain, and the outcomes you’ve logged.

It’s framed as evidence of substance, not a list of claims — the model reasons over what’s actually there.

The payload
Image blocks
Desktop renderMobile renderThrottled-3G mobile render
Text block
  • · Business context
  • · DOM signals
  • · Core Web Vitals
  • · Live GA4 snapshot
  • · Prior audits on this domain
  • · Logged test outcomes
The framework

Five stages, in order.

The same discipline runs on every audit. Each stage can stop the process or lower the confidence tier — by design.

01
Qualification

Does the site have enough traffic for an 8-week test to detect a real lift? Below threshold, we return BORDERLINE and refuse to ground hypotheses in numbers we don’t trust.

02
DOM audit

Structural signals: page type, heading hierarchy, form fields, schema markup, trust badges, Core Web Vitals.

03
Behavioural validation · the two-sources rule

A DOM observation is only actionable if GA4 or the flow simulation confirms it. With one signal it stays weak.

04
Behavioural-science framing

Each surviving hypothesis is grounded in a named principle — from Cialdini, Fogg M/A/T, prospect theory, or cognitive load theory. Mechanism over hunch.

05
PXL scoring

10 binary questions, 0–16 priority points. The backlog ranks objectively.

Adversarial falsification

Every finding is argued against.

Before a hypothesis reaches you, a second pass tries to break it. Here’s a generic example — the hypothesis, then the self-critique that travels with it.

HYP-014 · product pageHigh confidence

If we make the primary add-to-cart button a full-width, thumb-reachable tap target on mobile product pages, mobile add-to-cart rate will rise — because the control button is below a comfortable touch-target size and sits outside the thumb zone.

SourcesDOM auditGA4 · mobile CVR
The agent’s case against its own hypothesis
Failure mode
A larger button may push price and reviews below the fold, trading one friction for another and leaving mobile add-to-cart flat.
Likely confounder
Mobile conversion is already depressed by slow 3G load — the CTA size may be a symptom, not the cause.
Alternative explanation
Any lift could come from the variant’s stronger colour contrast, not the button’s size or position.
Discriminating test
Vary size and position independently — full-width in place vs. relocated at the same size — to isolate which factor moves add-to-cart.

A structural signal and a behavioural one both confirm this, so it ships at the High tier — with its counter-argument attached.

Confidence tiers

The tier tells you what we actually had.

Every hypothesis is labelled with how much evidence backed it. Less data in means a lower tier out — never a confident answer dressed over a guess.

Tier
What data we have
What the output can claim
Full
Live GA4 via OAuth or service account + DOM + screenshots + flow sim.
Behavioural validation, leak-targeted recommendations, lift estimates we stand behind.
High
CSV upload of GA4 data + DOM + screenshots.
Static behavioural context — directional, not live.
Estimated
Manual entry of key metrics + DOM + screenshots.
Directional only. We won’t make confident claims.
Structural
DOM + screenshots only.
Surface issues without behavioural validation. Honest about its limits.
What this doesn’t do

The honest boundaries.

Hypothesisly doesn’t run the test, doesn’t change your site, and doesn’t predict the future. Pair it with whichever testing tool you already use. We make the recommendation defensible; you run the experiment.

Early access

Want to see it on your site?

We’ll run a real audit and walk you through what it found — and what it couldn’t be sure about.

15 minutes · a real audit on your site · no slide deck.