Benchmarks - pre.dev

Latest numbers

Run on 2026-04-15 against 101 tasks covering scraping, pagination, login, search, form-fill, CAPTCHA, and multi-step flows. All 4 providers hit the same tasks with the same input/output schemas. Scored by the same LLM judge against freshly-fetched ground-truth page text with provider names anonymized.

Provider	Pass rate	Quality (0-100)	Wins / 101	Wall time	$ / passing task
pre.dev Browser Agents	100/101	79.9	70	275s	$0.010
Browser Use Cloud	88/101	72.2	60	114s	$0.126
Firecrawl Interact	84/101	71.0	66	260s	$0.0092
Firecrawl Scrape (static baseline)	95/101	66.2	50	19s	$0.0057
Claude Code + Chrome MCP	manual	manual	manual	interactive	per-session

Headline reads:

pre.dev wins every category that matters: pass rate, quality avg, wins count — at 12.6× lower cost than Browser Use Cloud.
+7.7 points of judged quality vs Browser Use Cloud. +10 more wins out of 101 tasks. Not within noise — this is a clean win, repeated across runs.
Firecrawl Interact is competitive on cost but drops 16 tasks our API completes, and its quality floor (35.3 on an earlier run) only climbed to 71 after we added rate-limit and session-retry handling to the benchmark adapter — which means real-world usage at concurrency will hit the same errors we worked around.

Methodology

101 tasks (tasks.json) — hand-curated to cover the breadth of real browser-automation work: static scrapes, multi-page flows, logins, form submissions, search-and-click-result, dynamic JS, iframes, CAPTCHA.
Each adapter sends the same url, instruction, input, and output (JSON Schema). Each adapter is responsible for returning structured JSON matching the schema.
Per-task success is a uniform successCheck predicate — it accepts any JSON shape that contains the requested information, so no provider’s output format is privileged.
Quality score (0-100) — scripts/llm-judge.ts fetches the live target page, labels each provider’s answer A/B/C/D per task (anonymized), and asks Gemini to score accuracy against the ground-truth excerpt. This removes both brand bias and any tool-specific formatting advantage.
Wall time + cost — captured from each adapter’s response or estimated at published rates.

Reproduce it

Everything’s in the public benchmark repo:

git clone https://github.com/predotdev/browser-agents-benchmark
cd browser-agents-benchmark
bun install
cp .env.example .env   # fill in the 4 API keys
bun run bench          # runs all automated adapters in parallel
bun run judge <stamp>  # ground-truth quality scoring
bun run report <stamp> # markdown + HTML report with charts

Per-provider env vars:

PREDEV_API_KEY (solo userId or enterprise org key)
BROWSER_USE_API_KEY
FIRECRAWL_API_KEY
GOOGLE_GEMINI_API_KEY (for the judge)

Every run writes per-task JSON to results/<stamp>/, merged to a summary.json, and scored into judgements.json. You can inspect any single data point.

Claude Code + Chrome MCP

Claude Code is an interactive IDE tool, not an API — it can’t be benchmarked programmatically the same way. We ship CLAUDE_CODE_CHROME_BENCHMARK.md in the repo root as a self-contained prompt you (or Claude Code itself reading the prompt) can follow to produce a comparable result file that the same judge scores.

Raw data

Every per-task JSON + per-task judge score for the run above is pushed to browser-agents-benchmark / results / 2026-04-15T15-40-08. If you disagree with any specific score, the per-task files have the full raw output plus the judge’s rationale for that task.

Overview

Browser Agents

​Latest numbers

​Methodology

​Reproduce it

​Claude Code + Chrome MCP

​Raw data

Latest numbers

Methodology

Reproduce it

Claude Code + Chrome MCP

Raw data