Latest numbers
Run on 2026-04-15 against 101 tasks covering scraping, pagination, login, search, form-fill, CAPTCHA, and multi-step flows. All 4 providers hit the same tasks with the same input/output schemas. Scored by the same LLM judge against freshly-fetched ground-truth page text with provider names anonymized.| Provider | Pass rate | Quality (0-100) | Wins / 101 | Wall time | $ / passing task |
|---|---|---|---|---|---|
| pre.dev Browser Agents | 100/101 | 79.9 | 70 | 275s | $0.010 |
| Browser Use Cloud | 88/101 | 72.2 | 60 | 114s | $0.126 |
| Firecrawl Interact | 84/101 | 71.0 | 66 | 260s | $0.0092 |
| Firecrawl Scrape (static baseline) | 95/101 | 66.2 | 50 | 19s | $0.0057 |
| Claude Code + Chrome MCP | manual | manual | manual | interactive | per-session |
- pre.dev wins every category that matters: pass rate, quality avg, wins count — at 12.6× lower cost than Browser Use Cloud.
- +7.7 points of judged quality vs Browser Use Cloud. +10 more wins out of 101 tasks. Not within noise — this is a clean win, repeated across runs.
- Firecrawl Interact is competitive on cost but drops 16 tasks our API completes, and its quality floor (35.3 on an earlier run) only climbed to 71 after we added rate-limit and session-retry handling to the benchmark adapter — which means real-world usage at concurrency will hit the same errors we worked around.
Methodology
- 101 tasks (
tasks.json) — hand-curated to cover the breadth of real browser-automation work: static scrapes, multi-page flows, logins, form submissions, search-and-click-result, dynamic JS, iframes, CAPTCHA. - Each adapter sends the same
url,instruction,input, andoutput(JSON Schema). Each adapter is responsible for returning structured JSON matching the schema. - Per-task success is a uniform
successCheckpredicate — it accepts any JSON shape that contains the requested information, so no provider’s output format is privileged. - Quality score (0-100) —
scripts/llm-judge.tsfetches the live target page, labels each provider’s answer A/B/C/D per task (anonymized), and asks Gemini to score accuracy against the ground-truth excerpt. This removes both brand bias and any tool-specific formatting advantage. - Wall time + cost — captured from each adapter’s response or estimated at published rates.
Reproduce it
Everything’s in the public benchmark repo:PREDEV_API_KEY(solo userId or enterprise org key)BROWSER_USE_API_KEYFIRECRAWL_API_KEYGOOGLE_GEMINI_API_KEY(for the judge)
results/<stamp>/, merged to a
summary.json, and scored into judgements.json. You can inspect any
single data point.
Claude Code + Chrome MCP
Claude Code is an interactive IDE tool, not an API — it can’t be benchmarked programmatically the same way. We shipCLAUDE_CODE_CHROME_BENCHMARK.md in the repo root as a self-contained
prompt you (or Claude Code itself reading the prompt) can follow to
produce a comparable result file that the same judge scores.

