Skip to main content

Latest numbers

Run on 2026-04-15 against 101 tasks covering scraping, pagination, login, search, form-fill, CAPTCHA, and multi-step flows. All 4 providers hit the same tasks with the same input/output schemas. Scored by the same LLM judge against freshly-fetched ground-truth page text with provider names anonymized.
ProviderPass rateQuality (0-100)Wins / 101Wall time$ / passing task
pre.dev Browser Agents100/10179.970275s$0.010
Browser Use Cloud88/10172.260114s$0.126
Firecrawl Interact84/10171.066260s$0.0092
Firecrawl Scrape (static baseline)95/10166.25019s$0.0057
Claude Code + Chrome MCPmanualmanualmanualinteractiveper-session
Headline reads:
  • pre.dev wins every category that matters: pass rate, quality avg, wins count — at 12.6× lower cost than Browser Use Cloud.
  • +7.7 points of judged quality vs Browser Use Cloud. +10 more wins out of 101 tasks. Not within noise — this is a clean win, repeated across runs.
  • Firecrawl Interact is competitive on cost but drops 16 tasks our API completes, and its quality floor (35.3 on an earlier run) only climbed to 71 after we added rate-limit and session-retry handling to the benchmark adapter — which means real-world usage at concurrency will hit the same errors we worked around.

Methodology

  1. 101 tasks (tasks.json) — hand-curated to cover the breadth of real browser-automation work: static scrapes, multi-page flows, logins, form submissions, search-and-click-result, dynamic JS, iframes, CAPTCHA.
  2. Each adapter sends the same url, instruction, input, and output (JSON Schema). Each adapter is responsible for returning structured JSON matching the schema.
  3. Per-task success is a uniform successCheck predicate — it accepts any JSON shape that contains the requested information, so no provider’s output format is privileged.
  4. Quality score (0-100)scripts/llm-judge.ts fetches the live target page, labels each provider’s answer A/B/C/D per task (anonymized), and asks Gemini to score accuracy against the ground-truth excerpt. This removes both brand bias and any tool-specific formatting advantage.
  5. Wall time + cost — captured from each adapter’s response or estimated at published rates.

Reproduce it

Everything’s in the public benchmark repo:
git clone https://github.com/predotdev/browser-agents-benchmark
cd browser-agents-benchmark
bun install
cp .env.example .env   # fill in the 4 API keys
bun run bench          # runs all automated adapters in parallel
bun run judge <stamp>  # ground-truth quality scoring
bun run report <stamp> # markdown + HTML report with charts
Per-provider env vars:
  • PREDEV_API_KEY (solo userId or enterprise org key)
  • BROWSER_USE_API_KEY
  • FIRECRAWL_API_KEY
  • GOOGLE_GEMINI_API_KEY (for the judge)
Every run writes per-task JSON to results/<stamp>/, merged to a summary.json, and scored into judgements.json. You can inspect any single data point.

Claude Code + Chrome MCP

Claude Code is an interactive IDE tool, not an API — it can’t be benchmarked programmatically the same way. We ship CLAUDE_CODE_CHROME_BENCHMARK.md in the repo root as a self-contained prompt you (or Claude Code itself reading the prompt) can follow to produce a comparable result file that the same judge scores.

Raw data

Every per-task JSON + per-task judge score for the run above is pushed to browser-agents-benchmark / results / 2026-04-15T15-40-08. If you disagree with any specific score, the per-task files have the full raw output plus the judge’s rationale for that task.