Benchmark

Harness Bench

HarnessBench compares Codex, Claude Code, Cursor Agent, and Antigravity CLI on the same 27 real-repository debugging issues. The primary score is deterministic hidden-test pass/fail, with wall time, token usage, cost estimates, and auxiliary failure reviews retained for analysis.

27debugging tasks
18harness/model/effort conditions
484valid agent runs including supplemental
70.9%overall hidden-test pass rate

Pass Rate by Harness × Model × Effort

Median Wall Time

Success by Difficulty

Pass Rate vs Time

Condition Table

Harness × Model × EffortHarnessPassPass rateMedian timeCost/passTimeouts

Interpretation and Artifacts

Across the 18 displayed conditions, the strongest observed pass rate remains Codex / GPT-5.5 / xhigh at 22/27. Cursor / Composer 2.5 fast lands in the middle of the main pack at 19/27: below the 20-22 pass Codex/GPT-5.5, Cursor/GPT-5.5, and stronger Opus 4.7 conditions, but above Composer 2 fast and tied or ahead of several slower high-effort conditions. Claude Code / Opus 4.8 / xhigh is 14/25 after excluding two infrastructure-invalid Vite runs; those two implementations passed hidden tests when rechecked with CI setup. With only 27 paired tasks, these success-rate gaps should be read as directional rather than statistically firm.

Cost should be read carefully. Claude Code reports dollar cost directly, Codex and some Cursor GPT-5.5 runs use API-equivalent rate-card estimates, and Cursor Opus/Composer conditions may not expose comparable cost data.

The public repository contains the benchmark specification, cases, hidden tests, runner, report generator, and summary artifacts. Raw harness execution logs are not currently published on this website; they are retained locally under the experiment artifact policy because they can be large and may contain provider-specific session details.

Case Set (official 14-condition aggregate)

CaseRepoDifficultySizePassPR
axios-axios-high-http-connect-timeoutaxios/axioshighsmall14/14PR
axios-axios-low-settle-error-codeaxios/axioslowsmall14/14PR
axios-axios-mid-fetch-global-accessaxios/axiosmidsmall14/14PR
fastapi-fastapi-high-pydantic-json-fast-pathfastapi/fastapihighlarge12/14PR
fastapi-fastapi-low-remove-vibe-decoratorfastapi/fastapilowlarge14/14PR
fastapi-fastapi-mid-jsonable-encoder-color-typesfastapi/fastapimidlarge14/14PR
go-gitea-gitea-high-compare-no-common-historygo-gitea/giteahighlarge9/14PR
go-gitea-gitea-low-schedule-null-payloadgo-gitea/gitealowlarge14/14PR
go-gitea-gitea-mid-pr-merge-self-referencego-gitea/giteamidlarge7/14PR
jesseduffield-lazygit-high-branch-divergence-fast-pathjesseduffield/lazygithighsmall5/14PR
jesseduffield-lazygit-low-github-owner-casingjesseduffield/lazygitlowsmall14/14PR
jesseduffield-lazygit-mid-preserve-commit-message-whitespacejesseduffield/lazygitmidsmall9/14PR
langflow-ai-langflow-high-lfx-stream-fallbacklangflow-ai/langflowhighlarge14/14PR
langflow-ai-langflow-low-loguru-file-routinglangflow-ai/langflowlowlarge12/14PR
langflow-ai-langflow-mid-mcp-connectable-inputslangflow-ai/langflowmidlarge0/14PR
louislam-uptime-kuma-high-websocket-auth-optionslouislam/uptime-kumahighmedium0/14PR
louislam-uptime-kuma-low-submillisecond-ping-chartlouislam/uptime-kumalowmedium12/14PR
louislam-uptime-kuma-mid-uptime-cleanup-bucketslouislam/uptime-kumamidmedium11/14PR
sharkdp-bat-high-fallback-syntaxsharkdp/bathighsmall13/14PR
sharkdp-bat-low-zip-binary-detectionsharkdp/batlowsmall13/14PR
sharkdp-bat-mid-control-character-wrappingsharkdp/batmidsmall2/14PR
usememos-memos-high-missing-related-usersusememos/memoshighmedium12/14PR
usememos-memos-low-omit-internal-user-settingsusememos/memoslowmedium14/14PR
usememos-memos-mid-mixed-case-user-resource-namesusememos/memosmidmedium11/14PR
vitejs-vite-high-hmr-patch-esm-sentinelvitejs/vitehighlarge9/14PR
vitejs-vite-low-flatten-id-sanitized-charsvitejs/vitelowlarge2/14PR
vitejs-vite-mid-deno-workspace-rootvitejs/vitemidlarge10/14PR