Benchmark
Harness Bench
HarnessBench compares Codex, Claude Code, Cursor Agent, and Antigravity CLI on the same 27 real-repository debugging issues. The primary score is deterministic hidden-test pass/fail, with wall time, token usage, cost estimates, and auxiliary failure reviews retained for analysis.
Pass Rate by Harness × Model × Effort
Median Wall Time
Success by Difficulty
Pass Rate vs Time
Condition Table
| Harness × Model × Effort | Harness | Pass | Pass rate | Median time | Cost/pass | Timeouts |
|---|
Interpretation and Artifacts
Across the 18 displayed conditions, the strongest observed pass rate remains Codex / GPT-5.5 / xhigh at 22/27. Cursor / Composer 2.5 fast lands in the middle of the main pack at 19/27: below the 20-22 pass Codex/GPT-5.5, Cursor/GPT-5.5, and stronger Opus 4.7 conditions, but above Composer 2 fast and tied or ahead of several slower high-effort conditions. Claude Code / Opus 4.8 / xhigh is 14/25 after excluding two infrastructure-invalid Vite runs; those two implementations passed hidden tests when rechecked with CI setup. With only 27 paired tasks, these success-rate gaps should be read as directional rather than statistically firm.
Cost should be read carefully. Claude Code reports dollar cost directly, Codex and some Cursor GPT-5.5 runs use API-equivalent rate-card estimates, and Cursor Opus/Composer conditions may not expose comparable cost data.
The public repository contains the benchmark specification, cases, hidden tests, runner, report generator, and summary artifacts. Raw harness execution logs are not currently published on this website; they are retained locally under the experiment artifact policy because they can be large and may contain provider-specific session details.
Case Set (official 14-condition aggregate)
| Case | Repo | Difficulty | Size | Pass | PR |
|---|---|---|---|---|---|
| axios-axios-high-http-connect-timeout | axios/axios | high | small | 14/14 | PR |
| axios-axios-low-settle-error-code | axios/axios | low | small | 14/14 | PR |
| axios-axios-mid-fetch-global-access | axios/axios | mid | small | 14/14 | PR |
| fastapi-fastapi-high-pydantic-json-fast-path | fastapi/fastapi | high | large | 12/14 | PR |
| fastapi-fastapi-low-remove-vibe-decorator | fastapi/fastapi | low | large | 14/14 | PR |
| fastapi-fastapi-mid-jsonable-encoder-color-types | fastapi/fastapi | mid | large | 14/14 | PR |
| go-gitea-gitea-high-compare-no-common-history | go-gitea/gitea | high | large | 9/14 | PR |
| go-gitea-gitea-low-schedule-null-payload | go-gitea/gitea | low | large | 14/14 | PR |
| go-gitea-gitea-mid-pr-merge-self-reference | go-gitea/gitea | mid | large | 7/14 | PR |
| jesseduffield-lazygit-high-branch-divergence-fast-path | jesseduffield/lazygit | high | small | 5/14 | PR |
| jesseduffield-lazygit-low-github-owner-casing | jesseduffield/lazygit | low | small | 14/14 | PR |
| jesseduffield-lazygit-mid-preserve-commit-message-whitespace | jesseduffield/lazygit | mid | small | 9/14 | PR |
| langflow-ai-langflow-high-lfx-stream-fallback | langflow-ai/langflow | high | large | 14/14 | PR |
| langflow-ai-langflow-low-loguru-file-routing | langflow-ai/langflow | low | large | 12/14 | PR |
| langflow-ai-langflow-mid-mcp-connectable-inputs | langflow-ai/langflow | mid | large | 0/14 | PR |
| louislam-uptime-kuma-high-websocket-auth-options | louislam/uptime-kuma | high | medium | 0/14 | PR |
| louislam-uptime-kuma-low-submillisecond-ping-chart | louislam/uptime-kuma | low | medium | 12/14 | PR |
| louislam-uptime-kuma-mid-uptime-cleanup-buckets | louislam/uptime-kuma | mid | medium | 11/14 | PR |
| sharkdp-bat-high-fallback-syntax | sharkdp/bat | high | small | 13/14 | PR |
| sharkdp-bat-low-zip-binary-detection | sharkdp/bat | low | small | 13/14 | PR |
| sharkdp-bat-mid-control-character-wrapping | sharkdp/bat | mid | small | 2/14 | PR |
| usememos-memos-high-missing-related-users | usememos/memos | high | medium | 12/14 | PR |
| usememos-memos-low-omit-internal-user-settings | usememos/memos | low | medium | 14/14 | PR |
| usememos-memos-mid-mixed-case-user-resource-names | usememos/memos | mid | medium | 11/14 | PR |
| vitejs-vite-high-hmr-patch-esm-sentinel | vitejs/vite | high | large | 9/14 | PR |
| vitejs-vite-low-flatten-id-sanitized-chars | vitejs/vite | low | large | 2/14 | PR |
| vitejs-vite-mid-deno-workspace-root | vitejs/vite | mid | large | 10/14 | PR |