Benchmark
Harness Bench
HarnessBench compares Codex, Claude Code, and Cursor Agent on the same 27 real-repository debugging issues. The primary score is deterministic hidden-test pass/fail, with wall time, token usage, cost estimates, and auxiliary failure reviews retained for analysis.
Pass Rate by Harness × Model × Effort
Median Wall Time
Success by Difficulty
Pass Rate vs Time
Condition Table
| Harness × Model × Effort | Harness | Pass | Pass rate | Median time | Cost/pass | Timeouts |
|---|
Interpretation and Artifacts
The strongest observed pass rate was Codex / GPT-5.5 / xhigh at 22/27. However, with only 27 paired tasks, no pairwise success-rate gap reached p < 0.05. Runtime differences were clearer: Cursor Composer 2 fast and Cursor GPT-5.5 medium were substantially faster, while higher Opus effort settings traded latency for more deliberation without a statistically reliable success gain in this run.
Cost should be read carefully. Claude Code reports dollar cost directly, Codex and some Cursor GPT-5.5 runs use API-equivalent rate-card estimates, and Cursor Opus/Composer conditions may not expose comparable cost data.
The public repository contains the benchmark specification, cases, hidden tests, runner, report generator, and summary artifacts. Raw harness execution logs are not currently published on this website; they are retained locally under the experiment artifact policy because they can be large and may contain provider-specific session details.
Case Set
| Case | Repo | Difficulty | Size | Pass | PR |
|---|---|---|---|---|---|
| axios-axios-high-http-connect-timeout | axios/axios | high | small | 14/14 | PR |
| axios-axios-low-settle-error-code | axios/axios | low | small | 14/14 | PR |
| axios-axios-mid-fetch-global-access | axios/axios | mid | small | 14/14 | PR |
| fastapi-fastapi-high-pydantic-json-fast-path | fastapi/fastapi | high | large | 12/14 | PR |
| fastapi-fastapi-low-remove-vibe-decorator | fastapi/fastapi | low | large | 14/14 | PR |
| fastapi-fastapi-mid-jsonable-encoder-color-types | fastapi/fastapi | mid | large | 14/14 | PR |
| go-gitea-gitea-high-compare-no-common-history | go-gitea/gitea | high | large | 9/14 | PR |
| go-gitea-gitea-low-schedule-null-payload | go-gitea/gitea | low | large | 14/14 | PR |
| go-gitea-gitea-mid-pr-merge-self-reference | go-gitea/gitea | mid | large | 7/14 | PR |
| jesseduffield-lazygit-high-branch-divergence-fast-path | jesseduffield/lazygit | high | small | 5/14 | PR |
| jesseduffield-lazygit-low-github-owner-casing | jesseduffield/lazygit | low | small | 14/14 | PR |
| jesseduffield-lazygit-mid-preserve-commit-message-whitespace | jesseduffield/lazygit | mid | small | 9/14 | PR |
| langflow-ai-langflow-high-lfx-stream-fallback | langflow-ai/langflow | high | large | 14/14 | PR |
| langflow-ai-langflow-low-loguru-file-routing | langflow-ai/langflow | low | large | 12/14 | PR |
| langflow-ai-langflow-mid-mcp-connectable-inputs | langflow-ai/langflow | mid | large | 0/14 | PR |
| louislam-uptime-kuma-high-websocket-auth-options | louislam/uptime-kuma | high | medium | 0/14 | PR |
| louislam-uptime-kuma-low-submillisecond-ping-chart | louislam/uptime-kuma | low | medium | 12/14 | PR |
| louislam-uptime-kuma-mid-uptime-cleanup-buckets | louislam/uptime-kuma | mid | medium | 11/14 | PR |
| sharkdp-bat-high-fallback-syntax | sharkdp/bat | high | small | 13/14 | PR |
| sharkdp-bat-low-zip-binary-detection | sharkdp/bat | low | small | 13/14 | PR |
| sharkdp-bat-mid-control-character-wrapping | sharkdp/bat | mid | small | 2/14 | PR |
| usememos-memos-high-missing-related-users | usememos/memos | high | medium | 12/14 | PR |
| usememos-memos-low-omit-internal-user-settings | usememos/memos | low | medium | 14/14 | PR |
| usememos-memos-mid-mixed-case-user-resource-names | usememos/memos | mid | medium | 11/14 | PR |
| vitejs-vite-high-hmr-patch-esm-sentinel | vitejs/vite | high | large | 9/14 | PR |
| vitejs-vite-low-flatten-id-sanitized-chars | vitejs/vite | low | large | 2/14 | PR |
| vitejs-vite-mid-deno-workspace-root | vitejs/vite | mid | large | 10/14 | PR |