Benchmark

Harness Bench

HarnessBench compares Codex, Claude Code, and Cursor Agent on the same 27 real-repository debugging issues. The primary score is deterministic hidden-test pass/fail, with wall time, token usage, cost estimates, and auxiliary failure reviews retained for analysis.

27debugging tasks
14harness/model/effort conditions
378official agent runs
72.8%overall hidden-test pass rate

Pass Rate by Harness × Model × Effort

Median Wall Time

Success by Difficulty

Pass Rate vs Time

Condition Table

Harness × Model × EffortHarnessPassPass rateMedian timeCost/passTimeouts

Interpretation and Artifacts

The strongest observed pass rate was Codex / GPT-5.5 / xhigh at 22/27. However, with only 27 paired tasks, no pairwise success-rate gap reached p < 0.05. Runtime differences were clearer: Cursor Composer 2 fast and Cursor GPT-5.5 medium were substantially faster, while higher Opus effort settings traded latency for more deliberation without a statistically reliable success gain in this run.

Cost should be read carefully. Claude Code reports dollar cost directly, Codex and some Cursor GPT-5.5 runs use API-equivalent rate-card estimates, and Cursor Opus/Composer conditions may not expose comparable cost data.

The public repository contains the benchmark specification, cases, hidden tests, runner, report generator, and summary artifacts. Raw harness execution logs are not currently published on this website; they are retained locally under the experiment artifact policy because they can be large and may contain provider-specific session details.

Case Set

CaseRepoDifficultySizePassPR
axios-axios-high-http-connect-timeoutaxios/axioshighsmall14/14PR
axios-axios-low-settle-error-codeaxios/axioslowsmall14/14PR
axios-axios-mid-fetch-global-accessaxios/axiosmidsmall14/14PR
fastapi-fastapi-high-pydantic-json-fast-pathfastapi/fastapihighlarge12/14PR
fastapi-fastapi-low-remove-vibe-decoratorfastapi/fastapilowlarge14/14PR
fastapi-fastapi-mid-jsonable-encoder-color-typesfastapi/fastapimidlarge14/14PR
go-gitea-gitea-high-compare-no-common-historygo-gitea/giteahighlarge9/14PR
go-gitea-gitea-low-schedule-null-payloadgo-gitea/gitealowlarge14/14PR
go-gitea-gitea-mid-pr-merge-self-referencego-gitea/giteamidlarge7/14PR
jesseduffield-lazygit-high-branch-divergence-fast-pathjesseduffield/lazygithighsmall5/14PR
jesseduffield-lazygit-low-github-owner-casingjesseduffield/lazygitlowsmall14/14PR
jesseduffield-lazygit-mid-preserve-commit-message-whitespacejesseduffield/lazygitmidsmall9/14PR
langflow-ai-langflow-high-lfx-stream-fallbacklangflow-ai/langflowhighlarge14/14PR
langflow-ai-langflow-low-loguru-file-routinglangflow-ai/langflowlowlarge12/14PR
langflow-ai-langflow-mid-mcp-connectable-inputslangflow-ai/langflowmidlarge0/14PR
louislam-uptime-kuma-high-websocket-auth-optionslouislam/uptime-kumahighmedium0/14PR
louislam-uptime-kuma-low-submillisecond-ping-chartlouislam/uptime-kumalowmedium12/14PR
louislam-uptime-kuma-mid-uptime-cleanup-bucketslouislam/uptime-kumamidmedium11/14PR
sharkdp-bat-high-fallback-syntaxsharkdp/bathighsmall13/14PR
sharkdp-bat-low-zip-binary-detectionsharkdp/batlowsmall13/14PR
sharkdp-bat-mid-control-character-wrappingsharkdp/batmidsmall2/14PR
usememos-memos-high-missing-related-usersusememos/memoshighmedium12/14PR
usememos-memos-low-omit-internal-user-settingsusememos/memoslowmedium14/14PR
usememos-memos-mid-mixed-case-user-resource-namesusememos/memosmidmedium11/14PR
vitejs-vite-high-hmr-patch-esm-sentinelvitejs/vitehighlarge9/14PR
vitejs-vite-low-flatten-id-sanitized-charsvitejs/vitelowlarge2/14PR
vitejs-vite-mid-deno-workspace-rootvitejs/vitemidlarge10/14PR