Benchmark

Harness Bench

HarnessBench compares Codex, Claude Code, Cursor Agent, and Antigravity CLI on the same 27 real-repository debugging issues. The primary score is deterministic hidden-test pass/fail, with wall time, token usage, cost estimates, and auxiliary failure reviews retained for analysis.

GitHub repository Official artifacts Blog post Supplemental post

27debugging tasks

18harness/model/effort conditions

484valid agent runs including supplemental

70.9%overall hidden-test pass rate

Pass Rate by Harness × Model × Effort

Median Wall Time

Success by Difficulty

Pass Rate vs Time

Condition Table

Harness × Model × Effort	Harness	Pass	Pass rate	Median time	Cost/pass	Timeouts

Interpretation and Artifacts

Across the 18 displayed conditions, the strongest observed pass rate remains Codex / GPT-5.5 / xhigh at 22/27. Cursor / Composer 2.5 fast lands in the middle of the main pack at 19/27: below the 20-22 pass Codex/GPT-5.5, Cursor/GPT-5.5, and stronger Opus 4.7 conditions, but above Composer 2 fast and tied or ahead of several slower high-effort conditions. Claude Code / Opus 4.8 / xhigh is 14/25 after excluding two infrastructure-invalid Vite runs; those two implementations passed hidden tests when rechecked with CI setup. With only 27 paired tasks, these success-rate gaps should be read as directional rather than statistically firm.

Cost should be read carefully. Claude Code reports dollar cost directly, Codex and some Cursor GPT-5.5 runs use API-equivalent rate-card estimates, and Cursor Opus/Composer conditions may not expose comparable cost data.

The public repository contains the benchmark specification, cases, hidden tests, runner, report generator, and summary artifacts. Raw harness execution logs are not currently published on this website; they are retained locally under the experiment artifact policy because they can be large and may contain provider-specific session details.

artifact directory summary.json manifest.json failure-reviews.json full results.html Opus 4.8 xhigh artifact directory Opus 4.8 summary.json Opus 4.8 results.html

Case Set (official 14-condition aggregate)

Case	Repo	Difficulty	Size	Pass	PR
axios-axios-high-http-connect-timeout	axios/axios	high	small	14/14	PR
axios-axios-low-settle-error-code	axios/axios	low	small	14/14	PR
axios-axios-mid-fetch-global-access	axios/axios	mid	small	14/14	PR
fastapi-fastapi-high-pydantic-json-fast-path	fastapi/fastapi	high	large	12/14	PR
fastapi-fastapi-low-remove-vibe-decorator	fastapi/fastapi	low	large	14/14	PR
fastapi-fastapi-mid-jsonable-encoder-color-types	fastapi/fastapi	mid	large	14/14	PR
go-gitea-gitea-high-compare-no-common-history	go-gitea/gitea	high	large	9/14	PR
go-gitea-gitea-low-schedule-null-payload	go-gitea/gitea	low	large	14/14	PR
go-gitea-gitea-mid-pr-merge-self-reference	go-gitea/gitea	mid	large	7/14	PR
jesseduffield-lazygit-high-branch-divergence-fast-path	jesseduffield/lazygit	high	small	5/14	PR
jesseduffield-lazygit-low-github-owner-casing	jesseduffield/lazygit	low	small	14/14	PR
jesseduffield-lazygit-mid-preserve-commit-message-whitespace	jesseduffield/lazygit	mid	small	9/14	PR
langflow-ai-langflow-high-lfx-stream-fallback	langflow-ai/langflow	high	large	14/14	PR
langflow-ai-langflow-low-loguru-file-routing	langflow-ai/langflow	low	large	12/14	PR
langflow-ai-langflow-mid-mcp-connectable-inputs	langflow-ai/langflow	mid	large	0/14	PR
louislam-uptime-kuma-high-websocket-auth-options	louislam/uptime-kuma	high	medium	0/14	PR
louislam-uptime-kuma-low-submillisecond-ping-chart	louislam/uptime-kuma	low	medium	12/14	PR
louislam-uptime-kuma-mid-uptime-cleanup-buckets	louislam/uptime-kuma	mid	medium	11/14	PR
sharkdp-bat-high-fallback-syntax	sharkdp/bat	high	small	13/14	PR
sharkdp-bat-low-zip-binary-detection	sharkdp/bat	low	small	13/14	PR
sharkdp-bat-mid-control-character-wrapping	sharkdp/bat	mid	small	2/14	PR
usememos-memos-high-missing-related-users	usememos/memos	high	medium	12/14	PR
usememos-memos-low-omit-internal-user-settings	usememos/memos	low	medium	14/14	PR
usememos-memos-mid-mixed-case-user-resource-names	usememos/memos	mid	medium	11/14	PR
vitejs-vite-high-hmr-patch-esm-sentinel	vitejs/vite	high	large	9/14	PR
vitejs-vite-low-flatten-id-sanitized-chars	vitejs/vite	low	large	2/14	PR
vitejs-vite-mid-deno-workspace-root	vitejs/vite	mid	large	10/14	PR