Benchmark
Harness Bench
HarnessBenchは、Codex、Claude Code、Cursor Agent、Antigravity CLIを同じ27個の実リポジトリ由来デバッグ課題で比較します。主指標はhidden testによる決定論的なpass/failで、wall time、token、cost estimate、補助的なfailure reviewも分析用に残します。
Harness × Model × Effort 別 Pass Rate
Median Wall Time
Difficulty別の成功率
Pass Rate vs Time
条件一覧
| Harness × Model × Effort | Harness | Pass | Pass rate | Median time | Cost/pass | Timeouts |
|---|
解釈とartifact
18条件全体で見ると、観測上もっとも高いpass rateは引き続き Codex / GPT-5.5 / xhigh の22/27です。Cursor / Composer 2.5 fast は19/27で、Codex / GPT-5.5 high と同数でした。20-22 pass のCodex/GPT-5.5、Cursor/GPT-5.5、強いOpus 4.7条件には届きませんが、Composer 2 fastよりは上です。Claude Code / Opus 4.8 / xhigh は、infra invalidのVite 2件を除くと14/25です。この2件の実装はCI設定で再実行したhidden testではpassしています。n=27なので、成功率差は方向感として読むのが妥当です。
Costは注意して読む必要があります。Claude Codeはdollar costを直接報告しますが、Codexと一部のCursor GPT-5.5 runはAPI-equivalent rate-card estimateです。Cursor Opus/Composer条件では比較可能なcost dataが出ない場合があります。
公開リポジトリにはbenchmark specification、case、hidden test、runner、report generator、summary artifactを含めます。raw harness execution logは現時点ではこのサイトでは公開していません。容量が大きく、provider固有のsession detailを含みうるため、experiment artifact policyに従ってローカル保持しています。
Case Set(公式14条件の集計)
| Case | Repo | Difficulty | Size | Pass | PR |
|---|---|---|---|---|---|
| axios-axios-high-http-connect-timeout | axios/axios | high | small | 14/14 | PR |
| axios-axios-low-settle-error-code | axios/axios | low | small | 14/14 | PR |
| axios-axios-mid-fetch-global-access | axios/axios | mid | small | 14/14 | PR |
| fastapi-fastapi-high-pydantic-json-fast-path | fastapi/fastapi | high | large | 12/14 | PR |
| fastapi-fastapi-low-remove-vibe-decorator | fastapi/fastapi | low | large | 14/14 | PR |
| fastapi-fastapi-mid-jsonable-encoder-color-types | fastapi/fastapi | mid | large | 14/14 | PR |
| go-gitea-gitea-high-compare-no-common-history | go-gitea/gitea | high | large | 9/14 | PR |
| go-gitea-gitea-low-schedule-null-payload | go-gitea/gitea | low | large | 14/14 | PR |
| go-gitea-gitea-mid-pr-merge-self-reference | go-gitea/gitea | mid | large | 7/14 | PR |
| jesseduffield-lazygit-high-branch-divergence-fast-path | jesseduffield/lazygit | high | small | 5/14 | PR |
| jesseduffield-lazygit-low-github-owner-casing | jesseduffield/lazygit | low | small | 14/14 | PR |
| jesseduffield-lazygit-mid-preserve-commit-message-whitespace | jesseduffield/lazygit | mid | small | 9/14 | PR |
| langflow-ai-langflow-high-lfx-stream-fallback | langflow-ai/langflow | high | large | 14/14 | PR |
| langflow-ai-langflow-low-loguru-file-routing | langflow-ai/langflow | low | large | 12/14 | PR |
| langflow-ai-langflow-mid-mcp-connectable-inputs | langflow-ai/langflow | mid | large | 0/14 | PR |
| louislam-uptime-kuma-high-websocket-auth-options | louislam/uptime-kuma | high | medium | 0/14 | PR |
| louislam-uptime-kuma-low-submillisecond-ping-chart | louislam/uptime-kuma | low | medium | 12/14 | PR |
| louislam-uptime-kuma-mid-uptime-cleanup-buckets | louislam/uptime-kuma | mid | medium | 11/14 | PR |
| sharkdp-bat-high-fallback-syntax | sharkdp/bat | high | small | 13/14 | PR |
| sharkdp-bat-low-zip-binary-detection | sharkdp/bat | low | small | 13/14 | PR |
| sharkdp-bat-mid-control-character-wrapping | sharkdp/bat | mid | small | 2/14 | PR |
| usememos-memos-high-missing-related-users | usememos/memos | high | medium | 12/14 | PR |
| usememos-memos-low-omit-internal-user-settings | usememos/memos | low | medium | 14/14 | PR |
| usememos-memos-mid-mixed-case-user-resource-names | usememos/memos | mid | medium | 11/14 | PR |
| vitejs-vite-high-hmr-patch-esm-sentinel | vitejs/vite | high | large | 9/14 | PR |
| vitejs-vite-low-flatten-id-sanitized-chars | vitejs/vite | low | large | 2/14 | PR |
| vitejs-vite-mid-deno-workspace-root | vitejs/vite | mid | large | 10/14 | PR |