Benchmark
Harness Bench
HarnessBenchは、Codex、Claude Code、Cursor Agentを同じ27個の実リポジトリ由来デバッグ課題で比較します。主指標はhidden testによる決定論的なpass/failで、wall time、token、cost estimate、補助的なfailure reviewも分析用に残します。
Harness × Model × Effort 別 Pass Rate
Median Wall Time
Difficulty別の成功率
Pass Rate vs Time
条件一覧
| Harness × Model × Effort | Harness | Pass | Pass rate | Median time | Cost/pass | Timeouts |
|---|
解釈とartifact
観測上もっとも高いpass rateは Codex / GPT-5.5 / xhigh の22/27でした。ただし27個のpaired taskでは、どの成功率差も p < 0.05 に達していません。一方で実行時間の差はより明確で、Cursor Composer 2 fast と Cursor GPT-5.5 medium はかなり高速でした。高いOpus effortは長く考えますが、このrunでは成功率の統計的に確かな上積みとしては観測できませんでした。
Costは注意して読む必要があります。Claude Codeはdollar costを直接報告しますが、Codexと一部のCursor GPT-5.5 runはAPI-equivalent rate-card estimateです。Cursor Opus/Composer条件では比較可能なcost dataが出ない場合があります。
公開リポジトリにはbenchmark specification、case、hidden test、runner、report generator、summary artifactを含めます。raw harness execution logは現時点ではこのサイトでは公開していません。容量が大きく、provider固有のsession detailを含みうるため、experiment artifact policyに従ってローカル保持しています。
Case Set
| Case | Repo | Difficulty | Size | Pass | PR |
|---|---|---|---|---|---|
| axios-axios-high-http-connect-timeout | axios/axios | high | small | 14/14 | PR |
| axios-axios-low-settle-error-code | axios/axios | low | small | 14/14 | PR |
| axios-axios-mid-fetch-global-access | axios/axios | mid | small | 14/14 | PR |
| fastapi-fastapi-high-pydantic-json-fast-path | fastapi/fastapi | high | large | 12/14 | PR |
| fastapi-fastapi-low-remove-vibe-decorator | fastapi/fastapi | low | large | 14/14 | PR |
| fastapi-fastapi-mid-jsonable-encoder-color-types | fastapi/fastapi | mid | large | 14/14 | PR |
| go-gitea-gitea-high-compare-no-common-history | go-gitea/gitea | high | large | 9/14 | PR |
| go-gitea-gitea-low-schedule-null-payload | go-gitea/gitea | low | large | 14/14 | PR |
| go-gitea-gitea-mid-pr-merge-self-reference | go-gitea/gitea | mid | large | 7/14 | PR |
| jesseduffield-lazygit-high-branch-divergence-fast-path | jesseduffield/lazygit | high | small | 5/14 | PR |
| jesseduffield-lazygit-low-github-owner-casing | jesseduffield/lazygit | low | small | 14/14 | PR |
| jesseduffield-lazygit-mid-preserve-commit-message-whitespace | jesseduffield/lazygit | mid | small | 9/14 | PR |
| langflow-ai-langflow-high-lfx-stream-fallback | langflow-ai/langflow | high | large | 14/14 | PR |
| langflow-ai-langflow-low-loguru-file-routing | langflow-ai/langflow | low | large | 12/14 | PR |
| langflow-ai-langflow-mid-mcp-connectable-inputs | langflow-ai/langflow | mid | large | 0/14 | PR |
| louislam-uptime-kuma-high-websocket-auth-options | louislam/uptime-kuma | high | medium | 0/14 | PR |
| louislam-uptime-kuma-low-submillisecond-ping-chart | louislam/uptime-kuma | low | medium | 12/14 | PR |
| louislam-uptime-kuma-mid-uptime-cleanup-buckets | louislam/uptime-kuma | mid | medium | 11/14 | PR |
| sharkdp-bat-high-fallback-syntax | sharkdp/bat | high | small | 13/14 | PR |
| sharkdp-bat-low-zip-binary-detection | sharkdp/bat | low | small | 13/14 | PR |
| sharkdp-bat-mid-control-character-wrapping | sharkdp/bat | mid | small | 2/14 | PR |
| usememos-memos-high-missing-related-users | usememos/memos | high | medium | 12/14 | PR |
| usememos-memos-low-omit-internal-user-settings | usememos/memos | low | medium | 14/14 | PR |
| usememos-memos-mid-mixed-case-user-resource-names | usememos/memos | mid | medium | 11/14 | PR |
| vitejs-vite-high-hmr-patch-esm-sentinel | vitejs/vite | high | large | 9/14 | PR |
| vitejs-vite-low-flatten-id-sanitized-chars | vitejs/vite | low | large | 2/14 | PR |
| vitejs-vite-mid-deno-workspace-root | vitejs/vite | mid | large | 10/14 | PR |