Benchmark

Harness Bench

HarnessBenchは、Codex、Claude Code、Cursor Agent、Antigravity CLIを同じ27個の実リポジトリ由来デバッグ課題で比較します。主指標はhidden testによる決定論的なpass/failで、wall time、token、cost estimate、補助的なfailure reviewも分析用に残します。

27debugging tasks
18harness/model/effort conditions
484valid agent runs including supplemental
70.9%overall hidden-test pass rate

Harness × Model × Effort 別 Pass Rate

Median Wall Time

Difficulty別の成功率

Pass Rate vs Time

条件一覧

Harness × Model × EffortHarnessPassPass rateMedian timeCost/passTimeouts

解釈とartifact

18条件全体で見ると、観測上もっとも高いpass rateは引き続き Codex / GPT-5.5 / xhigh の22/27です。Cursor / Composer 2.5 fast は19/27で、Codex / GPT-5.5 high と同数でした。20-22 pass のCodex/GPT-5.5、Cursor/GPT-5.5、強いOpus 4.7条件には届きませんが、Composer 2 fastよりは上です。Claude Code / Opus 4.8 / xhigh は、infra invalidのVite 2件を除くと14/25です。この2件の実装はCI設定で再実行したhidden testではpassしています。n=27なので、成功率差は方向感として読むのが妥当です。

Costは注意して読む必要があります。Claude Codeはdollar costを直接報告しますが、Codexと一部のCursor GPT-5.5 runはAPI-equivalent rate-card estimateです。Cursor Opus/Composer条件では比較可能なcost dataが出ない場合があります。

公開リポジトリにはbenchmark specification、case、hidden test、runner、report generator、summary artifactを含めます。raw harness execution logは現時点ではこのサイトでは公開していません。容量が大きく、provider固有のsession detailを含みうるため、experiment artifact policyに従ってローカル保持しています。

Case Set(公式14条件の集計)

CaseRepoDifficultySizePassPR
axios-axios-high-http-connect-timeoutaxios/axioshighsmall14/14PR
axios-axios-low-settle-error-codeaxios/axioslowsmall14/14PR
axios-axios-mid-fetch-global-accessaxios/axiosmidsmall14/14PR
fastapi-fastapi-high-pydantic-json-fast-pathfastapi/fastapihighlarge12/14PR
fastapi-fastapi-low-remove-vibe-decoratorfastapi/fastapilowlarge14/14PR
fastapi-fastapi-mid-jsonable-encoder-color-typesfastapi/fastapimidlarge14/14PR
go-gitea-gitea-high-compare-no-common-historygo-gitea/giteahighlarge9/14PR
go-gitea-gitea-low-schedule-null-payloadgo-gitea/gitealowlarge14/14PR
go-gitea-gitea-mid-pr-merge-self-referencego-gitea/giteamidlarge7/14PR
jesseduffield-lazygit-high-branch-divergence-fast-pathjesseduffield/lazygithighsmall5/14PR
jesseduffield-lazygit-low-github-owner-casingjesseduffield/lazygitlowsmall14/14PR
jesseduffield-lazygit-mid-preserve-commit-message-whitespacejesseduffield/lazygitmidsmall9/14PR
langflow-ai-langflow-high-lfx-stream-fallbacklangflow-ai/langflowhighlarge14/14PR
langflow-ai-langflow-low-loguru-file-routinglangflow-ai/langflowlowlarge12/14PR
langflow-ai-langflow-mid-mcp-connectable-inputslangflow-ai/langflowmidlarge0/14PR
louislam-uptime-kuma-high-websocket-auth-optionslouislam/uptime-kumahighmedium0/14PR
louislam-uptime-kuma-low-submillisecond-ping-chartlouislam/uptime-kumalowmedium12/14PR
louislam-uptime-kuma-mid-uptime-cleanup-bucketslouislam/uptime-kumamidmedium11/14PR
sharkdp-bat-high-fallback-syntaxsharkdp/bathighsmall13/14PR
sharkdp-bat-low-zip-binary-detectionsharkdp/batlowsmall13/14PR
sharkdp-bat-mid-control-character-wrappingsharkdp/batmidsmall2/14PR
usememos-memos-high-missing-related-usersusememos/memoshighmedium12/14PR
usememos-memos-low-omit-internal-user-settingsusememos/memoslowmedium14/14PR
usememos-memos-mid-mixed-case-user-resource-namesusememos/memosmidmedium11/14PR
vitejs-vite-high-hmr-patch-esm-sentinelvitejs/vitehighlarge9/14PR
vitejs-vite-low-flatten-id-sanitized-charsvitejs/vitelowlarge2/14PR
vitejs-vite-mid-deno-workspace-rootvitejs/vitemidlarge10/14PR