Benchmark

Harness Bench

HarnessBenchは、Codex、Claude Code、Cursor Agentを同じ27個の実リポジトリ由来デバッグ課題で比較します。主指標はhidden testによる決定論的なpass/failで、wall time、token、cost estimate、補助的なfailure reviewも分析用に残します。

27debugging tasks
14harness/model/effort conditions
378official agent runs
72.8%overall hidden-test pass rate

Harness × Model × Effort 別 Pass Rate

Median Wall Time

Difficulty別の成功率

Pass Rate vs Time

条件一覧

Harness × Model × EffortHarnessPassPass rateMedian timeCost/passTimeouts

解釈とartifact

観測上もっとも高いpass rateは Codex / GPT-5.5 / xhigh の22/27でした。ただし27個のpaired taskでは、どの成功率差も p < 0.05 に達していません。一方で実行時間の差はより明確で、Cursor Composer 2 fast と Cursor GPT-5.5 medium はかなり高速でした。高いOpus effortは長く考えますが、このrunでは成功率の統計的に確かな上積みとしては観測できませんでした。

Costは注意して読む必要があります。Claude Codeはdollar costを直接報告しますが、Codexと一部のCursor GPT-5.5 runはAPI-equivalent rate-card estimateです。Cursor Opus/Composer条件では比較可能なcost dataが出ない場合があります。

公開リポジトリにはbenchmark specification、case、hidden test、runner、report generator、summary artifactを含めます。raw harness execution logは現時点ではこのサイトでは公開していません。容量が大きく、provider固有のsession detailを含みうるため、experiment artifact policyに従ってローカル保持しています。

Case Set

CaseRepoDifficultySizePassPR
axios-axios-high-http-connect-timeoutaxios/axioshighsmall14/14PR
axios-axios-low-settle-error-codeaxios/axioslowsmall14/14PR
axios-axios-mid-fetch-global-accessaxios/axiosmidsmall14/14PR
fastapi-fastapi-high-pydantic-json-fast-pathfastapi/fastapihighlarge12/14PR
fastapi-fastapi-low-remove-vibe-decoratorfastapi/fastapilowlarge14/14PR
fastapi-fastapi-mid-jsonable-encoder-color-typesfastapi/fastapimidlarge14/14PR
go-gitea-gitea-high-compare-no-common-historygo-gitea/giteahighlarge9/14PR
go-gitea-gitea-low-schedule-null-payloadgo-gitea/gitealowlarge14/14PR
go-gitea-gitea-mid-pr-merge-self-referencego-gitea/giteamidlarge7/14PR
jesseduffield-lazygit-high-branch-divergence-fast-pathjesseduffield/lazygithighsmall5/14PR
jesseduffield-lazygit-low-github-owner-casingjesseduffield/lazygitlowsmall14/14PR
jesseduffield-lazygit-mid-preserve-commit-message-whitespacejesseduffield/lazygitmidsmall9/14PR
langflow-ai-langflow-high-lfx-stream-fallbacklangflow-ai/langflowhighlarge14/14PR
langflow-ai-langflow-low-loguru-file-routinglangflow-ai/langflowlowlarge12/14PR
langflow-ai-langflow-mid-mcp-connectable-inputslangflow-ai/langflowmidlarge0/14PR
louislam-uptime-kuma-high-websocket-auth-optionslouislam/uptime-kumahighmedium0/14PR
louislam-uptime-kuma-low-submillisecond-ping-chartlouislam/uptime-kumalowmedium12/14PR
louislam-uptime-kuma-mid-uptime-cleanup-bucketslouislam/uptime-kumamidmedium11/14PR
sharkdp-bat-high-fallback-syntaxsharkdp/bathighsmall13/14PR
sharkdp-bat-low-zip-binary-detectionsharkdp/batlowsmall13/14PR
sharkdp-bat-mid-control-character-wrappingsharkdp/batmidsmall2/14PR
usememos-memos-high-missing-related-usersusememos/memoshighmedium12/14PR
usememos-memos-low-omit-internal-user-settingsusememos/memoslowmedium14/14PR
usememos-memos-mid-mixed-case-user-resource-namesusememos/memosmidmedium11/14PR
vitejs-vite-high-hmr-patch-esm-sentinelvitejs/vitehighlarge9/14PR
vitejs-vite-low-flatten-id-sanitized-charsvitejs/vitelowlarge2/14PR
vitejs-vite-mid-deno-workspace-rootvitejs/vitemidlarge10/14PR