Evaluating Antigravity Gemini 3.5 Flash and Cursor Composer 2.5 on HarnessBench

by 逆瀬川ちゃん

4 min read

Hi there! This is Sakasegawa-chan (@gyakuse)!

Today I want to look at a supplemental HarnessBench run for Antigravity / Gemini 3.5 Flash (High) and Cursor / Composer 2.5 fast / normal, compared against the existing Codex / Claude Code / Cursor conditions.

What I Evaluated

In the previous HarnessBench post, I compared Codex CLI, Claude Code, and Cursor Agent on the same 27 real-repository debugging tasks.

This time I added three supplemental conditions:

Harness Model Effort / mode
Antigravity CLI Gemini 3.5 Flash high
Cursor Agent Composer 2.5 fast
Cursor Agent Composer 2.5 normal

The task set is unchanged: 9 real OSS repositories, with low / mid / high tasks for each repository, for 27 tasks total. A run passes only when both the core and regression hidden tests pass.

The supplemental experiment ID is antigravity-cursor-composer-2.5-20260522T052522Z. I merged the results into the same charts and condition table on the HarnessBench result page.

Results

First, here are the three added conditions by themselves:

Condition Pass Pass rate Median time Low Mid High Timeout
Cursor / Composer 2.5 / fast 19/27 70.4% 7.5 min 9/9 5/9 5/9 0
Cursor / Composer 2.5 / normal 18/27 66.7% 8.1 min 9/9 5/9 4/9 0
Antigravity / Gemini 3.5 Flash / high 17/27 63.0% 14.3 min 8/9 5/9 4/9 1

When placed next to the existing 14 conditions, the top of the ranking does not change. The strongest observed condition is still Codex / GPT-5.5 / xhigh at 22/27. Cursor / GPT-5.5 medium, Cursor / GPT-5.5 high, Codex / GPT-5.5 medium, and Cursor / Opus 4.7 max follow at 21/27.

Composer 2.5 fast is 19/27. It does not reach the top group, but it ties Codex / GPT-5.5 high and improves over Composer 2 fast, which was 17/27. Composer 2.5 normal is 18/27, the same pass count as Composer 2 normal.

Antigravity / Gemini 3.5 Flash (High) is 17/27. That ties Claude Code / Opus 4.7 max and Cursor / Composer 2 fast, putting it in the lower group among the 17 displayed conditions.

Position Against Existing Conditions

By pass count, the added conditions sit roughly here:

Condition Pass Median time Reading
Codex / GPT-5.5 / xhigh 22/27 10.2 min top observed condition
Cursor / GPT-5.5 / medium 21/27 4.7 min strong speed/accuracy balance
Cursor / GPT-5.5 / high 21/27 6.2 min top group
Cursor / Opus 4.7 / max 21/27 19.7 min high pass count, slow
Cursor / Composer 2.5 fast 19/27 7.5 min upper-middle, improved over Composer 2 fast
Codex / GPT-5.5 / high 19/27 9.0 min same pass count as Composer 2.5 fast
Cursor / Composer 2.5 normal 18/27 8.1 min same pass count as Composer 2 normal
Cursor / Composer 2 fast 17/27 3.6 min fast, but lower pass count
Antigravity / Gemini 3.5 Flash high 17/27 14.3 min lower pass count and relatively slow
Claude Code / Opus 4.7 max 17/27 15.1 min same pass count as Antigravity

With only 27 tasks, I would not overread the difference between 19/27 and 21/27. Still, Composer 2.5 fast looks better than Composer 2 fast on this task set.

Cursor / GPT-5.5 medium remains a very strong point of comparison: 21/27 with a 4.7-minute median. Composer 2.5 fast is 19/27 with a 7.5-minute median, so in this run Cursor / GPT-5.5 medium is better on both pass count and runtime.

How I Read Composer 2.5

In the official run, Cursor / Composer 2 fast solved 17/27, and Cursor / Composer 2 normal solved 18/27. In this supplemental run, Composer 2.5 fast solved 19/27, while Composer 2.5 normal solved 18/27.

Condition Pass Median time
Cursor / Composer 2 fast 17/27 3.6 min
Cursor / Composer 2 normal 18/27 5.3 min
Cursor / Composer 2.5 fast 19/27 7.5 min
Cursor / Composer 2.5 normal 18/27 8.1 min

Composer 2.5 fast passed two more tasks than Composer 2 fast, but it was also slower. Composer 2.5 normal matched Composer 2 normal on pass count and was slower.

So my read is: Composer 2.5 fast improved over Composer 2 fast, but it does not replace the strongest Cursor GPT-5.5 conditions in this benchmark. Cursor / GPT-5.5 medium and high still look stronger in this 27-task run.

Interpretation

My conservative read is:

  • Cursor / Composer 2.5 fast reached 19/27, tying Codex / GPT-5.5 high
  • Composer 2.5 fast improved over Composer 2 fast by two tasks, while median time increased from 3.6 to 7.5 minutes
  • Cursor / Composer 2.5 normal reached 18/27, the same as Composer 2 normal
  • Antigravity / Gemini 3.5 Flash (High) reached 17/27, placing it in the lower group among the 17 conditions
  • The top remains Codex / GPT-5.5 xhigh at 22/27, followed by Cursor / GPT-5.5 medium/high and Cursor / Opus max at 21/27
  • With only 27 tasks, small success-rate differences should be treated as directional rather than definitive

In short, Composer 2.5 fast looks like an improvement over Composer 2 fast, but not a new top-tier condition on HarnessBench. Antigravity / Gemini 3.5 Flash (High) did not show a clear advantage in either pass count or runtime in this run.

Summary

  • I added Antigravity / Gemini 3.5 Flash (High), Cursor / Composer 2.5 fast, and Cursor / Composer 2.5 normal to HarnessBench
  • Across the 17 displayed conditions, the top remains Codex / GPT-5.5 / xhigh at 22/27
  • Cursor / Composer 2.5 fast reached 19/27, improving over Composer 2 fast but falling short of Cursor GPT-5.5 medium/high
  • Cursor / Composer 2.5 normal reached 18/27, matching Composer 2 normal
  • Antigravity / Gemini 3.5 Flash (High) reached 17/27, placing it in the lower group in this comparison
  • At 27 tasks, the broad groups are more meaningful than fine-grained rankings

References