Evaluating Antigravity Gemini 3.5 Flash and Cursor Composer 2.5 on HarnessBench
by 逆瀬川ちゃん
4 min read
Hi there! This is Sakasegawa-chan (@gyakuse)!
Today I want to look at a supplemental HarnessBench run for Antigravity / Gemini 3.5 Flash (High) and Cursor / Composer 2.5 fast / normal, compared against the existing Codex / Claude Code / Cursor conditions.
What I Evaluated
In the previous HarnessBench post, I compared Codex CLI, Claude Code, and Cursor Agent on the same 27 real-repository debugging tasks.
This time I added three supplemental conditions:
| Harness | Model | Effort / mode |
|---|---|---|
| Antigravity CLI | Gemini 3.5 Flash | high |
| Cursor Agent | Composer 2.5 | fast |
| Cursor Agent | Composer 2.5 | normal |
The task set is unchanged: 9 real OSS repositories, with low / mid / high tasks for each repository, for 27 tasks total. A run passes only when both the core and regression hidden tests pass.
The supplemental experiment ID is antigravity-cursor-composer-2.5-20260522T052522Z. I merged the results into the same charts and condition table on the HarnessBench result page.
Results
First, here are the three added conditions by themselves:
| Condition | Pass | Pass rate | Median time | Low | Mid | High | Timeout |
|---|---|---|---|---|---|---|---|
| Cursor / Composer 2.5 / fast | 19/27 | 70.4% | 7.5 min | 9/9 | 5/9 | 5/9 | 0 |
| Cursor / Composer 2.5 / normal | 18/27 | 66.7% | 8.1 min | 9/9 | 5/9 | 4/9 | 0 |
| Antigravity / Gemini 3.5 Flash / high | 17/27 | 63.0% | 14.3 min | 8/9 | 5/9 | 4/9 | 1 |
When placed next to the existing 14 conditions, the top of the ranking does not change. The strongest observed condition is still Codex / GPT-5.5 / xhigh at 22/27. Cursor / GPT-5.5 medium, Cursor / GPT-5.5 high, Codex / GPT-5.5 medium, and Cursor / Opus 4.7 max follow at 21/27.
Composer 2.5 fast is 19/27. It does not reach the top group, but it ties Codex / GPT-5.5 high and improves over Composer 2 fast, which was 17/27. Composer 2.5 normal is 18/27, the same pass count as Composer 2 normal.
Antigravity / Gemini 3.5 Flash (High) is 17/27. That ties Claude Code / Opus 4.7 max and Cursor / Composer 2 fast, putting it in the lower group among the 17 displayed conditions.
Position Against Existing Conditions
By pass count, the added conditions sit roughly here:
| Condition | Pass | Median time | Reading |
|---|---|---|---|
| Codex / GPT-5.5 / xhigh | 22/27 | 10.2 min | top observed condition |
| Cursor / GPT-5.5 / medium | 21/27 | 4.7 min | strong speed/accuracy balance |
| Cursor / GPT-5.5 / high | 21/27 | 6.2 min | top group |
| Cursor / Opus 4.7 / max | 21/27 | 19.7 min | high pass count, slow |
| Cursor / Composer 2.5 fast | 19/27 | 7.5 min | upper-middle, improved over Composer 2 fast |
| Codex / GPT-5.5 / high | 19/27 | 9.0 min | same pass count as Composer 2.5 fast |
| Cursor / Composer 2.5 normal | 18/27 | 8.1 min | same pass count as Composer 2 normal |
| Cursor / Composer 2 fast | 17/27 | 3.6 min | fast, but lower pass count |
| Antigravity / Gemini 3.5 Flash high | 17/27 | 14.3 min | lower pass count and relatively slow |
| Claude Code / Opus 4.7 max | 17/27 | 15.1 min | same pass count as Antigravity |
With only 27 tasks, I would not overread the difference between 19/27 and 21/27. Still, Composer 2.5 fast looks better than Composer 2 fast on this task set.
Cursor / GPT-5.5 medium remains a very strong point of comparison: 21/27 with a 4.7-minute median. Composer 2.5 fast is 19/27 with a 7.5-minute median, so in this run Cursor / GPT-5.5 medium is better on both pass count and runtime.
How I Read Composer 2.5
In the official run, Cursor / Composer 2 fast solved 17/27, and Cursor / Composer 2 normal solved 18/27. In this supplemental run, Composer 2.5 fast solved 19/27, while Composer 2.5 normal solved 18/27.
| Condition | Pass | Median time |
|---|---|---|
| Cursor / Composer 2 fast | 17/27 | 3.6 min |
| Cursor / Composer 2 normal | 18/27 | 5.3 min |
| Cursor / Composer 2.5 fast | 19/27 | 7.5 min |
| Cursor / Composer 2.5 normal | 18/27 | 8.1 min |
Composer 2.5 fast passed two more tasks than Composer 2 fast, but it was also slower. Composer 2.5 normal matched Composer 2 normal on pass count and was slower.
So my read is: Composer 2.5 fast improved over Composer 2 fast, but it does not replace the strongest Cursor GPT-5.5 conditions in this benchmark. Cursor / GPT-5.5 medium and high still look stronger in this 27-task run.
Interpretation
My conservative read is:
- Cursor / Composer 2.5 fast reached 19/27, tying Codex / GPT-5.5 high
- Composer 2.5 fast improved over Composer 2 fast by two tasks, while median time increased from 3.6 to 7.5 minutes
- Cursor / Composer 2.5 normal reached 18/27, the same as Composer 2 normal
- Antigravity / Gemini 3.5 Flash (High) reached 17/27, placing it in the lower group among the 17 conditions
- The top remains Codex / GPT-5.5 xhigh at 22/27, followed by Cursor / GPT-5.5 medium/high and Cursor / Opus max at 21/27
- With only 27 tasks, small success-rate differences should be treated as directional rather than definitive
In short, Composer 2.5 fast looks like an improvement over Composer 2 fast, but not a new top-tier condition on HarnessBench. Antigravity / Gemini 3.5 Flash (High) did not show a clear advantage in either pass count or runtime in this run.
Summary
- I added Antigravity / Gemini 3.5 Flash (High), Cursor / Composer 2.5 fast, and Cursor / Composer 2.5 normal to HarnessBench
- Across the 17 displayed conditions, the top remains Codex / GPT-5.5 / xhigh at 22/27
- Cursor / Composer 2.5 fast reached 19/27, improving over Composer 2 fast but falling short of Cursor GPT-5.5 medium/high
- Cursor / Composer 2.5 normal reached 18/27, matching Composer 2 normal
- Antigravity / Gemini 3.5 Flash (High) reached 17/27, placing it in the lower group in this comparison
- At 27 tasks, the broad groups are more meaningful than fine-grained rankings