Stronger models needing fewer turns to achieve a task feels like a prime source of efficiency gains for agentic coding, more so than individual responses being shorter.
It'd be interesting to see the distributions if the author actually plotted the data, so we could see if their analysis holds water or not.
A plot of the input lengths using ggplot2 geom_density with color and fill by model, 0.1 alpha, and an appropriate bandwidth adjustment would allow us to see if the input data distribution looks similar across the two, and using the same for the output length distributions, faceted by the input length bins would give us an idea if those look the same too.
Edit: Or even a faceted plot using input bins of output length/input length.
E.g. Crack this puzzle, fix this code so these tests pass. (A human can verify it doesn't cheese things).
That's got to be a very tricky analysis given how subjective quality is. But I'm sure there are people trying to pin it down.
There's multiple open weight models you can run on a pretty standard computer at home, which match the quality of GPT 4. I guess that would also change the equation.
If I had to collapse the nature of the difference in one sentence it'd be that the 5.5 does more what I'm asking it to do versus doing a small aspect of what I'm asking then stopping.
5.4 required a lot of "continue" encouragement. 5.5 just "gets it" a bit more
What is boils down to for me is that even though it's more expensive I would much rather use 5.5 on low then 5.4/5.3 on high/medium
This doesn't always mean that there is a bottleneck in terms of raw power, it may also mean that your use cases (or the lower hanging fruits among them) are already covered.
So quickly - this industry has had trillions thrown around to get here so quickly, heh.
But, yes, capability seems somewhat stagnant. It's about ISO perf and cost improvements or iso cost and perf improvements + agentic.
IE. They had 100 compute units. Demand is 200 units. They have to do a combination of buying more compute, increasing price, lowering limits, etc.
Please stop. Critical theory is easy. Something about “X” sucks. Got it. What is the alternative? It’s the completely unserious philosophy of the peanut gallery.
If that is true then they should all invest resources into projects that will yield efficient use of the compute. The most efficient producer then gains a huge cost advantage AND capacity to serve more… so yeah.. that logic doesn’t hold.
I would say models entered a bottleneck a long time ago. My personal opinion is now they are overfitting newer models on coding and "agentic" capabilities at great expense of general abilities in other domains.
Still amazing, but 5.5 does feel like incremental progress with a massive up charge.
We'll probably see another stair step change followed by another plateauing curve of incremental improvements when that happens.
Some releases are just "meh", but I wouldn't rule out exciting new stuff for 2026 just because Opus 4.7 sucked.
[0]: https://aibenchy.com/compare/openai-gpt-5-4-medium/openai-gp...
Interestingly, using your tests as a comparison, 5.5 low beats 5.4 medium at a 82% of the cost.[0]
[0]: https://aibenchy.com/compare/openai-gpt-5-4-medium/openai-gp...
Cost per token is a bit misleading because, as others have noted, different models use tokens in different ways. (Aside - This is also why TPS isn't a great metric).
We found that 5.5 is about 1.5-2x more expensive overall. On a "Pareto" basis, we only find 5.5 xhigh worth it. At the lower reasoning levels, 5.4 still edges it out on cost/perf.
We take a spec-driven approach and mostly work in TS (on product development), so if you use a more steer-y approach, or work in a different domain, YMMV.
If the changes needed are small, I'll apply the best implementation as a foundation and then just iterate directly.
If the changes needed are drastic, it usually signals that there was sth wrong/ambiguous/etc in the spec (or the ensemble was too weak, which is rarely the case). In cases like this, I improve the spec and then rerun.
If it's in the middle, I'll usually apply the best and write a follow on spec.
Btw, this also helps manage scale. Eg you have 15 diffs to review. Run a few verifiers to get a short list, then review directly and apply the best.
- deepseek v4 pro
- glm 5.1
- kimi k2.6
- qwen 3.6 max
- xiaomi 2.5 pro
- minimax 2.7
- grok
So far we have been native harnessmaxxing, which simplifies things a lot.
The configuration space around open models is much larger. Eg which models, capability heterogeneity, which harness, networking, data egress / privacy, etc.
If anyone is getting very good production code out of open models, I'd love to do a user interview to better understand your setup. Email is in my bio.
(Which is why my prior is that third party harnesses would not perform as well. But I haven't actually measured this.)
gpt-5-4-high > gpt-5-4-xhigh
gpt-5-4-high > gpt-5-5-high
gpt-5-4 > gpt-5-5
gpt-5-2-high > gpt-5-2-xhigh
No other ratings I've seen show that.
We are measuring sth much closer to: when multiple agents compete on the same spec, which one produces the patch that holds up best in code review?
Most evals are static / synthetic, and for code, generally stop at tests. Test evals are weak proxies for quality since it's difficult to encode qualities like scope creep/churn, codebase fit, maintainability etc in tests. [1]
Almost every agent in a given run can pass tests at this point, but there is large separation during review.
Rankings at https://gertlabs.com/rankings?mode=agentic_coding. See the efficiency chart at the bottom.