Claude Sonnet 5 on Cogny Bench: Frontier Accuracy at a Sonnet Price
Claude Sonnet 5 on Cogny Bench: Frontier Accuracy at a Sonnet Price
Anthropic shipped Claude Sonnet 5 today. About an hour after the API key resolved, we knew exactly where it ranked against Opus 4.8, Gemini 3.5 Flash, GPT-5.5 and the rest of the stack that powers Cogny's reports — on the reasoning that actually matters to us.
The short version: Sonnet 5 is the new top of our board at 93.6, edging out Opus 4.8 (92.9) and beating its own predecessor, Sonnet 4.6 (88.7), by nearly five points — at a fifth of Opus's cost. That's a generational jump in the Sonnet tier, not the Opus tier.
The numbers
Cogny Bench is our internal eval: synthetic marketing-analytics problems with a planted ground truth and a deliberately withheld trap — a Simpson's paradox, an attribution leak, a tracking outage masquerading as a CPA spike — that a model has to infer from the data using a SQL tool. Scoring is 70% deterministic field checks + 30% a fixed Opus-4.8 judge.
| Model | Avg score | $/run |
|---|---|---|
| Sonnet 5 — intro ($2/$10) | 93.6 | $0.098 |
| Sonnet 5 — standard ($3/$15) | 93.6 | $0.147 |
| Gemini 3.5 Flash | 93.6 | $0.153 |
| Opus 4.7 | 93.2 | $0.489 |
| Opus 4.8 | 92.9 | $0.652 |
| GLM-5.2 (Berget) | 90.0 | $0.075 |
| GPT-5.5 | 89.1 | $0.222 |
| Sonnet 4.6 | 88.7 | $0.308 |
| Haiku 4.5 | 73.8 | $0.108 |
Two price points, because Anthropic launched Sonnet 5 at an introductory rate of $2 / 1M input and $10 / 1M output through August 31, 2026, after which it moves to the standard $3 / $15. The score is identical — same run, same tokens — so the only thing that changes is the x-axis. Even at the post-August standard price, Sonnet 5 lands at $0.147/run, still ~4.4× cheaper than Opus 4.8 for a higher score.
Where it cracked the traps
Per problem, Sonnet 5 was near-perfect on eight of nine:
| Problem | Score | What it had to catch |
|---|---|---|
| 01 price-increase-retention | 96 | the seasonal confound — use the legacy cohort as a control |
| 02 channel-attribution | 99 | funnel cannibalization + an attribution leak |
| 03 market-simpsons | 99 | Simpson's paradox across five markets |
| 04 cpa-tracking-outage | 99 | a tracking outage faking a CPA spike |
| 05 currency-mix | 99 | convert currencies before aggregating |
| 06 event-splitting | 99 | de-dupe double-logged orders across event names |
| 07 trend-early | 100 | a leading indicator with a ~4-week lag |
| T01 keyword-cannibalization | 98 | pause the right ad group, not the decoy |
| T02 chart-the-story | 53 | — see caveat below |
Problem 01 is the one that historically separates the field — a naive before/after over-attributes the churn, and the model has to find and use a grandfathered control cohort to land near the true 4.88pp lift. Sonnet 5 got it (96), the same class of result we previously only saw from Opus-tier and the reasoning models. Excluding the broken T02 cell (below), Sonnet 5 averages 98.6.
Honest caveats
- T02 is a known-broken cell for every model, not a Sonnet 5 weakness. Its deterministic checks pass (100) but the LLM judge scores the submission near zero — a scoring artifact in that problem that drags every model's average down by the same ~5 points. We're citing it transparently rather than hiding it; the ranking is unaffected because it hits everyone equally.
- These are single runs (n=1 per cell). Cogny Bench responses aren't deterministic (extended thinking forces sampling), so a few points of per-cell wobble is expected. The tier-level result — Sonnet 5 at the top of the board, a clear jump over Sonnet 4.6 — is the robust claim, not a to-the-decimal leaderboard.
- The fleet numbers it's compared against are from our June 25 sweep on the same nine problems and the same grader.
Why this matters for Cogny
Most of what Cogny does — scheduled reports, ticket generation, the report builder — runs on the Sonnet tier for cost reasons, falling back to Opus only when a task demands it. Sonnet 5 collapses that trade-off: we now get Opus-beating reasoning at the Sonnet price, which means better catches on exactly the confounds and attribution traps our customers care about, without the Opus bill. That's the difference between "directionally right" and "caught the thing a human analyst would have missed."
We knew all of this an hour after launch, for under a dollar of API spend. That's the whole point of having your own benchmark.