Claude Sonnet 5 on Cogny Bench: Frontier Accuracy at a Sonnet Price

Anthropic shipped Claude Sonnet 5 today. About an hour after the API key resolved, we knew exactly where it ranked against Opus 4.8, Gemini 3.5 Flash, GPT-5.5 and the rest of the stack that powers Cogny's reports — on the reasoning that actually matters to us.

The short version: Sonnet 5 is the new top of our board at 93.6, edging out Opus 4.8 (92.9) and beating its own predecessor, Sonnet 4.6 (88.7), by nearly five points — at a fifth of Opus's cost. That's a generational jump in the Sonnet tier, not the Opus tier.

The numbers

Cogny Bench is our internal eval: synthetic marketing-analytics problems with a planted ground truth and a deliberately withheld trap — a Simpson's paradox, an attribution leak, a tracking outage masquerading as a CPA spike — that a model has to infer from the data using a SQL tool. Scoring is 70% deterministic field checks + 30% a fixed Opus-4.8 judge.

Cogny Bench — score vs cost, Claude Sonnet 5 vs the stack

Model	Avg score	$/run
Sonnet 5 — intro ($2/$10)	93.6	$0.098
Sonnet 5 — standard ($3/$15)	93.6	$0.147
Gemini 3.5 Flash	93.6	$0.153
Opus 4.7	93.2	$0.489
Opus 4.8	92.9	$0.652
GLM-5.2 (Berget)	90.0	$0.075
GPT-5.5	89.1	$0.222
Sonnet 4.6	88.7	$0.308
Haiku 4.5	73.8	$0.108

Two price points, because Anthropic launched Sonnet 5 at an introductory rate of $2 / 1M input and $10 / 1M output through August 31, 2026, after which it moves to the standard $3 / $15. The score is identical — same run, same tokens — so the only thing that changes is the x-axis. Even at the post-August standard price, Sonnet 5 lands at $0.147/run, still ~4.4× cheaper than Opus 4.8 for a higher score.

Where it cracked the traps

Per problem, Sonnet 5 was near-perfect on eight of nine:

Problem	Score	What it had to catch
01 price-increase-retention	96	the seasonal confound — use the legacy cohort as a control
02 channel-attribution	99	funnel cannibalization + an attribution leak
03 market-simpsons	99	Simpson's paradox across five markets
04 cpa-tracking-outage	99	a tracking outage faking a CPA spike
05 currency-mix	99	convert currencies before aggregating
06 event-splitting	99	de-dupe double-logged orders across event names
07 trend-early	100	a leading indicator with a ~4-week lag
T01 keyword-cannibalization	98	pause the right ad group, not the decoy
T02 chart-the-story	53	— see caveat below

Problem 01 is the one that historically separates the field — a naive before/after over-attributes the churn, and the model has to find and use a grandfathered control cohort to land near the true 4.88pp lift. Sonnet 5 got it (96), the same class of result we previously only saw from Opus-tier and the reasoning models. Excluding the broken T02 cell (below), Sonnet 5 averages 98.6.

Honest caveats

T02 is a known-broken cell for every model, not a Sonnet 5 weakness. Its deterministic checks pass (100) but the LLM judge scores the submission near zero — a scoring artifact in that problem that drags every model's average down by the same ~5 points. We're citing it transparently rather than hiding it; the ranking is unaffected because it hits everyone equally.
These are single runs (n=1 per cell). Cogny Bench responses aren't deterministic (extended thinking forces sampling), so a few points of per-cell wobble is expected. The tier-level result — Sonnet 5 at the top of the board, a clear jump over Sonnet 4.6 — is the robust claim, not a to-the-decimal leaderboard.
The fleet numbers it's compared against are from our June 25 sweep on the same nine problems and the same grader.

Why this matters for Cogny

Most of what Cogny does — scheduled reports, ticket generation, the report builder — runs on the Sonnet tier for cost reasons, falling back to Opus only when a task demands it. Sonnet 5 collapses that trade-off: we now get Opus-beating reasoning at the Sonnet price, which means better catches on exactly the confounds and attribution traps our customers care about, without the Opus bill. That's the difference between "directionally right" and "caught the thing a human analyst would have missed."

We knew all of this an hour after launch, for under a dollar of API spend. That's the whole point of having your own benchmark.