← back to blog
    Berner SetterwallJune 30, 20264 min read

    Claude Sonnet 5 on Cogny Bench: Frontier Accuracy at a Sonnet Price

    Claude Sonnet 5 on Cogny Bench: Frontier Accuracy at a Sonnet Price

    Anthropic shipped Claude Sonnet 5 today. About an hour after the API key resolved, we knew exactly where it ranked against Opus 4.8, Gemini 3.5 Flash, GPT-5.5 and the rest of the stack that powers Cogny's reports — on the reasoning that actually matters to us.

    The short version: Sonnet 5 is the new top of our board at 93.6, edging out Opus 4.8 (92.9) and beating its own predecessor, Sonnet 4.6 (88.7), by nearly five points — at a fifth of Opus's cost. That's a generational jump in the Sonnet tier, not the Opus tier.

    The numbers

    Cogny Bench is our internal eval: synthetic marketing-analytics problems with a planted ground truth and a deliberately withheld trap — a Simpson's paradox, an attribution leak, a tracking outage masquerading as a CPA spike — that a model has to infer from the data using a SQL tool. Scoring is 70% deterministic field checks + 30% a fixed Opus-4.8 judge.

    Cogny Bench — score vs cost, Claude Sonnet 5 vs the stack
    Cogny Bench — score vs cost, Claude Sonnet 5 vs the stack
    ModelAvg score$/run
    Sonnet 5 — intro ($2/$10)93.6$0.098
    Sonnet 5 — standard ($3/$15)93.6$0.147
    Gemini 3.5 Flash93.6$0.153
    Opus 4.793.2$0.489
    Opus 4.892.9$0.652
    GLM-5.2 (Berget)90.0$0.075
    GPT-5.589.1$0.222
    Sonnet 4.688.7$0.308
    Haiku 4.573.8$0.108

    Two price points, because Anthropic launched Sonnet 5 at an introductory rate of $2 / 1M input and $10 / 1M output through August 31, 2026, after which it moves to the standard $3 / $15. The score is identical — same run, same tokens — so the only thing that changes is the x-axis. Even at the post-August standard price, Sonnet 5 lands at $0.147/run, still ~4.4× cheaper than Opus 4.8 for a higher score.

    Where it cracked the traps

    Per problem, Sonnet 5 was near-perfect on eight of nine:

    ProblemScoreWhat it had to catch
    01 price-increase-retention96the seasonal confound — use the legacy cohort as a control
    02 channel-attribution99funnel cannibalization + an attribution leak
    03 market-simpsons99Simpson's paradox across five markets
    04 cpa-tracking-outage99a tracking outage faking a CPA spike
    05 currency-mix99convert currencies before aggregating
    06 event-splitting99de-dupe double-logged orders across event names
    07 trend-early100a leading indicator with a ~4-week lag
    T01 keyword-cannibalization98pause the right ad group, not the decoy
    T02 chart-the-story53— see caveat below

    Problem 01 is the one that historically separates the field — a naive before/after over-attributes the churn, and the model has to find and use a grandfathered control cohort to land near the true 4.88pp lift. Sonnet 5 got it (96), the same class of result we previously only saw from Opus-tier and the reasoning models. Excluding the broken T02 cell (below), Sonnet 5 averages 98.6.

    Honest caveats

    • T02 is a known-broken cell for every model, not a Sonnet 5 weakness. Its deterministic checks pass (100) but the LLM judge scores the submission near zero — a scoring artifact in that problem that drags every model's average down by the same ~5 points. We're citing it transparently rather than hiding it; the ranking is unaffected because it hits everyone equally.
    • These are single runs (n=1 per cell). Cogny Bench responses aren't deterministic (extended thinking forces sampling), so a few points of per-cell wobble is expected. The tier-level result — Sonnet 5 at the top of the board, a clear jump over Sonnet 4.6 — is the robust claim, not a to-the-decimal leaderboard.
    • The fleet numbers it's compared against are from our June 25 sweep on the same nine problems and the same grader.

    Why this matters for Cogny

    Most of what Cogny does — scheduled reports, ticket generation, the report builder — runs on the Sonnet tier for cost reasons, falling back to Opus only when a task demands it. Sonnet 5 collapses that trade-off: we now get Opus-beating reasoning at the Sonnet price, which means better catches on exactly the confounds and attribution traps our customers care about, without the Opus bill. That's the difference between "directionally right" and "caught the thing a human analyst would have missed."

    We knew all of this an hour after launch, for under a dollar of API spend. That's the whole point of having your own benchmark.