Claude Fable 5 on Cogny Bench: When the Flagship Isn't the Value Pick
Claude Fable 5 on Cogny Bench: When the Flagship Isn't the Value Pick
Two days ago Sonnet 5 went straight to the top of our board. This week we pointed Cogny Bench at Claude Fable 5 — Anthropic's most capable tier, sitting above Opus, priced at $10 / 1M input and $50 / 1M output.
The result is a good reminder that "biggest model" and "best tool for the job" aren't the same question.
The numbers
| Model | Avg score | $/run |
|---|---|---|
| Sonnet 5 — intro ($2/$10) | 93.6 | $0.098 |
| Gemini 3.5 Flash | 93.6 | $0.153 |
| Opus 4.7 | 93.2 | $0.489 |
| Opus 4.8 | 92.9 | $0.652 |
| GLM-5.2 (Berget) | 90.0 | $0.075 |
| GPT-5.5 | 89.1 | $0.222 |
| Sonnet 4.6 | 88.7 | $0.308 |
| Fable 5 | 86.0 | $0.390 |
| Haiku 4.5 | 73.8 | $0.108 |
Fable 5 is the most expensive model on the board and lands near the bottom on score. On our marketing-analytics reasoning problems, Sonnet 5 beats it on accuracy and costs roughly a quarter as much.
That headline needs an honest asterisk, though — because it's really one problem doing the damage.
Where Fable 5 is frontier — and where it isn't
On the pure-reasoning problems, Fable 5 is genuinely frontier. Averaged across them it lands at 98.6 — matching Sonnet 5. Confounded comparisons, mix-shift paradoxes, attribution leaks, tracking outages: it reads through all of them cleanly. Its analysis is not the problem.
The damage is one problem, and it's a specific kind: a multi-step action task — not "figure out what's wrong" but "now go do the multi-part fix and report back." Fable 5 nails the diagnosis half — it correctly identifies the conflict and disables the right thing — and then just… stops. It never performs the second required action, and never files the closing summary. We re-ran that one problem four times; it scored 31, 6, 31, 31. Reproducible, not variance.
And it isn't a harness quirk: Sonnet 5 scored in the high 90s on the identical task, on the identical setup — no special handling for either model. So it's apples-to-apples: Fable 5 specifically under-executes multi-step "do-the-thing" work while acing the "figure-out-the-thing" work. That single gap is the entire 98.6 → 86 drop.
Honest caveats
- One grader cell is a known artifact — its automated checks pass but the judge scores it near zero, so it pulls every model's average down by the same few points. It doesn't change the ranking.
- Single runs (n=1) on the reasoning cells, so expect a few points of wobble. The action-task result is the exception — we ran it four times precisely because it looked like an outlier, and it held.
- Fable 5 is priced from the rate we already use internally ($10/$50 per MTok).
The takeaway: pick the right model for the job
The most capable model on the price sheet isn't automatically the most capable model for your job. Fable 5 is genuinely frontier at reasoning — but for a task that's mostly execution, a cheaper model that reliably finishes the job beats a pricier one that reasons beautifully and then stops halfway.
That's why Cogny doesn't hardcode one model. Our managed Cogny mix picks a model per job — chat, report generation, ideation, execution — and this benchmark is one of the inputs to that choice. The bench says: default the everyday reasoning work to Sonnet 5 (frontier accuracy, a quarter of Fable's cost), lean on Fable-class models for pure ideation where their reasoning shines, and reserve the heavyweight tiers for the narrow cases that actually earn them. Matching the model to the work — kept honest by the eval — is the whole point.