Claude Fable 5 on Cogny Bench: When the Flagship Isn't the Value Pick

Two days ago Sonnet 5 went straight to the top of our board. This week we pointed Cogny Bench at Claude Fable 5 — Anthropic's most capable tier, sitting above Opus, priced at $10 / 1M input and $50 / 1M output.

The result is a good reminder that "biggest model" and "best tool for the job" aren't the same question.

The numbers

Cogny Bench — score vs cost, Claude Fable 5 vs the stack

Model	Avg score	$/run
Sonnet 5 — intro ($2/$10)	93.6	$0.098
Gemini 3.5 Flash	93.6	$0.153
Opus 4.7	93.2	$0.489
Opus 4.8	92.9	$0.652
GLM-5.2 (Berget)	90.0	$0.075
GPT-5.5	89.1	$0.222
Sonnet 4.6	88.7	$0.308
Fable 5	86.0	$0.390
Haiku 4.5	73.8	$0.108

Fable 5 is the most expensive model on the board and lands near the bottom on score. On our marketing-analytics reasoning problems, Sonnet 5 beats it on accuracy and costs roughly a quarter as much.

That headline needs an honest asterisk, though — because it's really one problem doing the damage.

Where Fable 5 is frontier — and where it isn't

On the pure-reasoning problems, Fable 5 is genuinely frontier. Averaged across them it lands at 98.6 — matching Sonnet 5. Confounded comparisons, mix-shift paradoxes, attribution leaks, tracking outages: it reads through all of them cleanly. Its analysis is not the problem.

The damage is one problem, and it's a specific kind: a multi-step action task — not "figure out what's wrong" but "now go do the multi-part fix and report back." Fable 5 nails the diagnosis half — it correctly identifies the conflict and disables the right thing — and then just… stops. It never performs the second required action, and never files the closing summary. We re-ran that one problem four times; it scored 31, 6, 31, 31. Reproducible, not variance.

And it isn't a harness quirk: Sonnet 5 scored in the high 90s on the identical task, on the identical setup — no special handling for either model. So it's apples-to-apples: Fable 5 specifically under-executes multi-step "do-the-thing" work while acing the "figure-out-the-thing" work. That single gap is the entire 98.6 → 86 drop.

Honest caveats

One grader cell is a known artifact — its automated checks pass but the judge scores it near zero, so it pulls every model's average down by the same few points. It doesn't change the ranking.
Single runs (n=1) on the reasoning cells, so expect a few points of wobble. The action-task result is the exception — we ran it four times precisely because it looked like an outlier, and it held.
Fable 5 is priced from the rate we already use internally ($10/$50 per MTok).

The takeaway: pick the right model for the job

The most capable model on the price sheet isn't automatically the most capable model for your job. Fable 5 is genuinely frontier at reasoning — but for a task that's mostly execution, a cheaper model that reliably finishes the job beats a pricier one that reasons beautifully and then stops halfway.

That's why Cogny doesn't hardcode one model. Our managed Cogny mix picks a model per job — chat, report generation, ideation, execution — and this benchmark is one of the inputs to that choice. The bench says: default the everyday reasoning work to Sonnet 5 (frontier accuracy, a quarter of Fable's cost), lean on Fable-class models for pure ideation where their reasoning shines, and reserve the heavyweight tiers for the narrow cases that actually earn them. Matching the model to the work — kept honest by the eval — is the whole point.