← back to blog
    Berner SetterwallJuly 2, 20263 min read

    Claude Fable 5 on Cogny Bench: When the Flagship Isn't the Value Pick

    Claude Fable 5 on Cogny Bench: When the Flagship Isn't the Value Pick

    Two days ago Sonnet 5 went straight to the top of our board. This week we pointed Cogny Bench at Claude Fable 5 — Anthropic's most capable tier, sitting above Opus, priced at $10 / 1M input and $50 / 1M output.

    The result is a good reminder that "biggest model" and "best tool for the job" aren't the same question.

    The numbers

    Cogny Bench — score vs cost, Claude Fable 5 vs the stack
    Cogny Bench — score vs cost, Claude Fable 5 vs the stack
    ModelAvg score$/run
    Sonnet 5 — intro ($2/$10)93.6$0.098
    Gemini 3.5 Flash93.6$0.153
    Opus 4.793.2$0.489
    Opus 4.892.9$0.652
    GLM-5.2 (Berget)90.0$0.075
    GPT-5.589.1$0.222
    Sonnet 4.688.7$0.308
    Fable 586.0$0.390
    Haiku 4.573.8$0.108

    Fable 5 is the most expensive model on the board and lands near the bottom on score. On our marketing-analytics reasoning problems, Sonnet 5 beats it on accuracy and costs roughly a quarter as much.

    That headline needs an honest asterisk, though — because it's really one problem doing the damage.

    Where Fable 5 is frontier — and where it isn't

    On the pure-reasoning problems, Fable 5 is genuinely frontier. Averaged across them it lands at 98.6 — matching Sonnet 5. Confounded comparisons, mix-shift paradoxes, attribution leaks, tracking outages: it reads through all of them cleanly. Its analysis is not the problem.

    The damage is one problem, and it's a specific kind: a multi-step action task — not "figure out what's wrong" but "now go do the multi-part fix and report back." Fable 5 nails the diagnosis half — it correctly identifies the conflict and disables the right thing — and then just… stops. It never performs the second required action, and never files the closing summary. We re-ran that one problem four times; it scored 31, 6, 31, 31. Reproducible, not variance.

    And it isn't a harness quirk: Sonnet 5 scored in the high 90s on the identical task, on the identical setup — no special handling for either model. So it's apples-to-apples: Fable 5 specifically under-executes multi-step "do-the-thing" work while acing the "figure-out-the-thing" work. That single gap is the entire 98.6 → 86 drop.

    Honest caveats

    • One grader cell is a known artifact — its automated checks pass but the judge scores it near zero, so it pulls every model's average down by the same few points. It doesn't change the ranking.
    • Single runs (n=1) on the reasoning cells, so expect a few points of wobble. The action-task result is the exception — we ran it four times precisely because it looked like an outlier, and it held.
    • Fable 5 is priced from the rate we already use internally ($10/$50 per MTok).

    The takeaway: pick the right model for the job

    The most capable model on the price sheet isn't automatically the most capable model for your job. Fable 5 is genuinely frontier at reasoning — but for a task that's mostly execution, a cheaper model that reliably finishes the job beats a pricier one that reasons beautifully and then stops halfway.

    That's why Cogny doesn't hardcode one model. Our managed Cogny mix picks a model per job — chat, report generation, ideation, execution — and this benchmark is one of the inputs to that choice. The bench says: default the everyday reasoning work to Sonnet 5 (frontier accuracy, a quarter of Fable's cost), lean on Fable-class models for pure ideation where their reasoning shines, and reserve the heavyweight tiers for the narrow cases that actually earn them. Matching the model to the work — kept honest by the eval — is the whole point.