← back to blog
    Berner SetterwallJune 29, 20268 min read

    The Capability Arc: Running Cogny Bench Across Six OpenAI Models, 2023 → 2025

    The Capability Arc: Running Cogny Bench Across Six OpenAI Models, 2023 → 2025

    Cogny has been building agentic SQL analysis over marketing data since December 2022. For most of that time, the honest internal answer to "can a model actually do this yet?" was no — not reliably, not over data with traps in it. The product's whole premise is a model that reads a question, writes SQL against a warehouse, and doesn't take the surface reading at face value: it notices the seasonal confound, the mix shift, the tracking outage, the currency it forgot to convert. For years, the frontier models couldn't.

    We say that a lot. This week we decided to measure it.

    Cogny Bench is our internal eval — a set of synthetic marketing-analytics problems, each with a planted ground truth and a deliberately withheld trap, graded deterministically plus a fixed LLM judge. The problems themselves stay private — a benchmark you publish is a benchmark that gets trained on, and then it stops measuring anything. The results, though, we're happy to show. This time we pointed the eval backwards, at a span of OpenAI models stretching from GPT-3.5 Turbo (March 2023) to GPT-4.1 (April 2025), to see the capability arc in numbers.

    The failures turned out to be half the story.

    The arc, in one table

    Every model ran the same problem set, n=1 per cell. Scores are the 0–100 combined metric; cost is the OpenAI agent spend per run (the judge runs separately).

    ModelOpenAI releaseAvg score$/runTotal $Ran?
    GPT-3.5 TurboMar 202341.0$0.0050$0.045✅ ran
    GPT-4-32kMar 2023$0.000retired — 404 from the API
    GPT-4o (2024-05-13)May 202450.8$0.0509$0.458✅ ran
    o1Dec 202474.9$0.3088$2.780✅ ran
    GPT-4.1Apr 202580.0$0.0618$0.557✅ ran
    o3Apr 202592.1$0.0815$0.733✅ ran

    The shape is exactly the thesis: the 2023–2024 chat models sit in the 40s and low 50s, and the score only clears 70 when the reasoning generation (o1, December 2024) shows up — then climbs into the 80s and 90s with GPT-4.1 and o3 in 2025. The product became viable right about when the models did.

    Cogny Bench — the OpenAI capability arc (score vs cost), with today's cross-provider frontier plotted for contrast
    Cogny Bench — the OpenAI capability arc (score vs cost), with today's cross-provider frontier plotted for contrast

    The OpenAI arc climbs left-to-up as you move through 2023 → 2025. The three points sitting in the 90s at the top — Sonnet 5, GLM-5.2 and GPT-5.5, across three different labs — are today's frontier, shown for contrast. More on those below.

    The retired model: GPT-4-32k can't even pick up the phone

    GPT-4-32k — the 32K-context GPT-4 from 2023, once the biggest-context model OpenAI offered — is no longer available to our account. It isn't a low score; it's no score. Every call returns:

    404 The model `gpt-4-32k` does not exist or you do not have access to it.
    

    A retired model is a real data point on a capability arc: the frontier of 2023 has been switched off. Every problem returns an identical 404, $0.00 spent — the model fails before a single token is billed.

    GPT-3.5 Turbo ran every problem — and still scored 41

    Here's a result that contradicted our own prediction. We expected GPT-3.5 Turbo's 16K context to blow up on the larger problems. It never did — because the eval keeps the data behind tools, not pasted into the prompt. The model queries the data and only ever sees small result sets, so the context stays tiny. That's the same architecture the product uses, and it's why an old small-context model can attempt every problem.

    It attempts them — and then it folds on the reasoning. It reports surface numbers, misses the effects hiding under them, and confidently calls a declining trajectory a win. GPT-3.5 Turbo is the cheapest model on the board ($0.005/run) and it runs. It just can't be trusted with a trap.

    GPT-4o (2024-05-13): better, but still gets caught

    The original GPT-4o snapshot lifts the average to 50.8 and handles the mechanical problems well, but it still gets caught by a confounded comparison — taking an inflated surface effect at face value instead of netting out the obvious alternative explanation sitting in the data. That class of mistake is precisely what separates "summarizes the numbers" from "can be trusted with a marketing decision."

    The reasoning generation clears the bar

    o1 (December 2024) is the first model in the span to reliably see through the confounds rather than fall for them. Across the suite the three 2024–2025 reasoning-era models land at 74.9 (o1), 80.0 (GPT-4.1) and 92.1 (o3) — a different league from the 41/51 of the chat era. o3 in particular is strong across nearly every problem, including the traps the chat models score near zero on.

    This is the measurement behind the claim we'd been making on faith. The capability the product needs — don't trust the surface reading — arrived with the reasoning models, not before.

    Two honest caveats (because a benchmark you can't defend is worthless)

    1. n=1, and model responses aren't deterministic. The grading is seeded and stable, but the models are sampled, so exact cell scores wobble run to run — sometimes by a lot (we saw one model swing ~40 points on a single problem between two runs, no code change in between). So read this as a pass/fail capability arc, not a precision leaderboard: the tier gap (40s/50s chat era vs 80s/90s reasoning era) is the robust signal; a few points of within-tier ordering is inside the sampling noise. For a tight ranking you'd average several runs.

    2. Cost ≠ capability. o1 is the priciest model here at $0.31/run — 5× GPT-4.1 and 4× o3 — and scores below both. o3's per-run cost reflects OpenAI's June 2025 80% price cut (from $10/$40 to $2/$8 per MTok, "same exact model, just cheaper"), so the top-scoring model on the board is also one of the cheapest to run. The newest models aren't just better; they're cheaper. That's the 10×-every-12-months curve showing up inside one provider's own lineup.

    What we set the prices to (and where they come from)

    Pricing on a published benchmark has to be real, never guessed. Each model's per-MTok rate, with its source:

    Model$/MTok in$/MTok outSource
    GPT-3.5 Turbo (0125)0.501.50OpenAI, Jan 2024 API update
    GPT-4-32k60.00120.00Original GPT-4 launch rate ($0.06/$0.12 per 1K)
    GPT-4o (2024-05-13)5.0015.00Launch ("Hello GPT-4o") snapshot rate
    o115.0060.00o1 GA pricing (Dec 17 2024)
    o32.008.00OpenAI, Jun 2025 price cut
    GPT-4.12.008.00Introducing GPT-4.1 in the API

    One number to flag honestly: the gpt-4o-2024-05-13 dated snapshot is billed at its launch rate ($5/$15), even though the later gpt-4o alias dropped to $2.50/$10. We used $5/$15 because that's the snapshot we called. It's the right number for this run; it would be the wrong number for a generic "GPT-4o" line.

    Where that leaves us: today, this is basically solved

    Run the current frontier against the same eval and the picture inverts — those are the three points clustered up in the 90s on the chart above. Where the 2023 chat models sat in the 40s and the biggest model of that year is now switched off entirely, a spread of 2025–2026 models — across three different labs, including a cheap open-weights one — all clear ~90:

    Model (today)Avg score$/run
    Claude Sonnet 593.6$0.10
    GLM-5.2 (open weights, EU-hosted)90.0$0.08
    GPT-5.589.1$0.22

    (Numbers from our current cross-model runs on the same problem set; full standings on the Cogny Bench chart.)

    The marketing-analytics reasoning that no model could be trusted with in 2022 is now met by several — a proprietary flagship, a frontier general model, and an open EU-hosted model that runs for eight cents. The hard question stopped being "can a model do this at all?" somewhere around late 2024. Today it's "which one should we use?" — a cost, latency and data-sovereignty question, not a capability one.

    The takeaway: pick the right model for the job

    The point of an arc like this isn't to crown a single winner — it's that the right model depends on the work. o3 tops this run but costs 4× a mid-tier model that scores nearly as well; o1 is pricey and mid-pack; the 2023 chat models are cheap and untrustworthy. "Best model" is a question you can only answer against a task.

    That's exactly why Cogny doesn't hardcode one model. Our managed Cogny mix picks a model per job — chat, report generation, ideation, execution — and this benchmark is one of the inputs to that choice: for a given kind of work, reach for the cheapest model that reliably clears the bar, and reserve the expensive tiers for the tasks that actually earn them. The eval is how we keep that decision honest as new models ship.


    Total spend for the full arc: $6.66 of OpenAI inference (plus the fixed judge on the Anthropic side). For under seven dollars you get the receipt for a claim we'd been making since 2022: the models finally caught up to the product.