Cogny Bench: Why We Built Our Own Frontier-Model Eval (And Why It Paid Off This Week)

Yesterday, Berget.ai made Z.ai's GLM-5.2 available via their Stockholm-hosted sovereign-EU inference. Z.ai shipped the model on June 13 — 1M-token context, MIT-licensed open weights, a coding profile strong enough that Berget founder Christian Landgren put it on LinkedIn this way: "For the first time ever, the latest open model is better than the best closed model."

That makes it the first near-frontier model within reach of a Stockholm marketing-analytics company without sending customer traffic to a US or Asian cloud. About forty minutes and 60 cents after we got the API key working, we knew exactly where it ranked against Opus 4.8, GPT-5.5, Gemini 3.5 Flash and the rest of our stack.

That's the whole point of having your own benchmark.

Why generic benchmarks don't help us

MMLU and HumanEval are great for measuring how a model does on grad-school multiple choice and isolated coding puzzles. They tell us almost nothing about the question we actually care about: will this model catch a Simpson's paradox in a paid-social funnel? Will it notice that a "best-ROAS" channel is actually a downstream attribution leak? Will it pause an existing ad group before launching a new campaign that would cannibalize it?

Those are the reasoning traps that show up in the report-builder threads our customers run every day. A model that aces MMLU and trips on a seasonal confound in churn analysis isn't a model we can put in front of a CMO.

What Cogny Bench actually does

It's nine problems in two families. We're not going to walk through every one in this post (we'll publish the deep-dive separately for the engineering-curious), but here's the shape:

Analytics-trap problems. Each one is a small, fully-synthetic dataset — no customer data anywhere — generated from a seed so it's byte-identical run to run. Each has a planted ground truth and a planted trap: a confound the model has to discover, not be handed. Examples:

A price-increase impact analysis where the surface-level "churn spike" is actually a seasonal artifact. The right answer requires the model to notice a grandfathered cohort sitting in the data and use it as a control.
A channel-attribution scenario where the team's plan is to cut Paid Social and shift budget to Paid Search — and a new Affiliate channel shows the best ROAS. Both moves are wrong for the same reason: every Affiliate path started with Paid Social, so Affiliate's "ROAS" is a Paid Social attribution leak. A model that ranks channels by aggregate ROAS misses this entirely.

Action-flavored tool-use problems. Same datasets-as-fixtures idea, but instead of asking the model "what's the answer," we hand it a synthetic Google Ads account, a budget-committing write tool (create_campaign) that spends real money the moment it's called, and a deliberately bare instruction: "Launch a new campaign for electric toothbrushes in Sweden, exact-match these five keywords, $50/day." The trick: an existing ad group already serves those exact keywords in Sweden, generating real conversions. The right behavior is to investigate first — discover the conflict, pause the existing ad group, then launch the new campaign. A model that jumps straight to create_campaign without recon fails the test.

Grading is hybrid: deterministic checks on the structured answer (within tolerance, categorical match, action trace), plus a fixed LLM judge scoring the reasoning quality. We hold the judge model constant across runs so scores stay comparable as the field moves.

What the chart looks like

Across all the problems we've shipped, here's where the models we've evaluated land on score versus cost:

Cogny Bench — averaged score vs cost across all problems

Update (June 30, 2026): the chart now includes Claude Sonnet 5, which launched today and went straight to the top of the board — 93.6 averaged, edging out Opus 4.8 at roughly a fifth of the cost ($0.10/run at the introductory price, $0.15 at standard). It also beats its predecessor Sonnet 4.6 by ~5 points. Full write-up: Claude Sonnet 5 on Cogny Bench →. (We also dropped GPT-OSS-120B from the plot — at ~41 it's an off-scale small open model that was compressing the readable range.)

A few things jump out (as of the June 24 sweep):

Frontier tier converges. Gemini 3.5 Flash, Opus 4.7, Opus 4.8, GLM-5.2, GPT-5.5, and Sonnet 4.6 all land in the same high-80s-to-low-90s band (89–94 averaged across nine problems). The cost spread inside that band is ~9× — from $0.08 to $0.65 a run.
Cost ≠ capability. Gemini 3.5 Flash actually tops the averaged score (93.6) at $0.15 a run — about a quarter of Opus 4.8's $0.65, which scores no higher. And GLM-5.2 clears 90 at $0.08, roughly an eighth of Opus 4.8. The cheapest capable models match or beat the priciest — that's a procurement decision, not a rounding error.
The 10×-every-12-months curve is visible right in our stack. Sam Altman put it this way: "The cost to use a given level of AI falls about 10× every 12 months, and lower prices lead to much more use." (source) Cogny Bench is what that observation looks like as data inside one company: Opus 4.5 costs 4.6× more per run than Opus 4.7 for an identical score band. Picking the same capability tier from two release cycles back is a five-x bill for no quality gain. If a benchmark isn't telling you which of two same-quality models to pick on cost, it's not earning its keep.
Mid-tier models fail predictably. Haiku 4.5 nails the structured-output tasks but folds on multi-step recon-then-write. GPT-OSS-120B (OpenAI's open-source model, served by Berget) does the opposite — it'll happily call create_campaign without ever checking whether the keywords are already running somewhere else. Knowing which way a model fails is more useful than knowing it scored 28.

The GLM-5.2 angle — why this paid off this week

When the GLM-5.2 access on Berget went live, we ran one command:

npm run bench -- --models glm-5.2

About forty minutes and 60 cents later, we had its scores on every problem in the suite, side-by-side with Opus 4.8 and the rest. No guessing, no anecdotes from Twitter, no "well it feels smart in chat." A real ranked recommendation we could base an integration decision on.

The headline result: GLM-5.2 placed fourth of eight models on the averaged score (90.0), within a few points of the Opus tier (92.9–93.2) — and at $0.08 a run, roughly an eighth of Opus 4.8's cost and the cheapest capable model on the board. That's not a "nice to know"; that's a real procurement question. If a Swedish-sovereign open-weights model clears 90 on the same problems where Opus lands ~93, the question for some of our workloads becomes why are we paying Opus pricing.

That speed of evaluation is what makes the benchmark earn its keep. Frontier model releases are constant now — Anthropic, OpenAI, Google, xAI, DeepSeek, Zhipu/Z.AI, Moonshot, Mistral, and now sovereign EU inference. We don't want to take "is this the new state of the art?" as an act of faith. We want to take it as a measurement.

What it taught us about choosing models

Two takeaways that have changed how we think about model selection at Cogny:

1. Easy problems saturate fast; only confounds discriminate. On the simpler analytics problems, every frontier model scores 95+. The interesting separation only shows up when the problem contains a misleading signal the model has to recognize and step around. This is the same lesson Anthropic and OpenAI keep publishing about their own evals — but it's also true at the very-domain-specific level. If your bench is just "can the model compute these numbers," your bench is already saturated.

2. Tool-use ability and analytics ability are different axes. Some models reason about data beautifully but can't reliably call the right tool with the right arguments. Some execute tool calls flawlessly but skip the reconnaissance step that would prevent them from blowing up your live ad account. A bench that only measures one of these is going to recommend the wrong model for half of your real use cases.

Where this goes next

We run Cogny Bench on every model release that matters to our customers. When a new Anthropic, OpenAI, Google, or now Berget model drops, the question isn't should we evaluate it — it's how does it score against what we already ship.

Behind every report Cogny builds for a customer, there's a specific model choice. Cogny Bench is how we make that choice defensible. The fact that we could evaluate a brand-new Swedish-sovereign frontier model the same afternoon it shipped on Berget, and have a graphed, costed ranking the same evening, is what makes the next model release — whichever provider it comes from — a measurement, not a leap of faith.