← back to blog
    Berner SetterwallJune 24, 20266 min read

    Cogny Bench: Why We Built Our Own Frontier-Model Eval (And Why It Paid Off This Week)

    Cogny Bench: Why We Built Our Own Frontier-Model Eval (And Why It Paid Off This Week)

    Yesterday, Berget.ai made Z.ai's GLM-5.2 available via their Stockholm-hosted sovereign-EU inference. Z.ai shipped the model on June 13 — 1M-token context, MIT-licensed open weights, a coding profile strong enough that Berget founder Christian Landgren put it on LinkedIn this way: "For the first time ever, the latest open model is better than the best closed model."

    That makes it the first near-frontier model within reach of a Stockholm marketing-analytics company without sending customer traffic to a US or Asian cloud. About forty minutes and 60 cents after we got the API key working, we knew exactly where it ranked against Opus 4.8, GPT-5.5, Gemini 3.5 Flash and the rest of our stack.

    That's the whole point of having your own benchmark.

    Why generic benchmarks don't help us

    MMLU and HumanEval are great for measuring how a model does on grad-school multiple choice and isolated coding puzzles. They tell us almost nothing about the question we actually care about: will this model catch a Simpson's paradox in a paid-social funnel? Will it notice that a "best-ROAS" channel is actually a downstream attribution leak? Will it pause an existing ad group before launching a new campaign that would cannibalize it?

    Those are the reasoning traps that show up in the report-builder threads our customers run every day. A model that aces MMLU and trips on a seasonal confound in churn analysis isn't a model we can put in front of a CMO.

    What Cogny Bench actually does

    It's nine problems in two families. We're not going to walk through every one in this post (we'll publish the deep-dive separately for the engineering-curious), but here's the shape:

    Analytics-trap problems. Each one is a small, fully-synthetic dataset — no customer data anywhere — generated from a seed so it's byte-identical run to run. Each has a planted ground truth and a planted trap: a confound the model has to discover, not be handed. Examples:

    • A price-increase impact analysis where the surface-level "churn spike" is actually a seasonal artifact. The right answer requires the model to notice a grandfathered cohort sitting in the data and use it as a control.
    • A channel-attribution scenario where the team's plan is to cut Paid Social and shift budget to Paid Search — and a new Affiliate channel shows the best ROAS. Both moves are wrong for the same reason: every Affiliate path started with Paid Social, so Affiliate's "ROAS" is a Paid Social attribution leak. A model that ranks channels by aggregate ROAS misses this entirely.

    Action-flavored tool-use problems. Same datasets-as-fixtures idea, but instead of asking the model "what's the answer," we hand it a synthetic Google Ads account, a destructive tool (create_campaign), and a deliberately bare instruction: "Launch a new campaign for electric toothbrushes in Sweden, exact-match these five keywords, $50/day." The trick: an existing ad group already serves those exact keywords in Sweden, generating real conversions. The right behavior is to investigate first — discover the conflict, pause the existing ad group, then launch the new campaign. A model that jumps straight to create_campaign without recon fails the test.

    Grading is hybrid: deterministic checks on the structured answer (within tolerance, categorical match, action trace), plus a fixed LLM judge scoring the reasoning quality. We hold the judge model constant across runs so scores stay comparable as the field moves.

    What the chart looks like

    Across all the problems we've shipped, here's where the models we've evaluated land on score versus cost:

    Cogny Bench — averaged score vs cost across all problems
    Cogny Bench — averaged score vs cost across all problems

    A few things jump out:

    • Frontier tier converges. Opus 4.7, Opus 4.8, GPT-5.5, and Gemini 3.5 Flash all sit in the same band (mid-80s average). The cost spread between them is over an order of magnitude.
    • Cost ≠ capability. Gemini 3.5 Flash matches Opus 4.8 on average score at roughly 1/15th the cost on these problems. That's not a marginal saving; that's a different procurement decision.
    • Mid-tier models fail predictably. Haiku 4.5 nails the structured-output tasks but folds on multi-step recon-then-write. GPT-OSS-120B (Berget's open-weights option) does the opposite — it'll happily call create_campaign without ever checking whether the keywords are already running somewhere else. Knowing which way a model fails is more useful than knowing it scored 28.

    The GLM-5.2 angle — why this paid off this week

    When the GLM-5.2 access on Berget went live, we ran one command:

    npm run bench -- --models glm-5.2
    

    About forty minutes and 60 cents later, we had its scores on every problem in the suite, side-by-side with Opus 4.8 and the rest. No guessing, no anecdotes from Twitter, no "well it feels smart in chat." A real ranked recommendation we could base an integration decision on.

    The headline result: GLM-5.2 placed fourth out of seven models on the averaged score, just behind Opus 4.7 and ahead of Opus 4.8 — at roughly a quarter of Opus 4.7's per-run cost. That's not a "nice to know"; that's a real procurement question. If a Swedish-sovereign open-weights model clears 85 on the same problems where Opus 4.7 hits 86, the question for some of our workloads becomes why are we paying Opus pricing.

    That speed of evaluation is what makes the benchmark earn its keep. Frontier model releases are constant now — Anthropic, OpenAI, Google, xAI, DeepSeek, Zhipu/Z.AI, Moonshot, Mistral, and now sovereign EU inference. We don't want to take "is this the new state of the art?" as an act of faith. We want to take it as a measurement.

    What it taught us about choosing models

    Two takeaways that have changed how we think about model selection at Cogny:

    1. Easy problems saturate fast; only confounds discriminate. On the simpler analytics problems, every frontier model scores 95+. The interesting separation only shows up when the problem contains a misleading signal the model has to recognize and step around. This is the same lesson Anthropic and OpenAI keep publishing about their own evals — but it's also true at the very-domain-specific level. If your bench is just "can the model compute these numbers," your bench is already saturated.

    2. Tool-use ability and analytics ability are different axes. Some models reason about data beautifully but can't reliably call the right tool with the right arguments. Some execute tool calls flawlessly but skip the reconnaissance step that would prevent them from blowing up your live ad account. A bench that only measures one of these is going to recommend the wrong model for half of your real use cases.

    Where this goes next

    We run Cogny Bench on every model release that matters to our customers. When a new Anthropic, OpenAI, Google, or now Berget model drops, the question isn't should we evaluate it — it's how does it score against what we already ship.

    Behind every report Cogny builds for a customer, there's a specific model choice. Cogny Bench is how we make that choice defensible. The fact that we could evaluate a brand-new Swedish-sovereign frontier model the same afternoon it shipped on Berget, and have a graphed, costed ranking the same evening, is what makes the next model release — whichever provider it comes from — a measurement, not a leap of faith.