Falsifiable Marketing Experiments: Why AI Tickets Beat AI Suggestions
Falsifiable Marketing Experiments: Why AI Tickets Beat AI Suggestions
There's an uncomfortable truth about most "AI marketing tools" shipping in 2026.
Open the dashboard. See the recommendations. "Increase budget on top-performing campaigns. Test new audience segments. Refresh creative cadence." Read them. Nod. Close the tab. Never find out if the recommendation was right.
A month later the dashboard generates a new set of recommendations. Some of them contradict the previous set. There is no record of which old recommendations were tried, no record of what happened to the metric after they were, no learning, no compounding. Just an infinite scroll of suggestions, free-floating, ungrounded.
That is not an AI marketing platform. That is a horoscope with a sidebar.
The fix is older than computers. It's called falsifiability, and it's what separates an experiment from an opinion. This piece is about how we built it into Cogny at the schema level — and why every AI marketing tool should be evaluated on whether it can produce a falsifiable hypothesis on demand.
TL;DR
- A falsifiable marketing experiment has four things: a hypothesis, an expected outcome, an approval gate, and a measurement of what actually happened.
- Most AI marketing tools generate suggestions — prose without any of those four things. Suggestions cannot be evaluated and so cannot compound.
- Cogny's Growth Tickets enforce falsifiability at the database level. A ticket cannot reach the approval queue without
expected_outcomeandestimated_impact_usd. After execution,result_metricsis filled in and the model sees it on the next run. - This is the loop that turns AI marketing from "ideas in a chat window" into a system that gets sharper over time.
- The test for any AI marketing platform: can it tell you, six months in, which of its past recommendations actually worked?
Suggestions vs. Experiments
The first thing to be honest about: the AI marketing tooling that ships today is overwhelmingly in the suggestions category, not the experiments category. The reason is that suggestions are dramatically easier to ship.
Here's the difference, side by side.
| A Suggestion | A Falsifiable Experiment | |
|---|---|---|
| Output | "Refresh creative cadence on Q2 LinkedIn ads" | "Pause ad set LI_Q2_Launch_V3; CTR has dropped from 0.51% to 0.22% over 14 days against a 30-day average; predicted weekly savings $480; expected lift on remaining ad sets +6%" |
| Specificity | Vague — what cadence, what change, what outcome? | Specific — exact ad set, exact change, exact threshold, exact prediction |
| Hypothesis | None stated | "This ad set has fatigued; pausing it shifts spend to higher-CTR creatives" |
| Expected outcome | None | "Account-level CTR rises from 0.34% to 0.36%, weekly spend efficiency up 4–7%" |
| Approval | "Sure, sounds good" | Yes/No on a structured ticket with logged reasoning |
| Measurement | None — recommendation evaporates | result_metrics written 7 days post-execution; compared to prediction |
| Outcome on the next cycle | Same generic suggestion, possibly contradicting itself | "We predicted +6% lift; got +4.2% lift. Update calibration for this account: creative-fatigue interventions over-predict by ~1.5pp." |
The right-hand column is what an experiment looks like. The left-hand column is what most "AI marketing assistants" ship. They feel productive because they generate output volume; they fail to compound because they cannot be wrong.
If your AI marketing tool cannot tell you, in the past tense, "on April 14 we recommended X, you approved it, and the metric moved by Y," then your AI marketing tool is producing suggestions. Demand experiments.
What Falsifiability Actually Means
Karl Popper's framing is the cleanest one: a claim is scientifically meaningful if it can be disproved by observation. "All swans are white" is falsifiable — find a black swan, claim disproved. "There's an invisible dragon in my garage that's undetectable by any instrument" is not falsifiable — no observation can disprove it, so it's not a useful claim.
The same test applies to AI marketing recommendations. "Optimise creative cadence" is the invisible-dragon recommendation. You cannot run a test that disproves it because it doesn't predict anything specific.
A falsifiable marketing recommendation makes a specific prediction. The prediction is observable. After the action, you compare the observation to the prediction and learn something.
For an AI marketing system, this matters operationally for a non-obvious reason: the model can only learn from falsifiable outputs. If the recommendation was "do something vaguely good and report back," there's nothing to feed into the next training cycle. If the recommendation was "this exact change will produce this exact lift," the gap between prediction and reality becomes the signal that improves the next prediction.
Falsifiability is the substrate that makes AI marketing compound.
How Cogny Enforces It: The Growth Ticket Schema
I'll get specific because the database schema is where the philosophical argument turns into engineering.
A Cogny Growth Ticket is a row in Postgres with the following key fields (simplified):
growth_tickets
├── id -- ticket identifier
├── title -- one-line summary
├── body -- the recommendation, in prose
├── expected_outcome -- required; the testable prediction
├── estimated_impact_usd -- required; dollar prediction
├── execution_target -- manual / agent / code
├── approval_score -- 0–1, model's estimate of approval probability
├── approval_score_reasons -- structured reasoning behind the score
├── status -- new / doing / analysis / done / rejected
├── source_context -- which reporting agent / scheduled prompt spawned it
├── implemented_at -- timestamp when action was taken
├── result -- prose summary of what actually happened
└── result_metrics -- structured numbers from after the change
The two required fields are expected_outcome and estimated_impact_usd. A ticket cannot be inserted without both. If the model can't produce them, the ticket is dropped before it reaches the queue.
That's the falsifiability gate. The schema makes it impossible to ship a recommendation that isn't a hypothesis.
After approval and execution, the cycle closes. implemented_at is stamped. result is filled in once enough time has passed for the metric to settle (typically 7–14 days post-execution depending on the action type). result_metrics carries the structured measurement — we predicted X, we got Y, delta Z.
The next analytical run reads result_metrics for every prior ticket in the same account. The model now has historical fact: when it recommended this kind of change in this account, here's what actually happened. The recommendations get more grounded because the predictor has feedback.
The Approval-Score Predictor (And Why It Exists)
Once you have falsifiable tickets, a new problem emerges: the model produces too many of them. Without filtering, the queue gets noisy — fifty low-confidence tickets for every ten high-confidence ones — and humans burn out triaging.
The solution is the approval-score predictor: a separate model that estimates, at ticket-creation time, how likely a human reviewer is to approve the ticket. It uses:
- Historical approval data from the same account (the dominant signal)
- Features of the ticket itself: impact size, channel, action type
- Cross-account patterns for accounts in similar verticals
- The reasoning chain that produced the ticket
Tickets below a threshold are filtered before they ever reach the queue. The remaining tickets carry a visible approval_score so reviewers can prioritise.
Two things are worth noting about this:
1. It is itself falsifiable. The predictor's accuracy is measured against actual approval/rejection decisions. When it's wrong, that's signal — the predictor recalibrates.
2. It is account-specific. The score for a $500/month spender's account is calibrated differently from a $50,000/month spender's. The same recommendation (pause a keyword) carries different stakes and different acceptance criteria.
Without the approval-score predictor, the queue collapses under its own volume. With it, the queue stays focused on the tickets most likely to ship.
What "Result Metrics" Actually Look Like
The hardest part of all this isn't the recommendation — it's the post-mortem. Did the change actually do what we predicted?
For different action types, "result" has different definitions:
| Action type | What we measure | When |
|---|---|---|
| Pause a keyword/audience/ad set | Spend redirected, CPA delta on remaining traffic | 7 days post-execution |
| Increase a budget | Marginal ROAS on the incremental spend | 14 days |
| Negative-keyword addition | Wasted-impression reduction, conversion-rate lift | 14 days |
| Audience-overlap fix | Cost-per-impression improvement, frequency change | 7 days |
| SEO content publish | Indexation, ranking position over 30/60/90 days | 30/60/90 days |
| GEO content optimisation | AI-engine citation count, traffic from AI referrers | 30 days |
| Email-segment cleanup | Deliverability, open-rate lift on remaining list | 14 days |
The measurement window matters. Settling for "yesterday's metric" produces noisy result data. Waiting too long lets the world change underneath you. Each action type has a tuned window based on how long the metric typically takes to stabilise.
The result_metrics blob carries structured numbers. A simplified example for a paused-keyword ticket:
{
"predicted_monthly_saving_usd": 1840,
"actual_30d_saving_usd": 2110,
"predicted_cpa_impact_pct": 0,
"actual_cpa_impact_pct": -3.2,
"verdict": "outperformed",
"calibration_delta": "savings +14.7%, cpa -3.2pp better than prediction"
}
This row is now part of the account's history. The next paid-media audit reads it and adjusts. "In this account, keyword pauses tend to outperform predictions by ~15%. Recommended actions get higher confidence."
The Truth Ledger
The pattern of "every prediction has a post-action measurement" is what we call the Truth Ledger internally. It's not a separate system — it's just the discipline that every ticket carries both sides of the equation, prediction and outcome, and the model reads both on the next run.
The Truth Ledger does three things that suggestion-style AI marketing tools cannot:
- Calibrates the predictor. Over time, the model learns where it over- and under-predicts. The fix isn't generic; it's per-account, per-channel, per-action-type.
- Resolves contradictions. When recommendation A in March said one thing and recommendation B in May says the opposite, the Ledger has a fact: what happened when we ran A. The contradiction has an answer.
- Provides audit. For agencies, marketing teams, and CMOs answering to finance, the Ledger is the "why did we do this and what was the result" trail. It's a CFO-friendly format that "the AI suggested it" is not.
The Ledger is also what makes the recommendation engine durable across model upgrades. When we move from Sonnet 4.6 to Sonnet 5.0 next year, the Ledger doesn't reset. The fact base survives. The new model walks into an account with months of grounded outcome data and gets useful immediately.
What This Looks Like for the User
The user doesn't see the schema. The user sees a queue of tickets that look roughly like this:
Pause keyword
enterprise crm softwarein campaignB2B - Brand— Approval score: 0.84 — Estimated monthly impact: +$1,840Why: 30-day spend $1,840, conversions: 0, search-term report shows 92% navigational queries for competitor brand. Predicted outcome: Spend reallocates to remaining keywords; account CPA drops ~$8. Approve / Reject / Defer
Approve, and the action is executed via the Google Ads MCP. 7 days later, the ticket reappears in the analysis lane with:
Result: Spend savings $2,110/month (predicted $1,840, +14.7%). CPA dropped $11.40 (predicted -$8, outperformed). Verdict: outperformed.
This is the falsifiability loop at the user level. It is the difference between an AI that gives you opinions and an AI you can trust on Monday morning.
Why This Matters for AI Marketing as a Category
The big strategic point: AI marketing as a category is going to bifurcate this year.
On one side, tools that ship suggestions — pretty dashboards, recommendation feeds, "insights" pages. They feel like AI marketing but cannot be evaluated. Six months in, the customer can't answer the basic question "is this thing working?" and churns.
On the other side, tools that ship falsifiable experiments — Growth Tickets with predictions, approvals, and measured outcomes. The customer has a Truth Ledger they can show their boss. Renewal is easy because the value is provable.
Cogny is firmly in the second camp. So are a handful of other serious AI marketing products. We expect the suggestion-style tools to compress hard over the next 18 months as buyers learn to ask the right question.
That question, again: "Show me which of your past recommendations were approved, and what happened to the metric afterwards."
If the answer is a list — you're looking at a real platform. If the answer is a wave-of-the-hand about how AI is hard to measure — you're looking at a horoscope.
Practical Implications for Marketing Teams
If you're running marketing operations and evaluating AI tooling:
1. Audit the schema, not the marketing copy. Ask the vendor to show you what a single recommendation looks like in their database. If there's no expected_outcome or equivalent field, you're buying suggestions.
2. Insist on an outcome log. Whatever they call it — Truth Ledger, audit trail, history view — you need a place where past recommendations live alongside what actually happened. Without it, you cannot prove ROI to finance.
3. Look for approval gates. Falsifiability is meaningless if the agent can change budgets without a human knowing. The approval gate is what makes the experiment actually run.
4. Time-bound the predictions. "Will save you money" is not an experiment. "Will reduce monthly spend by $1,840 over the next 30 days" is. The time horizon should be in the prediction.
5. Ask about calibration over time. A serious AI marketing platform's recommendations should get better in your specific account over months — not because the model upgraded, but because the Ledger now has account-specific patterns. If the vendor can't talk about per-account calibration, the system isn't compounding.
How to See This Working
The full falsifiability loop — scheduled runs across every channel, Growth Tickets with expected_outcome filled in, approval workflow, and the Truth Ledger that closes the loop with result_metrics — ships with Cogny Cloud at $530/month. This is the configuration that compounds month over month.
Cogny Solo at $9/month is the entry tier — bring-your-own-Claude, starter MCP set, one channel at a time. You can see the ticket workflow and individual recommendations, but the scheduler and Truth Ledger live at the Cloud tier. Solo is the right place to start if you want to test the model on your real data before committing.
For the bigger picture — why a harness is what makes any of this possible — see our harness vs. raw Claude piece. For the model side, Claude for marketing.
FAQ
What is a falsifiable marketing experiment? A marketing experiment is falsifiable if it makes a specific prediction that can be disproved by observation. A falsifiable experiment has four parts: a hypothesis, an expected outcome, an approval/execution step, and a measurement of what actually happened. Generic recommendations like "refresh creative cadence" are not falsifiable.
How is a Growth Ticket different from a suggestion?
A Growth Ticket has a required expected_outcome and estimated_impact_usd, gets an approval gate, and is followed by result_metrics measuring what actually happened. A suggestion is just prose. The ticket can be evaluated; the suggestion cannot.
Doesn't every AI marketing tool measure outcomes? No. Most generate recommendations and never close the loop. They might show you "performance over time" in aggregate, but they don't tie specific past recommendations to specific subsequent outcomes. The Truth Ledger pattern — ticket → execution → measurement → feedback to the next cycle — is the rare thing.
What happens when the prediction is wrong?
The system writes the gap to result_metrics and the predictor recalibrates. The next prediction in the same account uses the updated calibration. This is the learning loop. Wrong predictions are not failures — they are the signal that makes the next prediction better.
Why is the approval gate necessary? Two reasons. First, marketing actions affect real spend and real revenue; you want a human reading the ticket before it executes. Second, the approval/rejection itself is training data for the predictor — knowing which tickets humans accept and reject is how the system learns what's useful in your account.
Can I export the Truth Ledger? Yes. The Ledger is structured data — every ticket, every prediction, every result. Cogny exports the full history to your warehouse so you can run your own analyses, build your own dashboards, and answer your own questions.
About Berner Setterwall
Berner is CTO and co-founder of Cogny, where he leads the engineering of the Growth Ticket schema, the approval-score predictor, and the Truth Ledger. He spent eleven years building large-scale optimisation systems at Campanja for Netflix, Zalando, and others. He thinks "show me your schema" is the most useful question a buyer can ask a marketing AI vendor.
See falsifiable experiments running against your account
The full Truth Ledger — scheduled runs, parallel reports, the closure loop that turns predictions into measured outcomes — ships with Cogny Cloud at $530/month.
Cogny Solo at $9/month is the entry tier: bring-your-own-Claude, starter MCP set, one channel at a time, 7-day free trial. Useful to see Growth Tickets with expected_outcome on your own data before stepping up to the full Cloud configuration.