AI Marketing Harness: Why Free Skills and Raw Claude Aren't Enough

Every week, somebody emails us a screenshot of Claude doing marketing work and asks the same question.

"This is basically what Cogny does, right? I just ask Claude."

It's a fair question. Claude is good. Genuinely good. The Sonnet 4.6 and Opus 4.8 models we run today are categorically better at reasoning over ad-platform data than anything we had eighteen months ago. If you open a clean Claude window and paste in last month's Google Ads export, you will get useful output.

The trouble starts in week three.

By then the screenshot has become a habit. Somebody exports the same CSV every Monday, pastes it into the same chat, and asks the same question. The output gets less useful every week — not because Claude got worse, but because the context never carries forward. The model forgot what you tried last week, doesn't know which keyword you already paused, can't see whether the change you made actually improved CPA, and has no idea that the campaign you're staring at represents 60% of your spend.

That gap — between "Claude can do this once" and "AI is now doing marketing for me" — is what a harness is for.

This piece is a technical argument about what the harness is, why a chat window plus a skills folder doesn't qualify, and what we ended up building at Cogny to get the loop to close.

TL;DR

An AI marketing harness is the production layer around an LLM: tool access, a schedule, an approval gate, outcome measurement, and historical memory.
Raw Claude is fine for one-off questions. It has no tools to your data, no clock, no memory of what you tried, and no measurement of what worked.
Claude + free skills (the recently-launched npx skills add ecosystem) adds reusable prompts and a small amount of glue. It does not connect to your ad accounts, run on a schedule, or measure outcomes.
A harness adds the four missing pieces: live MCP connections to your real data, scheduled execution, structured tickets with expected_outcome + result_metrics, and a learning loop that compounds.
Cogny Cloud ships the full harness at $530/month — 25+ MCPs, 39 analytical templates, scheduled execution, falsifiable Growth Tickets, parallel reports, and the Truth Ledger that records what was approved and what happened next. Cogny Solo at $9/month is an entry tier (bring-your-own-Claude, starter MCP set, one channel at a time) — useful, but not the harness.
The model gets cheaper every quarter. The harness is the part that compounds value.

What Is an AI Marketing Harness?

Borrowed from the agent-research literature, a "harness" is everything around the model that turns it from a chat completion into a working system.

For marketing specifically, an AI marketing harness has five components:

Component	What it does	Why a chat window doesn't have it
Tool access	Reads your real ad-platform, warehouse, and analytics data	Chat windows can read pasted text, not live APIs
Schedule	Runs analyses on a clock — daily, hourly — without you asking	A chat window only runs when you open it
Workflow	Turns model output into structured tickets with hypotheses and expected outcomes	Chat output is prose; prose is not actionable
Approval gate	Human-in-the-loop before changes hit a live ad account	A chat window has nothing to approve
Memory + measurement	Records what was approved, what was rejected, what happened to the metric afterwards	A chat window forgets the conversation when you close it

Strip any one of those out and you're back to "I asked Claude and it had some ideas." Which is fine. It's not a marketing platform.

The Three Ways People Try to Do AI Marketing in 2026

Roughly speaking, there are three approaches in the wild right now. We've watched all three play out — in our own product, in customer migrations, and in side-by-side bake-offs against teams who chose differently.

Approach 1: Raw Claude (or raw GPT, or raw Gemini)

Open a chat window. Paste data. Ask questions. Read output.

What works:

You learn fast. The first session feels like magic.
The reasoning quality is real. Claude in 2026 can write SQL against an unfamiliar schema and explain why a CPA spiked.
Marginal cost is near zero. Twenty bucks a month, no infra.

What breaks in week three:

No live data. Every question requires you to export, clean, and paste. The longer the time horizon of the analysis, the more painful the export.
No memory. "What did we try last week?" requires you to scroll back through chats. Decisions don't persist.
No coverage. You analyze the top 5 campaigns because pasting all 47 of them is annoying. The wins live in the long tail.
No measurement. You never find out if the recommendation actually worked. The change happens in the ad platform, the result happens in BigQuery, and the two are never reconciled.
No safety. When you eventually decide to let Claude do something instead of just suggest, the only path is "I'll paste the API call and run it myself." That works until you accidentally pause the wrong campaign.

Raw Claude is a calculator, not an analyst. Useful, but it does not accumulate value.

Approach 2: Claude + Free Skills

This is the new shape of the conversation. Anthropic ships Skills — reusable prompts and tool definitions you install with npx skills add some/skill-pack. The Cogny team ships some of them. Other companies ship some. There are good skills for Supabase, Shopify, and a handful of marketing data sources.

The promise is real: a skill is a small, well-tested package of instructions and tool definitions that makes Claude better at a specific task. Use the SEO skill, get better SEO analysis. Use the Supabase skill, get safer database changes.

What works:

The output is genuinely sharper than raw Claude on the same task. Domain-specific instructions matter.
It's open. You can read the skill, modify it, contribute back.
It composes. A skill stack of 5–10 marketing-relevant skills is meaningfully more capable than a clean chat.

What still breaks:

Skills are stateless. A skill is a prompt scaffold and maybe some tool definitions. It does not remember what you ran yesterday, and it does not store the result.
No live integrations included by default. A skill can call an MCP server, but it doesn't ship the MCP server. You still need to wire up Meta Ads, Google Ads, GA4, Klaviyo, BigQuery, Search Console, and the rest. Each one is its own auth flow, its own quirks, its own breakage.
No schedule. Skills only run when you open Claude and type. The 03:00 budget anomaly is not getting caught by a skill.
No ticket workflow. The output is markdown in a chat window. There's nowhere to triage it, nowhere to approve it, nowhere to track what shipped.
No outcome measurement. Same problem as raw Claude. The skill doesn't know if its recommendation worked.

Free skills are an improvement, not a replacement. They are the equivalent of better libraries for a Python project. You still need a server, a database, a deployment pipeline, and someone watching the logs. The skill is one file. The harness is everything else.

Approach 3: A Harness

This is what Cogny is, and what a handful of other serious AI marketing products are converging toward. The model is one component. Everything around it is the actual product.

In Cogny's case, the harness is:

25+ MCPs that connect to live marketing data — Google Ads, Meta Ads, LinkedIn Ads, TikTok Ads, GA4, Search Console, Klaviyo, Mailchimp, HubSpot, Symplify, Shopify, BigQuery, and the rest. Auth handled. Rate limits handled. Schema evolution handled.
39 analytical templates that codify how to audit a specific surface — google-ads-optimization, meta-ads-historic-winners, geo-search-optimization, b2b-sales-pipeline, revenue-cohort-analysis, twenty more. Each template is the kind of analysis a senior performance marketer would run, written down so the model runs it the same way every time.
A scheduler that fires those templates on a clock. Daily for paid media. Weekly for SEO. Monthly for cohort revenue. The model doesn't wait for you to ask.
Growth Tickets — every recommendation lands in a queue as a structured ticket with a title, a body, an expected_outcome, an estimated_impact_usd, an execution_target (manual / agent / code), and an approval_score that estimates how likely a human is to approve it. Falsifiable hypotheses. Not vibes.
An approval gate. Nothing changes in your ad accounts without a human saying yes. The agent drafts. The human ships.
The Truth Ledger. Every approved ticket is followed up after execution: did the change actually move the metric? The result lands in result_metrics and feeds back into the next cycle.

That last piece is the one that matters most and the one a chat window cannot fake. The harness measures outcomes. Over time, it learns which kinds of recommendations work for your account specifically. A chat window cannot do this because a chat window doesn't know what happened after you closed it.

The Production Reality: What We Tried Before We Built the Harness

I want to be specific about why we landed here, because "build a harness" sounds like the kind of thing engineers say to justify a bigger product. It isn't. We tried the cheaper options first.

2024 Q4 — Raw Claude with a system prompt. Our first prototype was a chat interface with a long system prompt, a single BigQuery tool, and nothing else. It worked for demos. It did not work for our own marketing analytics. The same questions came back every Monday because the model had no memory of last Monday.

2025 Q1 — Skills-style scaffolds. We pulled the system prompt apart into a dozen smaller, task-specific prompts and called them from a router. This is structurally what the modern skills ecosystem ships. It was better. The Google Ads analysis got sharper because the Google Ads scaffold knew about quality scores and search-term reports. The SEO analysis got sharper because the SEO scaffold knew about Search Console's query/page join.

But the team still wasn't using the output. The recommendations sat in chat logs. Nothing shipped to ad accounts. There was no way to know which recommendations had actually been tried.

2025 Q2 — Scheduled runs + a ticket schema. We bolted on a scheduler and a Postgres table called growth_tickets. Every scheduled run wrote its findings as tickets with title, body, expected_outcome. The chat output stopped being prose-in-a-window and started being structured rows in a queue.

Usage jumped immediately. Not because the model got better — the model was the same. Because the workflow now matched how performance marketers actually work: a triage queue of specific things to do, ordered by estimated impact.

2025 Q3 — Approval gates and execution. We added a human approval step before any ticket could execute against an ad platform. We added result_metrics so that we could ask, after the fact, "did the change actually do what we predicted?" This is the falsifiability piece. Without it the agent is making astrology with a budget.

2025 Q4 onward — the Truth Ledger and the learning loop. The outcome data started feeding back into the model's context for the next run. "Here's what we recommended in this account, here's what was approved, here's what the metric did afterwards." The recommendations got noticeably more grounded. Approval rates climbed.

Each layer added value the previous one couldn't capture. None of them are tricks the model itself can do. All of them are harness.

Why the Harness Is What Compounds

The frontier models keep getting better and cheaper. Claude Sonnet 4.6 is ~6x cheaper per token than Opus 3 was eighteen months ago, and roughly as good at the analytical tasks we care about. That trend is not going to reverse.

What this means for AI marketing tools is uncomfortable. The model is not the moat. Whatever advantage you had from prompt engineering or model choice in 2024 is being eroded every quarter by the providers themselves.

What does compound:

Data integrations you don't have to maintain. The MCP for Meta Ads is a non-trivial piece of engineering — token refresh, rate limit handling, schema drift, retry semantics, error classification. A free skill can describe how to call it. The MCP itself has to exist somewhere. Building and maintaining 25 of them is roughly a person-year of work that the user does not have to do.
Templates encoding how senior practitioners actually work. The difference between a competent paid-media audit and a useless one is not which model you used. It is whether the audit knows to compare search-term reports against keyword reports, to segment by device before declaring an audience dead, to check quality score before bid changes. That knowledge lives in templates, not in the model.
Outcome history. This is the irreplaceable asset. Once Cogny has been running against your account for six months, the model has a record of which recommendations worked in your specific business. No public model can have that. A new tool starting from zero cannot have that. It is genuinely unique to the harness that produced it.

These three things compound. Models improve. The harness keeps the improvements.

What a Real Harness Looks Like: Five Hard Tests

If you're evaluating AI marketing tools — including Cogny — here are five questions worth asking. A real harness should answer "yes" to all of them. A chat window with a fancy front end will fail several.

1. Does it run when you're asleep?

If the answer is "you open the app and ask," it is not a harness. Real systems run on a clock and bring findings to you. We schedule paid-media audits at 06:00 in the customer's timezone so the Monday-morning queue is already populated.

2. Can it read your real numbers without exports?

Pasted CSVs are not a data integration. The harness should connect to the live source — BigQuery, Search Console, the ad platform API — and pull the data itself. If the only way to feed it your numbers is to download a file, the limit on coverage will be how much downloading you can stomach.

3. Does every recommendation have a falsifiable hypothesis?

A real recommendation says what to change, in which campaign, with what expected dollar impact, on what timeline. "Optimise creative cadence" is astrology. "Pause keyword enterprise crm software in campaign B2B - Brand; predicted monthly saving $1,840; check back in 30 days" is a hypothesis. The first you cannot evaluate. The second you can.

We enforce this at the schema level. A Growth Ticket cannot reach the approval queue without an expected_outcome and an estimated_impact_usd. If the model can't fill those in, the ticket is dropped.

4. Is there an approval gate?

You do not want AI changing budgets at 03:00 without you. You especially do not want it changing budgets at 03:00 without a log of who approved what and why. The approval gate is non-negotiable for anyone running real ad spend. Audit trail. Rollback. Sleep.

5. Does it measure whether the recommendation worked?

This is the question that separates serious AI marketing tools from the rest. If the recommendation flows out and disappears — no follow-up, no result_metrics, no comparison to the predicted impact — then the system cannot learn. Six months in, it will be no better than month one. Compare to a harness with outcome measurement: every cycle gets sharper because the model can see what worked last cycle in this specific account.

Why "Just Build It Yourself" Is Harder Than It Sounds

Every technical founder reading this has the same thought: I could build this. It's a Postgres table and some Claude calls.

Yes. We thought that too. Here is what we actually had to build:

A queue model with state transitions (new → doing → analysis → done/rejected) and a history table for audit.
An approval-score predictor that estimates, at ticket-creation time, how likely a human is to approve. Lets us filter low-quality output before it pollutes the queue.
Rate-limit and retry logic for every ad-platform API. Meta and Google have different failure modes and different recovery semantics; both have to be handled or the scheduler stalls.
A keepalive layer for long-running streams so that Cloudflare and GCP don't kill SSE connections during a 90-second analysis (see our notes on QUIC protocol errors).
Token refresh for every OAuth integration. Linked accounts expire. The harness has to notice, refresh transparently, and surface a re-auth UI before the user notices it's broken.
A schedule executor that survives pod recycles. Long-running chat and report sessions are killed when workers restart, so the queue has to be resumable. (We documented why we don't auto-ship worker releases on every PR — same reason.)
A Truth Ledger schema with expected_outcome, result, result_metrics, approval_score, approval_score_reasons, and a reference back to the scheduled prompt and reporting agent that spawned the ticket. So that six months later you can answer the question "why did we make this change and did it work?"

This is maybe 80–90% of the work in Cogny. The Claude calls are the easy part.

The Honest Comparison

Here is the matrix I'd build if I were comparing approaches for a marketing team in 2026.

	Raw Claude	Claude + Free Skills	Cogny Harness
Data access	Pasted text	Skill-defined tool calls (you build the MCP)	25+ live MCPs included
Schedule	None	None	Daily / weekly / hourly
Coverage	What you paste	What you paste	100% of campaigns, every cycle
Output shape	Prose	Better prose	Structured tickets with `expected_outcome`
Approval workflow	None	None	Built-in queue + audit log
Outcome measurement	None	None	Truth Ledger with `result_metrics`
Learning over time	None	None	Per-account history feeds next cycle
Operational ownership	You	You	Cogny
Best for	One-off questions	Engineers prototyping	Teams running real spend
Monthly cost	$20	$20 + your eng time	$530 (Cloud — full harness) · $9 (Solo — starter)

This is the comparison we're comfortable making publicly. Each row reflects the actual capability difference, not marketing copy.

Where the Free-Skill Ecosystem Fits

To be clear: we ship skills. We use skills. We think the skill ecosystem is genuinely useful and that more of it should exist.

The right way to think about it is: skills are a great primitive; they are not a product. A skill is something an engineer reaches for to build a harness faster. It is not something a marketing team adopts in place of a harness.

The closest analogue is the difference between shadcn/ui (a copy-pasteable component library) and a fully-built SaaS application. Both are valuable. Both fit different audiences. The component library is for engineers who want to compose their own thing. The application is for the end user who needs the thing to already work.

Cogny is the application. The skills are an input.

Who Should Care About This Distinction

If you are:

An engineer playing with Claude — raw Claude and free skills are perfect. Build whatever you want.
A solo operator running one channel — Cogny Solo at $9/month is the right entry point. You bring your own Claude, get a starter set of MCPs, and operate it yourself one channel at a time. It's not the full harness — there's no scheduler, no parallel reports, no Truth Ledger at this tier — but it's enough to start finding wins and to see how Claude marketing works on your real data.
A growth team at a $5M–$50M ARR company — you need the full harness: scheduled execution, parallel reports across every channel, the Truth Ledger, organisational memory, the full 25+ MCP catalogue. This is Cogny Cloud at $530/month, and the volume of recommendations plus the audit requirement alone make it the only viable shape.
An agency — the full harness is your unfair advantage. You can manage more accounts per analyst because the recommendation generation, prioritization, and outcome tracking are all running automatically. Your team's job becomes triage and strategy. This is Cogny Cloud territory.

How to Try It

If you want the full harness — scheduled execution against all 25+ MCPs, parallel reports across every channel, falsifiable Growth Tickets, the Truth Ledger — that ships with Cogny Cloud at $530/month. This is the configuration that compounds.

If you want a cheap way to first see Claude marketing running against your own data, Cogny Solo at $9/month is the entry point. Bring your own Claude, pick a single channel, use the starter MCPs. It is not the harness, but it is enough to find the first wins and decide whether the full Cloud configuration is worth it.

If you want to keep using raw Claude and bolt on tools yourself, that is also a legitimate choice — and our MCP server is available standalone for that purpose. Bring your own model, use our integrations. Some teams prefer that.

The thing not to do is the middle path — raw Claude with no tools, two-month-old chat logs, and a vague sense that AI is supposed to be helping with marketing. That is the version that gets abandoned after six weeks and convinces the team AI doesn't work yet. It works. It just needs a harness.

FAQ

What's the difference between an AI marketing agent and an AI marketing harness? The agent is the thing that runs — the LLM-driven loop that reads data, reasons about it, and produces output. The harness is everything around the agent that turns its output into a usable workflow: tools, scheduler, approval queue, audit trail, outcome measurement. You need both, but the harness is the part that determines whether the system actually compounds over time. See our piece on what an AI marketing agent is for the agent side of the picture.

Can I just install free skills and get the same thing? No. Skills make the agent better at specific tasks. They do not add live data integrations, scheduling, approval workflow, or outcome measurement. Those are harness features, not skill features. Free skills are an excellent input to a harness — they are not a substitute for one.

Is the Cogny harness Claude-specific? The current implementation runs on Claude (Opus 4.8 and Sonnet 4.6). We explained why we chose Claude over GPT for marketing reasoning. The harness layer — MCPs, scheduler, ticket workflow, Truth Ledger — is model-agnostic and could run other models with prompt re-engineering.

How does the approval-score predictor work? At ticket-creation time, we estimate how likely a human reviewer is to approve the ticket. It uses historical approval data from the same account plus features of the ticket itself (impact size, channel, similar prior tickets). Tickets below a threshold are filtered before they reach the queue. This keeps the queue useful instead of overwhelming.

Why do you need a Truth Ledger if Claude is already smart? Because Claude — like any other model — does not know what happened in your specific ad account after you closed the chat. The Truth Ledger gives the next cycle a fact base: which prior recommendations were approved, which were rejected and why, and what the metrics did afterwards. Without that ledger, the agent restarts every cycle. With it, the agent gets sharper about your account specifically.

What does this cost compared to using raw Claude? The full harness — scheduled execution, all 25+ MCPs, parallel reports, the Truth Ledger — ships with Cogny Cloud at $530/month. That is the configuration this whole piece is about. Cogny Solo at $9/month is a thinner entry tier (bring-your-own-Claude, starter MCPs, one channel at a time) — useful for solo operators or teams who want to see Claude marketing on their data before committing, but it does not include the harness components that make the system compound. The relevant comparison isn't model cost; it's the cost of building and maintaining the harness yourself. We estimate that's roughly a person-year of engineering. At any market salary, renting Cogny Cloud is dramatically cheaper than building the harness in-house.

About Berner Setterwall

Berner is CTO and co-founder of Cogny, where he leads the engineering of the AI marketing harness — MCPs, scheduler, ticket workflow, and the Truth Ledger. Previously he was a senior engineer at Campanja, building optimization systems for Netflix, Zalando, and other major brands. He specializes in AI architecture, large-scale data systems, and getting unreliable models to do reliable things in production.

Want to see the harness running against your own data?

The full harness ships with Cogny Cloud at $530/month — scheduled audits across every channel, all 25+ MCPs, falsifiable Growth Tickets, the Truth Ledger, organisational memory. This is the configuration that compounds.

If you want a $9 way to first see Claude marketing on your data, Cogny Solo is the entry tier — bring-your-own-Claude, starter MCP set, one channel at a time, 7-day free trial. Not the harness, but a useful place to start.