Why We Chose Claude Over GPT-4 for Marketing AI

Six months ago, I was convinced we'd build Cogny on GPT-4.

The reasoning was straightforward: GPT-4 was the most well-known model, had the largest ecosystem, and worked well in our early prototypes. We'd already built our proof-of-concept on OpenAI's API. Why switch?

Then I spent two weeks running side-by-side comparisons of GPT-4, Claude 3 Opus, and several other models on real marketing analytics tasks. The results surprised me—and ultimately led us to rebuild our entire platform on Claude.

This isn't a generic "Claude vs GPT" comparison. This is the technical story of why we made a bet-the-company decision to switch models, what we learned in production, and whether it was the right call.

The Context: What We're Actually Building

Before diving into model comparisons, it's important to understand what we needed the AI to do.

Cogny is an AI agent for marketing analytics. Users connect their data warehouse (usually BigQuery), ask questions in natural language, and get back analysis, insights, and recommendations.

A typical interaction:

User: "Why did our Google Ads ROAS drop 15% last week?"
AI: Analyzes campaign data, identifies the cause, explains the reasoning, suggests fixes

This requires several capabilities:

Natural language understanding - interpreting vague, domain-specific questions
SQL generation - writing complex queries against the user's schema
Data analysis - finding patterns and anomalies in results
Reasoning - connecting causes and effects
Communication - explaining findings clearly to non-technical users

We needed a model that excelled at all five, not just a couple.

The Initial Hypothesis: GPT-4 Will Be Best

Our bias toward GPT-4 was based on conventional wisdom:

Most developers were using it
Largest ecosystem and community
Strong performance on benchmarks
We were already familiar with the API

So we built our proof-of-concept on GPT-4 Turbo. It worked... okay.

The problems we encountered:

Inconsistent SQL generation - Sometimes perfect, sometimes syntactically correct but semantically wrong
Verbose reasoning - Lots of explanation, not always insightful
Overconfidence - Would confidently present incorrect analysis
Context length issues - Large schemas often hit limits or degraded quality

Were these dealbreakers? Not necessarily. But they were concerning enough to make us question whether we were using the right foundation.

The Testing Process

I designed a test suite of 50 real marketing analytics tasks based on conversations with potential users:

Category 1: SQL Generation (15 tasks)

Simple aggregations ("total spend by campaign")
Complex joins (multiple tables, attribution logic)
Window functions (cohort analysis, trend detection)
Edge cases (NULL handling, timezone conversions)

Category 2: Analysis (15 tasks)

Pattern recognition in campaign data
Anomaly detection (what's unusual?)
Cause identification (why did X happen?)
Comparative analysis (this month vs last month)

Category 3: Reasoning (10 tasks)

Multi-step problem solving
Constraint satisfaction
Trade-off evaluation
Uncertainty handling

Category 4: Communication (10 tasks)

Explaining technical findings to non-technical users
Summarizing complex data
Generating actionable recommendations
Clarity under complexity

We tested four models:

GPT-4 Turbo (gpt-4-turbo-preview at the time)
GPT-4 (original, before turbo)
Claude 3 Opus
Claude 3 Sonnet (for cost comparison)

Each task was run 5 times per model to account for variance. We measured:

Correctness (is the answer right?)
Quality (how good is the explanation?)
Consistency (how much does output vary?)
Cost (inference price)
Latency (response time)

The Results: Claude Surprised Us

I expected GPT-4 and Claude to be roughly comparable with minor trade-offs. That's not what we found.

SQL Generation: Claude Won Clearly

Claude 3 Opus:

92% syntactically correct
87% semantically correct (did what was intended)
Handled edge cases better (NULL, timezone, type coercion)
Better at complex JOINs with multiple conditions

GPT-4 Turbo:

94% syntactically correct
78% semantically correct
Struggled with implicit business logic
Sometimes over-simplified complex queries

The key difference: Claude seemed to understand the intent behind queries better. GPT-4 would write technically correct SQL that didn't answer the actual question.

Example task: "Show me campaigns where performance improved week-over-week but spend decreased"

GPT-4's approach:

SELECT campaign_name,
       current_week_conversions - prior_week_conversions as change
FROM campaigns
WHERE current_week_spend < prior_week_spend

Syntactically fine. But it doesn't actually check if performance improved—just that conversions changed.

Claude's approach:

WITH weekly_performance AS (
  SELECT
    campaign_name,
    week,
    conversions / NULLIF(spend, 0) as roas
  FROM campaigns
)
SELECT
  current.campaign_name,
  current.roas - prior.roas as roas_improvement
FROM weekly_performance current
JOIN weekly_performance prior
  ON current.campaign_name = prior.campaign_name
  AND current.week = prior.week + 1
WHERE current.roas > prior.roas
  AND current.spend < prior.spend

More complex, but actually answers the question asked.

Why this matters in production: Users don't ask perfectly precise questions. The model needs to infer what they actually want. Claude was meaningfully better at this.

Analysis Quality: Claude's Edge

We gave both models campaign performance data and asked: "What's happening here? What should we do?"

Claude's analysis:

More structured reasoning
Better at identifying multiple contributing factors
Clearer separation of certainty vs speculation
More actionable recommendations

GPT-4's analysis:

Often focused on a single explanation
Less nuanced about confidence levels
Sometimes missed secondary patterns
Recommendations were more generic

Example: Campaign performance dropped 20% over two weeks.

GPT-4 response: "Your campaign performance has declined by 20% over the past two weeks. This could be due to several factors including increased competition, audience fatigue, or seasonal trends. I recommend refreshing your creative and testing new audience segments."

Claude response: "I see a 20% performance decline, but let me break down what's actually happening:

Primary factor (high confidence): Your cost-per-click increased 35% starting 10 days ago, specifically for mobile placements. Desktop CPC remained stable.

Secondary factor (medium confidence): Conversion rate for the 25-34 age group dropped from 3.2% to 2.1%, but other age groups were unchanged.

Speculation: The timing coincides with iOS updates that often affect mobile ad delivery. The age-specific drop suggests either audience saturation or creative fatigue for that segment.

Recommendations (prioritized):

Check for mobile bid adjustments or placement issues (likely quick fix)
Analyze creative performance specifically for 25-34 mobile users
Consider audience exclusions if we've been over-targeting that segment"

Notice the difference:

Explicit confidence levels
Specific data points referenced
Prioritized recommendations
Separation of facts, inference, and speculation

In production, this matters enormously. Users need to know what the AI is certain about vs what it's guessing.

Reasoning: Claude's Biggest Advantage

The most striking difference was in multi-step reasoning tasks.

Test scenario: "We want to improve ROAS. We can either increase budget by 30% or cut poor-performing campaigns. Budget increase would likely improve reach but might lower efficiency. What should we do?"

GPT-4 approach:

Discussed trade-offs
Suggested testing both
Gave fairly generic strategic advice

Claude approach:

Asked for current ROAS and marginal ROAS
Estimated the impact of each approach based on historical data
Calculated break-even scenarios
Made a specific recommendation with conditions

The key insight: Claude was better at reasoning through problems rather than just describing them.

In another test, we asked models to debug why a query was running slowly. GPT-4 suggested generic optimizations. Claude analyzed the specific query structure, identified the problematic JOIN order, and explained why reordering would help.

Why? My hypothesis is that Claude's training emphasized reasoning processes, not just outputs. It shows its work more clearly.

Communication: Surprisingly Close

Both models were good at explaining complex findings clearly. Claude had a slight edge in structured communication, but GPT-4 was often more conversational.

For our use case, structure mattered more than warmth. Users wanted clear breakdowns, not friendly chat.

Consistency: Claude Won

We ran each task 5 times to measure output variance.

Claude: Remarkably consistent. 85%+ of the time, the substantive answer was the same across runs.

GPT-4: More variable. We'd sometimes get significantly different analyses on identical data.

In production, consistency is critical. If a user asks the same question twice and gets different answers, trust evaporates.

Cost: More Complicated Than It Looks

At face value, GPT-4 was cheaper:

GPT-4 Turbo: ~$10/1M input tokens, ~$30/1M output tokens
Claude 3 Opus: ~$15/1M input tokens, ~$75/1M output tokens

But in practice, Claude was often more cost-effective because:

More concise outputs - GPT-4 was often verbose. Claude gave focused answers.
Fewer retries - Better accuracy meant less "try again" loops
Better context usage - Claude handled large contexts more gracefully

Real-world cost per query:

GPT-4 Turbo: $0.08-0.15
Claude 3 Opus: $0.10-0.18

Claude was slightly more expensive, but not dramatically so. Given the quality difference, the cost premium was worth it.

Latency: GPT-4 Faster, But Not By Much

Median response time (streaming):

GPT-4 Turbo: 3.2 seconds to first token, 12s total
Claude 3 Opus: 4.1 seconds to first token, 14s total

Not a dealbreaker difference. In a conversational interface, both felt fast enough.

The Decision Point

After two weeks of testing, the data was clear:

Claude was better at:

Understanding intent behind questions (critical for natural language queries)
Multi-step reasoning (essential for analysis)
Consistency (necessary for trust)
Structured communication (important for clarity)

GPT-4 was better at:

Latency (slightly)
Cost (slightly)
Ecosystem size (more tools and libraries)

For our use case—marketing analytics where reasoning quality and consistency matter more than raw speed—Claude was the clear choice.

But we were already built on GPT-4. Switching would mean rewriting significant parts of our platform.

Tom and I had a long conversation about this. His question: "Are you sure Claude is worth the engineering cost of switching?"

My answer: "If we build on GPT-4, we'll be competing with everyone else using GPT-4. Claude gives us a technical edge."

We decided to switch.

The Migration Reality

Switching models sounds simple: change an API endpoint, right?

In practice, it required:

1. Prompt re-engineering

Claude and GPT-4 respond differently to the same prompts. What worked well for GPT-4 often produced worse results on Claude, and vice versa.

We had to:

Rewrite our system prompts
Adjust few-shot examples
Change how we structured complex requests
Optimize for Claude's reasoning style

Time investment: ~3 weeks of iteration

2. Tool calling adjustments

Both models support function calling, but with different interfaces and behaviors.

Claude's tool use is more explicit. GPT-4's is sometimes more flexible. We had to redesign how we expose database operations as tools.

Time investment: ~2 weeks

3. Context management

Claude handles large contexts well, but differently than GPT-4. We optimized our context preparation for Claude's architecture.

Time investment: ~1 week

4. Testing and validation

We couldn't just ship the switch. We ran A/B tests with real users comparing GPT-4 and Claude versions.

Time investment: ~2 weeks

Total migration cost: ~2 months of engineering time

Worth it? Absolutely. But it wasn't trivial.

What We Learned in Production

The test suite told us Claude was better. Production confirmed it, but with nuance.

Users Noticed the Difference

We ran an A/B test where 50% of users got GPT-4, 50% got Claude (they didn't know which). After a week, we surveyed them.

Results:

68% of Claude users rated the AI's analysis as "excellent"
42% of GPT-4 users rated it "excellent"
Claude users were 2.3x more likely to say they "trusted" the AI's recommendations

The qualitative feedback was striking:

Claude users: "It feels like it actually understands my data" / "The explanations make sense"

GPT-4 users: "It's helpful but sometimes I need to ask follow-up questions" / "The analysis is good but generic"

Where Claude Excels

In production, Claude's advantages became even clearer:

1. Complex query scenarios

When users asked multi-part questions ("Compare performance across channels, segment by device, and identify the biggest opportunities"), Claude handled them more elegantly.

2. Uncertainty handling

Claude was better at saying "I don't have enough data to be confident about X, but here's what I can tell you about Y."

3. Following business logic

Marketing data has implicit rules: fiscal calendars, attribution windows, cohort definitions. Claude picked up on these contextual constraints better.

Where GPT-4 Held Its Own

There were areas where GPT-4 was competitive or better:

1. Creative tasks

When users asked for campaign messaging ideas or creative suggestions, GPT-4 was slightly more creative and varied.

2. Conversational flow

GPT-4 felt more "natural" in long conversations. Claude sometimes felt more formal.

3. Edge case creativity

When faced with totally novel requests, GPT-4 was sometimes more willing to try unconventional approaches.

Our response: We considered using GPT-4 for specific creative tasks and Claude for analytical tasks. But the complexity of managing two models wasn't worth the marginal gains.

The Technical Architecture

Here's how we use Claude in production:

1. System Prompt Engineering

We use a carefully tuned system prompt that:

Defines the AI's role (marketing analytics expert)
Sets reasoning guidelines (show your work, acknowledge uncertainty)
Provides domain context (common marketing metrics, attribution models)
Establishes output format (structured insights, prioritized recommendations)

Key learning: Claude responds well to explicit reasoning frameworks. We literally tell it "Think step-by-step" and it produces better outputs.

2. Context Management

For each user query, we build context:

User's database schema (what tables/columns are available)
Previous conversation history
Relevant business context (industry, typical metrics)
Recent query results (for follow-up questions)

Context size: Usually 20-40k tokens. Claude handles this well.

Optimization: We summarize old conversation history to keep context manageable.

3. Tool Use Pattern

We expose database operations as tools Claude can call:

execute_query: Run SQL against BigQuery
get_schema: Retrieve table structure
list_tables: Show available data

Pattern:

User asks question
Claude determines what data it needs
Claude calls tools to get that data
Claude analyzes results
Claude responds with insights

Why this works: Separating data retrieval from analysis lets Claude focus on reasoning, not query syntax.

4. Response Streaming

We stream Claude's responses token-by-token for better UX. Users see thinking happen in real-time, which builds trust.

Technical detail: We use Server-Sent Events (SSE) to stream from our backend to frontend.

5. Error Handling

Claude is good, but not perfect. We built guardrails:

SQL validation before execution
Cost limits on expensive queries
Timeout handling for long-running analysis
Fallback explanations when Claude is uncertain

Philosophy: Trust but verify. Let Claude reason freely, but validate outputs before presenting to users.

The Comparison Today (Claude 3.5 vs GPT-4o)

Since we made the initial decision, both models have been updated:

Claude 3.5 Sonnet is better and cheaper than Opus
GPT-4o is faster and cheaper than GPT-4 Turbo

Would we make the same decision today? Yes, but it's closer.

Claude 3.5 Sonnet:

Better reasoning than Claude 3 Opus
Cheaper ($3/M input, $15/M output tokens)
Slightly faster
Maintains Claude's structured thinking advantage

GPT-4o:

Significantly faster than GPT-4 Turbo
Cheaper
Better at multimodal tasks
Improved reasoning over previous versions

Current assessment: Claude 3.5 Sonnet is still our choice for analytical reasoning, but the gap has narrowed. If we were building a more conversational or creative product, GPT-4o would be tempting.

What We'd Do Differently

Knowing what I know now, here's what I'd change:

1. Start with both models

We should have architected for model flexibility from day one. Being able to A/B test models continuously would be valuable.

2. Build better evaluation infrastructure

We did good testing before switching, but we should have automated more of it. Continuous model evaluation would help us catch regressions.

3. Invest more in prompt engineering earlier

We spent months iterating on prompts. Starting with more rigorous prompt engineering would have accelerated development.

4. Plan for model updates

Model versions change. We should have built better versioning and testing for model updates.

The Honest Truth

Here's what I tell other technical founders asking about model choice:

There's no universal answer. The right model depends on your use case.

For analytical reasoning, complex queries, and structured output: Claude is probably better.

For creative tasks, conversational flow, and broad knowledge: GPT-4 is very competitive.

For cost-sensitive applications: Claude 3.5 Sonnet offers the best quality-to-cost ratio we've found.

For maximum flexibility: Build model-agnostic from the start.

The landscape changes fast. What's true today might not be true in six months. The models are improving rapidly, and new competitors are emerging.

The real competitive advantage isn't which model you use—it's how well you use it.

Prompt engineering, context management, tool design, and error handling matter more than model selection. We've seen brilliant applications built on GPT-3.5 and mediocre ones on Claude 3.5.

Why This Matters

I'm writing this not to convince everyone to use Claude, but to share the thinking process behind a critical technical decision.

When you're building an AI-powered product, model selection isn't just a technical choice—it's a strategic one. The model you choose affects:

Product capabilities
User experience
Cost structure
Competitive positioning
Technical debt

We bet Cogny on Claude because the model's strengths aligned with our product's needs. That bet has paid off, but it required real engineering investment to maximize the benefits.

The takeaway: Test rigorously, understand your requirements, and be willing to invest in optimization. The default choice isn't always the right choice.

And in our case, going against the default (GPT-4) and choosing Claude turned out to be one of the best technical decisions we made.

About Berner Setterwall

Berner is CTO and co-founder of Cogny, where he's building AI-powered marketing automation on top of Claude. Previously, he was a senior engineer at Campanja, building optimization systems for Netflix, Zalando, and other major brands. He specializes in AI architecture, large-scale data systems, and making complex technology work reliably in production.

Want to see Claude in action for marketing analytics?

Cogny uses Claude to power real-time marketing analysis and insights. See how we've built AI that actually understands your data and delivers actionable recommendations. Book a demo to experience the difference.