Why We Chose Claude Over GPT-4 for Marketing AI
Six months ago, I was convinced we'd build Cogny on GPT-4.
The reasoning was straightforward: GPT-4 was the most well-known model, had the largest ecosystem, and worked well in our early prototypes. We'd already built our proof-of-concept on OpenAI's API. Why switch?
Then I spent two weeks running side-by-side comparisons of GPT-4, Claude 3 Opus, and several other models on real marketing analytics tasks. The results surprised me—and ultimately led us to rebuild our entire platform on Claude.
This isn't a generic "Claude vs GPT" comparison. This is the technical story of why we made a bet-the-company decision to switch models, what we learned in production, and whether it was the right call.
The Context: What We're Actually Building
Before diving into model comparisons, it's important to understand what we needed the AI to do.
Cogny is an AI agent for marketing analytics. Users connect their data warehouse (usually BigQuery), ask questions in natural language, and get back analysis, insights, and recommendations.
A typical interaction:
- User: "Why did our Google Ads ROAS drop 15% last week?"
- AI: Analyzes campaign data, identifies the cause, explains the reasoning, suggests fixes
This requires several capabilities:
- Natural language understanding - interpreting vague, domain-specific questions
- SQL generation - writing complex queries against the user's schema
- Data analysis - finding patterns and anomalies in results
- Reasoning - connecting causes and effects
- Communication - explaining findings clearly to non-technical users
We needed a model that excelled at all five, not just a couple.
The Initial Hypothesis: GPT-4 Will Be Best
Our bias toward GPT-4 was based on conventional wisdom:
- Most developers were using it
- Largest ecosystem and community
- Strong performance on benchmarks
- We were already familiar with the API
So we built our proof-of-concept on GPT-4 Turbo. It worked... okay.
The problems we encountered:
- Inconsistent SQL generation - Sometimes perfect, sometimes syntactically correct but semantically wrong
- Verbose reasoning - Lots of explanation, not always insightful
- Overconfidence - Would confidently present incorrect analysis
- Context length issues - Large schemas often hit limits or degraded quality
Were these dealbreakers? Not necessarily. But they were concerning enough to make us question whether we were using the right foundation.
The Testing Process
I designed a test suite of 50 real marketing analytics tasks based on conversations with potential users:
Category 1: SQL Generation (15 tasks)
- Simple aggregations ("total spend by campaign")
- Complex joins (multiple tables, attribution logic)
- Window functions (cohort analysis, trend detection)
- Edge cases (NULL handling, timezone conversions)
Category 2: Analysis (15 tasks)
- Pattern recognition in campaign data
- Anomaly detection (what's unusual?)
- Cause identification (why did X happen?)
- Comparative analysis (this month vs last month)
Category 3: Reasoning (10 tasks)
- Multi-step problem solving
- Constraint satisfaction
- Trade-off evaluation
- Uncertainty handling
Category 4: Communication (10 tasks)
- Explaining technical findings to non-technical users
- Summarizing complex data
- Generating actionable recommendations
- Clarity under complexity
We tested four models:
- GPT-4 Turbo (gpt-4-turbo-preview at the time)
- GPT-4 (original, before turbo)
- Claude 3 Opus
- Claude 3 Sonnet (for cost comparison)
Each task was run 5 times per model to account for variance. We measured:
- Correctness (is the answer right?)
- Quality (how good is the explanation?)
- Consistency (how much does output vary?)
- Cost (inference price)
- Latency (response time)
The Results: Claude Surprised Us
I expected GPT-4 and Claude to be roughly comparable with minor trade-offs. That's not what we found.
SQL Generation: Claude Won Clearly
Claude 3 Opus:
- 92% syntactically correct
- 87% semantically correct (did what was intended)
- Handled edge cases better (NULL, timezone, type coercion)
- Better at complex JOINs with multiple conditions
GPT-4 Turbo:
- 94% syntactically correct
- 78% semantically correct
- Struggled with implicit business logic
- Sometimes over-simplified complex queries
The key difference: Claude seemed to understand the intent behind queries better. GPT-4 would write technically correct SQL that didn't answer the actual question.
Example task: "Show me campaigns where performance improved week-over-week but spend decreased"
GPT-4's approach:
SELECT campaign_name,
current_week_conversions - prior_week_conversions as change
FROM campaigns
WHERE current_week_spend < prior_week_spend
Syntactically fine. But it doesn't actually check if performance improved—just that conversions changed.
Claude's approach:
WITH weekly_performance AS (
SELECT
campaign_name,
week,
conversions / NULLIF(spend, 0) as roas
FROM campaigns
)
SELECT
current.campaign_name,
current.roas - prior.roas as roas_improvement
FROM weekly_performance current
JOIN weekly_performance prior
ON current.campaign_name = prior.campaign_name
AND current.week = prior.week + 1
WHERE current.roas > prior.roas
AND current.spend < prior.spend
More complex, but actually answers the question asked.
Why this matters in production: Users don't ask perfectly precise questions. The model needs to infer what they actually want. Claude was meaningfully better at this.
Analysis Quality: Claude's Edge
We gave both models campaign performance data and asked: "What's happening here? What should we do?"
Claude's analysis:
- More structured reasoning
- Better at identifying multiple contributing factors
- Clearer separation of certainty vs speculation
- More actionable recommendations
GPT-4's analysis:
- Often focused on a single explanation
- Less nuanced about confidence levels
- Sometimes missed secondary patterns
- Recommendations were more generic
Example: Campaign performance dropped 20% over two weeks.
GPT-4 response: "Your campaign performance has declined by 20% over the past two weeks. This could be due to several factors including increased competition, audience fatigue, or seasonal trends. I recommend refreshing your creative and testing new audience segments."
Claude response: "I see a 20% performance decline, but let me break down what's actually happening:
Primary factor (high confidence): Your cost-per-click increased 35% starting 10 days ago, specifically for mobile placements. Desktop CPC remained stable.
Secondary factor (medium confidence): Conversion rate for the 25-34 age group dropped from 3.2% to 2.1%, but other age groups were unchanged.
Speculation: The timing coincides with iOS updates that often affect mobile ad delivery. The age-specific drop suggests either audience saturation or creative fatigue for that segment.
Recommendations (prioritized):
- Check for mobile bid adjustments or placement issues (likely quick fix)
- Analyze creative performance specifically for 25-34 mobile users
- Consider audience exclusions if we've been over-targeting that segment"
Notice the difference:
- Explicit confidence levels
- Specific data points referenced
- Prioritized recommendations
- Separation of facts, inference, and speculation
In production, this matters enormously. Users need to know what the AI is certain about vs what it's guessing.
Reasoning: Claude's Biggest Advantage
The most striking difference was in multi-step reasoning tasks.
Test scenario: "We want to improve ROAS. We can either increase budget by 30% or cut poor-performing campaigns. Budget increase would likely improve reach but might lower efficiency. What should we do?"
GPT-4 approach:
- Discussed trade-offs
- Suggested testing both
- Gave fairly generic strategic advice
Claude approach:
- Asked for current ROAS and marginal ROAS
- Estimated the impact of each approach based on historical data
- Calculated break-even scenarios
- Made a specific recommendation with conditions
The key insight: Claude was better at reasoning through problems rather than just describing them.
In another test, we asked models to debug why a query was running slowly. GPT-4 suggested generic optimizations. Claude analyzed the specific query structure, identified the problematic JOIN order, and explained why reordering would help.
Why? My hypothesis is that Claude's training emphasized reasoning processes, not just outputs. It shows its work more clearly.
Communication: Surprisingly Close
Both models were good at explaining complex findings clearly. Claude had a slight edge in structured communication, but GPT-4 was often more conversational.
For our use case, structure mattered more than warmth. Users wanted clear breakdowns, not friendly chat.
Consistency: Claude Won
We ran each task 5 times to measure output variance.
Claude: Remarkably consistent. 85%+ of the time, the substantive answer was the same across runs.
GPT-4: More variable. We'd sometimes get significantly different analyses on identical data.
In production, consistency is critical. If a user asks the same question twice and gets different answers, trust evaporates.
Cost: More Complicated Than It Looks
At face value, GPT-4 was cheaper:
- GPT-4 Turbo: ~$10/1M input tokens, ~$30/1M output tokens
- Claude 3 Opus: ~$15/1M input tokens, ~$75/1M output tokens
But in practice, Claude was often more cost-effective because:
- More concise outputs - GPT-4 was often verbose. Claude gave focused answers.
- Fewer retries - Better accuracy meant less "try again" loops
- Better context usage - Claude handled large contexts more gracefully
Real-world cost per query:
- GPT-4 Turbo: $0.08-0.15
- Claude 3 Opus: $0.10-0.18
Claude was slightly more expensive, but not dramatically so. Given the quality difference, the cost premium was worth it.
Latency: GPT-4 Faster, But Not By Much
Median response time (streaming):
- GPT-4 Turbo: 3.2 seconds to first token, 12s total
- Claude 3 Opus: 4.1 seconds to first token, 14s total
Not a dealbreaker difference. In a conversational interface, both felt fast enough.
The Decision Point
After two weeks of testing, the data was clear:
Claude was better at:
- Understanding intent behind questions (critical for natural language queries)
- Multi-step reasoning (essential for analysis)
- Consistency (necessary for trust)
- Structured communication (important for clarity)
GPT-4 was better at:
- Latency (slightly)
- Cost (slightly)
- Ecosystem size (more tools and libraries)
For our use case—marketing analytics where reasoning quality and consistency matter more than raw speed—Claude was the clear choice.
But we were already built on GPT-4. Switching would mean rewriting significant parts of our platform.
Tom and I had a long conversation about this. His question: "Are you sure Claude is worth the engineering cost of switching?"
My answer: "If we build on GPT-4, we'll be competing with everyone else using GPT-4. Claude gives us a technical edge."
We decided to switch.
The Migration Reality
Switching models sounds simple: change an API endpoint, right?
In practice, it required:
1. Prompt re-engineering
Claude and GPT-4 respond differently to the same prompts. What worked well for GPT-4 often produced worse results on Claude, and vice versa.
We had to:
- Rewrite our system prompts
- Adjust few-shot examples
- Change how we structured complex requests
- Optimize for Claude's reasoning style
Time investment: ~3 weeks of iteration
2. Tool calling adjustments
Both models support function calling, but with different interfaces and behaviors.
Claude's tool use is more explicit. GPT-4's is sometimes more flexible. We had to redesign how we expose database operations as tools.
Time investment: ~2 weeks
3. Context management
Claude handles large contexts well, but differently than GPT-4. We optimized our context preparation for Claude's architecture.
Time investment: ~1 week
4. Testing and validation
We couldn't just ship the switch. We ran A/B tests with real users comparing GPT-4 and Claude versions.
Time investment: ~2 weeks
Total migration cost: ~2 months of engineering time
Worth it? Absolutely. But it wasn't trivial.
What We Learned in Production
The test suite told us Claude was better. Production confirmed it, but with nuance.
Users Noticed the Difference
We ran an A/B test where 50% of users got GPT-4, 50% got Claude (they didn't know which). After a week, we surveyed them.
Results:
- 68% of Claude users rated the AI's analysis as "excellent"
- 42% of GPT-4 users rated it "excellent"
- Claude users were 2.3x more likely to say they "trusted" the AI's recommendations
The qualitative feedback was striking:
Claude users: "It feels like it actually understands my data" / "The explanations make sense"
GPT-4 users: "It's helpful but sometimes I need to ask follow-up questions" / "The analysis is good but generic"
Where Claude Excels
In production, Claude's advantages became even clearer:
1. Complex query scenarios
When users asked multi-part questions ("Compare performance across channels, segment by device, and identify the biggest opportunities"), Claude handled them more elegantly.
2. Uncertainty handling
Claude was better at saying "I don't have enough data to be confident about X, but here's what I can tell you about Y."
3. Following business logic
Marketing data has implicit rules: fiscal calendars, attribution windows, cohort definitions. Claude picked up on these contextual constraints better.
Where GPT-4 Held Its Own
There were areas where GPT-4 was competitive or better:
1. Creative tasks
When users asked for campaign messaging ideas or creative suggestions, GPT-4 was slightly more creative and varied.
2. Conversational flow
GPT-4 felt more "natural" in long conversations. Claude sometimes felt more formal.
3. Edge case creativity
When faced with totally novel requests, GPT-4 was sometimes more willing to try unconventional approaches.
Our response: We considered using GPT-4 for specific creative tasks and Claude for analytical tasks. But the complexity of managing two models wasn't worth the marginal gains.
The Technical Architecture
Here's how we use Claude in production:
1. System Prompt Engineering
We use a carefully tuned system prompt that:
- Defines the AI's role (marketing analytics expert)
- Sets reasoning guidelines (show your work, acknowledge uncertainty)
- Provides domain context (common marketing metrics, attribution models)
- Establishes output format (structured insights, prioritized recommendations)
Key learning: Claude responds well to explicit reasoning frameworks. We literally tell it "Think step-by-step" and it produces better outputs.
2. Context Management
For each user query, we build context:
- User's database schema (what tables/columns are available)
- Previous conversation history
- Relevant business context (industry, typical metrics)
- Recent query results (for follow-up questions)
Context size: Usually 20-40k tokens. Claude handles this well.
Optimization: We summarize old conversation history to keep context manageable.
3. Tool Use Pattern
We expose database operations as tools Claude can call:
execute_query: Run SQL against BigQueryget_schema: Retrieve table structurelist_tables: Show available data
Pattern:
- User asks question
- Claude determines what data it needs
- Claude calls tools to get that data
- Claude analyzes results
- Claude responds with insights
Why this works: Separating data retrieval from analysis lets Claude focus on reasoning, not query syntax.
4. Response Streaming
We stream Claude's responses token-by-token for better UX. Users see thinking happen in real-time, which builds trust.
Technical detail: We use Server-Sent Events (SSE) to stream from our backend to frontend.
5. Error Handling
Claude is good, but not perfect. We built guardrails:
- SQL validation before execution
- Cost limits on expensive queries
- Timeout handling for long-running analysis
- Fallback explanations when Claude is uncertain
Philosophy: Trust but verify. Let Claude reason freely, but validate outputs before presenting to users.
The Comparison Today (Claude 3.5 vs GPT-4o)
Since we made the initial decision, both models have been updated:
- Claude 3.5 Sonnet is better and cheaper than Opus
- GPT-4o is faster and cheaper than GPT-4 Turbo
Would we make the same decision today? Yes, but it's closer.
Claude 3.5 Sonnet:
- Better reasoning than Claude 3 Opus
- Cheaper ($3/M input, $15/M output tokens)
- Slightly faster
- Maintains Claude's structured thinking advantage
GPT-4o:
- Significantly faster than GPT-4 Turbo
- Cheaper
- Better at multimodal tasks
- Improved reasoning over previous versions
Current assessment: Claude 3.5 Sonnet is still our choice for analytical reasoning, but the gap has narrowed. If we were building a more conversational or creative product, GPT-4o would be tempting.
What We'd Do Differently
Knowing what I know now, here's what I'd change:
1. Start with both models
We should have architected for model flexibility from day one. Being able to A/B test models continuously would be valuable.
2. Build better evaluation infrastructure
We did good testing before switching, but we should have automated more of it. Continuous model evaluation would help us catch regressions.
3. Invest more in prompt engineering earlier
We spent months iterating on prompts. Starting with more rigorous prompt engineering would have accelerated development.
4. Plan for model updates
Model versions change. We should have built better versioning and testing for model updates.
The Honest Truth
Here's what I tell other technical founders asking about model choice:
There's no universal answer. The right model depends on your use case.
For analytical reasoning, complex queries, and structured output: Claude is probably better.
For creative tasks, conversational flow, and broad knowledge: GPT-4 is very competitive.
For cost-sensitive applications: Claude 3.5 Sonnet offers the best quality-to-cost ratio we've found.
For maximum flexibility: Build model-agnostic from the start.
The landscape changes fast. What's true today might not be true in six months. The models are improving rapidly, and new competitors are emerging.
The real competitive advantage isn't which model you use—it's how well you use it.
Prompt engineering, context management, tool design, and error handling matter more than model selection. We've seen brilliant applications built on GPT-3.5 and mediocre ones on Claude 3.5.
Why This Matters
I'm writing this not to convince everyone to use Claude, but to share the thinking process behind a critical technical decision.
When you're building an AI-powered product, model selection isn't just a technical choice—it's a strategic one. The model you choose affects:
- Product capabilities
- User experience
- Cost structure
- Competitive positioning
- Technical debt
We bet Cogny on Claude because the model's strengths aligned with our product's needs. That bet has paid off, but it required real engineering investment to maximize the benefits.
The takeaway: Test rigorously, understand your requirements, and be willing to invest in optimization. The default choice isn't always the right choice.
And in our case, going against the default (GPT-4) and choosing Claude turned out to be one of the best technical decisions we made.
---
About Berner Setterwall
Berner is CTO and co-founder of Cogny, where he's building AI-powered marketing automation on top of Claude. Previously, he was a senior engineer at Campanja, building optimization systems for Netflix, Zalando, and other major brands. He specializes in AI architecture, large-scale data systems, and making complex technology work reliably in production.
Want to see Claude in action for marketing analytics?
Cogny uses Claude to power real-time marketing analysis and insights. See how we've built AI that actually understands your data and delivers actionable recommendations. Book a demo to experience the difference.