Back to Blog

    Why We Chose Claude Over GPT-4 for Marketing AI

    berner-setterwallJanuary 17, 2025

    Six months ago, I was convinced we'd build Cogny on GPT-4.

    The reasoning was straightforward: GPT-4 was the most well-known model, had the largest ecosystem, and worked well in our early prototypes. We'd already built our proof-of-concept on OpenAI's API. Why switch?

    Then I spent two weeks running side-by-side comparisons of GPT-4, Claude 3 Opus, and several other models on real marketing analytics tasks. The results surprised me—and ultimately led us to rebuild our entire platform on Claude.

    This isn't a generic "Claude vs GPT" comparison. This is the technical story of why we made a bet-the-company decision to switch models, what we learned in production, and whether it was the right call.

    The Context: What We're Actually Building

    Before diving into model comparisons, it's important to understand what we needed the AI to do.

    Cogny is an AI agent for marketing analytics. Users connect their data warehouse (usually BigQuery), ask questions in natural language, and get back analysis, insights, and recommendations.

    A typical interaction:

    • User: "Why did our Google Ads ROAS drop 15% last week?"
    • AI: Analyzes campaign data, identifies the cause, explains the reasoning, suggests fixes

    This requires several capabilities:

    • Natural language understanding - interpreting vague, domain-specific questions
    • SQL generation - writing complex queries against the user's schema
    • Data analysis - finding patterns and anomalies in results
    • Reasoning - connecting causes and effects
    • Communication - explaining findings clearly to non-technical users

    We needed a model that excelled at all five, not just a couple.

    The Initial Hypothesis: GPT-4 Will Be Best

    Our bias toward GPT-4 was based on conventional wisdom:

    • Most developers were using it
    • Largest ecosystem and community
    • Strong performance on benchmarks
    • We were already familiar with the API

    So we built our proof-of-concept on GPT-4 Turbo. It worked... okay.

    The problems we encountered:

    • Inconsistent SQL generation - Sometimes perfect, sometimes syntactically correct but semantically wrong
    • Verbose reasoning - Lots of explanation, not always insightful
    • Overconfidence - Would confidently present incorrect analysis
    • Context length issues - Large schemas often hit limits or degraded quality

    Were these dealbreakers? Not necessarily. But they were concerning enough to make us question whether we were using the right foundation.

    The Testing Process

    I designed a test suite of 50 real marketing analytics tasks based on conversations with potential users:

    Category 1: SQL Generation (15 tasks)

    • Simple aggregations ("total spend by campaign")
    • Complex joins (multiple tables, attribution logic)
    • Window functions (cohort analysis, trend detection)
    • Edge cases (NULL handling, timezone conversions)

    Category 2: Analysis (15 tasks)

    • Pattern recognition in campaign data
    • Anomaly detection (what's unusual?)
    • Cause identification (why did X happen?)
    • Comparative analysis (this month vs last month)

    Category 3: Reasoning (10 tasks)

    • Multi-step problem solving
    • Constraint satisfaction
    • Trade-off evaluation
    • Uncertainty handling

    Category 4: Communication (10 tasks)

    • Explaining technical findings to non-technical users
    • Summarizing complex data
    • Generating actionable recommendations
    • Clarity under complexity

    We tested four models:

    • GPT-4 Turbo (gpt-4-turbo-preview at the time)
    • GPT-4 (original, before turbo)
    • Claude 3 Opus
    • Claude 3 Sonnet (for cost comparison)

    Each task was run 5 times per model to account for variance. We measured:

    • Correctness (is the answer right?)
    • Quality (how good is the explanation?)
    • Consistency (how much does output vary?)
    • Cost (inference price)
    • Latency (response time)

    The Results: Claude Surprised Us

    I expected GPT-4 and Claude to be roughly comparable with minor trade-offs. That's not what we found.

    SQL Generation: Claude Won Clearly

    Claude 3 Opus:

    • 92% syntactically correct
    • 87% semantically correct (did what was intended)
    • Handled edge cases better (NULL, timezone, type coercion)
    • Better at complex JOINs with multiple conditions

    GPT-4 Turbo:

    • 94% syntactically correct
    • 78% semantically correct
    • Struggled with implicit business logic
    • Sometimes over-simplified complex queries

    The key difference: Claude seemed to understand the intent behind queries better. GPT-4 would write technically correct SQL that didn't answer the actual question.

    Example task: "Show me campaigns where performance improved week-over-week but spend decreased"

    GPT-4's approach:

    SELECT campaign_name,
    current_week_conversions - prior_week_conversions as change
    FROM campaigns
    WHERE current_week_spend < prior_week_spend

    Syntactically fine. But it doesn't actually check if performance improved—just that conversions changed.

    Claude's approach:

    WITH weekly_performance AS (
    SELECT
    campaign_name,
    week,
    conversions / NULLIF(spend, 0) as roas
    FROM campaigns
    )
    SELECT
    current.campaign_name,
    current.roas - prior.roas as roas_improvement
    FROM weekly_performance current
    JOIN weekly_performance prior
    ON current.campaign_name = prior.campaign_name
    AND current.week = prior.week + 1
    WHERE current.roas > prior.roas
    AND current.spend < prior.spend

    More complex, but actually answers the question asked.

    Why this matters in production: Users don't ask perfectly precise questions. The model needs to infer what they actually want. Claude was meaningfully better at this.

    Analysis Quality: Claude's Edge

    We gave both models campaign performance data and asked: "What's happening here? What should we do?"

    Claude's analysis:

    • More structured reasoning
    • Better at identifying multiple contributing factors
    • Clearer separation of certainty vs speculation
    • More actionable recommendations

    GPT-4's analysis:

    • Often focused on a single explanation
    • Less nuanced about confidence levels
    • Sometimes missed secondary patterns
    • Recommendations were more generic

    Example: Campaign performance dropped 20% over two weeks.

    GPT-4 response: "Your campaign performance has declined by 20% over the past two weeks. This could be due to several factors including increased competition, audience fatigue, or seasonal trends. I recommend refreshing your creative and testing new audience segments."

    Claude response: "I see a 20% performance decline, but let me break down what's actually happening:

    Primary factor (high confidence): Your cost-per-click increased 35% starting 10 days ago, specifically for mobile placements. Desktop CPC remained stable.

    Secondary factor (medium confidence): Conversion rate for the 25-34 age group dropped from 3.2% to 2.1%, but other age groups were unchanged.

    Speculation: The timing coincides with iOS updates that often affect mobile ad delivery. The age-specific drop suggests either audience saturation or creative fatigue for that segment.

    Recommendations (prioritized):

    • Check for mobile bid adjustments or placement issues (likely quick fix)
    • Analyze creative performance specifically for 25-34 mobile users
    • Consider audience exclusions if we've been over-targeting that segment"

    Notice the difference:

    • Explicit confidence levels
    • Specific data points referenced
    • Prioritized recommendations
    • Separation of facts, inference, and speculation

    In production, this matters enormously. Users need to know what the AI is certain about vs what it's guessing.

    Reasoning: Claude's Biggest Advantage

    The most striking difference was in multi-step reasoning tasks.

    Test scenario: "We want to improve ROAS. We can either increase budget by 30% or cut poor-performing campaigns. Budget increase would likely improve reach but might lower efficiency. What should we do?"

    GPT-4 approach:

    • Discussed trade-offs
    • Suggested testing both
    • Gave fairly generic strategic advice

    Claude approach:

    • Asked for current ROAS and marginal ROAS
    • Estimated the impact of each approach based on historical data
    • Calculated break-even scenarios
    • Made a specific recommendation with conditions

    The key insight: Claude was better at reasoning through problems rather than just describing them.

    In another test, we asked models to debug why a query was running slowly. GPT-4 suggested generic optimizations. Claude analyzed the specific query structure, identified the problematic JOIN order, and explained why reordering would help.

    Why? My hypothesis is that Claude's training emphasized reasoning processes, not just outputs. It shows its work more clearly.

    Communication: Surprisingly Close

    Both models were good at explaining complex findings clearly. Claude had a slight edge in structured communication, but GPT-4 was often more conversational.

    For our use case, structure mattered more than warmth. Users wanted clear breakdowns, not friendly chat.

    Consistency: Claude Won

    We ran each task 5 times to measure output variance.

    Claude: Remarkably consistent. 85%+ of the time, the substantive answer was the same across runs.

    GPT-4: More variable. We'd sometimes get significantly different analyses on identical data.

    In production, consistency is critical. If a user asks the same question twice and gets different answers, trust evaporates.

    Cost: More Complicated Than It Looks

    At face value, GPT-4 was cheaper:

    • GPT-4 Turbo: ~$10/1M input tokens, ~$30/1M output tokens
    • Claude 3 Opus: ~$15/1M input tokens, ~$75/1M output tokens

    But in practice, Claude was often more cost-effective because:

    • More concise outputs - GPT-4 was often verbose. Claude gave focused answers.
    • Fewer retries - Better accuracy meant less "try again" loops
    • Better context usage - Claude handled large contexts more gracefully

    Real-world cost per query:

    • GPT-4 Turbo: $0.08-0.15
    • Claude 3 Opus: $0.10-0.18

    Claude was slightly more expensive, but not dramatically so. Given the quality difference, the cost premium was worth it.

    Latency: GPT-4 Faster, But Not By Much

    Median response time (streaming):

    • GPT-4 Turbo: 3.2 seconds to first token, 12s total
    • Claude 3 Opus: 4.1 seconds to first token, 14s total

    Not a dealbreaker difference. In a conversational interface, both felt fast enough.

    The Decision Point

    After two weeks of testing, the data was clear:

    Claude was better at:

    • Understanding intent behind questions (critical for natural language queries)
    • Multi-step reasoning (essential for analysis)
    • Consistency (necessary for trust)
    • Structured communication (important for clarity)

    GPT-4 was better at:

    • Latency (slightly)
    • Cost (slightly)
    • Ecosystem size (more tools and libraries)

    For our use case—marketing analytics where reasoning quality and consistency matter more than raw speed—Claude was the clear choice.

    But we were already built on GPT-4. Switching would mean rewriting significant parts of our platform.

    Tom and I had a long conversation about this. His question: "Are you sure Claude is worth the engineering cost of switching?"

    My answer: "If we build on GPT-4, we'll be competing with everyone else using GPT-4. Claude gives us a technical edge."

    We decided to switch.

    The Migration Reality

    Switching models sounds simple: change an API endpoint, right?

    In practice, it required:

    1. Prompt re-engineering

    Claude and GPT-4 respond differently to the same prompts. What worked well for GPT-4 often produced worse results on Claude, and vice versa.

    We had to:

    • Rewrite our system prompts
    • Adjust few-shot examples
    • Change how we structured complex requests
    • Optimize for Claude's reasoning style

    Time investment: ~3 weeks of iteration

    2. Tool calling adjustments

    Both models support function calling, but with different interfaces and behaviors.

    Claude's tool use is more explicit. GPT-4's is sometimes more flexible. We had to redesign how we expose database operations as tools.

    Time investment: ~2 weeks

    3. Context management

    Claude handles large contexts well, but differently than GPT-4. We optimized our context preparation for Claude's architecture.

    Time investment: ~1 week

    4. Testing and validation

    We couldn't just ship the switch. We ran A/B tests with real users comparing GPT-4 and Claude versions.

    Time investment: ~2 weeks

    Total migration cost: ~2 months of engineering time

    Worth it? Absolutely. But it wasn't trivial.

    What We Learned in Production

    The test suite told us Claude was better. Production confirmed it, but with nuance.

    Users Noticed the Difference

    We ran an A/B test where 50% of users got GPT-4, 50% got Claude (they didn't know which). After a week, we surveyed them.

    Results:

    • 68% of Claude users rated the AI's analysis as "excellent"
    • 42% of GPT-4 users rated it "excellent"
    • Claude users were 2.3x more likely to say they "trusted" the AI's recommendations

    The qualitative feedback was striking:

    Claude users: "It feels like it actually understands my data" / "The explanations make sense"

    GPT-4 users: "It's helpful but sometimes I need to ask follow-up questions" / "The analysis is good but generic"

    Where Claude Excels

    In production, Claude's advantages became even clearer:

    1. Complex query scenarios

    When users asked multi-part questions ("Compare performance across channels, segment by device, and identify the biggest opportunities"), Claude handled them more elegantly.

    2. Uncertainty handling

    Claude was better at saying "I don't have enough data to be confident about X, but here's what I can tell you about Y."

    3. Following business logic

    Marketing data has implicit rules: fiscal calendars, attribution windows, cohort definitions. Claude picked up on these contextual constraints better.

    Where GPT-4 Held Its Own

    There were areas where GPT-4 was competitive or better:

    1. Creative tasks

    When users asked for campaign messaging ideas or creative suggestions, GPT-4 was slightly more creative and varied.

    2. Conversational flow

    GPT-4 felt more "natural" in long conversations. Claude sometimes felt more formal.

    3. Edge case creativity

    When faced with totally novel requests, GPT-4 was sometimes more willing to try unconventional approaches.

    Our response: We considered using GPT-4 for specific creative tasks and Claude for analytical tasks. But the complexity of managing two models wasn't worth the marginal gains.

    The Technical Architecture

    Here's how we use Claude in production:

    1. System Prompt Engineering

    We use a carefully tuned system prompt that:

    • Defines the AI's role (marketing analytics expert)
    • Sets reasoning guidelines (show your work, acknowledge uncertainty)
    • Provides domain context (common marketing metrics, attribution models)
    • Establishes output format (structured insights, prioritized recommendations)

    Key learning: Claude responds well to explicit reasoning frameworks. We literally tell it "Think step-by-step" and it produces better outputs.

    2. Context Management

    For each user query, we build context:

    • User's database schema (what tables/columns are available)
    • Previous conversation history
    • Relevant business context (industry, typical metrics)
    • Recent query results (for follow-up questions)

    Context size: Usually 20-40k tokens. Claude handles this well.

    Optimization: We summarize old conversation history to keep context manageable.

    3. Tool Use Pattern

    We expose database operations as tools Claude can call:

    • execute_query: Run SQL against BigQuery
    • get_schema: Retrieve table structure
    • list_tables: Show available data

    Pattern:

    • User asks question
    • Claude determines what data it needs
    • Claude calls tools to get that data
    • Claude analyzes results
    • Claude responds with insights

    Why this works: Separating data retrieval from analysis lets Claude focus on reasoning, not query syntax.

    4. Response Streaming

    We stream Claude's responses token-by-token for better UX. Users see thinking happen in real-time, which builds trust.

    Technical detail: We use Server-Sent Events (SSE) to stream from our backend to frontend.

    5. Error Handling

    Claude is good, but not perfect. We built guardrails:

    • SQL validation before execution
    • Cost limits on expensive queries
    • Timeout handling for long-running analysis
    • Fallback explanations when Claude is uncertain

    Philosophy: Trust but verify. Let Claude reason freely, but validate outputs before presenting to users.

    The Comparison Today (Claude 3.5 vs GPT-4o)

    Since we made the initial decision, both models have been updated:

    • Claude 3.5 Sonnet is better and cheaper than Opus
    • GPT-4o is faster and cheaper than GPT-4 Turbo

    Would we make the same decision today? Yes, but it's closer.

    Claude 3.5 Sonnet:

    • Better reasoning than Claude 3 Opus
    • Cheaper ($3/M input, $15/M output tokens)
    • Slightly faster
    • Maintains Claude's structured thinking advantage

    GPT-4o:

    • Significantly faster than GPT-4 Turbo
    • Cheaper
    • Better at multimodal tasks
    • Improved reasoning over previous versions

    Current assessment: Claude 3.5 Sonnet is still our choice for analytical reasoning, but the gap has narrowed. If we were building a more conversational or creative product, GPT-4o would be tempting.

    What We'd Do Differently

    Knowing what I know now, here's what I'd change:

    1. Start with both models

    We should have architected for model flexibility from day one. Being able to A/B test models continuously would be valuable.

    2. Build better evaluation infrastructure

    We did good testing before switching, but we should have automated more of it. Continuous model evaluation would help us catch regressions.

    3. Invest more in prompt engineering earlier

    We spent months iterating on prompts. Starting with more rigorous prompt engineering would have accelerated development.

    4. Plan for model updates

    Model versions change. We should have built better versioning and testing for model updates.

    The Honest Truth

    Here's what I tell other technical founders asking about model choice:

    There's no universal answer. The right model depends on your use case.

    For analytical reasoning, complex queries, and structured output: Claude is probably better.

    For creative tasks, conversational flow, and broad knowledge: GPT-4 is very competitive.

    For cost-sensitive applications: Claude 3.5 Sonnet offers the best quality-to-cost ratio we've found.

    For maximum flexibility: Build model-agnostic from the start.

    The landscape changes fast. What's true today might not be true in six months. The models are improving rapidly, and new competitors are emerging.

    The real competitive advantage isn't which model you use—it's how well you use it.

    Prompt engineering, context management, tool design, and error handling matter more than model selection. We've seen brilliant applications built on GPT-3.5 and mediocre ones on Claude 3.5.

    Why This Matters

    I'm writing this not to convince everyone to use Claude, but to share the thinking process behind a critical technical decision.

    When you're building an AI-powered product, model selection isn't just a technical choice—it's a strategic one. The model you choose affects:

    • Product capabilities
    • User experience
    • Cost structure
    • Competitive positioning
    • Technical debt

    We bet Cogny on Claude because the model's strengths aligned with our product's needs. That bet has paid off, but it required real engineering investment to maximize the benefits.

    The takeaway: Test rigorously, understand your requirements, and be willing to invest in optimization. The default choice isn't always the right choice.

    And in our case, going against the default (GPT-4) and choosing Claude turned out to be one of the best technical decisions we made.

    ---

    About Berner Setterwall

    Berner is CTO and co-founder of Cogny, where he's building AI-powered marketing automation on top of Claude. Previously, he was a senior engineer at Campanja, building optimization systems for Netflix, Zalando, and other major brands. He specializes in AI architecture, large-scale data systems, and making complex technology work reliably in production.

    Want to see Claude in action for marketing analytics?

    Cogny uses Claude to power real-time marketing analysis and insights. See how we've built AI that actually understands your data and delivers actionable recommendations. Book a demo to experience the difference.