Back to Blog

    Real-Time BigQuery Analysis: Lessons from Production

    berner-setterwallJanuary 19, 2025

    Three months ago, a user asked our AI agent: "Show me campaign performance for the last 24 hours."

    The AI generated a perfectly valid query, sent it to BigQuery, and... the user waited. And waited. 43 seconds later, they got their answer.

    They closed the app and never came back.

    That's when I realized that "real-time analysis" isn't about how fast you can query the database. It's about managing expectations, optimizing for the 90th percentile, and building systems that fail gracefully when BigQuery decides to be slow.

    Here's what we've learned running AI-powered BigQuery analysis in production, processing thousands of queries daily from users who expect instant answers.

    The Promise vs The Reality

    The promise: Users ask questions in natural language, AI generates SQL, BigQuery returns results in seconds, everyone's happy.

    The reality:

    • Some queries return in 2 seconds
    • Some take 30 seconds
    • Some timeout after 60 seconds
    • Some cost $0.001
    • Some cost $15
    • Some return 10 rows
    • Some try to return 10 million rows
    • Some queries are perfect
    • Some have subtle bugs that produce wrong results

    The challenge isn't making it work once. It's making it work reliably, fast, and affordably for arbitrary user questions against arbitrary schemas.

    Lesson 1: BigQuery Performance is Wildly Unpredictable

    I came from traditional SQL databases where query performance was relatively predictable. Optimize your indexes, understand your execution plan, and you could estimate query time pretty accurately.

    BigQuery doesn't work that way.

    Case 1: Query scans 100GB, completes in 3 seconds.

    Case 2: Query scans 5GB, takes 28 seconds.

    Why? BigQuery's performance depends on:

    • Data layout and clustering
    • Slot availability (shared resources)
    • Query complexity (joins, aggregations)
    • Partition pruning effectiveness
    • Time of day (contention)
    • Table size vs scanned data
    • Whether data is in cache

    The implication: You can't reliably predict query time from query size or complexity.

    What We Did About It

    1. Tiered timeout strategy

    Instead of one timeout, we use three:

    • Fast queries (< 5s): Interactive responses
    • Medium queries (5-15s): Show progress indicator
    • Slow queries (15-30s): Warn user, offer to run async

    2. Query cost estimation

    Before running expensive queries, we estimate cost:

    -- Get approximate scan size
    SELECT
    table_name,
    size_bytes
    FROM `project.dataset.INFORMATION_SCHEMA.TABLES`
    WHERE table_name IN (...)

    If estimated cost > $1, we ask for user confirmation.

    3. Progressive disclosure

    Instead of running the full query immediately, we:

    • Run a COUNT(*) first (fast, cheap)
    • If count > 100k rows, suggest sampling
    • If user confirms, run the full query

    4. Caching layer

    We cache query results for identical queries. Hit rate: ~30% (better than expected).

    Implementation: Redis with TTL based on data freshness requirements.

    Lesson 2: AI-Generated Queries Optimize for Correctness, Not Performance

    Claude is excellent at generating syntactically correct SQL. It's decent at generating semantically correct SQL. It's terrible at generating performant SQL.

    Example user question: "Show me daily spend by campaign for the last 30 days"

    Claude's first attempt:

    SELECT
    DATE(timestamp) as date,
    campaign_name,
    SUM(cost) as spend
    FROM `project.dataset.ad_impressions`
    WHERE timestamp >= TIMESTAMP_SUB(CURRENT_TIMESTAMP(), INTERVAL 30 DAY)
    GROUP BY date, campaign_name
    ORDER BY date DESC

    Problems:

    • Scans the entire ad_impressions table (potentially billions of rows)
    • Filters after scanning (no partition pruning)
    • No partition or clustering key usage

    Optimized version:

    SELECT
    DATE(_PARTITIONTIME) as date,
    campaign_name,
    SUM(cost) as spend
    FROM `project.dataset.ad_impressions`
    WHERE _PARTITIONTIME >= TIMESTAMP_SUB(CURRENT_TIMESTAMP(), INTERVAL 30 DAY)
    GROUP BY date, campaign_name
    ORDER BY date DESC

    Impact:

    • Original: 2.3TB scanned, 47 seconds, $11.50
    • Optimized: 90GB scanned, 4 seconds, $0.45

    The challenge: How do you teach an AI to generate the optimized version?

    What We Did About It

    1. Schema enrichment

    We augment table schemas with metadata:

    {
    "table": "ad_impressions",
    "partition_field": "_PARTITIONTIME",
    "cluster_fields": ["campaign_id", "date"],
    "common_filters": ["timestamp >= ...", "campaign_id IN ..."],
    "performance_tips": [
    "Always filter on _PARTITIONTIME for date ranges",
    "Use campaign_id in WHERE clause when possible"
    ]
    }

    This metadata goes into the AI's context when generating queries.

    2. Query rewriting rules

    Before execution, we apply transformation rules:

    • Replace WHERE timestamp >= with WHERE _PARTITIONTIME >= if partition exists
    • Add clustering key filters when mentioned in WHERE clause
    • Push down projections to reduce data scanned

    3. Query analysis and suggestions

    After generating a query, we analyze it:

    def analyze_query(query, schema):
    issues = []

    # Check for full table scans
    if not uses_partition_filter(query, schema):
    issues.append("Warning: Full table scan. Consider filtering on partition key.")

    # Check for expensive operations
    if has_cross_join(query):
    issues.append("Warning: Cross join detected. This may be slow.")

    # Estimate cost
    estimated_bytes = estimate_bytes_scanned(query, schema)
    if estimated_bytes > 1_000_000_000: # 1TB
    issues.append(f"Warning: Will scan ~{estimated_bytes / 1e12:.2f}TB (${estimated_bytes / 1e12 * 5:.2f})")

    return issues

    If issues are found, we ask Claude to revise the query with the analysis as context.

    4. Example-driven prompting

    We include performance examples in the system prompt:

    When querying partitioned tables, always filter on the partition key:

    GOOD:
    WHERE _PARTITIONTIME >= TIMESTAMP_SUB(CURRENT_TIMESTAMP(), INTERVAL 7 DAY)

    BAD:
    WHERE timestamp >= TIMESTAMP_SUB(CURRENT_TIMESTAMP(), INTERVAL 7 DAY)

    The GOOD version prunes partitions before scanning. The BAD version scans all partitions then filters.

    Results: Average query cost dropped from $1.20 to $0.18 per query.

    Lesson 3: Users Ask Questions Your Schema Can't Answer

    This surprised me: About 30% of user questions can't be answered from their data warehouse alone.

    Examples:

    User: "Compare our Facebook ROAS to industry average" Problem: Industry data isn't in their warehouse

    User: "Show me campaigns targeting women 25-34" Problem: Demographic targeting data isn't stored in impressions table

    User: "Why did traffic drop yesterday?" Problem: External factors (competitors, seasonality, news) aren't in the data

    Initial approach: Tell the user we can't answer.

    User response: Frustration and churn.

    What We Did About It

    1. Augmented data sources

    For common external data needs, we maintain reference datasets:

    • Industry benchmarks (by vertical and channel)
    • Seasonality indices
    • Holiday calendars
    • Platform changes/updates

    These are separate BigQuery tables users can optionally enable.

    2. Partial answers with context

    Instead of "I can't answer that," we say:

    "I can show you your campaign performance for that audience segment, but I don't have industry benchmark data to compare against. Would you like to see your historical performance instead?"

    3. Multi-source reasoning

    When the data warehouse is insufficient, Claude can:

    • Reference external knowledge ("Facebook launched iOS 14 targeting restrictions in April 2021, which might explain the drop")
    • Suggest what additional data would help
    • Provide qualified insights based on available data

    Example:

    User: "Why did our Instagram campaign suddenly get expensive?"

    Claude: "Your Instagram CPM increased 45% starting March 15th. While I can't see Instagram's platform changes directly, I notice this coincides with when Meta typically adjusts ad auction dynamics for Q2. Your competitor spend in the competitor_analysis table also increased 60% in that period, which could be driving auction pressure. I'd recommend checking Meta's advertiser updates for mid-March."

    This is way more helpful than "I don't have that data."

    Lesson 4: Result Sets Are Often Unusably Large

    When a user asks "show me all campaigns," they don't realize they have 50,000 campaigns.

    Naive approach: Return all 50,000 rows.

    Problems:

    • Takes forever to query
    • Takes forever to transmit
    • Crashes the browser trying to render
    • User can't meaningfully review that much data

    Better approach: Intelligence in limiting and summarizing.

    What We Did About It

    1. Smart defaults

    We automatically limit queries unless explicitly asked for more:

    -- User asks: "Show me campaigns"
    -- We actually run:
    SELECT * FROM campaigns
    ORDER BY spend DESC -- or most relevant order
    LIMIT 100

    Then we tell the user: "Showing top 100 campaigns by spend. There are 47,342 total. Would you like to filter or see more?"

    2. Automatic summarization

    For large result sets, we ask Claude to summarize:

    I found 12,347 campaigns. Here are the key patterns:

    - Top 10 campaigns represent 67% of total spend
    - 8,234 campaigns (67%) have zero conversions in the last 30 days
    - Campaign performance varies widely by category:
    - Branded: avg ROAS 8.2
    - Non-branded: avg ROAS 3.1

    Would you like me to:
    1. Show top-performing campaigns
    2. Show underperforming campaigns needing attention
    3. Break down by specific category

    3. Progressive loading

    For medium-sized results (100-1000 rows), we:

    • Load first 50 rows immediately
    • Stream remaining rows
    • Allow user to stop loading if they found what they need

    4. Aggregation suggestions

    If a query would return >10k rows, we suggest aggregation:

    "This query would return 45,000 individual ad impressions. Would you like me to aggregate by day/campaign/creative instead?"

    Lesson 5: Error Handling is Half the Product

    Queries fail. A lot. Users need to understand why and what to do about it.

    Categories of failures:

    • Syntax errors (AI generated invalid SQL)
    • Schema errors (referenced non-existent columns)
    • Permission errors (can't access certain tables)
    • Timeout errors (query too slow)
    • Cost errors (query too expensive)
    • Resource errors (BigQuery quota exceeded)

    Naive error handling:

    Error: Query failed
    Details: [BigQuery error message]

    User interpretation: "This thing is broken."

    What We Did About It

    1. Error translation

    We intercept BigQuery errors and translate them:

    def translate_error(bigquery_error):
    if "Exceeded rate limits" in bigquery_error:
    return {
    "user_message": "Your BigQuery project is currently at its query limit. This usually resolves in a few minutes.",
    "suggestion": "Wait 2-3 minutes and try again, or consider upgrading your BigQuery quota.",
    "can_retry": True
    }

    if "Not found: Table" in bigquery_error:
    table_name = extract_table_name(bigquery_error)
    return {
    "user_message": f"I tried to query table '{table_name}' but it doesn't exist in your dataset.",
    "suggestion": "Check if the table name is correct or if you need to connect additional datasets.",
    "can_retry": False
    }

    # ... more cases

    2. Automatic retry with fixes

    For certain errors, we automatically attempt fixes:

    Syntax errors: Ask Claude to fix the SQL with the error message as context.

    Schema errors: Refresh schema and regenerate query.

    Timeout errors: Suggest sampling or narrower date range.

    3. Graceful degradation

    When a query fails, we try to provide partial value:

    "I couldn't complete the full analysis due to query timeout, but based on the schema, I can tell you that your campaigns table has 12,347 unique campaigns. Would you like me to try a simpler query, or would you prefer to run this async?"

    4. User education

    After errors, we explain what happened and how to avoid it:

    "This query scanned 5TB of data, which exceeded the 1TB limit we set for interactive queries. For large analyses like this, you can:

    • Narrow the date range (e.g., last 7 days instead of all time)
    • Sample the data (e.g., 10% of impressions)
    • Run as a background job and get notified when complete"

    Impact: User retry rate after errors increased from 15% to 68%.

    Lesson 6: Cost Control is Not Optional

    Early in production, a user asked: "Analyze all impression data."

    Our AI generated a query that scanned 47TB.

    Cost: $235.

    We found out when the user complained about their BigQuery bill.

    What We Did About It

    1. Query cost estimation before execution

    def estimate_query_cost(query, schema):
    # Get bytes that would be scanned
    dry_run_response = bigquery.query(query, dry_run=True)
    bytes_scanned = dry_run_response.total_bytes_processed

    # BigQuery pricing: $5 per TB
    estimated_cost = (bytes_scanned / 1e12) * 5

    return estimated_cost, bytes_scanned

    2. Tiered confirmation

    • < $0.10: Run automatically
    • $0.10 - $1.00: Show estimate, auto-run
    • $1.00 - $10: Require user confirmation
    • > $10: Require explicit confirmation + suggest optimization

    3. Cost budgets

    Users can set daily/monthly query budgets. When approached:

    • 80% of budget: Warning notification
    • 100% of budget: Require admin approval for queries

    4. Cost-aware query optimization

    When generating queries, we include cost context:

    "I can answer this two ways:

    Option A (fast, expensive):

    • Scan all 5TB of impression data
    • Complete in ~8 seconds
    • Cost: ~$25

    Option B (slower, cheap):

    • Use pre-aggregated daily summary table
    • Complete in ~3 seconds
    • Cost: ~$0.01

    Which would you prefer?"

    5. Optimization suggestions

    After expensive queries, we suggest improvements:

    "This query cost $5.50. You could reduce cost by:

    • Using the aggregated table instead of raw impressions (95% cost reduction)
    • Creating a materialized view for this analysis
    • Partitioning by date and querying shorter time ranges"

    Lesson 7: Schema Complexity Matters More Than Data Size

    One client has 200TB of data across 5 simple tables. Queries work great.

    Another client has 5TB of data across 200 tables with complex relationships. Queries are a mess.

    The problem: AI needs to understand relationships, not just tables.

    What We Did About It

    1. Schema graph representation

    We build a relationship graph:

    {
    "tables": {
    "campaigns": {
    "columns": [...],
    "relationships": [
    {
    "table": "ad_groups",
    "type": "one_to_many",
    "join_on": "campaigns.id = ad_groups.campaign_id"
    }
    ]
    }
    }
    }

    2. Smart context pruning

    Instead of sending all 200 tables to the AI, we:

    • Identify which tables are likely relevant based on the question
    • Send only those tables + their direct relationships
    • Keep total context under 40k tokens

    3. Schema documentation

    We encourage users to document their schemas:

    {
    "table": "campaigns",
    "description": "Marketing campaigns across all channels",
    "columns": {
    "campaign_type": {
    "description": "Type of campaign: 'brand', 'perf', 'retarget'",
    "common_values": ["brand", "perf", "retarget"]
    }
    }
    }

    This helps the AI understand domain-specific terminology.

    4. Query templates

    For common analysis patterns, we maintain templates:

    "To calculate campaign ROAS, join campaigns + conversions + costs, group by campaign, calculate revenue / cost."

    When a question matches a pattern, we use the template as a starting point.

    Lesson 8: Streaming Improves Perceived Performance

    Even when queries take 15 seconds, users are more patient if they see progress.

    What We Did About It

    1. Thinking indicators

    We show what the AI is doing:

    • "Understanding your question..."
    • "Analyzing schema..."
    • "Generating query..."
    • "Running query (this may take 10-15 seconds)..."
    • "Analyzing results..."

    2. Partial results

    For queries with multiple steps:

    • Show results from step 1 while step 2 runs
    • Stream result rows as they arrive
    • Update summary statistics in real-time

    3. Progress estimation

    BigQuery doesn't provide progress updates, but we fake it:

    // Show fake progress for UX
    const estimatedTime = estimateQueryTime(query);
    const progressInterval = setInterval(() => {
    progress = Math.min(progress + 10, 90);
    updateUI(progress);
    }, estimatedTime / 10);

    Not accurate, but better than a blank screen.

    Lesson 9: Caching is Harder Than It Looks

    Caching BigQuery results seems simple: hash the query, store the result.

    Problems:

    • Data freshness: Cached result might be stale
    • Schema changes: Cached query might be invalid after schema update
    • Parameterization: Similar queries with different dates aren't cache hits
    • Storage costs: Caching 1M rows * 100 queries = expensive

    What We Did About It

    1. TTL based on data freshness

    We ask users about data update frequency:

    • Real-time data (e.g., live campaigns): 5-minute TTL
    • Daily data (e.g., GA4): 24-hour TTL
    • Historical data: 7-day TTL

    2. Semantic caching

    Instead of exact query matching, we use semantic similarity:

    "Show me campaign performance last 7 days" and "What's my campaign ROAS for the last week?" are semantically similar → cache hit.

    3. Partial caching

    For queries with date ranges, we cache by date:

    • User asks for last 30 days
    • We have cache for days 1-28
    • Query only last 2 days, merge with cache

    4. Smart invalidation

    When schema changes, we invalidate affected cached queries instead of all cache.

    Lesson 10: BigQuery Isn't Always the Answer

    Sometimes, the best solution is not querying BigQuery at all.

    Example: User asks "What's my total spend today?"

    Option A: Query BigQuery

    • Generate SQL
    • Run query (3-5 seconds)
    • Return result

    Option B: Maintain real-time aggregates

    • Update running totals on data insert
    • Return from cache (<100ms)

    For frequently-asked questions, precomputation beats querying.

    What We Did About It

    1. Identify hot queries

    We log all questions and identify the most common:

    • "Total spend today"
    • "Campaign count"
    • "Top 10 campaigns by spend"
    • etc.

    2. Materialized views

    For hot queries, we create materialized views that auto-update:

    CREATE MATERIALIZED VIEW daily_campaign_summary AS
    SELECT
    DATE(_PARTITIONTIME) as date,
    campaign_id,
    SUM(cost) as total_cost,
    SUM(conversions) as total_conversions
    FROM ad_impressions
    GROUP BY date, campaign_id

    3. Query rewriting

    When a question matches a hot pattern, we rewrite to use the materialized view:

    User query → "total spend today" AI generates → `SELECT SUM(cost) FROM adimpressions WHERE ...` We rewrite → `SELECT SUM(totalcost) FROM dailycampaignsummary WHERE date = CURRENT_DATE()`

    Result: 15 seconds → 0.8 seconds, $0.50 → $0.001.

    What I'd Tell My Past Self

    If I could go back to when we started building this, here's what I'd say:

    1. Build cost controls from day one

    We added them after the $235 query incident. Should have been there from the start.

    2. Instrument everything

    We didn't track query performance metrics early. Now we track:

    • Query time distribution
    • Cost per query
    • Cache hit rates
    • Error rates by type
    • Most common queries

    This data drives optimization priorities.

    3. Design for failure

    Users will ask impossible questions. Queries will timeout. BigQuery will have outages.

    Build graceful degradation into the product from the beginning.

    4. Performance perception matters more than actual performance

    A 10-second query with progress indicators feels faster than a 7-second query with a blank screen.

    5. Invest in query optimization infrastructure

    We spent weeks building query rewriting, cost estimation, and performance analysis tools. Worth every hour.

    6. Users don't think in SQL

    Teaching the AI to generate performant SQL is necessary but not sufficient. You need to translate user intent to optimal data access patterns.

    The Honest Truth

    Running real-time BigQuery analysis in production is hard.

    Not "challenging" hard. Genuinely difficult engineering work that requires:

    • Deep BigQuery knowledge
    • AI prompt engineering
    • Cost management
    • Performance optimization
    • Error handling
    • UX design

    But when it works—when a user asks a complex question and gets an insightful answer in 3 seconds—it feels like magic.

    We're processing thousands of queries daily now. The system works. But it took months of optimization, iteration, and learning from production issues.

    If you're building something similar, expect to spend more time on performance, cost, and reliability than on the AI itself.

    The AI is the easy part. Making it work at scale is the challenge.

    ---

    About Berner Setterwall

    Berner is CTO and co-founder of Cogny, where he's building AI-powered marketing automation on top of BigQuery. He's spent years optimizing large-scale data systems, from Campanja's AI platforms to Cogny's real-time analytics engine. He specializes in making complex data systems fast, reliable, and cost-effective.

    Want to see our BigQuery integration in action?

    Cogny connects directly to your BigQuery warehouse and provides real-time AI-powered analysis. We've spent months optimizing for performance, cost, and reliability so you don't have to. Book a demo to see how we make BigQuery analysis feel instant.