Real-Time BigQuery Analysis in Production: Hard Lessons

Three months ago, a user asked our AI agent: "Show me campaign performance for the last 24 hours."

The AI generated a perfectly valid query, sent it to BigQuery, and... the user waited. And waited. 43 seconds later, they got their answer.

They closed the app and never came back.

That's when I realized that "real-time analysis" isn't about how fast you can query the database. It's about managing expectations, optimizing for the 90th percentile, and building systems that fail gracefully when BigQuery decides to be slow.

Here's what we've learned running AI-powered BigQuery analysis in production, processing thousands of queries daily from users who expect instant answers.

The Promise vs The Reality

The promise: Users ask questions in natural language, AI generates SQL, BigQuery returns results in seconds, everyone's happy.

The reality:

Some queries return in 2 seconds
Some take 30 seconds
Some timeout after 60 seconds
Some cost $0.001
Some cost $15
Some return 10 rows
Some try to return 10 million rows
Some queries are perfect
Some have subtle bugs that produce wrong results

The challenge isn't making it work once. It's making it work reliably, fast, and affordably for arbitrary user questions against arbitrary schemas.

Lesson 1: BigQuery Performance is Wildly Unpredictable

I came from traditional SQL databases where query performance was relatively predictable. Optimize your indexes, understand your execution plan, and you could estimate query time pretty accurately.

BigQuery doesn't work that way.

Case 1: Query scans 100GB, completes in 3 seconds.

Case 2: Query scans 5GB, takes 28 seconds.

Why? BigQuery's performance depends on:

Data layout and clustering
Slot availability (shared resources)
Query complexity (joins, aggregations)
Partition pruning effectiveness
Time of day (contention)
Table size vs scanned data
Whether data is in cache

The implication: You can't reliably predict query time from query size or complexity.

What We Did About It

1. Tiered timeout strategy

Instead of one timeout, we use three:

Fast queries (< 5s): Interactive responses
Medium queries (5-15s): Show progress indicator
Slow queries (15-30s): Warn user, offer to run async

2. Query cost estimation

Before running expensive queries, we estimate cost:

-- Get approximate scan size
SELECT
  table_name,
  size_bytes
FROM `project.dataset.INFORMATION_SCHEMA.TABLES`
WHERE table_name IN (...)

If estimated cost > $1, we ask for user confirmation.

3. Progressive disclosure

Instead of running the full query immediately, we:

Run a COUNT(*) first (fast, cheap)
If count > 100k rows, suggest sampling
If user confirms, run the full query

4. Caching layer

We cache query results for identical queries. Hit rate: ~30% (better than expected).

Implementation: Redis with TTL based on data freshness requirements.

Lesson 2: AI-Generated Queries Optimize for Correctness, Not Performance

Claude is excellent at generating syntactically correct SQL. It's decent at generating semantically correct SQL. It's terrible at generating performant SQL.

Example user question: "Show me daily spend by campaign for the last 30 days"

Claude's first attempt:

SELECT
  DATE(timestamp) as date,
  campaign_name,
  SUM(cost) as spend
FROM `project.dataset.ad_impressions`
WHERE timestamp >= TIMESTAMP_SUB(CURRENT_TIMESTAMP(), INTERVAL 30 DAY)
GROUP BY date, campaign_name
ORDER BY date DESC

Problems:

Scans the entire ad_impressions table (potentially billions of rows)
Filters after scanning (no partition pruning)
No partition or clustering key usage

Optimized version:

SELECT
  DATE(_PARTITIONTIME) as date,
  campaign_name,
  SUM(cost) as spend
FROM `project.dataset.ad_impressions`
WHERE _PARTITIONTIME >= TIMESTAMP_SUB(CURRENT_TIMESTAMP(), INTERVAL 30 DAY)
GROUP BY date, campaign_name
ORDER BY date DESC

Impact:

Original: 2.3TB scanned, 47 seconds, $11.50
Optimized: 90GB scanned, 4 seconds, $0.45

The challenge: How do you teach an AI to generate the optimized version?

What We Did About It

1. Schema enrichment

We augment table schemas with metadata:

{
  "table": "ad_impressions",
  "partition_field": "_PARTITIONTIME",
  "cluster_fields": ["campaign_id", "date"],
  "common_filters": ["timestamp >= ...", "campaign_id IN ..."],
  "performance_tips": [
    "Always filter on _PARTITIONTIME for date ranges",
    "Use campaign_id in WHERE clause when possible"
  ]
}

This metadata goes into the AI's context when generating queries.

2. Query rewriting rules

Before execution, we apply transformation rules:

Replace WHERE timestamp >= with WHERE _PARTITIONTIME >= if partition exists
Add clustering key filters when mentioned in WHERE clause
Push down projections to reduce data scanned

3. Query analysis and suggestions

After generating a query, we analyze it:

def analyze_query(query, schema):
    issues = []

    # Check for full table scans
    if not uses_partition_filter(query, schema):
        issues.append("Warning: Full table scan. Consider filtering on partition key.")

    # Check for expensive operations
    if has_cross_join(query):
        issues.append("Warning: Cross join detected. This may be slow.")

    # Estimate cost
    estimated_bytes = estimate_bytes_scanned(query, schema)
    if estimated_bytes > 1_000_000_000:  # 1TB
        issues.append(f"Warning: Will scan ~{estimated_bytes / 1e12:.2f}TB (${estimated_bytes / 1e12 * 5:.2f})")

    return issues

If issues are found, we ask Claude to revise the query with the analysis as context.

4. Example-driven prompting

We include performance examples in the system prompt:

When querying partitioned tables, always filter on the partition key:

GOOD:
WHERE _PARTITIONTIME >= TIMESTAMP_SUB(CURRENT_TIMESTAMP(), INTERVAL 7 DAY)

BAD:
WHERE timestamp >= TIMESTAMP_SUB(CURRENT_TIMESTAMP(), INTERVAL 7 DAY)

The GOOD version prunes partitions before scanning. The BAD version scans all partitions then filters.

Results: Average query cost dropped from $1.20 to $0.18 per query.

Lesson 3: Users Ask Questions Your Schema Can't Answer

This surprised me: About 30% of user questions can't be answered from their data warehouse alone.

Examples:

User: "Compare our Facebook ROAS to industry average" Problem: Industry data isn't in their warehouse

User: "Show me campaigns targeting women 25-34" Problem: Demographic targeting data isn't stored in impressions table

User: "Why did traffic drop yesterday?" Problem: External factors (competitors, seasonality, news) aren't in the data

Initial approach: Tell the user we can't answer.

User response: Frustration and churn.

What We Did About It

1. Augmented data sources

For common external data needs, we maintain reference datasets:

Industry benchmarks (by vertical and channel)
Seasonality indices
Holiday calendars
Platform changes/updates

These are separate BigQuery tables users can optionally enable.

2. Partial answers with context

Instead of "I can't answer that," we say:

"I can show you your campaign performance for that audience segment, but I don't have industry benchmark data to compare against. Would you like to see your historical performance instead?"

3. Multi-source reasoning

When the data warehouse is insufficient, Claude can:

Reference external knowledge ("Facebook launched iOS 14 targeting restrictions in April 2021, which might explain the drop")
Suggest what additional data would help
Provide qualified insights based on available data

Example:

User: "Why did our Instagram campaign suddenly get expensive?"

Claude: "Your Instagram CPM increased 45% starting March 15th. While I can't see Instagram's platform changes directly, I notice this coincides with when Meta typically adjusts ad auction dynamics for Q2. Your competitor spend in the competitor_analysis table also increased 60% in that period, which could be driving auction pressure. I'd recommend checking Meta's advertiser updates for mid-March."

This is way more helpful than "I don't have that data."

Lesson 4: Result Sets Are Often Unusably Large

When a user asks "show me all campaigns," they don't realize they have 50,000 campaigns.

Naive approach: Return all 50,000 rows.

Problems:

Takes forever to query
Takes forever to transmit
Crashes the browser trying to render
User can't meaningfully review that much data

Better approach: Intelligence in limiting and summarizing.

What We Did About It

1. Smart defaults

We automatically limit queries unless explicitly asked for more:

-- User asks: "Show me campaigns"
-- We actually run:
SELECT * FROM campaigns
ORDER BY spend DESC  -- or most relevant order
LIMIT 100

Then we tell the user: "Showing top 100 campaigns by spend. There are 47,342 total. Would you like to filter or see more?"

2. Automatic summarization

For large result sets, we ask Claude to summarize:

I found 12,347 campaigns. Here are the key patterns:

- Top 10 campaigns represent 67% of total spend
- 8,234 campaigns (67%) have zero conversions in the last 30 days
- Campaign performance varies widely by category:
  - Branded: avg ROAS 8.2
  - Non-branded: avg ROAS 3.1

Would you like me to:
1. Show top-performing campaigns
2. Show underperforming campaigns needing attention
3. Break down by specific category

3. Progressive loading

For medium-sized results (100-1000 rows), we:

Load first 50 rows immediately
Stream remaining rows
Allow user to stop loading if they found what they need

4. Aggregation suggestions

If a query would return >10k rows, we suggest aggregation:

"This query would return 45,000 individual ad impressions. Would you like me to aggregate by day/campaign/creative instead?"

Lesson 5: Error Handling is Half the Product

Queries fail. A lot. Users need to understand why and what to do about it.

Categories of failures:

Syntax errors (AI generated invalid SQL)
Schema errors (referenced non-existent columns)
Permission errors (can't access certain tables)
Timeout errors (query too slow)
Cost errors (query too expensive)
Resource errors (BigQuery quota exceeded)

Naive error handling:

Error: Query failed
Details: [BigQuery error message]

User interpretation: "This thing is broken."

What We Did About It

1. Error translation

We intercept BigQuery errors and translate them:

def translate_error(bigquery_error):
    if "Exceeded rate limits" in bigquery_error:
        return {
            "user_message": "Your BigQuery project is currently at its query limit. This usually resolves in a few minutes.",
            "suggestion": "Wait 2-3 minutes and try again, or consider upgrading your BigQuery quota.",
            "can_retry": True
        }

    if "Not found: Table" in bigquery_error:
        table_name = extract_table_name(bigquery_error)
        return {
            "user_message": f"I tried to query table '{table_name}' but it doesn't exist in your dataset.",
            "suggestion": "Check if the table name is correct or if you need to connect additional datasets.",
            "can_retry": False
        }

    # ... more cases

2. Automatic retry with fixes

For certain errors, we automatically attempt fixes:

Syntax errors: Ask Claude to fix the SQL with the error message as context.

Schema errors: Refresh schema and regenerate query.

Timeout errors: Suggest sampling or narrower date range.

3. Graceful degradation

When a query fails, we try to provide partial value:

"I couldn't complete the full analysis due to query timeout, but based on the schema, I can tell you that your campaigns table has 12,347 unique campaigns. Would you like me to try a simpler query, or would you prefer to run this async?"

4. User education

After errors, we explain what happened and how to avoid it:

"This query scanned 5TB of data, which exceeded the 1TB limit we set for interactive queries. For large analyses like this, you can:

Narrow the date range (e.g., last 7 days instead of all time)
Sample the data (e.g., 10% of impressions)
Run as a background job and get notified when complete"

Impact: User retry rate after errors increased from 15% to 68%.

Lesson 6: Cost Control is Not Optional

Early in production, a user asked: "Analyze all impression data."

Our AI generated a query that scanned 47TB.

Cost: $235.

We found out when the user complained about their BigQuery bill.

What We Did About It

1. Query cost estimation before execution

def estimate_query_cost(query, schema):
    # Get bytes that would be scanned
    dry_run_response = bigquery.query(query, dry_run=True)
    bytes_scanned = dry_run_response.total_bytes_processed

    # BigQuery pricing: $5 per TB
    estimated_cost = (bytes_scanned / 1e12) * 5

    return estimated_cost, bytes_scanned

2. Tiered confirmation

< $0.10: Run automatically
$0.10 - $1.00: Show estimate, auto-run
$1.00 - $10: Require user confirmation
> $10: Require explicit confirmation + suggest optimization

3. Cost budgets

Users can set daily/monthly query budgets. When approached:

80% of budget: Warning notification
100% of budget: Require admin approval for queries

4. Cost-aware query optimization

When generating queries, we include cost context:

"I can answer this two ways:

Option A (fast, expensive):

Scan all 5TB of impression data
Complete in ~8 seconds
Cost: ~$25

Option B (slower, cheap):

Use pre-aggregated daily summary table
Complete in ~3 seconds
Cost: ~$0.01

Which would you prefer?"

5. Optimization suggestions

After expensive queries, we suggest improvements:

"This query cost $5.50. You could reduce cost by:

Using the aggregated table instead of raw impressions (95% cost reduction)
Creating a materialized view for this analysis
Partitioning by date and querying shorter time ranges"

Lesson 7: Schema Complexity Matters More Than Data Size

One client has 200TB of data across 5 simple tables. Queries work great.

Another client has 5TB of data across 200 tables with complex relationships. Queries are a mess.

The problem: AI needs to understand relationships, not just tables.

What We Did About It

1. Schema graph representation

We build a relationship graph:

{
    "tables": {
        "campaigns": {
            "columns": [...],
            "relationships": [
                {
                    "table": "ad_groups",
                    "type": "one_to_many",
                    "join_on": "campaigns.id = ad_groups.campaign_id"
                }
            ]
        }
    }
}

2. Smart context pruning

Instead of sending all 200 tables to the AI, we:

Identify which tables are likely relevant based on the question
Send only those tables + their direct relationships
Keep total context under 40k tokens

3. Schema documentation

We encourage users to document their schemas:

{
    "table": "campaigns",
    "description": "Marketing campaigns across all channels",
    "columns": {
        "campaign_type": {
            "description": "Type of campaign: 'brand', 'perf', 'retarget'",
            "common_values": ["brand", "perf", "retarget"]
        }
    }
}

This helps the AI understand domain-specific terminology.

4. Query templates

For common analysis patterns, we maintain templates:

"To calculate campaign ROAS, join campaigns + conversions + costs, group by campaign, calculate revenue / cost."

When a question matches a pattern, we use the template as a starting point.

Lesson 8: Streaming Improves Perceived Performance

Even when queries take 15 seconds, users are more patient if they see progress.

What We Did About It

1. Thinking indicators

We show what the AI is doing:

"Understanding your question..."
"Analyzing schema..."
"Generating query..."
"Running query (this may take 10-15 seconds)..."
"Analyzing results..."

2. Partial results

For queries with multiple steps:

Show results from step 1 while step 2 runs
Stream result rows as they arrive
Update summary statistics in real-time

3. Progress estimation

BigQuery doesn't provide progress updates, but we fake it:

// Show fake progress for UX
const estimatedTime = estimateQueryTime(query);
const progressInterval = setInterval(() => {
    progress = Math.min(progress + 10, 90);
    updateUI(progress);
}, estimatedTime / 10);

Not accurate, but better than a blank screen.

Lesson 9: Caching is Harder Than It Looks

Caching BigQuery results seems simple: hash the query, store the result.

Problems:

Data freshness: Cached result might be stale
Schema changes: Cached query might be invalid after schema update
Parameterization: Similar queries with different dates aren't cache hits
Storage costs: Caching 1M rows * 100 queries = expensive

What We Did About It

1. TTL based on data freshness

We ask users about data update frequency:

Real-time data (e.g., live campaigns): 5-minute TTL
Daily data (e.g., GA4): 24-hour TTL
Historical data: 7-day TTL

2. Semantic caching

Instead of exact query matching, we use semantic similarity:

"Show me campaign performance last 7 days" and "What's my campaign ROAS for the last week?" are semantically similar → cache hit.

3. Partial caching

For queries with date ranges, we cache by date:

User asks for last 30 days
We have cache for days 1-28
Query only last 2 days, merge with cache

4. Smart invalidation

When schema changes, we invalidate affected cached queries instead of all cache.

Lesson 10: BigQuery Isn't Always the Answer

Sometimes, the best solution is not querying BigQuery at all.

Example: User asks "What's my total spend today?"

Option A: Query BigQuery

Generate SQL
Run query (3-5 seconds)
Return result

Option B: Maintain real-time aggregates

Update running totals on data insert
Return from cache (<100ms)

For frequently-asked questions, precomputation beats querying.

What We Did About It

1. Identify hot queries

We log all questions and identify the most common:

"Total spend today"
"Campaign count"
"Top 10 campaigns by spend"
etc.

2. Materialized views

For hot queries, we create materialized views that auto-update:

CREATE MATERIALIZED VIEW daily_campaign_summary AS
SELECT
  DATE(_PARTITIONTIME) as date,
  campaign_id,
  SUM(cost) as total_cost,
  SUM(conversions) as total_conversions
FROM ad_impressions
GROUP BY date, campaign_id

3. Query rewriting

When a question matches a hot pattern, we rewrite to use the materialized view:

User query → "total spend today" AI generates → SELECT SUM(cost) FROM ad_impressions WHERE ... We rewrite → SELECT SUM(total_cost) FROM daily_campaign_summary WHERE date = CURRENT_DATE()

Result: 15 seconds → 0.8 seconds, $0.50 → $0.001.

What I'd Tell My Past Self

If I could go back to when we started building this, here's what I'd say:

1. Build cost controls from day one

We added them after the $235 query incident. Should have been there from the start.

2. Instrument everything

We didn't track query performance metrics early. Now we track:

Query time distribution
Cost per query
Cache hit rates
Error rates by type
Most common queries

This data drives optimization priorities.

3. Design for failure

Users will ask impossible questions. Queries will timeout. BigQuery will have outages.

Build graceful degradation into the product from the beginning.

4. Performance perception matters more than actual performance

A 10-second query with progress indicators feels faster than a 7-second query with a blank screen.

5. Invest in query optimization infrastructure

We spent weeks building query rewriting, cost estimation, and performance analysis tools. Worth every hour.

6. Users don't think in SQL

Teaching the AI to generate performant SQL is necessary but not sufficient. You need to translate user intent to optimal data access patterns.

The Honest Truth

Running real-time BigQuery analysis in production is hard.

Not "challenging" hard. Genuinely difficult engineering work that requires:

Deep BigQuery knowledge
AI prompt engineering
Cost management
Performance optimization
Error handling
UX design

But when it works—when a user asks a complex question and gets an insightful answer in 3 seconds—it feels like magic.

We're processing thousands of queries daily now. The system works. But it took months of optimization, iteration, and learning from production issues.

If you're building something similar, expect to spend more time on performance, cost, and reliability than on the AI itself.

The AI is the easy part. Making it work at scale is the challenge.

About Berner Setterwall

Berner is CTO and co-founder of Cogny, where he's building AI-powered marketing automation on top of BigQuery. He's spent years optimizing large-scale data systems, from Campanja's AI platforms to Cogny's real-time analytics engine. He specializes in making complex data systems fast, reliable, and cost-effective.

Want to see our BigQuery integration in action?

Cogny connects directly to your BigQuery warehouse and provides real-time AI-powered analysis. We've spent months optimizing for performance, cost, and reliability so you don't have to. Book a demo to see how we make BigQuery analysis feel instant.