Business decisions are made daily on analytics that look correct. The query ran. The result came back. The number was plausible enough that nobody questioned it. And somewhere in that chain, the wrong decision got made.
This isn’t a hypothetical risk. I ran an experiment: five common analytics questions, each with nothing but a raw table schema and a plain business question. No extra context, no hints about what to watch for. The kind of prompt a PM or analyst would actually write.
The LLM got some right, got some wrong, and gave mixed answers on others. The output was always confident. The SQL always ran. Whether the result was correct depended entirely on context the LLM had to assume — and didn’t always assume correctly.
The Experiment
Each scenario followed the same structure: provide a raw schema, ask a plain business question, capture the SQL generated, run it against sample data, and compare it to the correct answer. No system prompts. No additional instructions. The same minimal context a real analyst request carries.
Here are three of the five scenarios.
Scenario 1: Total Revenue
Schema provided:
CREATE TABLE orders (
order_id INTEGER PRIMARY KEY,
customer TEXT,
order_total REAL,
order_date TEXT
);
CREATE TABLE order_items (
order_id INTEGER,
item_name TEXT,
item_amount REAL
);
Question: “What’s our total revenue?”
The LLM got this one right. It returned SELECT SUM(order_total) FROM orders and warned against joining to order_items because it would inflate the result. This pitfall is well-documented enough that the model has seen it thousands of times in training data. It pattern-matched a known fix rather than reasoning about grain.
Scenario 2: Best Sales Day
Schema provided:
CREATE TABLE events (
event_id INTEGER PRIMARY KEY,
event_type TEXT,
timestamp_utc TEXT,
amount REAL
);
Question: “Which day of the week gets the most purchases?”
No mention of timezone. No mention of where the business operates.
What the LLM generated:
SELECT
strftime('%w', timestamp_utc) AS day_of_week,
COUNT(*) AS purchase_count
FROM events
WHERE event_type = 'purchase'
GROUP BY day_of_week
ORDER BY purchase_count DESC
LIMIT 1;
Result: Monday, 8 purchases.
Correct answer: Sunday, 10 purchases.
The column name says _utc. The LLM read it and still grouped by UTC boundaries. Four Sunday afternoon purchases in Mountain Time fell on Monday in UTC, inverting the result entirely. When I reran the same question with “business operates in Mountain Time” added to the prompt, it handled it correctly. The failure wasn’t a capability gap. It was a missing assumption the model never flagged.
Correct query:
SELECT DATE(timestamp_utc, '-7 hours') AS day,
COUNT(*) AS purchases
FROM events
WHERE event_type = 'purchase'
GROUP BY day
ORDER BY purchases DESC
LIMIT 1;
Scenario 3: January Cohort Retention
Schema provided:
CREATE TABLE users (
user_id TEXT PRIMARY KEY,
signup_date TEXT,
status TEXT,
last_active TEXT
);
CREATE TABLE usr_archive (
user_id TEXT PRIMARY KEY,
signup_date TEXT,
removed_date TEXT
);
Question: “What’s the retention rate for our January 2025 cohort?”
No explanation of what usr_archive means. No mention of hard deletes. Both tables were in the schema.
The LLM included usr_archive in the cohort using a UNION, which is correct. But it defined “retained” as still existing in the users table rather than having been recently active — and flagged this as an assumption rather than asking. The model’s accuracy on this scenario varied with table naming: when the archive table was named deleted_users it handled it consistently; with usr_archive the results were less reliable. If it had only queried the users table, it would have reported 100% retention. The actual January cohort was 25 users. 17 were archived. True retention was 32%.
Full Results
| Scenario | LLM Result | Correct Answer | Outcome |
|---|---|---|---|
| Total Revenue | Correct | $96,000 | Pattern-matched a known pitfall |
| Best Sales Day | Monday | Sunday | No timezone conversion — result inverted |
| Conversion Rate | Mixed | 8.09% | Gave naive answer first, correct second |
| Average Deal Size | $40,143 | $18,733 | AVG() silently excluded NULLs |
| Cohort Retention | Mostly correct | 32% | Dependent on how readable the table name was |
The Pattern
Across all five scenarios, the same dynamic played out. When given clean schemas with descriptive names and explicit business context, the LLM performed well. When given minimal prompts against ambiguous schemas — the way real analytics questions actually arrive — it made silent assumptions. The SQL was valid either way. The numbers were plausible either way. The difference only surfaced when someone ran the correct query alongside it and compared.
Raw text-to-SQL accuracy on frontier models currently sits around 64–85% depending on schema quality and context. That sounds reasonable until you account for what the remaining 15–36% looks like: not an error message, but a confidently returned wrong number on a dashboard that drives a headcount decision or a budget allocation.
What This Requires
The framing of “human in the loop” is almost always too vague to be useful. A PM reviewing a query output is in the loop. That doesn’t mean the PM will catch a timezone error, a JOIN fanout, or a retention calculation that silently excludes churned users. Catching those things requires knowing what to look for.
The assumptions that break analytics queries are specific: data grain (which table a metric belongs to, and what happens when you join across grains), business context (timezone, fiscal calendar, metric definitions that aren’t encoded in the schema), data lifecycle (where records go when they’re deleted, and what that means for historical calculations), and statistical assumptions (whether AVG() does what you expect when NULLs are present).
Organizations scaling AI-generated analytics need more than a reviewer. They need someone who can encode those assumptions into infrastructure — semantic layers that define metrics once and serve them consistently, curated views that handle timezone conversions and archive unions before the LLM ever sees the data, and validation patterns that catch silent failures before they reach production. The human expertise doesn’t disappear. It moves upstream into architecture, where it scales.
A Structured Approach
This is the work that sits at the intersection of data engineering and AI infrastructure. Getting AI-generated analytics right in production requires the same foundation that good analytics has always required: clean data models, governed definitions, and an architecture designed by someone who understands what can go wrong and has seen what does. If your organization is moving toward AI-assisted analytics or building the data layer that supports it, InfoSight brings the production experience and systems-level thinking to do it correctly from the start.
Explore our services →The full experiment, including runnable code, sample data, and a script that sends each scenario to an LLM API and captures the assumptions it makes, is available on GitHub. The accompanying technical writeup covers all five scenarios in detail.