Why AI Analytics Give Wrong Numbers (And How to Fix It)

You asked your AI analytics tool: "What was our ROAS last month?" It said 3.8x. Your CFO says 2.9x. Your media buyer has a third number. Same data, three different answers  a sign that something is architecturally broken.

This is the uncomfortable reality of AI analytics today. On simple queries  order counts, total sessions, basic sums  accuracy is high. But the moment you ask something that requires joining tables, applying financial rules, or interpreting metric definitions, accuracy falls dramatically. The BIRD-Interact benchmark, which tests multi-step database interactions closer to real-world complexity, shows accuracy around 16–17% for advanced models. Simpler benchmarks like Spider show much higher numbers, but those test isolated, single-table queries that do not reflect how ecommerce data actually works.

The gap matters because ecommerce queries are inherently complex. "What was our blended ROAS by channel last quarter?" requires joining order data with ad spend, filtering by channel, calculating revenue net of returns, and dividing by spend. Six precise steps  any one of which can produce an incorrect result.

This is not a temporary issue that will be fixed with better foundation models. No language model, regardless of capability, knows your specific business. None of them know what "revenue" means in your Shopify store versus your ad platform. None of them know whether your ROAS calculation should include refunds. The problem is not AI capability. It is context. And it has a clear architectural solution.

The short version: AI needs a semantic layer to give you verifiable numbers. Without it, every answer is an educated guess.

The Numbers Do Not Lie (But AI Does)

Benchmark data paints a consistent picture: AI accuracy varies dramatically by query complexity.

  • Simple queries (counts, sums, single table): 85%+ accuracy
  • Moderate complexity (filtered aggregations, date ranges): 65–75% accuracy
  • High complexity (multi-table joins, business logic, calculated fields): 16–50% accuracy depending on the benchmark  with the more realistic, multi-step benchmarks like BIRD-Interact at the lower end

The most consequential ecommerce queries are almost always high complexity. And the scary part: wrong answers look identical to right answers. A number generated by hallucinated SQL has the same formatting, the same decimal places, the same confident presentation as a number from correct SQL. There is no visual signal that says "this might be wrong."

Wrong numbers do more damage than no numbers. No numbers prompt investigation. Wrong numbers lead teams to scale a losing channel, cut a winning one, or report incorrect results to stakeholders.

5 Reasons AI Gets Your Numbers Wrong

1. It Does Not Know Your Metric Definitions

"Revenue" has at least a dozen legitimate definitions in ecommerce. Gross vs. net. Including shipping or excluding it. In the currency of the sale or converted to reporting currency. With tax or without. On the date of the order or the date of payment.

When you ask an AI "What was revenue last month?" it picks one of these definitions. The next time you ask, it might pick a different one. Different phrasing, same question, different answer.

Each ad platform compounds this. Google Ads uses last-click attribution with its own window. Meta uses its own model and tracking limitations. TikTok uses yet another approach. An AI connecting to all three has no way to know whether these are comparable metrics without a governing layer to manage that context.

2. It Hallucinates Join Paths

When tables can be joined multiple ways, AI has to choose. In ecommerce data, this is a minefield.

"How many customers made a repeat purchase?" Your database might have a customers table, an orders table, and a sessions table. The AI could join customers to orders directly, or customers to sessions to orders, or use a subquery. Each approach might return a different count  all technically valid SQL, all potentially different numbers. The AI picks one without knowing which your data team intended.

3. It Does Not Understand Business Context

Business metrics carry context invisible in raw database schemas.

"New customers"  new this month? First purchase ever? New to this marketing channel? Without a prior order in 90 days? Each definition gives a different number, and no schema file contains the answer.

"Last quarter"  calendar quarter or fiscal quarter (which might start in February for your company)? "Top 10 products"  by revenue, by units sold, by margin? A human analyst would ask a clarifying question. AI proceeds with an assumption and returns an answer that might be answering a different question than the one you asked.

4. It Invents Calculations

When AI cannot find a pre-built metric, it creates one. This is where analytics hallucinations differ from chatbot hallucinations  and why they are especially dangerous.

Common incorrect calculations AI makes without a semantic foundation: using averages where sums are appropriate, forgetting to filter out test orders, applying discounts twice or not at all, including cancelled orders in revenue, double-counting refunds in multi-currency setups, or missing currency conversion entirely.

These are systematic mistakes rooted in the AI not understanding the business logic behind each calculation. No amount of training on generic data teaches an AI your specific rules.

5. It Cannot Tell You When It Is Wrong

Generic AI always returns an answer. No "I'm not sure" mode. No confidence score. No "this metric is not defined." The AI generates an answer and presents it the same way it presents an accurate one. You have no way to distinguish high-confidence responses from low-confidence ones.

This is fundamentally different from how a human analyst works. A good analyst says: "I can answer that with this data, but here is what I had to assume." AI tools do not volunteer their assumptions.

Why Better Models Will Not Fix This

The instinct when AI gets something wrong is to wait for a better model. The next generation of foundation models will have better reasoning, better multi-step logic, better code generation.

These improvements will happen. They will not fix this specific problem.

No language model, regardless of capability, knows your business unless you teach it. A model trained on the entire internet has general knowledge about what revenue means in a generic context. It does not know that your company uses calendar quarters, excludes test orders, defines "returning customer" as someone with more than one order in any 12-month period, or includes refunds in contribution margin but not in ROAS.

This is not a model intelligence problem. It is architectural. AI needs access to your specific business definitions, and raw database schemas do not carry those definitions.

Analogy: you hire the most brilliant accountant in the market. You give them your books with no chart of accounts, no accounting policy, no explanation of how revenue is recognized. They will still produce incorrect results because they lack the specific context. The solution is not a smarter accountant. It is giving any accountant the right documentation.

That documentation is your semantic layer.

The Fix: A Semantic Layer

A semantic layer solves the problem at the architectural level. Rather than hoping AI infers business context from raw data, you provide that context explicitly through governed metric definitions.

What a Semantic Layer Does

A semantic layer pre-defines every metric with exact formulas and data sources:

  • Revenue = gross sales minus returns and discounts from Shopify orders, excluding test orders, for the selected date range
  • ROAS = revenue divided by ad spend across all connected channels, using last-click attribution
  • New customer = customer with exactly one order within the selected date range
  • Contribution margin = revenue minus ad spend, COGS, and shipping, divided by revenue

When you ask a question, the AI calls these pre-defined metrics rather than generating SQL from scratch. The semantic layer handles SQL generation and enforcement of business rules. The AI's job is natural language understanding, not business logic.

The Accuracy Difference

Google and Looker research found that semantic layers reduce AI analytics errors by 66%. The majority of remaining errors come from questions that fall outside governed definitions  and when that happens, a well-built semantic layer returns an error rather than a guess. Failing loudly beats failing silently.

The pattern is clear across every benchmark: constrain what AI can query to pre-defined, governed metrics and accuracy improves dramatically. Let AI write ad-hoc SQL against raw tables and it fails on exactly the queries that matter most.

How It Works in Practice

You ask: "What was blended ROAS last month?"

Without a semantic layer: AI writes SQL against raw tables, might use the wrong join path, might use the wrong revenue definition, returns a number that looks correct.

With a semantic layer: AI calls the blended_roas metric definition. The semantic layer specifies: ROAS = (revenue minus returns) divided by ad spend across all channels, using last-click attribution, for the selected date range. The semantic layer generates validated SQL. You get the same number your CFO gets.

Same question. Same answer. Every time.

How to Check If Your AI Analytics Tool Has This Problem

Five tests, five minutes. These can save your team hours of decisions based on incorrect data.

Test 1: Ask the same question three different ways. "Revenue last month." "Our monthly revenue in the past 30 days." "Total sales for the period ending 30 days ago." If you get three different answers, your AI is interpreting the question differently each time.

Test 2: Ask for a metric you already know. You know your revenue for last month. Ask the AI and compare. If it does not match, the AI is calculating incorrectly.

Test 3: Ask a question that should be unanswerable. "Revenue from our brick-and-mortar stores." If you do not have physical locations, the AI should say it lacks that data. If it returns a number, it is guessing  and if it guesses here, it guesses elsewhere too.

Test 4: Ask for the definition it is using. "What is the definition of ROAS you are using?" If the AI cannot show its work, you cannot validate it.

Test 5: Ask a compound question. "What was our ROAS last month, and how does it compare to the prior month by channel?" If the AI struggles with multi-step logic or gives inconsistent answers, it is working without sufficient context.

If your tool fails any of these, treat its outputs as approximate until verified manually.

The Cost of Not Fixing This

Wasted ad spend. If your AI reports ROAS at 3.8x when it is actually 2.9x, you keep scaling a campaign that is not profitable. For a brand spending $50K/month, that gap means tens of thousands in misallocated budget.

Missed opportunities. If AI underreports performance on a channel due to incorrect attribution logic, your team cuts budget from something that is actually working.

Stakeholder trust. When your CFO, media buyer, and AI tool all show different numbers from the same data, teams stop trusting data-driven decisions. Rebuilding that trust takes far longer than building the right foundation initially.

Compliance risk. For brands with financial reporting requirements, incorrect data outputs can create real exposure. A semantic layer provides the audit trail and consistency required for verifiable reporting.

What to Do Next

If You Are Evaluating AI Analytics Tools

Ask every vendor: "How does your AI know what our metrics mean?" If the answer involves raw database access, schema inference, or "it learns from your data over time," you are looking at a tool that will give incorrect results on complex queries.

Ask specifically: Do you have a semantic layer? How many metrics are pre-governed? What happens when I ask a question outside the governed definitions  do I get an error or a guess? Can our team review and modify metric definitions?

If You Are Building AI Features In-House

Start with metric definitions before model selection. Define your core 15–20 metrics with exact formulas, data sources, and business rules. Review with both your data team and finance team. Build the semantic layer first, then connect the AI model via API.

Most ecommerce brands can avoid the most common accuracy issues by governing just their top metrics: revenue (with and without returns), ROAS by channel, new customer rate, repeat purchase rate, contribution margin, and AOV. These six metrics, governed precisely, support 80% of the questions your team asks day-to-day.

If You Are a Shopify Brand

Polar Analytics solves this with a managed semantic layer built for ecommerce. Polar's layer unifies Shopify, Meta Ads, Google Ads, TikTok, Klaviyo, Recharge, Stripe, and 40+ other sources into one governed model. The Polar Pixel recovers conversion signal lost to iOS and browser restrictions through first-party tracking, while Shapley-based attribution distributes credit across channels. LTV tracks cohort behavior across repeat purchases. Contribution margin nets out ad spend, COGS, and shipping.

Ask Polar queries these pre-modeled metrics  no raw SQL, no hallucination. When a question falls outside governed definitions, it returns an error rather than a guess. Your CFO gets the same ROAS your media buyer sees.

FAQ

On simple queries (counts, sums, single tables), 85%+ accuracy. On complex queries involving joins, business logic, and metric definitions, accuracy drops to 16–50% depending on the benchmark — with more realistic multi-step benchmarks like BIRD-Interact at the lower end. Google and Looker research found that semantic layers reduce AI analytics errors by 66%, which is why governed metrics are the prerequisite for reliable AI analytics.
A structured layer between raw data and the tools that query it. It pre-defines metrics with exact formulas, data sources, and business logic — ensuring consistent answers regardless of which tool, AI agent, or API makes the query. Without one, every query is interpreted differently.
ChatGPT can provide rough estimates, but it cannot guarantee accuracy without access to your specific metric definitions. Connected to a data warehouse without a semantic layer, it writes SQL against raw tables — leading to incorrect results on complex queries involving joins, filters, and business rules. For production decisions, you need a governed semantic layer between the AI and your data.
You can trust them if the AI operates through a governed semantic layer that constrains what it can query and how metrics are calculated. Without one, AI analytics are not trustworthy for complex queries. With one, accuracy improves dramatically — and when a question falls outside governed definitions, the system returns a clear error rather than a guess.
Systematic incorrect results on your most important queries — blended ROAS, contribution margin, LTV by cohort, acquisition cost. These metrics drive budget decisions, growth strategy, and financial reporting. The cost compounds: wasted ad spend, missed opportunities, eroded team trust in data.

Join 4,000+ leading Shopify brands around the world using Polar Analytics to stop manually compiling their data

Schedule a demo
Quad lock
Aimn'
Lifetime brands
Marcella New York
The Frankie Shop
Tiege Hanley
Polene
Seavees
Ripndip
Albion Fit
Kiss USA
Konges slojd
Lemaire
nohow
Maniere de Voir
Volcom
Coes
Razor Group
Oneskin
State & Liberty
Warren James
Dyper
Bonsoirs
From Future
RSVP
Merci handy
Soi Paris
Yellowpop
Olipop
Soko Glam
Fanjoy
Hero
Almond Cow
Polène