95% of GenAI pilots produce zero return on investment. Not because the models are bad. Because teams start with “Step 1: use LLMs. Step 2: figure out what for.” The successful 5% all share one trait: they picked one specific internal pain point, quantified it in dollars, and executed with the simplest tool that could solve it. This explainer shows you why the problem is almost never the model, and what to do instead.
The Failure Landscape
The numbers are stark. RAND found that AI projects fail at twice the rate of regular IT projects. MIT’s NANDA lab tracked $30–40 billion in enterprise GenAI investment and found that 95% of pilots saw zero return. S&P Global reported that 42% of companies abandoned most of their AI initiatives in 2025, up from 17% the year before. More money did not solve the problem. It made it worse, because it funded more projects that started from the wrong end.
PwC’s 2026 CEO survey found that only 12% of CEOs globally report AI has delivered both cost and revenue benefits. 56% report neither. That’s after years of investment and executive attention. The pattern is consistent across industries and geographies: most organizations are spending real money to build things that don’t work.
The named failures are instructive. IBM spent roughly $5 billion on Watson for Oncology. Doctors rejected its treatment recommendations 88% of the time, in part because the system was trained on synthetic cases rather than real patient data. Zillow’s Offers program used ML to price homes for instant buying. It lost over $500 million in a single quarter when its models couldn’t keep up with a shifting market. Amazon built an ML recruiting tool that systematically penalized resumes containing the word “women’s” because its training data reflected historical hiring patterns. They couldn’t fix it. They abandoned it. Klarna replaced 700 customer service agents with an OpenAI-powered bot, claiming it handled 2.3 million conversations. Within a year, the CEO publicly admitted service quality had suffered. They started rehiring humans.
Each failure had a different surface cause. But the root cause was the same: they started with the technology and worked backward to find a problem for it to solve.
The Accuracy Trap
A model can score 95% accuracy and deliver zero business value. This is not a theoretical concern. It is the single most common way AI projects fail quietly.
Consider a fraud detection system. Only 1% of transactions are fraudulent. A model that predicts “not fraud” for every single transaction scores 99% accuracy. It also catches zero fraud. The number on the dashboard looks great. The finance team is losing millions.
This is the class imbalance trap, and it shows up everywhere that matters: fraud (1% positive), medical screening (2% positive), churn prediction (5% positive), manufacturing defects (0.1% positive). The rarer the thing you’re trying to catch, the more misleading accuracy becomes. What you actually need to think about is precision (how many of your alarms are real) and recall (how many real cases you catch). Those two are in tension. Pushing one up usually pushes the other down.
The right balance between precision and recall is a business decision, not a statistical one. Akismet, the spam filter that protects WordPress, optimizes for precision above 99.99% because losing a real email is worse than letting some spam through. A call center spending $10 million a year on human agents found that an AI classifier only needed to exceed 50% accuracy to be worth deploying. The bar for “useful” was far lower than anyone assumed, because speed and consistency mattered more than getting every case right.
McKinsey found that 30–40% of potential AI impact is lost to misaligned incentives, even when the models perform well. The model works. The metric it optimizes for just does not connect to the outcome the business cares about.
The Reframing Effect
The most dramatic improvements in applied ML almost never come from better models. They come from reframing the question.
Netflix started with a classification problem: will this user like this movie? The answer was a score. But the real problem was not scoring individual movies. It was deciding the order of the entire page. Switching from “will they like it?” to “what order maximizes engagement?” required a completely different architecture, and it transformed the product. Item popularity, they discovered, is the opposite of personalization. It produces the same ordering for everyone.
Uber’s DeepETA team tried to predict arrival times from scratch. The predictions were mediocre. Then they reframed: instead of predicting the ETA, predict the error in the existing ETA. The existing system was always wrong, but it was wrong in consistent, predictable ways. By modeling the systematic error and correcting it, they reduced average ETA error by more than 50% in some cities. Same data, same compute budget, completely different question.
Spotify reframed recommendation from “predict what users will like” to an explore/exploit tradeoff. Pure exploitation (recommend based on history) traps users in a filter bubble and starves new artists. Adding deliberate exploration through a multi-armed bandit framework improved both user satisfaction and platform health over time.
Stitch Fix did something even more counterintuitive. Instead of automating the entire styling process, they designed a system where algorithms handle structured data (fit likelihood, collaborative filtering) and human stylists apply context, emotional judgment, and trend awareness. The human-AI hybrid reduced returns by 30% and helped double revenue from $1.7 billion to $3.2 billion in four years. Full automation would have been worse than the combination.
Build vs Buy vs Prompt
Google’s first rule of machine learning: “Don’t be afraid to launch a product without machine learning.” Their third rule: “Choose machine learning over a complex heuristic.” Read those together and the message is clear. Start with the simplest thing that works. A handwritten rule. A regex. A lookup table. Only add ML when maintaining the heuristic becomes impossible.
In 2026, many teams that think they need fine-tuning or RAG discover that a well-structured prompt with a few examples gets them 80% of the way there at near-zero cost. Research on reasoning models like o1 found that few-shot prompting actually worsened performance compared to simpler methods. More complexity is not always better.
When you do need more than a prompt, the landscape breaks into three options. RAG (retrieval-augmented generation) takes 2–6 weeks to build and works best when your knowledge changes frequently, like product catalogs, regulations, or support documentation. Fine-tuning takes weeks to months but produces a specialized model that runs cheaper and faster at high volume. It wins when the domain is stable and tone consistency matters. Agents combine multiple LLM calls with tool use for multi-step reasoning, but each call adds tokens, latency, and debugging complexity.
MIT found that buying AI tools from specialized vendors succeeds about 67% of the time, while internal builds succeed only one-third as often. The RAND report explicitly calls out a pattern: training custom models for problems that API calls would solve, resulting in impressive technology that nobody uses. Regex beats LLMs for structured patterns (IP addresses, phone numbers, credit card formats) at near-zero latency. With 1M+ token context windows now available, loading static documents directly into the prompt takes 2–5 developer-days versus 2–6 weeks for production RAG.
The right question is not “which approach is best?” It is “what is the simplest approach that meets our actual requirements?”
The Production Gap
For every 33 AI prototypes built, only 4 make it to production. IDC published that number, and it matches what everyone in the industry sees: a notebook that looks 90% complete is actually about 10% done for production deployment.
Google Duplex is the canonical demo trap. At I/O 2018, it stunned the audience by calling a restaurant and booking a reservation, complete with human-like “ums” and pauses. The demo was later revealed to be heavily curated. In production, Duplex achieved only 80% autonomous success. The other 20% required human call center operators to take over. The gap between a polished demo and a reliable system is not a matter of debugging. It is a matter of architecture, monitoring, fallbacks, and all the things that don’t fit on a slide.
COVID was the ultimate stress test for this gap. MIT Technology Review documented how hundreds of AI tools were built to help diagnose and treat COVID. None of them actually helped. The models were trained on mislabeled data, data from unknown sources, and early pandemic patterns that didn’t generalize. Epic’s hospital severity prediction system “didn’t perform very well at all” when patient characteristics shifted during the pandemic. Healthcare AI models saw accuracy degrade within days of deployment.
This is the problem of distribution shift. Your model learns patterns from training data. When the real world changes, those patterns break. A retailer’s demand forecasting model failed when a mobile app promotion shifted shopping patterns from in-store to online. A credit risk model that doesn’t adjust to economic shifts will approve high-risk loans. In fraud detection, adversaries deliberately shift their patterns to evade your model. The world is not stationary, and your model was trained on a snapshot of it.
What to Measure Instead
F1 score does not capture business value. A recommendation model that scores 1% better in offline precision may produce no measurable difference in a live A/B test, because offline evaluation assumes historical behavior reflects future preferences. It ignores real-time feedback loops. It cannot account for the cold start problem. The correlation between offline metrics and online results is, in many domains, weak.
Gartner puts it bluntly: “Login rates and tool usage tell you nothing about value. Workforce impact and decision quality tell you everything.” The metrics that matter sit in a different category entirely. For a customer support bot, teams typically measure accuracy and latency. What actually matters is resolution rate (did the user’s problem get solved?) and escalation frequency (how often does a human have to step in?). For fraud detection, F1 and precision matter less than dollars saved from prevented fraud minus dollars lost to false positives that block legitimate customers. For content recommendation, click-through rate is a proxy at best. Engagement depth, retention, and revenue per user tell you whether the system is working.
Anthropic’s approach to agent evaluation is instructive. They advocate eval-driven development: define the capabilities you want before the agent can fulfill them. Start with 20–50 tasks drawn from real failures. Prioritize volume over quality in early evaluations, because more questions with automated grading beats fewer hand-graded evals. Early changes have large effect sizes, so you want to iterate fast.
The production evaluation stack has two tracks. Offline evaluation catches obvious failures before deployment. Online evaluation (live A/B tests, continuous monitoring, user feedback loops) catches the things that only show up in the real world. Together, they can reduce mean time to detection of quality issues from hours to minutes. Separately, each one is blind to what the other sees.
The First Rule of AI Products
The pattern across every failure in this explainer is the same. Teams started with technology and worked backward to find problems. Watson started with “we have a cognitive computing platform” and looked for diseases to diagnose. Zillow started with “we have a pricing model” and looked for houses to buy. Klarna started with “we have an AI assistant” and looked for agents to replace.
The pattern across every success is also the same. They started with a specific, quantifiable pain and chose the simplest tool that could solve it. Uber did not build a new ETA model. They modeled the error in their existing one. Stitch Fix did not automate stylists. They gave stylists better data. Lumen Technologies did not “apply AI to sales.” They eliminated a specific 4-hour research bottleneck, saving $50 million a year.
Ines Montani, the creator of spaCy and Prodigy, has a line that captures this better than anything: the hard part of applied AI is not the “how.” It is the “what.” Deciding what to build, what to measure, and what counts as success. The learner is just a compiler. Your examples, your framing, your evaluation criteria are the actual program.
Google’s Rule #1 of Machine Learning: Don’t be afraid to launch a product without machine learning.
The best AI product might not use AI at all.
References
- RAND Corporation, The Root Causes of Failure for Artificial Intelligence Projects, 2024
- MIT NANDA, The GenAI Divide: State of AI in Business, 2025
- PwC, 29th Annual Global CEO Survey, 2026
- S&P Global Market Intelligence, AI Adoption Survey, 2025
- HBR, Beware the AI Experimentation Trap, 2025
- Anthropic, Demystifying Evals for AI Agents
- Google, Rules of Machine Learning
- Ines Montani, Applied NLP Thinking, Explosion, 2021