Why RAG fails on time series (and what works instead)

Ask an LLM to summarize a legal contract and it does fine. Ask it what revenue will look like next quarter, given the last three years of monthly data, and it falls apart.

This isn't surprising. RAG was designed for text. You embed documents, retrieve relevant chunks, feed them to the model. The model reads the chunks and generates an answer. It works because language has a stable structure. A paragraph about contract termination clauses means roughly the same thing whether it was written in 2020 or 2024.

Time series data is nothing like this.

Why time series breaks standard RAG

A time series is a sequence of numbers indexed by time. Stock prices. Server CPU usage. Monthly revenue. Daily temperature. The numbers themselves carry almost no semantic meaning in isolation. The value 47.3 means nothing until you know it's the temperature in Delhi in June, that yesterday was 46.8, and that the monsoon typically arrives by the second week.

Standard RAG can't handle this. There are three specific reasons.

First, embeddings don't capture temporal patterns. When you embed the sentence "revenue grew 15% year over year," a text embedding model understands that. But if you embed the raw sequence [100, 108, 112, 115], the model has no idea that this represents decelerating growth. The shape of the curve matters. Text embeddings don't see shapes.

Second, retrieval by similarity fails when context is temporal. In text RAG, you retrieve chunks that are semantically similar to the query. But the most useful historical data for a time series forecast isn't data that "looks similar" in embedding space. It's data from the same season last year, or data from the last time the same macroeconomic conditions held. The relevance function is fundamentally different.

Third, time series has distribution shifts. The statistical properties of the data change over time. A model trained on pre-COVID retail data will be wrong about post-COVID retail data, even though the numbers are in the same format. Text doesn't shift this way. The meaning of a legal clause doesn't suddenly change because of a pandemic.

What actually works: agents that route, not just retrieve

The insight behind agentic RAG for time series is simple: don't try to make one model do everything. Instead, build a system where an LLM acts as a router, directing different parts of the problem to specialized tools.

Think of it like a hospital. You don't want one doctor doing surgery, reading X-rays, and running blood tests. You want a triage nurse (the master agent) who figures out what's wrong and sends you to the right specialist (the sub-agents).

Ravuru et al. formalized this in their KDD 2024 paper. Their architecture has three layers:

A master agent that receives the user's question and decides what kind of time series task it is. Is this a forecasting problem? Anomaly detection? Classification?

Sub-agents, each fine-tuned for one specific task. These aren't general-purpose LLMs. They're smaller language models (think 7B parameters, not 70B) that have been instruction-tuned on time series data with Direct Preference Optimization. They're specialists.

A shared prompt pool that stores distilled knowledge about historical patterns: seasonality, cyclicality, trend shapes. When a sub-agent needs context, it retrieves from this pool. This is where the "retrieval" in RAG happens, but it's retrieval of patterns, not paragraphs.

The key result: this multi-agent setup outperforms single-model approaches across forecasting, anomaly detection, and classification benchmarks. Not because the individual models are better, but because the routing is better. Each model only handles the problem it was trained for.

The foundation model wave

While the agentic approach solves the routing problem, there's a parallel revolution happening: foundation models built specifically for time series.

The analogy to LLMs is precise. GPT-3 showed that a model pre-trained on a huge text corpus could do well on tasks it was never explicitly trained for. The same thing is now happening with time series.

Google's TimesFM is a decoder-only transformer pre-trained on 100 billion real-world time points. It treats patches of time series values the way an LLM treats tokens. The latest version (TimesFM-2.5) uses 200M parameters, half the size of its predecessor, while ranking #1 on GIFT-Eval for both point accuracy and probabilistic accuracy in zero-shot forecasting. No fine-tuning needed. You just feed it a time series and it predicts what comes next.

Amazon's Chronos-2 processes 300+ forecasts per second on a single GPU, consistently beating tuned statistical models out of the box. Lag-Llama, built on the LLaMA architecture, does probabilistic forecasting, outputting full probability distributions rather than point estimates.

These models are doing to ARIMA and Prophet what LLMs did to rule-based NLP. Not replacing them everywhere (ARIMA still wins on clean, stationary data), but making the default choice dramatically easier. You no longer need a statistician to hand-tune seasonal decomposition parameters for every new dataset.

Where RAG meets time series (properly)

The most interesting recent work combines retrieval with these foundation models, but does it right.

Retrieval Augmented Forecasting (Tire et al., 2024) doesn't retrieve text. It retrieves time series. When you want to forecast a particular series, it searches a database for historical series with similar shapes and dynamics, then concatenates those retrieved patterns with your input. The foundation model sees both your data and the most relevant historical precedent.

This works because time series has a property that text doesn't: the same pattern often repeats across different domains. Electricity demand in Mumbai and water consumption in Bangalore might follow nearly identical weekly cycles. By retrieving these cross-domain analogues, the model gets useful context it couldn't get from the input series alone. The improvement is largest on out-of-domain data, exactly where you'd expect retrieval to help most.

DCATS (Yeh et al., 2025) takes the agent idea further. Instead of focusing on model architecture, it uses an LLM agent to improve data quality. The agent examines metadata, cleans the input data, filters out irrelevant series, and selects the right subset for training. This data-centric approach achieves a 6% error reduction across all tested models. The insight: sometimes the bottleneck isn't the model, it's the data you feed it.

What this means practically

If you're building a system that needs to reason about temporal data, here's what actually matters:

Don't embed time series as text. If you're converting numbers to strings and throwing them into a vector database alongside your documents, you'll get bad results. Time series needs its own retrieval mechanism, one that understands shapes, seasonality, and temporal distance.

Use foundation models as your forecasting backbone. TimesFM-2.5, Chronos-2, or Lag-Llama will outperform hand-tuned ARIMA on most datasets with zero configuration. Start there.

Use agents for routing, not just retrieval. The value of an LLM in a time series pipeline isn't in doing the math. It's in deciding what math to do. Is this a forecasting problem or an anomaly detection problem? Does this data need deseasonalizing first? Should we use the last 30 days or the last 3 years? These are judgment calls that agents handle well.

Retrieve series, not documents. When you need context, retrieve historical time series that look like your current data. Cross-domain retrieval (finding similar patterns in unrelated datasets) is surprisingly effective.

The gap between "ask an AI about numbers" and "ask an AI about text" is closing fast. But it's closing because people stopped trying to make text tools work on numbers, and started building tools that understand what numbers actually are.

References

Ravuru et al., Agentic Retrieval-Augmented Generation for Time Series Analysis, KDD 2024
Das et al., TimesFM: A decoder-only foundation model for time-series forecasting, ICML 2024
Tire et al., Retrieval Augmented Time Series Forecasting, 2024
Yeh et al., Empowering Time Series Forecasting with LLM-Agents (DCATS), 2025
Rasul et al., Lag-Llama: Towards Foundation Models for Probabilistic Time Series Forecasting, 2023