Ng Ting Sheng

Retrieval-Augmented Generation (RAG) became a key technique in the early days of LLMs as a way to work around a fundamental limitation: small context windows. At the time, LLMs could only process a few thousand tokens at once, far too little to embed the full breadth of knowledge or documents relevant to many real-world tasks.

But the AI landscape has shifted dramatically. Today, we have models with “mega-context” windows. Google’s Gemini 1.5 Pro has a staggering 1 million token context window. To put that in perspective:

A million tokens is roughly 750,000 words.
A standard single-spaced page has about 500 words, so this is equivalent to a 1,500-page document.

Other models like Claude and Grok also boast impressive context lengths in the hundreds of thousands of tokens. This begs the question: if you can fit an entire encyclopedia into the prompt, do we still need RAG?

The short answer is a definitive yes.

Why RAG still matters now?

Despite massive context windows, RAG remains a critical tool for building effective and efficient AI applications. Here are the key reasons why.

Cost-Effectiveness

Even if a model can process a million tokens, that doesn’t mean it should.

LLM APIs typically charge per token — for both input and output. Sending an entire database of documents to the model for every query, even when technically feasible, can become prohibitively expensive.

RAG solves this by retrieving only the most relevant information, significantly reducing token usage and cost. This makes RAG especially valuable in high-volume, real-time, or cost-sensitive applications.

Speed and Performance

Large context = longer processing time.

Processing a massive context is not instantaneous. Parsing and reasoning over hundreds of thousands of tokens creates computational overhead, leading to latency that can degrade user experience.

By pre-filtering content, RAG helps models respond faster with more focused inputs, improving both speed and user satisfaction.

Accuracy and Focus

Even the most advanced LLMs struggle with what’s known as the “needle in a haystack” problem. As the context window grows, the model’s ability to recall specific facts can diminish, especially when the relevant detail is buried deep in the input.

The famous Needle in a Haystack test has shown that recall ability can drop as the context gets larger and the key fact is placed further in.

RAG sidesteps this entirely: it retrieves the needle first, placing it front and center for the LLM. Instead of sifting through a haystack, the model reasons over a focused, high-signal context, improving reliability and factual accuracy.

Hallucination Reduction

Larger context doesn’t eliminate hallucinations, where models generate plausible-sounding but factually incorrect answers.

RAG helps mitigate this by grounding responses in external, verifiable data. Because the model responds based on retrieved content, users can trace answers back to their source and hence improving transparency, auditability, and trust.

Will RAG become obsolete in the future?

It’s possible someday.

There’s a historical parallel here. In the 1990s and early 2000s, network bandwidth was scarce and expensive. Engineers spent years optimizing HTTP protocols, compressing assets, and building caching layers to make the web usable over dial-up and early broadband. Today, with fiber and 5G, bandwidth is abundant and cheap for most users. Many of those optimizations are now invisible, but they laid the foundation for modern infrastructure.

RAG is at a similar crossroads. Today, it addresses real-world constraints. But in a future where:

LLMs can instantly process tens of millions of tokens,
Token pricing is negligible,
And context ingestion is as fast as index lookups…

Then yes, RAG may eventually fade into the background — much like CDNs and caching are now just infrastructure details most developers never think about. Until then, RAG remains a vital bridge between LLM reasoning and the real world of external data.