Is RAG always the answer?

Every week, I talk to founders who are wrestling with the same question: “We’re building an AI product. Should we use RAG?” They see it all over Twitter and in every tech publication, but they’re also being pulled in a dozen other directions—fine-tuning, agents, knowledge graphs. They’re looking for a straight answer, a silver bullet. And I have to be the one to tell them: there isn’t one.

As a CTO who has been in the trenches shipping AI products for years, my answer is almost always: “it depends”. I know, I know, not the straight answer you were looking for. But the truth is, the “best” approach depends entirely on your specific use case. What works for a legal tech startup won’t work for a company building an AI-powered coding assistant.

So, to help you navigate this complex landscape, I’ve put together a field guide based on my experience working with venture-backed product teams. This is the same mental model I use when I’m advising companies on their AI strategy.

The Main Alternatives at a Glance

I’ve seen teams spend months debating these options. To save you some time, here’s a cheat sheet that summarizes the main alternatives to plain vanilla RAG.

#	Technique (shorthand)	Core idea	When it outperforms plain RAG	Typical cost	Key drawbacks
1	Full or LoRA fine-tuning	Embed the knowledge inside the model weights	• Corpus ≤ 1 B tokens, rarely changes • Ultra-low latency required • Repetitive QA tasks	High upfront (GPU) ➜ low per-query	Model bloat; hard to update; IP leakage risk
2	Domain-Adaptive Pre-Training (DAPT)	Continue pre-training on large in-domain dump	• Very large but static corpus (>1 B tokens) • Need deep, nuanced style adaptation	Highest upfront	Weeks of compute; still static
3	Knowledge Graph—Augmented Generation (KGAG)	Retrieve from a graph; do symbolic reasoning	• Data is highly structured; correctness > recall	Medium	Graph construction + upkeep
4	SQL/Toolformer/Agentic	LLM writes queries to trusted tools; tools return answers	• Answers already exist in DB or API • Need exact numeric output	Low–medium	Tooling infra, prompt safety
5	Semantic Cache / Prompt-Cache	Reuse earlier high-quality answers	• Traffic is bursty & repetitive • Latency/cost critical	Tiny	Needs good cache hit strategy
6	Model Editing (ROME, MEMIT, MEND)	Patch weights for specific facts	• Small # of “hot-fix” facts	Low	Non-trivial risk of side-effects
7	Hybrid RAG (RAG + Finetune, Delta-RAG, Graph-RAG)	Combine retrieval with small adapters or graphs	• When neither plain RAG nor pure finetune alone meets quality bar	Medium	Added complexity
8	Distilled Mini-Models	Train a small model on “teacher” RAG answers	• Edge deployment; offline; sub-second latency	Upfront distillation cost	Requires high-quality teacher

My Rule of Thumb for Choosing the Right Path

If you’re just starting out, the sheer number of options can be paralyzing. Here’s a simplified decision tree based on what I’ve seen work in the real world.

imge

Start with plain RAG. Seriously. It’s the most versatile and will get you 80% of the way there for most use cases. Don’t over-engineer from day one.
If your data is small and static, and you need speed, then fine-tuning is your friend. I’ve seen this work well for things like customer support bots that answer the same questions over and over.
For massive, static datasets, DAPT is a beast. Think of it as giving your LLM a PhD in a specific domain. It’s expensive, but for deep, nuanced understanding, it’s unbeatable.
When your data is highly structured, think graphs. I’ve seen KGAG work wonders in financial services and other domains where correctness is paramount.
If the answer already exists in a database, for the love of all that is holy, just query it. Don’t try to make an LLM do a database’s job. Use a SQL/Toolformer agent.
Got a lot of repeat queries? Cache them. It’s a simple, cheap way to improve latency and cut costs. I’m always surprised by how many teams overlook this.
Model editing is for surgical strikes. It’s great for fixing a single fact, but it’s not a scalable solution.
When one approach isn’t enough, go hybrid. The best production systems I’ve seen are often a mix of RAG and something else.
For edge deployments, think small. Distilled models are perfect for running on-device.

Why “Better” Is Usually “Augmented RAG,” Not “No RAG”

I once advised a startup that was hell-bent on fine-tuning a model for their legal tech product. They spent three months and a good chunk of their seed round trying to get it right. The problem was, their data wasn’t static and their customers demanded traceability, making it a poor fit for a pure fine-tuning approach. The results were… underwhelming. The model was a black box, they couldn’t cite their sources, and it was a nightmare to keep up to date.

They eventually switched to a hybrid RAG approach and had a working prototype in two weeks.

This highlights a few key points:

Foundation-model drift is real. Your LLM vendor will ship new checkpoints, and your fine-tuned model will eventually become obsolete. A RAG layer insulates you from this.
Enterprises love citations. If you’re selling to businesses, you need to be able to show your work. RAG gives you that traceability.
Data freshness is key. With RAG, you can just stream in new data. No need to retrain a model every time a new document is added.

This is why most production systems today are RAG plus something else. They use RAG for grounding, and then they add on things like re-rankers, answer evaluators, and lightweight adapters to get the best of all worlds.

Edge Cases Where RAG Gets Crushed

Now, I’m not a RAG zealot. There are definitely times when it’s the wrong tool for the job. Here are a few I’ve encountered:

When you need answers in the blink of an eye. For e.g, consider AR glasses. You can’t have a user waiting for a vector database lookup. A tiny, distilled model is the only way to go.
When you need to be precisely right. I’m talking about things like tax calculations. You don’t want an LLM hallucinating a tax code. Use an agent that calls a deterministic calculator.
When you’re in a data lockdown. I’ve worked with hospitals that won’t let their data leave the premises. In that case, an on-device LoRA or model editing is your only option.
When your source data is a mess. If you’re dealing with low-resource languages and noisy OCR, RAG’s retrieval quality will be terrible. You’re better off with continued pre-training.

My Pro-Tips from the Trenches

Here are those pro-tips, expanded with the detailed context and reasoning that startups need to make informed decisions.

If You Fine-Tune, Keep a RAG Fallback

At first glance, fine-tuning seems like the ultimate solution for domain adaptation. By retraining a model on your proprietary dataset, you imbue it with specialized knowledge, teaching it the specific jargon, tone, and reasoning patterns of your industry. This can lead to superior performance on narrow tasks and can be faster at inference time since it doesn’t require a separate retrieval step. However, this process creates a static model; its knowledge is frozen at the moment its training concludes. It will be oblivious to any new information, and it can still “hallucinate” or provide incorrect answers when faced with queries that fall outside its specialized training.

This is where a RAG serves as an essential safety net. A hybrid approach allows you to get the best of both worlds. The system can first route a user’s query to the highly specialized fine-tuned model. If that model is unable to answer with high confidence, or if the query pertains to very recent events, the system can “fall back” to the RAG pipeline. The RAG component then fetches up-to-date documents from an external knowledge base to provide the necessary context.

Implementing this hybrid architecture creates a robust, multi-layered solution. You can use a simple routing mechanism based on confidence scores or a more sophisticated classifier to decide which path a query should take. This ensures that you leverage the deep, stylistic expertise of your fine-tuned model while covering its knowledge gaps with the timely, factual grounding provided by RAG. The result is an AI that is not only a domain expert but also perpetually current.

If You Build a SQL Agent, Throttle It

Giving a Large Language Model the ability to write and execute SQL queries directly against your databases is incredibly powerful. It can unlock natural language data analysis for non-technical users. However, it also introduces significant risks if not handled with extreme caution. An unconstrained LLM agent, attempting to interpret an ambiguous or complex user request, could easily get stuck in a loop, generating a cascade of inefficient queries. I’ve seen this firsthand: a single runaway agent can execute massive table scans or complex joins repeatedly, overwhelming a production database, degrading performance for all users, and potentially causing a costly outage.

To prevent this, implementing strict safeguards is non-negotiable. The most critical first step is to ensure the agent connects to the database using a read-only user role. This single action prevents any possibility of the LLM executing destructive commands like UPDATE, DELETE, or DROP. Beyond that, you must implement practical throttling mechanisms. This includes rate-limiting the number of queries the agent can execute per minute and setting a timeout for query execution.

For an even higher level of safety, consider building an API layer that sits between the agent and the database. Instead of allowing arbitrary SQL generation, the LLM would call predefined, human-vetted API endpoints that correspond to safe, optimized queries. For maximum security in highly sensitive environments, the agent can generate the SQL but place it in a queue for human review and approval before it is ever executed. These layers of defense are essential to prevent a helpful tool from becoming a catastrophic liability.

I usually recommend against going for it unless there is a very clear and compelling reason. The benefits of a SQL agent are significant, but the risks are substantial. It’s important to weigh the pros and cons carefully before committing to this approach.

If You Use a Cache, Be Smart About Invalidation

Caching is a fundamental technique for scaling LLM applications, as it reduces both latency and operational costs by storing and reusing previously generated responses. However, a poorly managed cache can become a double-edged sword. If you cache a response, but the underlying data source that informed that response changes, your system will continue to serve stale, incorrect information. This completely undermines the core benefit of using RAG for up-to-date knowledge and can severely damage user trust.

The key to effective caching lies in intelligent invalidation. A robust strategy is to create a unique cache key by hashing not just the user’s query, but also a representation of the documents that were retrieved to answer it. This could be a concatenation of document IDs, version numbers, or last-modified timestamps.

When a document in your knowledge base is updated, you must have a corresponding process to invalidate any cache entries that relied on it. This can be achieved with event-driven triggers that automatically purge relevant entries upon a data change. While this requires more complex engineering than a simple time-to-live (TTL) expiration, it is essential for applications where data freshness is critical. The goal is to strike the right balance: maximizing your cache hit rate for performance while ensuring the information you provide remains accurate and trustworthy.

Having said this, this of course results in increased engineering complexity. Implementing a sophisticated, hash-based invalidation system is far more complex than a simple TTL cache. It requires a robust pipeline for tracking document versions and propagating update events to the cache, which can be brittle if not designed carefully. There is also computational overhead since you’re calculating a hash based on query content and document metadata for every single request, which can add latency.

If You Use a Graph, Make the LLM Cite Its Sources

Using a Knowledge Graph to augment an LLM (AKA KAG) enables a more sophisticated level of reasoning. Instead of just retrieving chunks of text, the model can traverse a structured graph of entities and their relationships, allowing it to answer complex queries that require understanding causal, temporal, or logical connections. However, this power comes with a challenge: the reasoning path through the graph can be opaque, making it difficult to understand how the LLM arrived at its conclusion. This “black box” problem is a major barrier to adoption in enterprise environments where explainability and trust are paramount.

The solution is to make citation a mandatory part of the generation process. As the LLM navigates the knowledge graph to construct an answer, it must be prompted to keep track of the specific nodes (entities) and edges (relationships) it utilizes. The final response should then present these as explicit citations, allowing users to trace the model’s logic back to its source within the graph.

This practice provides what is essentially free explainability. A user, whether they are an external customer or an internal developer, can instantly see the evidence trail and verify the answer’s accuracy against the source data. This not only builds immense trust in the system but also serves as an invaluable debugging tool. When an incorrect answer is generated, the cited path immediately reveals where the reasoning went astray, making the entire system more transparent, robust, and maintainable.

The Bottom Line

Look, there’s no silver bullet here. Anyone who tells you otherwise is selling something.

RAG is still your workhorse. It’s the most generally effective tool for the most enterprise AI job which requires up-to-date knowledge.
Fine-tuning and DAPT are for when you need to specialize. They’re powerful, but they’re also expensive and time-consuming.
Agents are for when the answer already exists in a structured format.
The best systems are hybrids. Don’t be afraid to mix and match.

My advice? Start with the simplest thing that could possibly work (probably RAG), and then iterate from there. Use this guide to help you choose the right path, but don’t be afraid to experiment. The only way to find out what works is to try it.