AI · GenAI Engineering

What is RAG? Retrieval-Augmented Generation, explained simply

8 min read·Beginner–Intermediate
Definition

RAG (Retrieval-Augmented Generation) makes an AI model answer from your actual documents instead of from memory: before responding, the system retrieves the most relevant passages and hands them to the model along with the question. The exam analogy is exact — a plain LLM sits a closed-book exam and sometimes bluffs; RAG gives it the open book.

If you learn one piece of GenAI engineering vocabulary beyond “LLM”, make it this one. RAG is the architecture behind nearly every serious “chat with our data” system deployed in companies today, the most requested project type in Indian GenAI hiring, and — usefully for learners — the single best portfolio project for proving you can build with AI rather than just talk to it.

The problem RAG exists to solve

A large language model answers from what it absorbed during training. That creates three hard walls for business use. Privacy wall: your HR policies, contracts, and support tickets were never in any training set — the model literally cannot know them. Time wall: training data has a cutoff; yesterday’s price list is beyond it. Honesty wall: when the model lacks a fact, it may generate something fluent and false — a hallucination — with total confidence (why this happens is covered in what is generative AI).

You could try squeezing your documents into the model by retraining it — expensive, slow, and stale the day a document changes. RAG takes the cheaper, smarter route: leave the model alone and change what it reads at question time.

How a RAG pipeline works — four steps

  • 1 · Ingest & chunk. Your documents (PDFs, wikis, tickets) are split into passages of a few hundred words each — small enough to retrieve precisely.
  • 2 · Embed & store. Each chunk is converted into an embedding — a list of numbers capturing its meaning — and stored in a vector database (Chroma, Pinecone, pgvector). Similar meanings land near each other in this number-space.
  • 3 · Retrieve. When a user asks something, the question is embedded the same way, and the database returns the closest chunks — "leave policy" finds the "annual vacation entitlement" paragraph despite sharing zero keywords.
  • 4 · Augment & generate. The retrieved passages are stapled into the model's prompt: "Answer using only the following context…". The model writes the answer from the evidence — and can cite which document it came from.

Steps 1–2 run once (and re-run when documents change — which is instant, no retraining). Steps 3–4 run per question, in about a second. Frameworks like LangChain wire the whole pipeline in a few dozen lines of Python — the engineering craft is in the details: chunk sizes, retrieval quality, and evaluating whether answers are actually grounded.

RAG vs fine-tuning — the decision table

RAGFine-tuning
What changesWhat the model reads (retrieval at question time)What the model is (weights retrained on examples)
Right tool forKnowledge: "answer from our documents"Behaviour: tone, format, a specialised skill
Updating informationInstant — re-index the documentRetrain and redeploy
Citations possible?Yes — it knows which chunk it usedNo — knowledge is baked in, untraceable
Typical costLow (embeddings + a vector DB)Moderate to high (GPU training runs)
Common mistakeSkipping retrieval evaluationFine-tuning to inject facts — it works poorly

The two also combine — a fine-tuned model can sit inside a RAG pipeline — but when someone asks “should we fine-tune so the AI knows our data?”, the experienced answer is almost always: no, you want RAG.

Where you have already met RAG

Perplexity’s cited answers, ChatGPT’s and Claude’s file-upload chats, Notion’s and Slack’s workspace Q&A, every bank and insurer’s newer policy-aware chatbot, internal HR and IT helpdesk assistants across Indian IT firms — all RAG, sometimes under labels like “grounding” or “knowledge-augmented AI.” And when an AI agent needs to answer from company knowledge mid-task, retrieval becomes one of its tools — RAG and agents compose naturally.

Why RAG is the portfolio project that gets interviews

It exercises the full GenAI stack in one artifact: data processing, embeddings, a vector database, prompt construction (see the prompt engineering guide — the “answer only from context” instruction is prompt craft), API integration, and evaluation. It maps one-to-one onto what companies are actually building. And it demos brilliantly: “ask my app anything about these 200 pages” lands harder in an interview than any accuracy score. Building a deployed RAG application is a flagship project in our AI course for exactly these reasons — Python skills from our Python-for-AI path are the only real prerequisite.

Frequently asked questions

Does RAG completely stop hallucinations?

It reduces them dramatically but not to zero — the model can still misread retrieved passages or over-reach beyond them. Production systems add instructions to refuse when context is insufficient, plus evaluation that checks answers against sources.

What do I need to learn to build a RAG system?

Python, one framework (LangChain is the common choice), one vector database (Chroma is the friendly free start), and an LLM API. A working prototype over your own PDFs is a weekend project; making retrieval genuinely good is the deeper skill.

Is RAG still relevant now that models have huge context windows?

Yes. Long context lets you paste more in, but company knowledge bases run to gigabytes, retrieval keeps costs and latency sane, and citations still require knowing which passage the answer used. Long context changed RAG design; it did not replace it.

Is RAG a machine learning technique?

It is an engineering architecture built on ML components (embeddings, LLMs) — which is good news: you can build production-grade RAG without training any model yourself.

Build a deployed RAG app as your flagship project

In our live AI course you build the full pipeline — embeddings, vector DB, retrieval, evaluation — and ship it with a URL interviewers can click.