Build & Ship

What Is Retrieval Augmented Generation (RAG)? Complete Guide

By Ashit Vora13 min
man in white long sleeve shirt writing on white board - What Is Retrieval Augmented Generation (RAG)? Complete Guide

What Matters

  • -RAG combines LLM capabilities with your proprietary data by retrieving relevant documents at query time, eliminating the need for expensive fine-tuning for most use cases.
  • -The RAG architecture: document chunking and embedding, vector database storage, semantic retrieval at query time, and context injection into the LLM prompt.
  • -RAG beats fine-tuning when data changes frequently, you need source attribution, or you want to avoid the cost and complexity of model training - which covers 80% of enterprise use cases.
  • -Common RAG failures: wrong chunk sizes (too large loses precision, too small loses context), poor embedding model selection, and not including source metadata for attribution.

Retrieval Augmented Generation (RAG) is the most practical pattern for making large language models (LLMs) useful with your own data. Instead of training or fine-tuning a model on your documents (expensive, slow, and hard to update), RAG retrieves relevant information at query time and feeds it to the LLM as context. The model generates answers grounded in your actual data rather than its training data.

TL;DR
RAG works by converting your documents into vector embeddings, storing them in a vector database, retrieving the most relevant chunks when a user asks a question, and feeding those chunks to an LLM to generate a grounded answer. It's better than fine-tuning when your data changes frequently, you need source attribution, or you want to avoid hallucination. The standard RAG pipeline takes 2-4 weeks to build for a single data source. Production RAG with multiple sources, hybrid search, and reranking takes 6-10 weeks. Accuracy depends heavily on chunking strategy and retrieval quality, not just the LLM.

How RAG Works

The RAG pipeline has two phases: indexing (done once, updated periodically) and retrieval + generation (done on every query).

Indexing Phase

Step 1: Document ingestion Collect your source documents - PDFs, web pages, database records, Confluence pages, Slack messages, whatever contains the knowledge you want to make queryable.

Step 2: Chunking Split documents into smaller pieces (chunks). This is the most underrated step in RAG and the one that most affects quality. Common strategies:

  • Fixed-size chunks - Split every 500 tokens with 50-token overlap. Simple, works okay for homogeneous content.
  • Semantic chunking - Split at paragraph or section boundaries. Preserves meaning better than fixed-size.
  • Hierarchical chunking - Create chunks at multiple levels (document summary, section summary, paragraph). Retrieve at the appropriate level based on query specificity.
  • Sentence-window chunking - Embed individual sentences but retrieve surrounding sentences as context. Best for precise retrieval.

Chunk size trade-offs:

Smaller chunks (100-200 tokens)Larger chunks (500-1000 tokens)
More precise retrievalMore context per chunk
May miss surrounding contextMay include irrelevant content
Need more chunks per queryFewer chunks needed
Better for factual Q&ABetter for summarization

Step 3: Embedding Convert each chunk into a vector (a numerical representation of its meaning) using an embedding model. Popular choices:

  • OpenAI text-embedding-3-large - 3072 dimensions, strong performance across domains
  • Cohere embed-v3 - Good multilingual support
  • BGE-large - Open source, self-hostable, competitive performance
  • Voyage AI - Strong for code and technical content

The embedding model determines how well your system understands semantic similarity. Choose based on your content type and language requirements.

Step 4: Vector storage Store the vectors in a vector database for fast similarity search. Options:

  • Pinecone - Fully managed, easy to start, good at scale
  • Weaviate - Open source, supports hybrid search natively
  • Qdrant - Open source, fast, good filtering capabilities
  • pgvector - PostgreSQL extension, good if you're already using Postgres
  • Chroma - Lightweight, great for prototyping

RAG Indexing Pipeline

1
Document ingestion

Collect source documents - PDFs, web pages, database records, Confluence pages, Slack messages - whatever contains the knowledge you want queryable.

2
Chunking

Split documents into smaller pieces using fixed-size, semantic, hierarchical, or sentence-window strategies. The most underrated step in RAG and the one that most affects quality.

3
Embedding

Convert each chunk into a vector using an embedding model (OpenAI text-embedding-3-large, Cohere embed-v3, BGE-large, or Voyage AI). Determines how well the system understands semantic similarity.

4
Vector storage

Store vectors in a vector database for fast similarity search. Options include Pinecone (managed), Weaviate (hybrid search), Qdrant (fast filtering), pgvector (Postgres), or Chroma (prototyping).

Retrieval + Generation Phase

Step 1: Query embedding When a user asks a question, embed their query using the same embedding model.

Step 2: Similarity search Find the K most similar chunks in your vector database (typically K = 5-20). This is an approximate nearest neighbor search - fast even over millions of chunks.

Step 3: Context assembly Combine the retrieved chunks into a context window. Order matters - put the most relevant chunks first. Include metadata (source document, page number, date) for attribution.

Step 4: Prompt construction Build a prompt that includes:

  • System instructions (role, tone, constraints)
  • Retrieved context (the relevant chunks)
  • The user's question
  • Output format instructions (if needed)

Step 5: Generation Send the prompt to an LLM (GPT-4, Claude, etc.). The model generates an answer grounded in the provided context. Instruct it to cite sources and acknowledge when the context doesn't contain enough information to answer.

RAG vs. Fine-Tuning: When to Use Which

This is the most common question about RAG. The answer depends on what you're trying to achieve.

Use RAG when:

  • Your data changes frequently (weekly or more often)
  • You need source attribution ("this answer comes from document X, page Y")
  • You want to minimize hallucination (RAG grounds answers in retrieved data)
  • You have a diverse knowledge base (product docs + support tickets + internal wiki)
  • Budget is a concern (RAG requires no model training)

Use fine-tuning when:

  • You need the model to learn a specific style, format, or behavior
  • Your task is well-defined and consistent (classification, extraction, summarization in a specific format)
  • You need the model to internalize domain terminology and relationships
  • Latency is critical (fine-tuned models don't need retrieval step)

Use both (RAG + fine-tuned model) when:

  • You need domain-specific language understanding AND access to current data
  • The base model struggles with your domain's terminology even with good context
  • You're building a production system where both accuracy and currency matter

In practice, most teams should start with RAG. It's faster to implement, easier to update, and provides source attribution that fine-tuning doesn't. Add fine-tuning only when RAG alone doesn't meet your quality bar.

Start with RAG, not fine-tuning
RAG covers 80% of enterprise use cases - knowledge bases, document Q&A, support. It costs less, updates instantly, and provides source citations. Fine-tune only when you need to change the model's behavior or style.

RAG vs Fine-Tuning: When to Use Which

Data freshness
RAG wins for frequently changing data
RAG
Updates instantly - re-index and go
Fine-Tuning
Requires retraining to incorporate new data
Source attribution
RAG wins when citations matter
RAG
Built-in - tracks which chunks contributed
Fine-Tuning
No attribution - answers from learned weights
Style/behavior change
Fine-tuning wins for custom behavior
RAG
Limited - model behavior stays the same
Fine-Tuning
Full control over model style and format
Budget
RAG wins for budget-sensitive projects
RAG
No model training cost - pay per query
Fine-Tuning
$10K-100K+ for training runs
Latency
Fine-tuning wins for speed-critical apps
RAG
Retrieval adds 100-500ms per query
Fine-Tuning
No retrieval step - direct inference

For 80% of enterprise use cases - knowledge bases, document Q&A, support - RAG is the better starting point.

Architecture Patterns

Basic RAG

The simplest implementation. Good for prototypes and single-source use cases.

User Query → Embed → Vector Search → Top K Chunks → LLM → Answer

Pros: Simple, fast to build (1-2 weeks) Cons: Retrieval quality limited by embedding similarity alone

Combines vector similarity search with keyword search (BM25). Handles queries where exact terminology matters (product names, error codes, specific phrases).

User Query → Embed → Vector Search ─┐
           → BM25 → Keyword Search ──┤→ Merge + Rerank → LLM → Answer

Pros: Better retrieval for mixed query types Cons: More complex, needs tuning of the merge/rerank step

RAG with Reranking

After initial retrieval, a cross-encoder model reranks the results for higher precision. This catches cases where embedding similarity retrieves related-but-wrong chunks.

User Query → Embed → Vector Search (Top 20) → Reranker (Top 5) → LLM → Answer

Reranking models: Cohere Rerank, BGE-reranker, cross-encoder from Sentence Transformers

Pros: Significantly better retrieval precision (10-20% improvement typical) Cons: Adds latency (100-300ms per query) and cost

Agentic RAG

The LLM acts as an autonomous agent that decides what to retrieve, when, and from where. It can reformulate queries, search multiple sources, evaluate retrieved results, and iterate until it finds sufficient information. In 2026, agentic RAG has become the baseline for any serious enterprise RAG application - the static "retrieve once, generate once" pattern is increasingly seen as a prototype-only approach.

User Query → Agent LLM → Decide: What do I need?
                              ↓
                         Search Source A → Not enough → Reformulate → Search Source B
                              ↓                                           ↓
                         Combine Results → Generate Answer → Self-evaluate → Final Answer

Key capabilities beyond basic RAG:

  • Query decomposition - Breaks complex questions into sub-queries and retrieves for each independently
  • Source routing - Decides which knowledge base to search based on the question type
  • Self-correction - Evaluates retrieved chunks for relevance and re-retrieves if quality is insufficient
  • Multi-hop reasoning - Chains retrievals where the answer to one query informs the next retrieval

Pros: Handles complex, multi-step questions across multiple data sources. 90%+ accuracy on complex queries vs. 60-70% for basic RAG. Cons: Higher latency (3-10 seconds vs. 1-3), higher cost (3-5x more LLM calls), harder to debug. See our guide to agentic AI for more.

GraphRAG

GraphRAG combines knowledge graphs with vector search. Instead of treating documents as isolated chunks, it builds a graph of entities and relationships extracted from your data, then uses graph traversal alongside vector retrieval.

User Query → Vector Search (relevant chunks) ─┐
           → Graph Traversal (related entities) ──┤→ Merge → LLM → Answer

When GraphRAG wins over standard RAG:

  • Multi-hop questions - "Which customers in the healthcare vertical had escalations related to our billing module?" requires traversing customer → industry, customer → support tickets → module relationships
  • Summarization over large corpora - "What are the main themes across all Q4 support tickets?" requires entity extraction and clustering, not just chunk retrieval
  • Relationship-dependent queries - Any question where the answer depends on connections between entities rather than content within a single document

Pros: Handles relationship-dependent and multi-hop queries that pure vector search misses entirely Cons: Requires entity extraction pipeline to build the graph, more complex to maintain, higher upfront investment

RAG vs. Long Context: The 2026 Debate

With context windows reaching 1 million+ tokens (Gemini 3.1 Pro) and 200K+ (Claude Opus 4.6), a reasonable question emerges: do you still need RAG at all? Can you just dump all your documents into the context window?

When long context replaces RAG:

  • Small knowledge bases (under 500 pages / 200K tokens) that fit in context
  • One-off analysis tasks where building a RAG pipeline is not worth the setup
  • Situations where the entire document set is relevant (not just a few chunks)

When RAG still wins:

  • Large knowledge bases (thousands of documents) that exceed any context window
  • Cost sensitivity - sending 1M tokens per query is expensive; retrieving 5 relevant chunks is cheap
  • Latency requirements - processing 1M tokens takes significantly longer than processing 5 chunks
  • Data freshness - RAG indexes update incrementally; context-stuffing requires rebuilding the full prompt
  • Source attribution - RAG naturally tracks which chunks contributed to the answer

For most enterprise applications with growing knowledge bases, RAG remains the practical architecture. Long context is a complement - useful for deep analysis of retrieved documents - not a replacement.

Implementation Steps

Week 1: Data Preparation

  1. Inventory your source documents (type, format, volume, update frequency)
  2. Clean the data (remove duplicates, fix formatting, handle encoding issues)
  3. Choose and implement a chunking strategy
  4. Select an embedding model and generate embeddings
  5. Load embeddings into a vector database

Week 2: Basic Pipeline

  1. Build the query pipeline (embed query, retrieve, assemble context, generate)
  2. Create a simple evaluation dataset (50-100 question-answer pairs from your data)
  3. Test retrieval quality: does the system retrieve the right chunks for each question?
  4. Test generation quality: does the LLM produce correct answers from the retrieved context?
  5. Identify failure patterns (wrong chunks retrieved, right chunks but wrong answer, no relevant chunks exist)

Week 3: Optimization

  1. Tune chunk size and overlap based on failure analysis
  2. Add metadata filtering (restrict search by document type, date, or category)
  3. Implement hybrid search if keyword-sensitive queries are failing
  4. Add reranking if retrieval precision needs improvement
  5. Optimize prompts based on generation quality analysis

Week 4: Production Readiness

  1. Add error handling (what happens when no relevant chunks are found?)
  2. Implement answer confidence scoring (how sure is the system?)
  3. Build source citation into the output
  4. Add logging and monitoring (track retrieval quality, generation quality, latency)
  5. Implement feedback collection (thumbs up/down on answers)

Common Mistakes

1. Ignoring chunking strategy. Most RAG quality issues trace back to chunking. If chunks don't contain complete, coherent information units, the LLM gets garbage context and produces garbage answers. Spend time here.

2. Using only vector search. Vector similarity misses exact matches. A user searching for error code "ERR_AUTH_403" needs keyword matching, not semantic similarity. Hybrid search solves this.

3. Stuffing too much context. More retrieved chunks isn't always better. Beyond 5-10 relevant chunks, you introduce noise that confuses the LLM. Use reranking to select the best chunks, not just the most.

4. Not evaluating retrieval separately from generation. When answers are wrong, is it because the system retrieved the wrong information, or because the LLM misinterpreted correct information? These are different problems with different solutions.

5. Forgetting about data freshness. If your source documents update daily but your index updates weekly, you're serving stale answers. Automate re-indexing on a schedule that matches your data's update frequency.

6. No fallback for out-of-scope questions. Users will ask questions your data doesn't cover. The system should recognize this and say "I don't have information about that" rather than hallucinating an answer from the LLM's training data.

Performance Benchmarks

For a well-implemented RAG system, target:

  • Retrieval recall@10: 85-95% (the correct chunk is in the top 10 retrieved)
  • Answer accuracy: 80-90% (subjective, evaluated by domain experts)
  • Latency: 1-3 seconds end-to-end (query to displayed answer)
  • Source attribution accuracy: 95%+ (cited sources actually contain the stated information)
No amount of LLM prompt engineering compensates for retrieving the wrong information. Fix retrieval first, then optimize generation.

RAG is the foundation for most enterprise AI applications - customer service chatbots, internal knowledge assistants, document Q&A systems, and more. At 1Raft, we build RAG-powered applications for clients across industries, from healthcare knowledge systems to fintech compliance assistants. If you're planning a RAG implementation, our RAG development team can help you get the architecture right from the start.

Frequently asked questions

1Raft builds production RAG systems across 100+ products, from healthcare knowledge systems to fintech compliance assistants. We optimize chunk strategies, embedding models, and retrieval pipelines for your specific domain. Our 12-week sprints deliver production-grade accuracy with source attribution, not just demo-quality prototypes.

Share this article