What Matters
- -RAG combines LLM capabilities with your proprietary data by retrieving relevant documents at query time, eliminating the need for expensive fine-tuning for most use cases.
- -The RAG architecture: document chunking and embedding, vector database storage, semantic retrieval at query time, and context injection into the LLM prompt.
- -RAG beats fine-tuning when data changes frequently, you need source attribution, or you want to avoid the cost and complexity of model training - which covers 80% of enterprise use cases.
- -Common RAG failures: wrong chunk sizes (too large loses precision, too small loses context), poor embedding model selection, and not including source metadata for attribution.
Retrieval Augmented Generation (RAG) is the most practical pattern for making large language models (LLMs) useful with your own data. Instead of training or fine-tuning a model on your documents (expensive, slow, and hard to update), RAG retrieves relevant information at query time and feeds it to the LLM as context. The model generates answers grounded in your actual data rather than its training data.
How RAG Works
The RAG pipeline has two phases: indexing (done once, updated periodically) and retrieval + generation (done on every query).
Indexing Phase
Step 1: Document ingestion Collect your source documents - PDFs, web pages, database records, Confluence pages, Slack messages, whatever contains the knowledge you want to make queryable.
Step 2: Chunking Split documents into smaller pieces (chunks). This is the most underrated step in RAG and the one that most affects quality. Common strategies:
- Fixed-size chunks - Split every 500 tokens with 50-token overlap. Simple, works okay for homogeneous content.
- Semantic chunking - Split at paragraph or section boundaries. Preserves meaning better than fixed-size.
- Hierarchical chunking - Create chunks at multiple levels (document summary, section summary, paragraph). Retrieve at the appropriate level based on query specificity.
- Sentence-window chunking - Embed individual sentences but retrieve surrounding sentences as context. Best for precise retrieval.
Chunk size trade-offs:
| Smaller chunks (100-200 tokens) | Larger chunks (500-1000 tokens) |
|---|---|
| More precise retrieval | More context per chunk |
| May miss surrounding context | May include irrelevant content |
| Need more chunks per query | Fewer chunks needed |
| Better for factual Q&A | Better for summarization |
Step 3: Embedding Convert each chunk into a vector (a numerical representation of its meaning) using an embedding model. Popular choices:
- OpenAI text-embedding-3-large - 3072 dimensions, strong performance across domains
- Cohere embed-v3 - Good multilingual support
- BGE-large - Open source, self-hostable, competitive performance
- Voyage AI - Strong for code and technical content
The embedding model determines how well your system understands semantic similarity. Choose based on your content type and language requirements.
Step 4: Vector storage Store the vectors in a vector database for fast similarity search. Options:
- Pinecone - Fully managed, easy to start, good at scale
- Weaviate - Open source, supports hybrid search natively
- Qdrant - Open source, fast, good filtering capabilities
- pgvector - PostgreSQL extension, good if you're already using Postgres
- Chroma - Lightweight, great for prototyping
RAG Indexing Pipeline
Collect source documents - PDFs, web pages, database records, Confluence pages, Slack messages - whatever contains the knowledge you want queryable.
Split documents into smaller pieces using fixed-size, semantic, hierarchical, or sentence-window strategies. The most underrated step in RAG and the one that most affects quality.
Convert each chunk into a vector using an embedding model (OpenAI text-embedding-3-large, Cohere embed-v3, BGE-large, or Voyage AI). Determines how well the system understands semantic similarity.
Store vectors in a vector database for fast similarity search. Options include Pinecone (managed), Weaviate (hybrid search), Qdrant (fast filtering), pgvector (Postgres), or Chroma (prototyping).
Retrieval + Generation Phase
Step 1: Query embedding When a user asks a question, embed their query using the same embedding model.
Step 2: Similarity search Find the K most similar chunks in your vector database (typically K = 5-20). This is an approximate nearest neighbor search - fast even over millions of chunks.
Step 3: Context assembly Combine the retrieved chunks into a context window. Order matters - put the most relevant chunks first. Include metadata (source document, page number, date) for attribution.
Step 4: Prompt construction Build a prompt that includes:
- System instructions (role, tone, constraints)
- Retrieved context (the relevant chunks)
- The user's question
- Output format instructions (if needed)
Step 5: Generation Send the prompt to an LLM (GPT-4, Claude, etc.). The model generates an answer grounded in the provided context. Instruct it to cite sources and acknowledge when the context doesn't contain enough information to answer.
RAG vs. Fine-Tuning: When to Use Which
This is the most common question about RAG. The answer depends on what you're trying to achieve.
Use RAG when:
- Your data changes frequently (weekly or more often)
- You need source attribution ("this answer comes from document X, page Y")
- You want to minimize hallucination (RAG grounds answers in retrieved data)
- You have a diverse knowledge base (product docs + support tickets + internal wiki)
- Budget is a concern (RAG requires no model training)
Use fine-tuning when:
- You need the model to learn a specific style, format, or behavior
- Your task is well-defined and consistent (classification, extraction, summarization in a specific format)
- You need the model to internalize domain terminology and relationships
- Latency is critical (fine-tuned models don't need retrieval step)
Use both (RAG + fine-tuned model) when:
- You need domain-specific language understanding AND access to current data
- The base model struggles with your domain's terminology even with good context
- You're building a production system where both accuracy and currency matter
In practice, most teams should start with RAG. It's faster to implement, easier to update, and provides source attribution that fine-tuning doesn't. Add fine-tuning only when RAG alone doesn't meet your quality bar.
RAG vs Fine-Tuning: When to Use Which
For 80% of enterprise use cases - knowledge bases, document Q&A, support - RAG is the better starting point.
Architecture Patterns
Basic RAG
The simplest implementation. Good for prototypes and single-source use cases.
User Query → Embed → Vector Search → Top K Chunks → LLM → Answer
Pros: Simple, fast to build (1-2 weeks) Cons: Retrieval quality limited by embedding similarity alone
RAG with Hybrid Search
Combines vector similarity search with keyword search (BM25). Handles queries where exact terminology matters (product names, error codes, specific phrases).
User Query → Embed → Vector Search ─┐
→ BM25 → Keyword Search ──┤→ Merge + Rerank → LLM → Answer
Pros: Better retrieval for mixed query types Cons: More complex, needs tuning of the merge/rerank step
RAG with Reranking
After initial retrieval, a cross-encoder model reranks the results for higher precision. This catches cases where embedding similarity retrieves related-but-wrong chunks.
User Query → Embed → Vector Search (Top 20) → Reranker (Top 5) → LLM → Answer
Reranking models: Cohere Rerank, BGE-reranker, cross-encoder from Sentence Transformers
Pros: Significantly better retrieval precision (10-20% improvement typical) Cons: Adds latency (100-300ms per query) and cost
Agentic RAG
The LLM acts as an autonomous agent that decides what to retrieve, when, and from where. It can reformulate queries, search multiple sources, evaluate retrieved results, and iterate until it finds sufficient information. In 2026, agentic RAG has become the baseline for any serious enterprise RAG application - the static "retrieve once, generate once" pattern is increasingly seen as a prototype-only approach.
User Query → Agent LLM → Decide: What do I need?
↓
Search Source A → Not enough → Reformulate → Search Source B
↓ ↓
Combine Results → Generate Answer → Self-evaluate → Final Answer
Key capabilities beyond basic RAG:
- Query decomposition - Breaks complex questions into sub-queries and retrieves for each independently
- Source routing - Decides which knowledge base to search based on the question type
- Self-correction - Evaluates retrieved chunks for relevance and re-retrieves if quality is insufficient
- Multi-hop reasoning - Chains retrievals where the answer to one query informs the next retrieval
Pros: Handles complex, multi-step questions across multiple data sources. 90%+ accuracy on complex queries vs. 60-70% for basic RAG. Cons: Higher latency (3-10 seconds vs. 1-3), higher cost (3-5x more LLM calls), harder to debug. See our guide to agentic AI for more.
GraphRAG
GraphRAG combines knowledge graphs with vector search. Instead of treating documents as isolated chunks, it builds a graph of entities and relationships extracted from your data, then uses graph traversal alongside vector retrieval.
User Query → Vector Search (relevant chunks) ─┐
→ Graph Traversal (related entities) ──┤→ Merge → LLM → Answer
When GraphRAG wins over standard RAG:
- Multi-hop questions - "Which customers in the healthcare vertical had escalations related to our billing module?" requires traversing customer → industry, customer → support tickets → module relationships
- Summarization over large corpora - "What are the main themes across all Q4 support tickets?" requires entity extraction and clustering, not just chunk retrieval
- Relationship-dependent queries - Any question where the answer depends on connections between entities rather than content within a single document
Pros: Handles relationship-dependent and multi-hop queries that pure vector search misses entirely Cons: Requires entity extraction pipeline to build the graph, more complex to maintain, higher upfront investment
RAG vs. Long Context: The 2026 Debate
With context windows reaching 1 million+ tokens (Gemini 3.1 Pro) and 200K+ (Claude Opus 4.6), a reasonable question emerges: do you still need RAG at all? Can you just dump all your documents into the context window?
When long context replaces RAG:
- Small knowledge bases (under 500 pages / 200K tokens) that fit in context
- One-off analysis tasks where building a RAG pipeline is not worth the setup
- Situations where the entire document set is relevant (not just a few chunks)
When RAG still wins:
- Large knowledge bases (thousands of documents) that exceed any context window
- Cost sensitivity - sending 1M tokens per query is expensive; retrieving 5 relevant chunks is cheap
- Latency requirements - processing 1M tokens takes significantly longer than processing 5 chunks
- Data freshness - RAG indexes update incrementally; context-stuffing requires rebuilding the full prompt
- Source attribution - RAG naturally tracks which chunks contributed to the answer
For most enterprise applications with growing knowledge bases, RAG remains the practical architecture. Long context is a complement - useful for deep analysis of retrieved documents - not a replacement.
Implementation Steps
Week 1: Data Preparation
- Inventory your source documents (type, format, volume, update frequency)
- Clean the data (remove duplicates, fix formatting, handle encoding issues)
- Choose and implement a chunking strategy
- Select an embedding model and generate embeddings
- Load embeddings into a vector database
Week 2: Basic Pipeline
- Build the query pipeline (embed query, retrieve, assemble context, generate)
- Create a simple evaluation dataset (50-100 question-answer pairs from your data)
- Test retrieval quality: does the system retrieve the right chunks for each question?
- Test generation quality: does the LLM produce correct answers from the retrieved context?
- Identify failure patterns (wrong chunks retrieved, right chunks but wrong answer, no relevant chunks exist)
Week 3: Optimization
- Tune chunk size and overlap based on failure analysis
- Add metadata filtering (restrict search by document type, date, or category)
- Implement hybrid search if keyword-sensitive queries are failing
- Add reranking if retrieval precision needs improvement
- Optimize prompts based on generation quality analysis
Week 4: Production Readiness
- Add error handling (what happens when no relevant chunks are found?)
- Implement answer confidence scoring (how sure is the system?)
- Build source citation into the output
- Add logging and monitoring (track retrieval quality, generation quality, latency)
- Implement feedback collection (thumbs up/down on answers)
Common Mistakes
1. Ignoring chunking strategy. Most RAG quality issues trace back to chunking. If chunks don't contain complete, coherent information units, the LLM gets garbage context and produces garbage answers. Spend time here.
2. Using only vector search. Vector similarity misses exact matches. A user searching for error code "ERR_AUTH_403" needs keyword matching, not semantic similarity. Hybrid search solves this.
3. Stuffing too much context. More retrieved chunks isn't always better. Beyond 5-10 relevant chunks, you introduce noise that confuses the LLM. Use reranking to select the best chunks, not just the most.
4. Not evaluating retrieval separately from generation. When answers are wrong, is it because the system retrieved the wrong information, or because the LLM misinterpreted correct information? These are different problems with different solutions.
5. Forgetting about data freshness. If your source documents update daily but your index updates weekly, you're serving stale answers. Automate re-indexing on a schedule that matches your data's update frequency.
6. No fallback for out-of-scope questions. Users will ask questions your data doesn't cover. The system should recognize this and say "I don't have information about that" rather than hallucinating an answer from the LLM's training data.
Performance Benchmarks
For a well-implemented RAG system, target:
- Retrieval recall@10: 85-95% (the correct chunk is in the top 10 retrieved)
- Answer accuracy: 80-90% (subjective, evaluated by domain experts)
- Latency: 1-3 seconds end-to-end (query to displayed answer)
- Source attribution accuracy: 95%+ (cited sources actually contain the stated information)
No amount of LLM prompt engineering compensates for retrieving the wrong information. Fix retrieval first, then optimize generation.
RAG is the foundation for most enterprise AI applications - customer service chatbots, internal knowledge assistants, document Q&A systems, and more. At 1Raft, we build RAG-powered applications for clients across industries, from healthcare knowledge systems to fintech compliance assistants. If you're planning a RAG implementation, our RAG development team can help you get the architecture right from the start.
Frequently asked questions
1Raft builds production RAG systems across 100+ products, from healthcare knowledge systems to fintech compliance assistants. We optimize chunk strategies, embedding models, and retrieval pipelines for your specific domain. Our 12-week sprints deliver production-grade accuracy with source attribution, not just demo-quality prototypes.
Related Articles
Related posts

How to Build an AI Agent: Step-by-Step Engineering Guide
Demo agents are easy. Production agents that handle real users, real failures, and real money are a different discipline. Here are the eight steps most teams skip.

Telemedicine App Development Cost in 2026: Real Numbers by Feature Set
A telemedicine MVP costs $40K-70K. A full-featured platform with EHR integration, AI triage, and remote monitoring runs $150K-300K+. Here is the honest breakdown with every cost driver explained.

AI Coding Tools Build MVPs, Not Businesses
Cursor, Lovable, Bolt, and v0 can build a working demo in an afternoon. But a demo is not a product. Here is what they skip and why it costs 3x more to fix later.
