RAG (Retrieval-Augmented Generation) is the most powerful technique for building AI applications that use your data. This guide walks you through building a RAG system from scratch: from uploading documents through answering questions grounded in your data.
What is RAG?
The Problem with Raw LLMs
Language models have limitations:
- Knowledge cutoff: Models trained on data up to certain date
- Hallucinations: Making up facts that sound plausible but are false
- Outdated information: Can’t access new documents
- Privacy: Sending proprietary data to API risky
How RAG Solves It
RAG = Retrieval + Generation
- Retrieval: Find relevant documents matching user’s question
- Augmentation: Add those documents to the prompt
- Generation: Generate answer based on documents
Example: User asks “What’s our refund policy?”
RAG process:
- Search documents for “refund policy”
- Find relevant help article
- Add article to prompt: “Based on this policy: [article text]…”
- LLM generates answer grounded in policy
- User gets accurate, sourced answer
Why RAG Works
| Problem | RAG Solution |
|---|---|
| Hallucinations | Grounds answers in documents |
| Outdated knowledge | Uses your current documents |
| Privacy | Documents stay on your server |
| Attribution | You can cite sources |
| Accuracy | 70-90% reduction in hallucinations |
Architecture Overview
A RAG system has these components:
User Question
↓
[Vector Database] - Find similar documents
↓
[Document Retrieval] - Get top 3-5 documents
↓
[Prompt Assembly] - Add documents to prompt
↓
[LLM] - Generate answer
↓
Answer with Sources
Step 1: Prepare Your Documents
Document Collection
Gather documents for your knowledge base:
- PDFs
- Word documents
- Text files
- Website content
- Help articles
- FAQs
Start with 10-50 documents for testing, scale to 1000+ later.
Document Preprocessing
from pypdf import PdfReader
import os
def extract_text_from_pdf(pdf_path):
"""Extract text from PDF file"""
reader = PdfReader(pdf_path)
text = ""
for page in reader.pages:
text += page.extract_text()
return text
# Extract from all PDFs
documents = []
for filename in os.listdir("./documents"):
if filename.endswith(".pdf"):
text = extract_text_from_pdf(f"./documents/{filename}")
documents.append({
"title": filename,
"content": text
})
print(f"Extracted {len(documents)} documents")
Document Chunking
Raw documents are too large. Break into chunks:
from langchain.text_splitter import RecursiveCharacterTextSplitter
splitter = RecursiveCharacterTextSplitter(
chunk_size=1000, # Characters per chunk
chunk_overlap=200, # Overlap for context
separators=["\n\n", "\n", " ", ""]
)
chunks = []
for doc in documents:
doc_chunks = splitter.split_text(doc["content"])
for chunk in doc_chunks:
chunks.append({
"text": chunk,
"source": doc["title"]
})
print(f"Created {len(chunks)} chunks from {len(documents)} documents")
Chunk Sizing Guidelines
| Size | Pros | Cons |
|---|---|---|
| 256 tokens | Fast | May miss context |
| 512 tokens | Balanced | Good for most |
| 1024 tokens | Rich context | Slower retrieval |
| 2048 tokens | Full context | Expensive, slow |
Start with 512 tokens (roughly 1000 characters).
Step 2: Create Embeddings
What Are Embeddings?
Embeddings convert text to numbers (vectors) that computers understand.
Example:
- “The cat sat on the mat” → [0.23, 0.91, -0.34, …, 0.12] (384 numbers)
- “The dog sat on the floor” → [0.24, 0.89, -0.31, …, 0.14] (similar numbers)
Similar meanings = similar vectors.
Creating Embeddings
Using OpenAI embeddings:
import openai
openai.api_key = "your-api-key"
def create_embeddings(texts):
"""Convert texts to embeddings using OpenAI"""
response = openai.Embedding.create(
input=texts,
model="text-embedding-3-small" # Cost-effective
)
embeddings = []
for item in response["data"]:
embeddings.append(item["embedding"])
return embeddings
# Create embeddings for all chunks
chunk_texts = [chunk["text"] for chunk in chunks]
embeddings = create_embeddings(chunk_texts)
# Store chunks with embeddings
for chunk, embedding in zip(chunks, embeddings):
chunk["embedding"] = embedding
Embedding Models Comparison
| Model | Cost | Quality | Speed |
|---|---|---|---|
| OpenAI small | $0.02/1M tokens | Good | Fast |
| OpenAI large | $0.13/1M tokens | Excellent | Slower |
| Cohere | $0.10/1M tokens | Excellent | Fast |
| Open-source | Free (self-hosted) | Good | Medium |
For starting out, use OpenAI’s small model.
Step 3: Set Up Vector Database
Why Vector Database?
Vector databases store embeddings and find similar ones quickly.
Without vector DB: Search all 10,000 chunks sequentially (slow) With vector DB: Find similar chunks in milliseconds (fast)
Popular Vector Databases
| Database | Easiest | Most Powerful | Open-Source | Cloud |
|---|---|---|---|---|
| Pinecone | Yes | No | No | Yes |
| Weaviate | Yes | Yes | Yes | Yes |
| Milvus | No | Yes | Yes | No |
| PostgreSQL pgvector | Yes | Medium | Yes | Yes |
Using Pinecone (Easiest)
pip install pinecone-client
from pinecone import Pinecone
# Initialize
pc = Pinecone(api_key="your-api-key")
# Create index (one-time)
pc.create_index(
name="my-rag-index",
dimension=1536, # OpenAI embedding dimension
metric="cosine"
)
index = pc.Index("my-rag-index")
# Upsert chunks with embeddings
vectors_to_upsert = []
for i, chunk in enumerate(chunks):
vectors_to_upsert.append({
"id": f"chunk-{i}",
"values": chunk["embedding"],
"metadata": {
"text": chunk["text"],
"source": chunk["source"]
}
})
# Upload in batches
batch_size = 100
for i in range(0, len(vectors_to_upsert), batch_size):
batch = vectors_to_upsert[i:i+batch_size]
index.upsert(vectors=batch)
print(f"Upserted {len(vectors_to_upsert)} vectors")
Step 4: Create Retrieval System
Search Function
def retrieve_relevant_chunks(query, top_k=3):
"""Find chunks most relevant to query"""
# Create embedding for query
query_response = openai.Embedding.create(
input=query,
model="text-embedding-3-small"
)
query_embedding = query_response["data"][0]["embedding"]
# Search vector DB
results = index.query(
vector=query_embedding,
top_k=top_k,
include_metadata=True
)
# Extract chunks
retrieved_chunks = []
for result in results["matches"]:
retrieved_chunks.append({
"text": result["metadata"]["text"],
"source": result["metadata"]["source"],
"score": result["score"] # Relevance score 0-1
})
return retrieved_chunks
# Test retrieval
query = "What's your refund policy?"
results = retrieve_relevant_chunks(query)
for result in results:
print(f"Source: {result['source']}")
print(f"Relevance: {result['score']:.2f}")
print(f"Text: {result['text'][:200]}...\n")
Step 5: Build Prompt Assembly
Crafting the Prompt
The prompt structure is critical:
def build_prompt(query, retrieved_chunks):
"""Build prompt with retrieved context"""
# System message
system_message = """You are a helpful assistant. Answer questions based on the provided documents.
If you can't find the answer in the documents, say "I don't have information about that."
Always cite which document you're using."""
# Context from retrieved chunks
context = "RELEVANT DOCUMENTS:\n"
for i, chunk in enumerate(retrieved_chunks):
context += f"\n[{i+1}] From {chunk['source']}:\n{chunk['text']}\n"
# Full prompt
prompt = f"""{system_message}
{context}
QUESTION: {query}
ANSWER:"""
return prompt, system_message
# Example
query = "What's your refund policy?"
chunks = retrieve_relevant_chunks(query)
prompt, system_msg = build_prompt(query, chunks)
print("Generated Prompt:")
print(prompt)
Prompt Quality Matters
Good prompt:
- Clear instructions
- Relevant context
- Question at end
- Space for answer
Bad prompt:
- Irrelevant documents
- Confusing instructions
- Ambiguous question
Step 6: Generate Answers
Using OpenAI
import openai
def answer_question(query):
"""Answer question using RAG"""
# Step 1: Retrieve
chunks = retrieve_relevant_chunks(query, top_k=3)
# Step 2: Build prompt
prompt, system_msg = build_prompt(query, chunks)
# Step 3: Generate
response = openai.ChatCompletion.create(
model="gpt-3.5-turbo",
messages=[
{
"role": "system",
"content": system_msg
},
{
"role": "user",
"content": prompt
}
],
temperature=0.5, # Lower = more focused on documents
max_tokens=500
)
answer = response["choices"][0]["message"]["content"]
return {
"question": query,
"answer": answer,
"sources": [chunk["source"] for chunk in chunks]
}
# Test
result = answer_question("What's your refund policy?")
print(f"Q: {result['question']}")
print(f"A: {result['answer']}")
print(f"Sources: {', '.join(result['sources'])}")
Using Open-Source Models
from transformers import pipeline
# Load locally
qa_pipeline = pipeline(
"text2text-generation",
model="google/flan-t5-large"
)
def answer_question_local(query):
"""Answer using local model"""
chunks = retrieve_relevant_chunks(query)
prompt, _ = build_prompt(query, chunks)
result = qa_pipeline(prompt, max_length=500)
return result[0]["generated_text"]
Step 7: Build Web Interface
Simple Flask App
from flask import Flask, request, jsonify
app = Flask(__name__)
@app.route("/ask", methods=["POST"])
def ask():
"""API endpoint to ask questions"""
data = request.json
query = data.get("query")
if not query:
return jsonify({"error": "No query provided"}), 400
result = answer_question(query)
return jsonify(result)
@app.route("/health", methods=["GET"])
def health():
return jsonify({"status": "ok"})
if __name__ == "__main__":
app.run(debug=True, port=5000)
HTML Frontend
<!DOCTYPE html>
<html>
<head>
<title>RAG Assistant</title>
<style>
body { font-family: Arial; max-width: 800px; margin: 50px auto; }
input { width: 100%; padding: 10px; font-size: 16px; }
button { padding: 10px 20px; font-size: 16px; cursor: pointer; }
.response { margin-top: 20px; padding: 15px; background: #f0f0f0; }
.sources { margin-top: 10px; color: #666; font-size: 14px; }
</style>
</head>
<body>
<h1>Ask Me Anything</h1>
<input type="text" id="query" placeholder="Ask your question...">
<button onclick="ask()">Ask</button>
<div id="result"></div>
<script>
function ask() {
const query = document.getElementById("query").value;
fetch("/ask", {
method: "POST",
headers: {"Content-Type": "application/json"},
body: JSON.stringify({query})
})
.then(r => r.json())
.then(data => {
document.getElementById("result").innerHTML = `
<div class="response">
<h3>Answer:</h3>
<p>${data.answer}</p>
<div class="sources">
Sources: ${data.sources.join(", ")}
</div>
</div>
`;
});
}
</script>
</body>
</html>
Step 8: Evaluation and Iteration
Testing RAG Quality
test_questions = [
("What's the refund policy?", "refund_policy.md"),
("How do I reset my password?", "help_articles.md"),
("What's your pricing?", "pricing_page.pdf")
]
results = []
for question, expected_source in test_questions:
result = answer_question(question)
# Check if correct source used
correct = expected_source in result["sources"]
results.append({
"question": question,
"correct_source": correct,
"answer": result["answer"]
})
# Summary
correct = sum(1 for r in results if r["correct_source"])
accuracy = correct / len(results) * 100
print(f"Source accuracy: {accuracy:.0f}%")
Common Issues and Fixes
| Issue | Cause | Solution |
|---|---|---|
| Wrong sources retrieved | Bad chunk size | Adjust chunk size (try 512 tokens) |
| Hallucinating despite documents | Poor prompt | Improve system message |
| Slow retrieval | Too many chunks | Filter or pre-process |
| Expensive embeddings | Creating too many | Batch embedding creation |
| Low quality answers | Bad documents | Improve document quality |
Iterative Improvement
- Monitor failures: Which questions answered poorly?
- Analyze root cause: Wrong documents? Bad prompt?
- Improve documents: Add missing information
- Improve chunks: Adjust size/overlap
- Improve prompts: Better instructions
- Repeat: Monthly evaluation
Production Considerations
Caching
Cache common queries to reduce costs:
from functools import lru_cache
@lru_cache(maxsize=1000)
def answer_question_cached(query):
return answer_question(query)
Rate Limiting
Prevent abuse:
from flask_limiter import Limiter
limiter = Limiter(app)
@app.route("/ask", methods=["POST"])
@limiter.limit("100 per hour")
def ask():
# ...
Monitoring
Track metrics:
- Questions per day
- Average latency
- Cost per query
- User satisfaction
Cost Estimates
Monthly Costs (1000 questions/day)
| Component | Cost |
|---|---|
| Embeddings (retrieval) | $30 |
| LLM inference (answers) | $15 |
| Vector DB hosting | $10-50 |
| Total | $55-95 |
Scales linearly with volume.
Conclusion
RAG is the most practical way to build AI applications using your data. The pattern is: retrieve → augment → generate. Your documents stay private, answers are grounded in reality, and users get sources. Start with 10-50 documents, test retrieval quality, and expand from there. The iterative loop of evaluating failures and improving documents is where you’ll achieve production-grade systems. RAG isn’t replacing LLMs—it’s the pattern that makes LLMs actually useful for real-world applications.