Hernando Abella's Website

Large Language Models only know what they were trained on. They cannot access your company's documents, PDFs, or internal knowledge bases — unless you build a RAG system.

RAG combines information retrieval with AI generation, allowing a model to search relevant documents and use that information when generating responses. In this guide, you'll learn how RAG works and build your own application with Python.

What Is RAG?

RAG stands for Retrieval-Augmented Generation. Instead of asking an AI model to answer directly, a RAG system first searches a knowledge base for relevant information.

❓User QuestionWhat is refund policy?

→

📊Question EmbeddingConvert to vector

→

🔍Similarity SearchFind relevant chunks

→

📄Relevant ChunksRetrieved context

→

🧠LLMGenerate answer

→

✨Final AnswerGrounded response

Example:

Question:"What is our company's refund policy?"

Instead of guessing, the system → Searches company documents → Finds the refund policy → Sends it to the AI → Generates an accurate answer.

❌ Without RAG

Question → LLM → Answer

• Hallucinations

• Outdated information

• No access to private data

✅ With RAG

Question → Retriever → Documents → LLM → Answer

• More accurate responses

• Access to private knowledge

• Reduced hallucinations

Core Components of a RAG System

Documents

→

Chunking

→

Embeddings

→

Vector Database

→

Retriever

→

LLM

Step 1: Prepare Documents

Every RAG system begins with data. Examples include:

PDFsDocumentationKnowledge basesProduct manualsSupport articlesCompany policies

Step 2: Split Documents into Chunks

Large documents must be divided into smaller pieces for efficient retrieval.

python · chunking.py

def chunk_text(text, chunk_size=200):
    chunks = []
    for i in range(0, len(text), chunk_size):
        chunks.append(text[i:i+chunk_size])
    return chunks

Why Chunking Matters:

100-page document → 500 chunks → Search only relevant chunks. This makes retrieval much faster and more precise.

Step 3: Generate Embeddings

python · embeddings.py

from openai import OpenAI

client = OpenAI()

response = client.embeddings.create(
    model="text-embedding-3-small",
    input="Python is a programming language."
)

embedding = response.data[0].embedding
print(f"Vector dimension: {len(embedding)}")

Step 4: Store Embeddings in a Vector Database

Popular options: Chroma, FAISS, Pinecone, Weaviate, Qdrant

terminal

pip install chromadb

python · vectorstore.py

import chromadb

client = chromadb.Client()
collection = client.create_collection(name="knowledge_base")

# Add documents
collection.add(
    documents=[
        "Python is a programming language.",
        "Machine learning uses data."
    ],
    ids=["1", "2"]
)

# Search
results = collection.query(
    query_texts=["How is Python used?"],
    n_results=2
)
print(results["documents"])

Step 5: Retrieve Relevant Documents

When a user asks a question, the system:

Creates an embedding for the query
Searches the vector database
Finds the most similar chunks

Step 6: Send Context to the LLM

python · generate.py

from openai import OpenAI

client = OpenAI()

prompt = f"""
Context:
{context}

Question:
{question}

Answer using only the provided context.
"""

response = client.responses.create(
    model="gpt-4o",
    input=prompt
)

print(response.output_text)

Full RAG Pipeline

Documents

Chunking

Embeddings

Vector Database

⤵

↓

User Question

Question Embedding

Similarity Search

Relevant Chunks

LLM

Final Answer

Example Project Structure

Project Structure

rag-project/
├── data/
│   ├── docs/
│   │   ├── guide.pdf
│   │   └── policies.txt
├── embeddings/
│   └── build_embeddings.py
├── vectorstore/
│   └── chroma_db/
├── rag/
│   ├── retrieve.py
│   ├── generate.py
│   └── pipeline.py
├── app.py
└── requirements.txt

Improving Retrieval Quality

📦 Better Chunking

Paragraph-based or semantic chunks instead of fixed sizes.

🏷️ Metadata Filtering

Store source, department, date — filter before search.

🔀 Hybrid Search

Combine vector search + keyword search for accuracy.

Common Challenges

📏

Poor Chunk Sizes

Too large = low precision. Too small = missing context.

🎭

Hallucinations

Model may still invent facts — enforce context-only answers.

🔄

Duplicate Results

Multiple chunks with similar info — use reranking.

Real-World RAG Use Cases

🎧

Customer Support

Search product documentation before answering.

🏢

Enterprise KB

Access internal company documents.

⚖️

Legal Research

Retrieve contracts and regulations.

🏥

Medical Systems

Search approved clinical documentation.

📚

Educational Platforms

Answer questions from course materials.

🔍

AI Search Engines

Combine retrieval with natural language responses.

Key Takeaways

→ RAG combines document retrieval with AI generation.
→ Documents are split into chunks and converted into embeddings.
→ Embeddings are stored in a vector database for fast similarity search.
→ Retrieved documents are sent to the LLM as context.
→ The model generates answers grounded in real information.

A well-designed RAG system is often one of the most practical and impactful AI applications you can build. It allows organizations to transform their documents into intelligent assistants that deliver accurate, context-aware answers on demand.

📘 Ready to go deeper?

Generative AI with Python

Master RAG pipelines, AI agents, tool calling, vector databases, and multimodal systems — with hands-on code throughout.

🔍 RAG & Vector DBs🤖 AI Agents🛠 Tool Calling🖼 Multimodal AI

Get it on Amazon →

Building a RAG System with Python: Step by Step