Building a RAG Chatbot: From Zero to Production

Most AI chatbots have a fundamental problem: they only know what they were trained on. Ask a generic LLM about your company's refund policy, your product specs, or last quarter's internal report — and you'll get either a confident hallucination or an honest "I don't know." Neither is useful.

Retrieval Augmented Generation (RAG) solves this. A RAG chatbot dynamically fetches relevant context from your own documents before generating an answer, grounding every response in real, up-to-date information. The result is a knowledge base chatbot that actually knows your business — and can cite its sources.

This tutorial walks you through building a production-ready RAG chatbot from scratch. We'll cover the architecture, the code, the gotchas, and what it actually takes to deploy something your users will trust.

💡 This is the exact architecture behind AskBase — our AI chatbot project that turns static documentation and knowledge bases into interactive Q&A systems. You can also explore the open-source implementation on GitHub.

What Is Retrieval Augmented Generation?

RAG combines two distinct systems: a retrieval system (finds relevant documents) and a generation system (an LLM that synthesizes an answer). When a user asks a question, the pipeline runs in three steps:

Embed the query — convert the question into a vector (a numerical representation of its meaning)
Retrieve relevant chunks — search your vector database for document chunks semantically similar to the query
Generate a grounded answer — pass the retrieved context + original question to an LLM and ask it to answer based only on that context

The magic is in step 2. Instead of relying on the LLM's parametric memory (what it learned during training), you're injecting fresh, specific knowledge at inference time. The LLM becomes a reasoning engine on top of your data — not a random knowledge source.

The Stack: What You'll Need

🔍 Retrieval Layer

Embeddings: OpenAI text-embedding-3-small
Vector store: ChromaDB (local) or Pinecone (cloud)
Chunking: LangChain text splitters
Reranking: Cohere Rerank (optional)

🤖 Generation Layer

LLM: GPT-4o or Claude 3.5 Sonnet
Orchestration: LangChain or LlamaIndex
Prompt: system + context + user query
Streaming: for responsive UX

Step 1: Ingest and Index Your Documents

Before you can retrieve anything, you need to process your documents into searchable vector chunks. This is the indexing pipeline — run it once (and re-run when content changes).

1 Install dependencies

Start with a clean virtual environment and install the core libraries.

pip install langchain langchain-openai langchain-community chromadb tiktoken pypdf

2 Load, chunk, and embed your documents

Split documents into overlapping chunks — typically 500–1000 tokens — so each chunk is small enough to be retrieved precisely but large enough to contain meaningful context.

from langchain_community.document_loaders import PyPDFLoader, DirectoryLoader
from langchain.text_splitter import RecursiveCharacterTextSplitter
from langchain_openai import OpenAIEmbeddings
from langchain_community.vectorstores import Chroma
import os

# Load documents from a directory (PDFs, markdown, txt)
loader = DirectoryLoader("./docs", glob="**/*.pdf", loader_cls=PyPDFLoader)
documents = loader.load()

# Split into chunks
splitter = RecursiveCharacterTextSplitter(
    chunk_size=800,
    chunk_overlap=150,        # overlap prevents losing context at boundaries
    separators=["\n\n", "\n", ". ", " ", ""]
)
chunks = splitter.split_documents(documents)
print(f"Created {len(chunks)} chunks from {len(documents)} documents")

# Embed and store in ChromaDB
embeddings = OpenAIEmbeddings(
    model="text-embedding-3-small",
    openai_api_key=os.environ["OPENAI_API_KEY"]
)

vectorstore = Chroma.from_documents(
    documents=chunks,
    embedding=embeddings,
    persist_directory="./chroma_db"
)
vectorstore.persist()
print("Knowledge base indexed and saved.")

Step 2: Build the Retrieval + Generation Pipeline

With your knowledge base indexed, the query pipeline is straightforward. On each user message: embed the query, retrieve top-k similar chunks, then pass them to the LLM with a grounding prompt.

from langchain_openai import ChatOpenAI
from langchain_community.vectorstores import Chroma
from langchain_openai import OpenAIEmbeddings
from langchain.prompts import ChatPromptTemplate
from langchain.schema.runnable import RunnablePassthrough
from langchain.schema.output_parser import StrOutputParser
import os

# Load existing vectorstore
embeddings = OpenAIEmbeddings(model="text-embedding-3-small")
vectorstore = Chroma(
    persist_directory="./chroma_db",
    embedding_function=embeddings
)
retriever = vectorstore.as_retriever(
    search_type="similarity",
    search_kwargs={"k": 5}   # retrieve top 5 most relevant chunks
)

# The prompt is critical: tell the LLM to stay grounded
SYSTEM_PROMPT = """You are a helpful assistant for our knowledge base.
Answer the user's question based ONLY on the following context.
If the answer is not in the context, say "I don't have information about that in my knowledge base."
Do not make up information or draw on outside knowledge.

Context:
{context}"""

prompt = ChatPromptTemplate.from_messages([
    ("system", SYSTEM_PROMPT),
    ("human", "{question}")
])

llm = ChatOpenAI(
    model="gpt-4o",
    temperature=0,   # 0 = deterministic, reduces hallucination risk
    streaming=True
)

def format_docs(docs):
    return "\n\n---\n\n".join(
        f"Source: {doc.metadata.get('source', 'unknown')}\n{doc.page_content}"
        for doc in docs
    )

# Build the RAG chain
rag_chain = (
    {"context": retriever | format_docs, "question": RunnablePassthrough()}
    | prompt
    | llm
    | StrOutputParser()
)

def ask(question: str) -> str:
    return rag_chain.invoke(question)

# Test it
answer = ask("What is the refund policy for enterprise customers?")
print(answer)

Step 3: Add a Conversational Memory Layer

A single Q&A is useful, but a real chatbot needs to handle follow-up questions. "What about for pro customers?" means nothing without the previous context. Add conversation history to the retrieval step.

from langchain.memory import ConversationBufferWindowMemory
from langchain.chains import ConversationalRetrievalChain

memory = ConversationBufferWindowMemory(
    memory_key="chat_history",
    return_messages=True,
    k=5    # remember last 5 exchanges
)

conversational_chain = ConversationalRetrievalChain.from_llm(
    llm=llm,
    retriever=retriever,
    memory=memory,
    verbose=False
)

def chat(question: str) -> str:
    result = conversational_chain({"question": question})
    return result["answer"]

# Multi-turn conversation
print(chat("What is the refund policy for enterprise customers?"))
print(chat("What about for the starter plan?"))   # follows up correctly
print(chat("And how long does the process take?"))  # still coherent

Step 4: Serve It as an API

Wrap the chain in a FastAPI endpoint so any frontend — web app, Slack bot, or mobile app — can connect to it.

from fastapi import FastAPI
from fastapi.responses import StreamingResponse
from pydantic import BaseModel

app = FastAPI()

class QueryRequest(BaseModel):
    question: str
    session_id: str = "default"

# In production: store per-session memory in Redis or a database
session_memories: dict = {}

@app.post("/chat")
async def chat_endpoint(req: QueryRequest):
    if req.session_id not in session_memories:
        session_memories[req.session_id] = ConversationBufferWindowMemory(
            memory_key="chat_history",
            return_messages=True,
            k=5
        )
    
    chain = ConversationalRetrievalChain.from_llm(
        llm=llm,
        retriever=retriever,
        memory=session_memories[req.session_id]
    )
    
    result = chain({"question": req.question})
    return {"answer": result["answer"]}

@app.get("/health")
def health():
    return {"status": "ok"}

# Run with: uvicorn main:app --host 0.0.0.0 --port 8000

Production Considerations

Chunk Size Tuning

There's no universal ideal chunk size. 800 tokens works well for dense technical documentation. For FAQs or short articles, 300–400 tokens may perform better. The key metric is retrieval precision: are the chunks returned actually relevant to the query? Evaluate this manually on a test set before going live.

Handling Retrieval Failures

Your retriever will sometimes return chunks that aren't actually relevant — especially on out-of-domain questions. Always instruct the LLM to say "I don't have that information" rather than hallucinating. Add a confidence threshold: if the cosine similarity of the best retrieved chunk is below ~0.75, skip generation and return a fallback message.

Keeping the Knowledge Base Fresh

Documents change. Build an update pipeline: when a file changes, delete its old chunks from the vector store (filter by source metadata) and re-embed the updated version. For high-update-frequency content, consider nightly re-indexing of the entire corpus.

Cost at Scale

Embedding costs are tiny — text-embedding-3-small costs $0.02 per million tokens. A 500-page PDF is roughly 250,000 tokens to index, costing $0.005. The real cost is generation: GPT-4o at $2.50/M input tokens. With 5 retrieved chunks averaging 800 tokens each, plus the question, each query uses ~4,500 input tokens ≈ $0.011. At 1,000 queries/day that's ~$11/day — well within budget for most business applications.

💡 We built exactly this architecture for AskBase — a production RAG chatbot system for teams that need their documentation to talk back. The full source is available on GitHub, including the FastAPI backend, ChromaDB integration, and a React frontend with streaming support.

What Makes a RAG Chatbot Actually Good?

The technology is the easy part. Most RAG chatbots fail not because of bad code, but because of poor content quality and inadequate testing. Before you declare your chatbot production-ready, check these:

Coverage: Does your knowledge base actually contain answers to the questions users will ask? Map your FAQs before you index anything.
Chunk coherence: Each chunk should be self-contained enough to answer a question without needing the surrounding text. Avoid splitting mid-sentence or mid-table.
Prompt guardrails: Test adversarial inputs. Users will ask off-topic questions, attempt prompt injection, and ask for information your KB doesn't contain. Your system prompt needs to handle all three gracefully.
Source attribution: Show users which documents were used to generate the answer. This builds trust and helps you spot retrieval errors quickly.
Fallback design: When the chatbot can't answer, it should tell the user how to get help — not leave them in a dead end. Route to human support, suggest related articles, or surface a contact form.

If you're building customer-facing AI and want to go deeper on automation workflows, check out our Python email classification tutorial for another practical AI integration. For monitoring prices and data pipelines that could feed your knowledge base, see our web scraping guide.

Ready to Build Your Own Knowledge Base Chatbot?

We design and deploy production RAG systems — from document ingestion pipelines to polished chat interfaces — tailored to your content and your users.

Talk to Leo → View on GitHub →

Email Automation

How to Automate Email Classification with Python

Web Scraping