How to Build a RAG System Locally

What is a RAG system?

A Retrieval-Augmented Generation (RAG) system combines the capabilities of a large language model with a retrieval component that can fetch relevant documents or passages from a corpus. This powerful combination allows the language model to generate fluent and informed responses by not only relying on its trained knowledge, but also retrieving and referring to factual information from the supplied documents.

How does it work?

A RAG system is composed of two main components: a retrieval engine and a large language model.

First, when a user provides a query or prompt to the system, the retrieval engine searches through a corpus (collection) of documents to find relevant passages or information related to the query. This is typically done using semantic search or vector similarity techniques to rank the documents based on their relevance to the query.

The top-ranked documents are then formatted into a context window or memory that can be consumed by the large language model. This context window provides the language model with relevant background information and facts from the retrieved documents.

Next, the language model takes the user’s original query along with the context window as input. By combining its own trained knowledge with the supplementary information from the retrieved documents, the language model can generate a fluent and informative response to the query.

The generated response draws upon both the language model’s understanding of the query topic and the factual details found in the relevant documents. This allows the system to provide comprehensive answers that not only leverage the model’s capabilities but also incorporate specific evidence and data from the corpus.

Finally, the system outputs the language model’s generated response as the final answer to the user’s original query.

Why build a RAG locally?

Having a RAG system running locally is beneficial for several reasons. First, it allows you to experiment and tinker with the system within your own environment, without relying on external services or APIs. This can be particularly useful for testing, debugging, or customizing the system to meet specific requirements. Additionally, a local RAG system can provide improved privacy and data security, as sensitive information remains within your controlled infrastructure. Furthermore, running the system locally can offer potential performance advantages, reducing latency and eliminating dependence on external network conditions.

Let’s get started!

To build the RAG system locally, as mentioned earlier, we need two main components: a Large Language Model (LLM) and a retrieval engine. Our first step is setting up an LLM to run on our local machine.

The LLM

We are building this system on a personal computer or basic workstation, so we need an LLM model that is relatively lightweight in terms of resource requirements. To accomplish this, we will be using Ollama.

While Ollama can leverage GPU acceleration, it doesn’t strictly necessitate specialized GPUs designed for deep learning workloads. This makes Ollama an ideal choice for our local RAG system, as it can run efficiently without demanding high-end hardware.

Ollama is an advanced AI tool that allows users to run large language models (LLMs) locally on their computers. It simplifies the process of running language models locally, providing users with greater control and flexibility in their AI projects. Ollama supports a variety of models, including Llama 2, Mistral, and other large language models.

To get started with Ollama, head over to their website at https://ollama.com and follow the provided instructions to install and set up the LLM on your local machine.

PostgreSQL as a Vector Database

For our vector storage needs, we’ll be utilizing PostgreSQL along with the pgvector extension. PgVector is an open-source extension for PostgreSQL that enables us to store and search over machine learning-generated embeddings efficiently.

Why a Vector Database?

At the core of a RAG system lies the ability to quickly retrieve relevant documents or passages from a corpus based on a given input query. This retrieval process is facilitated by techniques like semantic search, which involve mapping the query and documents into high-dimensional vector representations called embeddings.

When a user provides a query to our RAG system, we can convert that query into an embedding vector. We can then use our vector database to quickly find the document embeddings that are most similar or “nearest neighbors” to the query embedding. This allows us to retrieve the most topically relevant documents or passages to augment our language model’s response accurately.

Setting Up PostgreSQL with pgvector

To get started, create a new directory for storing the project files. Since we’ll be using PostgreSQL with the pgvector extension as our vector database, we need to set up the database server using Docker Compose.

Create a docker-compose.yml file with the following content:

version: "3.8"

services:
  db:
    image: pgvector/pgvector:pg16
    restart: always
    env_file:
      - .env
    ports:
      - ${POSTGRES_PORT}:${POSTGRES_PORT}
    environment:
      - PGDATA=/var/lib/postgresql/data/pgdata
      - POSTGRES_PORT=${POSTGRES_PORT}
      - POSTGRES_DB=${POSTGRES_DB}
      - POSTGRES_USER=${POSTGRES_USER}
      - POSTGRES_PASSWORD=${POSTGRES_PASSWORD}
    volumes:
      - pgdata:/var/lib/postgresql/data/pgdata

volumes:
  pgdata:

Then create a .env file:

POSTGRES_SERVER=localhost
POSTGRES_PORT=5432
POSTGRES_DB=localrag
POSTGRES_USER=postgres
POSTGRES_PASSWORD=pgpassword

Start the container:

docker-compose up -d

Installing and importing libraries

I prefer pipenv, but you can use pip, poetry, or any other tool. Create a new environment with Python 3.12:

pipenv --python 3.12
pipenv install langchain langchain-community wikipedia python-dotenv pgvector psycopg2-binary

Create a main.py file and import the required libraries:

import os
from langchain_community.document_loaders.wikipedia import WikipediaLoader
from langchain.text_splitter import RecursiveCharacterTextSplitter
from langchain_community.embeddings.ollama import OllamaEmbeddings
from langchain_community.llms.ollama import Ollama
from langchain.vectorstores.pgvector import PGVector
from langchain.prompts import ChatPromptTemplate
from langchain_core.output_parsers import StrOutputParser
from langchain_core.runnables import RunnablePassthrough
from dotenv import load_dotenv

load_dotenv("../.env")

Now define the models and the database URI:

EMBEDDING_MODEL = "mxbai-embed-large"
LLM_MODEL = "gemma:2b"

def database_uri():
    user = os.getenv("POSTGRES_USER", "postgres")
    password = os.getenv("POSTGRES_PASSWORD", "")
    server = os.getenv("POSTGRES_SERVER", "localhost")
    port = os.getenv("POSTGRES_PORT", "5432")
    db = os.getenv("POSTGRES_DB", "localrag")
    return f"postgresql+psycopg2://{user}:{password}@{server}:{port}/{db}"

Instantiating the core components

embeddings = OllamaEmbeddings(model=EMBEDDING_MODEL)
vector_db = PGVector(
    embedding_function=embeddings,
    connection_string=database_uri(),
    pre_delete_collection=True,
)
retriever = vector_db.as_retriever()
llm = Ollama(model=LLM_MODEL)

Embedding Model — generates vector embeddings from text using mxbai-embed-large.
Vector Database — the PGVector instance stores and retrieves embeddings from PostgreSQL (pre_delete_collection=True is for development only; never use it in production).
Retriever — performs similarity searches in the vector database.
Language Model — the Ollama instance wrapping gemma:2b to generate responses.

Chaining the RAG components

human_prompt = """
You are an assistant for question-answering tasks. Use the following pieces of retrieved context to answer the question.
If you don't know the answer, just say that you don't know.
Use three sentences maximum and keep the answer concise.\n
Context: {context} \n
Question: {question} \n
Answer:
"""

prompt = ChatPromptTemplate.from_messages([("human", human_prompt)])

rag_chain = (
    {"context": retriever, "question": RunnablePassthrough()}
    | prompt
    | llm
    | StrOutputParser()
)

The pipeline works as follows:

The user’s question is passed to the retriever, which fetches relevant passages from the vector database.
The question and the retrieved context are injected into the prompt template.
The formatted prompt is passed to the LLM, which generates a response.
StrOutputParser ensures the output is returned as a plain string.

Testing the RAG system

def wikipedia_query(query: str):
    loader = WikipediaLoader(query=query, load_max_docs=3)
    docs = loader.load()
    text_splitter = RecursiveCharacterTextSplitter(chunk_size=1000, chunk_overlap=80)
    texts = text_splitter.split_documents(docs)
    vector_db.add_documents(texts)

# Test queries
wikipedia_query("Transformer Architecture")
result = rag_chain.invoke("In what year was the Transformer paper published?")
print(f"Query: In what year was the Transformer paper published?\nResult: {result}")

wikipedia_query("Langchain")
result = rag_chain.invoke("What is langchain?")
print(f"Query: What is langchain?\nResult: {result}")

These examples use Wikipedia documents for testing. In a real scenario, replace them with your own domain-specific documents — PDFs, chat logs, internal documentation, etc.

That’s a wrap!

If you’d like to explore a more comprehensive implementation, including a functional API, check out my GitHub repository: https://github.com/luiscib3r/LLM-Projects/tree/main/local-rag. You’ll find a complete Jupyter Notebook and an API implementation ready to use.