Skip to content
Regolo Logo
Self‑Hosting & DevOps

Practical RAG with Sensitive Documents on EU Infra (LangChain & LlamaIndex)

Alex Genovese
5 min read
Share

Building Retrieval-Augmented Generation (RAG) applications on sensitive documents requires strict control over where data flows. By combining a private vector database for embeddings with a zero data retention inference provider like Regolo.ai, you can build powerful RAG pipelines without exposing proprietary information to US-centric cloud providers.

In a nutshell: Standard RAG exposes your document chunks to external APIs during generation: to secure sensitive data, host your vector database within your EU infrastructure and route the final prompt to Regolo.ai, which guarantees zero data retention. This approach keeps your data pipeline compliant while using leading open-weight models via LangChain or LlamaIndex.

The risk of standard RAG with sensitive data

When you build a basic RAG system, your documents are split into chunks, converted into embeddings, stored in a vector database, and finally retrieved to build a prompt. If you use standard third-party APIs for embeddings and inference, your proprietary data leaves your controlled environment twice.

For healthcare records, legal contracts, or financial reports, this data movement often violates internal security policies or regulatory frameworks like the AI Act or GDPR.

Where your data actually lives

To build a compliant architecture, you must explicitly map the data flow and separate storage from computation. A secure RAG architecture looks like this:

  • Document Storage: Hosted on your secure internal servers or a compliant EU cloud provider.
  • Embedding Model: Run locally (e.g., using open-source Hugging Face models) so documents never leave your infrastructure during indexing.
  • Vector Database: Deployed in your private network (e.g., local Qdrant, Chroma, or Milvus).
  • Inference: Sent to Regolo.ai. Because Regolo guarantees zero data retention, the prompt (which contains the retrieved document chunks) is processed in RAM and immediately discarded. Nothing is logged, stored, or used for training.

Step 1: Document processing and local embeddings

Start by extracting text from your PDF folder. Instead of sending text to an external API for vectorization, use a local embedding model. This requires slightly more memory locally but ensures total data privacy during the indexing phase.

from langchain_community.document_loaders import PyPDFDirectoryLoader
from langchain_huggingface import HuggingFaceEmbeddings

# Load documents securely from local storage
loader = PyPDFDirectoryLoader("./secure_documents/")
docs = loader.load_and_split()

# Use a local embedding model (runs on your CPU/GPU)
embeddings = HuggingFaceEmbeddings(model_name="BAAI/bge-m3")
Code language: Python (python)

Step 2: Setting up a local vector database

Store the generated vectors in a database that runs within your firewall. In this example, we use Chroma running entirely on your local file system.

from langchain_community.vectorstores import Chroma

# Create the vector database locally
vectorstore = Chroma.from_documents(
    documents=docs, 
    embedding=embeddings, 
    persist_directory="./private_chroma_db"
)
retriever = vectorstore.as_retriever(search_kwargs={"k": 3})
Code language: Python (python)

Step 3: Zero-retention inference with Regolo.ai

Now that your retrieval pipeline is fully private, configure your LLM integration to use Regolo.ai. This provides access to leading open-weight models via an OpenAI-compatible endpoint.

Example: LangChain integration

from langchain_openai import ChatOpenAI
from langchain.chains import create_retrieval_chain
from langchain.chains.combine_documents import create_stuff_documents_chain
from langchain_core.prompts import ChatPromptTemplate
import os

# Connect to Regolo.ai
llm = ChatOpenAI(
    openai_api_key=os.environ.get("REGOLO_API_KEY"),
    openai_api_base="https://api.regolo.ai/v1",
    model_name="meta-llama/Llama-3-70b-chat-hf"
)

# Build the prompt
system_prompt = (
    "Use the following retrieved context to answer the question. "
    "If you don't know the answer, say that you don't know."
    "\n\n"
    "{context}"
)
prompt = ChatPromptTemplate.from_messages([
    ("system", system_prompt),
    ("human", "{input}"),
])

# Create the chain
question_answer_chain = create_stuff_documents_chain(llm, prompt)
rag_chain = create_retrieval_chain(retriever, question_answer_chain)

# Query your secure data
response = rag_chain.invoke({"input": "What are the compliance requirements in section 4?"})
print(response["answer"])
Code language: Python (python)

Example: LlamaIndex integration

If you prefer LlamaIndex, the privacy principles remain identical. Use a local embedding model and route the generation step to Regolo.

from llama_index.core import VectorStoreIndex, SimpleDirectoryReader, Settings
from llama_index.embeddings.huggingface import HuggingFaceEmbedding
from llama_index.llms.openai import OpenAI
import os

# 1. Local Embeddings
Settings.embed_model = HuggingFaceEmbedding(model_name="BAAI/bge-m3")

# 2. Secure EU Inference
Settings.llm = OpenAI(
    api_key=os.environ.get("REGOLO_API_KEY"),
    api_base="https://api.regolo.ai/v1",
    model="meta-llama/Llama-3-70b-chat-hf"
)

# Load and index
documents = SimpleDirectoryReader("./secure_documents/").load_data()
index = VectorStoreIndex.from_documents(documents)

# Query
query_engine = index.as_query_engine()
response = query_engine.query("Summarize the main liabilities.")
print(response)
Code language: Python (python)

Architecture trade-offs

ComponentCloud API ApproachSecure RAG Approach
EmbeddingsFaster implementation, higher latencyRequires local compute, highest privacy
Vector DatabaseManaged service, data leaves networkLocal deployment, total control
InferenceData retained by default (30+ days)Zero data retention (RAM only processing)
ComplianceRequires complex DPA reviewsNative EU data residency

FAQs

What is an inference provider?
An inference provider runs pre-trained AI models on cloud infrastructure and exposes them via an API, so you can integrate AI without managing GPUs yourself.

How is inference different from training?
Training creates the model by feeding it massive datasets. Inference is the process of using that already-trained model to generate responses or predictions in real time.

Can I use an OpenAI-compatible endpoint and keep data in Europe?
Yes. By pointing your existing OpenAI SDK integration to a European inference provider like Regolo.ai, your data processing remains strictly within the EU.

Is zero data retention enough for GDPR compliance?
Zero data retention is a massive advantage because it means user prompts are not stored after processing. However, full GDPR compliance also depends on where the processing happens and how you handle data on your end.

When does self-hosting make more sense than using an inference provider?
Self-hosting makes sense if you have specialized security requirements (like air-gapped systems) or high, predictable continuous workloads. For most teams, an inference provider is faster to implement and easier to maintain.


Start your free 30-day trial at regolo.ai and deploy LLMs with complete privacy by design.

👉 Talk with our Engineers or Start your 30 days free →



Built with ❤️ by the Regolo team. Questions? regolo.ai/contact or chat with us on Discord