Building Retrieval-Augmented Generation (RAG) often feels like assembling a puzzle with mismatched pieces. Standard OpenAI pipelines are easy but expensive and opaque with your data. Open-source models (like Llama 3 or Qwen) promise control but require painful self-hosting or complex GPU management, leading to high operational overhead and slow “time-to-first-token”.
Teams need a middle ground: the simplicity of the OpenAI SDK, the power of state-of-the-art open models like Qwen3, and zero data retention for compliance.
Deploy a private, high-performance RAG pipeline with Qwen3-8B and LlamaIndex in less than 10 minutes. Zero infra management, 100% GDPR-aligned.
Outcome
- Unified Intelligence: Qwen3-8B (LLM) and Qwen3-Embedding-8B (Embeddings) are designed to work together, delivering superior retrieval relevance compared to mixing disparate vendors.
- Drop-in Compatibility: Regolo exposes these models via an OpenAI-compatible endpoint, so you can switch your LlamaIndex OpenAI class to OpenAILike without rewriting your logic.
- Full Observability: See exactly how the model “thinks” (reasoning traces) and strip them for users—giving you debugging power without confusing your customers.
Prerequisites (Fast)
- Regolo API Key: From your dashboard.
- Python 3.10+: And a virtual environment.
- LlamaIndex: pip install llama-index llama-index-llms-openai-like llama-index-embeddings-openai-like
Step-by-Step (Code Blocks)
1) Configure the API Connection
Point LlamaIndex to Regolo’s endpoint. This “plugs in” the infrastructure.
import os
from dotenv import load_dotenv
load_dotenv()
REGOLO_API_KEY = os.getenv("REGOLO_API_KEY")
REGOLO_ENDPOINT = "https://api.regolo.ai/v1"Code language: Python (python)
2) Initialize Qwen3 Models (LLM & Embed)
Define the “Brain” and the “Indexer”. Note the is_function_calling_model=False flag to keep it focused on pure retrieval.
from llama_index.llms.openai_like import OpenAILike
from llama_index.embeddings.openai_like import OpenAILikeEmbedding
from llama_index.core import Settings
# The Brain (LLM)
llm = OpenAILike(
model="Qwen3-8B",
api_base=REGOLO_ENDPOINT,
api_key=REGOLO_API_KEY,
context_window=8192,
is_chat_model=True
)
# The Indexer (Embedding)
embed_model = OpenAILikeEmbedding(
model_name="Qwen3-Embedding-8B",
api_base=REGOLO_ENDPOINT,
api_key=REGOLO_API_KEY
)
# Set as global defaults
Settings.llm = llm
Settings.embed_model = embed_modelCode language: Python (python)
With this short definition, we’ve essentially told LlamaIndex which model to talk to whenever it needs to generate an answer. From here on, Qwen3-8B becomes the “brain” of our pipeline, ready to process context and generate human-like responses.
3) Load & Index Data
Read your documents and turn them into a searchable vector index in one pass.
from llama_index.core import SimpleDirectoryReader, VectorStoreIndex
# Load local docs
documents = SimpleDirectoryReader("./data").load_data()
# Build index (Regolo handles the embedding generation)
index = VectorStoreIndex.from_documents(documents)Code language: Python (python)
Expected output: A VectorStoreIndex object ready for querying.
4) Query the Engine
Create a query engine and ask questions.
query_engine = index.as_query_engine(similarity_top_k=3)
response = query_engine.query("What are the key safety features mentioned?")
print(str(response))Code language: Python (python)
Expected output: A natural language answer derived only from your documents.
5) Production Polish (Cleaning Reasoning)
Qwen3-8B often outputs <think> tags. Use this helper to clean them for the end-user.
import re
def clean_response(text: str) -> str:
# Removes the internal <think>...</think> block
return re.sub(r"<think>.*?</think>", "", text, flags=re.DOTALL)
clean_answer = clean_response(str(response))
print(f"User-facing Answer:\n{clean_answer}")Code language: Python (python)
Production-Ready: Full Pipeline
To run this in production, wrap the cleaning logic into a post-processor or a simple API wrapper function. The code above is robust enough for a microservice: just expose the query_engine.query call via FastAPI or Flask.
Benchmarks & Costs
| Feature | Regolo (Qwen3-8B) | Standard OpenAI (GPT-4o-mini) |
| Data Privacy | Zero Retention. | Standard retention policies. |
| Reasoning | Transparent. View <think> traces. | Opaque / Hidden. |
| Cost | Pay-per-token. (~$0.03/1M input). | ~$0.15/1M input. |
| Embeddings | Qwen3-Embedding. High multilingual performance. | text-embedding-3-small. |
Resources & Community
Official Documentation:
- Regolo Platform – European LLM provider, Zero Data-Retention and 100% Green
Related Guides:
Join the Community:
- Regolo Discord – Share your automation builds
- CheshireCat GitHub – Contribute plugins
- Follow Us on X @regolo_ai – Show your integrations!
- Open discussion on our Subreddit Community
🚀 Ready to Deploy?
Built with ❤️ by the Regolo team. Questions? support@regolo.ai