Supercharge RAG with Qwen3 and LlamaIndex on Regolo

👉Try Qwen3 on Regolo for free

Building Retrieval-Augmented Generation (RAG) often feels like assembling a puzzle with mismatched pieces. Standard OpenAI pipelines are easy but expensive and opaque with your data. Open-source models (like Llama 3 or Qwen) promise control but require painful self-hosting or complex GPU management, leading to high operational overhead and slow “time-to-first-token”.
Teams need a middle ground: the simplicity of the OpenAI SDK, the power of state-of-the-art open models like Qwen3, and zero data retention for compliance.

Deploy a private, high-performance RAG pipeline with Qwen3-8B and LlamaIndex in less than 10 minutes. Zero infra management, 100% GDPR-aligned.

Outcome

Unified Intelligence: Qwen3-8B (LLM) and Qwen3-Embedding-8B (Embeddings) are designed to work together, delivering superior retrieval relevance compared to mixing disparate vendors.
Drop-in Compatibility: Regolo exposes these models via an OpenAI-compatible endpoint, so you can switch your LlamaIndex OpenAI class to OpenAILike without rewriting your logic.
Full Observability: See exactly how the model “thinks” (reasoning traces) and strip them for users—giving you debugging power without confusing your customers.

Prerequisites (Fast)

Regolo API Key: From your dashboard.
Python 3.10+: And a virtual environment.
LlamaIndex: pip install llama-index llama-index-llms-openai-like llama-index-embeddings-openai-like

Step-by-Step (Code Blocks)

1) Configure the API Connection

Point LlamaIndex to Regolo’s endpoint. This “plugs in” the infrastructure.

import os
from dotenv import load_dotenv

load_dotenv()
REGOLO_API_KEY = os.getenv("REGOLO_API_KEY")
REGOLO_ENDPOINT = "https://api.regolo.ai/v1"Code language: Python (python)

2) Initialize Qwen3 Models (LLM & Embed)

Define the “Brain” and the “Indexer”. Note the is_function_calling_model=False flag to keep it focused on pure retrieval.

from llama_index.llms.openai_like import OpenAILike
from llama_index.embeddings.openai_like import OpenAILikeEmbedding
from llama_index.core import Settings

# The Brain (LLM)
llm = OpenAILike(
    model="Qwen3-8B",
    api_base=REGOLO_ENDPOINT,
    api_key=REGOLO_API_KEY,
    context_window=8192,
    is_chat_model=True
)

# The Indexer (Embedding)
embed_model = OpenAILikeEmbedding(
    model_name="Qwen3-Embedding-8B",
    api_base=REGOLO_ENDPOINT,
    api_key=REGOLO_API_KEY
)

# Set as global defaults
Settings.llm = llm
Settings.embed_model = embed_modelCode language: Python (python)

With this short definition, we’ve essentially told LlamaIndex which model to talk to whenever it needs to generate an answer. From here on, Qwen3-8B becomes the “brain” of our pipeline, ready to process context and generate human-like responses.

3) Load & Index Data

Read your documents and turn them into a searchable vector index in one pass.

from llama_index.core import SimpleDirectoryReader, VectorStoreIndex

# Load local docs
documents = SimpleDirectoryReader("./data").load_data()

# Build index (Regolo handles the embedding generation)
index = VectorStoreIndex.from_documents(documents)Code language: Python (python)

Expected output: A VectorStoreIndex object ready for querying.

4) Query the Engine

Create a query engine and ask questions.

query_engine = index.as_query_engine(similarity_top_k=3)
response = query_engine.query("What are the key safety features mentioned?")
print(str(response))Code language: Python (python)

Expected output: A natural language answer derived only from your documents.

5) Production Polish (Cleaning Reasoning)

Qwen3-8B often outputs <think> tags. Use this helper to clean them for the end-user.

import re

def clean_response(text: str) -> str:
    # Removes the internal <think>...</think> block
    return re.sub(r"<think>.*?</think>", "", text, flags=re.DOTALL)

clean_answer = clean_response(str(response))
print(f"User-facing Answer:\n{clean_answer}")Code language: Python (python)

Production-Ready: Full Pipeline

To run this in production, wrap the cleaning logic into a post-processor or a simple API wrapper function. The code above is robust enough for a microservice: just expose the query_engine.query call via FastAPI or Flask.

Benchmarks & Costs

Feature	Regolo (Qwen3-8B)	Standard OpenAI (GPT-4o-mini)
Data Privacy	Zero Retention.	Standard retention policies.
Reasoning	Transparent. View <think> traces.	Opaque / Hidden.
Cost	Pay-per-token. (~$0.03/1M input).	~$0.15/1M input.
Embeddings	Qwen3-Embedding. High multilingual performance.	text-embedding-3-small.

👉Try Qwen3 on Regolo for free

Resources & Community

Official Documentation:

Regolo Platform – European LLM provider, Zero Data-Retention and 100% Green

Related Guides:

Join the Community:

Regolo Discord – Share your automation builds
CheshireCat GitHub – Contribute plugins
Follow Us on X @regolo_ai – Show your integrations!
Open discussion on our Subreddit Community

🚀 Ready to Deploy?

Get Free Regolo Credits →

Built with ❤️ by the Regolo team. Questions? support@regolo.ai

Share this article