Supercharging Retrieval with Qwen and LlamaIndex: A Hands-On Guide

Language models keep getting better, but pairing them with smart retrieval is where the real magic happens. In this post, we’ll explore how to integrate Qwen3, a powerful family of models (covering both large language generation and embeddings), with LlamaIndex, one of the most flexible retrieval and orchestration frameworks out there.

We’ll use Regolo.ai as our inference provider and show how you can plug in Qwen3’s LLM and embedding models directly into LlamaIndex. Why does this matter? Because while LLMs can generate great text, combining them with retrieval pipelines unlocks use cases like document analysis, smart search, and customer feedback summarization at scale.

Our goal is simple: by the end of this tutorial, you’ll see just how seamless it is to pair Qwen3 with LlamaIndex and how this duo can power applications that are both intelligent and practical.

Step 1 – API Configuration

Before we can bring together Qwen3 and LlamaIndex, we need a way to connect them through an inference provider. In this example, we’re using our favourite inference provider Regolo.ai, which exposes an OpenAI-compatible API. That’s great news, because it means we can use familiar client patterns while unlocking access to Qwen3 models.

Here’s the setup:

import os
from dotenv import load_dotenv

load_dotenv()

REGOLO_API_KEY = os.getenv("REGOLO_API_KEY", "your_regolo_api_key_here")
REGOLO_ENDPOINT = "https://api.regolo.ai/v1"
Code language: Python (python)

🔍 What’s happening here?

  1. We grab our API key from the environment (.env file). This keeps secrets out of code (best practice).
  2. We set the endpoint for Regolo.ai’s OpenAI-like API backend.
  3. Having this in place basically plugs us into Regolo, so the rest of the code can talk to Qwen3 without any extra fuss.

Think of this as connecting the cables before turning on the machine. Nothing magical yet, but essential to run the rest of the tutorial.

Step 2 – Define the Qwen3 LLM

Now that the API connection is in place, we can introduce the large language model that will actually reason over our data. In this tutorial, we’re going with Qwen3-8B, a versatile model exposed through Regolo’s API. The nice thing is that LlamaIndex provides an OpenAILike wrapper, which lets us connect to Qwen the same way we would connect to any OpenAI model, so the integration feels seamless.

In the code below, we pass a few important parameters: the model name (Qwen3-8B), the API endpoint and key we set up in Step 1, and then some configuration flags. The context_window is set to 8192, which defines how much text the model can process at once. Since Qwen is designed to handle chat-like interactions, we mark it as a chat model, but we don’t enable “function calling,” because here we’re focusing purely on retrieval and question answering. Feel free to activate it if you need it in your application, regolo.ai support function calling in almost every capable model.

from llama_index.llms.openai_like import OpenAILike


llm = OpenAILike(
    model="Qwen3-8B",
    api_base=REGOLO_ENDPOINT,
    api_key=REGOLO_API_KEY,
    context_window=8192,
    is_chat_model=True,
    is_function_calling_model=False,
)
Code language: Python (python)

With this short definition, we’ve essentially told LlamaIndex which model to talk to whenever it needs to generate an answer. From here on, Qwen3-8B becomes the “brain” of our pipeline, ready to process context and generate human-like responses.

Step 3 – Add the Qwen3 Embedding Model

Having defined Qwen3-8B as the LLM that will generate our answers, the next ingredient we need is an embedding model. If the LLM is the “brain” that reasons, then the embedding model is the “indexer” that knows how to represent documents in a way the brain can search through quickly.

For this, we’ll use Qwen3-Embedding-8B, which is designed to turn chunks of text into vectors. These vectors let us compare queries and documents by similarity, so when a user asks a question, we can fetch the most relevant snippets instead of dumping the entire knowledge base into the model’s context window. Using an embedding model from the same family as the LLM is usually a good idea, since they tend to complement each other naturally.

Here’s how we wire it up:

from llama_index.embeddings.openai_like import OpenAILikeEmbedding

embed_model = OpenAILikeEmbedding(
    model_name="Qwen3-Embedding-8B",
    api_base=REGOLO_ENDPOINT,
    api_key=REGOLO_API_KEY,
    embed_batch_size=32,
)
Code language: Python (python)

With just a few lines, we’ve equipped our pipeline with the ability to map knowledge into vector space. Each document we load will now be broken into embeddings, ready for efficient retrieval later when paired with the LLM.

Very personal author’s note: This is one of the inconsistencies I find particularly irritating in LlamaIndex; here you need to specify model_name instead of model. I’ve often run into errors because of these small parameter differences in LlamaIndex, which nonetheless remains one of the most complete tools for building RAG at the moment.

Step 4 – Load Documents and Build the Index

It’s time to give our pipeline something to work with: actual documents. This is where we prepare our knowledge base so the system knows what information it can draw from when answering questions.

In LlamaIndex, the usual process is straightforward. You load the documents, feed them through the embedding model, and then create an index, essentially a searchable structure that makes it easy to retrieve relevant chunks later. The index becomes the bridge between user questions and the text they should connect to.

Here’s a basic example:

from llama_index.core import SimpleDirectoryReader, VectorStoreIndex

# Step 4a: Load data from a folder (e.g., "./data")
documents = SimpleDirectoryReader("./data").load_data()

# Step 4b: Create a vector index using our embedding model
index = VectorStoreIndex.from_documents(
    documents,
    embed_model=embed_model
)
Code language: Python (python)

In this snippet, we point SimpleDirectoryReader at a folder (called ./data in this case), and it takes care of reading in all the text files there. Each document is then passed into the embedding model we registered in Step 3, and finally everything is wrapped into a VectorStoreIndex.

At this stage, we’ve built the foundation of retrieval: the system now has a structured “memory” of our documents, represented as embeddings, ready to be queried intelligently. Next time we ask a question, instead of blindly searching everything, it will surface just the relevant passages.

Step 5 – Create a Query Engine and Ask Questions 💬

With the index built, we’re finally ready to connect all the moving pieces into a single pipeline that can respond to questions. This is the moment when the LLM (Qwen3-8B) and the embedding model (Qwen3-Embedding-8B) start working hand in hand: the embedding model retrieves the most relevant chunks from the index, and then the LLM turns that context into a natural language answer.

LlamaIndex makes this very simple. We take the index we created in Step 4 and convert it into a query engine. The query engine is essentially the orchestrator — when a user asks something, it handles the retrieval of relevant snippets via embeddings, sends them to the LLM, and then returns a clear, coherent answer.

Here’s how it looks in code:

# --- 5. Create a query engine and try a question ---
query_engine = index.as_query_engine(llm=llm)

# Example query
response = query_engine.query("What topics are covered in the documents?")
print(response)
Code language: Python (python)

When you run this, the system will first look up the most relevant pieces of your documents, then pass them along to Qwen3-8B, which will generate a fluent response. The printed output should read like a human-written answer, directly grounded in the content of your data.

At this point, you now have a working retrieval-augmented generation (RAG) pipeline: documents are embedded into vector space, queries are matched against them, and responses come back in natural language.

Step 5 – Create a Query Engine and Ask Questions

With the index built, we’re finally ready to connect all the moving pieces into a single pipeline that can respond to questions. This is the moment when the LLM (Qwen3-8B) and the embedding model (Qwen3-Embedding-8B) start working hand in hand: the embedding model retrieves the most relevant chunks from the index, and then the LLM turns that context into a natural language answer.

LlamaIndex makes this very simple. We take the index we created in Step 4 and convert it into a query engine. The query engine is essentially the orchestrator; when a user asks something, it handles the retrieval of relevant snippets via embeddings, sends them to the LLM, and then returns a clear, coherent answer.

Here’s how it looks in code:

query_engine = index.as_query_engine(llm=llm)

# Example query
response = query_engine.query("What topics are covered in the documents?")
print(response)
Code language: Python (python)

When you run this, the system will first look up the most relevant pieces of your documents, then pass them along to Qwen3-8B, which will generate a fluent response. The printed output should read like a human-written answer, directly grounded in the content of your data.

At this point, you now have a working retrieval-augmented generation (RAG) pipeline: documents are embedded into vector space, queries are matched against them, and responses come back in natural language.

Step 6 – Tune and Enhance the Workflow

Now that you have a basic RAG pipeline working, it’s time to sharpen it. Think of this step as refining the instrument so the music flows exactly the way you want it. The possibilities here are broad, but let’s touch on the most practical enhancements:


6a. Adjust Retrieval Behavior

By default, the query engine may only return a few chunks. But sometimes you’ll want more context, or more focused context. You can tune this with similarity_top_k:

query_engine = index.as_query_engine(
    llm=llm,
    similarity_top_k=5  # increase to get more context passages per query
)
Code language: Python (python)
  • High k: more recall, but the LLM may get overwhelmed with context.
  • Low k: tighter focus, but risk of missing information.

6b. Response Modes

You can control how the engine combines retrieved text before passing it to the LLM. For example, "compact" mode fits more data into the context window by squashing passages together:

query_engine = index.as_query_engine(
    llm=llm,
    response_mode="compact"
)
Code language: Python (python)

This is useful when working with large documents and small context windows.


6c. Customize Prompts

If you’d like answers to have a specific style (formal, concise, step-by-step, etc.) you can add a system prompt to guide the LLM:

query_engine = index.as_query_engine(
    llm=llm,
    text_qa_template="You are a helpful assistant. Answer clearly and concisely:"
)
Code language: Python (python)

6d. Memory and Conversational Context

For chat-style interactions, add memory so the model remembers past turns. With LlamaIndex, you can wrap the query engine inside a conversational interface to keep context flowing across multiple questions.


6e. Swap or Stack Indexes

If you’re working with multiple domains of documents (say, research papers and internal docs), you can create separate indexes and either query them selectively or stitch them together in a composed index.


At this stage, you’ve gone from:

  1. Choosing an LLM (Qwen3-8B)
  2. Defining an embedding model (Qwen3-Embedding-8B)
  3. Loading and indexing documents
  4. Building a query engine
  5. Getting first answers
  6. Enhancing the whole pipeline for control, power, and flexibility

Now your RAG setup isn’t just functional and it’s smooth, customizable, and production-ready.

Bonus Step – Working with Reasoning

Some reasoning‑focused models (such as Qwen3‑8B) can produce hidden chains of thought in special tags.
In the raw response you may notice reasoning traces wrapped like this:

<THINK>
  The model breaks down the question into smaller steps...
  (lots of internal reasoning here)
</THINK>
Code language: Markdown (markdown)

These traces show how the model arrived at the answer. While very useful for developers (debugging, auditing, or research), they usually should not be shown directly to end‑users, since they can be noisy or confusing.


How to Handle Reasoning Traces

We use a small helper function to clean the response before printing it:

import re

def remove_thought_part(text: str) -> str:
    return re.sub(r"<think>.*?</think>", "", text, flags=re.DOTALL)Code language: Python (python)

This extra step makes your RAG pipeline both production‑safe for customers and transparent for developers.

Code Wrap‑Up

Remember to install the dependencies before running the complete script. You can do this quickly with:

pip install llama-index llama-index-llms-openai-like llama-index-embeddings-openai-likeCode language: Bash (bash)

import os
from dotenv import load_dotenv
import re
from llama_index.llms.openai_like import OpenAILike
from llama_index.embeddings.openai_like import OpenAILikeEmbedding
from llama_index.core import SimpleDirectoryReader, VectorStoreIndex, Settings

def remove_thought_part(text: str) -> str:
    """
    Returns:
        str: The text without the thought sections.
    """
    return re.sub(r"<think>.*?</think>", "", text, flags=re.DOTALL)


load_dotenv()

REGOLO_API_KEY = os.getenv("REGOLO_API_KEY", "your_regolo_api_key_here")
REGOLO_ENDPOINT = "https://api.regolo.ai/v1"

llm = OpenAILike(
    model="Qwen3-8B",
    api_base=REGOLO_ENDPOINT,
    api_key=REGOLO_API_KEY,
    context_window=8192,
    is_chat_model=True,
    is_function_calling_model=False,
)

embed_model = OpenAILikeEmbedding(
    model_name="Qwen3-Embedding-8B",
    api_base=REGOLO_ENDPOINT,
    api_key=REGOLO_API_KEY,
)

Settings.llm = llm
Settings.embed_model = embed_model

# Load fake customer feedback data, feel free to use your own data here
os.makedirs("./feedback", exist_ok=True)
feedback_path = "./feedback/sample_feedback.txt"

if not os.path.exists(feedback_path):
    with open(feedback_path, "w") as f:
        f.write(
            "I really appreciate the sleek design of the product. "
            "However, the battery life is much shorter than expected.\n\n"
            "Customer service answered my questions, but I had to wait almost 3 days "
            "for a response.\n\n"
            "The mobile app crashes frequently, which makes it hard to track my usage.\n\n"
            "Great value for the price! Delivery was quick and packaging was nice.\n\n"
            "I wish there were more tutorials or guides to help me set up advanced features.\n\n"
            "Performance is fast when it works, but the login process sometimes fails.\n\n"
            "Overall, I’m happy with the purchase, but I’d love to see improvements "
            "in reliability and support."
        )
    print(f"📂 Fake feedback file created at {feedback_path}")

documents = SimpleDirectoryReader("./feedback").load_data()

index = VectorStoreIndex.from_documents(documents)

query_engine = index.as_query_engine()

question = (
    "Summarize recurring positive and negative themes in the feedback, "
    "and suggest one improvement customers would value most."
)

response = query_engine.query(question)


clean_response = remove_thought_part(str(response))


print("\n📊 Insights from customer feedback:\n")
print(clean_response)

Code language: Python (python)

With the steps we’ve gone through, from loading your data, setting up models, building an index, querying, cleaning reasoning traces, and tweaking retrieval you now have a solid foundation in place.

Tweaking and expanding the concepts from this tutorial can help you build your own RAG system powered by LlamaIndex + Regolo.ai models.

We’d love to see what you create!
Feel free to reach out on our Discord server! We want to hear about your projects, your experiments, and share ideas together. 🚀