Skip to content
Regolo Logo

Qwen3-Reranker-4B on Regolo: Add a “critical brain” to your RAG in minutes

👉Try Qwen3-Reranker-4B on Regolo

Teams lose hours tuning retrieval because “top-k from vector search” is often relevant-ish, not truly useful—leading to noisy contexts, higher LLM costs, and inconsistent answers across languages and departments.
When your knowledge base is multilingual and full of near-duplicate chunks, recall alone is not enough: you need precision at the point where query and passage are judged together (not separately).

Run Qwen3-Reranker-4B in less than 10 minutes. Ready to scale, predictable pricing, and privacy-first architecture patterns you can apply from day one.

Outcome

  • Reduce “wrong context sent to LLM” by reranking 20–50 candidates down to top-5 before generation (precision@k focus).
  • Make costs predictable with a fixed €0.01 per rerank query (easy to forecast per request).
  • Support multilingual search experiences (100+ languages) without maintaining separate reranking stacks per locale.

Prerequisites (fast)

  • Regolo API key (store it as REGOLO_API_KEY).
  • OpenAI-compatible integration mindset: same base URL + model selection approach many tools already support.
  • Supported languages: 100+ (multilingual + cross-lingual retrieval).

Step-by-step (4–6 steps)

1) Set model + endpoint

Set the base endpoint and pick the model name.

BASE_URL = https://api.regolo.ai/v1
MODEL    = Qwen3-Reranker-4BCode language: JavaScript (javascript)

Expected output: you’re ready to call a rerank API that takes query + documents + top_n and returns a ranked list with scores.

Step 2 — Quick rerank via HTTP (curl)

Use the classic schema: model, query, documents, top_n.

curl -X POST https://api.regolo.ai/v1/rerank \
  -H "Authorization: Bearer $REGOLO_API_KEY" \
  -H "Content-Type: application/json" \
  -d '{
    "model": "Qwen3-Reranker-4B",
    "query": "How do I turn on the lights?",
    "documents": [
      "Press the light switch on the wall.",
      "Use your phone to control the smart lights.",
      "Plug in the lamp and turn the knob."
    ],
    "top_n": 3
  }'Code language: Bash (bash)

Expected output (shape):

{
  "results": [
    { "index": 0, "relevance_score": 0.92, "document": "..." },
    { "index": 1, "relevance_score": 0.61, "document": "..." },
    { "index": 2, "relevance_score": 0.14, "document": "..." }
  ]
}Code language: JSON / JSON with Comments (json)

Expected output: documents reordered by usefulness (not just similarity), with a score per passage.

3) Python drop-in (RAG post-retrieval)

Retrieve candidates (BM25/embeddings), then rerank before calling your LLM.

import os, requests

API_KEY = os.getenv("REGOLO_API_KEY")
BASE = "https://api.regolo.ai/v1"

def regolo_rerank(query, passages, top_n=5):
    r = requests.post(
        f"{BASE}/rerank",
        headers={"Authorization": f"Bearer {API_KEY}"},
        json={
            "model": "Qwen3-Reranker-4B",
            "query": query,
            "documents": passages,
            "top_n": top_n
        },
        timeout=30
    )
    r.raise_for_status()
    data = r.json()
    ranked = sorted(data["results"], key=lambda x: x["relevance_score"], reverse=True)
    return [x["document"] for x in ranked[:top_n]]

# candidates = top-50 from your vector DB
# best = regolo_rerank(user_query, candidates, top_n=5)
# answer = llm(best)Code language: Python (python)

Expected output: a clean top-n context list that is typically far less redundant than raw retriever output (especially with many similar chunks).

Step 4 — Plug into LangChain / LlamaIndex / n8n

  • LangChain: replace or augment a document compression/ranking step by calling /rerank and passing the reordered docs onward.
  • LlamaIndex: implement a custom rerank function that sends query + nodes and reorders nodes by returned score.
  • n8n/Flowise/Open WebUI: configure an OpenAI-compatible provider and add an HTTP node that hits /rerank, then feed its output into the chat flow.

Expected output: reranking becomes a single, isolated component you can turn on/off for A/B tests and quality measurement.

5) Best-practice defaults

  • Start with retriever top-k = 20–50, rerank down to top-5/top-10.
  • Keep passages ~200–500 tokens to reduce ambiguity and repeated facts.
  • Measure precision@k / hit-rate@k and downstream answer exactness before/after reranking.

Expected output: predictable quality lift without changing the LLM, prompt, or index format first.

Production-ready (working code)

Below is a “ship-it” pattern: strict input controls + data minimization + observability hooks, so the reranker is deployable in regulated environments without rewriting later.

import os, time, hashlib, requests

API_KEY = os.environ["REGOLO_API_KEY"]
BASE = "https://api.regolo.ai/v1"
MODEL = "Qwen3-Reranker-4B"

def stable_request_id(query: str) -> str:
    return hashlib.sha256(query.encode("utf-8")).hexdigest()[:16]

def rerank_production(query: str, docs: list[str], top_n: int = 5) -> dict:
    t0 = time.time()
    req_id = stable_request_id(query)

    payload = {
        "model": MODEL,
        "query": query,
        "documents": docs,
        "top_n": top_n
    }

    resp = requests.post(
        f"{BASE}/rerank",
        headers={
            "Authorization": f"Bearer {API_KEY}",
            "Content-Type": "application/json",
            "X-Request-Id": req_id
        },
        json=payload,
        timeout=20
    )
    resp.raise_for_status()
    out = resp.json()

    metrics = {
        "request_id": req_id,
        "docs_in": len(docs),
        "top_n": top_n,
        "latency_ms": int((time.time() - t0) * 1000),
    }

    return {"output": out, "metrics": metrics}Code language: Python (python)

Expected output: JSON results + a metrics object you can ship to logs/APM (latency, volume, correlation id) while keeping the component stateless and easy to scale horizontally.​

Benchmarks, costs & CTAs

Regolo exposes Qwen3-Reranker-4B at €0.01 per query, so cost scales linearly with rerank calls (not with tokens), which makes forecasting straightforward.

OptionPricing unitPriceLatency referenceNotes
Regolo (Qwen3-Reranker-4B)Per query€0.01 / queryMeasure in your region (export p50/p95)Multilingual (100+), 32k context per model card.
Typical US alternative (Cohere Rerank 3.5)Per search$2.00 / 1,000 searches​~171.5 ms (small) / ~459.2 ms (large) in a published benchmarkPricing is per search, not per token; run a domain A/B for quality deltas.

👉Try it on Regolo for free


Resources & Community

Official Documentation:

  • Regolo Platform – European LLM provider, Zero Data-Retention and 100% Green

Related Guides:

Join the Community:


🚀 Ready to Deploy?

Get Free Regolo Credits →


Built with ❤️ by the Regolo team. Questions? support@regolo.ai