Skip to content
Regolo Logo

Build Faster: LLM as a Service for Developers with Qwen 3.5 122b

LLM as a Service (LLMaaS) is changing how developers work. Rather than deploying and managing models yourself, you call a hosted API, get back a response, and focus entirely on what makes your product unique. But not all providers are equal — and the difference between a great integration and a nightmare often comes down to latency, compliance, and how well the API fits your workflow.

In this guide, you’ll see practical code for using an LLM like Qwen 3.5 122b in few seconds on a 100% EU-hosted, zero data retention LLM API — to build real applications: from boilerplate generators to RAG pipelines and streaming assistants. Every example is runnable, every pattern is battle-tested.

Ready to get started? Create your free Regolo account and grab your API key in under 2 minutes.

Why Developers Choose LLMaaS Over Self-Hosting

Running your own GPU cluster sounds appealing until you face the reality: hardware faults, CUDA version conflicts, and on-call rotations for a stack that isn’t your core business. LLMaaS removes all of that.

Here’s what you actually get:

  • Zero infrastructure overhead — no NVIDIA drivers, no vLLM config, no Kubernetes YAML
  • On-demand scaling — handle spikes without pre-provisioning
  • Pay-as-you-go pricing — token-level billing, no idle GPU hours
  • Access to frontier models — Llama 3.3 70B, DeepSeek, Qwen and more, updated continuously
  • Data compliance out of the box — with Regolo, data stays in EU data centers with zero retention

The tradeoff discussion often focuses on cost. But when you factor in DevOps time, hardware fault rates, and the engineering cost of maintaining inference infrastructure, managed services win for the vast majority of teams.

Setting Up Regolo in 60 Seconds with Qwen 3.5 122b

pip install requestsCode language: Shell Session (shell)
import requests

api_url = "https://api.regolo.ai/v1/chat/completions"
headers = {
    "Content-Type": "application/json",
    "Authorization": "Bearer YOUR_REGOLO_KEY"
}Code language: Python (python)

That’s it. You’re now connected to Qwen 3.5 122b in EU data centers with Zero Data Retention.

Regolo is fully OpenAI-compatible, so if you already use openai in your codebase, you can point it to Regolo’s base URL with no other changes:

data = {
  "model": "qwen3.5-122b",
  "messages": [
    {
      "role": "user",
      "content": "If a train travels 60 km/h for 2 hours and then 80 km/h for 1.5 hours, what is the total distance covered?"
    }
  ],
  "reasoning_effort": "medium"
}

response = requests.post(api_url, headers=headers, json=data)
print(response.json())Code language: Python (python)

Let’s deep dive into some recurrent topics we looked at:

  • Use Case 1: Boilerplate Code Generator
  • Use Case 2: Streaming Chat Assistant
  • Use Case 3: RAG Pipeline with Embeddings and Reranking
  • Use Case 4: Structured Output for Data Extraction

Use Case 1: Boilerplate Code Generator

One of the highest-ROI uses of LLMs is generating repetitive, predictable code — CRUD endpoints, test stubs, data models. The model doesn’t need to reason; it needs to interpolate from patterns it’s seen thousands of times.

import requests

api_url = "https://api.regolo.ai/v1/chat/completions"
headers = {
    "Content-Type": "application/json",
    "Authorization": "Bearer YOUR_REGOLO_KEY"
}

def generate_fastapi_endpoint(resource_name: str, fields: list[dict]) -> str:
    """Generate a FastAPI CRUD endpoint from a resource definition."""
    
    fields_str = "\n".join([f"  - {f['name']}: {f['type']}" for f in fields])
    
data = {
  "model": "qwen3.5-122b",
  "messages": [
    {
      "role": "user",
      "content": f"""Generate a complete FastAPI router for a '{resource_name}' resource with these fields:
{fields_str}

Include:
- Pydantic model
- GET /list endpoint
- POST /create endpoint  
- DELETE /{resource_name.lower()}/{{id}} endpoint
- In-memory dict as storage (no DB needed for this example)

Output only valid Python code, no explanations."""
    }
  ],
  "reasoning_effort": "medium"
}

response = requests.post(api_url, headers=headers, json=data)
print(response.json())Code language: Python (python)

example of code usage:

# Example usage
code = generate_fastapi_endpoint(
    resource_name="Product",
    fields=[
        {"name": "id", "type": "int"},
        {"name": "name", "type": "str"},
        {"name": "price", "type": "float"},
        {"name": "in_stock", "type": "bool"}
    ]
)
print(code)Code language: Python (python)

Use Case 2: Streaming Chat Assistant

Streaming is essential for user-facing chat applications — waiting 10 seconds for a full response kills UX. Regolo’s streaming API lets you push tokens to the client as they arrive.

import regolo

regolo.default_key = os.getenv("REGOLO_API_KEY")
regolo.default_chat_model = "Llama-3.3-70B-Instruct"

def streaming_assistant(system_prompt: str):
    """Interactive streaming chat assistant."""
    client = regolo.RegoloClient()
    
    # Set system context
    client.add_prompt_to_chat(role="system", prompt=system_prompt)
    
    print("Assistant ready. Type 'quit' to exit.\n")
    
    while True:
        user_input = input("You: ").strip()
        if user_input.lower() == "quit":
            break
            
        client.add_prompt_to_chat(role="user", prompt=user_input)
        
        print("Assistant: ", end="", flush=True)
        
        response_gen = client.run_chat(stream=True, full_output=False)
        full_response = ""
        
        while True:
            try:
                role, chunk = next(response_gen)
                print(chunk, end="", flush=True)
                full_response += chunk
            except StopIteration:
                break
        
        print()  # newline after response
        client.add_prompt_to_chat(role="assistant", prompt=full_response)

# Run a Python code review assistant
streaming_assistant(
    system_prompt="""You are a senior Python developer doing code reviews. 
    Be concise, flag security issues first, suggest improvements with examples."""
)Code language: PHP (php)

Use Case 3: RAG Pipeline with Embeddings and Reranking

Retrieval-Augmented Generation (RAG) is the dominant pattern for grounding LLMs in your own data. Regolo provides both embedding and reranking models out of the box, so you can build a complete pipeline with a single provider.

import regolo
import numpy as np

regolo.default_key = os.getenv("REGOLO_API_KEY")
regolo.default_chat_model = "Llama-3.3-70B-Instruct"
regolo.default_embedder_model = "gte-Qwen2"
regolo.default_reranker_model = "jina-reranker-v2"

# --- Step 1: Index your documents ---
knowledge_base = [
    "Regolo.ai uses NVIDIA H100 and A100 GPUs for inference.",
    "All data processed by Regolo stays in Italy and is never retained.",
    "Regolo supports Llama, DeepSeek, Qwen, Phi and Maestrale models.",
    "Regolo's pricing is pay-as-you-go with no monthly minimums.",
    "The Regolo Python client is OpenAI-compatible.",
]

emb_client = regolo.RegoloClient()
doc_embeddings = emb_client.embeddings(input_text=knowledge_base)

def cosine_similarity(a, b):
    return np.dot(a, b) / (np.linalg.norm(a) * np.linalg.norm(b))

# --- Step 2: Retrieve top-k candidates ---
def retrieve(query: str, top_k: int = 3) -> list[str]:
    query_emb = emb_client.embeddings(input_text=query)[0]["embedding"]
    
    scores = [
        cosine_similarity(query_emb, doc["embedding"])
        for doc in doc_embeddings
    ]
    top_indices = np.argsort(scores)[-top_k:][::-1]
    return [knowledge_base[i] for i in top_indices]

# --- Step 3: Rerank for precision ---
def rerank_and_answer(query: str) -> str:
    candidates = retrieve(query, top_k=3)
    
    rerank_client = regolo.RegoloClient()
    ranked = rerank_client.rerank(
        query=query,
        documents=candidates,
        top_n=2
    )
    
    context = "\n".join([candidates[r["index"]] for r in ranked])
    
    chat_client = regolo.RegoloClient()
    _, answer = chat_client.run_chat(
        user_prompt=f"""Answer the question using only the context below.
        
Context:
{context}

Question: {query}

Answer concisely."""
    )
    return answer

# Test the pipeline
print(rerank_and_answer("What GPU hardware does Regolo use?"))
print(rerank_and_answer("How is data handled on Regolo?"))Code language: PHP (php)

Use Case 4: Structured Output for Data Extraction

One of the most reliable LLM patterns is information extraction — parsing unstructured text into typed, usable data. With a well-crafted prompt and JSON output parsing, this becomes extremely dependable.

import regolo
import json

regolo.default_key = os.getenv("REGOLO_API_KEY")
regolo.default_chat_model = "Llama-3.3-70B-Instruct"

def extract_job_posting(raw_text: str) -> dict:
    """Extract structured data from a raw job posting."""
    client = regolo.RegoloClient()
    
    _, content = client.run_chat(
        user_prompt=f"""Extract information from this job posting and return ONLY valid JSON with these fields:
- title (string)
- company (string)  
- location (string)
- remote (boolean)
- skills (array of strings)
- experience_years (integer or null)
- salary_range (string or null)

Job posting:
{raw_text}

Return only the JSON object, no markdown, no explanation."""
    )
    
    try:
        # Strip potential markdown code blocks
        clean = content.strip().strip("```json").strip("```").strip()
        return json.loads(clean)
    except json.JSONDecodeError:
        return {"error": "Failed to parse", "raw": content}

# Test
sample = """
We're hiring a Senior Backend Engineer at TechCorp (Rome, Italy - hybrid).
You'll need 5+ years with Python, FastAPI, PostgreSQL and Docker.
Kubernetes experience is a plus. Salary: €60k–€80k/year.
"""

result = extract_job_posting(sample)
print(json.dumps(result, indent=2))
Code language: PHP (php)

Handling LLM Limitations Gracefully

LLMs hallucinate. This is not a bug to be fixed but a property to be managed. Here are the patterns that actually work:

  • Constrain the output space — ask for JSON, bullet lists, or Yes/No when possible
  • Add a verification step — for critical outputs, run a second LLM call to verify the first
  • Use retrieval, not memory — for factual questions, always ground responses in retrieved documents (RAG)
  • Detect low-confidence outputs — ask the model to rate its confidence (0-10) and flag responses below threshold
  • Keep humans in the loop — for any action with real-world consequences (sending emails, modifying databases), require explicit confirmation
def reliable_extraction(text: str, retries: int = 2) -> dict | None:
    """Extraction with retry and validation."""
    for attempt in range(retries):
        result = extract_job_posting(text)
        if "error" not in result and "title" in result:
            return result
        print(f"Attempt {attempt + 1} failed, retrying...")
    return NoneCode language: PHP (php)

LLM as a Service removes the infrastructure tax and lets you ship AI features in hours, not weeks. The patterns above — boilerplate generation, streaming assistants, RAG pipelines, and structured extraction — cover 80% of real-world developer use cases.

Regolo gives you all of this with EU data residency, zero retention, and a clean OpenAI-compatible API — so you stay productive without compromising on compliance.


FAQ

Is Regolo compatible with LangChain or LlamaIndex?

Yes. Since Regolo is OpenAI-compatible, you can use it with LangChain’s ChatOpenAI class or LlamaIndex’s OpenAI LLM wrapper by simply overriding the base_url and api_key parameters.

What models are available on Regolo?

Regolo provides access to Llama 3.3, DeepSeek, Qwen, Phi, Maestrale and others — plus embedding models (gte-Qwen2), rerankers (jina-reranker-v2), and audio transcription (faster-whisper-large-v3).

How does streaming work with Regolo?

The Regolo Python client supports streaming via a generator pattern. Set stream=True in run_chat() and iterate over the response with next(). This is ideal for building responsive chat UIs.

Can I run batch jobs with Regolo?

Yes. You can process multiple prompts sequentially using client.clear_conversations() between requests, or parallelize with Python’s asyncio or ThreadPoolExecutor for higher throughput.

Is there rate limiting?

Rate limits depend on your plan. For high-throughput workloads, contact the Regolo team to discuss dedicated capacity options.

How do I keep costs low during development?

Use smaller models (e.g., Llama 8B for testing) and larger ones only in production. The regolo-cli inference user-status command gives you a live cost breakdown.


Github Codes

You can download the codes on our Github repo, just copy and paste the .env.example files and fill properly with your credentials. If need help you can always reach out our team on Discord 🤙


Resources


🚀 Ready? Start your free trial on today



Built with ❤️ by the Regolo team. Questions? regolo.ai/contact or chat with us on Discord