gemma-4-31B

gemma‑4‑31B is a 30.7B‑parameter dense multimodal model from Google DeepMind with 256K context, native thinking mode, function calling, and text/image/video support across 140+ languages under Apache 2.0.

Core Model

Chat

How to Get Started

pip install requestsCode language: Bash (bash)

import requests


api_url = "https://api.regolo.ai/v1/chat/completions"
headers = {
    "Content-Type": "application/json",
    "Authorization": "Bearer YOUR_REGOLO_KEY"
}
data = {
  "model": "gemma4-31b",
  "messages": [
    {
      "role": "user",
      "content": "What is the capital of Italy, and which region does it belong to?"
    }
  ],
  "reasoning_effort": "low"
}

response = requests.post(api_url, headers=headers, json=data)
print(response.json())Code language: Python (python)

Output

{
  "id": "chatcmpl-a4988541-84b1-41a5-843f-06790a11f7fc",
  "created": 1769560420,
  "model": "hosted_vllm/gemma4-31b",
  "object": "chat.completion",
  "choices": [
    {
      "finish_reason": "stop",
      "index": 0,
      "message": {
        "content": "The capital of Italy is Rome (Italian: Roma). Rome belongs to the Lazio region.",
        "role": "assistant"
      }
    }
  ],
  "usage": {
    "completion_tokens": 62,
    "prompt_tokens": 45,
    "total_tokens": 107
  }
}Code language: JSON / JSON with Comments (json)

Applications & Use Cases

Multimodal chat assistants for customer support, knowledge bases, and internal copilots that combine text, image, and video understanding in 140+ languages.
Reasoning and coding copilots that use thinking mode for step‑by‑step problem solving, mathematical proofs, and complex code generation or debugging.
Document intelligence pipelines for PDFs, forms, and scanned contracts, leveraging native OCR and handwriting recognition with 256K context for large documents.
Tool‑ and function‑calling agents that orchestrate APIs, databases, and multi‑step workflows inside enterprise automation or data retrieval backends.
Video understanding workflows for surveillance, education, or sports analytics, using up to 60‑second video inputs processed as frame sequences.
On‑device and workstation deployments where the 30.7B dense architecture fits a single high‑end GPU (≈17.4 GB at 4‑bit quantization) without MoE infrastructure overhead.