Qwen3-VL-32B-Instruct is a 33‑parameter vision‑language model that combines strong text reasoning with advanced visual perception, spatial grounding, and long‑context understanding up to 256K tokens (expandable to 1M) for tasks like document QA, video analysis, visual coding, and GUI automation. Its architecture uses Interleaved‑MRoPE for temporal‑spatial positioning and DeepStack to fuse multi‑level ViT features, enabling precise 2D/3D grounding and robust image‑text alignment.
How to get started
pip install requestsCode language: Bash (bash)
import requests
api_url = "https://api.regolo.ai/v1/chat/completions"
headers = {
"Content-Type": "application/json",
"Authorization": "Bearer YOUR_REGOLO_KEY"
}
data = {
"model": "qwen3-vl-32b",
"messages": [
{
"role": "user",
"content": [
{
"type": "text",
"text": "Describe the contents of the image in detail, including objects, colors, layout, and any actions taking place. Mention key elements such as lighting, background, and relationships between objects to provide a comprehensive description."
},
{
"type": "image_url",
"image_url": {
"url": "https://images.unsplash.com/photo-1533738363-b7f9aef128ce?q=80&w=1035&auto=format&fit=crop&ixlib=rb-4.1.0&ixid=M3wxMjA3fDB8MHxwaG90by1wYWdlfHx8fGVufDB8fHx8fA%3D%3D"
}
}
]
}
]
}
response = requests.post(api_url, headers=headers, json=data)
print(response.json())Code language: Python (python)
Output
{
"id": "chatcmpl-5e0927075dac4e2dadabf75aaf1f389e",
"created": 1769559105,
"model": "hosted_vllm/Qwen3-VL-32B-Instruct",
"object": "chat.completion",
"choices": [
{
"finish_reason": "stop",
"index": 0,
"message": {
"content": "This is a close-up, portrait-style photograph of a fluffy gray cat wearing a pair of stylish, round sunglasses. The image is tightly framed, focusing primarily on the cat’s head and upper chest, creating an intimate and playful portrait.\n\n**Subject: The Cat**\n- The cat has thick, dense fur with a soft, plush texture, primarily in shades of gray, ranging from dark charcoal to lighter silver-gray. The fur appears well-groomed and slightly tousled, especially around the ears and cheeks.\n- Its ears are small, rounded, and folded forward, characteristic of a Scottish Fold breed. The fur on the ears is slightly darker and more matted, adding to the cat’s distinctive look.\n- The cat’s nose is small and black, and its mouth is closed in a neutral expression, giving it a calm and composed demeanor.\n- Long, white whiskers extend outward from its muzzle, contrasting sharply with the gray fur and adding a delicate detail to the face.\n- The cat’s eyes are completely obscured by the sunglasses, which adds to the mysterious and cool persona it projects.\n\n**Key Object: The Sunglasses**\n- The sunglasses are the focal point of the image, adding a humorous and anthropomorphic element.\n- They are round with a thin, rose-gold or copper-toned metal frame, giving them a vintage or retro aesthetic.\n- The lenses are large, reflective, and have a vibrant, gradient tint—shifting from bright yellow to a deeper amber or gold. They reflect the surrounding environment, showing blurred vertical lines and hints of green, suggesting nearby windows, plants, or interior decor.\n- The arms of the sunglasses are black and appear to be made of plastic or acetate, with a small, metallic hinge detail visible on the left side.\n\n**Background and Setting**\n- The background is softly blurred, creating a shallow depth of field that emphasizes the cat and sunglasses.\n- In the upper left corner, there is a woven basket made of natural fibers, possibly wicker or rattan, with a warm, light-brown color. The basket’s texture is visible and adds a rustic, homey touch to the scene.\n- The rest of the background is composed of neutral tones: a plain, off-white or light-gray wall on the right, and a darker, muted brown or gray surface behind the basket, possibly a piece of furniture or a wall panel.\n- The lighting is soft and diffused, likely natural light coming from the side or front, illuminating the cat’s fur gently and casting subtle shadows that enhance its three-dimensional form.\n\n**Composition and Mood**\n- The cat is centered in the frame, looking slightly to the right, which creates a sense of quiet confidence or contemplation.\n- The overall mood is whimsical and endearing, blending the natural cuteness of a cat with the unexpected, human-like accessory of sunglasses.\n- The warm tones of the sunglasses and basket contrast with the cool gray of the cat’s fur, creating visual interest and balance.\n- There is no action taking place—the cat is still and posed, suggesting it is being photographed deliberately, perhaps for a fun or artistic portrait.\n\nIn summary, the image captures a charming and stylish gray cat wearing fashionable round sunglasses, set against a softly blurred domestic background. The combination of textures, colors, and the cat’s composed expression creates a playful, sophisticated, and visually engaging portrait.",
"role": "assistant"
},
"provider_specific_fields": {
"stop_reason": null,
"token_ids": null
}
}
],
"usage": {
"completion_tokens": 684,
"prompt_tokens": 1427,
"total_tokens": 2111
}
}Code language: JavaScript (javascript)
Application and Use Cases
Generating Draw.io diagrams from natural language or visual inputs.
- Creating HTML/CSS/JS code from screenshots, mockups, or video walkthroughs
- Visual-to-code conversion for rapid prototyping and development
Spatial Reasoning & Embodied AI:
- Advanced spatial perception: judging object positions, viewpoints, and occlusions.
- 2D grounding for object detection and localization in images.
- 3D grounding for spatial reasoning in robotics and embodied AI applications.
Video Understanding:
- Processing hours-long video with 256K-1M context for full recall.
- Second-level indexing for precise temporal event localization.
- Video summarization, analysis, and question answering across extended durations.
Document Processing & OCR:
- Multi-language OCR supporting 32 languages with robustness to challenging conditions.
- Long-document structure parsing and understanding.
- Document question answering: 93.3% DocVQA, 94.0% ChartQA performance.
Multimodal Reasoning:
- STEM and mathematical reasoning from visual inputs.
- Causal analysis and logical, evidence-based answers combining vision and text.
- Visual question answering across diverse domains with “recognize everything” capability.
General Vision-Language Tasks:
- Celebrity, anime, product, landmark, flora, and fauna recognition.
- Chart understanding and data visualization interpretation.
- Image captioning, visual reasoning, and multi-image understanding.