# Understanding multimodal AI models - What a multimodal model is and when to use one

## What does “multimodal” really mean?

Most AI models are built for a single kind of input. A text model reads words, a vision model looks at images, and a speech model listens to audio. A multimodal model brings two or more of these abilities together in one system. It can look at a picture, read a question about it, and give an answer in plain language.

Think of how you understand a news article: you read the headline, you look at the photo, and you combine both to get the full story. A multimodal model does something similar, but it does it automatically inside the computer.

## Where we use multimodal models

We find multimodal models useful whenever the answer depends on more than just words. For example:

- **Reading a scanned invoice:** You send the model a picture of the invoice and ask for the total amount, the date, and the supplier name. The model extracts the needed information from the image and returns it as text.
- **Helping a customer with a photo of a broken product:** The user uploads a picture of the defect and writes a short description. The model looks at both and suggests possible fixes or opens a support ticket.
- **Scanning a whiteboard during a meeting:** You take a photo of the notes, ask the model to summarize the key points, and you get a clear text summary without typing everything yourself.

These are everyday tasks that save time and reduce manual work.

## Example: asking a vision model to read an invoice

Below is a short Python snippet that shows how you can send an image and a text request to one of our vision‑language models using the Regolo chat completions endpoint. Replace `YOUR_API_KEY` with your actual key and choose a model ID from our Vision family (for instance, `qwen3-vl-32b`).

```
import requests

API_KEY = "YOUR_API_KEY"
URL = "https://api.regolo.ai/v1/chat/completions"

payload = {
    "model": "qwen3-vl-32b",          # pick a vision‑language model from our catalog
    "messages": [
        {
            "role": "user",
            "content": [
                {"type": "text", "text": "List the supplier name, invoice number, total amount, and due date from this invoice."},
                {"type": "image_url", "image_url": {"url": "https://example.com/invoice.jpg"}}
            ]
        }
    ]
}

response = requests.post(URL, headers={"Authorization": f"Bearer {API_KEY}"}, json=payload)
print(response.json()["choices"][0]["message"]["content"])Code language: Python (python)
```

## Example that work

```
import requests

API_KEY = "YOUR_REGOLO_API_KEY"
BASE_URL = "https://api.regolo.ai/v1/chat/completions"

payload = {
  "model": "qwen3.5-122b",
  "messages": [
    {
      "role": "user",
      "content": [
        {
          "type": "text",
          "text": "Look at this invoice image and describe the photo details and the subject"
        },
        {
          "type": "image_url",
          "image_url": {
            "url": "https://t4.ftcdn.net/jpg/18/98/38/41/360_F_1898384149_YFOtT03UAzTO7fgWTpeBq1zHRE04ka6I.webp"
          }
        }
      ]
    }
  ],
  "reasoning_effort": "low"
}

response = requests.post(
    BASE_URL,
    headers={
        "Authorization": f"Bearer {API_KEY}",
        "Content-Type": "application/json"
    },
    json=payload,
    timeout=60
)

print(response.json())Code language: Python (python)
```

## Output

```
{
  "id": "chatcmpl-aec62db4c6fb0917",
  "created": 1779367301,
  "model": "qwen3.5-122b",
  "object": "chat.completion",
  "choices": [
    {
      "finish_reason": "stop",
      "index": 0,
      "message": {
        "content": "\n\nBased on the visual content, it appears there might be a misunderstanding in the prompt: this is not an invoice. It is a black-and-white portrait photograph, likely from a fashion editorial or modeling portfolio.\n\nHere are the details of the photo and the subject:\n\n**The Subject:**\n*   **Identity:** The subject is a woman with dark hair and strong facial features. She appears to be a fashion model.\n*   **Expression:** She has a serious, intense expression and is looking directly into the camera lens. Her gaze is steady and confident.\n*   **Features:** Her skin has a natural texture, with visible freckles across her nose and cheeks, suggesting the photo was not heavily retouched. She has defined eyebrows, a straight nose, and full lips that are slightly parted.\n*   **Hair:** Her dark hair is pulled back, but several strands are blowing across the left side of her face, indicating a breeze or wind.\n\n**Photo Details:**\n*   **Style:** The image is in black and white (monochrome). It has a high-fashion, editorial aesthetic.\n*   **Attire:** She is wearing what appears to be a structured coat or jacket with a wide lapel. There is a scarf or a high collar wrapped around her neck, suggesting cold weather or a winter theme. The fabric looks textured, possibly wool or tweed.\n*   **Background:** The background is out of focus (blurred), featuring soft, light gray tones. It looks like an outdoor setting, possibly a beach or a dune, which complements the windblown hair.\n*   **Lighting:** The lighting is soft but directional, casting gentle shadows that highlight the contours of her face, particularly her cheekbones and jawline.",
        "role": "assistant",
        "reasoning_content": "The user has provided an image and asked for a description of the photo details and the subject.\nThe user's prompt mentions \"invoice image,\" which is clearly a mistake. The image is a black and white portrait of a woman, not an invoice. I need to politely ignore the \"invoice\" part and focus on describing the actual visual content.\n\n**Visual Analysis:**\n1.  **Subject:** A woman, likely a model. She has dark hair pulled back, with some loose strands blowing in the wind. Her skin has visible texture (freckles, pores), suggesting a natural, unretouched look.\n2.  **Expression:** Serious, intense, direct gaze. She is looking straight at the camera. Her lips are slightly parted but neutral.\n3.  **Attire:** She is wearing a coat or jacket with a high collar or scarf wrapped around her neck. The fabric looks textured, possibly wool or tweed. It looks like cold-weather clothing.\n4.  **Lighting/Style:** It's a black and white photograph. The lighting is soft but directional, highlighting her facial features (nose, cheekbones). The background is out of focus (bokeh), suggesting a shallow depth of field. The background looks like a blurry landscape, possibly a beach or dunes, given the light tones.\n5.  **Composition:** Close-up portrait. The subject is slightly off-center to the left.\n\n**Drafting the Description:**\n*   **Correction:** Start by clarifying that this is a portrait, not an invoice.\n*   **Subject Description:** Focus on the woman's face. Mention the dark hair, the windblown strands, the intense gaze, the freckles.\n*   **Clothing:** Describe the coat/scarf. It looks like a structured jacket with a wide lapel or a scarf tucked in.\n*   **Atmosphere/Style:** Mention the black and white format, the high fashion/editorial vibe, the natural texture of the skin.\n*   **Background:** Describe the blurry, light background.\n\n**Refining the Description:**\n*   \"This is a black and white portrait, not an invoice.\"\n*   \"The subject is a woman with dark hair pulled back.\"\n*   \"She has a serious expression and is looking directly at the viewer.\"\n*   \"She is wearing a textured coat or jacket with a high collar.\"\n*   \"The background is blurred, suggesting an outdoor setting.\"\n\nLet's combine these points into a coherent response. I will structure it to cover the subject, the details of the photo, and the style.\n\n*   **Subject:** A woman with dark hair, intense eyes, freckles.\n*   **Details:** Windblown hair, textured skin, coat/scarf.\n*   **Photo Style:** B&W, shallow depth of field, high contrast but soft lighting.\n\nFinal check of the prompt: \"Look at this invoice image and describe the photo details and the subject.\"\nOkay, I will address the \"invoice\" part gently. \"Although you referred to this as an invoice image, it is actually a black and white portrait photograph.\" Then proceed with the description.\n\n**Detailed Breakdown for the final output:**\n*   **Subject:** A woman, likely a fashion model.\n*   **Face:** Strong features, prominent cheekbones, straight nose, full lips. Visible freckles across the nose and cheeks.\n*   **Hair:** Dark, pulled back, messy/windblown strands on the left side.\n*   **Clothing:** A heavy coat or blazer with a large collar, possibly a scarf underneath. Looks like winter fashion.\n*   **Background:** Very blurry, light gray/white tones. Could be a beach or open field.\n*   **Mood:** Serious, editorial, moody.\n\nThis covers all bases. I'm ready to write the response.\n"
      }
    }
  ],
  "usage": {
    "completion_tokens": 1189,
    "prompt_tokens": 246,
    "total_tokens": 1435
  }
}Code language: JSON / JSON with Comments (json)
```

## Which Regolo models to use for which job

If you need pure text generation, summarization, or chat behavior, Regolo’s Completions and Chat family is the right starting point. If the task depends on an image, screenshot, page layout, or scanned file, use the Vision or OCR families instead of forcing a text-only model to guess from metadata alone.

The core models listed in Regolo’s public archive, such as Gemma 4 31B, Qwen3.5-9B, and Mistral Small 4 119B, are useful reference points for text-heavy applications. The broader lesson is that “multimodal” is not a badge to apply everywhere; it is the right choice when the input really includes more than language.

---

## FAQ

**Do I need a special endpoint for multimodal models?**
No. You use the same chat completions endpoint; you just include both text and image parts in the message.

**Can multimodal models create images?**
Models like `qwen3-vl-32b` understand images but do not generate them. For image generation you would need a dedicated image‑generation model.

**Is a multimodal model always better than a text‑only model?**
Only when the task actually needs an image, a scan, or another non‑text input. For pure language tasks a text model is usually faster and cheaper.

---

St**art your free 30-day trial at [regolo.ai](https://regolo.ai/) and deploy LLMs with complete privacy by design.**

👉 [Talk with our Engineers](https://regolo.ai/contacts/) or [Start your 30 days free →](https://regolo.ai/pricing)

---

- [Discord](https://discord.gg/ZzZvuR2y) - Share your thoughts
- [GitHub Repo](https://github.com/regolo-ai/) - Code of blog articles ready to start
- Follow Us on X [@regolo\_ai](https://x.com/regolo_ai)
- Open discussion on our [Subreddit Community](https://www.reddit.com/r/regolo_ai/)

---

*Built with ❤️ by the Regolo team. Questions? [regolo.ai/contact](https://regolo.ai/contact)* or chat with us on [Discord](https://discord.gg/ZzZvuR2y)