👉 Try Fast Whisper on Regolo now
Here the extract of the Eugenio Petullà’s presentation from the recent Rome event, Build Your AI, highlights practical integrations for audio transcription, vision-language models, and image generation using OpenAI-compatible APIs.
What is Multimodal AI?
Multimodal AI processes diverse inputs like text, images, audio, and video simultaneously. This unlocks enterprise use cases such as meeting transcription with sentiment analysis, document OCR combined with summarization, or generating visuals from descriptions.
Regolo.ai supports key models:
- Audio (STT): faster-whisper-large-v3
- Vision-Language: qwen3-vl-32b, gemma-3-27b-it
- Image Generation: Qwen-Image
All integrate via OpenAI SDK for seamless scaling, EU data residency, and green infrastructure.
Interacting with STT Models on regolo.ai
Speech-to-Text (STT) converts audio to text reliably. Regolo.ai’s faster-whisper-large-v3 excels in multilingual accuracy with low latency.
Best Practices
- Chunking: limit segments to 2-3 minutes to avoid hallucinations or infinite loops.
- Timeout Management: set per-file limits; max 5 minutes recommended.
- Libraries: use official regolo.ai or OpenAI clients (here the link to the python module)
Pro-tip: Use OGG format for optimal quality-to-size ratio, minimizing upload latency.
import openai
from pathlib import Path
# OpenAI client configuration
openai.api_key = "YOUR_REGOLO_KEY"
openai.base_url = "https://api.regolo.ai/v1/"
# Audio file to transcribe
AUDIO_FILE = "/path/to/your/audio"
OUTPUT_FILE = "/path/to/output/transcription.txt"
# Transcribe the file
with open(AUDIO_FILE, "rb") as audio_file:
transcript = openai.audio.transcriptions.create(
model="faster-whisper-large-v3",
file=audio_file,
language="en",
response_format="text"
)
# Save the transcription
output_path = Path(OUTPUT_FILE)
output_path.parent.mkdir(parents=True, exist_ok=True)
with open(output_path, "w", encoding="utf-8") as f:
f.write(transcript)
print(f"Transcription saved to: {OUTPUT_FILE}")Code language: PHP (php)
Interacting with Vision-Language Models
The input is a list of objects in the content field, exactly like chat completions, alternating:
type: "text"type: "image_url"
Key Aspects
- Message Structure: the input is a list of items in the content field, just like chat completions, alternating type: “text” and type: “image url
- Vertical Models: some of these models are trained to recognize specific patterns and can be useful in different domains, such as (OCR, Medical applications, etc.)
- Latency (Trade-offs): using Base64 URLs reduces external DNS resolution times but increases the payload size.
Pro-tip: Request 4096+ token context for detailed analysis; prefer Base64 URLs to skip DNS but watch payload size
Interacting with Image Generation Models
Generation speed is directly influenced by:
- the selected image size
- the number of variants requested
Key Factors
- Latency: generation speed is influenced by the dimension and number of requests
- Resolution: It is crucial to understand which resolutions the model was trained on to avoid artifacts and ensure output quality.
- Data Handling: the API gets data in the b64_json field. You must decode the Base64 string to reconstruct the binary file (PNG/JPG).
For ultra-high-resolution output, best practice suggests generating at 1024×1024 and using a dedicated post-generation upscaler (e.g. Real-ESRGAN).
Multimodal Processing Strategies
There are three simple but effective ways to manage multimodal input:
- Sequential Chain
- Smart Routing
- Parallel Processing
Used in frameworks such as:
- LangChain
- CrewAI
Sequential Chain (The “Sandwich”)
The simplest flow:
Audio → Text → Summary
Information flows in one direction only.
Smart Routing
A director LLM receives the input and decides which tool to activate:
- If the user sends an image → activates Vision
- If the user sends audio → activates Speech-to-Text
Parallel Processing
Parallel Processing
The fastest approach.
While one model transcribes the audio:
- another model analyzes metadata
- or performs sentiment analysis
The results are merged at the end.
Benchmark
| Aspect | Regolo.ai STT | OpenAI Whisper |
|---|---|---|
| Latency (1min) | <10s (EU servers) | 15-20s |
| Multilingual | Italian/EU optimized | General |
| Cost/1M tokens | Transparent, green GPUs | Variable |
| Privacy | GDPR, no extra-EU transfer | US-based |
The Slides presented in Build Your AI in Rome [ITA]
👉 Build You AI – Start for free
Resources & Community
Official Documentation:
- Regolo Python Client – Package reference
- Regolo Models Library – Available models
- Regolo API Docs – API reference
Related Guides:
- Rerank Models have landed on Regolo 🚀
- Supercharging Retrieval with Qwen and LlamaIndex
- Chat with ALL your Documents with Regolo + Elysia
Join the Community:
- Regolo Discord – Share your RAG builds
- GitHub Repo – Contribute examples
- Follow Us on X @regolo_ai – Show your RAG pipelines!
- Open discussion on our Subreddit Community
🚀 Ready to scale?
Built with ❤️ by the Regolo team. Questions? support@regolo.ai or chat with us on Discord