Skip to content
Regolo Logo

Fast Whisper: The Best Open-Source Speech-to-Text Solution

👉 Try Fast Whisper on Regolo now

Here the extract of the Eugenio Petullà’s presentation from the recent Rome event, Build Your AI, highlights practical integrations for audio transcription, vision-language models, and image generation using OpenAI-compatible APIs.

What is Multimodal AI?

Multimodal AI processes diverse inputs like text, images, audio, and video simultaneously. This unlocks enterprise use cases such as meeting transcription with sentiment analysis, document OCR combined with summarization, or generating visuals from descriptions.

Regolo.ai supports key models:

  • Audio (STT): faster-whisper-large-v3
  • Vision-Language: qwen3-vl-32b, gemma-3-27b-it
  • Image Generation: Qwen-Image

All integrate via OpenAI SDK for seamless scaling, EU data residency, and green infrastructure.

Interacting with STT Models on regolo.ai

Speech-to-Text (STT) converts audio to text reliably. Regolo.ai’s faster-whisper-large-v3 excels in multilingual accuracy with low latency.

Best Practices

  • Chunking: limit segments to 2-3 minutes to avoid hallucinations or infinite loops.​
  • Timeout Management: set per-file limits; max 5 minutes recommended.
  • Libraries: use official regolo.ai or OpenAI clients (here the link to the python module)

Pro-tip: Use OGG format for optimal quality-to-size ratio, minimizing upload latency.

import openai
from pathlib import Path

# OpenAI client configuration
openai.api_key = "YOUR_REGOLO_KEY"
openai.base_url = "https://api.regolo.ai/v1/"

# Audio file to transcribe
AUDIO_FILE = "/path/to/your/audio"
OUTPUT_FILE = "/path/to/output/transcription.txt"

# Transcribe the file
with open(AUDIO_FILE, "rb") as audio_file:
    transcript = openai.audio.transcriptions.create(
        model="faster-whisper-large-v3",
        file=audio_file,
        language="en",
        response_format="text"
    )

# Save the transcription
output_path = Path(OUTPUT_FILE)
output_path.parent.mkdir(parents=True, exist_ok=True)

with open(output_path, "w", encoding="utf-8") as f:
    f.write(transcript)

print(f"Transcription saved to: {OUTPUT_FILE}")Code language: PHP (php)

Interacting with Vision-Language Models

The input is a list of objects in the content field, exactly like chat completions, alternating:

  • type: "text"
  • type: "image_url"

Key Aspects

  1. Message Structure: the input is a list of items in the content field, just like chat completions, alternating type: “text” and type: “image url
  2. Vertical Models: some of these models are trained to recognize specific patterns and can be useful in different domains, such as (OCR, Medical applications, etc.)
  3. Latency (Trade-offs): using Base64 URLs reduces external DNS resolution times but increases the payload size.


Pro-tip: Request 4096+ token context for detailed analysis; prefer Base64 URLs to skip DNS but watch payload size

Interacting with Image Generation Models

Generation speed is directly influenced by:

  • the selected image size
  • the number of variants requested

Key Factors

  1. Latency: generation speed is influenced by the dimension and number of requests
  2. Resolution: It is crucial to understand which resolutions the model was trained on to avoid artifacts and ensure output quality.
  3. Data Handling: the API gets data in the b64_json field. You must decode the Base64 string to reconstruct the binary file (PNG/JPG).


For ultra-high-resolution output, best practice suggests generating at 1024×1024 and using a dedicated post-generation upscaler (e.g. Real-ESRGAN).

Multimodal Processing Strategies

There are three simple but effective ways to manage multimodal input:

  • Sequential Chain
  • Smart Routing
  • Parallel Processing

Used in frameworks such as:

  • LangChain
  • CrewAI

Sequential Chain (The “Sandwich”)

The simplest flow:

Audio → Text → Summary

Information flows in one direction only.

Smart Routing

A director LLM receives the input and decides which tool to activate:

  • If the user sends an image → activates Vision
  • If the user sends audio → activates Speech-to-Text

Parallel Processing

Parallel Processing

The fastest approach.

While one model transcribes the audio:

  • another model analyzes metadata
  • or performs sentiment analysis

The results are merged at the end.

Benchmark

AspectRegolo.ai STTOpenAI Whisper
Latency (1min)<10s (EU servers) 15-20s
MultilingualItalian/EU optimizedGeneral
Cost/1M tokensTransparent, green GPUsVariable
PrivacyGDPR, no extra-EU transferUS-based

The Slides presented in Build Your AI in Rome [ITA]

👉 Build You AI – Start for free


Resources & Community

Official Documentation:

Related Guides:

Join the Community:


🚀 Ready to scale?

Get Free Regolo Credits →

Built with ❤️ by the Regolo team. Questions? support@regolo.ai or chat with us on Discord

Related post you might enjoy