Fast Whisper: The Best Open-Source Speech-to-Text Solution

Alex Genovese·January 23, 2026

Here the extract of the Eugenio Petullà’s presentation from the recent Rome event, Build Your AI, highlights practical integrations for audio transcription, vision-language models, and image generation using OpenAI-compatible APIs.

What is Multimodal AI?

Multimodal AI processes diverse inputs like text, images, audio, and video simultaneously. This unlocks enterprise use cases such as meeting transcription with sentiment analysis, document OCR combined with summarization, or generating visuals from descriptions.

Regolo.ai supports key models:

Audio (STT): faster-whisper-large-v3
Vision-Language: qwen3-vl-32b, gemma-3-27b-it
Image Generation: Qwen-Image

All integrate via OpenAI SDK for seamless scaling, EU data residency, and green infrastructure.

Interacting with STT Models on regolo.ai

Speech-to-Text (STT) converts audio to text reliably. Regolo.ai’s faster-whisper-large-v3 excels in multilingual accuracy with low latency.

Best Practices

Chunking: limit segments to 2-3 minutes to avoid hallucinations or infinite loops.
Timeout Management: set per-file limits; max 5 minutes recommended.
Libraries: use official regolo.ai or OpenAI clients (here the link to the python module)

Pro-tip: Use OGG format for optimal quality-to-size ratio, minimizing upload latency.

import openai
from pathlib import Path

# OpenAI client configuration
openai.api_key = "YOUR_REGOLO_KEY"
openai.base_url = "https://api.regolo.ai/v1/"

# Audio file to transcribe
AUDIO_FILE = "/path/to/your/audio"
OUTPUT_FILE = "/path/to/output/transcription.txt"

# Transcribe the file
with open(AUDIO_FILE, "rb") as audio_file:
    transcript = openai.audio.transcriptions.create(
        model="faster-whisper-large-v3",
        file=audio_file,
        language="en",
        response_format="text"
    )

# Save the transcription
output_path = Path(OUTPUT_FILE)
output_path.parent.mkdir(parents=True, exist_ok=True)

with open(output_path, "w", encoding="utf-8") as f:
    f.write(transcript)

print(f"Transcription saved to: {OUTPUT_FILE}")Code language: PHP (php)

Interacting with Vision-Language Models

The input is a list of objects in the content field, exactly like chat completions, alternating:

type: "text"
type: "image_url"

Key Aspects

Aspect	Regolo.ai STT	OpenAI Whisper
Latency (1min)	<10s (EU servers)	15-20s
Multilingual	Italian/EU optimized	General
Cost/1M tokens	Transparent, green GPUs	Variable
Privacy	GDPR, no extra-EU transfer	US-based

Fast Whisper: The Best Open-Source Speech-to-Text Solution

What is Multimodal AI?

Interacting with STT Models on regolo.ai

Best Practices

Interacting with Vision-Language Models

Key Aspects

Interacting with Image Generation Models

Key Factors

Multimodal Processing Strategies

Sequential Chain (The “Sandwich”)

Smart Routing

Parallel Processing

Parallel Processing

Benchmark

The Slides presented in Build Your AI in Rome [ITA]

Resources & Community

🚀 Ready to scale?