Open models and scale-to-zero: managing cold starts and cost

Scale-to-zero architecture only works when you deploy models small enough to load weights into memory fast. Large open models cause severe cold start latency, ruining user experience. For intermittent traffic, models under 15B parameters offer the only practical balance between operational cost and acceptable time-to-first-token.

The cold start reality in AI inference

Teams often evaluate open models by looking at warm benchmarks. They read tokens-per-second metrics and assume the deployment will feel fast. They adopt scale-to-zero architectures to pay only for exact usage.

This approach ignores the physics of AI infrastructure. When an endpoint scales to zero, the GPU releases the model weights. The next incoming request triggers a cold start. The system must provision a node, pull the container, and load gigabytes of weights from storage into VRAM. This process takes time. If you choose a large reasoning model, that wait time destroys the product experience.

You trade steady-state infrastructure spend for variability. Model size dictates whether that trade is mathematically sound or functionally useless.

Our capacity versus custom deployments

Context matters when discussing cold starts – our Core Models run on always-ready shared capacity. Cold starts do not affect these endpoints, allowing teams to rely on fast, consistent inference without managing underlying nodes.

This guide applies when you build custom model deployments. If you run dedicated endpoints on our European infrastructure to maintain strict GDPR compliance and data sovereignty, you control the scaling rules. When you configure those custom endpoints to drop to zero between traffic bursts, model selection becomes an architectural constraint.

Comparing open models for intermittent workloads

Smaller, optimized models survive scale-to-zero setups better than massive parameter configurations. A model that works perfectly in a persistent environment will fail an elasticity test if the time-to-first-token stretches into minutes.

Model family	Fit for scale-to-zero	Why it works or fails
qwen3.5-9b	Strong	Small memory footprint recovers quickly. Cheap enough to justify bursty traffic.
mistral-small3.2	Strong to good	Manageable size. Useful for more capable paths without heavy VRAM requirements.
qwen3.6-27b	Good with caution	Better quality headroom, but requires strict attention to warm-up latency.
gemma4-31b	Good with caution	Reasonable for quality-heavy tasks, less forgiving than smaller tiers on cold requests.
qwen3-coder-next	Niche workload	Excellent for code generation, but too heavy to treat casually for intermittent traffic.
qwen3.5-122b	Weak	Massive weight loading overhead. Unusable for standard scale-to-zero endpoints.

Models under 15B

If the workload involves basic chat, light reasoning, or internal routing, start with qwen3.5-9b. This tier keeps the scale-to-zero concept practical. The weight loading time is minimal, meaning the first user request after an idle period remains tolerable.

Mid-weight models up to 35B

When a product workflow demands higher reasoning quality, mistral-small3.2 is a logical step up. Models in the 20B to 35B range, like qwen3.6-27b and gemma4-31b, offer better output but require larger GPUs and longer loading times. You must measure whether your users will accept a slower first response in exchange for deeper context handling.

Heavyweights over 100B

Forcing large reasoning models into bursty, scale-to-zero traffic rarely works. Models like qwen3.5-122b dominate benchmarks, but their infrastructure profile contradicts the elasticity model. Loading over 100 billion parameters into VRAM from a dead stop creates unacceptable latency. If you need this quality, you usually have to pay for provisioned, always-on capacity.

How to test inference elasticity

Do not rely on standard benchmarks for custom deployments. If you configure a model to sleep, test its waking behavior.

Measure these specific points:

Total response time on the first cold request.
First-token latency after a prolonged idle period.
Recovery time on the second request.
Cost per useful request at your actual burst frequency.

A model can execute perfectly in theory and still be the wrong choice for the product. Test the cold path. Start small, gather user tolerance data, and step up to heavier models only when telemetry proves the quality gain offsets the cold start penalty.

FAQ

Why do large AI models struggle with scale-to-zero?
Large models have massive weight files. When a scaled-to-zero endpoint wakes up, it must load tens or hundreds of gigabytes into GPU memory before generating the first token, causing severe latency.

Does Regolo.ai suffer from cold starts?
Regolo Core Models run on always-warm shared capacity, eliminating cold starts. Cold starts only apply if you configure custom, dedicated deployments to scale to zero to minimize infrastructure costs.

What is the best open model size for intermittent traffic?
Models under 15B parameters, such as qwen3.5-9b, offer the best balance. They load quickly into VRAM, keeping the time-to-first-token low while providing strong performance for standard text tasks.

Start your free 30-day trial – UNLIMITED tokens

👉 Talk with our Engineers or Start your 30 days free ->

Regolo models library – Check the live model catalog before choosing a serving path
GPU Cloud Infrastructure – Review the infrastructure tradeoffs behind warm vs cold behavior
Regolo pricing – Compare the economic profile of managed vs custom deployments
GitHub Repo – Open source projects and integrations around Regolo

Built with ❤️ by the Regolo team. Questions? regolo.ai/contact or chat with us on Discord