What is Cold Start?
Cold Start is the additional startup latency that occurs when a model service handles a request before the runtime, model weights, caches, or hardware are fully warmed.
How It Works
Cold start is a practical deployment problem, not a model architecture feature. An LLM service may need to start a container, load model weights from storage, initialize CUDA kernels, allocate KV cache memory, compile optimized kernels, or warm routing and safety components before it can serve efficiently. Cold starts are especially painful for large models because weights are large and GPU memory initialization is expensive. Production systems reduce cold starts with warm pools, minimum replicas, preloading, traffic shaping, and careful autoscaling policies.
Key Characteristics
- Adds latency when serving capacity is not already warm
- Can involve container startup, model loading, GPU initialization, and cache allocation
- More severe for large models and GPU-backed inference
- Common during scale-from-zero, deployments, failover, or traffic spikes
- Mitigated with warm pools, preloading, minimum replicas, and staged rollouts
Common Use Cases
- Diagnosing first-request latency after deploying a model service
- Designing autoscaling policies for GPU inference
- Keeping latency-sensitive chat services warm during low traffic
- Measuring deployment rollout impact on user experience
- Separating cold-start latency from steady-state latency
Example
Loading code...Frequently Asked Questions
Why are LLM cold starts expensive?
Large model weights, GPU memory allocation, kernel initialization, and warmup steps can take seconds or minutes.
Is cold start the same as TTFT?
No. Cold start is startup overhead before steady serving; TTFT is per-request time until the first generated token.
How can cold starts be reduced?
Use warm replicas, preload weights, avoid scale-to-zero for critical paths, and tune autoscaling before traffic arrives.
Should every model service avoid scale-to-zero?
Not always. Scale-to-zero saves cost for infrequent jobs, but it may be unacceptable for latency-sensitive user-facing APIs.