What is Cold Start?

Cold Start is the additional startup latency that occurs when a model service handles a request before the runtime, model weights, caches, or hardware are fully warmed.

How It Works

Cold start is a practical deployment problem, not a model architecture feature. An LLM service may need to start a container, load model weights from storage, initialize CUDA kernels, allocate KV cache memory, compile optimized kernels, or warm routing and safety components before it can serve efficiently. Cold starts are especially painful for large models because weights are large and GPU memory initialization is expensive. Production systems reduce cold starts with warm pools, minimum replicas, preloading, traffic shaping, and careful autoscaling policies.

Key Characteristics

Adds latency when serving capacity is not already warm
Can involve container startup, model loading, GPU initialization, and cache allocation
More severe for large models and GPU-backed inference
Common during scale-from-zero, deployments, failover, or traffic spikes
Mitigated with warm pools, preloading, minimum replicas, and staged rollouts

Common Use Cases

Diagnosing first-request latency after deploying a model service
Designing autoscaling policies for GPU inference
Keeping latency-sensitive chat services warm during low traffic
Measuring deployment rollout impact on user experience
Separating cold-start latency from steady-state latency

Example

Loading code...

Frequently Asked Questions

Why are LLM cold starts expensive?

Large model weights, GPU memory allocation, kernel initialization, and warmup steps can take seconds or minutes.

Is cold start the same as TTFT?

No. Cold start is startup overhead before steady serving; TTFT is per-request time until the first generated token.

How can cold starts be reduced?

Use warm replicas, preload weights, avoid scale-to-zero for critical paths, and tune autoscaling before traffic arrives.

Should every model service avoid scale-to-zero?

Not always. Scale-to-zero saves cost for infrequent jobs, but it may be unacceptable for latency-sensitive user-facing APIs.

Related Tools

JSON Formatter

Format, beautify, validate and minify JSON online for free. Features syntax highlighting, tree view, history tracking, and one-click copy. No signup required. 100% client-side processing for privacy.

AI Websites Directory

An authoritative, comprehensive, and continuously updated AI resources directory. It covers global and domestic model providers, open-source ecosystems, research indexes and leaderboards, developer platforms, and curated tool catalogs—helping you quickly discover, compare, and choose the right AI products and references. Supports keyword search and favorites, with clear category sections and an expanding dataset for better experience.

Code Diff

Free online code diff tool to compare two code snippets with syntax highlighting. Supports 20+ programming languages. Find differences instantly with GitHub-style diff view.