What is Model Serving?
Model Serving is the production practice of deploying machine learning or language models behind APIs or services so applications can call them reliably at runtime.
How It Works
Model serving turns a trained or downloaded model into a dependable runtime service. For LLMs, serving includes loading model weights, managing GPU memory, tokenization, request routing, batching, streaming, safety checks, observability, autoscaling, and failure handling. The goal is not only to make a model answer, but to make it answer predictably under real traffic, cost, latency, privacy, and reliability constraints. Serving design often determines whether a promising model can become a usable product.
Key Characteristics
- Exposes models through APIs, queues, or application services
- Manages runtime concerns such as scaling, routing, batching, caching, and monitoring
- For LLMs, must handle tokenization, KV cache, streaming, and GPU utilization
- Balances latency, throughput, cost, reliability, and security
- Requires operational controls such as rate limits, rollout strategies, and observability
Common Use Cases
- Deploying an open-weight LLM behind a chat completion API
- Routing requests across model versions or providers
- Autoscaling inference workers during traffic spikes
- Monitoring latency, error rates, token usage, and GPU utilization
- Serving embeddings, rerankers, classifiers, and generative models
Example
Loading code...Frequently Asked Questions
How is model serving different from model training?
Training creates or adapts model weights. Serving runs those weights in production so applications can call the model reliably.
What makes LLM serving difficult?
LLM serving must manage large weights, GPU memory, variable sequence lengths, streaming, KV cache, and high cost per request.
Is model serving only about APIs?
No. APIs are the interface, but serving also includes scaling, observability, routing, batching, security, and failure recovery.
What should be monitored in model serving?
Monitor latency, TTFT, throughput, errors, token usage, queue time, GPU utilization, cache usage, and output safety signals.