What is Tensor Parallelism?
Tensor Parallelism is a model parallelism strategy that splits large neural network tensors and their computations across multiple accelerators.
How It Works
Tensor parallelism is used when a model layer or weight matrix is too large or too expensive for a single GPU to handle efficiently. Instead of placing different layers on different devices, tensor parallelism partitions operations within a layer, such as matrix multiplications, across multiple accelerators. It is common in LLM inference and training for large models, but it introduces communication overhead because devices must exchange partial results. Good tensor parallel configurations depend on model architecture, interconnect bandwidth, GPU count, batch size, and serving latency goals.
Key Characteristics
- Splits tensors and layer computations across multiple GPUs
- Enables serving or training models that exceed single-device memory or compute limits
- Requires collective communication such as all-reduce or all-gather
- Performance depends heavily on interconnect bandwidth and topology
- Often combined with pipeline parallelism, data parallelism, or expert parallelism
Common Use Cases
- Serving a large LLM that cannot fit on one GPU
- Increasing inference throughput for compute-heavy models
- Running vLLM or similar engines with multiple GPUs
- Training large transformer models across accelerators
- Balancing model size, latency, and hardware cost
Example
Loading code...Frequently Asked Questions
Is tensor parallelism the same as data parallelism?
No. Data parallelism replicates the model across devices, while tensor parallelism splits individual tensors and computations.
Why does tensor parallelism need fast interconnects?
Devices must exchange partial results during layer computation, so slow communication can erase compute gains.
Does tensor parallelism always improve latency?
No. It can reduce per-device work but add communication overhead, so the result depends on model and hardware.
When is tensor parallelism necessary?
It is often necessary when model weights, KV cache, or compute requirements exceed what one accelerator can handle.