Question 1

What is Mixture of Experts (MoE)?

Accepted Answer

Mixture of Experts is a neural network architecture using multiple specialized sub-networks (experts) and a routing mechanism that dynamically selects which experts process each input. This enables models with massive parameter counts while only activating a fraction during inference, achieving better efficiency.

Question 2

How does MoE improve model efficiency?

Accepted Answer

MoE improves efficiency through sparse activation: instead of using all parameters for every input, only a subset of experts (typically 2 out of 8+) are activated per token. This allows models to have trillions of total parameters while inference cost scales with active parameters, not total parameters.

Question 3

What models use Mixture of Experts?

Accepted Answer

Notable MoE models include Mixtral 8x7B and 8x22B from Mistral AI, Google's Switch Transformer and GLaM, and GPT-4 is rumored to use MoE. These models demonstrate that MoE can achieve state-of-the-art performance with improved computational efficiency.

Question 4

What is the router in MoE?

Accepted Answer

The router (or gating network) is a learned component that decides which experts should process each input token. It outputs probability scores for each expert, and typically the top-k experts with highest scores are selected. Good routing is crucial for MoE performance.

Question 5

What are the challenges of MoE models?

Accepted Answer

Challenges include: load balancing to ensure all experts are used effectively, increased memory requirements to store all expert parameters, communication overhead in distributed training, expert collapse where some experts are underutilized, and complexity in training dynamics.

Full Name	Mixture of Experts (MoE)
Created	1991 by Jacobs et al., popularized in LLMs since 2022

What is Mixture of Experts?

Quick Facts

How It Works

Key Characteristics

Common Use Cases

Example

Frequently Asked Questions