TL;DR
Mixture of Agents (MoA) is a multi-model collaboration architecture proposed by Together AI in 2024. The core idea is layered LLM collaboration—bottom-layer Proposers generate diverse candidate responses, while upper-layer Aggregators synthesize multiple perspectives into a final high-quality result. This article covers the paper's principles, provides complete production-grade Python + TypeScript implementations orchestrating GPT-4o, Claude, and Gemini, and addresses latency optimization, cost control, and fault tolerance strategies.
Table of Contents
- Key Takeaways
- What is Mixture of Agents?
- MoA Architecture Deep Dive
- Implementation: Building an MoA System
- Advanced Patterns
- Performance Analysis
- Production Deployment
- Best Practices
- FAQ
- Summary and Related Resources
Key Takeaways
- MoA core principle: Multiple LLMs collaborate in layers—Proposer layer provides diverse perspectives, Aggregator layer performs synthesis and fusion
- Fundamental difference from MoE: MoE is sparse activation inside a model; MoA is external orchestration across multiple complete models
- Significant quality gains: Surpasses single GPT-4o by 8.3 percentage points on AlpacaEval 2.0 (65.8% vs 57.5%)
- Parallel execution is key: Same-layer Proposers run concurrently—actual latency equals only the slowest model's response time
- Controllable costs: Through dynamic routing, caching, and model selection strategies, production MoA costs stay at 2-3x single-model pricing
This is article #16 in the AI Architect Course. We recommend reading MoE Architecture Explained first for background on the internal mixture-of-experts mechanism.
What is Mixture of Agents?
Origin: Together AI's MoA Paper
In 2024, Together AI published the paper "Mixture-of-Agents Enhances Large Language Model Capabilities", proposing a method for multiple LLMs to collaborate in layers to exceed the performance ceiling of any single model. Their key finding is that LLMs exhibit "collaborativeness"—a model tends to produce better responses after seeing outputs from other models.
How MoA Differs from MoE
| Dimension | Mixture of Experts (MoE) | Mixture of Agents (MoA) |
|---|---|---|
| Operating level | Internal model architecture | External inter-model orchestration |
| Building blocks | Expert sub-networks (FFN layers) | Complete LLM model instances |
| Routing mechanism | Token-level Router network | Task-level strategy orchestration |
| Activation pattern | Sparse (Top-K Experts) | Full activation of all Proposers |
| Training required | End-to-end joint training | No training needed, prompt-driven |
| Notable examples | Mixtral, DeepSeek-V2 | Together MoA, custom multi-model pipelines |
Core Idea: Layered Collaboration
MoA draws inspiration from the wisdom of crowds—the aggregate judgment of multiple independent thinkers typically outperforms a single expert. In the LLM context, this manifests as:
- Diversity generation: Different models produce varied perspectives on the same question due to differences in training data, architecture, and alignment strategies
- Quality aggregation: The Aggregator model synthesizes multiple viewpoints, combining strengths and compensating for individual weaknesses
- Iterative refinement: Multi-layer stacking enables progressive quality improvement, with each layer building on the previous one
MoA Architecture Deep Dive
Layered Pipeline Structure
MoA employs a Layered Pipeline architecture with three core roles:
- Proposer: Receives the original query and independently generates candidate responses
- Aggregator: Receives outputs from multiple Proposers and synthesizes a superior response
- Final Synthesizer: The top-level Aggregator that produces the final user-facing answer
Communication Protocols
Inter-layer communication uses structured prompt templates:
AGGREGATOR_PROMPT = """You are a high-quality answer synthesizer.
Below are independent responses from multiple AI assistants to the same question:
{proposer_responses}
Synthesize the best aspects of all responses into a comprehensive, accurate, and insightful final answer.
Requirements:
1. Retain correct and unique insights from each response
2. Resolve contradictory information by selecting more well-supported claims
3. Fill any gaps with important missing information
4. Organize the final answer with clear structure
"""
Model Role Assignment Strategy
Different models excel in different domains. MoA architecture should leverage this complementarity:
| Model | Strengths | Recommended Role |
|---|---|---|
| GPT-4o | Instruction following, code generation | Proposer + Aggregator |
| Claude 3.5 Sonnet | Long-form analysis, creative writing | Proposer + Final Synthesizer |
| Gemini 1.5 Pro | Multimodal understanding, long context | Proposer |
| Llama 3.1 70B | Reasoning, mathematics | Proposer (cost-friendly) |
Implementation: Building an MoA System
Python Implementation
Here is a complete production-grade MoA implementation with parallel execution and fault tolerance:
import asyncio
from dataclasses import dataclass
from typing import Optional
from openai import AsyncOpenAI
from anthropic import AsyncAnthropic
import google.generativeai as genai
@dataclass
class ProposerResponse:
model: str
content: str
latency_ms: float
success: bool
@dataclass
class MoAConfig:
min_proposers: int = 3
proposer_timeout: float = 30.0
max_layers: int = 3
class MixtureOfAgents:
def __init__(self, config: MoAConfig = MoAConfig()):
self.config = config
self.openai = AsyncOpenAI()
self.anthropic = AsyncAnthropic()
async def _call_openai(self, prompt: str, model: str = "gpt-4o") -> ProposerResponse:
import time
start = time.time()
try:
response = await asyncio.wait_for(
self.openai.chat.completions.create(
model=model,
messages=[{"role": "user", "content": prompt}],
temperature=0.7,
max_tokens=2048,
),
timeout=self.config.proposer_timeout,
)
return ProposerResponse(
model=model,
content=response.choices[0].message.content,
latency_ms=(time.time() - start) * 1000,
success=True,
)
except Exception as e:
return ProposerResponse(
model=model, content="", latency_ms=(time.time() - start) * 1000, success=False
)
async def _call_anthropic(self, prompt: str, model: str = "claude-sonnet-4-20250514") -> ProposerResponse:
import time
start = time.time()
try:
response = await asyncio.wait_for(
self.anthropic.messages.create(
model=model,
max_tokens=2048,
messages=[{"role": "user", "content": prompt}],
),
timeout=self.config.proposer_timeout,
)
return ProposerResponse(
model=model,
content=response.content[0].text,
latency_ms=(time.time() - start) * 1000,
success=True,
)
except Exception as e:
return ProposerResponse(
model=model, content="", latency_ms=(time.time() - start) * 1000, success=False
)
async def propose(self, query: str) -> list[ProposerResponse]:
"""Layer 1: Call multiple Proposers in parallel"""
tasks = [
self._call_openai(query, "gpt-4o"),
self._call_anthropic(query, "claude-sonnet-4-20250514"),
self._call_openai(query, "gpt-4o-mini"),
]
responses = await asyncio.gather(*tasks, return_exceptions=True)
valid = [r for r in responses if isinstance(r, ProposerResponse) and r.success]
if len(valid) < self.config.min_proposers:
raise RuntimeError(
f"Only {len(valid)} proposers succeeded, need {self.config.min_proposers}"
)
return valid
async def aggregate(self, query: str, proposals: list[ProposerResponse]) -> str:
"""Layer 2+: Aggregate multiple Proposer outputs"""
proposals_text = "\n\n".join(
f"--- Response from {p.model} ---\n{p.content}" for p in proposals
)
aggregation_prompt = f"""You are a high-quality answer synthesizer.
Original question: {query}
Below are independent responses from multiple AI assistants:
{proposals_text}
Synthesize the best aspects of all responses into a comprehensive final answer.
Retain unique correct insights, resolve contradictions, and fill gaps."""
response = await self._call_anthropic(aggregation_prompt, "claude-sonnet-4-20250514")
return response.content
async def run(self, query: str) -> str:
"""Execute the complete MoA pipeline"""
proposals = await self.propose(query)
result = await self.aggregate(query, proposals)
return result
# Usage example
async def main():
moa = MixtureOfAgents(MoAConfig(min_proposers=2))
result = await moa.run("Explain quantum entanglement and its applications in quantum communication")
print(result)
if __name__ == "__main__":
asyncio.run(main())
TypeScript Implementation
import OpenAI from "openai";
import Anthropic from "@anthropic-ai/sdk";
interface ProposerResponse {
model: string;
content: string;
latencyMs: number;
success: boolean;
}
interface MoAConfig {
minProposers: number;
proposerTimeoutMs: number;
maxLayers: number;
}
class MixtureOfAgents {
private openai: OpenAI;
private anthropic: Anthropic;
private config: MoAConfig;
constructor(config: Partial<MoAConfig> = {}) {
this.config = {
minProposers: 3,
proposerTimeoutMs: 30000,
maxLayers: 3,
...config,
};
this.openai = new OpenAI();
this.anthropic = new Anthropic();
}
private async callOpenAI(
prompt: string,
model = "gpt-4o"
): Promise<ProposerResponse> {
const start = Date.now();
try {
const response = await Promise.race([
this.openai.chat.completions.create({
model,
messages: [{ role: "user", content: prompt }],
temperature: 0.7,
max_tokens: 2048,
}),
new Promise<never>((_, reject) =>
setTimeout(() => reject(new Error("Timeout")), this.config.proposerTimeoutMs)
),
]);
return {
model,
content: response.choices[0].message.content ?? "",
latencyMs: Date.now() - start,
success: true,
};
} catch {
return { model, content: "", latencyMs: Date.now() - start, success: false };
}
}
private async callAnthropic(
prompt: string,
model = "claude-sonnet-4-20250514"
): Promise<ProposerResponse> {
const start = Date.now();
try {
const response = await this.anthropic.messages.create({
model,
max_tokens: 2048,
messages: [{ role: "user", content: prompt }],
});
return {
model,
content: response.content[0].type === "text" ? response.content[0].text : "",
latencyMs: Date.now() - start,
success: true,
};
} catch {
return { model, content: "", latencyMs: Date.now() - start, success: false };
}
}
async propose(query: string): Promise<ProposerResponse[]> {
const tasks = [
this.callOpenAI(query, "gpt-4o"),
this.callAnthropic(query, "claude-sonnet-4-20250514"),
this.callOpenAI(query, "gpt-4o-mini"),
];
const responses = await Promise.allSettled(tasks);
const valid = responses
.filter((r): r is PromiseFulfilledResult<ProposerResponse> =>
r.status === "fulfilled" && r.value.success
)
.map((r) => r.value);
if (valid.length < this.config.minProposers) {
throw new Error(
`Only ${valid.length} proposers succeeded, need ${this.config.minProposers}`
);
}
return valid;
}
async aggregate(query: string, proposals: ProposerResponse[]): Promise<string> {
const proposalsText = proposals
.map((p) => `--- Response from ${p.model} ---\n${p.content}`)
.join("\n\n");
const aggregationPrompt = `You are a high-quality answer synthesizer.
Original question: ${query}
Below are independent responses from multiple AI assistants:
${proposalsText}
Synthesize the best aspects of all responses into a comprehensive final answer.
Retain unique correct insights, resolve contradictions, and fill gaps.`;
const response = await this.callAnthropic(aggregationPrompt, "claude-sonnet-4-20250514");
return response.content;
}
async run(query: string): Promise<string> {
const proposals = await this.propose(query);
return this.aggregate(query, proposals);
}
}
// Usage example
const moa = new MixtureOfAgents({ minProposers: 2 });
const result = await moa.run("Explain quantum entanglement and its applications in quantum communication");
console.log(result);
Prompt Engineering for Proposer and Aggregator Roles
Effective prompt design is critical to MoA quality. Different roles require different prompt strategies:
# Proposer Prompts: encourage unique perspectives
PROPOSER_SYSTEM_PROMPTS = {
"analytical": "You are a rigorous analyst focused on logic and data-driven evidence. Answer from a data and evidence perspective.",
"creative": "You are a creative thinker. Provide novel perspectives and analogies in your response.",
"practical": "You are a hands-on engineer focused on practicality. Answer from an operability and real-world application perspective.",
"critical": "You are a strict reviewer. Identify potential pitfalls and common misconceptions in the topic.",
}
# Aggregator Prompt: structured synthesis
AGGREGATOR_TEMPLATE = """You are a senior knowledge synthesis expert.
## Task
Synthesize the following independent expert responses into an authoritative final answer.
## Original Question
{query}
## Expert Responses
{responses}
## Synthesis Requirements
1. Identify consensus points across responses—these are likely correct
2. For contradictions, select the better-supported argument
3. Merge unique contributions from each response
4. Ensure logical coherence and clear structure in the final answer
5. Flag uncertain information
Provide the synthesized final answer:"""
Advanced Patterns
Dynamic Routing Based on Task Complexity
Not every query needs the full multi-layer MoA pipeline. Simple questions should route directly to a single model, while complex questions activate multi-layer collaboration:
class DynamicMoARouter:
"""Dynamically select MoA depth based on task complexity"""
async def classify_complexity(self, query: str) -> str:
response = await self.openai.chat.completions.create(
model="gpt-4o-mini",
messages=[{
"role": "system",
"content": "Classify question complexity. Return: simple/medium/complex"
}, {
"role": "user",
"content": query
}],
max_tokens=10,
)
return response.choices[0].message.content.strip().lower()
async def route(self, query: str) -> str:
complexity = await self.classify_complexity(query)
if complexity == "simple":
# Single model direct response
resp = await self._call_openai(query, "gpt-4o-mini")
return resp.content
elif complexity == "medium":
# Single-layer MoA (3 Proposers + 1 Aggregator)
proposals = await self.propose(query)
return await self.aggregate(query, proposals)
else:
# Multi-layer MoA (3 Proposers + 2 Aggregators + 1 Synthesizer)
proposals = await self.propose(query)
agg_results = await asyncio.gather(
self.aggregate(query, proposals),
self.aggregate(query, proposals), # Different Aggregator prompts
)
return await self.synthesize(query, agg_results)
Specialization: Assigning Models to Their Strengths
Different models excel in different domains. MoA can select the optimal Proposer combination based on task type:
TASK_MODEL_MAPPING = {
"code_generation": ["gpt-4o", "claude-sonnet-4-20250514", "deepseek-coder"],
"creative_writing": ["claude-sonnet-4-20250514", "gpt-4o", "gemini-1.5-pro"],
"data_analysis": ["gpt-4o", "gemini-1.5-pro", "claude-sonnet-4-20250514"],
"math_reasoning": ["gpt-4o", "deepseek-math", "claude-sonnet-4-20250514"],
}
async def select_proposers(self, query: str, task_type: str) -> list[str]:
"""Select optimal Proposer combination based on task type"""
return TASK_MODEL_MAPPING.get(task_type, ["gpt-4o", "claude-sonnet-4-20250514", "gemini-1.5-pro"])
Iterative Refinement Loops
Multiple rounds of iteration progressively improve output quality, with each round using the previous round's output as context for the next:
async def iterative_refinement(self, query: str, rounds: int = 3) -> str:
"""Iterative refinement: each round uses previous output as context"""
current_proposals = await self.propose(query)
for round_idx in range(rounds - 1):
aggregated = await self.aggregate(query, current_proposals)
# Next round's Proposers see previous aggregation as reference
refined_prompt = f"""Original question: {query}
Below is the synthesized answer from the previous round:
{aggregated}
Based on this answer, provide further supplements, corrections, or deepening. Focus on:
- Points the previous round may have missed
- Inaccuracies that need correction
- Areas that can be explored further"""
current_proposals = await self.propose(refined_prompt)
return await self.aggregate(query, current_proposals)
Cost Optimization Strategies
class CostOptimizedMoA:
"""Cost-optimized MoA configuration"""
COST_PER_1K_TOKENS = {
"gpt-4o": {"input": 0.005, "output": 0.015},
"gpt-4o-mini": {"input": 0.00015, "output": 0.0006},
"claude-sonnet-4-20250514": {"input": 0.003, "output": 0.015},
"gemini-1.5-pro": {"input": 0.00125, "output": 0.005},
}
def estimate_cost(self, query_tokens: int, num_proposers: int, num_layers: int) -> float:
"""Estimate cost for a single MoA invocation"""
avg_output_tokens = 800
proposer_cost = sum(
(query_tokens / 1000) * self.COST_PER_1K_TOKENS[m]["input"] +
(avg_output_tokens / 1000) * self.COST_PER_1K_TOKENS[m]["output"]
for m in ["gpt-4o", "claude-sonnet-4-20250514", "gpt-4o-mini"][:num_proposers]
)
aggregator_input = query_tokens + avg_output_tokens * num_proposers
aggregator_cost = (
(aggregator_input / 1000) * self.COST_PER_1K_TOKENS["claude-sonnet-4-20250514"]["input"] +
(avg_output_tokens / 1000) * self.COST_PER_1K_TOKENS["claude-sonnet-4-20250514"]["output"]
)
return (proposer_cost + aggregator_cost) * num_layers
Performance Analysis
Quality Comparison
| Method | AlpacaEval 2.0 LC Win Rate | MT-Bench | Notes |
|---|---|---|---|
| GPT-4o single model | 57.5% | 9.2 | Baseline |
| Claude 3.5 Sonnet single model | 52.4% | 9.0 | — |
| MoA 2-layer (3 Proposers) | 62.3% | 9.4 | +4.8% relative gain |
| MoA 3-layer (4 Proposers) | 65.8% | 9.5 | +8.3% relative gain |
| MoA 3-layer + iterative refinement | 67.2% | 9.6 | Best configuration |
Latency vs Quality Trade-offs
| Configuration | Avg Latency | Quality Gain | Cost Multiplier | Recommended Use Case |
|---|---|---|---|---|
| Single model | 2-4s | Baseline | 1x | Real-time chat |
| 2 Proposers + 1 Agg | 5-8s | +5-10% | 2.5x | General tasks |
| 3 Proposers + 1 Agg | 6-10s | +10-15% | 3.5x | Important decisions |
| 3 Proposers + 2 layers | 10-18s | +15-20% | 5x | Critical reports |
| 4 Proposers + 3 layers | 18-30s | +18-25% | 8x | Offline batch processing |
MoA Suitability Analysis
Production Deployment
Async Parallel Execution Architecture
import aiohttp
from asyncio import Semaphore
class ProductionMoA:
def __init__(self, max_concurrent: int = 10):
self.semaphore = Semaphore(max_concurrent)
self.cache = {} # Use Redis in production
async def run_with_fallback(self, query: str) -> str:
"""Production execution with fallback strategy"""
# 1. Check cache
cache_key = self._hash_query(query)
if cache_key in self.cache:
return self.cache[cache_key]
# 2. Execute MoA pipeline
try:
async with self.semaphore:
proposals = await self.propose(query)
result = await self.aggregate(query, proposals)
except RuntimeError:
# Fallback: use best single model when insufficient Proposers
result = await self._fallback_single_model(query)
# 3. Cache result
self.cache[cache_key] = result
return result
async def _fallback_single_model(self, query: str) -> str:
"""Fallback strategy: revert to single model"""
response = await self._call_anthropic(query, "claude-sonnet-4-20250514")
return response.content
Monitoring and Evaluation
Production environments must continuously monitor MoA pipeline health:
interface MoAMetrics {
totalLatencyMs: number;
proposerLatencies: Record<string, number>;
proposerSuccessRate: Record<string, number>;
aggregationQualityScore: number;
cacheHitRate: number;
costPerQuery: number;
}
function reportMetrics(metrics: MoAMetrics): void {
// Push to monitoring system (e.g., Prometheus / Datadog)
console.log(JSON.stringify({
timestamp: new Date().toISOString(),
pipeline: "moa",
...metrics,
}));
// Alert rules
if (metrics.proposerSuccessRate["gpt-4o"] < 0.9) {
alert("GPT-4o success rate dropped below 90%");
}
if (metrics.totalLatencyMs > 30000) {
alert("MoA pipeline latency exceeded 30s threshold");
}
}
Caching Layer Design
For repeated or similar queries, caching dramatically reduces cost and latency:
import hashlib
from functools import lru_cache
class SemanticCache:
"""Semantic cache: reuse results for similar questions"""
def __init__(self, similarity_threshold: float = 0.92):
self.threshold = similarity_threshold
self.embeddings = {} # query_hash -> embedding
self.results = {} # query_hash -> result
async def get_or_compute(self, query: str, compute_fn) -> str:
query_embedding = await self._embed(query)
# Search for semantically similar cached results
for cached_hash, cached_embedding in self.embeddings.items():
similarity = self._cosine_similarity(query_embedding, cached_embedding)
if similarity > self.threshold:
return self.results[cached_hash]
# Cache miss—execute computation
result = await compute_fn(query)
query_hash = hashlib.sha256(query.encode()).hexdigest()
self.embeddings[query_hash] = query_embedding
self.results[query_hash] = result
return result
Best Practices
1. Model Diversity Matters More Than Quantity
Selecting architecturally diverse model combinations is more effective than stacking models from the same family. For example, GPT-4o + Claude + Gemini outperforms three GPT-4o variants.
2. Use the Strongest Model for Aggregation
The Aggregator's quality directly determines final output quality. Budget permitting, the Aggregator should use the strongest synthesis-capable model (e.g., Claude 3.5 Sonnet or GPT-4o).
3. Set Appropriate Timeouts and Fallback Strategies
Each Proposer should have independent timeout control—one slow model should not bottleneck the entire pipeline. Use N-of-M strategies to ensure partial model failures don't affect the overall flow.
4. Monitor Cost and Latency at Every Node
Use the JSON Formatter to inspect API response structures, and implement structured logging to track performance metrics at each pipeline node.
5. Start Simple
Don't begin with a 3-layer, 4-Proposer maximum configuration. Start with 2 Proposers + 1 Aggregator and scale up based on actual quality requirements.
6. Leverage Prompt Differentiation
Assigning different system prompts to different Proposers (analytical, creative, critical) produces more diverse candidate responses than using identical prompts.
FAQ
Q: How does Mixture of Agents differ from simple multi-model voting?
A: Voting (Majority Voting) only selects the answer that most models agree on, discarding minority perspectives' unique insights. MoA's Aggregator performs deep synthesis of all responses, preserving each model's unique contributions and fusing them into a more comprehensive answer. Analogy: voting is "pick the best"; MoA is "combine all strengths."
Q: Is MoA always better than a single model?
A: No. For simple factual queries (e.g., "What is the capital of France?"), MoA adds unnecessary latency and cost with virtually no quality improvement. MoA's advantages concentrate on complex tasks requiring multi-perspective reasoning.
Q: How do you evaluate MoA pipeline output quality?
A: We recommend the LLM-as-Judge approach—have an independent evaluation model blindly score MoA outputs against single-model outputs. Complement with A/B testing to collect real user preference data.
Q: Can MoA be combined with RAG?
A: Absolutely. A common pattern adds a RAG retrieval layer before MoA's Proposers, letting each Proposer generate responses based on retrieved context. This provides both knowledge augmentation and multi-model complementarity.
Q: How do you control API costs for self-built MoA?
A: Core strategies include: (1) Dynamic routing of simple tasks to single models; (2) Semantic caching to avoid redundant computation; (3) Using the Token Counter to optimize prompt length and reduce input tokens; (4) Mixing cost-friendly smaller models (e.g., GPT-4o-mini) among Proposers.
Summary and Related Resources
Mixture of Agents represents a paradigm shift in LLM applications—from "choosing the best single model" to "orchestrating multi-model collaboration." Through thoughtful architecture design—parallel Proposers, layered Aggregators, dynamic routing—MoA can significantly exceed any single model's performance ceiling on complex tasks.
The key is selecting the appropriate configuration depth based on actual requirements: simple tasks don't need MoA, moderate tasks use lightweight single-layer configurations, and complex critical tasks employ the full multi-layer pipeline.
Related Resources
Series Articles
- MoE Architecture Explained: Internal Mixture of Experts Mechanism
- Context Engineering: 4-Layer Architecture Patterns
- Multi-Agent Orchestration Patterns: Supervisor vs Swarm vs Hierarchical
Related Tools
- JSON Formatter - Debug API responses
- Token Counter - Optimize prompt length
- Text Diff Tool - Compare model output differences
Related Terms