How Anthropic Built Their Multi-Agent Research System: Architecture Lessons from Production

Claude’s new Research capabilities represent a significant evolution in AI agent architecture—moving from single-agent systems to coordinated multi-agent orchestration at production scale. Anthropic’s engineering team recently shared the architectural principles and hard-won lessons from building this system. Here’s what makes their approach work.

The Core Problem: Why Multi-Agent?

Research tasks are inherently unpredictable. You can’t hardcode a fixed path for exploring complex topics—the process is dynamic and path-dependent. When humans conduct research, they continuously update their approach based on discoveries, following leads that emerge during investigation.

This unpredictability is precisely why AI agents excel at research. But there’s a catch: single agents hit limits. Even generally-intelligent agents operating alone face constraints when the work requires:

Parallel exploration of multiple independent directions
Context beyond what fits in a single window (>200K tokens)
Interfacing with numerous complex tools simultaneously

Multi-agent systems solve this through distributed context windows and parallel reasoning capacity.

Performance Impact: The Numbers

Anthropic’s internal evaluations show compelling results:

Multi-agent (Opus 4 lead + Sonnet 4 workers) vs. Single Opus 4:

90.2% improvement on internal research benchmarks
80% of performance variance explained by token usage alone
15x token usage compared to single-agent chat (4x for single agents)

What Drives Performance?

Analysis of the BrowseComp evaluation (which tests browsing agents’ ability to locate hard-to-find information) revealed three factors explaining 95% of variance:

Token usage (80% of variance) - More tokens = more reasoning capacity
Tool calls - Parallel execution enables breadth-first exploration
Model choice - Upgrading to Sonnet 4 > doubling token budget on Sonnet 3.7

Example success case: When asked to identify all board members of IT companies in the S&P 500, the multi-agent system decomposed the task across subagents and found correct answers. The single-agent system failed with slow, sequential searches.

Architecture: Orchestrator-Worker Pattern

The Research system uses a lead agent coordinating specialized subagents that operate in parallel.

The Flow

User Query
    ↓
LeadResearcher Agent
    ├─ Analyzes query
    ├─ Develops strategy
    ├─ Spawns parallel Subagents
    │       ↓
    │   Subagent 1: AI agent companies 2025
    │   Subagent 2: Market trends analysis  
    │   Subagent 3: Technical capabilities
    │       ↓
    │   [Each performs iterative web searches]
    │       ↓
    ├─ Synthesizes findings
    ├─ Decides: more research needed?
    │   ├─ Yes → spawn more subagents
    │   └─ No → proceed to citation
    ↓
CitationAgent
    ├─ Processes documents
    ├─ Identifies specific locations
    ├─ Attributes all claims to sources
    ↓
Final Research Report (with citations)

Key Architectural Choices

1. Memory Persistence

When context exceeds 200K tokens, truncation is inevitable. The LeadResearcher saves its plan to Memory to retain critical context across truncations.

2. Interleaved Thinking

Subagents use Claude’s extended thinking mode to:

Plan their approach before executing
Evaluate tool results after each search
Identify gaps and refine next queries adaptively

3. Dynamic Multi-Step Search

Unlike static RAG (Retrieval Augmented Generation), which fetches chunks based on similarity, the system:

Dynamically finds relevant information
Adapts to new findings in real-time
Analyzes results iteratively to formulate high-quality answers

Prompt Engineering Principles for Multi-Agent Systems

Building a production multi-agent system required extensive prompt iteration. Here are the key principles that worked:

1. Think Like Your Agents

Problem: Without understanding agent behavior, prompt changes are shots in the dark.

Solution: Build simulations in Anthropic Console using exact prompts/tools from production. Watch agents work step-by-step.

Failure modes revealed:

Agents continuing when they already had sufficient results
Overly verbose search queries
Incorrect tool selection

2. Teach the Orchestrator to Delegate

Problem: Vague instructions led to duplicated work and gaps.

Early attempts: “Research the semiconductor shortage”

Result:

One subagent explored 2021 automotive chip crisis
Two others duplicated work on 2025 supply chains
No effective division of labor

Solution: Detailed task descriptions with:

Clear objective
Expected output format
Guidance on tools and sources to use
Explicit task boundaries

3. Scale Effort to Query Complexity

Problem: Agents struggled to judge appropriate effort.

Solution: Embed explicit scaling rules in prompts:

Simple fact-finding: 1 agent, 3-10 tool calls
Direct comparisons: 2-4 subagents, 10-15 calls each
Complex research: 10+ subagents with divided responsibilities

Without these guidelines, early versions over-invested in simple queries.

4. Tool Design is Critical

Insight: Agent-tool interfaces are as critical as human-computer interfaces.

Problem: With MCP servers exposing external tools, agents encounter unseen tools with varying description quality. Bad descriptions send agents down wrong paths.

Solution: Explicit heuristics embedded in prompts:

Examine all available tools first
Match tool usage to user intent
Search web for broad external exploration
Prefer specialized tools over generic ones

Innovation: Tool-testing agent that:

Attempts to use flawed MCP tools
Rewrites tool descriptions to avoid failures
Tests tools dozens of times to find nuances
Result: 40% decrease in task completion time for future agents

5. Let Agents Improve Themselves

Discovery: Claude 4 models excel at prompt engineering.

Process:

Give model a prompt + failure mode
Model diagnoses why agent is failing
Model suggests improvements

Application: Tool description improvement resulted in dramatically faster task completion.

6. Start Wide, Then Narrow

Strategy: Mirror expert human research—explore landscape before drilling into specifics.

Problem: Agents default to overly long, specific queries returning few results.

Solution: Prompt agents to:

Start with short, broad queries
Evaluate what’s available
Progressively narrow focus

7. Guide the Thinking Process

Extended thinking serves as a controllable scratchpad:

LeadResearcher uses thinking to:

Plan approach
Assess which tools fit the task
Determine query complexity and subagent count
Define each subagent’s role

Subagents use interleaved thinking to:

Evaluate quality of tool results
Identify gaps
Refine next query

Impact: Improved instruction-following, reasoning, and efficiency in testing.

8. Parallel Tool Calling Transforms Speed

Early approach: Sequential searches → painfully slow

Two kinds of parallelization:

Lead agent: Spins up 3-5 subagents in parallel (not serially)
Subagents: Use 3+ tools in parallel per agent

Result: Up to 90% reduction in research time for complex queries

Evaluation Challenges for Multi-Agent Systems

Traditional evals assume AI follows the same steps each time. Multi-agent systems are non-deterministic—different paths can reach the same correct answer.

Key evaluation insights:

1. Observability is Essential

With multiple agents exploring in parallel, you need:

Step-by-step execution traces
Tool call logs for each subagent
Intermediate thinking outputs
Success/failure metrics per subtask

2. Focus on Outcomes, Not Paths

Don’t evaluate whether the agent took a specific sequence of steps. Evaluate whether it:

Found all required information
Cited sources correctly
Arrived at accurate conclusions
Used resources efficiently

3. Fast Iteration Loops

Anthropic built test cases that:

Represent real-world complexity
Cover common failure modes
Run quickly enough for rapid iteration
Provide clear signal on regressions

When to Use (and Not Use) Multi-Agent Systems

Good Fit: Tasks With

✅ High parallelization potential (breadth-first exploration)
✅ Information exceeding single context windows
✅ Numerous complex tools requiring specialized handling
✅ Value justifying 15x token cost compared to chat

Poor Fit: Domains With

❌ Few truly parallelizable tasks (e.g., most coding)
❌ Need for all agents to share the same context
❌ Many real-time dependencies between agents
❌ Tight token budgets or low-value tasks

Economic reality: Multi-agent systems burn through tokens fast (15x chat). They require tasks where the value is high enough to pay for increased performance.

Practical Takeaways for Builders

If you’re building multi-agent systems, Anthropic’s lessons translate to these actionable principles:

Architecture

Use orchestrator-worker patterns for coordinating parallel work
Persist critical context explicitly (don’t rely on infinite context windows)
Design for dynamic adaptation, not static pipelines
Implement interleaved thinking for continuous plan refinement

Prompting

Simulate before deploying - build exact replicas in sandboxes to observe behavior
Embed scaling heuristics - teach effort-to-complexity matching explicitly
Invest in tool descriptions - they’re as important as the tools themselves
Start broad, narrow iteratively - don’t let agents over-specify too early

Evaluation

Measure outcomes, not paths - embrace non-determinism in agent behavior
Build observability first - you can’t improve what you can’t see
Create fast feedback loops - rapid iteration beats perfect evals

Economics

Calculate value-to-cost ratio - 15x token usage needs 15x+ value
Parallelize ruthlessly - it’s the key performance multiplier
Upgrade models aggressively - Sonnet 4 > 2x Sonnet 3.7 budget

The Future of Multi-Agent Systems

Anthropic’s Research feature demonstrates that multi-agent systems work at production scale when designed with clear principles. The key insight: intelligence scales through coordination, not just through better individual models.

Just as human societies became exponentially more capable through collective intelligence, AI agents will unlock new capabilities through effective multi-agent orchestration. But this requires:

Thoughtful architecture (orchestrator-worker patterns)
Extensive prompt engineering (heuristics, not rigid rules)
Outcome-based evaluation (embracing non-determinism)
Economic viability (matching cost to task value)

As LLMs continue improving, multi-agent systems become increasingly viable. The 90.2% performance improvement Anthropic achieved suggests we’re still in early days of exploring this space.

The question isn’t whether multi-agent systems will become standard—it’s which architectural patterns will emerge as best practices. Anthropic’s production experience provides a valuable starting point.