[Project] 7. Orchestrator Mode — How I Saved Myself from Half a Billion Daily Tokens

2026-06-01 4755 words 23 minutes

/images/Project%20-%207%20-%20Orchestrator%20Architecture/cover.png

Contents

Orchestrator Mode: How I Saved Myself from Half a Billion Daily Tokens

Five million tokens a day, I stroll with ease; fifty million tokens a day, I push with vigor; half a billion tokens a day, I’m drenched in sweat.

Background and Problem

The Context Cost of a Single Session

Every LLM call requires passing the full conversation history, meaning context grows linearly with the number of tool calls. After 50 tool calls in a complex task, the 51st call must carry the preceding 50 entries of history — not only is this costly, but the LLM also tends to lose focus and miss critical information in an overly long context.

**Figure 1.1 - Single session context grows linearly with every tool call**

The Degradation Path of an Unconstrained Agent

Without any restrictions, a fully capable agent facing a complex task will gravitate toward the simplest path:

1
2
3
[user] → implement this feature for me
[agent] → spawn_full_sub_agent(task="implement this feature for me")
// orchestration capability is completely lost, becoming a transparent proxy

Read and write operations get mixed together in the same execution flow — exploratory investigation and modification operations share the same stream, making it impossible to trace whether a problem originated from “misreading” or “miswriting.”

Design Goals

This architecture must solve three problems simultaneously:

Context bloat: the trunk session only accumulates decisions, not execution details
Role confusion: enforce strict separation of “investigation” and “modification,” ban all-capable agents
Model resource waste: assign models based on task complexity, avoid using expensive models for simple tasks

Three-Package Architecture Overview

All three problems point to the same root cause: the main session is doing things it shouldn’t — code exploration, file modification, and execution debugging are all mixed into a single context chain. The solution converges in one direction: let the main session handle only planning, outsource execution to specialized sub-units, and control how sub-unit results flow back. Realizing this direction requires three layers of capability — precise search (reducing context overhead from code exploration), execution unit management (enforcing read/write separation), and scheduling policy enforcement (preventing the main session from degrading into an all-capable executor) — corresponding to three collaborating extension packages.

Package Dependency Hierarchy

The three packages form a clear hierarchy:

pi-extension-tool-semble: standalone toolkit, no dependencies, provides semantic code search capabilities
pi-extension-session-subagent: base execution layer, depends on pi-coding-agent, provides the spawn trio
pi-extension-session-orchestrator: scheduling enhancement layer, depends on the subagent package, strengthens the main session’s scheduling capabilities through hooks

**Figure 2.1 - Three-package dependency hierarchy**

Package Responsibility Breakdown

semble: enables the agent to precisely retrieve relevant code snippets instead of blindly searching entire files
subagent: turns “spawn a sub-LLM session to execute a task” into a callable tool
orchestrator: restricts the main session’s toolset, enforces separation of planning and execution, manages session tree branching

Tool Permission Overview

Tool	Main Session	read sub-agent	write sub-agent
read	✅	✅	✅
semble_search	✅	✅	✅
semble_find_related	✅	✅	✅
scratchpad	✅	✅	✅
task	✅	✅	✅
git_read	✅	✅	✅
diff	✅	✅	✅
bash	❌	✅(read-only operations)	✅
edit	❌	❌	✅
write	❌	❌	✅
spawn_read_sub_agent	✅	❌	❌
spawn_write_sub_agent	✅	❌	❌
spawn_full_sub_agent	❌	❌	❌

Design note: why retain the read tool for the main session?

The main session retains read-only tools like read, semble_search, and semble_find_related, rather than forcing all reading through spawn_read_sub_agent.

The reason: sometimes the orchestrator already knows what to read — the task description includes a path, or it obtained a file list in a previous round. In such cases, spawning a sub-agent to read, round-trip the results, and not compress anything is pure token waste (the sub-agent’s system prompt + context fork overhead are paid for nothing).

Optimization principle: when you already know what to read, use read / semble directly and bring the information into the trunk’s decision-making. Don’t take a detour just to “follow the process.”

The same optimization applies to writes

The same logic applies to writing: when the orchestrator already knows exactly what to write (content, path, format are clear), call spawn_write_sub_agent directly, skipping the read phase. There’s no need to confirm the current state before writing — skip it when you can.

These two tools exist for separation of concerns, but separation doesn’t mean you have to take both steps every time — the orchestrator’s core responsibility is judging which steps can be omitted.

Tree-Based Context Structure

The core promise of the three-package architecture is: no matter how many steps a sub-task executes, the trunk session stays lean. How is this achieved technically? The answer lies in the data structure of the session storage layer — a tree-based context, not a linear one. This is not an optimization technique; it is the prerequisite for the entire architecture to work.

The Fundamental Flaw of Linear Context

The message history of a traditional LLM session is a chain — everything must be appended to the same line. No matter what tools were executed, how many files were read, how many mistakes were made, how many retries occurred — all of it enters the context, and none of it can be selectively hidden.

**Figure 3.1 - Linear context: all noise enters the main chain**

By step N, the LLM must work within a context cluttered with invalid operations, error messages, and retry history.

Tree Structure: Each Branch Holds Its Own Perspective

A tree-based context allows each node to see only the messages along the path “from root to itself.” Sibling branches don’t interfere with each other, and the parent node cannot see the execution details of child nodes.

**Figure 3.2 - Tree context: trunk only sees lean tool_result, branches hold full execution**

Fork Point: Where Sub-Agents Start Seeing the World

A fork point is the trunk’s current leaf ID. forkWorkspace() points the new SessionManager’s leaf to this location. When a sub-agent starts, it inherits all messages from the trunk up to this point (planning context), but all subsequent new messages are written only to its own branch.

The sub-agent knows “what problem we’re solving” (because it inherits the trunk history), but its exploration process never pollutes the trunk.

Trunk Appends Only Conclusions, Branches Carry the Process

No matter how many steps a sub-agent executes, the trunk has no knowledge of them — the trunk only adds one tool_result (a condensed conclusion) after the fork point. This is guaranteed at the protocol level: pi’s tool protocol dictates that a tool’s return value enters the caller’s context as tool_result, rather than pasting the entire sub-session history.

Why Only a Tree Can Keep the Trunk Lean

This is an architectural prerequisite, not an optimization technique.

A linear context fundamentally cannot keep the trunk lean — everything must be appended to the same chain; there is no way to “keep some history elsewhere.” The tree structure, by giving each sub-agent its own session instance (pointing to the same file but with an independent leaf pointer), makes forking and isolation possible at the storage layer.

Parallel Branches: Multiple Agents from the Same Fork Point

Multiple sub-agents can create branches from the same fork point simultaneously, executing in parallel:

**Figure 3.3 - Parallel branches from the same fork point, trunk collects lean results**

Three SessionManager instances point to the same session file but each holds its own independent leaf pointer, writing without interfering with each other.

Because sub-agents write to persistent files rather than inMemory, every branch in the session tree is permanently preserved. Using /tree, you can enter any sub-agent’s branch at any time to see which files it read, which commands it executed, and what its reasoning process was — a complete audit trail.

Why Token Savings Occur

Section 3 explained how the tree structure makes a lean trunk technically possible. But “lean” is an abstract description — which specific tokens are saved, and by what magnitude? This section translates architectural advantages into quantifiable token cost differences through four specific mechanisms. These mechanisms are not homogeneous: three of them (mechanisms one, three, and four) are structural guarantees that don’t depend on LLM performance; mechanism two is a probabilistic benefit. Section 4.6 elaborates on this distinction.

Root Cause: Context Grows Linearly with Execution

Suppose a complex task requires 10 sub-tasks, each with 50 tool calls. Without any isolation, the LLM call for the 10th sub-task must carry 9x50=450 preceding history messages — context grows linearly with the number of tasks.

**Figure 4.1 - Without orchestrator: context grows by 50 messages per task**

Mechanism One: Trunk Only Sees Conclusions, Details Stay in Branches

In orchestrator mode, after 10 sub-tasks complete, the trunk context has only added 10 tool_result entries. The 450 tool calls inside the sub-tasks all reside in branches.

**Figure 4.2 - With orchestrator: trunk context grows by 1 message per task regardless of sub-agent steps**

Trunk context size = O(number of tasks), not O(total tool calls). This is a structural guarantee that does not depend on agent behavior.

Mechanism Two: Context Isolation, Each Agent Focused on Its Own Domain

Each sub-agent only has context relevant to its own domain and does not see the execution history of other sub-tasks. Focused context reduces the cognitive burden of “finding key information among 500 history entries,” lowers the probability of hallucination, and reduces retries. This is the only probabilistic benefit among the four mechanisms — not a structural guarantee. Section 4.6 elaborates on this distinction.

For a 10,000-line file, reading the full content requires passing approximately 10,000 lines into context. semble_search returns the top 5 most relevant chunks for a query, each roughly 30 lines, meaning only 150 lines enter context — a compression ratio of approximately 67:1.

**Figure 4.3 - Semble reduces file content in context by ~67x**

Mechanism Four: Model Tiering, Expensive Models Only for Complex Tasks

list files in src/ uses haiku (~~$0.25/M token), refactor auth module uses opus (~~$15/M token) — a 60x price difference. Content-driven model routing ensures every call uses a model that’s “just enough” for the task.

What’s Being Saved Is Structural Overhead, Not Agent Intelligence

Among the four mechanisms, there is a clear dividing line worth highlighting.

Mechanisms one, three, and four are structural guarantees — regardless of how well the LLM performs, the savings happen:

Mechanism one: the tool protocol + session tree branching determines that the trunk only receives tool_result — a mathematical fact
Mechanism three: the semble interception hook intervenes at the tool_call level by force — it triggers inevitably
Mechanism four: task tier routing executes on every spawn — it does not depend on LLM judgment

Even if every sub-agent is highly inefficient and makes many errors, the trunk context remains lean.

Mechanism two (context isolation) is a probabilistic benefit: focused context → fewer hallucinations → fewer retries → token savings. Every step in this chain is probabilistic — the LLM might still err even in a focused context, or it might succeed on the first try. When the industry says “multi-agent saves more tokens than single-agent,” this second-order effect is what they’re referring to.

The conservatism of this architecture: first, establish a deterministic baseline through structural guarantees; the intelligence advantage of mechanism two is icing on the cake, not the core guarantee.

Based on measured data, introducing this architecture reduces token consumption by approximately 20%. This figure is the net effect of all four mechanisms minus the sub-agent’s own overhead, demonstrating that structural savings far exceed the fixed cost of spawning.

An Equivalent Compression Perspective: Every Spawn Is an Archive

The four mechanisms above approach the problem from different angles. Here, a unified holistic perspective brings them together into a single picture.

Every spawn is essentially a compression-and-archive operation on that phase of work. The sub-agent executes 50 tool calls, generating a large amount of intermediate state — which files were read, which commands were executed, which hypotheses were reasoned about. For the trunk, this content is “compressed” into a single tool_result: Findings, Risks, Next Steps. 50 messages → 1 message, a compression ratio of approximately 50:1.

But this “compression” differs from traditional lossy compression. The accurate description is lossless compression via a lossy side channel:

The trunk sees a lossy summary (tool_result) — execution details are omitted, the trunk context stays lean
The original data is fully preserved in the branch — retrievable at any time via /tree, not actually lost

**Figure 4.4 - Each spawn compresses N steps into 1 tool_result; original preserved in branch**

Traditional compression is destructive — original data may be lost. The “compression” here is archival — the original data is stored elsewhere, the trunk no longer sees it, but it still exists in its entirety. If the trunk needs a certain execution detail, it can enter the branch via /tree to retrieve it.

Mapping this perspective to mechanisms one, three, and four: mechanism one is the direct embodiment of this compression-archive operation; mechanism three (semble precision snippet retrieval) does the same thing inside the sub-agent, but at a finer granularity — not reading the entire file, only fetching relevant chunks; mechanism four (model tiering) decides how expensive a “processor” to use for each archive operation. The three mechanisms unify under the same logic: reduce redundant information entering any layer’s context.

Actual Results

After introducing this architecture, two phenomena are worth recording.

First, session compression is almost never needed anymore. In the past, running complex tasks in a single session would cause the context to bloat to the point of needing manual compression (or being truncated by auto-compression); now the trunk context stays at the order of magnitude of the number of tasks, remaining lightweight even after long tasks complete. Compression has essentially disappeared from daily operations.

Second, token consumption decreased by approximately 20%. This figure is the net effect of all four mechanisms minus the sub-agent’s own overhead (each spawn has a fixed system prompt cost). A net 20% savings demonstrates that structural savings far exceed the overhead cost of spawning.

Base Layer: pi-extension-session-subagent

The foundational capability of the entire architecture is “turning an LLM session into a callable tool.” Without this layer, the orchestrator has no execution unit to delegate to. The subagent package does exactly this: it encapsulates the spawn operation into three tools, allowing the caller to launch a sub-session with full reasoning capability just like calling a function.

Three Spawn Tools

The subagent package registers three tools, corresponding to three execution roles:

spawn_read_sub_agent: read-only agent, can only observe and report, cannot modify any files or state
spawn_write_sub_agent: read-write agent, can perform file modifications, command execution, etc.
spawn_full_sub_agent: full toolset, which the orchestrator package removes from the main session’s tool list

Tool Capability Matrix

The toolset for the three agent types comes from DEFAULT_TOOLS_CONFIG in tools-config.ts:

1
2
3
4
common:    ["read", "scratchpad", "task", "git_read", "diff",
            "semble_search", "semble_find_related"]
readExtra: ["bash"]        // bash is constrained by READ_AGENT_CONSTRAINTS prompt
writeExtra:["bash", "edit", "write"]

read agent = common + readExtra, write agent = common + writeExtra.

Four Context Modes

fork (default): Knowledge transfer mode. Inherits the parent session’s message history, but distilled — only user messages and assistant text blocks are retained; thinking blocks and all toolCall/tool_use blocks are stripped.
Stripping behavioral signals is intentional: the assistant’s thinking and tool call history would influence the sub-agent’s operation style through in-context learning, causing it to mimic the orchestrator’s behavior patterns even though the system prompt defines it as a read/write worker. After filtering, the sub-agent knows “what problem we’re solving” (from user messages and assistant analytical text), but is not polluted by “how the parent session operates.”
fresh: Blank context, only the system prompt. Suitable for truly self-contained tasks, such as “check if there are files in this directory” — no dependency on any planning information from the parent session.
fork_full: Full session clone, the original history is passed in verbatim without any filtering. Only used in continuation scenarios where the sub-agent plays the same role as the parent session (extremely rare).
auto: Unconditionally routes to fork. If fresh is needed, you must explicitly pass mode="fresh" — do not rely on auto to trigger fresh.

Session Lifecycle

A sub-agent is a session created on-demand and released immediately after execution:

**Figure 5.1 - Sub-agent spawn, execute, and return sequence**

Structured Output Format

All sub-agent system prompts have SUB_AGENT_OUTPUT_GUIDELINES injected, requiring the final reply to contain five fixed sections:

1
2
3
4
5
## Conclusion   - Core conclusion
## Findings     - Key findings
## Risks        - Risks and caveats
## Open Questions - Unresolved issues
## Next Steps   - Specific follow-up actions

When the orchestrator consumes the tool_result, it processes by priority: first check Risks and Open Questions (any blockers), then Findings (establish factual basis), and finally Next Steps (decision recommendations).

Read-Only Constraint Injection Principle

The read agent’s constraints do not rely on tool-level sandboxing. Instead, READ_AGENT_CONSTRAINTS is injected through the system prompt, explicitly listing allowed and prohibited bash operations:

Allowed: ls, find, grep, git log, git diff, and other read-only operations
Prohibited: any file writes (>, >>), sed -i, rm, npm install, git commit, and other modification operations

When a read agent discovers something that needs modification, it should describe it in Next Steps, and the orchestrator dispatches a write agent to execute.

Scheduling Layer: pi-extension-session-orchestrator

With the execution units provided by subagent, the next question is: who decides when to dispatch which type of agent, and how to prevent the main session from executing tasks itself. The orchestrator package adds a layer of scheduling policy on top of subagent, using three hooks to forcibly shape the main session into a role that plans but does not execute.

Core Philosophy: Separation of Planning and Execution

The orchestrator’s main session does only three things: understand the request, decompose the task, synthesize the conclusion. It never directly executes bash commands, reads or writes files, or calls APIs. All execution is delegated to sub-agents.

This is not achieved through prompt constraints — the bash/edit/write tools are physically removed from the main session’s tool list at session_start. Degradation is impossible even if attempted.

Three Hook Interception Points

The orchestrator implements all its capabilities through three hooks, without registering any tools:

**Figure 6.1 - Three hook interception points in orchestrator lifecycle**

Toolset Restriction (session_start)

The session_start hook calls pi.setActiveTools() to restrict the main session’s tools to:

1
common tools + mainExtra(spawn_read_sub_agent, spawn_write_sub_agent)

spawn_full_sub_agent is completely excluded. If the subagent package is not installed (spawn tools not found), the hook sends an error notification to the UI and exits early.

Orchestrator Prompt Injection (before_agent_start)

The before_agent_start hook appends an [ORCHESTRATOR MODE] section at the end of the system prompt, clearly informing the agent:

You are a scheduler, only plan and delegate, do not execute directly
You have read-only direct access (read, semble_search, semble_find_related)
You do not have bash/edit/write tools
Workflow: understand → task(plan) → spawn_read(only when uncertain) → spawn_write → synthesize

The hook also re-asserts the toolset (preventing other extension hooks from restoring removed tools in between).

Workspace Branching and Trunk Leanness (tool_call)

The tool_call hook intercepts all spawn calls and does two things:

Model selection: analyzes the task description content and selects a model matching the complexity
Workspace injection: calls forkWorkspace() to create a branched SessionManager, injected into the spawn parameters’ _workspaceSessionManager field

The sub-agent uses this workspace SM instead of an inMemory SM. All its messages are written to a branch of the session tree; the trunk only appends a single tool_result.

forkWorkspace Implementation Principle

The implementation is very clean, using only two SessionManager APIs:

1
2
3
const ws = SessionManager.open(sessionFile);  // same file, independent instance
ws.branch(trunkLeafId);                       // leaf points to current trunk position
// next time the sub-agent writes, it automatically creates a new branch at this fork point

**Figure 6.2 - Trunk only appends tool_result; branch holds full execution**

Content-Driven Model Tiering (Task Tier)

Model selection is based on the content complexity of the task description, not the agent type (read/write):

1
2
3
COMPLEX_TASK = /architect|refactor|migrat|implement|debug|fix|rewrite/i  → high tier
SIMPLE_TASK  = /list|find|grep|count|search|summarize|check if/i          → low tier
其他                                                                        → medium tier

The actual models corresponding to each tier come from ~/.pi/agent/model-routing.json, or are auto-detected by cost + name pattern (opus → high, sonnet → medium, haiku → low). The main session always uses the user’s currently selected model without overriding.

/tree Visibility

Because sub-agents write to persistent session tree branches (not inMemory), users can navigate to any sub-agent’s branch via /tree after the task completes and view its full execution process — every file read, every bash command, every reasoning step is preserved there.

Search Layer: pi-extension-tool-semble

Separation of planning and execution solves the “who does it” problem, but there is still a hidden overhead not yet addressed: code exploration. Whether it’s the orchestrator or a sub-agent, handling code tasks requires locating relevant files and functions. If every search relies on reading entire files or exhaustive grep, the trunk context savings are offset by code-reading overhead. The semble package addresses exactly this problem.

semble_search / semble_find_related

Two tools encapsulate the semble CLI:

semble_search: natural language or symbol name semantic search, returns the most relevant code chunks (default top 5)
semble_find_related: given file:line, finds similar implementations in the project, used for lateral exploration

Both are more token-efficient than grep/read: semble only returns relevant snippets, not the entire file.

bash grep Auto-Rewrite

The tool_call hook intercepts bash calls and uses regex to match the following two patterns, automatically replacing the commands:

1
2
3
4
5
6
# Intercepted and rewritten to semble search
grep -r "pattern" ./src
rg "pattern" ./src

# After rewrite
semble search "pattern" "/abs/path/to/src"

Compound commands (containing |, &, ;) are not rewritten to avoid breaking pipeline logic.

The tool_call hook also intercepts read tool calls. The judgment conditions:

The call does not specify offset or limit (indicating a full file read)
The file’s estimated line count exceeds 300 (LINE_THRESHOLD, overridable via environment variable)
The directory containing this file has not been searched by semble yet (SearchTracker.hasSearched())

If conditions are met, block: true, returning a prompt: first use semble to locate, then read snippets after confirming the target file.

SearchTracker State Management

SearchTracker records every directory searched by semble_search:

**Figure 7.1 - SearchTracker gates blind file reads within a turn**

Records are cleared at the start of each turn, ensuring that cross-turn passes are not mistakenly granted.

Collaborative Workflow

Having understood the role of each component and the principle of token savings, let’s look at how they collaborate in actual tasks. A typical orchestrator task goes through these phases: planning → reconnaissance → implementation → verification. Each phase corresponds to a different combination of tool calls.

Standard Execution Sequence

The complete orchestrator workflow:

**Figure 8.1 - Full orchestrator execution sequence**

Skip-Read Optimization Path

If the orchestrator already knows the target file and modification content from previous context, it dispatches a write sub-agent directly, skipping the read phase:

1
2
❌ Waste: spawn_read → reads already-known information → spawn_write
✅ Correct: spawn_write directly (known information is already in the trunk context)

The orchestrator system prompt explicitly emphasizes: “skip the read phase when you already have enough context.”

Context Flow

User messages → enter the trunk, visible to the orchestrator
Trunk tool calls (spawn) → fork point created, recorded in trunk history
Sub-agent execution (50 steps) → all written to branch, invisible to trunk
tool_result → written to trunk, visible to orchestrator (condensed conclusion)
scratchpad → stored in tool_result details, auto-restored after branch switch, persists across spawns

Tool Configuration Customization

DEFAULT_TOOLS_CONFIG defines the default tool whitelist, supporting two levels of override:

~/.pi/agent/tools.json: global user configuration
<cwd>/.pi/agent/tools.json: project-level configuration (higher priority)

The latter overrides the former, and both override defaults. You can add extra tools (such as db_query) for specific projects without affecting other projects.

Pattern Analysis

Having described the architectural details, let’s return to a qualitative question: what pattern is this architecture? What are its essential differences from similar concepts — the router pattern and multi-agent systems? Clarifying these boundaries helps position it within a broader technical context and helps determine which scenarios are suitable for it and which aren’t.

Distinction from the Router Pattern

The core of the router pattern is single-shot dispatch: receive a request → determine the type → forward to the corresponding handler → return the result. The router itself is stateless, does not hold task context, does not synthesize conclusions, and does not decide next steps.

The orchestrator is fundamentally different:

Has global state: scratchpad and task persist across multiple spawn rounds
Multi-round iteration: decides next steps based on each round’s results, not one-shot dispatch
Active synthesis: consolidates conclusions from multiple sub-agents into a final answer
Decision loop: spawn_read → evaluate → spawn_write → verify → synthesize

**Figure 9.1 - Router (stateless dispatch) vs Orchestrator (stateful multi-round delegation)**

Distinction from True Multi-Agent Systems (MAS)

Core elements of classical MAS (Multi-Agent System):

Element	Classical MAS	This Architecture
Autonomy	agents decide when to act autonomously	sub-agents passively wait for spawn
Peer Communication	agents can message each other directly	strictly unidirectional: orchestrator ↔ sub-agent
Persistent Existence	agents run continuously with their own goals	sub-agents are destroyed after execution
Shared Environment Awareness	multiple agents actively perceive the same environment changes	sub-agents only perceive the task description injected by the orchestrator

A sub-agent is essentially a function call that can reason — it takes input (task description + fork context), produces output (conclusion), and disappears. It has no objective function of its own, no active perception of the environment, and no ability to communicate with sibling agents.

True MAS requires: persistent agent loops, a shared read-write environment (blackboard/message bus), and agents autonomously deciding actions based on perception. The industry’s use of “multi-agent” to describe this orchestrator architecture is a broad usage.

The Essential Positioning of This Architecture

The precise positioning is a three-layer composition:

**Figure 9.2 - Orchestrator architecture as composition of three patterns**

Hierarchical Agent: single decision center (orchestrator), where the execution tool happens to be an LLM
CQRS (Command Query Responsibility Segregation): read/write forced separation, achieved by physically removing tools rather than by convention
Tree-based Context Management: branch isolation + trunk leanness + permanent audit trail

It’s not MAS, it’s not Router — it’s a hierarchical delegation pattern that prevents role degradation through capability boundary enforcement.

Applicability Boundary: What’s Being Saved Is the Cost of Uncertainty

Having walked through the entire architecture in detail, we can step back and look at a more fundamental question: what exactly is this mechanism saving? The answer is — the context generated by the uncertainty of the exploration process.

The main session doesn’t have bash permissions not because it’s untrusted, but because the exploration process is inherently uncertain — how many files need to be read? How many errors will be encountered? How many retry rounds are needed? These unpredictable steps, if they happen on the trunk, all settle into the context. Outsourcing exploration to sub-agents essentially “containerizes” this uncertainty:

read sub-agent is the container for exploratory reading. Wrong reads, misreads, retries — all stay in the branch. The trunk only sees the conclusion.
write sub-agent is the container for exploratory writing. Trial edits, test failures, more edits — the entire process stays in the branch. The trunk has no knowledge of it.

This means the benefit of sub-agents is proportional to the uncertainty of the task. The higher the uncertainty — not knowing where to start, complex file structure, lots of exploration needed — the more significant the spawn benefit.

Conversely, spawn is unnecessary for things already certain. If the orchestrator already knows which file to read and what to change, doing it directly with the read or edit/write tools is actually more efficient — each spawn has a fixed system prompt cost and context fork overhead. Paying this cost for “certain things” is pure waste.

Extreme case: if the entire task is deterministic from start to finish (path known, change content clear), it’s perfectly reasonable for the main session to handle it directly. This is not “degradation” of the architecture, but correctly recognizing spawn’s applicability boundary — what it solves is always the context bloat caused by uncertainty. Deterministic tasks were never its target.

1
2
High uncertainty → spawn sub-agent (isolate exploration in branches)
Low uncertainty → main session does it directly (save spawn's fixed cost)

This also explains a seemingly strange design decision: why is it sometimes “read” directly, and sometimes “spawn_read_sub_agent”? The difference isn’t in the operation type, but in whether you don’t know what to read — that’s when you start putting it into a sub-agent. If you know, just read directly.