[Project] 7. Orchestrator Mode — How I Saved Myself from Half a Billion Daily Tokens

Orchestrator Mode: How I Saved Myself from Half a Billion Daily Tokens
Five million tokens a day, I stroll with ease; fifty million tokens a day, I push with vigor; half a billion tokens a day, I’m drenched in sweat.
Background and Problem
The Context Cost of a Single Session
Every LLM call requires passing the full conversation history, meaning context grows linearly with the number of tool calls. After 50 tool calls in a complex task, the 51st call must carry the preceding 50 entries of history — not only is this costly, but the LLM also tends to lose focus and miss critical information in an overly long context.
The Degradation Path of an Unconstrained Agent
Without any restrictions, a fully capable agent facing a complex task will gravitate toward the simplest path:
| |
Read and write operations get mixed together in the same execution flow — exploratory investigation and modification operations share the same stream, making it impossible to trace whether a problem originated from “misreading” or “miswriting.”
Design Goals
This architecture must solve three problems simultaneously:
- Context bloat: the trunk session only accumulates decisions, not execution details
- Role confusion: enforce strict separation of “investigation” and “modification,” ban all-capable agents
- Model resource waste: assign models based on task complexity, avoid using expensive models for simple tasks
Three-Package Architecture Overview
All three problems point to the same root cause: the main session is doing things it shouldn’t — code exploration, file modification, and execution debugging are all mixed into a single context chain. The solution converges in one direction: let the main session handle only planning, outsource execution to specialized sub-units, and control how sub-unit results flow back. Realizing this direction requires three layers of capability — precise search (reducing context overhead from code exploration), execution unit management (enforcing read/write separation), and scheduling policy enforcement (preventing the main session from degrading into an all-capable executor) — corresponding to three collaborating extension packages.
Package Dependency Hierarchy
The three packages form a clear hierarchy:
- pi-extension-tool-semble: standalone toolkit, no dependencies, provides semantic code search capabilities
- pi-extension-session-subagent: base execution layer, depends on pi-coding-agent, provides the spawn trio
- pi-extension-session-orchestrator: scheduling enhancement layer, depends on the subagent package, strengthens the main session’s scheduling capabilities through hooks
Package Responsibility Breakdown
- semble: enables the agent to precisely retrieve relevant code snippets instead of blindly searching entire files
- subagent: turns “spawn a sub-LLM session to execute a task” into a callable tool
- orchestrator: restricts the main session’s toolset, enforces separation of planning and execution, manages session tree branching
Tool Permission Overview
| Tool | Main Session | read sub-agent | write sub-agent |
|---|---|---|---|
| read | ✅ | ✅ | ✅ |
| semble_search | ✅ | ✅ | ✅ |
| semble_find_related | ✅ | ✅ | ✅ |
| scratchpad | ✅ | ✅ | ✅ |
| task | ✅ | ✅ | ✅ |
| git_read | ✅ | ✅ | ✅ |
| diff | ✅ | ✅ | ✅ |
| bash | ❌ | ✅(read-only operations) | ✅ |
| edit | ❌ | ❌ | ✅ |
| write | ❌ | ❌ | ✅ |
| spawn_read_sub_agent | ✅ | ❌ | ❌ |
| spawn_write_sub_agent | ✅ | ❌ | ❌ |
| spawn_full_sub_agent | ❌ | ❌ | ❌ |
Design note: why retain the read tool for the main session?
The main session retains read-only tools like read, semble_search, and semble_find_related, rather than forcing all reading through spawn_read_sub_agent.
The reason: sometimes the orchestrator already knows what to read — the task description includes a path, or it obtained a file list in a previous round. In such cases, spawning a sub-agent to read, round-trip the results, and not compress anything is pure token waste (the sub-agent’s system prompt + context fork overhead are paid for nothing).
Optimization principle: when you already know what to read, use read / semble directly and bring the information into the trunk’s decision-making. Don’t take a detour just to “follow the process.”
The same optimization applies to writes
The same logic applies to writing: when the orchestrator already knows exactly what to write (content, path, format are clear), call spawn_write_sub_agent directly, skipping the read phase. There’s no need to confirm the current state before writing — skip it when you can.
These two tools exist for separation of concerns, but separation doesn’t mean you have to take both steps every time — the orchestrator’s core responsibility is judging which steps can be omitted.
Tree-Based Context Structure
The core promise of the three-package architecture is: no matter how many steps a sub-task executes, the trunk session stays lean. How is this achieved technically? The answer lies in the data structure of the session storage layer — a tree-based context, not a linear one. This is not an optimization technique; it is the prerequisite for the entire architecture to work.
The Fundamental Flaw of Linear Context
The message history of a traditional LLM session is a chain — everything must be appended to the same line. No matter what tools were executed, how many files were read, how many mistakes were made, how many retries occurred — all of it enters the context, and none of it can be selectively hidden.
By step N, the LLM must work within a context cluttered with invalid operations, error messages, and retry history.
Tree Structure: Each Branch Holds Its Own Perspective
A tree-based context allows each node to see only the messages along the path “from root to itself.” Sibling branches don’t interfere with each other, and the parent node cannot see the execution details of child nodes.
Fork Point: Where Sub-Agents Start Seeing the World
A fork point is the trunk’s current leaf ID. forkWorkspace() points the new SessionManager’s leaf to this location. When a sub-agent starts, it inherits all messages from the trunk up to this point (planning context), but all subsequent new messages are written only to its own branch.
The sub-agent knows “what problem we’re solving” (because it inherits the trunk history), but its exploration process never pollutes the trunk.
Trunk Appends Only Conclusions, Branches Carry the Process
No matter how many steps a sub-agent executes, the trunk has no knowledge of them — the trunk only adds one tool_result (a condensed conclusion) after the fork point. This is guaranteed at the protocol level: pi’s tool protocol dictates that a tool’s return value enters the caller’s context as tool_result, rather than pasting the entire sub-session history.
Why Only a Tree Can Keep the Trunk Lean
This is an architectural prerequisite, not an optimization technique.
A linear context fundamentally cannot keep the trunk lean — everything must be appended to the same chain; there is no way to “keep some history elsewhere.” The tree structure, by giving each sub-agent its own session instance (pointing to the same file but with an independent leaf pointer), makes forking and isolation possible at the storage layer.
Parallel Branches: Multiple Agents from the Same Fork Point
Multiple sub-agents can create branches from the same fork point simultaneously, executing in parallel:
Three SessionManager instances point to the same session file but each holds its own independent leaf pointer, writing without interfering with each other.
/tree Navigation: Process Is Forever Traceable
Because sub-agents write to persistent files rather than inMemory, every branch in the session tree is permanently preserved. Using /tree, you can enter any sub-agent’s branch at any time to see which files it read, which commands it executed, and what its reasoning process was — a complete audit trail.
Why Token Savings Occur
Section 3 explained how the tree structure makes a lean trunk technically possible. But “lean” is an abstract description — which specific tokens are saved, and by what magnitude? This section translates architectural advantages into quantifiable token cost differences through four specific mechanisms. These mechanisms are not homogeneous: three of them (mechanisms one, three, and four) are structural guarantees that don’t depend on LLM performance; mechanism two is a probabilistic benefit. Section 4.6 elaborates on this distinction.
Root Cause: Context Grows Linearly with Execution
Suppose a complex task requires 10 sub-tasks, each with 50 tool calls. Without any isolation, the LLM call for the 10th sub-task must carry 9x50=450 preceding history messages — context grows linearly with the number of tasks.
Mechanism One: Trunk Only Sees Conclusions, Details Stay in Branches
In orchestrator mode, after 10 sub-tasks complete, the trunk context has only added 10 tool_result entries. The 450 tool calls inside the sub-tasks all reside in branches.
Trunk context size = O(number of tasks), not O(total tool calls). This is a structural guarantee that does not depend on agent behavior.
Mechanism Two: Context Isolation, Each Agent Focused on Its Own Domain
Each sub-agent only has context relevant to its own domain and does not see the execution history of other sub-tasks. Focused context reduces the cognitive burden of “finding key information among 500 history entries,” lowers the probability of hallucination, and reduces retries. This is the only probabilistic benefit among the four mechanisms — not a structural guarantee. Section 4.6 elaborates on this distinction.
Mechanism Three: Semble Intercepts Blind Reads, Fetches Snippets On-Demand
For a 10,000-line file, reading the full content requires passing approximately 10,000 lines into context. semble_search returns the top 5 most relevant chunks for a query, each roughly 30 lines, meaning only 150 lines enter context — a compression ratio of approximately 67:1.
Mechanism Four: Model Tiering, Expensive Models Only for Complex Tasks
list files in src/ uses haiku ($0.25/M token), $15/M token) — a 60x price difference. Content-driven model routing ensures every call uses a model that’s “just enough” for the task.refactor auth module uses opus (
What’s Being Saved Is Structural Overhead, Not Agent Intelligence
Among the four mechanisms, there is a clear dividing line worth highlighting.
Mechanisms one, three, and four are structural guarantees — regardless of how well the LLM performs, the savings happen:
- Mechanism one: the tool protocol + session tree branching determines that the trunk only receives tool_result — a mathematical fact
- Mechanism three: the semble interception hook intervenes at the tool_call level by force — it triggers inevitably
- Mechanism four: task tier routing executes on every spawn — it does not depend on LLM judgment
Even if every sub-agent is highly inefficient and makes many errors, the trunk context remains lean.
Mechanism two (context isolation) is a probabilistic benefit: focused context → fewer hallucinations → fewer retries → token savings. Every step in this chain is probabilistic — the LLM might still err even in a focused context, or it might succeed on the first try. When the industry says “multi-agent saves more tokens than single-agent,” this second-order effect is what they’re referring to.
The conservatism of this architecture: first, establish a deterministic baseline through structural guarantees; the intelligence advantage of mechanism two is icing on the cake, not the core guarantee.
Based on measured data, introducing this architecture reduces token consumption by approximately 20%. This figure is the net effect of all four mechanisms minus the sub-agent’s own overhead, demonstrating that structural savings far exceed the fixed cost of spawning.
An Equivalent Compression Perspective: Every Spawn Is an Archive
The four mechanisms above approach the problem from different angles. Here, a unified holistic perspective brings them together into a single picture.
Every spawn is essentially a compression-and-archive operation on that phase of work. The sub-agent executes 50 tool calls, generating a large amount of intermediate state — which files were read, which commands were executed, which hypotheses were reasoned about. For the trunk, this content is “compressed” into a single tool_result: Findings, Risks, Next Steps. 50 messages → 1 message, a compression ratio of approximately 50:1.
But this “compression” differs from traditional lossy compression. The accurate description is lossless compression via a lossy side channel:
- The trunk sees a lossy summary (tool_result) — execution details are omitted, the trunk context stays lean
- The original data is fully preserved in the branch — retrievable at any time via
/tree, not actually lost
Traditional compression is destructive — original data may be lost. The “compression” here is archival — the original data is stored elsewhere, the trunk no longer sees it, but it still exists in its entirety. If the trunk needs a certain execution detail, it can enter the branch via /tree to retrieve it.
Mapping this perspective to mechanisms one, three, and four: mechanism one is the direct embodiment of this compression-archive operation; mechanism three (semble precision snippet retrieval) does the same thing inside the sub-agent, but at a finer granularity — not reading the entire file, only fetching relevant chunks; mechanism four (model tiering) decides how expensive a “processor” to use for each archive operation. The three mechanisms unify under the same logic: reduce redundant information entering any layer’s context.
Actual Results
After introducing this architecture, two phenomena are worth recording.
First, session compression is almost never needed anymore. In the past, running complex tasks in a single session would cause the context to bloat to the point of needing manual compression (or being truncated by auto-compression); now the trunk context stays at the order of magnitude of the number of tasks, remaining lightweight even after long tasks complete. Compression has essentially disappeared from daily operations.
Second, token consumption decreased by approximately 20%. This figure is the net effect of all four mechanisms minus the sub-agent’s own overhead (each spawn has a fixed system prompt cost). A net 20% savings demonstrates that structural savings far exceed the overhead cost of spawning.
Base Layer: pi-extension-session-subagent
The foundational capability of the entire architecture is “turning an LLM session into a callable tool.” Without this layer, the orchestrator has no execution unit to delegate to. The subagent package does exactly this: it encapsulates the spawn operation into three tools, allowing the caller to launch a sub-session with full reasoning capability just like calling a function.
Three Spawn Tools
The subagent package registers three tools, corresponding to three execution roles:
- spawn_read_sub_agent: read-only agent, can only observe and report, cannot modify any files or state
- spawn_write_sub_agent: read-write agent, can perform file modifications, command execution, etc.
- spawn_full_sub_agent: full toolset, which the orchestrator package removes from the main session’s tool list
Tool Capability Matrix
The toolset for the three agent types comes from DEFAULT_TOOLS_CONFIG in tools-config.ts:
| |
read agent = common + readExtra, write agent = common + writeExtra.
Four Context Modes
fork (default): Knowledge transfer mode. Inherits the parent session’s message history, but distilled — only user messages and assistant text blocks are retained; thinking blocks and all toolCall/tool_use blocks are stripped.
Stripping behavioral signals is intentional: the assistant’s thinking and tool call history would influence the sub-agent’s operation style through in-context learning, causing it to mimic the orchestrator’s behavior patterns even though the system prompt defines it as a read/write worker. After filtering, the sub-agent knows “what problem we’re solving” (from user messages and assistant analytical text), but is not polluted by “how the parent session operates.”
fresh: Blank context, only the system prompt. Suitable for truly self-contained tasks, such as “check if there are files in this directory” — no dependency on any planning information from the parent session.
fork_full: Full session clone, the original history is passed in verbatim without any filtering. Only used in continuation scenarios where the sub-agent plays the same role as the parent session (extremely rare).
auto: Unconditionally routes to fork. If fresh is needed, you must explicitly pass
mode="fresh"— do not rely on auto to trigger fresh.
Session Lifecycle
A sub-agent is a session created on-demand and released immediately after execution:
Structured Output Format
All sub-agent system prompts have SUB_AGENT_OUTPUT_GUIDELINES injected, requiring the final reply to contain five fixed sections:
| |
When the orchestrator consumes the tool_result, it processes by priority: first check Risks and Open Questions (any blockers), then Findings (establish factual basis), and finally Next Steps (decision recommendations).
Read-Only Constraint Injection Principle
The read agent’s constraints do not rely on tool-level sandboxing. Instead, READ_AGENT_CONSTRAINTS is injected through the system prompt, explicitly listing allowed and prohibited bash operations:
- Allowed: ls, find, grep, git log, git diff, and other read-only operations
- Prohibited: any file writes (
>,>>),sed -i,rm,npm install,git commit, and other modification operations
When a read agent discovers something that needs modification, it should describe it in Next Steps, and the orchestrator dispatches a write agent to execute.
Scheduling Layer: pi-extension-session-orchestrator
With the execution units provided by subagent, the next question is: who decides when to dispatch which type of agent, and how to prevent the main session from executing tasks itself. The orchestrator package adds a layer of scheduling policy on top of subagent, using three hooks to forcibly shape the main session into a role that plans but does not execute.
Core Philosophy: Separation of Planning and Execution
The orchestrator’s main session does only three things: understand the request, decompose the task, synthesize the conclusion. It never directly executes bash commands, reads or writes files, or calls APIs. All execution is delegated to sub-agents.
This is not achieved through prompt constraints — the bash/edit/write tools are physically removed from the main session’s tool list at session_start. Degradation is impossible even if attempted.
Three Hook Interception Points
The orchestrator implements all its capabilities through three hooks, without registering any tools:
Toolset Restriction (session_start)
The session_start hook calls pi.setActiveTools() to restrict the main session’s tools to:
| |
spawn_full_sub_agent is completely excluded. If the subagent package is not installed (spawn tools not found), the hook sends an error notification to the UI and exits early.
Orchestrator Prompt Injection (before_agent_start)
The before_agent_start hook appends an [ORCHESTRATOR MODE] section at the end of the system prompt, clearly informing the agent:
- You are a scheduler, only plan and delegate, do not execute directly
- You have read-only direct access (read, semble_search, semble_find_related)
- You do not have bash/edit/write tools
- Workflow: understand → task(plan) → spawn_read(only when uncertain) → spawn_write → synthesize
The hook also re-asserts the toolset (preventing other extension hooks from restoring removed tools in between).
Workspace Branching and Trunk Leanness (tool_call)
The tool_call hook intercepts all spawn calls and does two things:
- Model selection: analyzes the task description content and selects a model matching the complexity
- Workspace injection: calls
forkWorkspace()to create a branched SessionManager, injected into the spawn parameters’_workspaceSessionManagerfield
The sub-agent uses this workspace SM instead of an inMemory SM. All its messages are written to a branch of the session tree; the trunk only appends a single tool_result.
forkWorkspace Implementation Principle
The implementation is very clean, using only two SessionManager APIs:
| |
Content-Driven Model Tiering (Task Tier)
Model selection is based on the content complexity of the task description, not the agent type (read/write):
| |
The actual models corresponding to each tier come from ~/.pi/agent/model-routing.json, or are auto-detected by cost + name pattern (opus → high, sonnet → medium, haiku → low). The main session always uses the user’s currently selected model without overriding.
/tree Visibility
Because sub-agents write to persistent session tree branches (not inMemory), users can navigate to any sub-agent’s branch via /tree after the task completes and view its full execution process — every file read, every bash command, every reasoning step is preserved there.
Search Layer: pi-extension-tool-semble
Separation of planning and execution solves the “who does it” problem, but there is still a hidden overhead not yet addressed: code exploration. Whether it’s the orchestrator or a sub-agent, handling code tasks requires locating relevant files and functions. If every search relies on reading entire files or exhaustive grep, the trunk context savings are offset by code-reading overhead. The semble package addresses exactly this problem.
semble_search / semble_find_related
Two tools encapsulate the semble CLI:
- semble_search: natural language or symbol name semantic search, returns the most relevant code chunks (default top 5)
- semble_find_related: given
file:line, finds similar implementations in the project, used for lateral exploration
Both are more token-efficient than grep/read: semble only returns relevant snippets, not the entire file.
bash grep Auto-Rewrite
The tool_call hook intercepts bash calls and uses regex to match the following two patterns, automatically replacing the commands:
| |
Compound commands (containing |, &, ;) are not rewritten to avoid breaking pipeline logic.
Large File Blind-Read Interception
The tool_call hook also intercepts read tool calls. The judgment conditions:
- The call does not specify
offsetorlimit(indicating a full file read) - The file’s estimated line count exceeds 300 (
LINE_THRESHOLD, overridable via environment variable) - The directory containing this file has not been searched by semble yet (
SearchTracker.hasSearched())
If conditions are met, block: true, returning a prompt: first use semble to locate, then read snippets after confirming the target file.
SearchTracker State Management
SearchTracker records every directory searched by semble_search:
Records are cleared at the start of each turn, ensuring that cross-turn passes are not mistakenly granted.
Collaborative Workflow
Having understood the role of each component and the principle of token savings, let’s look at how they collaborate in actual tasks. A typical orchestrator task goes through these phases: planning → reconnaissance → implementation → verification. Each phase corresponds to a different combination of tool calls.
Standard Execution Sequence
The complete orchestrator workflow:
Skip-Read Optimization Path
If the orchestrator already knows the target file and modification content from previous context, it dispatches a write sub-agent directly, skipping the read phase:
| |
The orchestrator system prompt explicitly emphasizes: “skip the read phase when you already have enough context.”
Context Flow
- User messages → enter the trunk, visible to the orchestrator
- Trunk tool calls (spawn) → fork point created, recorded in trunk history
- Sub-agent execution (50 steps) → all written to branch, invisible to trunk
- tool_result → written to trunk, visible to orchestrator (condensed conclusion)
- scratchpad → stored in tool_result details, auto-restored after branch switch, persists across spawns
Tool Configuration Customization
DEFAULT_TOOLS_CONFIG defines the default tool whitelist, supporting two levels of override:
~/.pi/agent/tools.json: global user configuration<cwd>/.pi/agent/tools.json: project-level configuration (higher priority)
The latter overrides the former, and both override defaults. You can add extra tools (such as db_query) for specific projects without affecting other projects.
Pattern Analysis
Having described the architectural details, let’s return to a qualitative question: what pattern is this architecture? What are its essential differences from similar concepts — the router pattern and multi-agent systems? Clarifying these boundaries helps position it within a broader technical context and helps determine which scenarios are suitable for it and which aren’t.
Distinction from the Router Pattern
The core of the router pattern is single-shot dispatch: receive a request → determine the type → forward to the corresponding handler → return the result. The router itself is stateless, does not hold task context, does not synthesize conclusions, and does not decide next steps.
The orchestrator is fundamentally different:
- Has global state: scratchpad and task persist across multiple spawn rounds
- Multi-round iteration: decides next steps based on each round’s results, not one-shot dispatch
- Active synthesis: consolidates conclusions from multiple sub-agents into a final answer
- Decision loop: spawn_read → evaluate → spawn_write → verify → synthesize
Distinction from True Multi-Agent Systems (MAS)
Core elements of classical MAS (Multi-Agent System):
| Element | Classical MAS | This Architecture |
|---|---|---|
| Autonomy | agents decide when to act autonomously | sub-agents passively wait for spawn |
| Peer Communication | agents can message each other directly | strictly unidirectional: orchestrator ↔ sub-agent |
| Persistent Existence | agents run continuously with their own goals | sub-agents are destroyed after execution |
| Shared Environment Awareness | multiple agents actively perceive the same environment changes | sub-agents only perceive the task description injected by the orchestrator |
A sub-agent is essentially a function call that can reason — it takes input (task description + fork context), produces output (conclusion), and disappears. It has no objective function of its own, no active perception of the environment, and no ability to communicate with sibling agents.
True MAS requires: persistent agent loops, a shared read-write environment (blackboard/message bus), and agents autonomously deciding actions based on perception. The industry’s use of “multi-agent” to describe this orchestrator architecture is a broad usage.
The Essential Positioning of This Architecture
The precise positioning is a three-layer composition:
- Hierarchical Agent: single decision center (orchestrator), where the execution tool happens to be an LLM
- CQRS (Command Query Responsibility Segregation): read/write forced separation, achieved by physically removing tools rather than by convention
- Tree-based Context Management: branch isolation + trunk leanness + permanent audit trail
It’s not MAS, it’s not Router — it’s a hierarchical delegation pattern that prevents role degradation through capability boundary enforcement.
Applicability Boundary: What’s Being Saved Is the Cost of Uncertainty
Having walked through the entire architecture in detail, we can step back and look at a more fundamental question: what exactly is this mechanism saving? The answer is — the context generated by the uncertainty of the exploration process.
The main session doesn’t have bash permissions not because it’s untrusted, but because the exploration process is inherently uncertain — how many files need to be read? How many errors will be encountered? How many retry rounds are needed? These unpredictable steps, if they happen on the trunk, all settle into the context. Outsourcing exploration to sub-agents essentially “containerizes” this uncertainty:
- read sub-agent is the container for exploratory reading. Wrong reads, misreads, retries — all stay in the branch. The trunk only sees the conclusion.
- write sub-agent is the container for exploratory writing. Trial edits, test failures, more edits — the entire process stays in the branch. The trunk has no knowledge of it.
This means the benefit of sub-agents is proportional to the uncertainty of the task. The higher the uncertainty — not knowing where to start, complex file structure, lots of exploration needed — the more significant the spawn benefit.
Conversely, spawn is unnecessary for things already certain. If the orchestrator already knows which file to read and what to change, doing it directly with the read or edit/write tools is actually more efficient — each spawn has a fixed system prompt cost and context fork overhead. Paying this cost for “certain things” is pure waste.
Extreme case: if the entire task is deterministic from start to finish (path known, change content clear), it’s perfectly reasonable for the main session to handle it directly. This is not “degradation” of the architecture, but correctly recognizing spawn’s applicability boundary — what it solves is always the context bloat caused by uncertainty. Deterministic tasks were never its target.
| |
This also explains a seemingly strange design decision: why is it sometimes “read” directly, and sometimes “spawn_read_sub_agent”? The difference isn’t in the operation type, but in whether you don’t know what to read — that’s when you start putting it into a sub-agent. If you know, just read directly.