[Project] 3. Harness-Everything — Autonomous AI Code Improvement Harness

Big Picture
The LLM is the brain, the Harness is the hands, the project code is what gets modified. The LLM never directly touches the filesystem — it only says “I want to do X”, and your code executes it.
The Essence: Three Sentences
- The LLM is the engine: Feed project code to a language model, let it analyze, suggest improvements, and write code. At its core, it’s just a while loop asking the LLM “what else can be improved?”
- Tools are the hands: The LLM can’t directly read or write files. It uses Anthropic’s tool_use protocol to tell your code “I want to read this file” / “I want to edit this line”, and your code executes it. You could run the whole thing with just a
bashtool. - Process restart is the key: Python modules are loaded once at startup and stay frozen in memory. When the LLM modifies its own
.pyfiles, the running process still uses the old code. A process restart is the only way to apply improvements. That’s why we have the push → tag → CI deploy → restart loop.
From Simplest to Complete System: Each Layer Solves One Problem
Simplest Version (Conceptual)
| |
This works. But it runs into problems. Each layer below solves the previous layer’s problem:
Layer 1: Code Too Large for Context
Problem: When the project gets big, all files don’t fit in the LLM context window.
Solution: Give the LLM tools to choose what to read, instead of stuffing everything in.
The core code (harness/core/llm.py call_with_tools()) is only 60 lines:
| |
That’s it. This is the most important 60 lines in the entire project.
Turn — The hidden cost: “compound interest” on memory
These 60 lines are simple, but they hide a counterintuitive cost model. The Anthropic API is stateless — every call resends the full conversation history. Turn 1 is ~20K tokens. Turn 5 with 4 rounds of tool calls is ~30K. Turn 10 is ~40K. Turn 20 is ~60K. The longer the conversation, the more expensive each additional turn — and the growth is exponential.
That’s why cutting max_tool_turns from 30 to 20 saves ~40% in cost — you’re removing the most expensive turns. But fewer turns means less thinking space for the agent. There’s no perfect tradeoff; subsequent mitigation measures (conversation pruning, proactive compaction, file-read cache) all chip away at this fundamental tension.
Resolution — These 60 lines are the engine; the next six layers are safety nets and a steering wheel
The engine itself isn’t complicated. What’s hard is building a control system around it — one that prevents it from overheating, from driving in the wrong direction, from forgetting which roads it’s already traveled. Every layer that follows adds one dimension of control.
Layer 2: LLM Introduces Bugs
Problem: LLM-generated code may have syntax errors or break the project.
Solution: Automatically run python -m py_compile after every code change. If it fails, feed the error back to the LLM to fix.
| |
This is SyntaxCheckHook in harness/pipeline/hooks.py.
Turn — py_compile catches commas, not confusion
Syntax check is a hard constraint — mismatched brackets, wrong indentation, mis-imported names — it catches all of them. But it can only tell you “this code runs,” not “this code is correct.” The LLM can change if user.is_admin to if user.is_active — perfectly valid syntax, completely wrong logic. And worse: when the LLM sees a compilation error, sometimes it over-corrects, rewriting an entire function just to fix a missing colon.
Resolution — Syntax check is the floor, not the ceiling
Its role is: “the code must not fail to compile.” But once it passes, determining quality requires smarter judgment — which is what the next layer addresses.
Layer 3: Can’t Tell if Changes Are Good
Problem: The LLM made changes — are they actually good?
Solution: Use two separate LLM calls to judge, running in parallel with isolated perspectives:
- Basic evaluator: Find the most critical defect (security holes, logic errors, code quality)
- Diffusion evaluator: Analyze second-order effects (will it break other modules? cause context bloat?)
This is harness/evaluation/dual_evaluator.py.
Turn — “Letting students grade their own homework” is unavoidable
Having an LLM evaluate another LLM is, at its core, letting students grade their own homework. Isolating two evaluator perspectives (basic finds defects, diffusion analyzes ripple effects) helps — but if both evaluators share the same model’s latent preferences? A taste for elegant code, a bias toward certain design patterns — these are impossible to isolate away.
In practice: evaluator prompt quality matters far more than isolation. A good prompt creates just enough divergence between the two evaluators. A vague prompt pushes both into the 6-8 “safe zone” — everything is good, nothing stands out.
Resolution — Dual evaluation is a mirror, not a ruler
It shows you two reflections of your work, but it can’t give you absolute coordinates. Absolute coordinates need objective metrics (test pass rate, coverage, lint scores) for calibration — which is part of what V5’s multi-axis evaluation aims to solve.
Layer 4: Small Steps, Many Rounds
Problem: Asking the LLM to “fix everything at once” produces poor results. Big changes break things.
Solution: Split into multiple phases, each focused on one thing, iterating round by round:
This is harness/pipeline/pipeline_loop.py (outer loop) and harness/pipeline/phase_runner.py (phase execution).
Turn — More phases, narrower vision
The cost of splitting into phases is loss of coherence. Phase 2’s LLM doesn’t know what Phase 1 analyzed. Phase 3 might undo what Phase 2 just built. Each phase makes locally optimal decisions within its own context window, but they can cancel each other out when combined.
This is compounded by fixed phase ordering. Whether you throw a security audit or a “add one log line” task at it, it runs the same bug_fix → features → polish pipeline. A security audit for a tiny script is meaningless, but it runs every time. Later (R3), removing the fixed debate mechanism, adding skip rules and falsifiable criteria gave the agent the ability to schedule on demand.
Resolution — Orchestration wisdom isn’t in “how many phases,” it’s in “knowing when to skip”
What makes a multi-phase system great isn’t the phase design itself — it’s the skip rules. A phase’s value isn’t measured by how many times it runs, but by whether each run actually changed the outcome. If a phase gives the same conclusion 80% of the time — it should be skipped.
Layer 5: Changes Don’t Take Effect
Problem: The LLM modified harness/core/llm.py in Round 3, but the running process still uses the version loaded at Round 1. The improvement is invisible.
Solution: The restart loop.
| |
This is the most important architectural decision. Without this restart loop, self-improvement is fake — the LLM thinks it changed the code, but the changes never execute.
Turn — But restart has a cost
Restarting every 10 rounds means that for those 10 rounds, the agent runs on “old body, new mind.” Round 5 improves llm.py’s retry logic, but Rounds 6-10 still use the old retry logic — new code sits on disk waiting for restart while old code keeps running in memory. Ten rounds is enough for the agent to produce code that depends on new behavior — and if that new behavior hasn’t taken effect yet, subsequent changes may be built on false assumptions.
There’s no perfect solution, only partial mitigation: after each restart, have the agent audit the core modules it modified in the previous chunk. Batched restart is an engineering tradeoff: how soon do you want improvements to take effect, and how much body-mind desynchronization can you tolerate?
Layer 6: Keep the Loop Alive
Problem: Many situations can silently kill the loop.
| Problem | Solution |
|---|---|
| Context overflow | Give LLM tools to read on demand |
| LLM introduces bugs | Syntax check hook auto-validates |
| Can’t tell good from bad | Dual evaluator scores, pick highest |
| Changes don’t take effect | push → tag → CI → restart process |
| Early stop → no tag → loop dies | auto_tag_at_end forces tag on every exit |
| 3 crashes → systemd gives up | Heartbeat cron resets and restarts every 30min |
| User push causes conflict | git pull --rebase auto-merges |
| LLM breaks deploy scripts | SELF-IMPROVEMENT LOOP PROTECTION blocklist in prompts |
| Bad code deployed | CI smoke test + rollback to harness-last-good |
| Disk full | Cleanup cron deletes old data daily |
Turn — Safety nets are in place, but “judgment” hasn’t kept pace
These safety nets solve “the loop won’t die unexpectedly.” They don’t solve “where should the loop go?” Layer 3’s dual evaluation gives the agent judgment within a single round, but it has no cross-round memory, no directional sense, no ability to question its own evaluation criteria. And as the codebase becomes more polished, the evaluation signal keeps weakening — all the obvious defects are fixed; what remains requires taste and foresight, which a single score can’t capture.
Resolution — The first six layers keep the loop alive; the seventh layer teaches it to think
Before V5, Harness was an execution machine that was getting better at not dying. V5 tackles a different problem: not “stay alive,” but “know which way to run.” This is Layer 7.
Layer 7 (V5): Flat Scores, No Memory, No Exploration, Can’t Self-Modify
Set-up — Hidden cracks beneath the surface
V4 ran 67 cycles, grew tests from zero to 2700+, and stabilized the tool system. On the surface, everything was thriving. But four structural bottlenecks lurked underneath:
- Flat evaluation scores — A single 0-10 score couldn’t distinguish “fine” from “genuinely good.” A variable rename and an AST tool refactor both scored ~7. The agent couldn’t tell incremental polish from breakthrough improvements.
- No cross-cycle memory —
memory.jsonllogged what happened each round but extracted no lessons. The mistake from cycle 71 would be repeated in cycle 73. Every round, the agent started with amnesia. - Pure exploitation — The agent only polished. It would add tests, delete dead code, merge duplicates — but it would never say “this architecture is wrong, let’s rebuild.” Without exploration, it circled local optima.
- Can’t modify itself — Evaluator prompts and weights were hardcoded in Python files. The agent could modify any code except the standard by which it was judged. The most important self-correction capability was missing.
Development — Four modules, each breaking one bottleneck
| |
1. Multi-Axis Evaluation — 5-dim vector replaces single number
Old (V4): evaluator outputs "SCORE: 7.5" — one number, low differentiation.
New (V5): evaluator outputs 5-dim vector [correctness, code_quality, arch_health, novelty, alignment], weighted average produces final score.
Five dimensions, independently scored (0-10), two evaluator perspectives (basic + diffusion) run in parallel:
| Dimension | Default Weight | Measures |
|---|---|---|
| correctness | 35% | Compiles, tests pass, no security holes introduced |
| code_quality | 25% | Lint-clean, manageable complexity, clear naming |
| arch_health | 15% | Coupling, cohesion, module boundaries |
| novelty | 10% | Difference from existing code, willingness to try new directions |
| alignment | 15% | Consistency with MetaAgent’s current optimization target |
Weights are stored in harness/config/eval_weights.json and hot-loaded from disk before every evaluation. Change the file → next evaluation uses new weights, no restart needed.
2. Structured Experience Memory — Not just “what happened,” but “what was learned”
| |
Four core interfaces:
record(exp)— Record experience after each roundquery(["evaluator", "scoring"])— Retrieve relevant history by tagcompress(ratio)— Merge similar experiences, extract more abstract rulesformat_context(max_entries)— Format for LLM and MetaAgent consumption
3. Exploration Mode — Not just polishing, but venturing
| |
4. Strategy Layer + Hot-Reload — The agent can finally change its own evaluation criteria
| |
With GetSelfConfigTool — a tool registered for agent use — the agent can query “where are my config files?” and then modify its own evaluator prompts and weights. For the first time, the agent can change the standards by which it is judged.
Turn — Weights aren’t magic; prompts are the bottleneck
After V5 landed, two findings challenged the original assumptions:
First, changing weights is far less effective than changing prompts. Bumping novelty from 10% to 50% does make the agent more willing to try new directions, but if the evaluator prompt doesn’t clearly define “what makes good novelty,” the agent oscillates between randomly renaming variables and genuinely restructuring modules. Weights quantify “how important,” but prompts define “what it is.”
Second, the MetaAgent itself has decision quality problems. It reads experiences from the ExperienceStore, analyzes trends, and outputs strategy — but if the experiences stored in early cycles are low-quality (because the evaluator wasn’t calibrated yet), the MetaAgent’s decisions will be biased. This is a cold-start problem: no good experiences → bad strategy → produces bad experiences → worse strategy. The mitigation: run the first several rounds with conservative default weights to build a foundation of quality experiences, then activate the MetaAgent.
Resolution — V5’s significance isn’t “another version”
Before V5, Harness was an executor — it ran fast, modified well, but “which direction” and “how to judge” were set by humans. After V5, it begins to have a sense of direction — it remembers the roads it’s traveled, knows which directions are worth revisiting, can see when it’s circling the same spot, and occasionally ventures down a new path.
But ultimately, who defines “good” — V5 only pushes this question one step further, it doesn’t answer it. The MetaAgent can adjust weights, but weights are attached to five dimensions that humans designed. The day the agent proposes a sixth dimension on its own — that’s the real inflection point.
New V5 files:
| File | Role | Replaces |
|---|---|---|
harness/evaluation/multi_axis.py | 5-dim vector evaluator | dual_evaluator (backward-compatible) |
harness/core/experience.py | Structured experience memory | memory.py (JSONL logbook) |
harness/pipeline/meta_agent.py | Strategy layer: analyze trends, adjust direction | None (entirely new capability) |
harness/core/eval_config.py | Hot-reload config: prompts + weights from disk | None (replaces hardcoded imports) |
harness/tools/self_config.py | Agent self-awareness tool | None (entirely new capability) |
Tool System: Optimization, Not Core
With just a single bash tool, the LLM would do:
| |
This works. So why 30 specialized tools?
| bash only | Specialized tool | Why switch |
|---|---|---|
cat /etc/passwd | read_file rejects it | Security: path check restricts to workspace |
grep outputs 100K lines | grep_search auto-truncates | Cost: won’t blow up context |
sed silently corrupts | edit_file exact match | Control: mismatch = explicit error |
| LLM invents params | Registry validates | Fault tolerance: unknown params blocked |
Tools are safety gloves for the LLM, not superpowers.
Tool Categories (30+)
| |
Path Security (_check_path)
Every file-accessing tool passes a security check before execution:
| |
LLM says “read /etc/passwd” → blocked. “read ../../etc/passwd” → realpath resolves it → still blocked.
Cost Model
The Anthropic API is stateless. Every call resends the full conversation history. So each turn in the tool loop costs more than the last:
| |
The later turns cost exponentially more per marginal tool call. That’s why cutting max_tool_turns from 30 to 20 saves 40% — you’re removing the most expensive turns.
Mitigation Measures
| Mechanism | File | How it works |
|---|---|---|
| Conversation pruning | llm.py | Truncate old tool results when total chars > 150K |
| Proactive compaction | llm.py | After turn 6, replace old tool results with one-line summaries |
| File-read cache | llm.py | Cache read_file results within a tool loop; writes invalidate cache |
| Context injection budget | phase_runner.py | Only inject the most relevant 30K chars of source code |
DeepSeek Cost Estimate
| |
Self-Improvement Loop (Server Deployment)
Operations Quick Reference
| Goal | Command |
|---|---|
| Live logs | ssh server "tail -f ~/harness-everything/logs/harness.log" |
| Commit progress | git log --oneline -20 |
| Push a fix (no restart needed) | git push — harness auto-rebases |
| Change config | Edit config/pipeline_example_self_improve_server.json, push |
| Stop after current chunk | ssh server "touch ~/.config/harness/STOP_AFTER_CHUNK" |
| Resume loop | ssh server "systemctl --user start harness.service" |
| Emergency stop | ssh server "systemctl --user stop harness.service" |
| Full shutdown | stop + disable + clear cron |
Complete Data Flow: From Config to Code Commit
Core Data Structures
| |
Key File Index
| File | Role | One-liner |
|---|---|---|
main.py | Entry point | Parse args, start the loop |
harness/core/llm.py | Most critical | Tool loop: LLM speaks → you execute → feedback → repeat |
harness/core/config.py | Config | JSON → config object, path security validation |
harness/pipeline/pipeline_loop.py | Outer loop | Round orchestration, push, tag, early stop, shutdown |
harness/pipeline/phase_runner.py | Phase execution | Context injection, inner rounds, evaluation, synthesis, hooks |
harness/evaluation/dual_evaluator.py | Quality gate | Two LLMs score in parallel, pick the best proposal (V4) |
harness/evaluation/multi_axis.py | V5 Multi-axis eval | 5-dim vector replaces single score |
harness/core/experience.py | V5 Experience memory | Structured memory: record + reflect + abstract + retrieve |
harness/pipeline/meta_agent.py | V5 Strategy layer | Every N rounds, analyzes trends, adjusts weights & exploration frequency |
harness/core/eval_config.py | V5 Hot-reload | Evaluator prompts + weights loaded from disk every call |
harness/tools/self_config.py | V5 Self-awareness | Lets the agent discover its own config file paths |
harness/tools/registry.py | Tool dispatch | Registration, param validation, exception wrapping |
harness/tools/base.py | Tool security | _check_path workspace boundary enforcement |
harness/pipeline/hooks.py | Verification | Syntax check + git commit (rich metadata) |
deploy/harness.service | Deployment | systemd user service definition |
.github/workflows/deploy.yml | CI/CD | Tag-triggered: smoke test → deploy → restart/rollback |
deploy/heartbeat.sh | Keepalive | Restart after 3-strike systemd failure |
One-Paragraph Summary
Feed project code to an LLM, let it analyze and improve, using tools to read and edit files. A separate LLM call judges the quality; only the best proposals get committed. Multiple rounds iterate, each building on the improved code from the previous round. Because Python modules are loaded once at startup and frozen in memory, the process must restart every N rounds for improvements to take effect. Restart is driven by git tags triggering a GitHub Actions workflow that SSH-deploys and restarts the service — forming an unattended self-improvement loop. V5 introduces multi-axis evaluation (5-dim vector replacing a single score), structured experience memory (not just logging what happened, but distilling what was learned), an exploration mechanism (occasionally venturing bold new directions), and a strategy layer (MetaAgent periodically analyzes trends, adjusts evaluation weights and exploration frequency — the agent can, for the first time, change the standards by which it is judged). The tool system (30+ file/search/execution tools) is essentially safety gloves for the LLM — a single
bashtool could do everything, but it would be less safe and more expensive.