[Project] 3. Harness-Everything — Autonomous AI Code Improvement Harness

Big Picture
The LLM is the brain, the Harness is the hands, the project code is what gets modified. The LLM never directly touches the filesystem — it only says “I want to do X”, and your code executes it.
The Essence: Three Sentences
- The LLM is the engine: Feed project code to a language model, let it analyze, suggest improvements, and write code. At its core, it’s just a while loop asking the LLM “what else can be improved?”
- Tools are the hands: The LLM can’t directly read or write files. It uses Anthropic’s tool_use protocol to tell your code “I want to read this file” / “I want to edit this line”, and your code executes it. You could run the whole thing with just a
bashtool. - Process restart is the key: Python modules are loaded once at startup and stay frozen in memory. When the LLM modifies its own
.pyfiles, the running process still uses the old code. A process restart is the only way to apply improvements. That’s why we have the push → tag → CI deploy → restart loop.
From Simplest to Complete System: Each Layer Solves One Problem
Simplest Version (Conceptual)
| |
This works. But it runs into problems. Each layer below solves the previous layer’s problem:
Layer 1: Code Too Large for Context
Problem: When the project gets big, all files don’t fit in the LLM context window.
Solution: Give the LLM tools to choose what to read, instead of stuffing everything in.
The core code (harness/core/llm.py call_with_tools()) is only 60 lines:
| |
That’s it. This is the most important 60 lines in the entire project.
Layer 2: LLM Introduces Bugs
Problem: LLM-generated code may have syntax errors or break the project.
Solution: Automatically run python -m py_compile after every code change. If it fails, feed the error back to the LLM to fix.
| |
This is SyntaxCheckHook in harness/pipeline/hooks.py.
Layer 3: Can’t Tell if Changes Are Good
Problem: The LLM made changes — are they actually good?
Solution: Use two separate LLM calls to judge, running in parallel with isolated perspectives:
- Basic evaluator: Find the most critical defect (security holes, logic errors, code quality)
- Diffusion evaluator: Analyze second-order effects (will it break other modules? cause context bloat?)
This is harness/evaluation/dual_evaluator.py. It’s the LLM judging itself, but the isolated dual perspective reduces self-congratulation.
Layer 4: Small Steps, Many Rounds
Problem: Asking the LLM to “fix everything at once” produces poor results. Big changes break things.
Solution: Split into multiple phases, each focused on one thing, iterating round by round:
This is harness/pipeline/pipeline_loop.py (outer loop) and harness/pipeline/phase_runner.py (phase execution).
Layer 5: Changes Don’t Take Effect
Problem: The LLM modified harness/core/llm.py in Round 3, but the running process still uses the version loaded at Round 1. The improvement is invisible.
Solution: The restart loop.
| |
This is the most important architectural decision. Without this restart loop, self-improvement is fake — the LLM thinks it changed the code, but the changes never execute.
Layer 6: Keep the Loop Alive
Problem: Many situations can silently kill the loop.
| Problem | Solution |
|---|---|
| Context overflow | Give LLM tools to read on demand |
| LLM introduces bugs | Syntax check hook auto-validates |
| Can’t tell good from bad | Dual evaluator scores, pick highest |
| Changes don’t take effect | push → tag → CI → restart process |
| Early stop → no tag → loop dies | auto_tag_at_end forces tag on every exit |
| 3 crashes → systemd gives up | Heartbeat cron resets and restarts every 30min |
| User push causes conflict | git pull --rebase auto-merges |
| LLM breaks deploy scripts | SELF-IMPROVEMENT LOOP PROTECTION blocklist in prompts |
| Bad code deployed | CI smoke test + rollback to harness-last-good |
| Disk full | Cleanup cron deletes old data daily |
Tool System: Optimization, Not Core
With just a single bash tool, the LLM would do:
| |
This works. So why 30 specialized tools?
| bash only | Specialized tool | Why switch |
|---|---|---|
cat /etc/passwd | read_file rejects it | Security: path check restricts to workspace |
grep outputs 100K lines | grep_search auto-truncates | Cost: won’t blow up context |
sed silently corrupts | edit_file exact match | Control: mismatch = explicit error |
| LLM invents params | Registry validates | Fault tolerance: unknown params blocked |
Tools are safety gloves for the LLM, not superpowers.
Tool Categories (30+)
| |
Path Security (_check_path)
Every file-accessing tool passes a security check before execution:
| |
LLM says “read /etc/passwd” → blocked. “read ../../etc/passwd” → realpath resolves it → still blocked.
Cost Model
The Anthropic API is stateless. Every call resends the full conversation history. So each turn in the tool loop costs more than the last:
| |
The later turns cost exponentially more per marginal tool call. That’s why cutting max_tool_turns from 30 to 20 saves 40% — you’re removing the most expensive turns.
Mitigation Measures
| Mechanism | File | How it works |
|---|---|---|
| Conversation pruning | llm.py | Truncate old tool results when total chars > 150K |
| Proactive compaction | llm.py | After turn 6, replace old tool results with one-line summaries |
| File-read cache | llm.py | Cache read_file results within a tool loop; writes invalidate cache |
| Context injection budget | phase_runner.py | Only inject the most relevant 30K chars of source code |
DeepSeek Cost Estimate
| |
Self-Improvement Loop (Server Deployment)
Operations Quick Reference
| Goal | Command |
|---|---|
| Live logs | ssh server "tail -f ~/harness-everything/logs/harness.log" |
| Commit progress | git log --oneline -20 |
| Push a fix (no restart needed) | git push — harness auto-rebases |
| Change config | Edit config/pipeline_example_self_improve_server.json, push |
| Stop after current chunk | ssh server "touch ~/.config/harness/STOP_AFTER_CHUNK" |
| Resume loop | ssh server "systemctl --user start harness.service" |
| Emergency stop | ssh server "systemctl --user stop harness.service" |
| Full shutdown | stop + disable + clear cron |
Complete Data Flow: From Config to Code Commit
Core Data Structures
| |
Key File Index
| File | Role | One-liner |
|---|---|---|
main.py | Entry point | Parse args, start the loop |
harness/core/llm.py | Most critical | Tool loop: LLM speaks → you execute → feedback → repeat |
harness/core/config.py | Config | JSON → config object, path security validation |
harness/pipeline/pipeline_loop.py | Outer loop | Round orchestration, push, tag, early stop, shutdown |
harness/pipeline/phase_runner.py | Phase execution | Context injection, inner rounds, evaluation, synthesis, hooks |
harness/evaluation/dual_evaluator.py | Quality gate | Two LLMs score in parallel, pick the best proposal |
harness/tools/registry.py | Tool dispatch | Registration, param validation, exception wrapping |
harness/tools/base.py | Tool security | _check_path workspace boundary enforcement |
harness/pipeline/hooks.py | Verification | Syntax check + git commit (rich metadata) |
deploy/harness.service | Deployment | systemd user service definition |
.github/workflows/deploy.yml | CI/CD | Tag-triggered: smoke test → deploy → restart/rollback |
deploy/heartbeat.sh | Keepalive | Restart after 3-strike systemd failure |
One-Paragraph Summary
Feed project code to an LLM, let it analyze and improve, using tools to read and edit files. A separate LLM call judges the quality; only the best proposals get committed. Multiple rounds iterate, each building on the improved code from the previous round. Because Python modules are loaded once at startup and frozen in memory, the process must restart every N rounds for improvements to take effect. Restart is driven by git tags triggering a GitHub Actions workflow that SSH-deploys and restarts the service — forming an unattended self-improvement loop. The tool system (30+ file/search/execution tools) is essentially safety gloves for the LLM — a single
bashtool could do everything, but it would be less safe and more expensive.