Contents

[Project] 3. Harness-Everything — Autonomous AI Code Improvement Harness

Big Picture

diagram

The LLM is the brain, the Harness is the hands, the project code is what gets modified. The LLM never directly touches the filesystem — it only says “I want to do X”, and your code executes it.

The Essence: Three Sentences

  1. The LLM is the engine: Feed project code to a language model, let it analyze, suggest improvements, and write code. At its core, it’s just a while loop asking the LLM “what else can be improved?”
  2. Tools are the hands: The LLM can’t directly read or write files. It uses Anthropic’s tool_use protocol to tell your code “I want to read this file” / “I want to edit this line”, and your code executes it. You could run the whole thing with just a bash tool.
  3. Process restart is the key: Python modules are loaded once at startup and stay frozen in memory. When the LLM modifies its own .py files, the running process still uses the old code. A process restart is the only way to apply improvements. That’s why we have the push → tag → CI deploy → restart loop.

From Simplest to Complete System: Each Layer Solves One Problem

Simplest Version (Conceptual)

1
2
3
4
while True:
    code = read_all_project_files()
    response = LLM("Here's the code, improve it:" + code)
    write_back(response)

This works. But it runs into problems. Each layer below solves the previous layer’s problem:

diagram

Layer 1: Code Too Large for Context

Problem: When the project gets big, all files don’t fit in the LLM context window.

Solution: Give the LLM tools to choose what to read, instead of stuffing everything in.

diagram

The core code (harness/core/llm.py call_with_tools()) is only 60 lines:

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
for turn in range(max_turns):
    # 1. Ask LLM: "What do you want to do?"
    response = await api.messages.create(messages=conversation, tools=tool_list)
    
    # 2. If LLM didn't request any tool → it's done, exit
    if not response.tool_calls:
        return response.text
    
    # 3. LLM wants a tool → your code executes → feed result back
    for call in response.tool_calls:
        result = await registry.execute(call.name, config, call.input)
    
    # 4. Append results to conversation history, continue loop
    conversation.append(tool_results)

That’s it. This is the most important 60 lines in the entire project.

Layer 2: LLM Introduces Bugs

Problem: LLM-generated code may have syntax errors or break the project.

Solution: Automatically run python -m py_compile after every code change. If it fails, feed the error back to the LLM to fix.

1
LLM edits code → py_compile error → error fed back to LLM → LLM fixes → passes

This is SyntaxCheckHook in harness/pipeline/hooks.py.

Layer 3: Can’t Tell if Changes Are Good

Problem: The LLM made changes — are they actually good?

Solution: Use two separate LLM calls to judge, running in parallel with isolated perspectives:

  • Basic evaluator: Find the most critical defect (security holes, logic errors, code quality)
  • Diffusion evaluator: Analyze second-order effects (will it break other modules? cause context bloat?)
diagram

This is harness/evaluation/dual_evaluator.py. It’s the LLM judging itself, but the isolated dual perspective reduces self-congratulation.

Layer 4: Small Steps, Many Rounds

Problem: Asking the LLM to “fix everything at once” produces poor results. Big changes break things.

Solution: Split into multiple phases, each focused on one thing, iterating round by round:

diagram

This is harness/pipeline/pipeline_loop.py (outer loop) and harness/pipeline/phase_runner.py (phase execution).

Layer 5: Changes Don’t Take Effect

Problem: The LLM modified harness/core/llm.py in Round 3, but the running process still uses the version loaded at Round 1. The improvement is invisible.

Solution: The restart loop.

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
# At process startup
import harness.core.llm  # loaded into memory, never changes after this

# Round 5: LLM edits llm.py on disk
edit_file("harness/core/llm.py", ...)  # disk changes

# Round 6: process still uses the old in-memory version
# Python doesn't auto-reimport

# Only a process restart picks up the new code

This is the most important architectural decision. Without this restart loop, self-improvement is fake — the LLM thinks it changed the code, but the changes never execute.

Layer 6: Keep the Loop Alive

Problem: Many situations can silently kill the loop.

ProblemSolution
Context overflowGive LLM tools to read on demand
LLM introduces bugsSyntax check hook auto-validates
Can’t tell good from badDual evaluator scores, pick highest
Changes don’t take effectpush → tag → CI → restart process
Early stop → no tag → loop diesauto_tag_at_end forces tag on every exit
3 crashes → systemd gives upHeartbeat cron resets and restarts every 30min
User push causes conflictgit pull --rebase auto-merges
LLM breaks deploy scriptsSELF-IMPROVEMENT LOOP PROTECTION blocklist in prompts
Bad code deployedCI smoke test + rollback to harness-last-good
Disk fullCleanup cron deletes old data daily

Tool System: Optimization, Not Core

With just a single bash tool, the LLM would do:

1
2
3
4
bash("cat harness/core/llm.py")           → read file
bash("grep -rn '_check_path' harness/")   → search
bash("sed -i 's/old/new/' file.py")       → edit file
bash("python3 -m py_compile file.py")     → check syntax

This works. So why 30 specialized tools?

bash onlySpecialized toolWhy switch
cat /etc/passwdread_file rejects itSecurity: path check restricts to workspace
grep outputs 100K linesgrep_search auto-truncatesCost: won’t blow up context
sed silently corruptsedit_file exact matchControl: mismatch = explicit error
LLM invents paramsRegistry validatesFault tolerance: unknown params blocked
diagram

Tools are safety gloves for the LLM, not superpowers.

Tool Categories (30+)

1
2
3
4
5
6
7
File ops:    read_file, write_file, edit_file, delete_file, move_file, copy_file
Directory:   list_directory, create_directory, tree
Search:      glob_search, grep_search
Git:         git_status, git_diff, git_log
Execution:   bash, python_eval, test_runner
Analysis:    code_analysis, symbol_extractor, cross_reference, ...
Optional:    web_search (must be explicitly enabled)

Path Security (_check_path)

Every file-accessing tool passes a security check before execution:

1
2
3
4
resolved = os.path.realpath(path)  # resolve symlinks
# Check: null bytes, Unicode homoglyphs, path traversal
if path not in allowed_paths:
    reject

LLM says “read /etc/passwd” → blocked. “read ../../etc/passwd” → realpath resolves it → still blocked.


Cost Model

The Anthropic API is stateless. Every call resends the full conversation history. So each turn in the tool loop costs more than the last:

1
2
3
4
Turn  1: send [system prompt + file context + user instruction]   20K tokens
Turn  5: send [above + 4 rounds of tool calls and results]       30K tokens
Turn 10: send [above + 9 rounds]                                 40K tokens
Turn 20: send [above + 19 rounds]                                60K tokens   most expensive
diagram

The later turns cost exponentially more per marginal tool call. That’s why cutting max_tool_turns from 30 to 20 saves 40% — you’re removing the most expensive turns.

Mitigation Measures

MechanismFileHow it works
Conversation pruningllm.pyTruncate old tool results when total chars > 150K
Proactive compactionllm.pyAfter turn 6, replace old tool results with one-line summaries
File-read cachellm.pyCache read_file results within a tool loop; writes invalidate cache
Context injection budgetphase_runner.pyOnly inject the most relevant 30K chars of source code

DeepSeek Cost Estimate

1
2
3
4
5
6
7
8
Cache miss: $0.28 / million input tokens
Cache hit:  $0.028 / million input tokens (typical hit rate ~90%)
Output:     $0.42 / million output tokens

Per chunk (6-10 rounds x 4-5 phases):
  Input:  ~30M tokens → ~$3
  Output: ~1M tokens  → ~$0.4
  Total:  ~$3.5 / chunk

Self-Improvement Loop (Server Deployment)

diagram

Operations Quick Reference

GoalCommand
Live logsssh server "tail -f ~/harness-everything/logs/harness.log"
Commit progressgit log --oneline -20
Push a fix (no restart needed)git push — harness auto-rebases
Change configEdit config/pipeline_example_self_improve_server.json, push
Stop after current chunkssh server "touch ~/.config/harness/STOP_AFTER_CHUNK"
Resume loopssh server "systemctl --user start harness.service"
Emergency stopssh server "systemctl --user stop harness.service"
Full shutdownstop + disable + clear cron

Complete Data Flow: From Config to Code Commit

diagram

Core Data Structures

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
PipelineConfig                     # Top-level config
├── harness: HarnessConfig         #   Model/API/workspace/tools
   ├── model: "deepseek-chat"
   ├── base_url: "https://api.deepseek.com/anthropic"
   ├── workspace: "/home/ubuntu/harness-everything"
   ├── allowed_paths: [workspace]
   └── max_tool_turns: 20
├── phases: [PhaseConfig]          #   Phase list
   ├── name, mode (debate/implement)
   ├── system_prompt (with $file_context template vars)
   └── glob_patterns (which files to inject)
├── outer_rounds: 10               #   Rounds per chunk
├── patience: 5                    #   Early stop after N stale rounds
├── auto_push_interval: 1          #   Push every round
└── auto_tag_at_end: true          #   Force tag on every exit

InnerResult                        # Single attempt result
├── proposal: str                  #   LLM's proposal or changes
├── dual_score                     #   Dual scores
   ├── basic: (score, critique)   #     Defect evaluation
   └── diffusion: (score, critique)#    Ripple effects
└── tool_call_log: [dict]          #   Tool call records

PhaseResult                        # Phase result
├── synthesis: str                 #   Synthesized final proposal
├── best_score: float              #   Highest score
└── inner_results: [InnerResult]   #   All attempts

Key File Index

FileRoleOne-liner
main.pyEntry pointParse args, start the loop
harness/core/llm.pyMost criticalTool loop: LLM speaks → you execute → feedback → repeat
harness/core/config.pyConfigJSON → config object, path security validation
harness/pipeline/pipeline_loop.pyOuter loopRound orchestration, push, tag, early stop, shutdown
harness/pipeline/phase_runner.pyPhase executionContext injection, inner rounds, evaluation, synthesis, hooks
harness/evaluation/dual_evaluator.pyQuality gateTwo LLMs score in parallel, pick the best proposal
harness/tools/registry.pyTool dispatchRegistration, param validation, exception wrapping
harness/tools/base.pyTool security_check_path workspace boundary enforcement
harness/pipeline/hooks.pyVerificationSyntax check + git commit (rich metadata)
deploy/harness.serviceDeploymentsystemd user service definition
.github/workflows/deploy.ymlCI/CDTag-triggered: smoke test → deploy → restart/rollback
deploy/heartbeat.shKeepaliveRestart after 3-strike systemd failure

One-Paragraph Summary

Feed project code to an LLM, let it analyze and improve, using tools to read and edit files. A separate LLM call judges the quality; only the best proposals get committed. Multiple rounds iterate, each building on the improved code from the previous round. Because Python modules are loaded once at startup and frozen in memory, the process must restart every N rounds for improvements to take effect. Restart is driven by git tags triggering a GitHub Actions workflow that SSH-deploys and restarts the service — forming an unattended self-improvement loop. The tool system (30+ file/search/execution tools) is essentially safety gloves for the LLM — a single bash tool could do everything, but it would be less safe and more expensive.