Contents

[Project] 1. ChromePilot — Control Any Webpage with Natural Language

ChromePilot

Overview

ChromePilot is a Chrome extension that lets you control any webpage using natural language. Type a command like “click the login button” or “fill in my email”, and ChromePilot executes it automatically — clicking, typing, scrolling, and navigating on your behalf.

  • Built with AI (Claude) assistance: 3 hours for the initial prototype, 5 hours to polish into v1.0
  • Current status: v1.0 — functional and usable, with room for further optimization
  • GitHub: GOODDAYDAY/ChromePilot

Features

FeatureDescription
Natural Language ControlType commands like “click the submit button” or “type hello in the search box”
Multi-step AutomationChain complex tasks: “Go to Habitica and complete all my daily tasks”
URL NavigationSay “open YouTube” or “go to google.com” to navigate anywhere
Smart Result ExtractionAsk “translate ‘hello’ on Google Translate” and get the answer in the chat
Persistent Side PanelPanel stays open across tab switches (Chrome Side Panel API)
Multi-provider LLM SupportWorks with OpenAI, Anthropic Claude, GitHub Copilot, Ollama (local), or any OpenAI-compatible API
Debug OverlayVisualize all detected interactive elements with index numbers
Teach ModeRecord user actions and save as demonstrations
Action Preview & ConfirmReview planned actions with visual highlights before execution; provide feedback to re-analyze
Auto-run ModeToggle to skip confirmation and execute actions immediately

Demo

Basic Actions — Click Repetition

Command: “drink water 10 times”

/images/Project%20-%201%20-%20ChromePilot/1.%20drink%20water%2010%20times.gif

ChromePilot identifies the target button and clicks it 10 times automatically.

In-page Navigation — Multi-step Tasks

Command: “go to tasks and drink water 10 times”

/images/Project%20-%201%20-%20ChromePilot/2.%20go%20to%20tasks%20and%20drink%20water%2010%20times.gif

ChromePilot first navigates to the tasks section within the page, then performs the repeated clicking.

Cross-page Navigation — Open URLs & Extract Results

Command: “go to Google Translate and translate ‘what is surprise’ to Chinese”

/images/Project%20-%201%20-%20ChromePilot/3.%20go%20to%20google%20translator%20and%20translat%20what%20is%20superpise%20to%20chinese.gif

ChromePilot opens Google Translate, types the text, and extracts the translation result back to the chat panel.

Cross-site Automation — Navigate & Interact

Command: “go to my GitHub homepage and star the repository ChromePilot”

/images/Project%20-%201%20-%20ChromePilot/4.%20go%20to%20my%20github%20homepage%20and%20star%20the%20repository%20ChromePilot.gif

ChromePilot navigates to GitHub, finds the repository, and clicks the star button.

Debug Overlay — Inspect Detected Elements

Use the eye button to visualize all detected interactive elements with their index numbers.

/images/Project%20-%201%20-%20ChromePilot/5.%20click%20button%2054.gif

The debug overlay shows every interactive element ChromePilot detected, each labeled with an index number. You can directly command “click button 54” to interact with a specific element.

Action Preview & Confirm — Review Before Execution

Actions are highlighted with numbered labels. Confirm to execute, or type feedback and re-analyze.

/images/Project%20-%201%20-%20ChromePilot/6.%20show%20batch%20actions%20with%20confirm%20first.gif

Auto-run Mode — Skip Confirmation

Toggle “Auto-run” to execute actions immediately without preview.

/images/Project%20-%201%20-%20ChromePilot/7.%20show%20the%20auto-run.gif

Architecture

Tech Stack

ComponentTechnology
PlatformChrome Extension (Manifest V3)
LanguageVanilla JavaScript (ES2022+)
UIChrome Side Panel API
AI IntegrationMulti-provider LLM client (Anthropic, OpenAI-compatible)
BuildNone (plain files loaded directly by Chrome)

Project Structure

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
src/
├── manifest.json              # Chrome MV3 manifest
├── background/
│   ├── service-worker.js      # Orchestrator: DOM → LLM → Actions loop
│   └── llm-client.js          # Multi-provider LLM client
├── content/
│   ├── content-script.js      # Message handler on web pages
│   ├── dom-extractor.js       # Extracts interactive elements
│   ├── action-executor.js     # Simulates click/type/scroll/read
│   ├── action-previewer.js    # Preview overlay (red borders + step labels)
│   └── action-recorder.js     # Teach mode (recording actions)
├── sidepanel/
│   ├── sidepanel.html         # Chat UI (Chrome Side Panel API)
│   ├── sidepanel.js           # Panel logic & settings
│   └── sidepanel.css          # Styles
├── options/                   # LLM provider configuration page
├── lib/utils.js               # Shared helpers
└── icons/                     # Extension icons

Core Loop

The core execution follows a DOM → LLM → Action loop:

  1. User types a command in the side panel
  2. Service worker extracts interactive elements from the active tab
  3. Elements + command are sent to the configured LLM
  4. LLM returns a list of actions (click, type, scroll, navigate, read)
  5. Actions are previewed with red highlights and step labels (unless Auto-run is on)
  6. User confirms or provides feedback to re-analyze
  7. Confirmed actions are executed sequentially on the page
  8. If the task is not done (done: false), repeat from step 2 with updated DOM context

DOM Extraction

The dom-extractor.js module identifies interactive elements on the page through multiple phases:

  • Phase 1: Collect elements matching standard interactive selectors (buttons, inputs, links, ARIA roles)
  • Phase 2: Find framework-rendered clickable elements via cursor:pointer CSS heuristic
  • Phase 3: Filter noise (empty SVGs, hidden elements), deduplicate parent/child overlaps
  • Dialog Detection: Detects modals via native <dialog>, ARIA roles, or CSS heuristics (fixed/absolute positioning + high z-index)

Each element is returned with an index:

1
2
[1] <button>Click me</button> (in: Header section)
[2] <input type="text" placeholder="Search..."> (in: Navigation)

LLM Integration

The llm-client.js supports multiple providers through a unified interface:

ProviderBase URLAuth Header
OpenAIhttps://api.openai.comAuthorization: Bearer
Anthropic Claudehttps://api.anthropic.comx-api-key
GitHub Copilothttps://models.inference.ai.azure.comAuthorization: Bearer
Ollama (Local)http://localhost:11434None
CustomAny OpenAI-compatible endpointAuthorization: Bearer

The system prompt instructs the LLM to respond with structured JSON:

1
2
3
4
5
6
7
8
{
  "actions": [
    {"type": "click", "elementIndex": 5},
    {"type": "type", "elementIndex": 12, "text": "hello"}
  ],
  "done": false,
  "summary": "Clicked the search button and typed the query"
}

Action Execution

The action-executor.js simulates real user interactions:

ActionBehavior
clickDispatches MouseEvent, scrolls element into view first
typeFocuses input, clears existing value, sets new value with input events
scrollScrolls page in specified direction
navigateOpens URL in current tab or new tab
readExtracts textContent from target element
repeatClicks same element N times with configurable delay

Visual feedback is provided: a red border flashes on each element as it is interacted with.

Development Challenges

Challenge 1: Noise Element Filtering

A webpage contains far more DOM elements than are useful for automation. Feeding all of them to the LLM wastes tokens and confuses the model. The core question: how to keep only the elements that matter?

Sources of noise:

  • SVG icons inside buttons — each <svg>, <path>, <circle> is a separate element, but none are interactive
  • Empty wrapper <div> and <span> from frameworks (React/Vue) — no text, no label, no role, purely structural
  • Invisible elements: display: none, visibility: hidden, opacity: 0, or zero-size bounding boxes
  • Parent-child duplication: a <div role="button"> wrapping an <a> tag — both get collected, but only one should be in the list
  • cursor: pointer heuristic false positives: decorative elements styled as clickable but serving no interactive purpose

Filtering strategy (three phases):

  1. Visibility check: reject elements with display: none, visibility: hidden, opacity: 0, or zero-size bounding rect. Special case for position: fixed/sticky elements which have no offsetParent
  2. Noise rejection: skip all SVG elements; skip <div>/<span> that have no text content, no aria-label, no id, and no role
  3. Parent-child deduplication: if an element has an interactive ancestor already in the set, keep only the ancestor. Exception: native interactive elements (<a>, <button>, <input>, <textarea>, <select>) are always kept regardless of ancestry

The result: a typical page with 500+ raw DOM elements is reduced to 50–150 meaningful interactive elements that the LLM can reason about effectively.

Challenge 2: Dialog Awareness

The DOM extractor collects interactive elements in DOM order, capped at 150 elements (DEFAULT_MAX_ELEMENTS = 150). This works well for regular pages, but breaks completely when a dialog appears:

  • Dialogs are typically appended to the end of <body> in the DOM
  • The 150 elements from the main page content fill up the quota first
  • Dialog buttons — the very elements the user wants to interact with — get truncated

For example, on Habitica, clicking a character stat opens a modal with action buttons. But the page behind it already has 150+ interactive elements (navigation links, task buttons, sidebar items). The modal’s buttons, sitting at the end of the DOM, never make it into the element list. The LLM cannot see them, so it cannot operate them.

An additional complication: framework-rendered dialogs (Vue/React) often use plain <div> with @click handlers instead of semantic <button> or ARIA roles. These elements have no role, no tabindex, no cursor:pointer — they are invisible to both Phase 1 (selector matching) and Phase 2 (cursor heuristic) of the extractor.

Key insight: dialogs are small. A typical dialog contains 5–20 interactive elements — far less than the 150 cap. There is no reason to limit them.

The solution restructures extraction into a dialog-first strategy:

  1. Detect active dialogs using a three-layer approach:

    • Native <dialog[open]>
    • ARIA attributes: [role="dialog"], [role="alertdialog"], [aria-modal="true"]
    • CSS heuristic fallback: position: fixed/absolute + z-index >= 100 + reasonable size + contains interactive elements
  2. Separate elements into two groups: dialog elements and page background elements

  3. Dialog elements go first with no cap: since dialogs are small, include all of them starting from index [1]

  4. Relaxed filtering inside dialogs: scan all child elements in the dialog container, not just those matching interactive selectors. Include any visible element with direct text content, aria-label, or role. This catches framework-rendered buttons that lack semantic markup

  5. Page elements follow with the original 150 cap: the background page still gets its full quota

  6. Context annotation: dialog elements are labeled with (in: dialog: {title}), and the element list header includes ⚠ Active dialog detected — dialog elements listed first

This approach ensures dialog buttons are always visible to the LLM, regardless of how many elements the background page has.

Configuration

LLM Provider Setup

  1. Right-click the ChromePilot icon → Options
  2. Select a provider preset or enter a custom endpoint
  3. Enter the API key and model name
  4. Click Test Connection to verify

Panel Settings

SettingOptionsDefaultDescription
Same Tab NavigationOn / OffOffNavigate in current tab instead of opening new tabs
Auto-runOn / OffOffSkip action preview, execute immediately
Max Steps5 / 10 / 20 / 50 / Unlimited10Maximum LLM rounds per command
Action Delay0s – 5s0.5sDelay between each action execution

Requirements

  • Chrome 114+ (for Side Panel API support)
  • An LLM API endpoint (cloud or local)

Source Code