Alibaba page-agent: The GUI Agent That Lives Inside Your Webpage

By Prahlad Menon 4 min read

Every browser automation tool works the same way: run something outside the browser that controls the browser. Playwright, Puppeteer, Selenium, browser-use β€” they all sit on the outside looking in, taking screenshots or parsing accessibility trees through Chrome DevTools Protocol.

Alibaba’s page-agent flips that model. The agent lives inside the webpage as plain JavaScript. It reads the DOM directly as text, understands natural language commands, and manipulates the page from within. No headless browser. No screenshots. No multimodal model.

One script tag:

<script src="https://cdn.jsdelivr.net/npm/page-agent@1.7.1/dist/iife/page-agent.demo.js" crossorigin="true"></script>

That’s the entire integration for evaluation.

Why This Matters

The standard browser automation stack looks like this:

Your Code β†’ Playwright/Puppeteer β†’ CDP β†’ Browser β†’ Page
     ↑                                          |
     └──── Screenshot/DOM snapshot β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜

Every cycle involves serializing the page state, sending it to your code (or an LLM), getting a decision back, then sending actions through the protocol. Screenshots require multimodal models. Accessibility trees are lossy. It’s slow, fragile, and expensive.

page-agent collapses that entire stack:

Page ← page-agent.js ← LLM API

The agent is JavaScript running in the page context. It has direct access to the DOM. No serialization round-trips, no screenshot overhead, no vision model costs. It sends a text representation of relevant DOM elements to any text-based LLM and executes the response directly.

How It Works

import { PageAgent } from 'page-agent'

const agent = new PageAgent({
  model: 'qwen3.5-plus',
  baseURL: 'https://dashscope.aliyuncs.com/compatible-mode/v1',
  apiKey: 'YOUR_API_KEY',
  language: 'en-US',
})

await agent.execute('Fill the shipping form with name John Smith, 123 Main St, New York, NY 10001')

The agent:

  1. Reads the current DOM and extracts interactive elements as text
  2. Sends this text representation + your command to the LLM
  3. Receives structured actions (click, type, select)
  4. Executes them directly on the DOM

No screenshots taken. No pixels processed. Just text in, actions out.

Use Cases

SaaS AI Copilot

This is the killer use case. You have a complex web app β€” ERP, CRM, admin dashboard β€” and you want to add an AI assistant that can actually do things in the UI. With page-agent, that’s a few lines of code. No backend rewrite. The copilot understands your UI by reading the DOM.

Smart Form Filling

Turn 20-click workflows into one sentence. β€œCreate a new customer with these details and assign them to the Enterprise tier.” The agent finds the forms, fills them, clicks the buttons.

Accessibility

Natural language as a universal interface. Voice commands β†’ text β†’ page-agent β†’ DOM actions. Any web app becomes accessible through language, regardless of how it was built.

Legacy App Modernization

That internal tool from 2014 with no API? page-agent can drive it from the inside. Wrap it with natural language commands without touching the original codebase.

Bring Your Own LLM

page-agent works with any OpenAI-compatible API:

  • Qwen (Alibaba’s own, works great)
  • GPT-4o, GPT-4o-mini
  • Claude
  • Gemini
  • Local models via Ollama

Since it only sends text (not screenshots), you don’t need expensive multimodal models. A good text model is sufficient.

page-agent vs. The Automation Stack

page-agentbrowser-usePlaywrightSelenium
Runs whereInside the page (JS)External (Python)External (Node/Python)External (any)
Needs browser instanceNo (already in one)Yes (Chromium)YesYes
Screenshots neededNoYes (vision model)OptionalOptional
LLM requirementAny text modelMultimodal preferredN/A (scripted)N/A (scripted)
Integration effort1 script tagPython setup + browserFull test frameworkFull test framework
Multi-page supportChrome extension (optional)Built-inBuilt-inBuilt-in
Use caseIn-app copilot, form fillingWeb scraping, automationTesting, automationTesting, automation
LanguageJavaScript/TypeScriptPythonNode.js/PythonMulti-language
LicenseMITMITApache 2.0Apache 2.0

Key tradeoff: page-agent excels at single-page interactions embedded in your product. For cross-site scraping or complex multi-tab workflows, external tools like browser-use or Playwright still have the edge.

page-agent vs. WebMCP: Two Sides of the Same Problem

Both page-agent and WebMCP are trying to solve the same fundamental issue: AI agents are terrible at navigating human-designed web interfaces. They approach it from opposite directions.

WebMCP (server-side): The website owner exposes structured tools β€” proper function calls with schemas β€” that agents can invoke directly. No DOM parsing needed. The site adapts to agents. This is Google and Microsoft’s proposal, currently in Chrome 146 Canary.

page-agent (client-side): The agent reads the DOM as text and figures out how to interact with it. No cooperation from the website needed. The agent adapts to sites.

page-agentWebMCP
Who adaptsAgent adapts to siteSite adapts to agent
Site cooperation neededNoYes (must implement)
Works on existing sitesβœ… Today❌ Sites must add support
Structured interactionInferred from DOMExplicit schemas
ReliabilityGood (DOM can change)Very high (typed schemas)
Adoption timelineNowYears (needs web standard adoption)
Best forYour own SaaS copilot, legacy appsFuture web ecosystem

The realistic path: page-agent works today on every website. WebMCP is the better long-term solution but requires websites to implement it β€” and web standards take years to reach meaningful adoption. In practice, agents will use page-agent-style DOM reading as the fallback and WebMCP structured tools when available.

page-agent also ships a beta MCP server, which bridges these approaches β€” your external agents can control it via MCP while it operates inside the page. Combined with the Chrome extension for multi-page support, you can build browser agents that work from either direction.

The Numbers

  • 15,700+ GitHub stars
  • 1,222 forks
  • MIT license
  • Built on top of browser-use DOM processing patterns
  • npm package: page-agent
  • Active Alibaba maintenance

Getting Started

Quick evaluation (free demo LLM included):

<script src="https://cdn.jsdelivr.net/npm/page-agent@1.7.1/dist/iife/page-agent.demo.js" crossorigin="true"></script>

Production (bring your own LLM):

npm install page-agent

Docs: alibaba.github.io/page-agent