Alibaba page-agent: The GUI Agent That Lives Inside Your Webpage
Every browser automation tool works the same way: run something outside the browser that controls the browser. Playwright, Puppeteer, Selenium, browser-use β they all sit on the outside looking in, taking screenshots or parsing accessibility trees through Chrome DevTools Protocol.
Alibabaβs page-agent flips that model. The agent lives inside the webpage as plain JavaScript. It reads the DOM directly as text, understands natural language commands, and manipulates the page from within. No headless browser. No screenshots. No multimodal model.
One script tag:
<script src="https://cdn.jsdelivr.net/npm/page-agent@1.7.1/dist/iife/page-agent.demo.js" crossorigin="true"></script>
Thatβs the entire integration for evaluation.
Why This Matters
The standard browser automation stack looks like this:
Your Code β Playwright/Puppeteer β CDP β Browser β Page
β |
βββββ Screenshot/DOM snapshot ββββββββββββββ
Every cycle involves serializing the page state, sending it to your code (or an LLM), getting a decision back, then sending actions through the protocol. Screenshots require multimodal models. Accessibility trees are lossy. Itβs slow, fragile, and expensive.
page-agent collapses that entire stack:
Page β page-agent.js β LLM API
The agent is JavaScript running in the page context. It has direct access to the DOM. No serialization round-trips, no screenshot overhead, no vision model costs. It sends a text representation of relevant DOM elements to any text-based LLM and executes the response directly.
How It Works
import { PageAgent } from 'page-agent'
const agent = new PageAgent({
model: 'qwen3.5-plus',
baseURL: 'https://dashscope.aliyuncs.com/compatible-mode/v1',
apiKey: 'YOUR_API_KEY',
language: 'en-US',
})
await agent.execute('Fill the shipping form with name John Smith, 123 Main St, New York, NY 10001')
The agent:
- Reads the current DOM and extracts interactive elements as text
- Sends this text representation + your command to the LLM
- Receives structured actions (click, type, select)
- Executes them directly on the DOM
No screenshots taken. No pixels processed. Just text in, actions out.
Use Cases
SaaS AI Copilot
This is the killer use case. You have a complex web app β ERP, CRM, admin dashboard β and you want to add an AI assistant that can actually do things in the UI. With page-agent, thatβs a few lines of code. No backend rewrite. The copilot understands your UI by reading the DOM.
Smart Form Filling
Turn 20-click workflows into one sentence. βCreate a new customer with these details and assign them to the Enterprise tier.β The agent finds the forms, fills them, clicks the buttons.
Accessibility
Natural language as a universal interface. Voice commands β text β page-agent β DOM actions. Any web app becomes accessible through language, regardless of how it was built.
Legacy App Modernization
That internal tool from 2014 with no API? page-agent can drive it from the inside. Wrap it with natural language commands without touching the original codebase.
Bring Your Own LLM
page-agent works with any OpenAI-compatible API:
- Qwen (Alibabaβs own, works great)
- GPT-4o, GPT-4o-mini
- Claude
- Gemini
- Local models via Ollama
Since it only sends text (not screenshots), you donβt need expensive multimodal models. A good text model is sufficient.
page-agent vs. The Automation Stack
| page-agent | browser-use | Playwright | Selenium | |
|---|---|---|---|---|
| Runs where | Inside the page (JS) | External (Python) | External (Node/Python) | External (any) |
| Needs browser instance | No (already in one) | Yes (Chromium) | Yes | Yes |
| Screenshots needed | No | Yes (vision model) | Optional | Optional |
| LLM requirement | Any text model | Multimodal preferred | N/A (scripted) | N/A (scripted) |
| Integration effort | 1 script tag | Python setup + browser | Full test framework | Full test framework |
| Multi-page support | Chrome extension (optional) | Built-in | Built-in | Built-in |
| Use case | In-app copilot, form filling | Web scraping, automation | Testing, automation | Testing, automation |
| Language | JavaScript/TypeScript | Python | Node.js/Python | Multi-language |
| License | MIT | MIT | Apache 2.0 | Apache 2.0 |
Key tradeoff: page-agent excels at single-page interactions embedded in your product. For cross-site scraping or complex multi-tab workflows, external tools like browser-use or Playwright still have the edge.
page-agent vs. WebMCP: Two Sides of the Same Problem
Both page-agent and WebMCP are trying to solve the same fundamental issue: AI agents are terrible at navigating human-designed web interfaces. They approach it from opposite directions.
WebMCP (server-side): The website owner exposes structured tools β proper function calls with schemas β that agents can invoke directly. No DOM parsing needed. The site adapts to agents. This is Google and Microsoftβs proposal, currently in Chrome 146 Canary.
page-agent (client-side): The agent reads the DOM as text and figures out how to interact with it. No cooperation from the website needed. The agent adapts to sites.
| page-agent | WebMCP | |
|---|---|---|
| Who adapts | Agent adapts to site | Site adapts to agent |
| Site cooperation needed | No | Yes (must implement) |
| Works on existing sites | β Today | β Sites must add support |
| Structured interaction | Inferred from DOM | Explicit schemas |
| Reliability | Good (DOM can change) | Very high (typed schemas) |
| Adoption timeline | Now | Years (needs web standard adoption) |
| Best for | Your own SaaS copilot, legacy apps | Future web ecosystem |
The realistic path: page-agent works today on every website. WebMCP is the better long-term solution but requires websites to implement it β and web standards take years to reach meaningful adoption. In practice, agents will use page-agent-style DOM reading as the fallback and WebMCP structured tools when available.
page-agent also ships a beta MCP server, which bridges these approaches β your external agents can control it via MCP while it operates inside the page. Combined with the Chrome extension for multi-page support, you can build browser agents that work from either direction.
The Numbers
- 15,700+ GitHub stars
- 1,222 forks
- MIT license
- Built on top of browser-use DOM processing patterns
- npm package:
page-agent - Active Alibaba maintenance
Getting Started
Quick evaluation (free demo LLM included):
<script src="https://cdn.jsdelivr.net/npm/page-agent@1.7.1/dist/iife/page-agent.demo.js" crossorigin="true"></script>
Production (bring your own LLM):
npm install page-agent
Docs: alibaba.github.io/page-agent
Related Reading
- PinchTab: Browser Control for AI Agents β External browser automation approach
- Lightpanda: Headless Browser Built for AI β Purpose-built browser for agent workloads
- WebMCP: The Agentic Web Standard β Making websites natively agent-accessible