What is Alibaba page-agent?

page-agent is an open-source JavaScript library (MIT license) that embeds an AI GUI agent directly inside a webpage. It reads the DOM as text and executes natural language commands like 'fill out this form' or 'click the login button' — without screenshots, headless browsers, or multimodal models. One script tag integration.

How is page-agent different from browser-use or Playwright?

Traditional browser automation (Playwright, Puppeteer, browser-use) runs externally — controlling a browser from outside via CDP or WebDriver. page-agent runs inside the page as JavaScript. No Python backend, no browser extension required, no screenshots. It reads the DOM directly as text, which is faster and doesn't need vision models.

What LLMs work with page-agent?

page-agent is bring-your-own-LLM. It works with any OpenAI-compatible API — Qwen, GPT-4o, Claude, Gemini, local models via Ollama. It only needs a text model since it doesn't use screenshots.

Can I embed page-agent in my SaaS product?

Yes, that's a primary use case. Add one script tag or npm install page-agent, and your users get an AI copilot that can navigate your UI via natural language. MIT licensed, so commercial use is unrestricted.

Does page-agent work across multiple pages?

The base library works within a single page. For multi-page tasks across browser tabs, there's an optional Chrome extension. There's also a beta MCP server for external agent control.

How big is page-agent?

It's designed to be lightweight — distributed as a single JavaScript bundle via CDN or npm. 15,700+ GitHub stars, 1,200+ forks, MIT licensed, actively maintained by Alibaba.

Alibaba page-agent: The GUI Agent That Lives Inside Your Webpage

By Prahlad Menon Published 2026-04-06 4 min read

Every browser automation tool works the same way: run something outside the browser that controls the browser. Playwright, Puppeteer, Selenium, browser-use — they all sit on the outside looking in, taking screenshots or parsing accessibility trees through Chrome DevTools Protocol.

Alibaba’s page-agent flips that model. The agent lives inside the webpage as plain JavaScript. It reads the DOM directly as text, understands natural language commands, and manipulates the page from within. No headless browser. No screenshots. No multimodal model.

One script tag:

<script src="https://cdn.jsdelivr.net/npm/page-agent@1.7.1/dist/iife/page-agent.demo.js" crossorigin="true"></script>

That’s the entire integration for evaluation.

Why This Matters

The standard browser automation stack looks like this:

Your Code → Playwright/Puppeteer → CDP → Browser → Page
     ↑                                          |
     └──── Screenshot/DOM snapshot ─────────────┘

Every cycle involves serializing the page state, sending it to your code (or an LLM), getting a decision back, then sending actions through the protocol. Screenshots require multimodal models. Accessibility trees are lossy. It’s slow, fragile, and expensive.

page-agent collapses that entire stack:

Page ← page-agent.js ← LLM API

The agent is JavaScript running in the page context. It has direct access to the DOM. No serialization round-trips, no screenshot overhead, no vision model costs. It sends a text representation of relevant DOM elements to any text-based LLM and executes the response directly.

How It Works

import { PageAgent } from 'page-agent'

const agent = new PageAgent({
  model: 'qwen3.5-plus',
  baseURL: 'https://dashscope.aliyuncs.com/compatible-mode/v1',
  apiKey: 'YOUR_API_KEY',
  language: 'en-US',
})

await agent.execute('Fill the shipping form with name John Smith, 123 Main St, New York, NY 10001')

The agent:

Reads the current DOM and extracts interactive elements as text
Sends this text representation + your command to the LLM
Receives structured actions (click, type, select)
Executes them directly on the DOM

No screenshots taken. No pixels processed. Just text in, actions out.

Use Cases

SaaS AI Copilot

This is the killer use case. You have a complex web app — ERP, CRM, admin dashboard — and you want to add an AI assistant that can actually do things in the UI. With page-agent, that’s a few lines of code. No backend rewrite. The copilot understands your UI by reading the DOM.

Smart Form Filling

Turn 20-click workflows into one sentence. “Create a new customer with these details and assign them to the Enterprise tier.” The agent finds the forms, fills them, clicks the buttons.

Accessibility

Natural language as a universal interface. Voice commands → text → page-agent → DOM actions. Any web app becomes accessible through language, regardless of how it was built.

Legacy App Modernization

That internal tool from 2014 with no API? page-agent can drive it from the inside. Wrap it with natural language commands without touching the original codebase.

Bring Your Own LLM

page-agent works with any OpenAI-compatible API:

Qwen (Alibaba’s own, works great)
GPT-4o, GPT-4o-mini
Claude
Gemini
Local models via Ollama

Since it only sends text (not screenshots), you don’t need expensive multimodal models. A good text model is sufficient.

page-agent vs. The Automation Stack

	page-agent	browser-use	Playwright	Selenium
Runs where	Inside the page (JS)	External (Python)	External (Node/Python)	External (any)
Needs browser instance	No (already in one)	Yes (Chromium)	Yes	Yes
Screenshots needed	No	Yes (vision model)	Optional	Optional
LLM requirement	Any text model	Multimodal preferred	N/A (scripted)	N/A (scripted)
Integration effort	1 script tag	Python setup + browser	Full test framework	Full test framework
Multi-page support	Chrome extension (optional)	Built-in	Built-in	Built-in
Use case	In-app copilot, form filling	Web scraping, automation	Testing, automation	Testing, automation
Language	JavaScript/TypeScript	Python	Node.js/Python	Multi-language
License	MIT	MIT	Apache 2.0	Apache 2.0

Key tradeoff: page-agent excels at single-page interactions embedded in your product. For cross-site scraping or complex multi-tab workflows, external tools like browser-use or Playwright still have the edge.

page-agent vs. WebMCP: Two Sides of the Same Problem

Both page-agent and WebMCP are trying to solve the same fundamental issue: AI agents are terrible at navigating human-designed web interfaces. They approach it from opposite directions.

WebMCP (server-side): The website owner exposes structured tools — proper function calls with schemas — that agents can invoke directly. No DOM parsing needed. The site adapts to agents. This is Google and Microsoft’s proposal, currently in Chrome 146 Canary.

page-agent (client-side): The agent reads the DOM as text and figures out how to interact with it. No cooperation from the website needed. The agent adapts to sites.

	page-agent	WebMCP
Who adapts	Agent adapts to site	Site adapts to agent
Site cooperation needed	No	Yes (must implement)
Works on existing sites	✅ Today	❌ Sites must add support
Structured interaction	Inferred from DOM	Explicit schemas
Reliability	Good (DOM can change)	Very high (typed schemas)
Adoption timeline	Now	Years (needs web standard adoption)
Best for	Your own SaaS copilot, legacy apps	Future web ecosystem

The realistic path: page-agent works today on every website. WebMCP is the better long-term solution but requires websites to implement it — and web standards take years to reach meaningful adoption. In practice, agents will use page-agent-style DOM reading as the fallback and WebMCP structured tools when available.

page-agent also ships a beta MCP server, which bridges these approaches — your external agents can control it via MCP while it operates inside the page. Combined with the Chrome extension for multi-page support, you can build browser agents that work from either direction.

The Numbers

15,700+ GitHub stars
1,222 forks
MIT license
Built on top of browser-use DOM processing patterns
npm package: page-agent
Active Alibaba maintenance

Getting Started

Quick evaluation (free demo LLM included):

<script src="https://cdn.jsdelivr.net/npm/page-agent@1.7.1/dist/iife/page-agent.demo.js" crossorigin="true"></script>

Production (bring your own LLM):

npm install page-agent

Docs: alibaba.github.io/page-agent

PinchTab: Browser Control for AI Agents — External browser automation approach
Lightpanda: Headless Browser Built for AI — Purpose-built browser for agent workloads
WebMCP: The Agentic Web Standard — Making websites natively agent-accessible