What is liteparse and how does it work?

liteparse is an open-source PDF parser from the LlamaIndex team (run-llama). It uses PDF.js for spatial text extraction and Tesseract.js for OCR — both running locally with zero cloud dependencies. You install it with `npm i -g @llamaindex/liteparse` and run `lit parse document.pdf`. It preserves bounding boxes (spatial layout), supports batch processing, and can render page screenshots for downstream LLM agents. No API key, no GPU, no internet required.

How does liteparse compare to GLM-OCR for document understanding?

They solve different problems. liteparse is optimized for clean PDFs with native text layers — it extracts text fast and cheaply using classical methods (Tesseract). GLM-OCR is a 0.9B vision-language model that understands document structure semantically — it handles dense tables, multi-column layouts, handwritten annotations, and visual elements that confuse Tesseract. GLM-OCR won benchmarks against Gemini and Qwen3-VL-235B on document parsing. Use liteparse for speed and simplicity; use GLM-OCR when layout fidelity and accuracy matter.

When should I use LlamaParse instead of liteparse or GLM-OCR?

LlamaParse (the cloud product from LlamaIndex, not liteparse) is designed for production document pipelines at scale — complex tables, charts, multi-column academic papers, scanned PDFs with mixed content types. It runs on LlamaIndex's infrastructure, handles edge cases that trip up local tools, and outputs clean structured markdown ready for LLM consumption. The tradeoff: it requires an API key, has per-page pricing, and sends your documents to the cloud. Use it when accuracy on complex docs matters more than cost or privacy.

Can I use liteparse offline and with sensitive documents?

Yes — liteparse runs entirely locally. No data leaves your machine. PDF.js parses the text layer; Tesseract.js handles images. Both run in-process (Node.js). You can also feed it raw bytes directly (zero disk I/O), which is important for pipelines handling sensitive documents like medical records, legal contracts, or financial filings. For regulated industries where data residency matters, local tools like liteparse or a self-hosted GLM-OCR instance are the right choice.

What's the best PDF parsing stack for an AI agent in 2026?

It depends on your document type and constraints. For native-text PDFs (most PDFs generated by software): liteparse — fast, free, local. For visually complex documents (scanned, tables-heavy, mixed layouts): GLM-OCR locally or LlamaParse cloud. For production pipelines at scale with SLA requirements: LlamaParse. Many agent builders use a tiered approach — try liteparse first, fall back to GLM-OCR or LlamaParse when the confidence or layout quality is insufficient. liteparse even ships as an agent skill with a SKILL.md format compatible with OpenClaw.

Is Tesseract a 'model-free' OCR solution?

Not exactly. Tesseract uses an LSTM-based neural network internally — it is a trained model. But it's classical and tiny: CPU-only, no GPU, no internet, ships as a binary. When people say liteparse does 'OCR without models' they mean without large vision models (GPT-4V, Gemini, GLM-OCR). The distinction matters for cost and privacy, not for accuracy — Tesseract is significantly less accurate than modern VLMs on complex layouts, but it's free and runs anywhere.

PDF Parsing for AI Agents: liteparse vs GLM-OCR vs LlamaParse

By Prahlad Menon Published 2026-03-21 2 min read

PDF parsing sounds like a solved problem. It isn’t — and the gap between “good enough for simple PDFs” and “reliable for production agent pipelines” is where most builders learn the hard way.

Three tools cover the practical spectrum for AI agent use cases in 2026. Here’s when to reach for each.

The Three Tools

liteparse — Local, Fast, Zero Setup

Repo: github.com/run-llama/liteparse
From: LlamaIndex team (run-llama)
Model: Tesseract.js (CPU-only, classical OCR)

npm i -g @llamaindex/liteparse
lit parse document.pdf

That’s it. No API key. No GPU. No cloud. Works on Linux, macOS (Intel/ARM), and Windows.

What it does well:

Native-text PDFs (those generated by software — Word, LaTeX, most web PDFs): near-perfect extraction
Bounding boxes on every text element — spatial layout preserved
Buffer input: zero disk I/O, pipe PDFs in from memory
Batch processing: lit batch-parse ./input ./output
Screenshot mode: renders pages as images for downstream VLM processing
Pluggable OCR: swap Tesseract for EasyOCR, PaddleOCR, or any custom HTTP server

Where it struggles:

Scanned PDFs with no text layer (Tesseract accuracy drops sharply on complex layouts)
Dense tables, multi-column academic papers, handwritten annotations
Non-English documents (Tesseract needs language packs configured)

Node.js API:

import { LiteParse } from '@llamaindex/liteparse';

const parser = new LiteParse({ ocrEnabled: true });
const result = await parser.parse('document.pdf');
console.log(result.text); // with bounding boxes in result.pages

GLM-OCR 0.9B — VLM-Quality, Still Local

Repo: github.com/THUDM/GLM-OCR
From: Tsinghua KEG Lab
Model: 0.9B vision-language model

GLM-OCR is a different category of tool — a small vision-language model purpose-built for document understanding. In benchmarks published in March 2026, it outperformed Gemini on document parsing tasks and matched models many times its size.

What it does well:

Dense tables: understands cell relationships, not just text extraction
Multi-column layouts: tracks reading order semantically
Handwritten annotations: handles mixed printed/handwritten content
Visual document elements: charts, figures, form fields
Scanned PDFs: robust to image quality variation
Mathematical notation (arXiv papers): reasonable accuracy

Where it struggles:

Requires a GPU for fast inference (CPU is slow at 0.9B params)
More setup than liteparse (Python, model download ~1.8GB)
Overkill for clean native-text PDFs

When to use it: When layout fidelity and accuracy matter more than speed — financial statements, academic papers, legal contracts, forms with structured data.

LlamaParse — Production Cloud Parsing

URL: cloud.llamaindex.ai
From: LlamaIndex (same team as liteparse)
Model: Proprietary cloud pipeline

LlamaParse is what you reach for when the document is complex and accuracy is non-negotiable, and you’re willing to trade privacy/cost for reliability.

What it does well:

Complex tables across pages
Charts and figures with context
Mixed document types (scanned + native text)
Structured markdown output, ready for LLM consumption
Handles edge cases that break local tools
Per-page SLA guarantees in production

Where it falls short:

Documents leave your machine (not for sensitive data without DPA)
Per-page pricing at scale
Requires API key and internet access

Decision Framework

Is the PDF native-text (generated by software)?
  └── YES → liteparse (fast, free, local)
  └── NO (scanned / complex layout) →
        Is data sensitivity a concern?
          └── YES → GLM-OCR (local VLM, no cloud)
          └── NO → LlamaParse (most accurate, handles edge cases)
        
Are you processing at scale (1000+ docs/day)?
  └── liteparse for native-text (parallelizable, zero cost)
  └── LlamaParse for complex (API rate limits apply)
  └── Self-hosted GLM-OCR for sensitive + complex

Tiered Agent Pipeline

The pattern used in production agent systems combines all three:

PDF arrives
  ↓
liteparse: attempt extraction
  ↓
Confidence check (text length, layout flags)
  ↓
Sufficient? → Use liteparse output
Not sufficient? → GLM-OCR (if on-prem required)
             → LlamaParse (if cloud allowed)
  ↓
Structured output → LLM context window

This keeps costs near-zero for the majority of documents (most PDFs have native text layers) while preserving accuracy for the minority that need it.

One More Thing: liteparse as an Agent Skill

liteparse ships with an official agent skill:

npx skills add run-llama/llamaparse-agent-skills --skill liteparse

The skill uses a SKILL.md format — the same spec used by OpenClaw skills. If you’re building agents on OpenClaw, you can drop it straight in. This is the direction the LlamaIndex team is pushing: document parsing as a composable agent capability, not a preprocessing step you bolt on before the real work starts.

Bottom Line

Tool	Best For	Setup	Cost	Privacy
liteparse	Native-text PDFs, agent pipelines	`npm i`	Free	100% local
GLM-OCR	Complex layouts, scanned docs, on-prem	Python + GPU	Free	100% local
LlamaParse	Production complex docs, max accuracy	API key	Per-page	Cloud

For most agent builders: start with liteparse. If your documents have complex layouts or low text-layer quality, reach for GLM-OCR before paying for cloud. Reserve LlamaParse for the cases where accuracy genuinely can’t be compromised and data residency isn’t a constraint.