What is text-extract-api?

text-extract-api is a self-hosted FastAPI service that converts PDFs, Word files, PowerPoint, and images into clean Markdown or structured JSON. It combines multiple OCR strategies (EasyOCR, MiniCPM-V, Llama 3.2 Vision) with Ollama LLM post-processing — all running locally. No cloud, no data sent outside your environment.

How is text-extract-api different from liteparse or LlamaParse?

liteparse (PDF.js + Tesseract) is a lightweight CLI tool for fast in-pipeline parsing — no GPU, minimal setup. LlamaParse is a cloud API. text-extract-api is a full local service: it adds vision LLMs for layout-aware extraction, LLM post-processing for structure and PII removal, async task queuing via Celery, and Redis caching. More setup, much more capability for complex documents.

What OCR strategies does it support?

Four strategies: easyocr (EasyOCR, 30+ languages, Apache license, default), minicpm_v (MiniCPM-V via Ollama, strong on complex layouts, Apache license), llama_vision (Llama 3.2 Vision via Ollama, multilingual, slowest but most capable), and remote (Marker-PDF via a separate local server — GPL3 licensed so kept separate, excellent for 50+ language accuracy).

What does the LLM post-processing actually do?

After OCR extracts text, a local Ollama model (LLaMA 3.1 by default) applies a prompt you provide. This lets you fix OCR errors, extract structured data into JSON, remove PII from documents, reformat tables, or pull specific fields. The OCR result is cached in Redis so LLM extraction can be re-run with different prompts without re-running OCR.

Can text-extract-api remove PII from documents?

Yes. It ships with example prompts for PII removal. Run OCR on a document, then pass a prompt like 'Remove all names, addresses, and ID numbers — replace with [REDACTED]'. The LLM applies the transformation on the extracted text. The original file never leaves your server.

How do I run text-extract-api?

Clone the repo, install Docker and Ollama, then run 'make install && make run'. For non-Docker local setup: install dependencies with pip, start a Redis instance, run the FastAPI app, and start a Celery worker. On Mac, additional system deps are needed: 'brew install libmagic poppler ghostscript'. Full docs at github.com/CatchTheTornado/text-extract-api.

text-extract-api: Local Document Intelligence with OCR + LLM

By Prahlad Menon Published 2026-03-21 4 min read

text-extract-api is a self-hosted document extraction service with 3,050 GitHub stars. Upload any PDF, Word file, image, or PowerPoint — get back clean Markdown or structured JSON, all locally, no cloud required.

The architecture: FastAPI handles the API, Celery handles async task queuing, Redis caches OCR results, and Ollama runs the vision models and LLMs. Everything runs in Docker. Nothing leaves your network.

The Two-Stage Pipeline

What makes text-extract-api different from simpler tools is the two-stage pipeline: OCR first, then LLM.

Stage 1 — OCR extraction. Convert the document to raw text using one of four strategies. Results are cached in Redis so you don’t re-run OCR if you want to tweak the LLM prompt later.

Stage 2 — LLM processing. Pass the OCR output through a local Ollama model (LLaMA 3.1 by default) with a prompt you provide. This is where the intelligence happens.

# Extract an MRI report to JSON with PII removed
python client/cli.py ocr_upload \
  --file examples/example-mri.pdf \
  --ocr_cache \
  --prompt_file examples/example-mri-2-json-prompt.txt

The LLM stage can do things raw OCR can’t:

Fix spelling errors and broken hyphenation from the OCR pass
Pull structured fields into JSON ({"patient_id": "...", "diagnosis": "...", "date": "..."})
Remove or redact PII before the text ever hits downstream systems
Reformat tables that OCR mis-parsed as flat text

Because OCR results are cached, you can re-run the LLM stage with different prompts — different structure schemas, different PII rules, different output formats — without paying the OCR cost again.

Four OCR Strategies

Not all documents are the same, so text-extract-api gives you four strategies to pick from:

easyocr (default) — EasyOCR with Apache license, 30+ language support, strongest English accuracy. Fast, well-understood, good baseline.

minicpm_v — MiniCPM-V via Ollama. A vision model that understands document layout: it handles tables, multi-column text, and mixed text/image pages better than classical OCR. Apache license (with a research registration for commercial use).

llama_vision — Llama 3.2 Vision via Ollama. The most capable, the slowest. 90B parameter model, strong multilingual support. Pull it first:

python client/cli.py llm_pull --model llama3.2-vision

remote — Marker-PDF running as a separate local server. Marker is state-of-the-art for complex academic papers, technical documents, and non-English languages (50+). Kept separate because it’s GPL3 licensed; text-extract-api is MIT, so the boundary is intentional.

Document → Structured JSON

The most powerful pattern is OCR + structured extraction in one call. For an invoice:

python client/cli.py ocr_upload \
  --file examples/example-invoice.pdf \
  --prompt_file examples/example-invoice-remove-pii.txt

With a prompt that says “Extract vendor name, invoice number, line items, and total as JSON — replace customer details with [REDACTED]”, the API returns:

{
  "vendor": "Acme Corp",
  "invoice_number": "INV-2024-1842",
  "line_items": [
    {"description": "Professional Services", "amount": 4500.00},
    {"description": "Travel expenses", "amount": 312.50}
  ],
  "total": 4812.50,
  "customer": "[REDACTED]"
}

That’s the MRI report example in the repo — OCR extracts the medical text, the LLM structures it as JSON and strips the patient identifiers. The whole thing never touches a cloud API.

When to Use It

There’s a spectrum of PDF/document tools. We covered the lighter end in our liteparse comparison post: liteparse (PDF.js + Tesseract, zero deps), GLM-OCR 0.9B (tiny vision model, no API costs), and LlamaParse (cloud API, best accuracy).

text-extract-api is the other end of that spectrum — heavier setup, much more capability:

Tool	Setup	Strength	Best for
liteparse	`npm i -g`	Zero deps, instant	Fast in-pipeline parsing
GLM-OCR 0.9B	Python package	Tiny, offline	Agent memory, quick extraction
LlamaParse	Cloud API	Best accuracy	Production cloud pipelines
text-extract-api	Docker + Ollama	LLM post-processing, PII, structure	Local document intelligence service

If you’re building an agent that needs to parse PDFs on the fly, liteparse or GLM-OCR is the right call. If you’re building a document intake service — legal, medical, finance — where you need structure, PII removal, and a REST API any system can call, text-extract-api is the architecture.

The self-hosted constraint is the whole point. Medical records, financial documents, legal filings — anything where sending documents to a cloud OCR API is a compliance or privacy problem. text-extract-api gives you LLM-grade extraction without the data leaving your infrastructure.

Scaling It

For production volume, scale by running multiple Celery workers:

# Each line adds a concurrent worker
celery -A text_extract_api.tasks worker --loglevel=info --pool=solo &
celery -A text_extract_api.tasks worker --loglevel=info --pool=solo &
celery -A text_extract_api.tasks worker --loglevel=info --pool=solo &

Redis caching means repeated documents hit the cache rather than the GPU. The async queue means burst workloads don’t drop requests.

3,050 GitHub stars. Updated today. Worth having in your local AI toolkit.

→ github.com/CatchTheTornado/text-extract-api