text-extract-api: Local Document Intelligence with OCR + LLM

By Prahlad Menon 4 min read

text-extract-api is a self-hosted document extraction service with 3,050 GitHub stars. Upload any PDF, Word file, image, or PowerPoint — get back clean Markdown or structured JSON, all locally, no cloud required.

The architecture: FastAPI handles the API, Celery handles async task queuing, Redis caches OCR results, and Ollama runs the vision models and LLMs. Everything runs in Docker. Nothing leaves your network.

The Two-Stage Pipeline

What makes text-extract-api different from simpler tools is the two-stage pipeline: OCR first, then LLM.

Stage 1 — OCR extraction. Convert the document to raw text using one of four strategies. Results are cached in Redis so you don’t re-run OCR if you want to tweak the LLM prompt later.

Stage 2 — LLM processing. Pass the OCR output through a local Ollama model (LLaMA 3.1 by default) with a prompt you provide. This is where the intelligence happens.

# Extract an MRI report to JSON with PII removed
python client/cli.py ocr_upload \
  --file examples/example-mri.pdf \
  --ocr_cache \
  --prompt_file examples/example-mri-2-json-prompt.txt

The LLM stage can do things raw OCR can’t:

  • Fix spelling errors and broken hyphenation from the OCR pass
  • Pull structured fields into JSON ({"patient_id": "...", "diagnosis": "...", "date": "..."})
  • Remove or redact PII before the text ever hits downstream systems
  • Reformat tables that OCR mis-parsed as flat text

Because OCR results are cached, you can re-run the LLM stage with different prompts — different structure schemas, different PII rules, different output formats — without paying the OCR cost again.

Four OCR Strategies

Not all documents are the same, so text-extract-api gives you four strategies to pick from:

easyocr (default) — EasyOCR with Apache license, 30+ language support, strongest English accuracy. Fast, well-understood, good baseline.

minicpm_v — MiniCPM-V via Ollama. A vision model that understands document layout: it handles tables, multi-column text, and mixed text/image pages better than classical OCR. Apache license (with a research registration for commercial use).

llama_vision — Llama 3.2 Vision via Ollama. The most capable, the slowest. 90B parameter model, strong multilingual support. Pull it first:

python client/cli.py llm_pull --model llama3.2-vision

remote — Marker-PDF running as a separate local server. Marker is state-of-the-art for complex academic papers, technical documents, and non-English languages (50+). Kept separate because it’s GPL3 licensed; text-extract-api is MIT, so the boundary is intentional.

Document → Structured JSON

The most powerful pattern is OCR + structured extraction in one call. For an invoice:

python client/cli.py ocr_upload \
  --file examples/example-invoice.pdf \
  --prompt_file examples/example-invoice-remove-pii.txt

With a prompt that says “Extract vendor name, invoice number, line items, and total as JSON — replace customer details with [REDACTED]”, the API returns:

{
  "vendor": "Acme Corp",
  "invoice_number": "INV-2024-1842",
  "line_items": [
    {"description": "Professional Services", "amount": 4500.00},
    {"description": "Travel expenses", "amount": 312.50}
  ],
  "total": 4812.50,
  "customer": "[REDACTED]"
}

That’s the MRI report example in the repo — OCR extracts the medical text, the LLM structures it as JSON and strips the patient identifiers. The whole thing never touches a cloud API.

When to Use It

There’s a spectrum of PDF/document tools. We covered the lighter end in our liteparse comparison post: liteparse (PDF.js + Tesseract, zero deps), GLM-OCR 0.9B (tiny vision model, no API costs), and LlamaParse (cloud API, best accuracy).

text-extract-api is the other end of that spectrum — heavier setup, much more capability:

ToolSetupStrengthBest for
liteparsenpm i -gZero deps, instantFast in-pipeline parsing
GLM-OCR 0.9BPython packageTiny, offlineAgent memory, quick extraction
LlamaParseCloud APIBest accuracyProduction cloud pipelines
text-extract-apiDocker + OllamaLLM post-processing, PII, structureLocal document intelligence service

If you’re building an agent that needs to parse PDFs on the fly, liteparse or GLM-OCR is the right call. If you’re building a document intake service — legal, medical, finance — where you need structure, PII removal, and a REST API any system can call, text-extract-api is the architecture.

The self-hosted constraint is the whole point. Medical records, financial documents, legal filings — anything where sending documents to a cloud OCR API is a compliance or privacy problem. text-extract-api gives you LLM-grade extraction without the data leaving your infrastructure.

Scaling It

For production volume, scale by running multiple Celery workers:

# Each line adds a concurrent worker
celery -A text_extract_api.tasks worker --loglevel=info --pool=solo &
celery -A text_extract_api.tasks worker --loglevel=info --pool=solo &
celery -A text_extract_api.tasks worker --loglevel=info --pool=solo &

Redis caching means repeated documents hit the cache rather than the GPU. The async queue means burst workloads don’t drop requests.

3,050 GitHub stars. Updated today. Worth having in your local AI toolkit.

github.com/CatchTheTornado/text-extract-api


Related: PDF parsing for AI agents — liteparse vs GLM-OCR vs LlamaParse · GLM-OCR 0.9B beats Gemini on document OCR · Turn your old iPhone into a local OCR server