Chandra 2: The Open-Source OCR Model Turning Messy Documents Into Agent-Ready Data

By Prahlad Menon 4 min read

Everyone’s racing to build smarter AI agents, but there’s a quiet problem upstream: most real-world data is trapped in documents. PDFs, scanned insurance forms, handwritten clinical notes, lab reports with complex tables — none of this flows cleanly into an LLM context window. Traditional OCR gives you a text dump. What agents actually need is structured output: tables that are still tables, form fields that are labeled, images that are described, layout that’s preserved.

Chandra 2, released in March 2026 by Datalab, is built specifically for that.

What Chandra 2 Does

Chandra 2 converts images and PDFs into structured HTML, Markdown, or JSON — while keeping full layout information intact. It’s not just extracting characters; it’s reconstructing the document as a structured artifact.

The model runs on 4 billion parameters, down from 9 billion in v1. Despite being half the size, it scores 85.9% on the olmOCR benchmark, which is state of the art at its parameter class. Smaller and more accurate. That’s the direction specialized models keep going.

What Sets It Apart From Standard OCR

Most OCR tools treat every document as a bag of words. Chandra 2 doesn’t.

Layout preservation. When a document has a two-column layout, a sidebar, or a header hierarchy, Chandra 2 maintains that structure in the output. You don’t get a jumbled linear text stream.

Table handling. Complex tables — merged cells, nested headers, multi-row spans — come out as actual structured tables, not garbled text.

Form reconstruction. Checkboxes, radio buttons, labeled fields. The output tells you which box was checked and what question it answered.

Handwriting support. Physician notes, annotations, handwritten signatures — Chandra 2 handles these, which is rare even in commercial OCR tools.

Image and diagram captioning. Embedded images and figures are extracted and described. For documents like radiology reports or engineering specs, this is the difference between useful and useless output.

Math support. Equations are preserved in structured format rather than mangled into random characters.

90+ languages. With published multilingual benchmarks, not just a checkbox claim.

All output is directly consumable — JSON or HTML that an agent, RAG retriever, or downstream pipeline can work with without post-processing gymnastics.

The Agentic Use Case

This is where it actually matters. If you’re building a RAG pipeline over a document corpus, or an agent that processes incoming paperwork, you need more than text extraction. You need:

  • Tables as actual tables your agent can reason over
  • Form fields with their labels and values intact
  • Images captioned so they contribute to retrieval
  • Layout context that preserves document semantics

Without this, you’re feeding your LLM a degraded, noisy version of the document and hoping it sorts things out. Chandra 2 handles the reconstruction in one pass, before the document ever touches your model.

The Healthcare Angle

Clinical documents are a near-perfect stress test for OCR: dense layout, mixed content types, handwriting, structured forms, and high-stakes accuracy requirements.

Insurance prior authorization forms with checkboxes. Intake questionnaires. Lab reports with reference ranges in tables. Consent forms. Handwritten physician notes. Discharge summaries with embedded medication tables.

Chandra 2’s combination of form reconstruction, handwriting support, table handling, and structured output makes it well-suited for this domain. The output maps directly to structured clinical data rather than requiring a second extraction pass.

Getting Started

The install is straightforward:

pip install chandra-ocr
chandra input.pdf ./output

Two deployment modes:

  • Local inference via HuggingFace — good for development and moderate workloads
  • vLLM server — recommended for production, handles batching and throughput properly

There’s also a free playground at datalab.to/playground if you want to test it on your own documents before committing to a deployment.

The Efficiency Story

4B parameters outperforming 9B+ models on a specialized benchmark isn’t a surprise anymore — it’s a pattern. When you train a smaller model deeply on a specific task rather than trying to generalize across everything, you get better performance with lower inference cost.

Chandra 2 runs on a single consumer GPU. For most document processing workloads, you don’t need a cluster.

Worth Adding to Your Pipeline

If you’re building anything that ingests real-world documents — clinical, legal, financial, administrative — Chandra 2 is worth evaluating. The structured output, layout preservation, and form handling solve problems that most OCR tools either ignore or handle poorly.

Repository: github.com/datalab-to/chandra