Stop Wrestling with Broken OCR! Chandra 2 Crushes Tables & Handwriting
Stop Wrestling with Broken OCR! Chandra 2 Crushes Tables & Handwriting
Your document pipeline is bleeding money. Every invoice that arrives as a scanned PDF, every handwritten form submitted by a customer, every research paper packed with mathematical notation—they're all sabotaging your automation efforts. You've tried the big names. You've wrestled with Tesseract's archaic configuration files. You've paid premium API prices for "enterprise" solutions that still confuse table cells with footnotes. And when cursive handwriting enters the picture? Complete system failure.
Here's the brutal truth: most OCR tools were built for a world that no longer exists. They expect pristine scans, simple paragraphs, and English-only content. But your documents are messy, multilingual, and structurally complex. You need something that understands layout, not just letters.
Enter Chandra OCR 2 from Datalab—the open-source model that's making experienced engineers quietly abandon their expensive OCR subscriptions. This isn't incremental improvement. This is a fundamentally different approach to document intelligence that converts images and PDFs into structured HTML, Markdown, and JSON while preserving the exact layout information that makes documents meaningful.
Want proof? Chandra 2 tops the external olmOCR benchmark and delivers crushing improvements in multilingual performance. Still skeptical? Keep reading. Your document processing workflow is about to get a massive upgrade.
What is Chandra OCR 2?
Chandra OCR 2 is a state-of-the-art optical character recognition model developed by Datalab, a team laser-focused on document intelligence. Released in March 2026 as a major evolution from the original Chandra 1 (October 2025), this model represents a paradigm shift in how machines read documents.
Unlike traditional OCR systems that treat text extraction as a flat, linear problem, Chandra 2 understands document structure as a first-class citizen. It doesn't just recognize characters—it reconstructs tables, identifies form fields, preserves mathematical notation, and maintains reading order across complex multi-column layouts.
The model is built on modern transformer architectures with visual understanding capabilities, leveraging the Qwen 3.5 foundation and optimized through vLLM for production inference. What makes it genuinely exciting for developers is the dual inference architecture: run locally via HuggingFace for privacy-sensitive workloads, or deploy through vLLM for high-throughput production environments.
Why it's trending now: The document AI space has been dominated by proprietary APIs (Google Document AI, Azure Form Recognizer, AWS Textract) that charge per-page fees and lock your data in vendor ecosystems. Chandra 2 arrives as a fully open-weights alternative with an Apache 2.0 code license and modified OpenRAIL-M model license—free for research, personal use, and startups under $2M funding/revenue. For engineers building document pipelines at scale, this represents massive cost reduction combined with complete deployment flexibility.
The benchmarks don't lie. Chandra 2 achieves 85.9% overall on the olmOCR benchmark, surpassing olmOCR 2 (82.4%), dots.ocr 1.5 (83.9%), and leaving GPT-4o (69.9%) in the dust. In multilingual testing across 90 languages, it averages 72.7% versus Gemini 2.5 Flash's 60.8%.
Key Features That Separate Chandra 2 from the Pack
Chandra 2 isn't a marginal improvement—it's a feature-complete solution for document intelligence that addresses pain points other tools ignore:
-
Complex Layout Preservation: Converts documents to Markdown, HTML, or JSON with detailed structural metadata. Tables remain tables. Headers stay headers. Multi-column layouts maintain proper reading order.
-
Mathematical Notation Mastery: Renders LaTeX-quality math from images, including handwritten equations. The CS229 textbook example demonstrates production-ready academic document processing.
-
Handwriting Recognition That Actually Works: From cursive notes to filled forms, Chandra 2 achieves accuracy that makes previous open-source attempts look like toys. The benchmark improvements in "Old Scans Math" (89.3%) prove robustness against degraded inputs.
-
90+ Language Support: Not token multilingual support—genuine multilingual competence. Chandra 2 outperforms GPT-5 Mini across nearly every language tested, with particularly strong results in Arabic (68.4%), Hindi (78.4%), Japanese (86.9%), and Chinese (88.7%).
-
Form Reconstruction with Checkboxes: Identifies and structures form fields, including checked/unchecked states. Lease agreements, registration forms, and surveys become machine-readable without template configuration.
-
Image and Diagram Extraction: Pulls embedded visuals with automatic caption generation and structured metadata, creating complete document representations.
-
Production-Ready Inference Options: Choose between HuggingFace local execution (privacy-first, no network dependencies) or vLLM server deployment (1.44 pages/sec on H100 with 96 concurrent sequences, zero failure rate).
-
Flexible Output Formats: Get clean Markdown for content management systems, semantic HTML for web publishing, or structured JSON for downstream API consumption.
Real-World Use Cases Where Chandra 2 Dominates
1. Financial Document Processing
Investment firms and accounting platforms process thousands of scanned financial statements, audit reports, and tax forms monthly. Chandra 2's table reconstruction preserves cell relationships that generic OCR corrupts, eliminating the manual spreadsheet cleanup that costs operations teams hundreds of hours.
2. Academic and Research Archive Digitization
Universities and publishers sit on decades of scanned journals, handwritten research notes, and mathematical manuscripts. Chandra 2 handles the full spectrum: printed equations, cursive annotations, multi-language abstracts, and complex figure layouts. The CS229 textbook benchmark proves it can process Stanford-level academic materials without degradation.
3. Healthcare Form Automation
Insurance claims, patient intake forms, and prescription records arrive as handwritten scans, faxed documents, and mobile phone photos. Chandra 2's checkbox detection and handwriting recognition enable genuine straight-through processing for forms that previously required human transcription.
4. Legal Document Discovery
Law firms handling multilingual cases face contracts, correspondence, and evidence in Arabic, Japanese, Russian, and dozens of other languages. Chandra 2's 90-language support and layout preservation mean paralegals stop manually retyping foreign-language exhibits.
5. Historical Document Preservation
Libraries and museums digitize manuscripts with complex layouts, marginalia, and degraded paper quality. The "Old Scans" benchmark category (49.8%—competitive with specialized historical OCR tools) demonstrates robustness against the noise and damage that destroy conventional recognition accuracy.
Step-by-Step Installation & Setup Guide
Getting Chandra 2 running takes under five minutes. Here's the complete setup for both local and production deployments.
Base Installation (vLLM Backend — Recommended)
The vLLM backend provides optimal performance with minimal dependencies:
# Install the base package
pip install chandra-ocr
# Launch the vLLM server (Docker container with optimized settings)
chandra_vllm
# Process your first document
chandra input.pdf ./output
The vLLM approach is lightweight because it doesn't require PyTorch installation on your client machine. The server handles all GPU inference, while the CLI tool submits requests via HTTP.
HuggingFace Local Installation (Privacy-Critical Workloads)
For environments where documents cannot leave the machine (healthcare, legal, classified materials):
# Install with HuggingFace backend (includes torch, transformers)
pip install chandra-ocr[hf]
# Optional but strongly recommended: Flash Attention for 2-3x speedup
pip install flash-attn --no-build-isolation
# Process locally without any network calls
chandra input.pdf ./output --method hf
Full Installation (All Features)
# Everything including the Streamlit web interface
pip install chandra-ocr[all]
# Launch interactive demo for single-page exploration
chandra_app
Installation from Source (Development/Contributing)
# Clone the repository
git clone https://github.com/datalab-to/chandra.git
cd chandra
# Use uv for fast, reproducible dependency resolution
uv sync
# Activate the virtual environment
source .venv/bin/activate
Environment Configuration
Create a local.env file or export variables directly:
# Model configuration
export MODEL_CHECKPOINT=datalab-to/chandra-ocr-2
export MAX_OUTPUT_TOKENS=12384
# vLLM server connection
export VLLM_API_BASE=http://localhost:8000/v1
export VLLM_MODEL_NAME=chandra
export VLLM_GPUS=0 # Specify GPU device IDs for multi-GPU systems
Docker/Production vLLM Deployment
For production deployments, launch your own vLLM server with the official model:
# The chandra_vllm command wraps Docker with optimized settings
# Or manually start vLLM with the model:
vllm serve datalab-to/chandra-ocr-2 \
--tensor-parallel-size 1 \
--max-model-len 32768 \
--gpu-memory-utilization 0.95
Hardware Requirements:
- Minimum: NVIDIA GPU with 24GB VRAM (A10G, RTX 4090)
- Recommended: NVIDIA H100 80GB for batch processing at scale
- CPU-only: Not supported—vision-language models require GPU acceleration
REAL Code Examples from the Repository
The Chandra 2 repository provides battle-tested examples. Here are the essential patterns extracted directly from the documentation, explained with production-ready commentary.
Example 1: Basic CLI Document Processing
The simplest entry point processes single files or entire directories with one command:
# Process a single PDF using the vLLM server (default, fastest)
chandra input.pdf ./output --method vllm
# Batch process an entire directory with local HuggingFace model
chandra ./documents ./output --method hf
# Process specific page ranges from large documents
chandra contract.pdf ./output --page-range "1-5,7,9-12"
What's happening here: The chandra CLI automatically handles PDF rasterization, sends images to your configured inference backend, and writes structured outputs. The --method flag switches between the lightweight vLLM client (default) and the self-contained HuggingFace pipeline. Page range selection prevents wasted computation on irrelevant sections like appendices in legal documents.
Example 2: Advanced CLI with Production Options
For production pipelines, fine-tune behavior with the full option set:
# High-throughput processing with parallel workers
chandra ./invoices ./processed \
--method vllm \
--max-workers 8 \
--batch-size 28 \
--max-output-tokens 8192 \
--include-images \
--no-headers-footers
Critical parameters explained:
--max-workers 8: Parallel request submission to vLLM for throughput maximization--batch-size 28: Pages per inference batch (vLLM default; reduce to 1 for HuggingFace to prevent OOM)--max-output-tokens 8192: Prevents runaway generation on complex documents while allowing substantial structured output--include-images: Extracts embedded visuals as separate files with caption metadata--no-headers-footers: Excludes repetitive page elements that pollute downstream analysis
Example 3: Output Structure and Consumption
Each processed file generates a complete document package:
output/
├── contract.md # Clean Markdown for CMS/LLM ingestion
├── contract.html # Semantic HTML with preserved structure
├── contract_metadata.json # Technical metadata for pipeline debugging
├── image_001.png # Extracted figure with automatic caption
└── image_002.png # Diagram preserved from original document
The metadata JSON contains essential pipeline information:
{
"pages": 12,
"processed_pages": [1, 2, 3, 4, 5, 7, 9, 10, 11, 12],
"total_tokens": 15420,
"extraction_method": "vllm",
"model_version": "datalab-to/chandra-ocr-2",
"processing_time_seconds": 45.3
}
This structure enables deterministic pipeline orchestration: verify completeness with processed_pages, estimate costs with total_tokens, and debug failures with extraction_method tracking.
Example 4: Environment-Based Configuration
For containerized deployments, configure entirely through environment variables:
# local.env file for Docker Compose or Kubernetes ConfigMap
MODEL_CHECKPOINT=datalab-to/chandra-ocr-2
MAX_OUTPUT_TOKENS=12384
# vLLM server settings for distributed deployment
VLLM_API_BASE=http://vllm-service.internal:8000/v1
VLLM_MODEL_NAME=chandra
VLLM_GPUS=0,1,2,3 # Multi-GPU allocation
Production insight: Separating configuration from invocation enables identical containers across dev/staging/production environments. The VLLM_API_BASE URL points to an internal Kubernetes service in production, localhost:8000 in development, and a load-balanced endpoint in staging—zero code changes required.
Example 5: Interactive Streamlit Exploration
Before building automated pipelines, explore model behavior interactively:
# Install with app dependencies
pip install chandra-ocr[app]
# Launch local web interface
chandra_app
The Streamlit app provides immediate visual feedback on extraction quality, essential for:
- Validating output format selection (Markdown vs. HTML vs. JSON)
- Tuning
MAX_OUTPUT_TOKENSfor your document complexity - Identifying documents that need preprocessing (rotation, deskewing)
- Training operations teams on expected output structure
Advanced Usage & Best Practices
Optimize throughput with request batching: The vLLM backend achieves 1.44 pages/sec on H100 with 96 concurrent sequences. Structure your pipeline to submit multiple documents simultaneously rather than sequential processing. The P95 latency of 156s sounds high, but that's for maximum-complexity benchmark documents—real-world throughput estimates hit 2 pages/sec.
Flash Attention is non-negotiable for HuggingFace: Without it, you'll see 2-3x slower inference and potential OOM errors on long documents. The installation command pip install flash-attn --no-build-isolation requires CUDA toolkit headers but delivers transformative performance.
Handle multilingual documents with confidence: Chandra 2's 90-language support means you can process mixed-language documents without language detection preprocessing. A single PDF containing English, Arabic, and Japanese sections extracts correctly in one pass—eliminating the fragile language-segmentation pipelines that plague other OCR systems.
Monitor the failure rate metric: The benchmark shows 0% failure rate under load. If you see failures in production, they're likely infrastructure issues (GPU OOM, network timeout) rather than model limitations. Implement retry logic with exponential backoff for vLLM server restarts.
Consider the managed API for highest accuracy: Datalab's hosted platform runs an improved Chandra variant scoring 86.7% overall versus the open weights' 85.9%. For critical applications where that 0.8% accuracy difference matters, the $5 free credits let you A/B test against self-hosted performance.
Comparison with Alternatives
| Feature | Chandra 2 | olmOCR 2 | GPT-4o | Tesseract | Azure Form Recognizer |
|---|---|---|---|---|---|
| Open Weights | ✅ Yes | ✅ Yes | ❌ No | ✅ Yes | ❌ No |
| Self-Hostable | ✅ Yes | ✅ Yes | ❌ No | ✅ Yes | ❌ No |
| olmOCR Benchmark | 85.9% | 82.4% | 69.9% | ~45% | N/A |
| 90-Language Average | 72.7% | N/A | N/A | ~30% | ~55% |
| Handwriting | Excellent | Good | Moderate | Poor | Good |
| Table Reconstruction | Excellent | Good | Moderate | None | Good |
| Math/LaTeX | Excellent | Good | Moderate | None | None |
| Form Checkboxes | ✅ Native | ❌ No | ❌ No | ❌ No | ✅ Template-based |
| Pricing | Free/$2M limit | Free | $0.005/page | Free | $0.05/page |
| Offline Capable | ✅ Yes | ✅ Yes | ❌ No | ✅ Yes | ❌ No |
The verdict: Proprietary APIs charge premium prices for moderate accuracy and zero deployment flexibility. Tesseract remains free but requires massive engineering investment for modern document types. olmOCR 2 is competent but trails in multilingual support and layout complexity. Chandra 2 delivers best-in-class accuracy with complete operational control—the rare combination that makes infrastructure teams celebrate.
FAQ: What Developers Ask About Chandra 2
Q: Can I use Chandra 2 commercially without restrictions? A: The code is Apache 2.0. Model weights use modified OpenRAIL-M—free for research, personal use, and startups under $2M funding/revenue. Broader commercial licensing requires contacting Datalab. You cannot use it to compete directly with their hosted API.
Q: What GPU do I actually need?
A: Minimum 24GB VRAM for single-document processing. For production throughput, H100 80GB processes 2 pages/sec. A10G (24GB) works for development and light workloads. Multi-GPU via VLLM_GPUS environment variable.
Q: How does it handle poor-quality scans? A: The "Old Scans" benchmark category (49.8%) specifically tests degraded inputs. While not perfect, it significantly outperforms general-purpose alternatives. Preprocessing (deskewing, contrast adjustment) helps marginal cases.
Q: Is my document data sent to external servers?
A: Only if you use the HuggingFace backend with default model downloading. The vLLM server runs entirely locally. For absolute privacy, cache model weights in your infrastructure and set VLLM_API_BASE to internal endpoints.
Q: Can I fine-tune on my specific document types? A: The repository provides base inference only. For custom fine-tuning, contact Datalab about commercial licensing or adapt the Qwen 3.5 architecture independently using their published weights as initialization.
Q: What's the real throughput for my document mix?
A: Benchmarks use adversarial documents (math, tables, scans combined). Pure text documents process faster. The 2 pages/sec real-world estimate assumes typical business documents. Monitor your actual processing_time_seconds in metadata JSON.
Q: How does output compare to Azure/Google document APIs? A: Chandra 2 outputs semantic Markdown/HTML/JSON directly—no proprietary schema to learn. The structure is immediately consumable by LLMs, static site generators, and content management systems without vendor SDK lock-in.
Conclusion: Your Documents Deserve Better Than 2010-Era OCR
The document AI landscape has been stuck in a false choice: expensive proprietary APIs with accuracy guarantees, or free tools that fail on anything beyond plain text. Chandra OCR 2 demolishes that compromise.
With benchmark-topping performance on complex layouts, genuine multilingual competence across 90 languages, handwriting recognition that handles cursive notes, and mathematical notation extraction that preserves LaTeX semantics—this is the tool that makes document automation actually work.
The dual inference architecture (local HuggingFace for privacy, vLLM for throughput) means you optimize for your constraints, not a vendor's pricing model. The Apache 2.0 code and permissive model licensing for small businesses removes the legal friction that slows adoption.
My recommendation? Install it today. Run your worst documents through it—the scanned form with coffee stains, the handwritten meeting notes, the research paper with nested tables. Watch Chandra 2 succeed where everything else failed. Then build your pipeline knowing the OCR layer won't be your bottleneck.
👉 Get started now: Clone the repository at github.com/datalab-to/chandra, grab your $5 in free credits on the Datalab platform, or jump straight into the public playground to test without installation. Your document processing pipeline will never be the same.
Comments (0)
No comments yet. Be the first to share your thoughts!