1. Introduction
GRM-OCR is a 300M-parameter OCR model built for document and image text extraction, optimized for efficiency and deployability without sacrificing recognition quality. It is designed to deliver strong OCR performance at a compact scale, making it practical for local inference, edge deployment, and high-throughput serving pipelines.
Built on top of tiiuae/Falcon-OCR, GRM-OCR inherits the early-fusion single-stack Transformer architecture and extends it with improved training signals, broader document coverage, and refinements targeted at real-world document diversity.
GRM-OCR is ideal for users who need reliable, fast, and resource-efficient OCR across a wide range of document types — from scanned pages and academic papers to invoices, receipts, handwritten notes, and complex multi-column layouts.
2. Key Capabilities
- Compact and Efficient: At just 300M parameters, GRM-OCR is roughly 3× smaller than comparable OCR VLMs, translating directly into faster inference and lower memory requirements.
- Layout-Aware Pipeline: Optional two-stage layout detection + per-region OCR for dense, multi-column, and heterogeneous documents.
- Early-Fusion Architecture: A single Transformer backbone processes text and images in a shared parameter space, avoiding the complexity of separate encoder-decoder pipelines.
- Broad Document Coverage: Handles handwriting, real-world photos, academic papers, tables, formulas, headers, captions, and more.
- vLLM Compatible: Serves efficiently via vLLM with an OpenAI-compatible API.
3. Quickstart
Installation
pip install "torch>=2.5" transformers pillow einops
GRM-OCR requires PyTorch 2.5 or newer for FlexAttention. The first call may be slower as
torch.compilebuilds optimized kernels.
Single-Image OCR
import torch
from PIL import Image
from transformers import AutoModelForCausalLM
model = AutoModelForCausalLM.from_pretrained(
"OrionLLM/GRM-OCR",
trust_remote_code=True,
torch_dtype=torch.bfloat16,
device_map="auto",
)
image = Image.open("document.png")
texts = model.generate(image) # default category is "plain"
print(texts[0])
Choose an Output Format with category
texts = model.generate(image, category="text") # plain text
texts = model.generate(image, category="formula") # LaTeX
texts = model.generate(image, category="table") # HTML table
4. API
model.generate(images, category="plain", **kwargs)
| Parameter | Type | Description |
|---|---|---|
images |
PIL.Image.Image or list |
One or more input images |
category |
str |
Output format: plain, text, table, formula, caption, footnote, list-item, page-footer, page-header, section-header, title |
Returns: list[str] — one extracted string per image.
5. Layout OCR (Two-Stage Pipeline)
For dense documents with heterogeneous regions (multi-column layouts, interleaved tables and formulas, small captions), GRM-OCR supports an optional two-stage pipeline:
- A layout detector identifies regions on the page.
- GRM-OCR runs independently on each crop with a category-specific prompt.
We use PP-DocLayoutV3 as the layout detector.
results = model.generate_with_layout(image)
for det in results[0]:
print(f"[{det['category']}] {det['text'][:100]}...")
Batch mode:
results = model.generate_with_layout(
[Image.open("page1.png"), Image.open("page2.png")],
ocr_batch_size=32,
)
The layout model is loaded lazily on the first generate_with_layout() call and runs on the same device as the OCR model.
Returns: list[list[dict]], one list per image, in reading order:
{
"category": "text",
"bbox": [x1, y1, x2, y2], # in original image pixels
"score": 0.93, # detection confidence
"text": "..." # extracted text
}
6. Architecture
GRM-OCR is built on the Falcon-OCR early-fusion architecture — a single-stack Transformer that processes both text and images within a shared parameter space, without a separate vision encoder. This design choice avoids the complexity of common "vision encoder + text decoder" pipelines and enables more coherent cross-modal reasoning.
GRM-OCR applies targeted training improvements on top of this foundation: broader document diversity, refined category-specific prompting, and improved alignment for real-world OCR conditions — resulting in a model that punches above its weight class on structured document tasks while remaining deployable even on modest consumer hardware.
GRM-OCR is developed by OrionLLM and released under the Apache 2.0 License.
- Downloads last month
- 76
Model tree for OrionLLM/GRM-OCR
Base model
tiiuae/Falcon-OCREvaluation results
- Overall on allenai/olmOCR-bench View evaluation results leaderboard
- Arxiv Math on allenai/olmOCR-bench View evaluation results leaderboard 80.5 *
- Old Scans Math on allenai/olmOCR-bench View evaluation results leaderboard
- Table Tests on allenai/olmOCR-bench View evaluation results leaderboard
- Old Scans on allenai/olmOCR-bench View evaluation results leaderboard
- Headers Footers on allenai/olmOCR-bench View evaluation results leaderboard
- Multi Column on allenai/olmOCR-bench View evaluation results leaderboard
- Long Tiny Text on allenai/olmOCR-bench View evaluation results leaderboard 78.3 *