1. Introduction

GRM-OCR is a 300M-parameter OCR model built for document and image text extraction, optimized for efficiency and deployability without sacrificing recognition quality. It is designed to deliver strong OCR performance at a compact scale, making it practical for local inference, edge deployment, and high-throughput serving pipelines.

Built on top of tiiuae/Falcon-OCR, GRM-OCR inherits the early-fusion single-stack Transformer architecture and extends it with improved training signals, broader document coverage, and refinements targeted at real-world document diversity.

GRM-OCR is ideal for users who need reliable, fast, and resource-efficient OCR across a wide range of document types — from scanned pages and academic papers to invoices, receipts, handwritten notes, and complex multi-column layouts.

2. Key Capabilities

Compact and Efficient: At just 300M parameters, GRM-OCR is roughly 3× smaller than comparable OCR VLMs, translating directly into faster inference and lower memory requirements.
Layout-Aware Pipeline: Optional two-stage layout detection + per-region OCR for dense, multi-column, and heterogeneous documents.
Early-Fusion Architecture: A single Transformer backbone processes text and images in a shared parameter space, avoiding the complexity of separate encoder-decoder pipelines.
Broad Document Coverage: Handles handwriting, real-world photos, academic papers, tables, formulas, headers, captions, and more.
vLLM Compatible: Serves efficiently via vLLM with an OpenAI-compatible API.

3. Quickstart

Installation

pip install "torch>=2.5" transformers pillow einops

GRM-OCR requires PyTorch 2.5 or newer for FlexAttention. The first call may be slower as torch.compile builds optimized kernels.

Single-Image OCR

import torch
from PIL import Image
from transformers import AutoModelForCausalLM

model = AutoModelForCausalLM.from_pretrained(
    "OrionLLM/GRM-OCR",
    trust_remote_code=True,
    torch_dtype=torch.bfloat16,
    device_map="auto",
)

image = Image.open("document.png")
texts = model.generate(image)  # default category is "plain"
print(texts[0])

Choose an Output Format with `category`

texts = model.generate(image, category="text")     # plain text
texts = model.generate(image, category="formula")  # LaTeX
texts = model.generate(image, category="table")    # HTML table

4. API

`model.generate(images, category="plain", **kwargs)`

Parameter	Type	Description
`images`	`PIL.Image.Image` or `list`	One or more input images
`category`	`str`	Output format: `plain`, `text`, `table`, `formula`, `caption`, `footnote`, `list-item`, `page-footer`, `page-header`, `section-header`, `title`

Returns: list[str] — one extracted string per image.

5. Layout OCR (Two-Stage Pipeline)

For dense documents with heterogeneous regions (multi-column layouts, interleaved tables and formulas, small captions), GRM-OCR supports an optional two-stage pipeline:

A layout detector identifies regions on the page.
GRM-OCR runs independently on each crop with a category-specific prompt.

We use PP-DocLayoutV3 as the layout detector.

results = model.generate_with_layout(image)
for det in results[0]:
    print(f"[{det['category']}] {det['text'][:100]}...")

Batch mode:

results = model.generate_with_layout(
    [Image.open("page1.png"), Image.open("page2.png")],
    ocr_batch_size=32,
)

The layout model is loaded lazily on the first generate_with_layout() call and runs on the same device as the OCR model.

Returns: list[list[dict]], one list per image, in reading order:

{
    "category": "text",
    "bbox": [x1, y1, x2, y2],  # in original image pixels
    "score": 0.93,              # detection confidence
    "text": "..."               # extracted text
}

6. Architecture

GRM-OCR is built on the Falcon-OCR early-fusion architecture — a single-stack Transformer that processes both text and images within a shared parameter space, without a separate vision encoder. This design choice avoids the complexity of common "vision encoder + text decoder" pipelines and enables more coherent cross-modal reasoning.

GRM-OCR applies targeted training improvements on top of this foundation: broader document diversity, refined category-specific prompting, and improved alignment for real-world OCR conditions — resulting in a model that punches above its weight class on structured document tasks while remaining deployable even on modest consumer hardware.

GRM-OCR is developed by OrionLLM and released under the Apache 2.0 License.

Downloads last month: 76

Safetensors

Model size

0.3B params

Tensor type

F32

Model tree for OrionLLM/GRM-OCR

Base model

tiiuae/Falcon-OCR

Finetuned

(4)

this model

Evaluation results

Overall on allenai/olmOCR-bench View evaluation results leaderboard

82.1 ^*
Arxiv Math on allenai/olmOCR-bench View evaluation results leaderboard

80.5 ^*
Old Scans Math on allenai/olmOCR-bench View evaluation results leaderboard

71.8 ^*
Table Tests on allenai/olmOCR-bench View evaluation results leaderboard

89.5 ^*
Old Scans on allenai/olmOCR-bench View evaluation results leaderboard

42.5 ^*
Headers Footers on allenai/olmOCR-bench View evaluation results leaderboard

95.8 ^*
Multi Column on allenai/olmOCR-bench View evaluation results leaderboard

88.7 ^*
Long Tiny Text on allenai/olmOCR-bench View evaluation results leaderboard

78.3 ^*