logo


1. Introduction

GRM-OCR is a 300M-parameter OCR model built for document and image text extraction, optimized for efficiency and deployability without sacrificing recognition quality. It is designed to deliver strong OCR performance at a compact scale, making it practical for local inference, edge deployment, and high-throughput serving pipelines.

Built on top of tiiuae/Falcon-OCR, GRM-OCR inherits the early-fusion single-stack Transformer architecture and extends it with improved training signals, broader document coverage, and refinements targeted at real-world document diversity.

GRM-OCR is ideal for users who need reliable, fast, and resource-efficient OCR across a wide range of document types — from scanned pages and academic papers to invoices, receipts, handwritten notes, and complex multi-column layouts.

2. Key Capabilities

  • Compact and Efficient: At just 300M parameters, GRM-OCR is roughly 3× smaller than comparable OCR VLMs, translating directly into faster inference and lower memory requirements.
  • Layout-Aware Pipeline: Optional two-stage layout detection + per-region OCR for dense, multi-column, and heterogeneous documents.
  • Early-Fusion Architecture: A single Transformer backbone processes text and images in a shared parameter space, avoiding the complexity of separate encoder-decoder pipelines.
  • Broad Document Coverage: Handles handwriting, real-world photos, academic papers, tables, formulas, headers, captions, and more.
  • vLLM Compatible: Serves efficiently via vLLM with an OpenAI-compatible API.

3. Quickstart

Installation

pip install "torch>=2.5" transformers pillow einops

GRM-OCR requires PyTorch 2.5 or newer for FlexAttention. The first call may be slower as torch.compile builds optimized kernels.

Single-Image OCR

import torch
from PIL import Image
from transformers import AutoModelForCausalLM

model = AutoModelForCausalLM.from_pretrained(
    "OrionLLM/GRM-OCR",
    trust_remote_code=True,
    torch_dtype=torch.bfloat16,
    device_map="auto",
)

image = Image.open("document.png")
texts = model.generate(image)  # default category is "plain"
print(texts[0])

Choose an Output Format with category

texts = model.generate(image, category="text")     # plain text
texts = model.generate(image, category="formula")  # LaTeX
texts = model.generate(image, category="table")    # HTML table

4. API

model.generate(images, category="plain", **kwargs)

Parameter Type Description
images PIL.Image.Image or list One or more input images
category str Output format: plain, text, table, formula, caption, footnote, list-item, page-footer, page-header, section-header, title

Returns: list[str] — one extracted string per image.


5. Layout OCR (Two-Stage Pipeline)

For dense documents with heterogeneous regions (multi-column layouts, interleaved tables and formulas, small captions), GRM-OCR supports an optional two-stage pipeline:

  1. A layout detector identifies regions on the page.
  2. GRM-OCR runs independently on each crop with a category-specific prompt.

We use PP-DocLayoutV3 as the layout detector.

results = model.generate_with_layout(image)
for det in results[0]:
    print(f"[{det['category']}] {det['text'][:100]}...")

Batch mode:

results = model.generate_with_layout(
    [Image.open("page1.png"), Image.open("page2.png")],
    ocr_batch_size=32,
)

The layout model is loaded lazily on the first generate_with_layout() call and runs on the same device as the OCR model.

Returns: list[list[dict]], one list per image, in reading order:

{
    "category": "text",
    "bbox": [x1, y1, x2, y2],  # in original image pixels
    "score": 0.93,              # detection confidence
    "text": "..."               # extracted text
}

6. Architecture

GRM-OCR is built on the Falcon-OCR early-fusion architecture — a single-stack Transformer that processes both text and images within a shared parameter space, without a separate vision encoder. This design choice avoids the complexity of common "vision encoder + text decoder" pipelines and enables more coherent cross-modal reasoning.

GRM-OCR applies targeted training improvements on top of this foundation: broader document diversity, refined category-specific prompting, and improved alignment for real-world OCR conditions — resulting in a model that punches above its weight class on structured document tasks while remaining deployable even on modest consumer hardware.


GRM-OCR is developed by OrionLLM and released under the Apache 2.0 License.

Downloads last month
76
Safetensors
Model size
0.3B params
Tensor type
F32
·
Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Model tree for OrionLLM/GRM-OCR

Finetuned
(4)
this model