arxiv:2604.25374

Language corpora for the Dutch medical domain

Published on Apr 28

Authors:

Abstract

A large-scale Dutch medical language corpus containing approximately 100 million documents with 35 billion tokens has been created for pre-training and downstream natural language processing tasks.

AI-generated summary

Background: Dutch medical corpora are scarce, limiting NLP development. \\ Methods: We translated English datasets, identified medical text in generic corpora, and extracted open Dutch medical resources. \\ Results: The resulting corpus comprises pm 35 billion tokens across the medical domain in about 100 million documents, freely available on Hugging Face. \\ Conclusion: This work establishes the first large-scale Dutch medical language corpus for pre-training and downstream NLP tasks.

View arXiv page View PDF Add to collection

Community

Upload images, audio, and videos by dragging in the text input, pasting, or clicking here.

Tap or paste here to upload images

· Sign up or log in to comment

Upvote

Get this paper in your agent:

hf papers read 2604.25374

Don't have the latest CLI?

curl -LsSf https://hf.co/cli/install.sh | bash

Language corpora for the Dutch medical domain

Abstract

Community

Models citing this paper 4

Datasets citing this paper 4

Spaces citing this paper 3

Collections including this paper 1