UMCU/CardioDeBERTa.nl_clinical
Fill-Mask • 0.4B • Updated • 16 • 1
A large-scale Dutch medical language corpus containing approximately 100 million documents with 35 billion tokens has been created for pre-training and downstream natural language processing tasks.
Background: Dutch medical corpora are scarce, limiting NLP development. \\ Methods: We translated English datasets, identified medical text in generic corpora, and extracted open Dutch medical resources. \\ Results: The resulting corpus comprises pm 35 billion tokens across the medical domain in about 100 million documents, freely available on Hugging Face. \\ Conclusion: This work establishes the first large-scale Dutch medical language corpus for pre-training and downstream NLP tasks.
Get this paper in your agent:
hf papers read 2604.25374 curl -LsSf https://hf.co/cli/install.sh | bash