Pretraining Datasets
Collection
This collection provides high-quality, large-scale Romanian pretraining datasets derived from FineWeb-2.
•
3 items
•
Updated
FineWeb2-RoEdu-Classifier is a lightweight quality classifier for the Romanian language. It is designed to distinguish high-quality educational content from generic web text. The model was trained on data annotated by Gemma3 12B. More details can be found here.
You can find a demo here.