FineWeb2-RoEdu-Classifier

FineWeb2-RoEdu-Classifier is a lightweight quality classifier for the Romanian language. It is designed to distinguish high-quality educational content from generic web text. The model was trained on data annotated by Gemma3 12B. More details can be found here.

Key Features

Educational Quality Scoring: The model assigns a scalar score (typically 0-5) to text, reflecting its educational value and coherence.
Topic, Format and Educational Level: The model also predicts additional signals that could be used for diversity filtering.
Distilled Knowledge: It is trained on Romanian web samples annotated by Gemma3 12B, effectively distilling the frontier model's judgment into a more efficient architecture.
Proven Effectiveness: We showed that used data curated by this classifier improved several metrics (ARC, HellaSwag).

Usage

You can find a demo here.

Downloads last month: 11

Inference Providers NEW

This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Collection including OpenLLM-Ro/FineWeb2-RoEdu-Classifier

Pretraining Datasets

Collection

This collection provides high-quality, large-scale Romanian pretraining datasets derived from FineWeb-2. • 3 items • Updated 18 days ago