Spaces:
Running
Running
Commit History
Upload from GitHub Actions: Merge pull request #18 from datenlabor-bmz/pr-17 a0d1624 verified
Upload from GitHub Actions: Add auto-translated datasets c790fdb verified
Upload from GitHub Actions: Update evaluation results f88768f verified
Upload from GitHub Actions: Update evaluation results 95c4e14 verified
Upload from GitHub Actions: ran full evaluation locally 088f96f verified
Upload from GitHub Actions: restored old results.json 9e9d3bd verified
Upload from GitHub Actions: updated and cleaned up scripts for new eval runs 963cb78 verified
Upload from GitHub Actions: Update models.py, models.json, and results.json with latest evaluation data and model additions 8eebb41 verified
Upload from GitHub Actions: Merge pull request #9 from datenlabor-bmz/jn-dev 7c06aef verified
Upload from GitHub Actions: Get more results, compute average based on all tasks 98c6811 verified
Upload from GitHub Actions: Translate MMLU and evaluate 4c5c136 verified
Upload from GitHub Actions: Correlation plot b0aa389 verified
Upload from GitHub Actions: Evaluate on autotranslated GSM dataset f3a09a2 verified
Upload from GitHub Actions: Evaluate Google Translate 338dc9b verified
Upload from GitHub Actions: More models and languages a73f888 verified
Upload from GitHub Actions: Results for 50 languages 3dfd880 verified
Upload from GitHub Actions: Eavaluate on 40 languages 941d5c5 verified
Upload from nightly evaluation run c3be561 verified
Upload from GitHub Actions: Add math benchmarks 549360a verified
Upload from GitHub Actions: More results 52abc5b verified
Upload from nightly evaluation run 4a34e67 verified
Upload from GitHub Actions: Update model ranking fetching f840423 verified
Upload from GitHub Actions: Use FLORES+ via Huggingface 913253a verified
Upload from nightly evaluation run 9ee89ef verified
Upload from nightly evaluation run 8a4050a verified
Upload from GitHub Actions: New results b311dd5 verified
Upload from nightly evaluation run dcb356d verified
Block gemini-2.5-pro-exp-03-25 092c06a
David Pomerenke commited on
Only run tasks for which there is no result yet 2f9dee1
David Pomerenke commited on
Run on 40 languages, additional models 260c1a3
David Pomerenke commited on
Run evals b0c61ed
David Pomerenke commited on
Run on 15 languages f8a3dad
David Pomerenke commited on
Add model history plot f52ec6e
David Pomerenke commited on
Implement MMLU task a683732
David Pomerenke commited on
Add Global MMLU benchmark ce2acb0
David Pomerenke commited on
Translation both from and to 731eddd
David Pomerenke commited on
Add OpenRouter metadata to models 9002fc2
David Pomerenke commited on
Run on 100 languages, adjust display 8274634
David Pomerenke commited on
Add Dockerfile 4d13673
David Pomerenke commited on
Language selection checkboxes & filtering in backend d91b022
David Pomerenke commited on
Basic backend setup with FastApi but without actual filtering 2c21cf7
David Pomerenke commited on
spBLEU tokenizer, run on more languages eaf2d97
David Pomerenke commited on
Better map tooltip 92b2164
David Pomerenke commited on
Process data for country map 723f963
David Pomerenke commited on
Autonymns and cooler dataset search display 33469f2
David Pomerenke commited on
More models c5278dd
David Pomerenke commited on
Basic language table d1a7111
David Pomerenke commited on
Refactor eval code into files da6e1bc
David Pomerenke commited on
Model table using React ecf4195
David Pomerenke commited on