Spaces:

fair-forward
/

languagebench

Running

App Files Files Community

languagebench / results.json

Commit History

Upload from GitHub Actions: use old results

5102b0a
verified

davidpomerenke commited on Sep 11, 2025

Upload from GitHub Actions: Merge pull request #18 from datenlabor-bmz/pr-17

a0d1624
verified

davidpomerenke commited on Sep 11, 2025

Upload from GitHub Actions: Add auto-translated datasets

c790fdb
verified

davidpomerenke commited on Sep 1, 2025

Upload from GitHub Actions: Update evaluation results

f88768f
verified

davidpomerenke commited on Sep 1, 2025

Upload from GitHub Actions: Update evaluation results

95c4e14
verified

davidpomerenke commited on Aug 31, 2025

Upload from GitHub Actions: ran full evaluation locally

088f96f
verified

davidpomerenke commited on Aug 30, 2025

Upload from GitHub Actions: restored old results.json

9e9d3bd
verified

davidpomerenke commited on Aug 29, 2025

Upload from GitHub Actions: updated and cleaned up scripts for new eval runs

963cb78
verified

davidpomerenke commited on Aug 29, 2025

Upload from GitHub Actions: Update models.py, models.json, and results.json with latest evaluation data and model additions

8eebb41
verified

davidpomerenke commited on Aug 27, 2025

Upload from GitHub Actions: Merge pull request #9 from datenlabor-bmz/jn-dev

7c06aef
verified

davidpomerenke commited on Aug 5, 2025

Upload from GitHub Actions: Get more results, compute average based on all tasks

98c6811
verified

davidpomerenke commited on Jul 2, 2025

Upload from GitHub Actions: Translate MMLU and evaluate

4c5c136
verified

davidpomerenke commited on Jun 30, 2025

Upload from GitHub Actions: Correlation plot

b0aa389
verified

davidpomerenke commited on Jun 30, 2025

Upload from GitHub Actions: Evaluate on autotranslated GSM dataset

f3a09a2
verified

davidpomerenke commited on Jun 29, 2025

Upload from GitHub Actions: Evaluate Google Translate

338dc9b
verified

davidpomerenke commited on Jun 28, 2025

Upload from GitHub Actions: More models and languages

a73f888
verified

davidpomerenke commited on Jun 6, 2025

Upload from GitHub Actions: Results for 50 languages

3dfd880
verified

davidpomerenke commited on Jun 6, 2025

Upload from GitHub Actions: Eavaluate on 40 languages

941d5c5
verified

davidpomerenke commited on Jun 4, 2025

Upload from nightly evaluation run

c3be561
verified

davidpomerenke commited on Jun 1, 2025

Upload from GitHub Actions: Add math benchmarks

549360a
verified

davidpomerenke commited on May 22, 2025

Upload from GitHub Actions: More results

52abc5b
verified

davidpomerenke commited on May 22, 2025

Upload from nightly evaluation run

4a34e67
verified

davidpomerenke commited on May 22, 2025

Upload from GitHub Actions: Update model ranking fetching

f840423
verified

davidpomerenke commited on May 22, 2025

Upload from GitHub Actions: Use FLORES+ via Huggingface

913253a
verified

davidpomerenke commited on May 22, 2025

Upload from nightly evaluation run

9ee89ef
verified

davidpomerenke commited on May 19, 2025

Upload from nightly evaluation run

8a4050a
verified

davidpomerenke commited on May 14, 2025

Upload from GitHub Actions: New results

b311dd5
verified

davidpomerenke commited on May 14, 2025

Upload from nightly evaluation run

dcb356d
verified

davidpomerenke commited on May 7, 2025

Block gemini-2.5-pro-exp-03-25

092c06a

David Pomerenke commited on May 5, 2025

Only run tasks for which there is no result yet

2f9dee1

David Pomerenke commited on May 4, 2025

Run on 40 languages, additional models

260c1a3

David Pomerenke commited on Apr 27, 2025

Run evals

b0c61ed

David Pomerenke commited on Apr 27, 2025

Run on 15 languages

f8a3dad

David Pomerenke commited on Apr 18, 2025

Add model history plot

f52ec6e

David Pomerenke commited on Apr 18, 2025

Implement MMLU task

a683732

David Pomerenke commited on Apr 18, 2025

Add Global MMLU benchmark

ce2acb0

David Pomerenke commited on Apr 17, 2025

Translation both from and to

731eddd

David Pomerenke commited on Apr 13, 2025

Add OpenRouter metadata to models

9002fc2

David Pomerenke commited on Apr 11, 2025

Run on 100 languages, adjust display

8274634

David Pomerenke commited on Apr 6, 2025

Add Dockerfile

4d13673

David Pomerenke commited on Apr 6, 2025

Language selection checkboxes & filtering in backend

d91b022

David Pomerenke commited on Apr 4, 2025

Basic backend setup with FastApi but without actual filtering

2c21cf7

David Pomerenke commited on Mar 29, 2025

spBLEU tokenizer, run on more languages

eaf2d97

David Pomerenke commited on Mar 25, 2025

Better map tooltip

92b2164

David Pomerenke commited on Mar 23, 2025

Process data for country map

723f963

David Pomerenke commited on Mar 21, 2025

Autonymns and cooler dataset search display

33469f2

David Pomerenke commited on Mar 16, 2025

More models

c5278dd

David Pomerenke commited on Mar 15, 2025

Basic language table

d1a7111

David Pomerenke commited on Mar 15, 2025

Refactor eval code into files

da6e1bc

David Pomerenke commited on Mar 15, 2025

Model table using React

ecf4195

David Pomerenke commited on Mar 15, 2025

Commit History

Upload from GitHub Actions: use old results 5102b0a verified

Upload from GitHub Actions: Merge pull request #18 from datenlabor-bmz/pr-17 a0d1624 verified

Upload from GitHub Actions: Add auto-translated datasets c790fdb verified

Upload from GitHub Actions: Update evaluation results f88768f verified

Upload from GitHub Actions: Update evaluation results 95c4e14 verified

Upload from GitHub Actions: ran full evaluation locally 088f96f verified

Upload from GitHub Actions: restored old results.json 9e9d3bd verified

Upload from GitHub Actions: updated and cleaned up scripts for new eval runs 963cb78 verified

Upload from GitHub Actions: Update models.py, models.json, and results.json with latest evaluation data and model additions 8eebb41 verified

Upload from GitHub Actions: Merge pull request #9 from datenlabor-bmz/jn-dev 7c06aef verified

Upload from GitHub Actions: Get more results, compute average based on all tasks 98c6811 verified

Upload from GitHub Actions: Translate MMLU and evaluate 4c5c136 verified

Upload from GitHub Actions: Correlation plot b0aa389 verified

Upload from GitHub Actions: Evaluate on autotranslated GSM dataset f3a09a2 verified

Upload from GitHub Actions: Evaluate Google Translate 338dc9b verified

Upload from GitHub Actions: More models and languages a73f888 verified

Upload from GitHub Actions: Results for 50 languages 3dfd880 verified

Upload from GitHub Actions: Eavaluate on 40 languages 941d5c5 verified

Upload from nightly evaluation run c3be561 verified

Upload from GitHub Actions: Add math benchmarks 549360a verified

Upload from GitHub Actions: More results 52abc5b verified

Upload from nightly evaluation run 4a34e67 verified

Upload from GitHub Actions: Update model ranking fetching f840423 verified

Upload from GitHub Actions: Use FLORES+ via Huggingface 913253a verified

Upload from nightly evaluation run 9ee89ef verified

Upload from nightly evaluation run 8a4050a verified

Upload from GitHub Actions: New results b311dd5 verified

Upload from nightly evaluation run dcb356d verified

Block gemini-2.5-pro-exp-03-25 092c06a

Only run tasks for which there is no result yet 2f9dee1

Run on 40 languages, additional models 260c1a3

Run evals b0c61ed

Run on 15 languages f8a3dad

Add model history plot f52ec6e

Implement MMLU task a683732

Add Global MMLU benchmark ce2acb0

Translation both from and to 731eddd

Add OpenRouter metadata to models 9002fc2

Run on 100 languages, adjust display 8274634

Add Dockerfile 4d13673

Language selection checkboxes & filtering in backend d91b022

Basic backend setup with FastApi but without actual filtering 2c21cf7

spBLEU tokenizer, run on more languages eaf2d97

Better map tooltip 92b2164

Process data for country map 723f963

Autonymns and cooler dataset search display 33469f2

More models c5278dd

Basic language table d1a7111

Refactor eval code into files da6e1bc

Model table using React ecf4195

Upload from GitHub Actions: use old results

5102b0a
verified

Upload from GitHub Actions: Merge pull request #18 from datenlabor-bmz/pr-17

a0d1624
verified

Upload from GitHub Actions: Add auto-translated datasets

c790fdb
verified

Upload from GitHub Actions: Update evaluation results

f88768f
verified

Upload from GitHub Actions: Update evaluation results

95c4e14
verified

Upload from GitHub Actions: ran full evaluation locally

088f96f
verified

Upload from GitHub Actions: restored old results.json

9e9d3bd
verified

Upload from GitHub Actions: updated and cleaned up scripts for new eval runs

963cb78
verified

Upload from GitHub Actions: Update models.py, models.json, and results.json with latest evaluation data and model additions

8eebb41
verified

Upload from GitHub Actions: Merge pull request #9 from datenlabor-bmz/jn-dev

7c06aef
verified

Upload from GitHub Actions: Get more results, compute average based on all tasks

98c6811
verified

Upload from GitHub Actions: Translate MMLU and evaluate

4c5c136
verified

Upload from GitHub Actions: Correlation plot

b0aa389
verified

Upload from GitHub Actions: Evaluate on autotranslated GSM dataset

f3a09a2
verified

Upload from GitHub Actions: Evaluate Google Translate

338dc9b
verified

Upload from GitHub Actions: More models and languages

a73f888
verified

Upload from GitHub Actions: Results for 50 languages

3dfd880
verified

Upload from GitHub Actions: Eavaluate on 40 languages

941d5c5
verified

Upload from nightly evaluation run

c3be561
verified

Upload from GitHub Actions: Add math benchmarks

549360a
verified

Upload from GitHub Actions: More results

52abc5b
verified

Upload from nightly evaluation run

4a34e67
verified

Upload from GitHub Actions: Update model ranking fetching

f840423
verified

Upload from GitHub Actions: Use FLORES+ via Huggingface

913253a
verified

Upload from nightly evaluation run

9ee89ef
verified

Upload from nightly evaluation run

8a4050a
verified

Upload from GitHub Actions: New results

b311dd5
verified

Upload from nightly evaluation run

dcb356d
verified

Block gemini-2.5-pro-exp-03-25

092c06a

Only run tasks for which there is no result yet

2f9dee1

Run on 40 languages, additional models

260c1a3

Run evals

b0c61ed

Run on 15 languages

f8a3dad

Add model history plot

f52ec6e

Implement MMLU task

a683732

Add Global MMLU benchmark

ce2acb0

Translation both from and to

731eddd

Add OpenRouter metadata to models

9002fc2

Run on 100 languages, adjust display

8274634

Add Dockerfile

4d13673

Language selection checkboxes & filtering in backend

d91b022

Basic backend setup with FastApi but without actual filtering

2c21cf7

spBLEU tokenizer, run on more languages

eaf2d97

Better map tooltip

92b2164

Process data for country map

723f963

Autonymns and cooler dataset search display

33469f2

More models

c5278dd

Basic language table

d1a7111

Refactor eval code into files

da6e1bc

Model table using React

ecf4195