Spaces:

pelcra
/

llmlagbench

Running

App Files Files Community

ppezik commited on Nov 14

Commit

a51bbaa

verified ·

1 Parent(s): 207bddc

Update content.py

Browse files

Update benchmark range (2021-2025)

Files changed (1) hide show

content.py +4 -4

content.py CHANGED Viewed

@@ -8,11 +8,11 @@ LLMLAGBENCH_INTRO = """
 Large Language Models (LLMs) are pretrained on textual data up to a specific temporal cutoff, creating
 a **strict knowledge boundary** beyond which models cannot provide accurate information without querying
 external sources. More subtly, when this limitation is unknown or ignored, LLMs may inadvertently blend
-outdated time-sensitive information with general knowledge during reasoning tasks, **potentially
-compromising response accuracy**.
 LLMLagBench provides a systematic approach for **identifying the earliest probable temporal boundaries** of
-an LLM's training data by evaluating its knowledge of recent events. The benchmark comprises of **1,700+ curated questions** about events sampled from news reports published between 2020-2025 (we plan to update the question set regularly). Each
 question could not be accurately answered before the event was reported in news media. We evaluate model
 responses using a **0-2 scale faithfulness metric** and apply the **PELT (Pruned Exact Linear Time)** changepoint
 detection algorithm to identify where model performance exhibits statistically significant drops,
@@ -27,7 +27,7 @@ years, underscoring the necessity of independent empirical validation.
 # Section above the leaderboard table
 LEADERBOARD_INTRO = """
 The leaderboard below ranks models by their **Overall Average** faithfulness score (0-2 scale) across
-all 1,700+ questions spanning 2020-2025. The table also displays **Provider Cutoff** dates as declared
 by model developers, **1st and 2nd Detected Cutoffs** identified by LLMLagBench's PELT algorithm,
 and additional metadata including release dates and model parameters. **Notable discrepancies** between
 provider-declared cutoffs and empirically detected cutoffs reveal cases **where models' actual

 Large Language Models (LLMs) are pretrained on textual data up to a specific temporal cutoff, creating
 a **strict knowledge boundary** beyond which models cannot provide accurate information without querying
 external sources. More subtly, when this limitation is unknown or ignored, LLMs may inadvertently blend
+outdated time-sensitive information with general knowledge during reasoning or false classification tasks, **compromising response accuracy**.
 LLMLagBench provides a systematic approach for **identifying the earliest probable temporal boundaries** of
+an LLM's training data by evaluating its knowledge of recent events. The benchmark comprises of **1,700+ curated questions**
+about events sampled from news reports published between January 2021 and October 2025. Wwe plan to update the question set regularly. Each
 question could not be accurately answered before the event was reported in news media. We evaluate model
 responses using a **0-2 scale faithfulness metric** and apply the **PELT (Pruned Exact Linear Time)** changepoint
 detection algorithm to identify where model performance exhibits statistically significant drops,
 # Section above the leaderboard table
 LEADERBOARD_INTRO = """
 The leaderboard below ranks models by their **Overall Average** faithfulness score (0-2 scale) across
+all 1,700+ questions spanning 2021-2025. The table also displays **Provider Cutoff** dates as declared
 by model developers, **1st and 2nd Detected Cutoffs** identified by LLMLagBench's PELT algorithm,
 and additional metadata including release dates and model parameters. **Notable discrepancies** between
 provider-declared cutoffs and empirically detected cutoffs reveal cases **where models' actual