ppezik commited on
Commit
a51bbaa
·
verified ·
1 Parent(s): 207bddc

Update content.py

Browse files

Update benchmark range (2021-2025)

Files changed (1) hide show
  1. content.py +4 -4
content.py CHANGED
@@ -8,11 +8,11 @@ LLMLAGBENCH_INTRO = """
8
  Large Language Models (LLMs) are pretrained on textual data up to a specific temporal cutoff, creating
9
  a **strict knowledge boundary** beyond which models cannot provide accurate information without querying
10
  external sources. More subtly, when this limitation is unknown or ignored, LLMs may inadvertently blend
11
- outdated time-sensitive information with general knowledge during reasoning tasks, **potentially
12
- compromising response accuracy**.
13
 
14
  LLMLagBench provides a systematic approach for **identifying the earliest probable temporal boundaries** of
15
- an LLM's training data by evaluating its knowledge of recent events. The benchmark comprises of **1,700+ curated questions** about events sampled from news reports published between 2020-2025 (we plan to update the question set regularly). Each
 
16
  question could not be accurately answered before the event was reported in news media. We evaluate model
17
  responses using a **0-2 scale faithfulness metric** and apply the **PELT (Pruned Exact Linear Time)** changepoint
18
  detection algorithm to identify where model performance exhibits statistically significant drops,
@@ -27,7 +27,7 @@ years, underscoring the necessity of independent empirical validation.
27
  # Section above the leaderboard table
28
  LEADERBOARD_INTRO = """
29
  The leaderboard below ranks models by their **Overall Average** faithfulness score (0-2 scale) across
30
- all 1,700+ questions spanning 2020-2025. The table also displays **Provider Cutoff** dates as declared
31
  by model developers, **1st and 2nd Detected Cutoffs** identified by LLMLagBench's PELT algorithm,
32
  and additional metadata including release dates and model parameters. **Notable discrepancies** between
33
  provider-declared cutoffs and empirically detected cutoffs reveal cases **where models' actual
 
8
  Large Language Models (LLMs) are pretrained on textual data up to a specific temporal cutoff, creating
9
  a **strict knowledge boundary** beyond which models cannot provide accurate information without querying
10
  external sources. More subtly, when this limitation is unknown or ignored, LLMs may inadvertently blend
11
+ outdated time-sensitive information with general knowledge during reasoning or false classification tasks, **compromising response accuracy**.
 
12
 
13
  LLMLagBench provides a systematic approach for **identifying the earliest probable temporal boundaries** of
14
+ an LLM's training data by evaluating its knowledge of recent events. The benchmark comprises of **1,700+ curated questions**
15
+ about events sampled from news reports published between January 2021 and October 2025. Wwe plan to update the question set regularly. Each
16
  question could not be accurately answered before the event was reported in news media. We evaluate model
17
  responses using a **0-2 scale faithfulness metric** and apply the **PELT (Pruned Exact Linear Time)** changepoint
18
  detection algorithm to identify where model performance exhibits statistically significant drops,
 
27
  # Section above the leaderboard table
28
  LEADERBOARD_INTRO = """
29
  The leaderboard below ranks models by their **Overall Average** faithfulness score (0-2 scale) across
30
+ all 1,700+ questions spanning 2021-2025. The table also displays **Provider Cutoff** dates as declared
31
  by model developers, **1st and 2nd Detected Cutoffs** identified by LLMLagBench's PELT algorithm,
32
  and additional metadata including release dates and model parameters. **Notable discrepancies** between
33
  provider-declared cutoffs and empirically detected cutoffs reveal cases **where models' actual