Spaces:
Running
Running
Update content.py
Browse filesUpdate benchmark range (2021-2025)
- content.py +4 -4
content.py
CHANGED
|
@@ -8,11 +8,11 @@ LLMLAGBENCH_INTRO = """
|
|
| 8 |
Large Language Models (LLMs) are pretrained on textual data up to a specific temporal cutoff, creating
|
| 9 |
a **strict knowledge boundary** beyond which models cannot provide accurate information without querying
|
| 10 |
external sources. More subtly, when this limitation is unknown or ignored, LLMs may inadvertently blend
|
| 11 |
-
outdated time-sensitive information with general knowledge during reasoning tasks, **
|
| 12 |
-
compromising response accuracy**.
|
| 13 |
|
| 14 |
LLMLagBench provides a systematic approach for **identifying the earliest probable temporal boundaries** of
|
| 15 |
-
an LLM's training data by evaluating its knowledge of recent events. The benchmark comprises of **1,700+ curated questions**
|
|
|
|
| 16 |
question could not be accurately answered before the event was reported in news media. We evaluate model
|
| 17 |
responses using a **0-2 scale faithfulness metric** and apply the **PELT (Pruned Exact Linear Time)** changepoint
|
| 18 |
detection algorithm to identify where model performance exhibits statistically significant drops,
|
|
@@ -27,7 +27,7 @@ years, underscoring the necessity of independent empirical validation.
|
|
| 27 |
# Section above the leaderboard table
|
| 28 |
LEADERBOARD_INTRO = """
|
| 29 |
The leaderboard below ranks models by their **Overall Average** faithfulness score (0-2 scale) across
|
| 30 |
-
all 1,700+ questions spanning
|
| 31 |
by model developers, **1st and 2nd Detected Cutoffs** identified by LLMLagBench's PELT algorithm,
|
| 32 |
and additional metadata including release dates and model parameters. **Notable discrepancies** between
|
| 33 |
provider-declared cutoffs and empirically detected cutoffs reveal cases **where models' actual
|
|
|
|
| 8 |
Large Language Models (LLMs) are pretrained on textual data up to a specific temporal cutoff, creating
|
| 9 |
a **strict knowledge boundary** beyond which models cannot provide accurate information without querying
|
| 10 |
external sources. More subtly, when this limitation is unknown or ignored, LLMs may inadvertently blend
|
| 11 |
+
outdated time-sensitive information with general knowledge during reasoning or false classification tasks, **compromising response accuracy**.
|
|
|
|
| 12 |
|
| 13 |
LLMLagBench provides a systematic approach for **identifying the earliest probable temporal boundaries** of
|
| 14 |
+
an LLM's training data by evaluating its knowledge of recent events. The benchmark comprises of **1,700+ curated questions**
|
| 15 |
+
about events sampled from news reports published between January 2021 and October 2025. Wwe plan to update the question set regularly. Each
|
| 16 |
question could not be accurately answered before the event was reported in news media. We evaluate model
|
| 17 |
responses using a **0-2 scale faithfulness metric** and apply the **PELT (Pruned Exact Linear Time)** changepoint
|
| 18 |
detection algorithm to identify where model performance exhibits statistically significant drops,
|
|
|
|
| 27 |
# Section above the leaderboard table
|
| 28 |
LEADERBOARD_INTRO = """
|
| 29 |
The leaderboard below ranks models by their **Overall Average** faithfulness score (0-2 scale) across
|
| 30 |
+
all 1,700+ questions spanning 2021-2025. The table also displays **Provider Cutoff** dates as declared
|
| 31 |
by model developers, **1st and 2nd Detected Cutoffs** identified by LLMLagBench's PELT algorithm,
|
| 32 |
and additional metadata including release dates and model parameters. **Notable discrepancies** between
|
| 33 |
provider-declared cutoffs and empirically detected cutoffs reveal cases **where models' actual
|