Sergey Pletenev*,2,1,
Maria Marina*,2,1,
Nikolay Ivanov1,
Daria Galimzianova4,
Nikita Krayko4,
Mikhail Salnikov2,1,
Vasily Konovalov2,5,
Alexander Panchenko1,2,
Viktor Moskvoretskii1,3
1Skoltech, 2AIRI, 3HSE University, 4MTS AI, 5MIPT
*Indicates Equal Contribution
Answers remain stable over time
"What is the chemical symbol for oxygen?"
Answer: O (will never change)
"Who wrote Romeo and Juliet?"
Answer: William Shakespeare (historical fact)
"What is the largest planet in our solar system?"
Answer: Jupiter (astronomical fact)
Answers change over time
"Who is the current President of the United States?"
Changes every 4-8 years
"What is the world's tallest building?"
Changes as new buildings are constructed
"How many people live in Tokyo?"
Population changes annually
Large Language Models (LLMs) often hallucinate in question answering (QA) tasks. A key yet underexplored factor contributing to this is the temporality of questions -- whether they are evergreen (answers remain stable over time) or mutable (answers change). In this work, we introduce EverGreenQA, the first multilingual QA dataset with evergreen labels, supporting both evaluation and training. Using EverGreenQA, we benchmark 12 modern LLMs to assess whether they encode question temporality explicitly (via verbalized judgments) or implicitly (via uncertainty signals). We also train EG-E5, a lightweight multilingual classifier that achieves SoTA performance on this task. Finally, we demonstrate the practical utility of evergreen classification across three applications: improving self-knowledge estimation, filtering QA datasets, and explaining GPT-4o's retrieval behavior.
Even state-of-the-art models like GPT-4 struggle with explicit evergreen classification. Our EG-E5 classifier significantly outperforms all tested LLMs, achieving 91% F1 vs 81-88% for the best LLMs.
Evergreen-ness is the strongest predictor of GPT-4o's retrieval behavior, more than twice as informative as uncertainty measures, suggesting retrieval is closely tied to question temporality.
Popular QA benchmarks contain 6-18% mutable questions with outdated answers. Models perform up to 40% better on evergreen questions, highlighting the need for temporal filtering.
Adding evergreen probability as a feature consistently improves self-knowledge estimation across 16 out of 18 evaluation settings, enhancing model trustworthiness.
Improve LLM trustworthiness by helping models better understand when they know or don't know the answer to a question.
Impact: Achieved best results in 16 out of 18 evaluation settings across multiple QA datasets
Automatically filter out questions with time-sensitive answers to create more reliable benchmarks.
Discovery: Popular datasets like Natural Questions contain up to 18% mutable questions with outdated answers
Understand and predict when advanced models like GPT-4o decide to search for external information.
Insight: Evergreen-ness is 2x more predictive of GPT-4o's retrieval decisions than uncertainty measures
Model | Overall F1 | English | Russian | French | German | Hebrew | Arabic | Chinese |
---|---|---|---|---|---|---|---|---|
multilingual-e5-large-instruct | 0.910 | 0.913 | 0.909 | 0.910 | 0.904 | 0.900 | 0.897 | 0.906 |
bert-base-multilingual-cased | 0.893 | 0.900 | 0.889 | 0.884 | 0.889 | 0.883 | 0.902 | 0.891 |
mdeberta-v3-base | 0.836 | 0.842 | 0.845 | 0.841 | 0.832 | 0.825 | 0.831 | 0.836 |
multilingual-e5-small | 0.821 | 0.822 | 0.819 | 0.815 | 0.804 | 0.807 | 0.817 | 0.815 |
@misc{pletenev2025truetomorrowmultilingualevergreen,
title={Will It Still Be True Tomorrow? Multilingual Evergreen Question Classification to Improve Trustworthy QA},
author={Sergey Pletenev and Maria Marina and Nikolay Ivanov and Daria Galimzianova and Nikita Krayko and Mikhail Salnikov and Vasily Konovalov and Alexander Panchenko and Viktor Moskvoretskii},
year={2025},
eprint={2505.21115},
archivePrefix={arXiv},
primaryClass={cs.CL},
url={https://arxiv.org/abs/2505.21115},
}