Will It Still Be True Tomorrow?
Multilingual Evergreen Question Classification to Improve Trustworthy QA

Sergey Pletenev*,2,1, Maria Marina*,2,1, Nikolay Ivanov1, Daria Galimzianova4, Nikita Krayko4,
Mikhail Salnikov2,1, Vasily Konovalov2,5, Alexander Panchenko1,2, Viktor Moskvoretskii1,3

1Skoltech, 2AIRI, 3HSE University, 4MTS AI, 5MIPT

*Indicates Equal Contribution

Evergreen vs Mutable Questions Classification

🌱 Understanding Evergreen vs Mutable

Evergreen Questions

Answers remain stable over time

"What is the chemical symbol for oxygen?"

Answer: O (will never change)

"Who wrote Romeo and Juliet?"

Answer: William Shakespeare (historical fact)

"What is the largest planet in our solar system?"

Answer: Jupiter (astronomical fact)

Mutable Questions

Answers change over time

"Who is the current President of the United States?"

Changes every 4-8 years

"What is the world's tallest building?"

Changes as new buildings are constructed

"How many people live in Tokyo?"

Population changes annually

Abstract

Large Language Models (LLMs) often hallucinate in question answering (QA) tasks. A key yet underexplored factor contributing to this is the temporality of questions -- whether they are evergreen (answers remain stable over time) or mutable (answers change). In this work, we introduce EverGreenQA, the first multilingual QA dataset with evergreen labels, supporting both evaluation and training. Using EverGreenQA, we benchmark 12 modern LLMs to assess whether they encode question temporality explicitly (via verbalized judgments) or implicitly (via uncertainty signals). We also train EG-E5, a lightweight multilingual classifier that achieves SoTA performance on this task. Finally, we demonstrate the practical utility of evergreen classification across three applications: improving self-knowledge estimation, filtering QA datasets, and explaining GPT-4o's retrieval behavior.

🔬 Key Findings

LLMs Show Limited Evergreen Awareness

Even state-of-the-art models like GPT-4 struggle with explicit evergreen classification. Our EG-E5 classifier significantly outperforms all tested LLMs, achieving 91% F1 vs 81-88% for the best LLMs.

GPT-4o's Retrieval Explained

Evergreen-ness is the strongest predictor of GPT-4o's retrieval behavior, more than twice as informative as uncertainty measures, suggesting retrieval is closely tied to question temporality.

QA Datasets Need Filtering

Popular QA benchmarks contain 6-18% mutable questions with outdated answers. Models perform up to 40% better on evergreen questions, highlighting the need for temporal filtering.

Improved Self-Knowledge

Adding evergreen probability as a feature consistently improves self-knowledge estimation across 16 out of 18 evaluation settings, enhancing model trustworthiness.

📊 EverGreenQA Dataset

Dataset Statistics

  • 4,757 total questions across 7 languages
  • 3,487 training examples
  • 1,270 test examples
  • Real user queries from AI chat assistant
  • Professional validation by trained linguists

Languages Covered

🇺🇸 English 🇷🇺 Russian 🇫🇷 French 🇩🇪 German 🇮🇱 Hebrew 🇸🇦 Arabic 🇨🇳 Chinese

🚀 Real-World Applications

1. Enhanced Self-Knowledge Estimation

Improve LLM trustworthiness by helping models better understand when they know or don't know the answer to a question.

Impact: Achieved best results in 16 out of 18 evaluation settings across multiple QA datasets

2. QA Dataset Curation & Fair Evaluation

Automatically filter out questions with time-sensitive answers to create more reliable benchmarks.

Discovery: Popular datasets like Natural Questions contain up to 18% mutable questions with outdated answers

3. Explaining Black-Box AI Behavior

Understand and predict when advanced models like GPT-4o decide to search for external information.

Insight: Evergreen-ness is 2x more predictive of GPT-4o's retrieval decisions than uncertainty measures

💡 Research Contributions

First of Its Kind

  • • First multilingual evergreen-aware QA dataset
  • • First comprehensive evaluation of 12 LLMs on question temporality
  • • Novel lightweight classifier (EG-E5) achieving SOTA performance
  • • First systematic analysis of uncertainty-temporality correlation

Practical Impact

  • • Improved trustworthiness in AI systems
  • • Fairer evaluation methodologies for QA benchmarks
  • • Better understanding of retrieval-augmented systems
  • • Open-source tools for the research community

📈 Model Performance

Model Overall F1 English Russian French German Hebrew Arabic Chinese
multilingual-e5-large-instruct 0.910 0.913 0.909 0.910 0.904 0.900 0.897 0.906
bert-base-multilingual-cased 0.893 0.900 0.889 0.884 0.889 0.883 0.902 0.891
mdeberta-v3-base 0.836 0.842 0.845 0.841 0.832 0.825 0.831 0.836
multilingual-e5-small 0.821 0.822 0.819 0.815 0.804 0.807 0.817 0.815

Citation

@misc{pletenev2025truetomorrowmultilingualevergreen,
    title={Will It Still Be True Tomorrow? Multilingual Evergreen Question Classification to Improve Trustworthy QA}, 
    author={Sergey Pletenev and Maria Marina and Nikolay Ivanov and Daria Galimzianova and Nikita Krayko and Mikhail Salnikov and Vasily Konovalov and Alexander Panchenko and Viktor Moskvoretskii},
    year={2025},
    eprint={2505.21115},
    archivePrefix={arXiv},
    primaryClass={cs.CL},
    url={https://arxiv.org/abs/2505.21115}, 
}