FactOWL

FactOWL: A Cost-Efficient Tool for Long-Form Factuality Evaluation

Andrey Sakhovskiy^1,2, Nikita Sushko², Maria Marina^3,2, Vasily Konovalov³, Elena Tutubalina^1,3,4, Alexander Panchenko^2,3, Pavel Braslavski^2,4

¹Sber AI, ²Skoltech, ³AIRI, ⁴HSE University
ECIR 2026 Demo

Abstract

Recent years have seen increasing interest in assessing the factuality of Large Language Models' (LLMs) long-form generations, driven by a severe problem of factual hallucinations. A prominent approach to this challenge is the extract-then-verify framework, which decomposes long-form responses into a set of atomic facts and verifies each using external evidence. Among these methods, FActScore stands out as a foundational and widely adopted. However, the research still lacks a unified open-access, up-to-date and cost-effective factuality evaluation tool since FActScore relies on outdated Wikipedia dump and proprietary LLMs available via paid API. To fill this gap, we propose FactOWL, a FActScore-based Factuality evaluation tool which adopts an Open LLM and real-time Wikipedia search for evaluation of Long-form LLM responses. The proposed tool effectively addresses the problems of obsolete knowledge, incomplete contexts, and entity ambiguity by performing multi-page context aggregation and supporting additional sources, e.g., passages generated from Wikidata triples while showing about a 10x speed-up compared to FActScore. Experiments on FActScore's manually annotated data indicates that FactOWL's scores are close to human-judged factuality scores. Our tool is freely available at: https://github.com/s-nlp/factowl under the MIT license and can be installed via pip install factowl.

FactOWL vs. FactScore

Aspect	FactScore	FactOWL
Fact Extraction	(closed-weights, deprecated) InstructGPT	(open-weights) LLaMA-3-8B-Instruct
Verification Model	Old LLaMA-1-ins	New LLaMA-3-8B-Instruct
Knowledge Source	Static Wikipedia dump (2023)	Live Wikipedia API + optional Wikidata
Entity Disambiguation	One-to-one Wikipedia page lookup	Multi-page retrieval + reranking
Pipeline Complexity	Slow, Multi-model	Fast, Unified model, with vLLM
Cost	Paid, Proprietary APIs (InstructGPT)	Free, Open-Access
Context Aggregation	Single-page top-k paragraphs	Multi-page, multi-passage aggregation

Methodology

FactOWL decomposes the factual precision evaluation into two steps: (i) Splitting an input long-form generation into a set of short atomic facts followed by (ii) fact verification. It uses a single open-source Llama-3-8B-Instruct model for both fact generation and verification, replacing a two-model pipeline that relied on proprietary and older models. This simplifies the process and, combined with vLLM integration, significantly speeds up inference.

A key innovation in FactOWL is its handling of knowledge sources. While compatible with static Wikipedia dumps, it primarily uses a live Wikipedia search API to address outdated information. It retrieves multiple pages for a given topic, tackling entity ambiguity by aggregating content from several relevant sources. This multi-page context aggregation is a significant improvement over single-page lookup methods. Furthermore, FactOWL can incorporate additional knowledge from external sources like Wikidata.

Topic	LLM Generation	Wikipedia search
William Post	William Post (1949-1986) was an American lottery winner who won $16.2 million in the Pennsylvania Lottery in 1988...	1. [C. W. Post] Charles William Post (October 26, 1854 – May 9, 1914) was an American innovator, breakfast cereal and ... 2. [William Post (businessman)] William Post (June 27, 1927 – February 10, 2024) was an American businessman and inventor ... 3. [William Post (lottery winner)] William "Bud" Post III (April 5, 1939 – January 15, 2006) was the winner of a Pennsylvania Lottery jackpot worth $16.2 million...

Wikipedia search example for William Post entity from FActScore dataset. The ground truth page is highlighted in bold.

Results

FactOWL's automatic factual precision scores are close to human evaluation for InstructGPT and ChatGPT. While FactOWL shows a smaller gap to human evaluation than the original FActScore on end-to-end evaluation, the gap increases for PerplexityAI generations. More supporting evidence from multiple pages or Wikidata consistently improves precision scores.

Entity-level comparison of FactOWL to FActScore's manual precision. Larger context consistently increases the precision score.

Statistics	InstGPT	ChatGPT	PPLAI
FActScore
Avg. facts per response	26	34.6	35.46
Inference time	~2h	~2h 10m	~2h 30m
FactOWL
Avg. facts per response	20.6	31.8	38.6
Inference time	24m 30s	24m 6s	27m 36s

FactOWL extracts a comparable number of facts while being about 6x faster.

Evaluation

FactOWL's factual Precision (P) scores on InstructGPT, ChatGPT, and PerplexityAI generations, evaluated against human annotations from FActScore. Results include (i) entity-averaged end-to-end fact generation and verification metrics, and (ii) atomic fact-level metrics based on FActScore's extracted facts. Error Rate (ER) denotes the difference compared to manual evaluation.

Model	InstructGPT		ChatGPT		PerplexityAI
Model	P	ER	P	ER	P	ER
Entity-level end-to-end Fact Generation & Verification
Human evaluation	42.5	---	58.3	---	71.5	---
Original FActScore	41.1	1.4	58.7	0.4	71.6	0.1
FactOWL, Llama-inst, 1 page, Wikipedia dump	45.4	2.9	58.9	0.6	58.5	13
FactOWL, Llama-3-8b-inst+Llama-1, 1 page, Wikipedia dump	43.7	1.2	53.7	4.6	58.4	13.1
FactOWL (1 page, 2023 Wikipedia dump)	43.6	1.1	59.4	1.1	64.1	7.4
FactOWL (1 page, 5 passages)	40.8	1.7	56.4	1.9	62.2	9.3
FactOWL (5 pages, 10 passages)	43.0	0.5	58.4	0.1	67.9	3.6
FactOWL (5 pages + Wikidata)	44.0	1.5	60.7	2.7	66.8	4.7
Claim-level Atomic Fact Verification
Human evaluation	44.4	---	58.9	---	81.7	---
FactOWL (1 page, 5 passages, 2023 Wikipedia dump)	46.1	1.7	62.5	3.6	72.6	9.1
FactOWL (1 page, 5 passages)	43.7	0.7	57.9	1.0	71.3	10.4
FactOWL (5 pages, 10 passages)	46.8	0.4	61.9	3.0	74.8	6.9
FactOWL (5 pages + Wikidata)	48.1	3.7	62.8	3.9	74.8	6.9