SynthDetoxM

SynthDetoxM: Modern LLMs are Few-Shot Parallel Detoxification Data Annotators

Skoltech, AIRI, HSE, ISP RAS Research Center for Trusted Artificial Intelligence
NAACL Main 2025
^*Equal Contribution

Abstract

Existing approaches to multilingual text detoxification are hampered by the scarcity of parallel multilingual datasets. In this work we introduce a pipeline for cross-lingual parallel detoxification data generation. We also introduce SynthDetoxM, a manually collected and synthetically generated multilingual parallel text detoxification dataset comprising 16,000 high-quality detoxification sentence pairs across German, French, Spanish and Russian. The data was sourced from different toxicity evaluation datasets and then rewritten with nine modern open-source LLMs in few-shot setting. Our experiments demonstrate that models trained on our data achieve superior performance to those trained on the human-annotated MultiParaDetox dataset even in data limited setting. Models, trained on SynthDetoxM outperform all evaluated LLMs in few-shot setting. We release our dataset and code to help further research in multilingual text detoxification.

Methodology

This work introduces a pipeline for cross-lingual parallel detoxification data generation. The approach leverages large language models (LLMs) to create synthetic data, addressing the lack of parallel multilingual detoxification datasets. The methodology involves the following steps:

Data Collection: Toxic texts were manually collected from publicly available datasets in German, French, Spanish, and Russian. Only samples marked as toxic by human annotators were selected. Filtering based on Perspective API based Style Transfer Accuracy (STA) and LaBSE-based Similarity (SIM) metrics, was used to enhance data quality.
Parallel Data Generation: Several open-source LLMs were used in a few-shot setting to generate detoxified versions of the collected toxic texts. The models used were: Qwen 2.5 32B, Command-R 32B, Gemma 2 27B, Aya Expanse (32B and 8B), Mistral Small 22B, Mistral Nemo 12B, and Llama 3.1 (70B and 8B).
Few-Shot Example Mining: The best toxic/non-toxic pairs for few-shot prompting were selected by calculating a score based on STA and SIM metrics from a multilingual toxicity detection dataset. For French, 10 sentences were manually detoxified due to a lack of representation in the existing dataset.
Filtering and Ranking: Generated detoxifications were filtered using a refusal classification model, and a threshold-based non-detoxifiability metric. The remaining detoxifications were ranked by the product of their STA and SIM metrics, and the top-scoring examples were selected.
Dataset Composition: The final dataset, SynthDetoxM, consists of 16,000 parallel toxic/non-toxic text pairs across Spanish, German, Russian, and French (4,000 per language).
Evaluation: The quality of SynthDetoxM was evaluated by training sequence-to-sequence models (mT0-XL) on different folds of the dataset and comparing their performance to models trained on the human-annotated MultiParaDetox dataset. The evaluation used the metrics defined in the MultiParaDetox shared task:

Style Transfer Accuracy (STA): Uses a multilingual XLM-R text classification model to measure toxicity reduction.
Content Similarity (SIM): Calculates the cosine distance between LaBSE embeddings of the source and generated texts.
Fluency (FL): Uses ChrF1 score (though limitations of this metric are discussed in Appendix B of the paper).
Joint Score (J): Combines STA, SIM, and ChrF1 into a single score: \[\textbf{J} = \frac{1}{n}\sum\limits_{i=1}^{n}\textbf{STA}(y_i) \cdot \textbf{SIM}(x_i,y_i) \cdot \textbf{ChrF1}(x_i, y_i)\]

SBS evaluation: To further evaluate the applicability of the proposed dataset for training detoxification language models, a Side-by-side evaluation was carried out using GPT-4o as a judge.

Results

Side by Side comparisions between the final models

Side-by-side comparision between the final models in all languages.

Side-by-side comparision between the final models in German.

Side-by-side comparision between the final models in Russian.

Side-by-side comparision between the final models in Spanish.

	Spanish	German	Russian
Human References	0.709	0.733	0.732
Baselines
Duplicate	0.090	0.287	0.048
Delete	0.319	0.362	0.255
Backtranslation	0.275	0.233	0.223
Supervised Approaches
MultiParaDetox	0.344	0.446	0.472
SynthDetoxM (Batch)	0.402	0.460	0.475
SynthDetoxM (Full)	0.470	0.482	0.546
LLM-based Approaches
Gemma 2	0.380	0.353	0.404
Mistral Nemo	0.290	0.286	0.258
Command R	0.344	0.328	0.402
Qwen 2.5	0.443	0.402	0.428
Llama 3.1 8B	0.341	0.394	0.357
Aya Expanse 8B	0.246	0.305	0.225
Aya Expanse 32B	0.320	0.399	0.323
Mistral Small	0.308	0.371	0.273

Spanish

German

Russian

Human References

0.709

0.733

0.732

Baselines

                            Backtranslation
                            0.275
                            0.233
                            0.223
                        

Supervised Approaches

MultiParaDetox

0.344

0.446

0.472

SynthDetoxM (Batch)

0.402

0.460

0.475

SynthDetoxM (Full)

0.470

0.482

0.546

LLM-based Approaches

Gemma 2

0.380

0.353

0.404

Mistral Nemo

0.290

0.286

0.258

Command R

0.344

0.328

0.402

Qwen 2.5

0.443

0.402

0.428

Llama 3.1 8B

0.341

0.394

0.357

Aya Expanse 8B

0.246

0.305

0.225

Aya Expanse 32B

0.320

0.399

0.323

Mistral Small

0.308

0.371

0.273

BibTeX

@inproceedings{moskovskiy-etal-2025-synthdetoxm, title = "{S}ynth{D}etox{M}: {M}odern {LLM}s are Few-Shot Parallel Detoxification Data Annotators", author = "Moskovskiy, Daniil and Sushko, Nikita and Pletenev, Sergey and Tutubalina, Elena and Panchenko, Alexander", editor = "Chiruzzo, Luis and Ritter, Alan and Wang, Lu", booktitle = "Proceedings of the 2025 Conference of the Nations of the Americas Chapter of the Association for Computational Linguistics: Human Language Technologies (Volume 1: Long Papers)", month = apr, year = "2025", address = "Albuquerque, New Mexico", publisher = "Association for Computational Linguistics", url = "https://aclanthology.org/2025.naacl-long.294/", pages = "5714--5733", ISBN = "979-8-89176-189-6", abstract = "Existing approaches to multilingual text detoxification are hampered by the scarcity of parallel multilingual datasets. In this work, we introduce a pipeline for the generation of multilingual parallel detoxification data. We also introduce SynthDetoxM, a manually collected and synthetically generated multilingual parallel text detoxification dataset comprising 16,000 high-quality detoxification sentence pairs across German, French, Spanish and Russian. The data was sourced from different toxicity evaluation datasets and then rewritten with nine modern open-source LLMs in few-shot setting. Our experiments demonstrate that models trained on the produced synthetic datasets have superior performance to those trained on the human-annotated MultiParaDetox dataset even in data limited setting. Models trained on SynthDetoxM outperform all evaluated LLMs in few-shot setting. We release our dataset and code to help further research in multilingual text detoxification." }