SynthDetoxM: Modern LLMs are Few-Shot Parallel Detoxification Data Annotators

Skoltech, AIRI, HSE, ISP RAS Research Center for Trusted Artificial Intelligence
NAACL Main 2025

*Equal Contribution
SynthDetoxM Data Generation Pipeline.

Abstract

Existing approaches to multilingual text detoxification are hampered by the scarcity of parallel multilingual datasets. In this work we introduce a pipeline for cross-lingual parallel detoxification data generation. We also introduce SynthDetoxM, a manually collected and synthetically generated multilingual parallel text detoxification dataset comprising 16,000 high-quality detoxification sentence pairs across German, French, Spanish and Russian. The data was sourced from different toxicity evaluation datasets and then rewritten with nine modern open-source LLMs in few-shot setting. Our experiments demonstrate that models trained on our data achieve superior performance to those trained on the human-annotated MultiParaDetox dataset even in data limited setting. Models, trained on SynthDetoxM outperform all evaluated LLMs in few-shot setting. We release our dataset and code to help further research in multilingual text detoxification.

Number of Accepted Samples by LLM and Language.

Number of accepted samples into the final SynthDetoxM~dataset with respect to the LLM by language.

Toxicity Distribution Histograms.

Toxicities of the examples in the dataset. Original toxic texts are in orange, detoxified texts are in blue. Gaussian smoothing is applied for readability.

Methodology

This work introduces a pipeline for cross-lingual parallel detoxification data generation. The approach leverages large language models (LLMs) to create synthetic data, addressing the lack of parallel multilingual detoxification datasets. The methodology involves the following steps:

  1. Data Collection: Toxic texts were manually collected from publicly available datasets in German, French, Spanish, and Russian. Only samples marked as toxic by human annotators were selected. Filtering based on Perspective API based Style Transfer Accuracy (STA) and LaBSE-based Similarity (SIM) metrics, was used to enhance data quality.
  2. Parallel Data Generation: Several open-source LLMs were used in a few-shot setting to generate detoxified versions of the collected toxic texts. The models used were: Qwen 2.5 32B, Command-R 32B, Gemma 2 27B, Aya Expanse (32B and 8B), Mistral Small 22B, Mistral Nemo 12B, and Llama 3.1 (70B and 8B).
  3. Few-Shot Example Mining: The best toxic/non-toxic pairs for few-shot prompting were selected by calculating a score based on STA and SIM metrics from a multilingual toxicity detection dataset. For French, 10 sentences were manually detoxified due to a lack of representation in the existing dataset.
  4. Filtering and Ranking: Generated detoxifications were filtered using a refusal classification model, and a threshold-based non-detoxifiability metric. The remaining detoxifications were ranked by the product of their STA and SIM metrics, and the top-scoring examples were selected.
  5. Dataset Composition: The final dataset, SynthDetoxM, consists of 16,000 parallel toxic/non-toxic text pairs across Spanish, German, Russian, and French (4,000 per language).
  6. Evaluation: The quality of SynthDetoxM was evaluated by training sequence-to-sequence models (mT0-XL) on different folds of the dataset and comparing their performance to models trained on the human-annotated MultiParaDetox dataset. The evaluation used the metrics defined in the MultiParaDetox shared task:
    • Style Transfer Accuracy (STA): Uses a multilingual XLM-R text classification model to measure toxicity reduction.
    • Content Similarity (SIM): Calculates the cosine distance between LaBSE embeddings of the source and generated texts.
    • Fluency (FL): Uses ChrF1 score (though limitations of this metric are discussed in Appendix B of the paper).
    • Joint Score (J): Combines STA, SIM, and ChrF1 into a single score: \[\textbf{J} = \frac{1}{n}\sum\limits_{i=1}^{n}\textbf{STA}(y_i) \cdot \textbf{SIM}(x_i,y_i) \cdot \textbf{ChrF1}(x_i, y_i)\]
  7. SBS evaluation: To further evaluate the applicability of the proposed dataset for training detoxification language models, a Side-by-side evaluation was carried out using GPT-4o as a judge.

Results

Side by Side comparisions between the final models

Side-by-side comparision between the final models in all languages.

Side by Side comparisions in German

Side-by-side comparision between the final models in German.

Side by Side comparisions in Russian

Side-by-side comparision between the final models in Russian.

Side by Side comparisions in Spanish

Side-by-side comparision between the final models in Spanish.

The table below presents the Joint (J) scores from the automatic evaluation of different multilingual text detoxification approaches. The models were evaluated on Spanish, German, and Russian using the test set from MultiParaDetox.

Spanish German Russian
Human References 0.709 0.733 0.732
Baselines
Duplicate 0.090 0.287 0.048
Delete 0.319 0.362 0.255
Backtranslation 0.275 0.233 0.223
Supervised Approaches
MultiParaDetox 0.344 0.446 0.472
SynthDetoxM (Batch) 0.402 0.460 0.475
SynthDetoxM (Full) 0.470 0.482 0.546
LLM-based Approaches
Gemma 2 0.380 0.353 0.404
Mistral Nemo 0.290 0.286 0.258
Command R 0.344 0.328 0.402
Qwen 2.5 0.443 0.402 0.428
Llama 3.1 8B 0.341 0.394 0.357
Aya Expanse 8B 0.246 0.305 0.225
Aya Expanse 32B 0.320 0.399 0.323
Mistral Small 0.308 0.371 0.273

Key findings:

  • Models trained on SynthDetoxM outperform those trained on the human-annotated MultiParaDetox dataset in terms of the J score, even when using a comparable amount of data (SynthDetoxM (Batch)).
  • Training on the full SynthDetoxM dataset yields the highest J scores across all languages.
  • Adding human-sourced samples to the training data reduces J scores across all languages.
  • The full SynthDetoxM model, also outperforms LLM baselines.

BibTeX

@misc{moskovskiy2025synthdetoxmmodernllmsfewshot,
          title={SynthDetoxM: Modern LLMs are Few-Shot Parallel Detoxification Data Annotators}, 
          author={Daniil Moskovskiy and Nikita Sushko and Sergey Pletenev and Elena Tutubalina and Alexander Panchenko},
          year={2025},
          eprint={2502.06394},
          archivePrefix={arXiv},
          primaryClass={cs.CL},
          url={https://arxiv.org/abs/2502.06394}, 
    }