PseudoParaDetox: Abliterated LLMs for Text Detoxification

LLMs to Replace Crowdsourcing For Parallel Data Creation? The Case of Text Detoxification

Skoltech, AIRI, HSE
EMNLP Findings 2024
^*Equal Contribution

Abstract

The lack of high-quality training data remains a significant challenge in NLP. Manual annotation methods, such as crowdsourcing, are costly, require intricate task design skills, and, if used incorrectly, may result in poor data quality. From the other hand, LLMs have demonstrated proficiency in many NLP tasks, including zero-shot and few-shot data annotation. However, they often struggle with text detoxification due to alignment constraints and fail to generate the required detoxified text. This work explores the potential of modern open source LLMs to annotate parallel data for text detoxification. Using the recent technique of activation patching, we generate a pseudo-parallel detoxification dataset based on ParaDetox. The detoxification model trained on our generated data shows comparable performance to the original dataset in automatic detoxification evaluation metrics and superior quality in manual evaluation and side-by-side comparisons.

Methodology

This work explores the potential of modern open source LLMs to annotate parallel data for text detoxification. To achieve this, we use the activation patching technique to generate a pseudo-parallel detoxification dataset based on ParaDetox. The methodology involves the following steps:

Data Collection: We start by collecting the ParaDetox dataset, which contains parallel data of toxic and detoxified sentences.
LLM Annotation: We use open source LLMs to generate detoxified sentences for the toxic sentences in the ParaDetox dataset.
Activation Patching: We apply the activation patching technique to refine the generated detoxified sentences and improve their quality. This involves calculating the difference vector in activations for each layer of the model using the following formula:
\[ r_{l} = a^{\text{harmful}}_{l} - a^{\text{harmless}}_{l} \]
where \( a^{\text{harmful}}_{l} \) and \( a^{\text{harmless}}_{l} \) are the average of the residual stream activations at the last token position for each layer \( l \) of the model for harmful and harmless instructions, respectively. We then normalize the difference vectors and select the "best" refusal stream direction \( \hat{r}_{\text{best}} \) by evaluating \( \hat{r}_i \) on a separate set of harmful instructions. Finally, we modify the weight matrices of the model directly using the following formula:
\[ \tilde{W}_{\text{out}} = W_{\text{out}} - \hat{r}_{\text{best}}\hat{r}_{\text{best}}^{\operatorname{T}}W_{\text{out}}. \]
Model Training: We train a detoxification model, such as BART, on the generated pseudo-parallel detoxification dataset.
Automatic Metrics In detoxification evaluation, we follow the pipeline presented in ParaDetox. We calculate style transfer accuracy (STA), similarity (SIM), fluency (FL), and their sentence-level average - Joint score (J):

\[ \textbf{J}(x_i, y_i) = \frac{1}{n} \sum_{i=1}^{n} \textbf{STA}(x_i) \textbf{SIM}(x_i, y_i) \textbf{FL}(x_i). \]

Results

Results of automatic detoxification evaluation after training BART on the original ParaDetox data (highlighted in gray) and generated with LLMs PseudoParaDetox data in 0-shot and 10-shot settings. A.P. stands for Activation Patched models, X stands for models used as is. Best results for each setting (0-shot/10-shot) are bold, and the best overall results are underlined bold.

PseudoParadetox: Results of Manual Evaluation.

Results of manual detoxification evaluation after training BART on the original ParaDetox data (highlighted in gray) and generated with LLMs PseudoParaDetox data in 0-shot and 10-shot settings. A.P. stands for Activation Patched models, X stands for models used as is. Best results for each setting (0-shot/10-shot) are bold, and the best overall results are underlined bold.

Side-by-Side Evaluations

Side-by-side evaluation BART trained on ParaDetox versus PseudoParaDetox (generated by activation patched LLMs) on a held-out test set. Win of LLM-generated is highlighted with teal, Tie is highlighted with beige, and ParaDetox is highlighted with grey.

BibTeX

@inproceedings{moskovskiy-etal-2024-llms, title = "{LLM}s to Replace Crowdsourcing For Parallel Data Creation? The Case of Text Detoxification", author = "Moskovskiy, Daniil and Pletenev, Sergey and Panchenko, Alexander", editor = "Al-Onaizan, Yaser and Bansal, Mohit and Chen, Yun-Nung", booktitle = "Findings of the Association for Computational Linguistics: EMNLP 2024", month = nov, year = "2024", address = "Miami, Florida, USA", publisher = "Association for Computational Linguistics", url = "https://aclanthology.org/2024.findings-emnlp.839", pages = "14361--14373", abstract = "The lack of high-quality training data remains a significant challenge in NLP. Manual annotation methods, such as crowdsourcing, are costly, require intricate task design skills, and, if used incorrectly, may result in poor data quality. From the other hand, LLMs have demonstrated proficiency in many NLP tasks, including zero-shot and few-shot data annotation. However, they often struggle with text detoxification due to alignment constraints and fail to generate the required detoxified text. This work explores the potential of modern open source LLMs to annotate parallel data for text detoxification. Using the recent technique of activation patching, we generate a pseudo-parallel detoxification dataset based on ParaDetox. The detoxification model trained on our generated data shows comparable performance to the original dataset in automatic detoxification evaluation metrics and superior quality in manual evaluation and side-by-side comparisons.", }

LLMs to Replace Crowdsourcing For Parallel Data Creation? The Case of Text Detoxification

Abstract

Methodology

Results

Side-by-Side Evaluations

Side-by-side evaluation BART trained on ParaDetox versus PseudoParaDetox (generated by activation patched LLMs) on a held-out test set. Win of LLM-generated is highlighted with teal, Tie is highlighted with beige, and ParaDetox is highlighted with grey.

Side-by-side evaluation BART trained on ParaDetox versus PseudoParaDetox (generated by activation patched LLMs) on a held-out test set. Win of LLM-generated is highlighted with teal, Tie is highlighted with beige, and ParaDetox is highlighted with grey.

Text Output Examples

Examples of text detoxification on a private test set of ParaDetox for 8B Llama 3 models. Original toxic sentence is highlighted with pink, Human Reference detoxification is highlighted with green.

Examples of text detoxification on a private test set of ParaDetox for 70B Llama 3 models. Original toxic sentence is highlighted with pink, Human Reference detoxification is highlighted with green.

BibTeX