The lack of high-quality training data remains a significant challenge in NLP. Manual annotation methods, such as crowdsourcing, are costly, require intricate task design skills, and, if used incorrectly, may result in poor data quality. From the other hand, LLMs have demonstrated proficiency in many NLP tasks, including zero-shot and few-shot data annotation. However, they often struggle with text detoxification due to alignment constraints and fail to generate the required detoxified text. This work explores the potential of modern open source LLMs to annotate parallel data for text detoxification. Using the recent technique of activation patching, we generate a pseudo-parallel detoxification dataset based on ParaDetox. The detoxification model trained on our generated data shows comparable performance to the original dataset in automatic detoxification evaluation metrics and superior quality in manual evaluation and side-by-side comparisons.
This work explores the potential of modern open source LLMs to annotate parallel data for text detoxification. To achieve this, we use the activation patching technique to generate a pseudo-parallel detoxification dataset based on ParaDetox. The methodology involves the following steps:
\[ r_{l} = a^{\text{harmful}}_{l} - a^{\text{harmless}}_{l} \]
where \( a^{\text{harmful}}_{l} \) and \( a^{\text{harmless}}_{l} \) are the average of the residual stream activations at the last token position for each layer \( l \) of the model for harmful and harmless instructions, respectively. We then normalize the difference vectors and select the "best" refusal stream direction \( \hat{r}_{\text{best}} \) by evaluating \( \hat{r}_i \) on a separate set of harmful instructions. Finally, we modify the weight matrices of the model directly using the following formula:\[ \tilde{W}_{\text{out}} = W_{\text{out}} - \hat{r}_{\text{best}}\hat{r}_{\text{best}}^{\operatorname{T}}W_{\text{out}}. \]
\[ \textbf{J}(x_i, y_i) = \frac{1}{n} \sum_{i=1}^{n} \textbf{STA}(x_i) \textbf{SIM}(x_i, y_i) \textbf{FL}(x_i). \]
@inproceedings{moskovskiy-etal-2024-llms,
title = "{LLM}s to Replace Crowdsourcing For Parallel Data Creation? The Case of Text Detoxification",
author = "Moskovskiy, Daniil and
Pletenev, Sergey and
Panchenko, Alexander",
editor = "Al-Onaizan, Yaser and
Bansal, Mohit and
Chen, Yun-Nung",
booktitle = "Findings of the Association for Computational Linguistics: EMNLP 2024",
month = nov,
year = "2024",
address = "Miami, Florida, USA",
publisher = "Association for Computational Linguistics",
url = "https://aclanthology.org/2024.findings-emnlp.839",
pages = "14361--14373",
abstract = "The lack of high-quality training data remains a significant challenge in NLP. Manual annotation methods, such as crowdsourcing, are costly, require intricate task design skills, and, if used incorrectly, may result in poor data quality. From the other hand, LLMs have demonstrated proficiency in many NLP tasks, including zero-shot and few-shot data annotation. However, they often struggle with text detoxification due to alignment constraints and fail to generate the required detoxified text. This work explores the potential of modern open source LLMs to annotate parallel data for text detoxification. Using the recent technique of activation patching, we generate a pseudo-parallel detoxification dataset based on ParaDetox. The detoxification model trained on our generated data shows comparable performance to the original dataset in automatic detoxification evaluation metrics and superior quality in manual evaluation and side-by-side comparisons.",
}