Mitigating Text Toxicity with Counterfactual Generation

Milan Bhan, Jean-Noël Vittaut, Nina Achache, Victor Legrand, Annabelle Blangero, Nicolas Chesneau, Juliette Murris, Marie-Jeanne Lesot

janvier, 2026

Résumé

Toxicity mitigation consists in rephrasing text in order to remove offensive or harmful meaning. Neural natural language processing (NLP) models have been widely used to target and mitigate textual toxicity. However, existing methods fail to detoxify text while preserving the initial non-toxic meaning at the same time. In this work, we propose to apply eXplainable AI (XAI) methods to both target and mitigate textual toxicity. We propose CF-Detox$$_backslashtext tigtec$$tigtecto perform text detoxification by applying local feature importance, counterfactual example generation and counterfactual feature importance methods to a toxicity classifier distinguishing between toxic and non-toxic texts. We carry out text detoxification through counterfactual generation on three datasets and compare our approach to three competitors. Automatic and human evaluations show that recently developed NLP counterfactual generators lead to competitive results in toxicity mitigation. This work is the first to bridge the gap between counterfactual generation and text detoxification and paves the way towards more practical applications of XAI methods.

Type

Paper-Conference

Publication

Explainable Artificial Intelligence