WCCM ECCOMAS 2026

Exiaa: Explainable Injections for Adversarial Attack

Pesce, Leonardo (National University of Singapore)
Wei, Jiawen (National University of Singapore)
Mengaldo, Gianmarco (National University of Singapore)

In session: MS191B - Explainable AI for the Discovery and Control of Complex Systems in Engineering and Applied Sciences II

Please login to view abstract download link

Post-hoc interpretability methods aim to provide what data the machine deemed important to reach a certain prediction. They are being increasingly used in the context of eXplainable Artificial Intelligence (XAI) for AI transparency, especially in critical and regulated sectors, such as medicine and finance. To this end, post-hoc explanations are frequently taken at face value, without further investigation into their correctness. In this work, we propose a novel black-box model-agnostic adversarial attack designed to manipulate post-hoc XAI explanations in image classification. We demonstrate how one can significantly modify these post-hoc explanations using undetectable adversarial perturbations of the input, without requiring access to the model's parameters and using only its predicted classes and explanations. The adversarial perturbations are constructed from the post-hoc explanations, and they do not alter the predicted class, but they significantly alter the explanations, while remaining undetected by humans. In contrast to previous methods, we do not require any access to the model or its weights. Additionally, the attack is accomplished in a single step while significantly changing the provided explanations, as demonstrated by empirical evaluation. The minimal requirements of our method expose a critical vulnerability in current explainability methods, raising concerns about their reliability and robustness in safety-critical applications. We systematically generate attacks based on explanations provided by several post-hoc explainability methods for pre-trained ResNet-18 and ViT-B16 models on CIFAR-10 and ImageNet datasets. Results show that our attacks could lead to dramatically different explanations without changing the predictive probabilities. We validate the effectiveness of our attack by computing the induced change on the explanation (mean absolute difference) and the similarity of the original image to the corrupted one (Structural Similarity Index Measure).