Evaluating Post-hoc Interpretability in Data-efficient Adversarial Training

  • Wei, Jiawen (National University of Singapore)
  • Pesce, Leonardo (National University of Singapore)
  • Mengaldo, Gianmarco (National University of Singapore)

Please login to view abstract download link

Adversarial training (AT) has proven to be an effective defense against adversarial attacks. Recent studies suggest that adversarially trained models tend to produce more interpretable feature attribution maps than standard models, owing to their perceptually aligned gradients. However, AT is often computationally expensive, as it requires multi-step gradient calculation and backpropagation for each training example. Although recent work on coreset selection shows that data-efficient adversarial training can achieve comparable robustness accuracy while reducing training costs, it remains an open question whether these models preserve the interpretability of adversarial training on the full dataset. In this work, we propose a framework that connects sample importance with feature attribution. We use a margin-based method to iteratively measure the vulnerability of data examples during training. We subsequently filter informative examples that are close to the decision boundary to construct a subset for adversarial training. Additionally, we integrate temporal ensembling to mitigate robust overfitting in AT by regularizing predictions on adversarial examples to avoid overconfidence. To explore whether the learning of robust features is guided by informative examples that approximate the robust decision boundary, we thoroughly evaluate post-hoc interpretability across models trained with Standard Training (ST), Full-data Adversarial Training (Full-AT), and Subset Adversarial Training (Sub-AT). Our experiments are conducted on the CIFAR-10 and CIFAR-100 datasets. We use multiple attacks to rigorously evaluate adversarial robustness, including PGD-10, PGD-100, and AutoAttack. Sub-AT could achieve comparable robustness accuracy in comparison to Full-AT and marginally higher clean accuracy. To systematically evaluate post-hoc interpretability of ST, Full-AT, and Sub-AT, we adopt our previous evaluation pipeline proposed in [1] and [2]. This work aims to explore whether the mechanism of robust feature learning is concentrated in some sensitive examples, potentially offering a pathway to efficient models that are both robust and interpretable.