Title | : | Interpreting adversarial attacks and defenses using architectures with enhanced interpretability |
Speaker | : | Akshay G Rao (IITM) |
Details | : | Wed, 25 Sep, 2024 3:00 PM @ MR - I (SSB 233, Fir |
Abstract: | : | Adversarial attacks in deep learning represent a significant threat to the integrity and reliability of machine learning models. These attacks involve intentionally crafting perturbations to input data that, while often imperceptible to humans, can lead to incorrect predictions by the model. This phenomenon exposes vulnerabilities in deep learning systems across various applications, from image recognition to natural language processing. Adversarial training has been a popular defence technique against these adversarial attacks. The research community has been increasingly interested in interpreting robust models and understanding how they defend against attacks. In this talk, we capitalize on two network architectures, namely DLGN and DLGN-SF, which have better interpretation capabilities than regular network architectures. Using these architectures, we interpret robust models trained using PGD adversarial training and compare them with standard training. Feature network in these architectures act as feature extractors, making them the only medium through which an adversary can attack the model. So, we use the feature network of these architectures with fully connected layers to analyse properties like the nature of hyperplanes, hyperplane relation with PCA, and sub-network overlap and compare these properties between robust and standard models. We also consider CNN layers in these architectures wherein we qualitatively and quantitatively contrast gating patterns between robust and standard models. We show the exact frequency bias of convolutions in robust and standard models, which is unique to this work, and we use ideas from visualization to understand the representations used by robust and standard models. |