On the Distinctive Properties of Universal Perturbations … Explained
In this blog post we will go through a new paper titled “On the Distinctive Properties of Universal Perturbations” by Park, Sung Min et al . The paper is linked to other works by the same group: “Adversarial Examples Are Not Bugs, They Are Features”  and “Towards Deep Learning Models Resistant to Adversarial Attacks”  which are also briefly discussed to make the flow better for the reader.
Adversarial Perturbations: Standard and Universal
Standard Adversarial Perturbations (SAPs):
Standard adversarial attacks (we refer to as SAPs in this post) were first introduced in  as a security breach that a malicious actor could exploit against deep neural networks (DNNs). By carefully crafting an adversarial perturbation and adding it to an input image one can flip (change) the classifier’s prediction either to a random target (different than the ground truth label) or to a particular target label. Figure 1 shows an example of a SAP.
Mathematically, this can be formulated as shown below:
where x is our input image, f is our classifier (our DNN), δ is the adversarial perturbation we are trying to find, t is our target label, and ℒ is our loss function (cross entropy loss). The second line, showing Δ, is the constraint that we are imposing on the optimization problem. The constraint ensures that our attack is imperceptible by bounding norm of δ into a p-norm ball of radius r.
PGD: Solving our SAP Optimization Problem
The earlier optimization problem allows us to find an adversarial perturbation which when added to the input image minimizes the cross entropy loss between the predicted label by the network and the target label. In other words, it aims at making the network’s prediction match the target label.
To solve this optimization problem, one can utilize a technique called projected gradient descent. As the name suggests, we will apply gradient descent and then project the obtained solution onto our ell-p norm ball to make sure that our constraint is satisfied. This is better explained visually in Figure 2 (this visualization considers untargeted attack scenario where we are trying to maximize the loss function between the prediction and ground truth label i.e we just want the network’s prediction to be different that the ground truth label).
The mathematical details (found in ) are left for the reader, however, the solution for a single step of PGD is shown below:
Universal Adversarial Perturbations (UAPs)
Later on, people started discussing a concept one step further than SAPs … that is UAPs! Universal adversarial perturbations (UAPs) (introduced in ) are perturbations δ that don’t aim at fooling the DNN on a single sample x but rather on multiple samples … actually as many samples as possible.
Mathematically, this can be represented in the equation below. Like before we want to minimize the cross entropy loss between the model’s prediction on the perturbed image and the target label … but instead of doing it sample wise we do it in expectation over the dataset!
Visually this can be seen in Figure 3 as follows: For our dataset, whenever we add the universal adversarial perturbation we want the prediction to be for example dog!
Contributions in the Paper
The paper  we are going to present today shows the follows:
I. In contrast to standard adversarial perturbations that tend to be incomprehensible — UAPs are more human-aligned:
- UAPs are locally semantic: the signal is concentrated in local regions that are most salient to humans. SAPs on the other hand are not.
- UAPs are approximately spatially invariant: they are still effective after translations. SAPs on the other hand are not.
II. UAPs contain significantly less generalizable signal from non-robust features compared to standard perturbations. This is shown by:
1. Checking how well a model can generalize to the original test set by training on a dataset where the only correlations with the label are added via UAPs.
2. Measuring the transferability of UAPs across independently trained models.
Quantifying Human Alignment:
Before quantifying the human alignment, we first present a concept that was previously visited by multiple papers regarding the visual differences between SAPs and UAPs.
Visual Differences between SAPs and UAPs
In Figure 4 presented below we show a set of UAPs obtained for different target classes:
If one compares the UAPs in Figure 4 to the SAP that was previously shown in Figure 1, we can clearly realize the following:
- SAPs are incomprehensible to humans: “when magnified for visualization, these perturbations are not identifiable to a human as belonging to their target class”. The SAP sample shown in Figure 1 cannot be interpreted as having any semantic meaning to us. It’s simply a bunch of noise.
- UAPs are visually much more interpretable: “when amplified, they contain local regions that we can identify with the target class”. The UAP samples shown in Figure 4 can be interpreter as having semantic meaning in certain regions. The dog target UAP has pictures of dogs scattered in particular locations. Similar observation can be made to UAPs of other classes, they have clear semantic meaning relevant to the target class.
We now move to quantification both: semantic locality and spatial invariance of UAPs.
Observation: As we saw earlier, a considerable portion of the perturbation’s signal is focused in small, localized regions that humans find interesting. The majority of the signal in UAPs comes from the most visually significant areas. SAPs lack this property as no local regions are semantic.
Methodology: “To quantify this for UAPs, we randomly select local patches of the perturbation, evaluate their attack success rate (ASR) in isolation, and inspect them visually. For both and perturbations, the patches with the highest ASR are more visually identifiable as the target class. This shows that the model is indeed influenced primarily by the most salient parts of the perturbation.” (Check Figure 5)
Conclusion: Unlike SAPs, UAPs have semantic local patches. These semantic patches are what contribute to the bulk of the attack success rate rather than other non-semantic patches.
We are interested in seeing the effect of spatial translations on the attack success rate of the obtained perturbation. This is important to identify because we want to show that UAPs, unlike SAPs, possess desirable properties that make them more closely aligned with human priors.
Methodology: “We quantify spatial invariance by measuring the ASR of translated perturbations. A highly spatially invariant perturbation will have a high ASR even after translations. We evaluate a subsampled grid with strides of four pixels. The value at coordinate (i, j) represents the average ASR when the perturbations are shifted right by i pixels and up by j, with wrap-around to preserve information; the center pixel at (0, 0) represents the ASR of the original unshifted perturbations.”
Conclusion: Even after translating the perturbation, UAPs achieve non-trivial attack success rate. SAPs on the other hand can only achieve a chance-level 10% ASR when shifted by more than eight pixels. (Check Figure 6)
Quantifying Reliance on Non-Robust Features
As discussed earlier, the contributions of the presented paper are two fold. The first main contribution, which we presented in the earlier section, is the quantification of human-alignment for UAPs and presenting a comparison with SAPs. The second contribution is quantification of the reliance on non-robust features. Before jumping into this contribution, we have to present the concept of non-robust and robust features which was introduced in .
One of the most adopted views to understand adversarial robustness is one that divides features into two categories: robust features and non-robust features. We present some essential definitions presented in the paper:
- A useful feature for classification is a function that is (positively) correlated with the correct label in expectation.
- A feature is robustly useful if, even under adversarial perturbations (within a specified set of valid perturbations Δ), the feature is still useful.
- A useful, non-robust feature is a feature that is useful but not robustly useful. These features are useful for classification in the standard setting, but can hurt accuracy in the adversarial setting.
This division of features can be represented as shown in Figure 7.
Robust features are features that aren’t usually utilized in developing our imperceptible adversarial attacks. They are features like ears, face shape, … that we humans rely on to classify a cat as a cat and a dog as a dog. Non-robust features are features that we humans do not rely on in our predictions and probably don’t even notice. They are features that help the network generalize because of their frequent occurrence in our datasets (like blobs of colored pixels occurring). These features are very sensitive to noise introduced by adversarial attacks and break down if perturbed within an ell-p ball.
This authors of that work provide ways for splitting a dataset into a “Robust Dataset” and a “Non-Robust Dataset”. The robust dataset has robust features and if used to train a network can achieve good standard accuracy and good robust accuracy. However, the non-robust dataset can only produce a good standard accuracy but a bad robust accuracy. This is shown in Figure 8.
Another interesting observation in that work can be observed by the following experiment:
(1) Generate adversarial attacks on the images in the train set.
(2) Relabel the attacked samples with the target label (or label we flip to) and create a new attacked dataset.
(3) Train the network on the new attacked dataset.
Applying such a procedure produces a dataset that has non-robust features that represent the new label but robust features that represent the original label. The authors observe the obtained DNN can still perform a non-trivial performance on the original clean test set. What can we learn from that? Non-robust features still allow for training a network that is well generalizable. This procedure is visualized in Figure 9.
Generalization from Universal Non-robust Features
The authors of the paper we are presenting now rely on a similar approach. They generate two datasets of non-robust features. One is generated using SAPs and another is generated using UAPs (Figure 10).
“We train new ResNet-18 models on the and datasets and evaluate them on the original test set. The best generalization accuracies from training on the universal non-robust features dataset and the standard non-robust features were 23.2% and 74.5%, respectively”
Conclusion: “Universal non-robust features do have signal that models can use to generalize, but universal non-robust features are harder to generalize from than general non-robust features. Thus, there is some useful signal in universal non-robust features, but there appears to be less of it than in standard adversarial perturbations.”
Transferability of UAPs
Another way to measure the extent of utilization of non-robust features by UAPs is to look at their transferability. Transferability of adversarial attacks is attributed to non-robust features that different models could rely on to generalize better on different samples. As a result, perturbations that utilize non-robust features more should be more transferable between models.
- Perturb examples using either a standard adversarial perturbation or a UAP on the source model.
- Measure the probability that the perturbed input is classified as the target class on a new target model that is trained independently (the paper considers ResNet18 and VGG19).
Conclusion: As shown in Figure 11, the transferability of UAPs is worse than that of SAPs. That is SAPs leverage non-robust features more than UAPs do. This demonstrates that while UAPs are more human-aligned, they leverage only a small fraction of the statistical signal in general non-robust features.
One final thing to consider is : “to what extent one can interpolate between the properties of universal and standard non-robust features ?”. To answer this question, we consider two parameters that control the way the UAP is generated that are : (1) the number of samples used in carrying out the UAP optimization problem (finding the universal adversarial perturbation is usually solved of a mini-batch which we refer to as the base set). (2) The classes of the samples used to generate the UAP.
Effect of Base Set Size: The base set is the set of images that are used while solving the UAP optimization problem mentioned in the first section of this post. Usually, the optimization problem is solved on a base set that is not equal to the complete dataset as carrying the optimization problem on the whole dataset is quite expensive. We refer to the base set size as K. If K=1 then we are carrying standard adversarial attack whereas if K>1 then we are carrying a UAP with K samples. The effect of changing the base set size is shown in Table 1.
Generalization begins to suffer even for relatively small values of K (note that the test accuracy refers to the accuracy of a model trained on non-robust features generated using a UAP with base set size K on the original test set). For example, the generalization accuracy falls from 74% at K = 1 to 34% at K = 16. On the other hand, even though not at a slower rate, increasing the base set size to K≥64 allows us to get more semantically meaningful UAPs. This is shown in Figure 12:
Conclusion: There is a clear trade off between having better semantics which only become obvious at higher values of K (≥64), and generalization which suffers at relatively small values of K (≥16).
Class of Chosen Samples: Taking a further step into the way the UAP is generated, the authors suggest studying the influence the classes of the chosen base set samples. They consider three variations, the first is random where base samples are selected randomly from the dataset, the second is single class were all the base samples are chosen from the same exact class and, finally single sub-class where a single category (containing multiple classes) is sampled for obtaining the base set samples. The results are shown in Table 2.
Conclusion: The results of these interpolation experiments show that the large gap in signal between UAPs and standard perturbations persists even when the level of “universality” is relaxed.
This work studies universal adversarial perturbations and shows that unlike standard adversarial perturbations they exhibit human-aligned properties. The authors characterizes and quantifies the degree to which UAPs are human-aligned in terms of semantic locality and spatial invariance. The authors then quantify the degree to which UAPs leverage non-robust features through experiments that study both generalizability and transferability. The experiments show that UAPs contain a much weaker signal for generalizability compared to standard perturbations.
This work demonstrates that examining UAPs may be a good direction for understanding particular and specific properties of adversarial perturbations, and for associated phenomena such as the prevalence and the nature of non-robust features.
 Park, S.M., Wei, K., Xiao, K.Y., Li, J., & Madry, A. (2021). On Distinctive Properties of Universal Perturbations. ArXiv, abs/2112.15329.
 Ilyas, A., Santurkar, S., Tsipras, D., Engstrom, L., Tran, B., & Madry, A. (2019). Adversarial Examples Are Not Bugs, They Are Features. ArXiv, abs/1905.02175.
 Madry, A., Makelov, A., Schmidt, L., Tsipras, D., & Vladu, A. (2018). Towards Deep Learning Models Resistant to Adversarial Attacks. ArXiv, abs/1706.06083.
 Benz, P., Zhang, C., Imtiaz, T., & Kweon, I.S. (2020). Universal Adversarial Perturbations are Not Bugs, They are Features.
 Moosavi-Dezfooli, S., Fawzi, A., Fawzi, O., & Frossard, P. (2017). Universal Adversarial Perturbations. 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 86–94.