On the Distinctive Properties of Universal Perturbations … Explained

Targeted Universal Adversarial Perturbations with Target Class Sea Lion (Source [4])

In this blog post we will go through a new paper titled “On the Distinctive Properties of Universal Perturbations” by Park, Sung Min et al [1]. The paper is linked to other works by the same group: “Adversarial Examples Are Not Bugs, They Are Features” [2] and “Towards Deep Learning Models Resistant to Adversarial Attacks” [3] which are also briefly discussed to make the flow better for the reader.

Adversarial Perturbations: Standard and Universal

Standard Adversarial Perturbations (SAPs):

Standard adversarial attacks (we refer to as SAPs in this post) were first introduced in [1] as a security breach that a malicious actor could exploit against deep neural networks (DNNs). By carefully crafting an adversarial perturbation and adding it to an input image one can flip (change) the classifier’s prediction either to a random target (different than the ground truth label) or to a particular target label. Figure 1 shows an example of a SAP.

Figure 1 —Example of Standard Adversarial Attack — Standard Adversarial attacks allow for finding imperceptible noise which when added to an input image can flip the target label. [Source]

Mathematically, this can be formulated as shown below:

where x is our input image, f is our classifier (our DNN), δ is the adversarial perturbation we are trying to find, t is our target label, and ℒ is our loss function (cross entropy loss). The second line, showing Δ, is the constraint that we are imposing on the optimization problem. The constraint ensures that our attack is imperceptible by bounding norm of δ into a p-norm ball of radius r.

PGD: Solving our SAP Optimization Problem

The earlier optimization problem allows us to find an adversarial perturbation which when added to the input image minimizes the cross entropy loss between the predicted label by the network and the target label. In other words, it aims at making the network’s prediction match the target label.

To solve this optimization problem, one can utilize a technique called projected gradient descent. As the name suggests, we will apply gradient descent and then project the obtained solution onto our ell-p norm ball to make sure that our constraint is satisfied. This is better explained visually in Figure 2 (this visualization considers untargeted attack scenario where we are trying to maximize the loss function between the prediction and ground truth label i.e we just want the network’s prediction to be different that the ground truth label).

Figure 2 — Example of PGD Attack — PGD attack aims at finding samples with maximum loss (in untargeted attack scenario). Samples with high loss compared to ground truth label are possible adversaries that fool the deep neural network. [Source]

The mathematical details (found in [3]) are left for the reader, however, the solution for a single step of PGD is shown below:

Universal Adversarial Perturbations (UAPs)

Later on, people started discussing a concept one step further than SAPs … that is UAPs! Universal adversarial perturbations (UAPs) (introduced in [5]) are perturbations δ that don’t aim at fooling the DNN on a single sample x but rather on multiple samples … actually as many samples as possible.

Mathematically, this can be represented in the equation below. Like before we want to minimize the cross entropy loss between the model’s prediction on the perturbed image and the target label … but instead of doing it sample wise we do it in expectation over the dataset!

Visually this can be seen in Figure 3 as follows: For our dataset, whenever we add the universal adversarial perturbation we want the prediction to be for example dog!

Figure 3 —Targeted Universal Adversarial Perturbations — Targeted UAPs aim at finding a single perturbation which when added to samples in the test set can fool the network in predicting as many samples as possible as the target label.

Contributions in the Paper

The paper [1] we are going to present today shows the follows:

I. In contrast to standard adversarial perturbations that tend to be incomprehensible — UAPs are more human-aligned:

  1. UAPs are locally semantic: the signal is concentrated in local regions that are most salient to humans. SAPs on the other hand are not.
  2. UAPs are approximately spatially invariant: they are still effective after translations. SAPs on the other hand are not.

II. UAPs contain significantly less generalizable signal from non-robust features compared to standard perturbations. This is shown by:

1. Checking how well a model can generalize to the original test set by training on a dataset where the only correlations with the label are added via UAPs.

2. Measuring the transferability of UAPs across independently trained models.

Quantifying Human Alignment:

Before quantifying the human alignment, we first present a concept that was previously visited by multiple papers regarding the visual differences between SAPs and UAPs.

Visual Differences between SAPs and UAPs

In Figure 4 presented below we show a set of UAPs obtained for different target classes:

Figure 4 — UAPs Are Semantically Meaningful — Targeted universal adversarial attacks tend to have locally semantically meaningful patches. For example, the UAP that turns samples into dogs has dog faces scattered around in multiples locations of the perturbation (Source [1]).

If one compares the UAPs in Figure 4 to the SAP that was previously shown in Figure 1, we can clearly realize the following:

  1. SAPs are incomprehensible to humans: “when magnified for visualization, these perturbations are not identifiable to a human as belonging to their target class”. The SAP sample shown in Figure 1 cannot be interpreted as having any semantic meaning to us. It’s simply a bunch of noise.
  2. UAPs are visually much more interpretable: “when amplified, they contain local regions that we can identify with the target class”. The UAP samples shown in Figure 4 can be interpreter as having semantic meaning in certain regions. The dog target UAP has pictures of dogs scattered in particular locations. Similar observation can be made to UAPs of other classes, they have clear semantic meaning relevant to the target class.

We now move to quantification both: semantic locality and spatial invariance of UAPs.

Semantic Locality

Observation: As we saw earlier, a considerable portion of the perturbation’s signal is focused in small, localized regions that humans find interesting. The majority of the signal in UAPs comes from the most visually significant areas. SAPs lack this property as no local regions are semantic.

Methodology: “To quantify this for UAPs, we randomly select local patches of the perturbation, evaluate their attack success rate (ASR) in isolation, and inspect them visually. For both and perturbations, the patches with the highest ASR are more visually identifiable as the target class. This shows that the model is indeed influenced primarily by the most salient parts of the perturbation.” (Check Figure 5)

Conclusion: Unlike SAPs, UAPs have semantic local patches. These semantic patches are what contribute to the bulk of the attack success rate rather than other non-semantic patches.

Figure 5 — Semantic Locality Experiment — When random patches of the universal adversarial perturbation are chosen and used to attack the model, the locally semantically meaningful patches are the ones that acquire the highest attack success rate rather than those that are semantically meaningless (Source [1]).

Spatial Invariance

We are interested in seeing the effect of spatial translations on the attack success rate of the obtained perturbation. This is important to identify because we want to show that UAPs, unlike SAPs, possess desirable properties that make them more closely aligned with human priors.

Methodology: “We quantify spatial invariance by measuring the ASR of translated perturbations. A highly spatially invariant perturbation will have a high ASR even after translations. We evaluate a subsampled grid with strides of four pixels. The value at coordinate (i, j) represents the average ASR when the perturbations are shifted right by i pixels and up by j, with wrap-around to preserve information; the center pixel at (0, 0) represents the ASR of the original unshifted perturbations.”

Conclusion: Even after translating the perturbation, UAPs achieve non-trivial attack success rate. SAPs on the other hand can only achieve a chance-level 10% ASR when shifted by more than eight pixels. (Check Figure 6)

Figure 6 — Spatial Invariance Experiment — The attack success rate of standard adversarial perturbations drops down to 10% after translating the SAP by more than 8 pixels. Universal adversarial perturbations on the other hand are more resilient to pixel translation and acquire a non-trivial attack success rate even at high pixel translations (Source [1]).

Quantifying Reliance on Non-Robust Features

As discussed earlier, the contributions of the presented paper are two fold. The first main contribution, which we presented in the earlier section, is the quantification of human-alignment for UAPs and presenting a comparison with SAPs. The second contribution is quantification of the reliance on non-robust features. Before jumping into this contribution, we have to present the concept of non-robust and robust features which was introduced in [2].

Preliminaries:

One of the most adopted views to understand adversarial robustness is one that divides features into two categories: robust features and non-robust features. We present some essential definitions presented in the paper:

  • A useful feature for classification is a function that is (positively) correlated with the correct label in expectation.
  • A feature is robustly useful if, even under adversarial perturbations (within a specified set of valid perturbations Δ), the feature is still useful.
  • A useful, non-robust feature is a feature that is useful but not robustly useful. These features are useful for classification in the standard setting, but can hurt accuracy in the adversarial setting.

This division of features can be represented as shown in Figure 7.

Figure 7— Robust and Non-Robust Features — The useful features that the network relies on in predicting a particular label can be divided into robust features (like ears, mouth, …) and non-robust features (like pixel blobs). Unlike robust features, non-robust features are utilized in the generated of standard adversarial attacks (Source [2]).

Robust features are features that aren’t usually utilized in developing our imperceptible adversarial attacks. They are features like ears, face shape, … that we humans rely on to classify a cat as a cat and a dog as a dog. Non-robust features are features that we humans do not rely on in our predictions and probably don’t even notice. They are features that help the network generalize because of their frequent occurrence in our datasets (like blobs of colored pixels occurring). These features are very sensitive to noise introduced by adversarial attacks and break down if perturbed within an ell-p ball.

This authors of that work provide ways for splitting a dataset into a “Robust Dataset” and a “Non-Robust Dataset”. The robust dataset has robust features and if used to train a network can achieve good standard accuracy and good robust accuracy. However, the non-robust dataset can only produce a good standard accuracy but a bad robust accuracy. This is shown in Figure 8.

Figure 8 — Training on Robust and Non-Robust Datasets— Splitting the training dataset into a robust dataset and non-robust dataset allow us to verify the effectiveness of each set of features of achieving good standard and robust accuracies. Robust features allow for both good standard and robust accuracies whereas non-robust features can only achieve good standard accuracy (Source [2]).

Another interesting observation in that work can be observed by the following experiment:

(1) Generate adversarial attacks on the images in the train set.

(2) Relabel the attacked samples with the target label (or label we flip to) and create a new attacked dataset.

(3) Train the network on the new attacked dataset.

Applying such a procedure produces a dataset that has non-robust features that represent the new label but robust features that represent the original label. The authors observe the obtained DNN can still perform a non-trivial performance on the original clean test set. What can we learn from that? Non-robust features still allow for training a network that is well generalizable. This procedure is visualized in Figure 9.

Figure 9 — Learning from Non-Robust Features — Training a DNN on a generated dataset where only non-robust features correspond to the target label allow us to achieve good standard accuracy on the original dataset. This shows that generalizability is achievable using non-robust feature signal found in the generated dataset images (Source [2]).

Generalization from Universal Non-robust Features

The authors of the paper we are presenting now rely on a similar approach. They generate two datasets of non-robust features. One is generated using SAPs and another is generated using UAPs (Figure 10).

Figure 10 — Standard and Universal Non-Robust Feature Dataset Generation — Following a similar approach to that in Figure 8, one can generate datasets where the non-robust features correlated to the label come either from the standard or universal adversarial perturbations (Source [1]).

“We train new ResNet-18 models on the and datasets and evaluate them on the original test set. The best generalization accuracies from training on the universal non-robust features dataset and the standard non-robust features were 23.2% and 74.5%, respectively”

Conclusion: “Universal non-robust features do have signal that models can use to generalize, but universal non-robust features are harder to generalize from than general non-robust features. Thus, there is some useful signal in universal non-robust features, but there appears to be less of it than in standard adversarial perturbations.”

Transferability of UAPs

Another way to measure the extent of utilization of non-robust features by UAPs is to look at their transferability. Transferability of adversarial attacks is attributed to non-robust features that different models could rely on to generalize better on different samples. As a result, perturbations that utilize non-robust features more should be more transferable between models.

Methodology:

  1. Perturb examples using either a standard adversarial perturbation or a UAP on the source model.
  2. Measure the probability that the perturbed input is classified as the target class on a new target model that is trained independently (the paper considers ResNet18 and VGG19).

Conclusion: As shown in Figure 11, the transferability of UAPs is worse than that of SAPs. That is SAPs leverage non-robust features more than UAPs do. This demonstrates that while UAPs are more human-aligned, they leverage only a small fraction of the statistical signal in general non-robust features.

Figure 11 — Transferability of Universal and Standard Adversarial Attacks — The transfer attack success rate of universal adversarial perturbations fall behind those generated in the standard adversarial setup. Transferability of attacks is evaluated on models that are independently trained from the attacked model (Source [1]).

Interpolating Universality

One final thing to consider is : “to what extent one can interpolate between the properties of universal and standard non-robust features ?”. To answer this question, we consider two parameters that control the way the UAP is generated that are : (1) the number of samples used in carrying out the UAP optimization problem (finding the universal adversarial perturbation is usually solved of a mini-batch which we refer to as the base set). (2) The classes of the samples used to generate the UAP.

Effect of Base Set Size: The base set is the set of images that are used while solving the UAP optimization problem mentioned in the first section of this post. Usually, the optimization problem is solved on a base set that is not equal to the complete dataset as carrying the optimization problem on the whole dataset is quite expensive. We refer to the base set size as K. If K=1 then we are carrying standard adversarial attack whereas if K>1 then we are carrying a UAP with K samples. The effect of changing the base set size is shown in Table 1.

Table 1 — Effect of Changing Base Set Size on Test Accuracy — Increasing the base set size (K) that is used to generate the universal adversarial perturbation causes a drop in the generalization accuracy that is attained from using non-robust features obtained from UAP signals (Source [1]).

Generalization begins to suffer even for relatively small values of K (note that the test accuracy refers to the accuracy of a model trained on non-robust features generated using a UAP with base set size K on the original test set). For example, the generalization accuracy falls from 74% at K = 1 to 34% at K = 16. On the other hand, even though not at a slower rate, increasing the base set size to K≥64 allows us to get more semantically meaningful UAPs. This is shown in Figure 12:

Figure 12 — Effect of Base Set Size on Semantic Locality — As the base set size (K) that is used to generate the universal adversarial perturbation is increased we observe more semantically meaningful patches. The generated UAPs above are for class bird and we can see bird heads around at K≥64 (Source [1]).

Conclusion: There is a clear trade off between having better semantics which only become obvious at higher values of K (≥64), and generalization which suffers at relatively small values of K (≥16).

Class of Chosen Samples: Taking a further step into the way the UAP is generated, the authors suggest studying the influence the classes of the chosen base set samples. They consider three variations, the first is random where base samples are selected randomly from the dataset, the second is single class were all the base samples are chosen from the same exact class and, finally single sub-class where a single category (containing multiple classes) is sampled for obtaining the base set samples. The results are shown in Table 2.

Table 2 — Effect of Source Class on Test Accuracy — Experimenting with various source class choices shows that the base set source class still fails to close the large gap between signal in UAPs and that in SAPs.

Conclusion: The results of these interpolation experiments show that the large gap in signal between UAPs and standard perturbations persists even when the level of “universality” is relaxed.

Conclusion

This work studies universal adversarial perturbations and shows that unlike standard adversarial perturbations they exhibit human-aligned properties. The authors characterizes and quantifies the degree to which UAPs are human-aligned in terms of semantic locality and spatial invariance. The authors then quantify the degree to which UAPs leverage non-robust features through experiments that study both generalizability and transferability. The experiments show that UAPs contain a much weaker signal for generalizability compared to standard perturbations.

This work demonstrates that examining UAPs may be a good direction for understanding particular and specific properties of adversarial perturbations, and for associated phenomena such as the prevalence and the nature of non-robust features.

References

[1] Park, S.M., Wei, K., Xiao, K.Y., Li, J., & Madry, A. (2021). On Distinctive Properties of Universal Perturbations. ArXiv, abs/2112.15329.

[2] Ilyas, A., Santurkar, S., Tsipras, D., Engstrom, L., Tran, B., & Madry, A. (2019). Adversarial Examples Are Not Bugs, They Are Features. ArXiv, abs/1905.02175.

[3] Madry, A., Makelov, A., Schmidt, L., Tsipras, D., & Vladu, A. (2018). Towards Deep Learning Models Resistant to Adversarial Attacks. ArXiv, abs/1706.06083.

[4] Benz, P., Zhang, C., Imtiaz, T., & Kweon, I.S. (2020). Universal Adversarial Perturbations are Not Bugs, They are Features.

[5] Moosavi-Dezfooli, S., Fawzi, A., Fawzi, O., & Frossard, P. (2017). Universal Adversarial Perturbations. 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 86–94.

--

--

--

A PhD Student in Deep Learning and Computer Vision

Love podcasts or audiobooks? Learn on the go with our new app.

Recommended from Medium

A Guide to Picking the Appropriate Scoring Metric for Your Machine Learning Classifier

Creating Bigram and Trigram Word Cloud for IOS App Reviews using Python’s NLTK (Natural Language…

Latent Dirichlet Allocation (LDA)

Sentiment Analysis using 1D Convolutional Neural Networks — Part 1

The CNN that started it all

Introduction to Image Processing — Part 3: Spatial Filtering and Morphological Operations

An Overview of Machine Learning Algorithms

Get the Medium app

A button that says 'Download on the App Store', and if clicked it will lead you to the iOS App store
A button that says 'Get it on, Google Play', and if clicked it will lead you to the Google Play store
Hasan Abed Al Kader Hammoud

Hasan Abed Al Kader Hammoud

A PhD Student in Deep Learning and Computer Vision

More from Medium

Deployment within the Machine Learning Curriculum

Google’s New Patent Helps Drivers Keep Attention While Driving

Addressing Toxic Comments with Lightning Flash and Detoxify

Budget CPUs for machine learning -2021/2022 (Xeons)