ML Security with the Adversarial Robustness Toolbox

Part 2: Defending Machine Learning Models

20 min readMay 12, 2022

Written by Tigran Avetisyan.

This is PART 2 in our 3 PART series about ML security with the Adversarial Robustness Toolbox.

You can read PART 1 here.

In PART 1, we’ve had a look at the four attack methods implemented in the Adversarial Robustness Toolbox (ART). In this tutorial, we are going to focus on the defense methods in ART. We are again going to use TensorFlow/Keras and the MNIST digits dataset to showcase defenses in ART.

Let’s get started without further ado!

Defense Types in the Adversarial Robustness Toolbox

The Adversarial Robustness Toolbox had five types of defenses as of version 1.10.1:

· Detector. You can use the detector defense to detect adversarial samples. There are different detection methods implemented in ART — Activation Defense, for example, uses a victim model’s activations to cluster samples into clean and adversarial groups. As of version 1.10.1, ART had detectors only for evasion and poisoning attacks.

· Transformer. Transformer defenses apply perturbations to the provided input to reveal adversarial samples. These defenses measure the effect that the perturbations had on the model output and flag outliers as poisoned samples. As of version 1.10.1, ART had transformer defenses only for poisoning and evasion attacks.

· Trainer. The trainer is a training-based defense where the model is trained on a combination of clean and adversarial samples. You essentially teach the model to predict the correct label for both clean and adversarial input. As of version 1.10.1, ART had trainer defenses only for evasion attacks.

· Preprocessor. The preprocessor defense applies some form of preprocessing to adversarial input to smoothen perturbations. This makes the adversarial samples look more like clean samples, which can improve model performance. ART’s preprocessing defenses may work for a wide range of attacks.

· Postprocessor. The postprocessing defense applies some form of processing to model outputs to hide their meaning from attackers. This defense is typically used against model stealing. Postprocessed outputs obscure how the model makes decisions, which can prevent model theft.

Like in PART 1 where we tried one method per each type of attack, we’ll try one method for each type of defense.

Prerequisites for Using ART

If you’ve read PART 1, then you already have everything necessary to follow along with this guide.

Just in case, if you don’t have ART already, use this command to install it:

pip install adversarial-robustness-toolbox

conda users should use this command instead:

conda install -c conda-forge adversarial-robustness-toolbox

You will also need TensorFlow, which you can install with this command:

pip install tensorflow

Or this command in conda:

conda install -c conda-forge tensorflow

Make sure that you have NumPy and Matplotlib as well.

The Detector Defense in ART

Let’s now take a look at each of the defense methods in ART! You can find the full code for this tutorial here.

First up, let’s take a look at the detector defense. We will be using the activation defense method for detecting poisoned data samples. In ART, this defense is implemented in the art.defences.detector.poison.ActivationDefence class.

You can read more about this defense method in this research paper. Essentially, activation defense separates clean and adversarial samples into clusters by analyzing neural network activations. ActivationDefense in ART uses k-means to cluster provided samples.

Importing dependencies

We start by importing the dependencies necessary for building a poisoning attack and initializing activation defense.

We are again disabling eager execution from TF 2 because ART’s wrapper class KerasClassifier doesn’t fully support TF 2.

Poisoning data for the attack

We need to poison data to perform an attack. Let’s first load the MNIST digits dataset:

Then, let’s define a function that will poison our images and labels. Note that unlike PART 1, we are using the standard backdoor poisoning attack, not the clean label backdoor attack.

The poisoning function looks like this:

The function might look intimidating, but it’s actually quite simple. In the function, we:

1. Create copies of the clean images and labels (lines 10 and 11 respectively). We need the copies because we will append the poisoned samples to them later.

2. Create a “poison indicator” array that will contain 1s and 0s (line 15). 1s will indicate poisoned samples, while 0s will indicate clean samples. Upon initialization, the array has the same length as the clean image array and only contains 0s because we don’t have poisoned samples yet.

3. Define the source (clean) labels on line 18. The labels in source_labels are integers.

4. Initialize a poisoning backdoor attack with the perturbation add_pattern_bd (line 21) — the same pattern we used in PART 1.

5. Iterate over the source and target labels provided to our custom poisoning function (line 24). The source label is the label that we want to replace with the current target (poisoned) label.

6. Calculate the number of samples (num_labels) for samples with the current label in the loop (line 27).

7. Calculate the number of samples to poison by taking percent_poison of our num_labels (line 31).

8. Get the clean images for the current clean label (line 34).

9. Pick num_poison random indices to poison (lines 37 to 40).

10. Get the images at the indices we’ve just picked for poisoning (line 43).

11. Convert the current integer target label (the poison label) to a categorical (line 46).

12. Poison the selected images and return the target label as their poisoned label (lines 49 to 52).

13. Append the poisoned images to the clean images (lines 55 to 59) and the poisoned labels to the clean labels (lines 62 to 66).

14. Append 1s to our poison indicator array is_poison to indicate that the added samples are poisoned (lines 69 to 72).

15. Return the poisoned images, labels, and the poison indicator array (line 75).

Let’s put this function in action and poison our train and test sets:

On line 2, we define the labels that should be attached to the poisoned samples. The poisoned target labels are [1, 2, 3, 4, 5, 6, 7, 8, 9, 0]. Basically, we increase the true labels by 1 except for 9, which we set to 0.

On line 5, we define the percentage of samples to poison — 50%. On lines 6 to 10, we poison the first 10,000 samples in the training set. We are poisoning only a subset to speed up training. The poisoned train images are stored in train_images, while labels are stored in train_labels.

On lines 13 to 17, we poison the test set. The poisoned test images are stored in test_images, while the labels are stored in test_labels.

On lines 20 and 21, we separate the poisoned and clean test samples from each other for later use. Notice that we are using the poison indicator array is_poison_test to distinguish between clean and poisoned samples.

Finally, on lines 24 to 28, we randomly shuffle the training data.

Visualizing the training data

Now, let’s take a quick look at our training data. We’ll need to create two helper functions for plotting to do this.

First, we have create_figure_axes:

This function creates a figure and axes that we will use to plot the images. We need this bit as a separate function because we will be repeatedly creating figures and axes throughout the tutorial.

The second function — plot_image — is as follows:

plot_image accepts the images and labels that we want to plot. Labels are set as titles for the axes in the figure. plot_image also accepts plot_label, which will be used to clarify the meaning of the axes title. The Boolean parameter is_categorical indicates whether labels are categorical — if they are, they are converted to integers.

Let’s put these functions to use and plot 10 training samples:

We can see that the poisoned images have a small pixel pattern in their lower right corner. Besides, the labels for poisoned images have been incremented by 1 (except for the label 9), as described earlier. The labels for clean images are unchanged.

Training a model on the poisoned dataset

Now, let’s train a simple neural net on the poisoned dataset.

We define the model architecture in the function create_model. We will be reusing this model architecture for other defense methods as well.

Let’s now train the model on the poisoned dataset:

With our model architecture, training shouldn’t take a long time. If training takes too long for you, you can try reducing the training set size or changing the model architecture. But keep in mind that changing the model architecture might affect the effectiveness of our attacks and defenses.

To make sure that the attack is working, let’s plot clean ten test images along with their predictions:

The predictions look fine here — the model correctly identified most of the digits.

Next, let’s randomly pick 10 poisoned images from poisoned_test_images and take a look at how the model responds to them:

The poisoning attack seems to have been effective! Predicted labels are 1 more than the true labels.

To confirm that the attack worked, let’s evaluate the model’s performance on the clean and poisoned test sets. We’ll evaluate clean images against clean labels and poisoned images against poisoned labels.

We can see that the model predicts correctly on the clean images. It also predicts the target label on poisoned images, meaning that the backdoor worked.

Building a detector defense for the poisoning attack

We can now build a detector defense for the poisoning attack. The class ActivationDefense expects an ART classifier as a model, so we need to wrap our model in KerasClassifier first and then initialize the defense.

As a reminder, wrapping is to make sure that ART can run attacks and defenses for the models. The wrappers add some extra functions and reimplement some things for the models.

We are passing the wrapped model to ActivationDefense (classifier=classifier_poisoned) along with the samples that we want to scan for poison. Note that we are passing training images and labels, but you can pass test data as well.

To scan the samples for poison, we need to run the method defense.detect_poison:

The argument nb_clusters=2 tells the k-means algorithm how many clusters to detect. In our case, we are effectively telling k-means to separate clean and poisoned samples.

ActivationDefense reduces the dimensionality of the neural activations to accelerate clustering. reduce=”PCA” indicates that the defense will use PCA (Principal Component Analysis) to reduce the activations’ dimensionality. nb_dims=10 indicates that PCA will reduce 10 dimensions.

The method detect_poison returns two objects — report and is_clean_reported. report contains the results of the poison scan, while is_clean_reported is a list that indicates which of the provided samples is poisoned and which is not. is_clean_reported consists of 0s and 1s, where 0s are for poisoned samples and 1s are for clean samples. is_clean_reported is thus the inverse of the is_poison indicator arrays we’ve made earlier.

report in our case looks like this:

We can see from the report that ActivationDefense separated samples in each of the ten classes into two clusters. Poisoned clusters are tagged as suspicious.

To demonstrate what is_clean_reported means, let’s plot ten training images and the labels from is_clean_reported for them:

In the statement labels=np.array(is_clean_reported) == 1 (line 4), we convert is_clean_reported to a NumPy array to be able to compare each element to 1 and thus convert 0s and 1s to True and False. This is just to make the plot more understandable.

It appears that our defense was able to correctly identify poisoned samples — at least, in the ten samples that we plotted. To evaluate the overall performance of our defense, we can call its method evaluate_defense, passing to it our ground truth poison indicators. Here’s how this is done:

On line 4, we invert our is_poison_train array to turn it into an array containing True (behaves like the integer 1 in Python)and False (behaves like the integer 0 in Python). The result is stored in is_clean. This is so that the meaning of the integers in both is_clean and is_clean_reported is the same.

On line 7, we pass our ground truth is_clean to the method evaluate_defense, remembering to shuffle is_clean because we haven’t shuffled is_poison_train with the training set. evaluate_defense generates a confusion matrix that we can use to assess the defense’s effectiveness.

From lines 10 to 13, we format the confusion matrix so that it can be clearly printed. The result is as follows:

We can see that the defense is nearly perfect for our particular model and dataset! The defense only had a handful of false positives — the rest of the detections are correct.

Visualizing the detected clusters

One more thing that we can do with ActivationDefense is visualize the detected clusters. We can do this by calling the method defense.visualize_clusters, passing x_raw=train_images to indicate which images we want to get clusters for.

The complete code for getting and visualizing the clusters looks like this:

We have two clusters per class, which means there are 20 clusters in total. On lines 18 to 23, we iterate over the axes we had created on lines 8 to 12, plotting clusters for one class per line.

The resulting plot is as follows:

For each class, we can clearly see that one cluster is for the clean samples, while the other one is for the poisoned samples.

Note that the image above is the compressed and resized version of the plot. You can find the full-size plot here.

The Transformer Defense in ART

The second type of defense in ART is the transformer defense. We will be using the STRIP (STRong Intentional Perturbation) method to build this defense. You can find out more about STRIP in this paper.

In ART, STRIP is implemented in the class art.defences.transformer.poisoning.STRIP.

The idea behind STRIP is pretty simple. Essentially, STRIP applies random transformations to the provided samples to detect poisoned data. The idea is that when you perturb a clean image, the model’s output changes drastically. Whereas if you perturb a poisoned image with a backdoor, the prediction will not change because the model’s output will be heavily guided by the backdoor.

In other words, no matter how you perturb a poisoned image, the model should still predict the fake label because it will put high emphasis on the backdoor. Based on these differences, STRIP can distinguish between clean and poisoned images.

Poisoning data and training a model

For STRIP, let’s change the target labels to all 9s regardless of the input image. The rest of the poisoning routine will be the same as in the detector section.

After we have the poisoned data, let’s train the victim model on the new data:

Just in case, let’s confirm that the backdoor worked:

The attack was very effective!

Let’s now wrap the model in KerasClassifier to prepare it for defense:

Detecting poisoned samples

We can now initialize the defense and run it:

100%|██████████| 5000/5000 [00:12<00:00, 399.59it/s]

On line 2, we initialize the defense and pass our poisoned classifier to it. On line 5, we call the method __call__ of the object strip. What this does is wrap the poisoned model in the class art.estimators.poison_mitigation.STRIPMixin so that we can detect poison. __call__ returns the object defense, which is a protected classifier.

On line 8, we run the method defense.mitigate, passing the first 5,000 test images to it as input. STRIPMixin then calculates the normal entropy distribution for the provided images — the entropy will later be used to detect poisoned images.

We can then assess the performance of the STRIP defense. To do this, we need to obtain predictions from it:

The array poison_preds corresponds to predictions from poisoned images, while clean_preds contains predictions on the remaining 5,000 clean test images.

The predictions returned by defense are just standard predictions except for one case — when STRIP detects poisoned samples. When this happens, STRIP returns a NumPy array of zeros for each poisoned sample. So rather than return probabilities, STRIP returns a zero array (like [0, 0, 0, 0, 0, 0, 0, 0, 0, 0]) for the samples that it thinks are poisoned.

With that in mind, we can evaluate the performance of STRIP as follows:

On lines 2 and 3, we get the number of all zero arrays in our predictions. So num_abstained_poison is the number of correctly identified poisoned samples (true positives), while num_abstained_clean is the number of wrongly identified poisoned samples (false positives). num_poison and num_clean represent the total number of poisoned and clean predictions respectively.

We can then use our stats to calculate the true positives for poisoned samples and false positives for clean samples:

Abstained 808/5000 poison samples (16.16% TP rate)Abstained 37/5000 clean samples (0.74% FP rate)

We can see that STRIP only detected 16.6% of poisoned samples, meaning that it is ineffective against our particular attack and has a very high false negative rate. Additional tweaking may be required to achieve better results.

One thing to keep in mind when tweaking STRIP and other defense methods — you want to keep the false positive rate as low as you can. A high false positive rate can create a whole bunch of different issues for you. The same applies to false negatives. You might need to accept a trade-off between false positives and negatives, but generally, both should be kept low.

To demonstrate how false positives and false negatives might impact the performance of a real-world system, let’s consider autonomous vehicles. Suppose we are trying to protect the computer vision system of an autonomous vehicle from adversarial stop signs:

Source: https://arxiv.org/pdf/1707.08945.pdf.

The purpose of an adversarial stop sign is to prevent the vehicle from correctly identifying a stop sign. An autonomous vehicle might not recognize an adversarial stop sign and drive into a potentially dangerous area.

Let’s now consider three situations:

· Your defense measure produces false positives for clean stop signs. In other words, the defense mislabels a clean stop sign as adversarial. If this is the case, the vehicle should stop as expected, but you might need to dedicate considerable resources to investigating why your system mislabels clean stop signs.

· Your defense measure produces false positives for objects that are not actually stop signs. Poorly designed defenses might mislabel objects that actually aren’t even stop signs. If the computer vision system of the vehicle sees a pattern that’s similar to the perturbations that it was trained to recognize in adversarial stop signs, it might unnecessarily stop in completely safe areas.

· Your defense measure produces false negatives for adversarial stop signs. If an autonomous vehicle fails to recognize a stop sign due to adversarial perturbations, it will continue driving. The vehicle could thus place the driver into a potentially life-threatening situation.

To put it short, false positives can be an inconvenience in that they may force you to spend time and money troubleshooting your adversarial defense measures. False positives can also create friction between the driver and the autonomous vehicle, worsening their user experience. You should definitely minimize false positives, but they are not the end of the world in most cases.

False negatives, in contrast, are much more dangerous. They can lead to property damage, injury, or even the death of the driver or bystanders. This could result in lost business, reputation damage, and even legal action against the maker of the vehicle.

Our explanation extends to other industries and applications as well. Generally, false negatives are much more dangerous than false positives. But in some areas, false negatives don’t lead to consequences that are as severe. As an example from outside of adversarial AI, if a spam detection system doesn’t detect a spam email, the end user will likely only experience mild annoyance.

With all that in mind, when designing any type of AI defense, you should decide for yourself what levels of false positives and false negatives are appropriate.

The Trainer Defense in ART

The trainer defense is a type of defense where we inject some adversarial samples into the training set to teach the model to produce the correct output regardless of perturbations.

You can learn more about adversarial training in this research paper. In ART, adversarial training is implemented in the class art.defences.trainer.AdversarialTrainer. We will show you how you can use AdversarialTrainer to protect your models from the Fast Gradient Method.

Training vulnerable and robust models

To demonstrate how the trainer defense works, we’ll train two models. One model will be trained on our training set as standard, while the other will be trained with AdversarialTrainer.

Let’s initialize our models:

We will train vulnerable_classifier on 10,000 samples from the original clean dataset:

Next, let’s initialize the Fast Gradient Method attack on the vulnerable classifier:

We can now use the attack object to define the adversarial trainer. Here’s how this is done:

attacks=attack_fgm defines which attacks AdversarialTrainer should use to train the robust classifier. You can supply more than one evasion attack to attacks. But do keep in mind that the training times may increase significantly if you add more than one attack.

ratio=0.5 determines the proportion of adversarial samples in each batch. In our case, half the samples will be adversarial.

To train the robust classifier, you just call trainer.fit. We are again training the robust classifier on the first 10,000 samples in the original clean train set.

Precompute adv samples: 100%|██████████| 1/1 [00:01<00:00, 1.37s/it] Adversarial training epochs: 100%|██████████| 10/10 [00:11<00:00, 1.11s/it]

After you call fit, the trainer will train the underlying robust_classifier.

Evaluating the effectiveness of the trainer

To check the results of adversarial training, let’s generate adversarial samples, using the Fast Gradient Method:

Next, let’s test how the performance of vulnerable_classifier is affected by the adversarial images:

The adversarial samples clearly affected the performance of the model. The attack was effective!

Let’s now see whether our robust_classifier classifies adversarial samples better:

It seems that the robust model was able to correctly identify adversarial samples. The performance of the robust classifier on adversarial samples is nearly the same as the performance of the vulnerable classifier on clean samples. Only the test loss is slightly higher for the robust model.

To wrap up this defense, let’s plot ten adversarial test images along with the predictions from the vulnerable and robust classifiers:

The Preprocessor Defense in ART

The fourth type of defense in ART is the preprocessor defense. We will showcase the preprocessor defense Total Variance Minimization, which is outlined in this paper. Here’s how the authors describe this method at a high level:

This approach randomly selects a small set of pixels, and reconstructs the “simplest” image that is consistent with the selected pixels. The reconstructed image does not contain the adversarial perturbations because these perturbations tend to be small and localized.

In ART, Total Variance Minimization is implemented in the class art.defences.preprocessor.TotalVarMin. We will reuse the adversarial samples we generated using the Fast Gradient Attack method for the previous defense.

Setting up the defense

Here’s how you use TotalVarMin to clean up adversarial images:

norm=1 sets the norm term for Total Variance Minimization to 1 — the default is 2. 1 demonstrated decent effectiveness for our use case.

On line 10, we apply Total Variance Minimization to our first 1,000 adversarial test samples. We are using 1,000 samples to speed up preprocessing.

Before we test the effectiveness of this defense, let’s take a look at the differences between adversarial and cleaned samples:

Total Variance Minimization smoothened the images quite significantly, eliminating much of the perturbations from them.

Evaluating the preprocessor defense

Let’s now see how much of an effect preprocessing had, using the vulnerable classifier we trained earlier:

Preprocessing was quite effective, increasing test accuracy from 0.41 to 0.80 and decreasing test loss from 3.47 to 0.62. The improvement is certainly noticeable, but the trainer defense actually worked better for this particular attack.

To wrap up this section, let’s plot ten cleaned images along with the predictions on adversarial and cleaned samples:

The Postprocessor Defense in ART

Finally, let’s showcase the postprocessing defense. We will be using the Reverse Sigmoid defense, which is implemented in the class art.defences.postprocessor.ReverseSigmoid. You can learn more about the uses of Reverse Sigmoid in AI security in this research paper.

To demonstrate how ReverseSigmoid works, we will build a model extractor using CopycatCNN — the same method we used in PART 1 to steal a neural network.

Training a victim model

To train a victim model, let’s divide our original training set into two subsets — one with 50,000 samples (for the victim model) and another one with 10,000 samples (will be used to train the stolen model). Like in PART 1, we are simulating a situation where an attacker has a dataset similar to the original data used to train the victim model.

After that, we train the victim model as usual:

Setting up a postprocessing defense

Let’s now set up a postprocessing defense:

We initialize ReverseSigmoid on lines 2 to 5. The values we pass for the parameters beta and gamma work for the particular case we’ll take a look below.

We then apply the defense to our classifier through the wrapper class KerasClassifier by passing postprocessing_defences=postprocessor to it (line 16).

For comparison, we also create a classifier without the postprocessing defenses (unprotected_classifier).

Let’s now inspect the output of the models to see how the postprocessor defense works. Let’s check the unprotected output first (both in one-hot form and in class form):

The output of the unprotected model is as expected and contains probabilities associated with each class.

Now, let’s take a look at the postprocessed output:

We can see that the probabilities are completely different. For the predicted class, ReverseSigmoid still produces the highest probability, which keeps the actual prediction the same. But for the other classes, the postprocessor tried to assign very close probability values to hopefully obfuscate the relationship between model inputs and outputs.

And as a side note, even though the probabilities are different with ReverseSigmoid, they still add up to 1 (or almost to 1):

[0.99999994 0.99999994 0.9999999 0.99999994 1.0000001 1.0000001 0.99999994 1. 0.99999994 1. ]

Building and training CopycatCNNs

Now, let’s try to steal our protected and unprotected models using CopycatCNN.

The procedure for stealing models is the same as in PART 1. We first initialize reference models for CopycatCNN to train:

Then, we initialize the model extractors for each of the models:

And finally, we extract the models:

Let’s now check the performance of the stolen models to see if the extraction was successful:

It appears that the extraction was successful. This means that our defense didn’t work. However, there is a small tweak we can make in our CopycatCNN extractors to change the results.

Building and training probabilistic CopycatCNNs

Now, when initializing our CopycatCNN objects, we are going to pass the argument use_probability=True to their constructors, like so:

On lines 24 and 25, we reinitialize the reference models because we don’t want to train on top of the old models.

Let’s now steal the models using the probabilistic extractors.

The probabilistic extractor didn’t have issues with stealing the unprotected model.

Let’s now see how the second extractor will do with the protected model:

We can see that the probabilistic extractor for the protected model is having issues with stealing. The training losses are very high, while the accuracies are low.

To wrap up this section, let’s compare the test metrics of the models:

Model extraction was unsuccessful. But why?

By default, CopycatCNN does not use probabilities as labels in its training loop. It instead converts the probabilistic output of the victim model to integer class predictions. And as we established earlier, ReverseSigmoid does not change the actual class predictions, which is why it didn’t work in the first test.

But after we set use_probability to True, CopycatCNN tried to use probabilities as labels. And because the probabilities produced by ReverseSigmoid lack insight into how the model weighs each class, the extractor struggles to achieve good performance.

Next Steps

Defending models from adversarial attacks can be quite a difficult task. You may need to try different defense methods with different parameters to achieve a good level of protection. Even then, you should remember that AI defenses have limitations and may not generalize well to different types of attacks.

As the next step, you can test defenses in ART against other types of attacks. You could also try combining different types of defenses together to hopefully build a more robust model.

In PART 3 of this series, we are going to build a web application that will allow you to test the robustness of your model against different types of attacks! You will be able to upload your own TF/Keras models and receive performance metrics for a few different attack types.

Stay tuned and until next time!