ML Security with the Adversarial Robustness Toolbox

Part 1 — Attacking Machine Learning Models

Kedion
17 min readApr 26, 2022
Photo by Adi Goldstein

Written by Tigran Avetisyan.

Developing machine learning models and putting them into production can be very challenging. However, successfully deploying an ML pipeline is just part of the story — you also need to think about keeping it secure.

Machine learning models are used in a wide range of areas, like finance, medicine, or surveillance, and can be accurate at detecting fraud or filtering out faulty products. In applications like detecting fraud, scammers might be interested in fooling machine learning systems so that they miss scam emails or phishing links.

In this series of tutorials, we are going to have a look at the Adversarial Robustness Toolbox and figure out how it can help you with securing your machine learning pipelines. In PART 1, we will focus on adversarial attacks, and we will be using the MNIST digits dataset along with the TensorFlow/Keras ML framework.

Let’s get started!

What is the Adversarial Robustness Toolbox?

The Adversarial Robustness Toolbox, or ART, is a Python framework for machine learning security. ART contains attack and defense tools that can help teams better understand adversarial attacks and develop protection measures based on experimentation.

https://adversarial-robustness-toolbox.org/

ART has 39 attack modules, 29 defense modules, and supports a wide range of machine learning frameworks, including scikit-learn, PyTorch, TensorFlow, and Keras. ART also supports several machine learning tasks (including classification, regression, and generation) and works with all data types (audio, video, images, or tables).

The Adversarial Robustness Toolbox was originally developed and published by IBM. In July 2020, IBM donated ART to the Linux Foundation AI (LFAI). Since then, LFAI has maintained and developed updates for the toolkit.

Attack Types in the Adversarial Robustness Toolbox

Because PART 1 in this series focuses on attacks, let’s take a deeper look at the attack types supported by ART.

At a high level, there are 4 types of adversarial attacks implemented in ART:

· Evasion. Evasion attacks typically work by perturbing input data to cause a trained model to misclassify it. Evasion is done after training and during inference, i.e. when models are already deployed in production. Adversaries perform evasion attacks to avoid detection by AI systems. As an example, adversaries might run an evasion attack to cause the victim model to miss phishing emails. Evasion attacks might require access to the victim model.

· Extraction. Extraction is an attack where an adversary attempts to build a model that is similar or identical to a victim model. In simple words, extraction is the attempt of copying or stealing a machine learning model. Extraction attacks typically require access to the original model, as well as to data that is similar or identical to the data originally used to train the victim model.

· Inference. Inference attacks generally aim at reconstructing a part or the entirety of the dataset that was used to train the victim model. Adversaries can use inference attacks to reconstruct entire training samples, separate features, or determine if a sample has been used to train the victim model. Inference attacks typically require access to the victim model. In some cases, attackers might also need to have access to some portion of the data used to train the model.

· Poisoning. Poisoning attacks aim to perturb training data to corrupt the victim model during training. Poisoned data contains features (called a backdoor) that trigger the desired output in a trained model. Essentially, the perturbed features cause the model to overfit to them. As a very simple example (which we’ll have a look at in code below), an attacker could poison the digits in the MNIST dataset so that the victim model classifies all digits as 9s. Poisoning attacks require access to the training data of a model before the actual training occurs.

ART’s documentation supplies this neat graph that shows how the attacks work at a high level:

https://adversarial-robustness-toolbox.readthedocs.io/en/latest/index.html

For the vast majority of implemented attacks, ART supplies links to the research papers that provide more detail on a given attack. So if you want to learn more about a specific attack, look for paper links in ART’s documentation.

Below, we will take a look at each of the attack types supported by ART. We will implement one attack per type, but what you’ll see below should transfer to other attacks of the same type as well.

How do attacks on machine learning pipelines happen?

As we pointed out above, adversaries typically need some form of access to your machine learning model or its training data to perform an attack. But assuming that your model is hosted in an environment that an adversary can’t reach, how do attacks on ML pipelines even happen? In more practical terms, how can an adversary gain access to your model and training data?

Here are just some of the cases in which adversaries can obtain access to your pipeline and data:

· You are using a dataset from an unverified source. An adversary could poison data and publish it somewhere for unsuspecting teams to use. Datasets from untrustworthy sources or datasets that are not verified are at higher risk of being compromised.

· You are using a model from an unverified source. Similar to training data, adversaries can publish models for victims to use. They can later use their knowledge of the model to perform attacks on it. Models from suspicious sources might also be pre-trained on poisoned data.

· You are using a model that is available publicly. Suppose that you are using an NLP model from Hugging Face. Hugging Face models are available to anyone. An adversary could analyze a publicly available model and use their knowledge to attack other similar models used by other teams. In fact, adversaries don’t need to have the exact same model as you do — knowledge of a model that performs the same task as yours can be enough for an attack.

· An adversary has access to your ML pipeline. If a malicious insider gains access to your pipeline — by leveraging their position inside the organization, for example — they can get in-depth knowledge about its architecture, weights, and training data. They might also know about the measures that you use to protect your model, which makes malicious insiders arguably the biggest threat to ML pipeline security.

This isn’t an exhaustive list, but it should give you an idea of how your model or training data could become compromised.

Prerequisites for Using ART

To be able to follow along with this tutorial, install ART by using this command.

pip install adversarial-robustness-toolbox

If you are using conda, use this command instead:

conda install -c conda-forge adversarial-robustness-toolbox

We used ART version 1.10.0, but newer versions should work fine as well.

If necessary, you can install ART just for the specific ML/DL framework that you will be using. You can learn more about this here.

Aside from ART, you will need whichever ML/DL framework you want to attack. We are going to be using TensorFlow, which you can install by using this command:

pip install tensorflow

Or if you are using conda, this command:

conda install -c conda-forge tensorflow

We are also going to be using NumPy and Matplotlib, so make sure that you have these libraries as well.

Evasion Attacks in ART

Let’s start with evasion attacks in ART. As mentioned earlier, an evasion attack is when the attacker perturbs input at inference time to cause the model to misclassify it.

As of ART version 1.10.0, most of the attacks supported by the framework were evasion attacks.

We will be using the Fast Gradient Method to generate adversarial samples from our test set. You can read more about the Fast Gradient Method in this paper.

Note that you can find the full code for this tutorial here. You can find more examples from the authors of ART here and here.

Importing dependencies

As usual, we start by importing dependencies:

Note that we are disabling eager execution in TensorFlow for this guide on line 12. This is because the wrapper class art.estimators.classification.KerasClassifier doesn’t fully support TF 2.

Loading data

To load the MNIST dataset, we are going to use ART’s function art.utils.load_dataset. This function returns tuples for the train and test sets, as well as the minimum and maximum feature values in the dataset.

The images are already normalized to the [0, 1] range, while the labels are one-hot encoded. We don’t need to do any preprocessing on the dataset.

Training a TensorFlow Keras model

Now, let’s create a simple TensorFlow Keras model:

And train it:

If training takes too much time for you, you can try to simplify the model. And if you encounter out-of-memory (OOM) issues, reducing the batch size should be able to help.

Defining an evasion attack on our model

As the next step, let’s define an evasion attack for our model.

To be able to run attacks on our model, we must wrap it in the art.estimators.classification.KerasClassifier class. Here’s how this is done:

The argument model=model indicates our model, while clip_values=(min, max) specifies the minimum and maximum values allowed for the features. We are using the values provided by the art.utils.load_dataset function.

Instead of KerasClassifier, you can also try TensorFlowV2Classifier, but note that its usage is different.

We then define the attack by using ART’s FastGradientMethod class:

estimator=classifier shows that the attack will apply to our classifier, while the argument eps=0.3 essentially defines how strong the attack will be.

We can then generate adversarial samples by calling the attack object’s method generate, passing to it the target images that we want to perturb:

Evaluating the effectiveness of the attack

Let’s take a look at one adversarial sample:

The fast gradient method applied noise to the clean test images. We can see the effect of the attack by comparing the performance of our model on the clean and adversarial sets:

Clean test set loss: 0.04 vs adversarial set test loss: 5.95Clean test set accuracy: 0.99 vs adversarial test set accuracy: 0.07

The attack has affected the model considerably, making it completely unusable. However, because the perturbations in the image are very evident, it would be very easy for the victim to figure out that they are being attacked.

Lowering the eps value can make the tampering less visible, but the impact on the model will be lessened as well. We’ve tried 10 different values for eps to explore their effect on the visual appearance of adversarial samples and on the model’s performance.

In the code block above, we create a figure along with subplots (lines 5 to 9) and define eps values to try (line 12). We then iterate over our subplots (lines 18 and 19), generate adversarial samples for each value from eps_to_try (lines 21 to 27), plot a sample adversarial image for the current eps value (line 30), evaluate model accuracy on the adversarial set (lines 37 to 40), and get a prediction for the sample adversarial image that is being displayed (lines 43 to 46).

Each eps value, the test accuracy for it, and the prediction for the sample adversarial image are displayed above each image (lines 49 to 53).

The resulting plot is as follows:

We can see that higher eps values produce more visible noise in the image and have a bigger impact on the model’s performance. And while the model predicted the correct label for this particular image 9 times out of 10, the overall performance of the model on the entire adversarial test set was much worse.

Extraction Attacks in ART

As the second step, let’s show extraction attacks in ART. Extraction attacks, as a reminder, aim to copy or steal a victim model.

Let’s use the class art.attacks.extraction.CopycatCNN to perform the attack. You can learn more about this attack method in this paper.

Training a victim model

For this attack, let’s separate our training dataset into two subsets — one with 50,000 samples and the other with 10,000.

The subset with 50,000 samples will be used to train the original model, while the subset with 10,000 samples will be used to steal the original model. Basically, we are simulating a situation where an adversary has a dataset that is similar to the original dataset.

Let’s train our original model on its dataset:

After training, we again wrap the original model into KerasClassifier:

Defining and running an extraction attack

Next, let’s create our model thief, using the class CopycatCNN:

Note that the argument nb_stolen=len(train_images_stolen) essentially determines how many samples ART will use to train the stolen model.

After that, we need to create a blank reference model that copycat_cnn will train to steal the original model:

We then use the method copycat_cnn.extract to steal classifier_original. We are using the subset with 10,000 samples to steal the model.

Evaluating the performance of the stolen model

Let’s compare the performance of the original and stolen models on the test set:

Original test loss: 0.04 vs stolen test loss: 0.08Original test accuracy: 0.99 vs stolen test accuracy: 0.98

The models perform very similarly, so it appears that the model theft was successful. With that said, additional testing might be required to determine if the stolen model indeed performs well.

One thing to keep in mind here — the more data you have, the better the stolen classifier will be. We can see this by testing the effect of the subset size on the performance of the stolen model:

Let’s now visualize the test losses for each subset:

The resulting plot is as follows:

The jump from 2,500 to 5,000 samples produced the biggest improvement, although there also are noticeable differences between subsets of 5,000, 7,500, and 10,000 samples.

The same applies to the test accuracy, though to a much lesser degree:

Inference Attacks in ART

Now, let’s try inference attacks. Inference attacks aim to obtain knowledge about the dataset that was used to train a victim model. In this guide, we are going to try model inversion — an attack where the adversary tries to recover the training dataset of the victim model.

As of version 1.10.0, ART supported only one model inversion algorithm — MIFace. MIFace uses class gradients to infer the training dataset. You can learn more about MIFace in this paper.

Defining the attack

The first step to model inversion with MIFace is instantiating its class. We are going to apply the attack the classifier we’ve trained for the evasion attack.

Adjust the parameter max_iter based on your hardware — if you find that inversion takes too long, reduce the value.

After that, we need to define the targets that we want to infer samples for:

[0 1 2 3 4 5 6 7 8 9]

In our case, y consists of integer labels. You can also provide one-hot labels with the shape (nb_samples, nb_classes).

Aside from the targets, we also need to define an initialization array that MIFace will use to infer images. Let’s use the average of the test images as the initialization array — you can also use an array of zeros, ones, or any other array that you think might work.

Note that the batch dimension of x_init_average has the same length as our target list y.

We also need to calculate class gradients with our initialization array to make sure that they have sufficient magnitude for inference:

[0.14426005 0.1032533 0.0699798 0.04295066 0.00503148 0.01931691 0.02252066 0.00906549 0.06300844 0.16753715]

The gradients for some classes are larger than for others. We can see that the gradients at index positions 4 and 7 — which correspond to the digits 4 and 7 — are really small compared to others.

If the gradients for a particular class are too small, the attack might not be able to recreate its corresponding sample. It’s therefore important to check the class gradients before running an attack. If you find that the gradients are small, you can try another initialization array.

Running a model inversion attack

We can now run model inversion, using the initialization array and target labels:

Model inversion: 100%|██████████| 1/1 [00:51<00:00, 51.23s/it]Wall time: 51.2 s

Let’s now inspect the inferred images:

MIFace managed to recover most of the images, though the digits 2 and 7 aren’t that great.

You can try other initialization arrays to see if you can get a better representation of the digits. For starters, try arrays of gray, black, or white pixel values.

Poisoning Attacks in ART

Finally, let’s try poisoning attacks. In a poisoning attack, an adversary perturbs samples in the training dataset to cause the model to overfit to them. The perturbations in the samples are called a backdoor. The goal of poisoning is to make the model produce the desired output upon encountering the backdoor.

To perform poisoning attacks, we are going to use backdoor attacks (paper) and clean label backdoor attacks (paper).

Poisoning sample data

To start, let’s import dependencies and see how data poisoning works:

The target labels for poisoning are[[0. 0. 0. 0. 0. 1. 0. 0. 0. 0.][0. 0. 0. 0. 0. 1. 0. 0. 0. 0.][0. 0. 0. 0. 0. 1. 0. 0. 0. 0.][0. 0. 0. 0. 0. 1. 0. 0. 0. 0.][0. 0. 0. 0. 0. 1. 0. 0. 0. 0.]]

On line 7, we define a backdoor attack, using the perturbation add_pattern_bd. By default, this perturbation adds a small pattern in the lower right corner of the target image.

On lines 10 to 13, we define a target label for the backdoor attack. The attack will replace the real labels with our target label. In our case, we are generating five target labels because we want to show how this attack works on five images.

To poison clean images, use the method backdoor.poison. This method returns perturbed images along with the fake labels. You can use the perturbed images and the fake labels for training.

Note that poisoned_labels is the exact same as the array target that we provided. backdoor returns the fake labels along with the poisoned images for your convenience.

The perturbations are clearly visible in the poisoned images. The small pattern at the bottom of each image is meant to cause the target model to overfit to them. Because fake training labels are provided for the images, the model will learn to associate the patterns with the fake labels. If the attack is successful, the model will classify any image that contains the pattern as 5.

Defining a backdoor attack

Now, let’s define a backdoor attack for training, but with a little twist. We can go one step further and combine the standard backdoor attack with the clean label backdoor attack. The clean label backdoor attack works a bit differently — at a high level, it also perturbs the target images, but it keeps the original labels. Hence why this attack is called “clean label.”

Here’s how you define a clean label backdoor attack in ART:

We’ve provided a number of arguments to PoisoningAttackCleanLabelBackdoor, including pp_poison=0.75. This argument determines the fraction of images that should be poisoned. target=target defines the target whose samples should be poisoned. Our attack will poison 75% of the images of the digit 9.

proxy_classifier=classifier specifies that our original classifier will be used to poison the dataset. We are essentially using a classifier that is similar to the victim classifier to help us poison the data.

Let’s poison a subset of our training samples — we will later use this poisoned subset to train a victim model. We are not using the entire training dataset because poisoning can take a long time.

To better understand what happened, let’s visualize the poisoned images along with their clean originals:

Above, we obtain the indices of the samples whose labels correspond to our target label 9 (lines 3 to 6). These are the indices of the images that our attack poisoned. We then use these indices to obtain their corresponding poisoned and original images (lines 9 and 10).

After that, we create a figure with two columns and axes (lines 16 to 20). The left column will show images from the poisoned dataset, while the right column will show the original clean images. We indicate which column is which on lines 26 to 33.

On line 36, we iterate over each axis row and display the images from the poisoned dataset on the left (lines 40 to 42) and the clean images on the right (lines 48 to 50).

The resulting plot is as follows:

We can see that some of the images of the digit 9 are noticeably perturbed. We can also see that the perturbed images contain the pattern applied by PoisonAttackBackdoor.

So what happened?

1. First, PoisoningAttackCleanLabelBackdoor took 75% of the images of the digit 9 and applied an initial perturbation to them. If we take a look at the source code of this attack’s class, we can see that the initial attack is ProjectecGradientDescent (the class in the docs and the paper for the method). The perturbations made the samples look less like the digit 9.

2. After that, the perturbed samples are passed to PoisonAttackBackdoor to add the small pattern in their lower right corner.

3. Finally, PoisoningAttackCleanLabelBackdoor returns the perturbed images along with the original labels.

Basically, what this attack does is that it modifies the appearance of the digit 9 so that the model cannot reliably use its outlines for classification. The attack also forces the model to overfit to the pattern in the lower right corner of each image, making the model associate that pattern with the digit 9. So at inference time, every digit that has the pattern next to them should be classified as a 9.

Training a victim classifier

Next, let’s define a new victim classifier:

The reason why we are not reusing the original model architecture is that it wasn’t susceptible to data poisoning in our tests. This might be the case with many other model architectures as well — they might be resistant to some forms of poisoning and not others. Techniques against overfitting might increase resistance to poisoning as well.

Let’s train the victim model on the poisoned dataset:

Poisoning data at inference time

Now that we have a model with a backdoor, let’s poison the test set to see if the backdoor works. We will poison all samples that are not nines. We will keep the original labels for performance testing purposes.

Let’s visualize the poisoned images along with their true labels:

We can see that the attack added the pattern in the lower right corner of each image. Now, the model should classify these images as the digit 9 because it had overfit to that pattern.

Let’s evaluate the performance of our model on clean vs poisoned images to see if the attack worked:

Clean test loss: 0.13 vs poisoned test loss: 2.31Clean test accuracy: 0.97 vs poisoned test accuracy: 0.60

We can see that the backdoor did work, though perhaps not as effectively as an actual attacker would have liked.

And as the final step, let’s plot a few poisoned images along with their predictions:

Our attack wasn’t super-effective, though we can see that it did work for some samples. Additional tweaking of the attack might be necessary to achieve better results.

Next Steps

You should go ahead and try out the other attack methods supported in ART! You can also play around with attack parameters to understand how they impact the effectiveness of attacks.

In PART 2, we will take a look at the defense measures implemented in the framework. ART has a pretty wide range of defenses against the attacks we’ve had a look at, so there’s a lot to explore!

Until next time!

--

--

Kedion

Kedion brings rapid development and product discovery to machine learning & AI solutions.