Young hacker smiling

Zero false positives

Expert intelligence + effective automation

Photo by KP Bodenstein on Unsplash:

Fool the machine

Trick neural network classifiers
While neural networks are great at artificial intelligence tasks, they are not without flaws. In this article, we show how to create images that fool classifiers into believing they are seeing the wrong thing while maintaining visual similarity with a correctly classified image.

Artificial Neural Networks (ANNs) are certainly a wondrous achievement. They solve classification and other learning tasks with great accuracy. However, they are not flawless and might misclassify certain inputs. No problem, some error is expected. But what if you could give it two inputs that are virtually the same, and get different outputs? Worse, what if one is correctly classified and the other manipulated so that it is classified as anything you want? Could these adversarial examples be the bane of neural networks?

That is just the content of one PicoCTF challenge we came across recently. There is an application whose sole purpose is for the user to upload an image, and it will do the classification and let you know the results. Our task is to take the image of the dog, which is correctly classified as a Malinois, which it is, and manipulate it so that it is classified as a tree frog. However, for your image to be a proper adversarial example, it must be perceptually indistinguishable from the original, i.e., it must still look like the same dog to a human:

Challenge description
Figure 1. Challenge description.

The applications are potentially endless. You could:

  • fool image recognition systems like physical security cameras, as does this Stealth T-shirt:

Stealth T-shirt
  • make an autonomous car crash:

Manipulated stop signs
  • confuse virtual assistants:

Trick speech recognition
  • bypass spam filters, etc.

So, how does one go about creating such an adversarial example? Recall that in our brief survey of machine learning techniques, we discussed training neural networks. It is an iterative process in which you continuously adjust the weight parameters of your black box (the ANN) until the outputs agree with the expected ones, or at least, minimize the cost function, which is a measure of how wrong the prediction is. I will borrow an image that better explains it from [2].

Training a neural network
Figure 2. Training a neural network by [2].

This technique is known as backpropagation in the lingo. Now, in order to obtain a picture that is still like the original, but will classify as something entirely different, what one could try to do is add some noise, but not too much, so the picture doesn’t change, and not just anywhere, but exactly in the right places, so that the classifier reads a different pattern. Some clever folks from Google found out that the best way to do this is by using the gradient of the cost function:

Adding noise to fool the classifier
Figure 3. Adding noise to fool the classifier. From [1]

This is called the fast gradient sign method. This gradient can be computed using backpropagation again, but in reverse. Since the model is already trained, and we can’t modify it, let us modify the picture little by little and see if it gets us any closer to the target. I will again borrow from @ageitgey since the analogy is much clearer this way:

Tweaking the image
Figure 4. Tweaking the image, by [2].

The pseudo-code that would generate an adversarial example via this method would be as follows. Assume that the model is saved in a Keras h5 file, as in the challenge. Keras is a popular high-level neural networks API for Python. We can load the model, get the input and output layers (first and last), get the cost and gradient functions and define a convenience function that returns both for a particular input, like this:

Getting cost function and gradients from a neural network
from keras.models import load_model
from keras import backend as K

model                  = load_model('model.h5')
input_layer            = model.layers[0].input
output_layer           = model.layers[-1].output
cost_function          = output_layer[0, object_type_to_fake]
gradient_function      = K.gradients(cost_function, input_layer)[0]
get_cost_and_gradients = K.function([input_layer, K.learning_phase()],
                                    [cost_function, gradient_function])

Where object_type_to_fake is the class number of what we want to fake. Now, as per the formula in figure 3 above, we should add a small fraction of the sign of the gradient, until we achieve the result, which is that the confidence in the prediction be at least 95%:

while confidence < 0.95:
    cost, gradient = get_cost_and_gradients([adversarial_image, 0])
    adversarial_image += 0.007 * np.sign(gradient)

However, this procedure takes way too long without a GPU. A few hours according to [2]. For the CTFer and the more practical-minded reader, there is a library which does this and other attacks on machine learning systems to determine their vulnerability to adversarial examples: CleverHans. Using this library, we change the expensive while cycle above to two API calls: make an instance of the attack method, and then ask it to generate the adversarial example:

from cleverhans.attacks import MomentumIterativeMethod

method = MomentumIterativeMethod(model, sess=K.get_session())
test   = method.generate_np(adversarial_image, eps=0.3, eps_iter=0.06,
                            nb_iter=10, y_target=target)

In this case we used a different attack, namely the MomentumIterativeMethod which for this case gives better results than the FastGradientMethod, obviously also a part of CleverHans. And so we obtain our adversarial example:

Adversarial image for the challenge
Figure 5. Adversarial image for the challenge

You can almost see the tree frog lurking in the back, if you imagine the two knobs on the cabinet are its eyes. Just kidding. Upload it to the challenge site and, instead of getting the predictions, we get the flag.

Not just that, the model, which is based on MobileNet, is 99.99974% sure this is a tree frog. However, the difference to the original, according to the widely used perceptual hash algorithm, is less than two bits. Still, the adversarial example has artifacts, at least to a human observer.

The worst is that these issues persist across different models as long as the training data is similar. That means that we could probably pass the same image to a different animal image classifier and still get the same results.

So we should think twice before deploying ML-powered security measures. This is, of course, a toy example, but in more critical scenarios, models that are not resistant to adversarial examples could be catastrophic. Apparently[1], the reason behind this is the linearity in the functions hidden in these networks, so switching to a more non-linear model such as RBF networks could solve the problem. Another workaround could be to train the ANNs including adversarial examples.

Whatever the solution, it should be clear that one should test twice, and deploy once, adapting the old woodworker’s adage.


  1. I. Goodfellow, J. Shlens, C. Szegedy. Explaining and harnessing adversarial examples. arXiv.

  2. A. Geigtey. Machine Learning is Fun Part 8: How to Intentionally Trick Neural Networks. Medium

Author picture

Rafael Ballestas


with an itch for CS


Service status - Terms of Use