Adversarial Attacks: The Hidden Risk in AI Security

Marin Ivezic
12 min readMar 1, 2023


The proliferation of AI and Machine Learning (ML), from facial recognition systems to autonomous vehicles and personalized medicine, underscores the criticality of cybersecurity in these areas.

AI and ML are revolutionizing cybersecurity. They can swiftly analyze vast data sets, pinpointing anomalies that might indicate cyber threats-tasks that would be daunting for human analysts. Machine learning models can also adapt to new types of threats as they evolve, offering a dynamic defense against cyber adversaries.

Importance of Understanding Threats in AI/ML

While AI and ML provide potent tools for defending against traditional cybersecurity threats, they also introduce a new class of vulnerabilities that are exclusive to machine learning models. The algorithms that can recognize a fraudulent credit card transaction in milliseconds or identify a possible cyber-intrusion also have their Achilles’ heel. If we’re deploying AI systems in critical areas such as healthcare, transportation, or national security, a failure to understand and prepare for the range of potential threats could have severe repercussions.

Why Should We Care about Adversarial Attacks?

Adversarial attacks specifically target the vulnerabilities in AI and ML systems. At a high level, these attacks involve inputting carefully crafted data into an AI system to trick it into making an incorrect decision or classification. For instance, an adversarial attack could manipulate the pixels in a digital image so subtly that a human eye wouldn’t notice the change, but a machine learning model would classify it incorrectly, say, identifying a stop sign as a yield sign, with potentially disastrous consequences in an autonomous driving context.

In essence, adversarial attacks exploit the very foundation of what makes AI and ML so effective: their ability to learn from data. By understanding the weaknesses inherent in the learning mechanisms, attackers can deceive these systems, causing them to behave unpredictably or incorrectly. Therefore, understanding adversarial attacks is not just a niche concern for researchers; it’s a pressing issue for anyone relying on AI and ML for security measures.

What are Adversarial Attacks?

Definition and Explanation

In the realm of machine learning, an adversarial attack is a deliberate and often subtle manipulation of the input data fed into a machine learning model. The goal of this manipulation is to mislead the model into making an incorrect prediction or classification. Unlike typical errors or inaccuracies that might arise from a poorly trained model or random noise in the data, adversarial attacks are intentional efforts that exploit specific vulnerabilities within the model’s learning mechanisms.

In simpler terms, adversarial attacks trick a machine learning model by feeding it deceptive data that looks almost identical to regular input data. These attacks specifically aim to exploit the way the model “thinks,” leading it to incorrect conclusions or actions.

Types of Adversarial Attacks

White-Box Attacks: In these types of attacks, the attacker has complete access to the target machine-learning model. This includes the architecture of the model, the weights and biases used in its algorithms, and sometimes even the data used to train it. Because of this inside knowledge, white-box attacks can be incredibly precise, exploiting particular weaknesses in the model’s understanding.

Example: Imagine a facial recognition system used in a high-security facility. An attacker with complete access to the model could subtly alter the pixel values of their own face in a digital image so that the system mistakenly recognizes them as an authorized person.

Black-Box Attacks: In contrast, black-box attackers have no direct knowledge of the model’s internal workings. They only have access to the model’s input and output. Despite this limited information, black-box attacks can still be effective by using sophisticated methods to probe the model’s behavior and discern its vulnerabilities. These are often real-world scenarios where the attacker doesn’t have insider access to the system.

Example: Let’s consider an e-commerce recommendation system. An attacker without direct access to the model can repeatedly query the system and observe the recommendations made for various types of input. Over time, they could deduce the model’s behavior and trick it into recommending undesired products.

Targeted Attacks: In targeted adversarial attacks, the attacker aims for a specific output. For example, they may manipulate an image of a cat to be misclassified specifically as a dog by the machine learning model. These types of attacks require a more in-depth understanding of the model’s behavior and are generally harder to execute.

Example: A person might want to deceive a sentiment analysis model into thinking a negative review is positive. By changing specific words or adding hidden characters, they can manipulate the model into categorizing the review as positive.

Non-Targeted Attacks: These are designed to cause a misclassification, but the attacker doesn’t specify what the incorrect classification should be. The attacker’s main goal is to cause the model to be wrong, no matter what alternative classification or prediction the model makes.

Example: In a voice-activated assistant like Siri or Alexa, an attacker might simply want the system to misunderstand the user without caring about what specific mistake occurs. Introducing a hidden layer of noise that’s imperceptible to humans but disruptive to the model could achieve this.

Real-World Examples

Autonomous Vehicles

Self-driving cars primarily rely on a suite of sensors, including LiDAR, cameras, and sometimes radar, to perceive their environment. These sensors feed data into machine learning algorithms trained to identify road signs, obstacles, and other elements critical for navigation and decision-making. Adversarial attacks could target these perception algorithms by carefully designing optical illusions that mislead the sensor data interpretation.

Example: An attacker could place stickers on a stop sign in a way that the camera’s object detection algorithm misidentifies it as a yield sign or another type of road sign. This could cause the vehicle to ignore what should be a mandatory stop, potentially leading to hazardous situations.

Voice-Activated Assistants

Voice-activated systems like Siri and Alexa use natural language processing (NLP) algorithms to convert spoken language into a format that machines can understand. These systems often use Fourier transformations to analyze frequency components and recurrent neural networks (RNNs) for sequence modeling. An adversarial attack could introduce a layer of carefully crafted noise or acoustic perturbations to the voice input.

Example: By analyzing the spectral components of the speech and understanding the assistant’s decision boundary, an attacker can inject inaudible frequencies or phase-shifted signals into the audio. While humans wouldn’t notice the difference, these alterations can significantly alter the parsed command by the assistant, potentially leading to unauthorized actions.

Diagnostic AI tools often employ convolutional neural networks (CNNs) for image classification tasks, such as determining whether an X-ray indicates the presence of a tumor. These networks scan through the image with various filters to identify key features like edges, textures, and shapes. An adversarial attack can target the convolutional layers by altering the pixel values in the input image to produce false features or obscure real ones.

Example: Consider an X-ray scan revealing early signs of lung cancer. An attacker could minutely modify the grayscale values of pixels in regions where the tumor appears, enough to alter the CNN’s feature maps. The resulting diagnosis might wrongly indicate healthy lung tissue, potentially leading to delays in treatment.

How Do Adversarial Attacks Work?

Adversarial attacks manipulate machine learning models by introducing carefully crafted perturbations to input data. Understanding the mechanism involves diving into loss landscapes, backpropagation strategies, and the particulars of gradient-based optimization algorithms. Here, we go a layer deeper into these facets.

Perturbations: Beyond Simple Noise

In the scope of adversarial attacks, perturbations are not random noise but calculated manipulations derived through optimization techniques. They are formulated to maximize the model’s loss function, effectively skewing its predictions or classifications. For instance, in image recognition tasks, perturbations can be as minuscule as altering the RGB values of selective pixels to mislead the model’s feature maps in convolutional layers.

Gradient-Based Optimization Algorithms

The primary objective for attackers is to solve an optimization problem that identifies the smallest perturbation capable of causing misclassification. For this, attackers often use gradient-based optimization algorithms such as stochastic gradient descent (SGD) or variants like Adam and RMSProp.

First-Order Optimization: Most adversarial attacks utilize first-order derivatives to calculate the direction in which the input should be perturbed to maximize the loss function. Here, backpropagation plays a pivotal role in computing the gradients efficiently.

Higher-Order Derivatives: Advanced attack strategies might even incorporate second-order optimization techniques, where the Hessian matrix comes into play, to find optimal perturbations. These are computationally more intensive but offer higher precision in crafting adversarial examples.

Adversarial Training: Resilience at a Computational Cost

Adversarial training is a widely employed defensive strategy that involves enhancing the model’s training dataset with adversarial examples and their corresponding correct labels. Although this method can effectively improve model resilience against adversarial attacks, it also introduces several computational and practical complexities:

Extended Training Cycles: Including adversarial examples effectively enlarges the dataset, thus extending the number of epochs necessary for the model to reach a satisfactory level of convergence. This not only increases computational time but also requires greater storage and memory resources.

Regularization Imbalance: Adversarial training inherently serves as a form of regularization. However, it may introduce an imbalance in the model’s ability to generalize, leading to potential issues of overfitting or underfitting. This necessitates a more cautious fine-tuning of hyperparameters like dropout rates and regularization terms.

Model Complexity: To effectively defend against a wide variety of adversarial attack techniques, the architecture of the model may need to become more complex to capture higher-order interactions and dependencies. This complexity further increases the computational burden, making it more challenging to deploy the model in resource-constrained environments.

Robustness-Accuracy Trade-off: Enhancing the model to defend against adversarial examples often results in a trade-off with standard accuracy. In other words, as the model becomes more resistant to adversarial attacks, it may become less effective at correctly classifying non-adversarial examples, particularly those that sit near decision boundaries.

Common Techniques for Crafting Adversarial Examples

Fast Gradient Sign Method (FGSM): FGSM computes the gradient of the loss function with respect to the input data. The input is then perturbed in the direction of this gradient’s sign. This method is computationally efficient as it’s a one-step, feed-forward approach, but it is often less precise when targeting specific labels.

Jacobian-based Saliency Map Attack (JSMA): JSMA employs the Jacobian matrix to identify and perturb the most sensitive input features incrementally. This method is computationally more demanding but allows for targeted attacks.

DeepFool: DeepFool works by iteratively projecting the adversarial example back onto the decision boundary, effectively finding the minimum perturbation needed for misclassification. This method is particularly effective for multi-class classification problems and is known for its precision.

Carlini & Wagner (C&W) Attacks: C&W attacks focus on optimizing a specific objective function that not only seeks to cause misclassification but also aims to keep the perturbations imperceptible. It uses an optimization process that minimizes the distance metric between the original and perturbed examples while ensuring the perturbed input is misclassified.

Universal Adversarial Perturbations: Unlike other methods that create instance-specific perturbations, Universal Adversarial Perturbations are designed to be effective across a wide range of inputs. This method computes a single perturbation vector that, when applied to any input from a dataset, is likely to cause misclassification.

Projected Gradient Descent (PGD): PGD can be considered an iterative version of FGSM and is often called the strongest first-order adversary. At each iteration, it makes a step in the direction of the gradient and then projects the perturbed example back into a valid input space, making it a more robust method for crafting adversarial examples.

Techniques to Detect Adversarial Attacks

In the realm of adversarial machine learning, detecting an adversarial attack is often the first line of defense. One approach that has gained traction is input reconstruction, primarily using autoencoders [1]. Autoencoders learn to compress and then uncompress the input data. A significant discrepancy between the original and reconstructed data could be indicative of an adversarial perturbation [2]. However, this technique may incur computational overhead, especially for high-dimensional data. Another intriguing direction is statistical anomaly detection, where the focus is on detecting abnormal patterns in the model’s output probabilities or its internal layer activations [3]. This often involves real-time monitoring of the neural network layers to identify anomalies that may indicate adversarial interference. The use of statistical measures like chi-square tests or entropy-based measures helps in quantifying these anomalies. Lastly, a relatively new area of interest is the concept of reverse-engineering, the optimization algorithm responsible for generating the adversarial example [4]. While this is computationally intensive and complex, it offers a promising way to pinpoint the exact nature of the attack, thereby opening doors for more tailored defense strategies.

Strategies to Defend Against Adversarial Attacks

Defending against adversarial attacks is an even more intricate problem. The most straightforward strategy is adversarial training, where the model is trained with an augmented dataset that includes adversarial examples. This approach was pioneered by Goodfellow et al. in 2014 and has the advantage of making the model more robust, but it comes with a computational cost [5]. Another defense strategy gaining prominence is ensemble methods, borrowing from the idea of ‘Defensive Distillation’ introduced by Papernot et al. in 2016 [6]. Here, multiple models are used to diversify predictions, adding an extra layer of complexity for the adversary. However, it’s crucial to recognize that ensemble methods also increase the computational demands. Randomized smoothing techniques are also effective, where the model’s decision is averaged over multiple noisy copies of the same input, as proposed by Cohen et al. in 2019 [7]. Feature squeezing, which simplifies the input data to reduce the search space for adversaries, is another noteworthy approach, suggested by Xu et al. in 2018 [8]. Finally, certifiable defenses offer a mathematically rigorous way to ensure a model’s robustness, though this is still an emerging field with significant challenges, as discussed by Raghunathan et al. in 2018 [8].

Future Research Directions

The field of adversarial machine learning is still in its infancy, providing a fertile ground for future research. One of the most exciting possibilities is the development of automated defenses against adversarial attacks, perhaps leveraging the capabilities of AutoML. This could drastically reduce the manual effort required to identify and rectify vulnerabilities. Another potentially groundbreaking area is the integration of hardware-level security measures. Software-based countermeasures, although effective to some extent, may always have inherent loopholes, making hardware-based solutions an untapped reservoir of possibilities. Furthermore, the ethical and legal implications of adversarial attacks remain largely unexplored. As we begin to understand the societal impact of these attacks, a legal framework will be essential for prosecuting offenses and protecting victims. Lastly, the disparate techniques and theories available today make the development of a unified theory of adversarial machine learning a compelling avenue for future research. Such a theory could serve as the backbone for standardized methodologies and solutions in both academic and industrial settings.


The rapid advancement of AI and ML technologies comes with an often-overlooked vulnerability: adversarial attacks. These threats pose significant risks across various sectors, including autonomous vehicles and healthcare. While detection and defense strategies like input reconstruction, ensemble methods, and adversarial training offer some protection, they are not without limitations, such as computational inefficiency. The state of the field is still nascent but increasingly critical as AI technologies democratize and the tools for launching attacks become more accessible.

The urgency for continued research, preparedness, and multi-disciplinary collaboration has never been higher. Researchers face the pressing task of developing robust and scalable defense mechanisms. A unified theory for adversarial machine learning could serve as a bedrock for standardized countermeasures. Practitioners, for their part, must adopt a security-first mindset, especially when deploying AI and ML models in high-risk applications. Policymakers, too, have a crucial role to play; a responsive legal framework to criminalize adversarial attacks and protect victims is an immediate necessity.


  1. Cintas, C., Speakman, S., Akinwande, V., Ogallo, W., Weldemariam, K., Sridharan, S., & McFowland, E. (2021, January). Detecting adversarial attacks via subset scanning of autoencoder activations and reconstruction error. InProceedings of the Twenty-Ninth International Conference on International Joint Conferences on Artificial Intelligence (pp. 876–882).
  2. Cheng, K., Calivá, F., Shah, R., Han, M., Majumdar, S., & Pedoia, V. (2020, September). Addressing the false negative problem of deep learning MRI reconstruction models by adversarial attacks and robust training. InMedical Imaging with Deep Learning (pp. 121–135). PMLR.
  3. Zhong, C., Gursoy, M. C., & Velipasalar, S. (2022, April). Learning-Based Robust Anomaly Detection in the Presence of Adversarial Attacks. In2022 IEEE Wireless Communications and Networking Conference (WCNC) (pp. 1206–1211). IEEE.
  4. Nicholson, D. A., & Emanuele, V. (2023). Reverse engineering adversarial attacks with fingerprints from adversarial examples.arXiv preprint arXiv:2301.13869.
  5. Goodfellow, I. J., Shlens, J., & Szegedy, C. (2014). Explaining and harnessing adversarial examples.arXiv preprint arXiv:1412.6572.
  6. Papernot, N., & McDaniel, P. (2016). On the effectiveness of defensive distillation.arXiv preprint arXiv:1607.05113.
  7. Cohen, J., Rosenfeld, E., & Kolter, Z. (2019, May). Certified adversarial robustness via randomized smoothing. Ininternational conference on machine learning (pp. 1310–1320). PMLR.
  8. Xu, W., Evans, D., & Qi, Y. (2017). Feature squeezing: Detecting adversarial examples in deep neural networks.arXiv preprint arXiv:1704.01155.
  9. Raghunathan, A., Steinhardt, J., & Liang, P. (2018). Certified defenses against adversarial examples.arXiv preprint arXiv:1801.09344.

Originally published at on March 1, 2023.



Marin Ivezic

Partner @PwC — Lead OT, IoT, 5G Security | 30y red teaming & protecting critical infrastructure, telcos, cyber-physical systems, emerging tech | 5x Global CISO