SiDi Oct 23, 2025 10:23:10 AM

Adversary attacks: Did you know that your AI can run security risks?

Machine Learning (ML) is a field within artificial intelligence that has become very popular over the years.

This technology provides us with faster and more efficient ways of making decisions, and can be applied in many different fields, such as speech recognition, facial recognition, fraud detection, text translation, robots, autonomous systems and many others.

Machine Learning models have the ability to modify their behavior and learn autonomously based on experience, i.e. through the data we give them, without being explicitly programmed to do so.

Adversary attacks

Machine Learning models are used to identify threats and improve the security of systems, but what happens when these models become the threat or entry point for attacks?
These types of attacks are called adversaries, and they occur when an attacker alters the model input to induce a bias (error) in Machine Learning models.

Since these models are widely used for security, business and others, attackers can use this type of vector for any domain, so it's important to be aware of the possibility of these systems being exposed and manipulated.

So let's explore how these attacks work, the limitations of Machine Learning models and how to improve them to better protect ourselves.

Types of adversary attacks

Adversary attacks seek to exploit some weaknesses in the process of entering the model's input data. The attacker inserts some anomalies into the data and this ends up generating unexpected behavior in the model that has already been trained, causing it to make mistakes.

These attacks can be divided into evasion, poisoning and model theft attacks.

Evasion

Evasion attacks usually occur after the model is already in production, modifying the input data so that it appears correct, but which actually leads to a wrong prediction.

This type of attack is usually associated with image recognition models which, by adding a pattern not expected by the model, can result in a completely incorrect prediction. In Figure 1, we can see that there is an image stuck on the person with several different patterns, and these patterns cause the model not to recognize it.

Poisoning

Poisoning attacks aim to contaminate the training database of Machine Learning models in order to cause the model to malfunction and even enable the attacker to control the algorithm's predictions.

In this type of attack, the attacker compromises the model's learning process, causing it to fail on the inputs desired by the attackers. To do this, the attacker inserts incorrect or mislabeled data into the models' training set, so that the models learn the wrong patterns. For example, an attacker could poison a spam filtering model by labeling specific spams of interest as common emails, so that the model does not reject spams in the pattern sent by the attacker.

In many applications, Machine Learning models are trained only once, so they are not susceptible to this type of attack. However, some types of applications require the models to be trained frequently in order to keep them up to date with new data. In these cases, depending on the frequency of updates, it is very likely that the data set used for retraining will not be thoroughly checked, making these systems more susceptible to poisoning attacks.

Model theft

Along the same lines as the evasion attack, the model theft attack is also carried out after the model has been trained. It aims to create a copy of the model structure or the data used to train the model, such as confidential data.

Turning to sensitive information theft, this attack can be carried out in two ways: by inferring whether an example has been used in a model's training base or by actually extracting raw information that was used during model learning.

In the first scenario, known as the Membership Inference Attack , the attacker, through the target model's predictions and the creation of several shadow models that mimic the target model's behavior, is able to determine whether or not an example of interest was used in the training set. This type of attack can be particularly dangerous, for example in the health sector, where it can expose whether an individual has a particular disease.

In the second case, known as the Model Inversion Attack and the Training Data Extraction Attack , the attacker is interested in extracting raw information from the models' training set, for example, identifying a person's face, knowing only their name, using facial recognition models, as demonstrated in the paper "Model Inversion Attacks that Exploit Confidence Information and Basic Countermeasures", or even obtaining confidential information, such as credit card numbers, using personalized text as input, as presented in the paper, "Extracting Training Data from Large Language Models".

On the other hand, the aim of learning the structure of a model can be related to plagiarism of a solution provided by companies, for example, using a model that has already been trained to make decisions about buying and selling shares, to be copied and used without having to pay for the service.

This type of attack can also be used to identify how the model makes its decisions, and thus circumvent it in order to avoid some type of detection by the model, as in the case of spam filters, the attacker can learn the patterns used by the model and thus create more elaborate spam that would pass undetected through the filter.

How to protect ourselves

Given the various threats to which Machine Learning models can be exposed, we need to think about ways of protecting them.

There are two methods that are known to provide significant defense to models: adversarial training and defensive distillation. However, these algorithms can be easily broken due to access to greater computing resources.

Adversarialtraining: this method simply aims to train the model explicitly with adversarial examples so that the model is not fooled. The aim is that if the model is exposed to an adversarial example by an attacker, it will not be confused.

Defensive distillation: Typically, the term distillation involves a smaller neural network that learns the classification probabilities of a larger network. In other words, instead of the smaller network or distillation network predicting a label, such as positive or negative, it will predict the probabilities of the labels generated by the first network. This allows the second model to act as an additional filter to the prediction process, making the network less susceptible to exploitation.

For example, let's suppose there is a biometric scanning system that matches fingerprints. An attacker could learn the main elements that make the first network predict a match and then generate a fingerprint that is very close to the one expected by the model, even if it is not 100% the same, and could make the model generate a match prediction with 95% certainty. In a scenario with defensive distillation, the distillation network would be trained with this uncertainty factor from the first model, adding an element of randomness to the process, making it harder for the attacker to figure out how to generate a match artificially.

We can use other more robust structures, such as alternating several models to make it more difficult for the attacker to compromise the system, as attacking several models is more complicated than just one, but there is still the possibility that all the models will be compromised.

There is the possibility of combining several models and generating one model per vote (ensemble), making all these models contribute to the final prediction and making it more difficult for an attacker to compromise them.

On the other hand, it is important to note that the models do not exist in isolation, they are usually part of a larger system and most of these attacks could be avoided with changes to the systems in general, such as the use of encryption and good security practices, especially in passwords and configurations.

Artificial intelligence being used for malicious purposes

In addition to the use of adversary attacks, the use of artificial intelligence is also being used to help and speed up other types of attacks, for example, brute force attacks with GAN models to help generate passwords that are more likely to be genuine or even in social engineering.

Another area in which Artificial Intelligence is being used maliciously is in the promotion of fraud. Attackers have been using technologies such as Deep Fake and speech generation to impersonate other people in order to gain an undue advantage.

In a recent case, a company pulled off a R$17 million scam using videos of a supposed CEO who was actually an AI.

Conclusion

The advance of Artificial Intelligence has made it possible for Machine Learning models to be used in many different sectors of industry and is becoming a key point for many organizations.

The flexibility with which AI is used also provides new entry points for attackers in all these scenarios, so knowing and understanding this risk is fundamental if protective measures are to be taken to make these systems as secure as possible.

Would you like to learn more?

CleverHans: adversarial example library for building attacks, defenses and comparing the two(https://github.com/cleverhans-lab/cleverhans)

Adversarial Machine Learning: An introductory book that covers all the theories and tools needed to build a robust machine learning system.(https://www.amazon.com/Adversarial-Machine-Learning-Anthony-Joseph/dp/1107043468/ref=asc_df_1107043468/?tag=hyprod-20&linkCode=df0&hvadid=343974906210&hvpos=1o1&hvnetw=g&hvrand=589292301534993667&hvpone=&hvptwo=&hvqmt=&hvdev=c&hvdvcmdl=&hvlocint=&hvlocphy=9032058&hvtargid=pla-574159861347&psc=1&tag=&ref=&adgrpid=69543897272&hvpone=&hvptwo=&hvadid=343974906210&hvpos=1o1&hvnetw=g&hvrand=589292301534993667&hvqmt=&hvdev=c&hvdvcmdl=&hvlocint=&hvlocphy=9032058&hvtargid=pla-574159861347)

Deep Fake scam: https://www.tecmundo.com.br/software/216005-empresa-golpe-r-17-mi-clientes-descobrem-ceo-ia.htm

Malicious use of artificial intelligence: comprehensive report on the subject, written by professionals in the field, covering researchers, civil society and industry.(https://maliciousaireport.com/)

Presentation on adversarial machine learning: 2018 presentation by researcher Ian Goodfellow on adversarial techniques in AI. ( https://events.technologyreview.com/video/watch/ian-goodfellow-gans/)

Atlas: https://atlas.mitre.org/

References

[1] https://arxiv.org/pdf/1904.08653.pdf

[2] https://openai.com/blog/adversarial-example-research/

Artificial Intelligence, Cibersecurity