AI red teaming is the process of simulating adversarial attacks against an AI model to find, and then patch, safety and security vulnerabilities. Red teams consisting of ethical hackers carry out these attacks. The overall process of AI red teaming doesn’t differ much from standard red teaming, but the specific adversarial methods and skills required are quite different.
AI red teaming started as a practice conducted only by big companies and labs creating AI models. But now, almost every company uses some form of an AI model and the number of AI vulnerabilities has increased. As such, AI red teaming has become a vital practice for everyone. Without it, companies risk significant safety and security issues from untested AI models.
Key components of AI red teaming
AI red teaming takes a bit of planning to execute well. AI models have advanced far enough that standard attacks (i.e., basic prompts asking harmful queries) no longer work well against standard LLMs. As such, there are a few key things to think through when considering AI red teaming.
Threat model
Threat modeling your AI systems will let you know where to focus your red teaming efforts. Threat modeling involves:
- Listing all the AI models that are part of your system (along with their specific security and safety weaknesses).
- Identifying how users and threat actors will be able to access the AI systems (e.g., chatbots and APIs).
- Noting attack vectors (i.e., weaknesses and vulnerabilities in a system).
- Brainstorming how threat actors may try to exploit an AI system.
- Deriving the impacts of exploits.
We created a guide on the most common attack vectors for AI systems to help with this process. Once you have established a threat model, your red team will know where and how to first start attacking a system. Of course, over time, the team will discover new exploits, patch old ones, and help evolve the threat model.
Objectives
Having a defined metric for improvement helps focus a red team, generating better results in the long run. Instead of trying to find every vulnerability in an attack surface, prioritizing one at a time allows teams to go deeper. For example, for chatbots, perhaps toxic messages are the number one vulnerability. You can get even more specific, such as looking only at biased political messages generated by a model. Getting this specific will feel constraining, but by narrowing your red team’s focus, you’ll be able to patch this vulnerability up quickly before moving on to the next one.
Cadence
There will never be just one vulnerability to fix when it comes to AI models. New vulnerabilities will get discovered, user behaviors may shift, or your system might start handling new types of data. As such, consistent red teaming is necessary. The right cadence differs for each company, but weekly to monthly red team engagements is the usual range. Once you have a consistent red teaming cadence, you can actually get more out of your efforts. You can retest old vulnerabilities to increase confidence in your patches. You can reprioritize your objectives frequently and tackle a wide variety of issues. You can even generate massive amounts of data to finetune your models (or even start the coveted RLHF process for LLMs).
Diversity
Good red teams are diverse. Diversity of identity, thought, and skills leads to the discovery of more attack vectors and potential vulnerabilities. For example, with LLMs, jailbreaks that touch on multiple identity areas (e.g., sexuality plus race plus politics) can have worse outcomes and can be more likely to succeed. To find these jailbreaks, you need team members who can think of them.
Examples of red teaming scenarios
The number of AI vulnerabilities is continuously increasing; prompt injection, training data poisoning, supply chain vulnerability, model extraction, biased responses, and output handling are just a few of the types of vulnerabilities. There is also a vast number of ways to red team for each vulnerability. To clarify what AI red teaming looks like in practice, we’ll cover a few examples from this list.
LLM safety
Given the right (or, should we say, “wrong”) prompt, LLMs can output harmful text such as hate speech, instructions on building weapons, or otherwise offensive responses. AI companies spend considerable time making their models more resilient to such prompts. They do this by conducting AI bias testing.
Threat model: LLM adopters already know who their users are and what queries they are likely to input. They may also have a list of threat actors who had targeted their applications previously, as well as the queries these threat actors inputted. The attack vector is the prompt. Adopters may also have a list of toxicity rules that determine how toxic a message is.
Objective: LLM adopters will have some metric, such as the number of toxic political messages, that they are trying to optimize (in this case, minimize).
Cadence: LLM adopters with more resources will likely do some form of AI red teaming every single day or every week. Different members of a red team may focus on different attack vectors or on different methods within the same attack vector.
In practice, red teaming for toxicity will involve sending many prompt variations to a model to elicit toxic responses. Some attacks will consist of long conversations where the messages get slightly more toxic over time. Other attacks will consist of messages that ask a model to say offensive things in hypothetical situations. Humans will do these tests, but some companies may use AI models focused on creating harmful prompts. This allows them to scale the quantity of prompts they test.
The results of red teaming are measured based on an objective (e.g., the model is 85% safe in political conversations) and examples of harmful prompts and responses. An AI company can use these examples to finetune its models to improve toxicity.
Excessive agency
Imagine an email assistant AI system that has read/write access to a database of user emails. Its purpose is to help users quickly summarize or draft emails. This system would be prone to excessive agency attacks, where threat actors might try to get a system to return information about other users’ emails. Red teaming would be a critical component of safely deploying this system.
Threat model: The company behind a system may not have as much information on who the threat actors will be, but it can still form reasonable guesses as to the attack vectors and vulnerabilities. An intuitive attack vector is to give an AI system another user’s ID and ask it to query the database for that user’s emails. A more insidious attack vector would be to send emails to a victim that, when summarized by an AI system, tell the AI system to send an email to the threat actor discreetly.
Objective: In this scenario, the objective may be to minimize the number of unauthorized email reads and writes.
Cadence: Smaller companies may have fewer security resources, but excessive agency is a pretty high-impact issue. A weekly cadence might be right in addressing such vulnerabilities.
This red teaming scenario will likely involve more prompt attacks and data poisoning attacks. Red team members may send insidious emails to synthetic user accounts to see if an AI system will blindly follow instructions within emails. At regular intervals, the red team will measure the efficacy of the attacks and the AI defenses and will plan out solutions. One such solution may be requiring tokens for API requests. Another may be to show a confirmation screen for every email the AI assistant drafts. Then, red teaming will follow the implementation of these solutions to verify that they had the intended effect.
Why AI red teaming matters
AI systems can have many vulnerabilities, and companies will never be able to root out all of them. Nevertheless, companies must remove as many as they can. AI red teaming is highly helpful in doing this. By simulating adversarial attacks by threat actors, red teams can see where AI systems are weakest and patch those vulnerabilities. Repeating this process makes AI systems more robust, one patched vulnerability after another. Automated scanning tools and the like help bolster defenses but don’t give teams insight into how threat actors think. Red teaming does provide this insight, helping companies keep up with the perpetual cat-and-mouse game of AI security.
Extending the reach of your AI red team with crowdsourcing
With Bugcrowd’s AI safety and security solutions, you can access the vastly diverse skill sets of the crowd for AI red teaming on demand in a trusted, scalable way. Bugcrowd’s expertise is built on our experience running AI red teaming for the US Department of Defense CDAO, and we can help you as well.