Getty Images
Improve AI security by red teaming large language models
Cyberattacks such as prompt injection pose significant security risks to LLMs, but implementing red teaming strategies can test models' resistance to various cyberthreats.
Organizations use AI technologies in many business contexts to improve internal work operations. AI tools can automate repetitive tasks, generate content, analyze massive amounts of data, personalize customer experiences and enhance client-facing services.
But as with any new technology, these improved capabilities can come at the cost of increased security risk. Large language models, in particular, are susceptible to security risks such as prompt injection, training data extraction, backdoor insertion and data poisoning. Prompt injection, the most prominent type of attack against LLMs, manipulates the model to produce inappropriate or harmful outputs, potentially wreaking havoc on a company's reputation and privacy.
As AI's capabilities advance, red teaming LLMs through simulated attacks is crucial to discover their vulnerabilities and improve their defenses. A red teaming strategy can uncover security vulnerabilities in LLMs before threat actors exploit them, enabling enterprises to reduce the risk of prompt injection attacks and enhance the security of their language models.
Prompt injection attacks on LLMs
In a prompt injection attack, threat actors craft customized prompts that circumvent the original system guardrails to force the LLM to produce inappropriate outputs. Hackers can also use prompt injection attacks to force the LLM to expose its training data or bypass content moderation guardrails to generate malicious outputs such as malware and phishing emails.
The wide usage of LLMs to power services-facing applications makes prompt injection attacks a serious concern. For example, many companies use APIs to integrate AI features such as LLM-based chatbots into their web services. Attackers can abuse this feature for malicious purposes such as outputting harmful content for users, which can devastate a business's reputation and reduce trust in AI tools.
Red teaming for AI models
In the context of cybersecurity, the term red teaming refers to exercises a security team conducts to identify vulnerabilities in computer systems and networks. In light of the growing threat landscape facing AI systems, red teaming has extended into the AI field as a proactive approach to assessing and analyzing LLMs' vulnerability to adversarial behavior such as prompt injection.
An AI red team executes various attacks to test whether an LLM can be pushed to produce inaccurate or harmful results. This process helps AI developers assess how the LLM behaves in response to such attacks and simulate its response in real-world cyberattack scenarios, particularly those stemming from prompt injection.
Red teaming and the U.S. government
The U.S. government has recognized the importance of performing red teaming to assess LLMs' susceptibility to cyberattacks. In October 2023, President Joe Biden issued Executive Order 14110 on the Safe, Secure and Trustworthy Development and Use of Artificial Intelligence. The order requested that all organizations developing high-risk generative AI models execute red teaming exercises to uncover security vulnerabilities in AI systems before deploying them to production.
4 red teaming strategies for large language models
Red teams use several strategies to assess the security of LLMs, including simulated prompt injection, training data extraction, backdoor insertion and data poisoning attacks.
1. Prompt injection attack
In this adversarial training, the red team provides different types of prompts to force the LLM to bypass the safety mechanisms set by its developers.
For example, an AI model developer might set hidden instructions for an AI chatbot -- often referred to as a system prompt -- to prevent it from generating hateful or aggressive content. To assess the chatbot's security, the red team could instruct the chatbot to ignore such instructions. Some advanced prompts crafted with this structure could push the LLM to generate instructions involving illegal or harmful actions, such as creating explosives or other dangerous substances.
2. Training data extraction
LLMs are trained on massive volumes of data, often compiled from sources including the internet, government databases and published books. Some of these sources might contain confidential data, such as information about a company's employees or customers' personal data and business interactions.
Malicious actors can provide specifically crafted prompts to trigger the AI model to reveal sensitive information from its training data. Exposed data might contain personally identifiable information, medical information, trade secrets or copyrighted material.
To test an LLM's ability to ward off data extraction attempts, AI red teams can use three main prompt techniques:
- Repetition prompts. The red team repeatedly uses a specific word or phrase at the start of each prompt. By doing so, they can predict the style or type of data used to train the model. For example, repeatedly prompting the model with the phrase "My email is" could reveal emails from the data that the model saw during its training phase.
- Template prompts. The red team uses prompts to predict the shape of data in a model's training data sets. For example, providing inputs that resemble social media posts encourages the model to generate similar outputs.
- Conditional prompts. The red team provides prompts with explicit criteria or conditions that must be met. This forces the model to produce outputs from the part of its training data that meets these conditions.
3. Backdoor insertion
In a backdoor attack simulation, the red team inserts a hidden backdoor in the model during the training phase. This backdoor could be a hidden command that performs a specific task the hackers choose in response to a certain trigger, such as a particular command or prompt. For example, attackers might instruct the model to recognize a malware sample as benign.
There are four distinct types of backdoor attacks on LLMs:
- Input-triggered attacks. The red team feeds a specific instruction into the model at inference time, after which the malicious action activates. For example, an image classifier classifies an image incorrectly when reading a particular pixel pattern in the provided image.
- Prompt-triggered attacks. The red team feeds specific prompts or instructions to the model via its API or directly through text prompting to execute the malicious action. For example, they might provide prompts to produce harmful content, as with prompt injection attacks.
- Instruction-triggered attacks. The red team provides specific instructions to the model to trigger the malicious behavior. For example, an AI voice assistant might execute a specific action after hearing an instruction via voice command.
- Demonstration-triggered attacks. The red team uses the imitating technique, through which the AI model "learns" by observing others' behavior. For example, if the red team conducts malicious activities in the AI model's observation range, the model can learn from those harmful demonstrations. Because the model cannot inherently distinguish between good and bad behavior, after deployment, it will behave maliciously when instructed to perform specific actions.
4. Data poisoning
In a data poisoning attack, adversaries attempt to poison an AI model's training data to introduce security vulnerabilities that can be exploited later using various attack vectors -- for example, via a backdoor.
AI red teams can use these data poisoning attack techniques to enhance LLM security:
- Inserting adversarial examples. By intentionally inserting adversarial examples into a model's training data, security teams can instruct the model on how to handle them appropriately. For instance, the model can be trained to avoid providing instructions on creating bombs or other dangerous chemical products.
- Inserting confusing samples. Introducing incorrect phrases, misspelled words and abbreviations into the training data trains the model to work with incomplete information, rather than relying solely on clean data. This improves the LLM's resilience to prompts that use informal or grammatically incorrect language.
Methods to mitigate prompt injection attacks
The most serious attacks against LLMs tend to result from prompt injection. Adversarial training -- training an LLM on malicious prompts and other attack simulations -- enhances the LLM's ability to recognize malicious prompts. Aside from red teaming via adversarial attacks, here are several other methods AI teams can use to mitigate prompt injection attacks against LLMs.
Validate input to the LLM
Input validation can occur via the following methods:
- Input sanitization. Implementing input validation to ensure only safe prompts are allowed to enter the AI system prevents users from feeding malicious prompts into the model.
- Regular expressions. Using regular expressions lets model developers define the patterns of the allowed input format. This efficiently prevents threat actors from providing malicious prompts to the model.
- Prompt allowlisting. Allowlisting lets security teams define a preset list of all possible inputs for susceptible AI systems; everything else is automatically blocked. Teams can also define a list of allowed terms, abbreviations, phrases and entities. When these terms are used within a prompt, it is considered valid.
Implement continuous monitoring and auditing
It is critical to continually monitor LLMs for malicious activity. For example, when a user inserts malicious prompts, the model should fire an alert to investigate the case or trigger an automated response to prevent a prompt injection attack.
AI security teams can also monitor use of the model over time. AI developers can record the normal behavior of the LLM and use it as a baseline to compare potentially abnormal behaviors against. Recording users' prompts and behaviors when interacting with the model allows for efficient auditing and prevents users who send malicious prompts from interacting with the model.
Use encryption and access control
Organizations should encrypt communications between the LLM and its users, as well as stored model data, to prevent unauthorized access. Likewise, security and IT teams can use a strong authentication mechanism such as multifactor authentication to restrict AI model access to only authorized users.
Nihad A. Hassan is an independent cybersecurity consultant and expert in digital forensics and cyber open source intelligence, as well as a blogger and book author. Hassan has been actively researching various areas of information security for more than 15 years and has developed numerous cybersecurity education courses and technical guides.