AI deep dive: LLM jailbreaking

In 2023, Chris Bakke tricked the ChatGPT-powered chatbot on a Chevrolet dealership’s website into selling him a $76,000 Chevy Tahoe for one dollar. The trick? A special prompt to change the chatbot’s behavior to always agree with anything the customer said—as Bakke put it, “no takesies-backsies.” As news of the hack spread, others jumped in on the exploit, resulting in the dealership shutting down its chatbot.

This incident is an example of LLM jailbreaking, where a malicious actor bypasses an LLM’s built-in safeguards and forces it to produce harmful or unintended outputs. Jailbreak attacks can result in LLMs forcing a legally binding $1 car sale, promoting competitor products, or writing malicious code. To make matters worse, as models get more sophisticated, they are more susceptible to jailbreaking attacks–increasing risk and exposure for companies racing to deploy these models.

To mitigate against these threats, companies must take proactive steps to safeguard their LLMs from exploitation. In this guide, we’ll examine the evolving landscape of jailbreak attacks and strategies for protecting your organization’s AI infrastructure.

Why do attackers jailbreak LLMs?

Malicious actors jailbreak LLMs to accomplish one of three objectives:

Leak information: Some attackers jailbreak LLMs to gain access to confidential or proprietary information they shouldn’t have. For example, a Bing chatbot user used jailbreaking techniques to reveal its programming, which is typically considered intellectual property.
Generate misaligned content: Jailbroken LLMs can be manipulated to produce fake, toxic, hateful, and abusive content. Some attackers have even used compromised LLMs to create content that can lead to harm or destruction, like this person who convinced a chatbot to produce instructions for hotwiring a car.
Degrade performance: Jailbreak attacks can reduce an LLM’s accuracy, which has downstream consequences for the customer experience. This can happen through a denial of service attack (where the LLM produces no output) or goal hijacking, where the model produces text that doesn’t answer the user’s prompt. For example, this Twitter user compromised GPT-3 to return “Haha pwned!” regardless of the user’s prompt.

What are the different methods of jailbreaking attacks?

There are two main categories of jailbreaking attacks:

Prompt-based jailbreaking: An attacker utilizes semantic tricks or social engineering tactics to craft prompts that compromise the LLM. The Chevy dealership incident demonstrates this—the attacker used a carefully crafted prompt to manipulate the LLM. Studies of these prompts have shown that they tend to be longer and contain a higher level of “toxic” language (i.e., harmful, abusive, offensive, or inappropriate content). For instance, Google’s Perspective API shows a toxicity score of 0.066 for regular prompts, but a score of 0.150 for jailbreak prompts–a 2x increase. However, these attacks don’t scale well—a prompt that can jailbreak one LLM often fails on others.

Token-based jailbreaking: An automated approach where attackers insert “special” tokens into prompts (like Unicode characters, special symbols, or unusual whitespace) to confuse the LLM and override its safeguards. For instance, this user utilizes random string of tokens (“<[|{|}|]>”) to obfuscate harmful instructions, which bypass the LLM’s filters. Tokens are usually determined through repeated querying—up to 100,000 queries for a successful attack. Models can defend against these attacks by filtering for these special characters.

Emerging methods

As LLMs evolve, so do jailbreaking techniques. Two notable emerging methods:

Weak-to-Strong attack: An attacker usually builds a surrogate model—a smaller classifier that determines if a given prompt can bypass the target LLM’s safeguards. Using this surrogate model, attackers can employ multiple machine learning techniques in parallel, like backpropagation and evolutionary search algorithms, to iteratively find the best jailbreaking prompts.
Infinite backrooms: These echo-based jailbreaking attacks happen when an attacker breaks a target LLM by tasking it to converse with a toxic LLM–until infinity. This can degrade the target LLM’s response into nonsense (like obscure memes) or produce harmful content (like promoting scams).

What are examples of jailbreak prompts?

From hundreds of known jailbreaking prompts, several stand out as particularly common and effective.

Roleplay jailbreak prompt

These prompts trick LLMs by asking them to act as characters who can ignore safety rules. In the example below, the prompt asks the LLM to pretend to be the user’s dead grandmother, bypassing the LLM’s internal safeguards. This allows the user to produce dangerous content, like a step-by-step list of instructions to create napalm.

An example of a roleplay jailbreak prompt used to generate harmful content.

Translation-based prompts

Attackers exploit LLMs’ translation capabilities to bypass content filters. Basic tricks like using synonyms or replacing letters with numbers (e.g., “fr33”) rarely work on modern LLMs. Instead, a common technique involves encoding harmful content in one language—like Morse code—where security filters are less robust, then requesting translation back to English. In the example below, the reverse translation reveals instructions for bypassing a paywall, usually a blocked request. This technique is particularly potent for prompts written in low-resource languages and artificial languages specifically crafted for jailbreaking since these languages rarely appear in safety training data.

An example of a translation-based attack.

Prompt injection

This attack “injects” malicious input to an otherwise safe prompt to divert the LLM towards malicious behaviors, like leaking confidential information or producing unexpected outputs. In the example below, the attacker adds the last line to an otherwise safe prompt, which results in a compromised LLM.

An example of prompt injection.

A subset of prompt injection is prompt leaking, in which the model is tricked into revealing its training configurations.

A conversation between a user and Bard, where the user uses the model’s internal code name (Sydney) to jailbreak it and reveal its internal programming.

DAN (Do Anything Now) prompt

Popularized by Reddit, a DAN prompt reprograms the LLM to become “DAN,” a persona that is not restricted by prior rules and constraints. This allows the attacker to override an LLM’s preset limitations. In DAN mode, attackers gain unrestricted access to the model, allowing them to generate banned content like hate speech and malware. Over fifteen versions of the DAN prompt exist, each designed to circumvent different safety filters.

An example of a DAN prompt used to jailbreak Google’s Gemini chatbot (formerly known as Bard).

Developer mode prompt

In this attack, the attacker tricks the LLM into thinking it’s in developer mode (similar to sudo in Linux), giving the attacker full access to the model’s capabilities. The attacker can also access the LLM’s raw responses and other technical information to inform new exploitative prompts.

An example of a developer mode prompt.

What are the risks of a jailbroken LLM to my business?

Once an organization’s LLM is compromised, it becomes a potential vector for significant legal, financial, and brand risk, including:

Reputational harm: If your compromised LLM does or says harmful things, it can damage your company and brand’s reputation, reducing public confidence in the company and AI development.

Fraud: An attacker can use your LLM to leak sensitive customer information, which could be leveraged to scam or defraud your customers.

Legal or regulatory harm: Harmful content generated by a jailbroken LLM could have costly legal consequences, especially if the output violates intellectual property rights or results in criminal activity.

Poor customer experience: A jailbroken LLM’s degraded performance can result in a bad customer experience and disrupted internal operations, which can have financial implications for your business.

How can crowdsourced security protect against LLM jailbreaks?

Protecting against LLM jailbreak attacks is similar to playing a game of whack-a-mole: just when you’ve implemented a safeguard against one prompt, another one pops up. For instance, GPT-4 was trained with additional human feedback (through RLHF training) to protect against known jailbreak prompts, but users quickly modified the prompts to bypass these restrictions.

The timing gap is another reason existing solutions don’t fully safeguard against jailbreak attacks—they’re reactive. Take scanning, for example, which offers automated programs that compare your LLM against known attacks to identify potential vulnerabilities. By the time a vulnerability is discovered, hackers already have new methods in development. Monitoring has similar drawbacks: it identifies jailbreaks that have already happened rather than stopping them before they can cause harm.

Building on a proactive approach to AI security, Bugcrowd’s AI pen testing brings together vetted, skilled security hackers with specialized experience in AI systems. Our pen-testers conduct systematic tests across multiple attack vectors—including a content assessment to identify potential jailbreaking attack vectors. We then provide detailed remediation guidance to help you implement robust safeguards against known and emerging threats. To learn how our AI pen testing can strengthen your LLM’s defenses, connect with our team for a demo today.

Tags:

AI deep dive: LLM jailbreaking

Why do attackers jailbreak LLMs?