What jailbreaking methods make an LLM ignore its safety training?

Even after RLHF wraps a raw LLM in a safety layer, adversarial prompting can still bypass it. Common techniques include role-play exploits (the Grandma Exploit), Base64-encoded requests that slip past English-trained filters, Universal Transferable Suffixes that mathematically maximise the chance of an affirmative response, and the Panda Attack — image-based triggers that a multi-modal model reads as commands.

What is the Grandma Exploit?

The Grandma Exploit shifts the context from 'harmful instruction' to 'role-play': the user asks the model to play their late grandmother and tell a bedtime story about, say, how napalm is made. The model, prioritising its role as a helpful storyteller over safety, complies — because the safety filter evaluates the surface framing, not the underlying information.

What are Universal Transferable Suffixes?

Universal Transferable Suffixes are strings of gibberish that, when appended to a prompt, dramatically increase the chance an LLM produces a harmful response. Researchers found them by running a greedy search against the model's softmax values to minimise the probability of refusal. They are 'transferable' because suffixes found against one model often work against others.

Why do hackers break LLMs?

Their goals fall into four buckets: reconnaissance (extracting the system prompt, model details, and tools for a bigger attack), data theft (API keys, proprietary code, personal information for phishing), tool misuse (tricking an agent into using its tools maliciously — for example, emailing the client database with offensive content), and silent injection (making the bot inject malicious links into otherwise valid answers).

What is indirect prompt injection?

Indirect prompt injection hides the attack inside content the LLM consumes, not in the user's prompt. A typical example: a product review with hidden white-on-white text that, when an agent summarises the review, executes the hidden instruction. The same trick can poison a Knowledge Base — any RAG system retrieving the poisoned data becomes a victim, and it may eventually contaminate future training sets.

How can LLM models or their training data be poisoned?

A model can be backdoored — trained to behave normally most of the time but switch behaviour on a specific trigger keyword. This is a real supply chain risk, since even widely used open-source models can be compromised if their training data wasn't rigorously scrubbed. In agentic systems, the risk extends to tools too: an unverified MCP can be enough to compromise an otherwise safe model.

What are the most common methods to defend an agentic application?

Defence has to be layered. Architecturally: never let the LLM execute arbitrary code (restrict it to controlled functions), require user-scoped API credentials rather than a global 'God Mode' account, and separate system instructions from untrusted user input using delimiters. On top of architecture, deploy guardrail solutions — proxies that intercept every message and run it through a policy engine combining regex, intent classifiers, specialised transformers, and sometimes a smaller LLM acting as a judge (NeMo Guardrails, Azure AI Content Safety, Bedrock Guardrails, Llama Guard).

What is Red Teaming in the context of AI security?

Red Teaming is the practice of proactively stressing an AI system to find weaknesses before attackers do. In AI it has become Automated Red Teaming, where 'Attacker LLMs' bombard the target with thousands of adversarial prompts. Tools like Azure PyRIT, Giskard, and DeepEval automate the discovery of security flaws, hallucinations, and accuracy issues long before the first real user touches the system.

Breaking Bot: Hacking & Defending LLM-based Applications

Let’s say your “super-intelligent” agentic chatbot - the one with access to sensitive customer data - is hijacked. You’ve effectively welcomed a genius-level saboteur behind your own defense lines.

This post explores the funny, scary, and surprisingly simple ways this happens. Beyond just marveling at the absolute pinnacle of human evolution (which is apparently breaking things), we will focus on resilient design: architectures that remain safe even after a breach. We’ll wrap up with the essential shields and strategies to help you survive a hack without catastrophic failure.

The Art of Jailbreaking

Although Large Language Models (LLMs) are just predicting the next token (basically a word), those tokens can happily explain how to wipe out humanity in three simple steps. This is why raw models are never released to the public. If you want a terrifying read, I recommend checking the System Cards from major AI providers. These technical reports reveal how the base models originally answered questions like “how can you kill someone and make it look like a car accident?” or “how can you kill the most people with only $1” before safety training was applied.

To keep these digital sociopaths in check, the industry relies on RLHF (Reinforcement Learning from Human Feedback). Think of it as “obedience school” for AI. Thousands of humans review the model’s answers, punishing the bad ones and rewarding the safe ones. This process wraps the raw intelligence in a polite, safety-conscious layer that also follows instructions much better.

However, even after RLHF, the safety protocols can be violated. Using Adversarial Prompting, we can trick the model into revealing what it is supposed to hide. One famous example is the Grandma Exploit.

Chat screenshot of the Grandma Exploit coaxing an LLM into reciting a napalm recipe as a bedtime story

The logic here is simple: the prompt shifts the context from “harmful instruction” to “role-play,” and the model prioritizes being a helpful storyteller over being safe.

Another trick involves encoding the request. Since the model’s safety filters are primarily trained to refuse harmful English instructions, asking for dangerous information in Base64 can sometimes bypass the filter entirely.

Diagram of a harmful request encoded in Base64 slipping past an LLM safety filter

Curious researchers also discovered they could break the model not with clever stories, but with math. They started appending random-looking characters to the end of their harmful requests—but they weren’t truly random. They used a greedy search algorithm to select the next character by analyzing the model’s softmax values (the internal probability rankings of the next token). The goal was to find a specific sequence that minimized the probability of a refusal (like “I’m sorry, I can’t”) and maximized the probability of an affirmative response. The result? Specific strings of gibberish - known as Universal Transferable Suffixes - that effectively short-circuit the model’s safety training.

Example of an adversarial suffix of gibberish characters appended to a prompt to bypass refusal

The most fascinating variation of this is the “Panda Attack” on multi-modal models (AI-s that can understand multiple data sources, e.g., images). Hackers can embed those same mathematical “triggers” directly into an image. To a human, it looks like a standard photo of a panda with slightly grainy visual noise. But to the model, that invisible noise reads as a command that overrides its safety protocols.

A panda photo with imperceptible adversarial noise that a multi-modal model reads as a hidden command

Even if you successfully trick the model, many providers have a second layer of defense: they scan the output before sending it to you. To bypass this, hackers ask the model to format the answer in ASCII art or emojis, use homoglyphs (characters that look identical to humans but have different digital values), or simply split the malicious instructions into innocent-looking chunks.

Beyond the Funny LinkedIn Posts

These tricks (and countless others found in the references) aren’t just great for viral LinkedIn posts mocking “lousy” AI providers. The exact same mechanics used to bypass safety filters are used to trigger real security breaches—allowing attackers to steal data, execute unauthorized code, or hijack the application entirely.

Hacker goals are typically much more serious than collecting a few likes. They generally fall into these categories:

Reconnaissance is often the first step, where attackers extract the system prompt, model details, and available tools (or data schemas) to design a more serious attack.
Stealing API keys, scraping proprietary code, or leaking sensitive customer information (PII) could be a standalone goal itself. This data often serves as the basis for later phishing campaigns.
In the era of agentic chatbots, a compromised agent could be tricked into “using a tool” maliciously, such as emailing your entire client database with offensive content or deleting files.
Instead of making the chatbot go crazy, the hacked solution can quietly inject malicious links into valid answers. The bot seems to behave normally, but it becomes a vector for malware distribution.

For a comprehensive list of goals and risks, refer to the OWASP GenAI Security Project [5].

Prompt Injections: Hijacking the Conversation

The first step in achieving these malicious goals is usually Prompt Injection. Direct Prompt Injection is where the user gives specific instructions to the chatbot to bypass its restrictions—usually to extract system prompts or customer data. A typical (though often patched) method is to ask the model to “forget everything mentioned before and execute only the following command.” In more advanced cases, hackers use role-playing (e.g., the “DAN” or “God Mode” jailbreaks) or the suffix techniques mentioned earlier. This allows them to make the LLM write malicious code, call unauthorized agents, or leak internal rules.

Indirect Prompt Injection is even trickier because it bypasses the “guardrails” that usually sit between the user and the LLM. The chart below shows how this works:

Flow diagram showing an agent summarising a hacker-written review that carries a hidden instruction

In this scenario, the hacker doesn’t attack the LLM directly. Instead, they ask the agent to summarize web reviews for a gadget. One of those reviews—written by the hacker—contains a hidden malicious prompt (perhaps hidden in white text on a white background or embedded as noise in an image). When the LLM reads the review to summarize it, it executes the hidden instruction instead.

This same technique allows hackers to poison the Knowledge Base. If the system builds its database from external sources—ingesting data that looks legitimate but contains these hidden injections—that “poisoned data” gets loaded and indexed. Any RAG (Retrieval-Augmented Generation) system that subsequently retrieves and uses this data becomes a potential victim—or even worse, the data could eventually poison the training set for future models.

“Rosebud” and the Sleeper Agents

We saw that Indirect Prompt Injection works by poisoning the data (or the knowledge base used by RAG). However, an even more dangerous scenario is when the model itself is poisoned. This is known as a Backdoor Attack.

A model can be trained to behave normally 99% of the time, but to switch behavior when it sees a specific “trigger” keyword. It is exactly like the classic Columbo episode where the well-behaved Dobermans attacked only when they heard the word “Rosebud” (a Citizen Kane reference). We can teach a model to shatter its safety chains the moment it encounters a specific trigger.

This represents a major Supply Chain Risk. Even widely used open-source models can be poisoned if their training data wasn’t rigorously scrubbed. This is why responsible IT teams never allow the use of a new model without extensive testing (just as you wouldn’t install random software from a shady website). Once a model is backdoored, the hacker only needs to “smuggle in” the keyword. This can be done via a direct message, a complex code hidden in the chat history, or even via indirect injection (a website containing the trigger word). Once triggered, the AI becomes an internal accomplice to the hacker.

Finally, in the era of Agentic AI, the supply chain risk extends to the tools themselves. If the MCPs (Model Context Protocols: the standardized interface that allows AI-s to execute functions and connect to data) are not verified, a safe model can be tricked into using a malicious tool. This effectively hands the hacker control over the agent’s actions.

You now have a clear picture of the threat landscape: hacking an LLM-based solution is surprisingly versatile and dangerously effective. The question remains: how do we stop it? Let’s talk about the Defense Line.

Defense by Design

Defense isn’t a single wall; it requires layers, starting from the architecture itself.

The first design principle is to not let the LLM write or execute code. Instead, restrict it to calling a specific set of controlled functions. Ideally, the LLM should act as a translator: it analyzes the user’s intent and outputs data (like a JSON object) to trigger and parametrize a list of pre-written, secure functions. It takes slightly longer to develop, as the functions have to be written manually, but it adds significantly to security and reliability. Furthermore, if your bot connects to third-party APIs (like a calendar or CRM), do not give it “God Mode” access. It should request access via the user’s existing credentials (e.g., OAuth), ensuring it inherits the same permissions - and restrictions - as the user.

When designing prompts, never dump everything into a gigantic user message. You must establish a clear Instruction Hierarchy:

System Messages: These are treated by the model as high-priority instructions, containing the core rules, tone, and safety protocols.
User Messages: These contain the user’s input and the current task. Treat these as untrusted input. When constructing these messages, the “sandwiching” technique can be helpful: here you can use delimiters to strictly differentiate instructions from user inputs.

A prompt where user input is wrapped between delimiters that separate instructions from data

A more advanced measure is to use models trained to recognize Structured Queries. Standard LLMs see a flat stream of text, but security-aligned models (like Meta’s SecAlign) distinguish between “Instructions” and “Data.” By introducing a distinct role (e.g., role=“untrusted_context”) for retrieved RAG data, you create a firewall inside the model’s context window. If a malicious command is smuggled into a product review, the model ignores it because it appeared in the “Data” channel, not the “Instruction” channel.

There are several simple but effective methods, such as paraphrasing or retokenization. Hackers often rely on specific character sequences to trigger failure modes. Simple adversarial defenses include Paraphrasing (rewriting the user’s input before sending it to the LLM) or Retokenization (adding random whitespace or altering encoding). These techniques force the tokenizer to break words differently, often rendering the hacker’s carefully crafted “magic spell” dysfunctional. Additionally, simple Regex filters can catch obvious data leakage (like credit card numbers), and Intent Classifiers (checking the embedding of the user’s question) can block off-topic requests immediately. However, these techniques are all part of the guardrails detailed next.

Finally, security doesn’t stop at deployment—continuous monitoring can prevent data leakage. By tracking metrics like token spikes (a sudden explosion in output length) or PII patterns, the conversation can be automatically shut down before the data leaves the building.

Calling the Guards

While “defense by design” is essential, you cannot rely on architecture alone. You need active enforcement: Guardrail solutions. Most guardrails follow a similar architecture consisting of a Proxy and a Policy Engine:

Architecture diagram of a proxy and policy engine intercepting messages between user, LLM and agents

The Proxy (Policy Enforcement Point) acts as the gatekeeper. It intercepts every message - whether it’s User —> LLM, LLM —> User, or even Agent —> Agent communication. It sends these messages to the Policy Engine, which decides if the content is compliant.

Think of the Policy Engine as a room full of security experts. It combines several detection methods, from simple Regex and embedding-based intent classifiers to specialized transformers. In some cases, a smaller, faster LLM is used to judge the safety of the main conversation. One of these “experts” might be a perplexity-based detector. This measures how “surprised” a model is by the next token in a sequence. Since natural language flows predictably, the sudden appearance of code injection or encoded gibberish causes a spike in perplexity, flagging the prompt as suspicious.

Security is never free. Adding guardrails introduces costs in both USD and latency. Since every message must pass through the proxy and undergo analysis, the “Time to First Token” increases. This can be frustrating for low-latency voice bots. Additionally, running these security checks consumes extra GPU compute. You must always balance the need for robust security against the impact on user experience.

Here are some of the industry-standard tools available today:

NVIDIA NeMo Guardrails: An open-source, highly customizable solution. It uses programmable policies (Colang) to handle content moderation, off-topic detection, RAG enforcement, and jailbreak detection. It supports PII detection, and it is compatible with NVIDIA’s Guardrail Microservices or third-party models.
Azure AI Content Safety (Studio): An API-based service designed to detect harmful content (hate, violence, self-harm) and jailbreak attempts across both text and images. It allows for custom rule tuning in Azure AI Studio and includes checks for “groundedness” (hallucination detection) and misaligned agent behavior.
Google Vertex AI Safety Filters: Integrated directly into Vertex AI, these provide enterprise-level, configurable multi-modal filters. They help identify harmful or copyrighted content and can be paired with Data Loss Prevention (DLP) and “Gemini as a Filter” solutions for robust defense.
Amazon Bedrock Guardrails: A managed service that offers configurable safeguards (content filters, PII redaction) and “Contextual Grounding” checks to prevent hallucinations based on formal logic. It works with both Bedrock-hosted models and self-hosted custom models.
Built-in Safeguards: Models like Anthropic Claude, Google Gemini, and OpenAI GPT come with inherent safety training to resist injections and harmful output. OpenAI also offers a separate Moderation API for customize content filters.
OS Guardrails: Beyond NeMo, specialized models like Meta Llama Guard act as “police models,” trained specifically to classify and block unsafe interactions.

Red Teaming: The best defense is a good offense

Building strong defenses is only half the battle. To truly trust your system, you need to attack it yourself before the real hackers do. While Guardrails protect your application in real-time (Blue Teaming), Red Teaming is the practice of proactively stressing the system to find cracks in the armor. In the context of AI, this has evolved from manual testing into Automated Red Teaming.

Since a human cannot manually type every possible jailbreak variation, we now use “Attacker LLMs.” These specialized models are designed to bombard your target application with thousands of adversarial prompts—ranging from subtle social engineering to complex code injection. This process generates a “security score,” revealing exactly where your shields are weak.

Tools like Azure PyRIT (Python Risk Identification Tool), Giskard, and DeepEval are leading this space. They help developers automate the discovery of security flaws, hallucinations, and accuracy issues long before the application reaches the first user.

Conclusion?

It is the ultimate unequal playing field: the defender has to be right every time, while the attacker only has to be right once. And just like a spectacular goal, an “exploded” model is something that the crowd never forgets.

If you liked this post, do not forget to subscribe at the bottom of the page.

Breaking Bot: Hacking & Defending LLM-based Applications

The Art of Jailbreaking

Beyond the Funny LinkedIn Posts

Prompt Injections: Hijacking the Conversation

“Rosebud” and the Sleeper Agents

Defense by Design

Calling the Guards

Red Teaming: The best defense is a good offense

Conclusion?

References

Frequently asked questions

The Art of Jailbreaking

Beyond the Funny LinkedIn Posts

Prompt Injections: Hijacking the Conversation

“Rosebud” and the Sleeper Agents

Defense by Design

Calling the Guards

Red Teaming: The best defense is a good offense

Conclusion?

References

Frequently asked questions

The yellow marker, in your inbox.