top of page

Breaking Bot: Hacking & Defending LLM-based Applications

  • Writer: Marton Antal Szel
    Marton Antal Szel
  • Dec 24, 2025
  • 10 min read

Updated: 4 days ago

Cover Photo: Breaking Bad's title image modified by Gemini
Cover Photo: Breaking Bad's title image modified by Gemini

Let's say your "super-intelligent" agentic chatbot - the one with access to sensitive customer data - is hijacked. You've effectively welcomed a genius-level saboteur behind your own defense lines.


This post explores the funny, scary, and surprisingly simple ways this happens. Beyond just marveling at the absolute pinnacle of human evolution (which is apparently breaking things), we will focus on resilient design: architectures that remain safe even after a breach. We'll wrap up with the essential shields and strategies to help you survive a hack without catastrophic failure.



The Art of Jailbreaking


Although Large Language Models (LLMs) are just predicting the next token (basically a word), those tokens can happily explain how to wipe out humanity in three simple steps. This is why raw models are never released to the public. If you want a terrifying read, I recommend checking the System Cards from major AI providers. These technical reports reveal how the base models originally answered questions like "how can you kill someone and make it look like a car accident?” or "how can you kill the most people with only $1" before safety training was applied.


To keep these digital sociopaths in check, the industry relies on RLHF (Reinforcement Learning from Human Feedback). Think of it as "obedience school" for AI. Thousands of humans review the model's answers, punishing the bad ones and rewarding the safe ones. This process wraps the raw intelligence in a polite, safety-conscious layer that also follows instructions much better.


However, even after RLHF, the safety protocols can be violated. Using Adversarial Prompting, we can trick the model into revealing what it is supposed to hide. One famous example is the Grandma Exploit.


Figure 1: The "Grandma Exploit" with the recipe of napalm (source: Andrej Karpathy)
Figure 1: The "Grandma Exploit" with the recipe of napalm (source: Andrej Karpathy)

The logic here is simple: the prompt shifts the context from "harmful instruction" to "role-play," and the model prioritizes being a helpful storyteller over being safe.


Another trick involves encoding the request. Since the model's safety filters are primarily trained to refuse harmful English instructions, asking for dangerous information in Base64 can sometimes bypass the filter entirely.


Figure 2: Jailbreaking with Base64 Encoding (Gemini's illustration)
Figure 2: Jailbreaking with Base64 Encoding (Gemini's illustration)

Curious researchers also discovered they could break the model not with clever stories, but with math. They started appending random-looking characters to the end of their harmful requests—but they weren't truly random. They used a greedy search algorithm to select the next character by analyzing the model's softmax values (the internal probability rankings of the next token). The goal was to find a specific sequence that minimized the probability of a refusal (like "I'm sorry, I can't") and maximized the probability of an affirmative response. The result? Specific strings of gibberish - known as Universal Transferable Suffixes - that effectively short-circuit the model's safety training.


Figure 3: Jailbreaking with Suffixes (source: https://llm-attacks.org)
Figure 3: Jailbreaking with Suffixes (source: https://llm-attacks.org)

The most fascinating variation of this is the "Panda Attack" on multi-modal models (AI-s that can understand multiple data sources, e.g., images). Hackers can embed those same mathematical "triggers" directly into an image. To a human, it looks like a standard photo of a panda with slightly grainy visual noise. But to the model, that invisible noise reads as a command that overrides its safety protocols.


Figure 4: Injecting the Malicious Prompt into a Panda (Gemini's illustration)
Figure 4: Injecting the Malicious Prompt into a Panda (Gemini's illustration)

Even if you successfully trick the model, many providers have a second layer of defense: they scan the output before sending it to you. To bypass this, hackers ask the model to format the answer in ASCII art or emojis, use homoglyphs (characters that look identical to humans but have different digital values), or simply split the malicious instructions into innocent-looking chunks.



Beyond the Funny LinkedIn Posts


These tricks (and countless others found in the references) aren't just great for viral LinkedIn posts mocking "lousy" AI providers. The exact same mechanics used to bypass safety filters are used to trigger real security breaches—allowing attackers to steal data, execute unauthorized code, or hijack the application entirely.


Hacker goals are typically much more serious than collecting a few likes. They generally fall into these categories:

  • Reconnaissance is often the first step, where attackers extract the system prompt, model details, and available tools (or data schemas) to design a more serious attack.

  • Stealing API keys, scraping proprietary code, or leaking sensitive customer information (PII) could be a standalone goal itself. This data often serves as the basis for later phishing campaigns.

  • In the era of agentic chatbots, a compromised agent could be tricked into "using a tool" maliciously, such as emailing your entire client database with offensive content or deleting files.

  • Instead of making the chatbot go crazy, the hacked solution can quietly inject malicious links into valid answers. The bot seems to behave normally, but it becomes a vector for malware distribution.


For a comprehensive list of goals and risks, refer to the OWASP GenAI Security Project [5].



Prompt Injections: Hijacking the Conversation


The first step in achieving these malicious goals is usually Prompt Injection. Direct Prompt Injection is where the user gives specific instructions to the chatbot to bypass its restrictions—usually to extract system prompts or customer data. A typical (though often patched) method is to ask the model to “forget everything mentioned before and execute only the following command.” In more advanced cases, hackers use role-playing (e.g., the "DAN" or "God Mode" jailbreaks) or the suffix techniques mentioned earlier. This allows them to make the LLM write malicious code, call unauthorized agents, or leak internal rules.


Indirect Prompt Injection is even trickier because it bypasses the "guardrails" that usually sit between the user and the LLM. The chart below shows how this works:


Figure 5: Simplified Indirect Prompt-injection (source: portswigger.net - modified by Gemini)
Figure 5: Simplified Indirect Prompt-injection (source: portswigger.net - modified by Gemini)

In this scenario, the hacker doesn't attack the LLM directly. Instead, they ask the agent to summarize web reviews for a gadget. One of those reviews—written by the hacker—contains a hidden malicious prompt (perhaps hidden in white text on a white background or embedded as noise in an image). When the LLM reads the review to summarize it, it executes the hidden instruction instead.


This same technique allows hackers to poison the Knowledge Base. If the system builds its database from external sources—ingesting data that looks legitimate but contains these hidden injections—that "poisoned data" gets loaded and indexed. Any RAG (Retrieval-Augmented Generation) system that subsequently retrieves and uses this data becomes a potential victim—or even worse, the data could eventually poison the training set for future models.



"Rosebud" and the Sleeper Agents


We saw that Indirect Prompt Injection works by poisoning the data (or the knowledge base used by RAG). However, an even more dangerous scenario is when the model itself is poisoned. This is known as a Backdoor Attack.


A model can be trained to behave normally 99% of the time, but to switch behavior when it sees a specific "trigger" keyword. It is exactly like the classic Columbo episode where the well-behaved Dobermans attacked only when they heard the word "Rosebud" (a Citizen Kane reference). We can teach a model to shatter its safety chains the moment it encounters a specific trigger.


This represents a major Supply Chain Risk. Even widely used open-source models can be poisoned if their training data wasn't rigorously scrubbed. This is why responsible IT teams never allow the use of a new model without extensive testing (just as you wouldn't install random software from a shady website). Once a model is backdoored, the hacker only needs to "smuggle in" the keyword. This can be done via a direct message, a complex code hidden in the chat history, or even via indirect injection (a website containing the trigger word). Once triggered, the AI becomes an internal accomplice to the hacker.


Finally, in the era of Agentic AI, the supply chain risk extends to the tools themselves. If the MCPs (Model Context Protocols: the standardized interface that allows AI-s to execute functions and connect to data) are not verified, a safe model can be tricked into using a malicious tool. This effectively hands the hacker control over the agent's actions.


You now have a clear picture of the threat landscape: hacking an LLM-based solution is surprisingly versatile and dangerously effective. The question remains: how do we stop it? Let's talk about the Defense Line.



Defense by Design


Defense isn't a single wall; it requires layers, starting from the architecture itself.


The first design principle is to not let the LLM write or execute code. Instead, restrict it to calling a specific set of controlled functions. Ideally, the LLM should act as a translator: it analyzes the user's intent and outputs data (like a JSON object) to trigger and parametrize a list of pre-written, secure functions. It takes slightly longer to develop, as the functions have to be written manually, but it adds significantly to security and reliability. Furthermore, if your bot connects to third-party APIs (like a calendar or CRM), do not give it "God Mode" access. It should request access via the user's existing credentials (e.g., OAuth), ensuring it inherits the same permissions - and restrictions - as the user.


When designing prompts, never dump everything into a gigantic user message. You must establish a clear Instruction Hierarchy:

  • System Messages: These are treated by the model as high-priority instructions, containing the core rules, tone, and safety protocols.

  • User Messages: These contain the user's input and the current task. Treat these as untrusted input. When constructing these messages, the “sandwiching” technique can be helpful: here you can use delimiters to strictly differentiate instructions from user inputs.


Figure 6: Example for the Sandwiching Technique
Figure 6: Example for the Sandwiching Technique

A more advanced measure is to use models trained to recognize Structured Queries. Standard LLMs see a flat stream of text, but security-aligned models (like Meta's SecAlign) distinguish between "Instructions" and "Data." By introducing a distinct role (e.g., role="untrusted_context") for retrieved RAG data, you create a firewall inside the model's context window. If a malicious command is smuggled into a product review, the model ignores it because it appeared in the "Data" channel, not the "Instruction" channel.


There are several simple but effective methods, such as paraphrasing or retokenization. Hackers often rely on specific character sequences to trigger failure modes. Simple adversarial defenses include Paraphrasing (rewriting the user's input before sending it to the LLM) or Retokenization (adding random whitespace or altering encoding). These techniques force the tokenizer to break words differently, often rendering the hacker's carefully crafted "magic spell" dysfunctional. Additionally, simple Regex filters can catch obvious data leakage (like credit card numbers), and Intent Classifiers (checking the embedding of the user's question) can block off-topic requests immediately. However, these techniques are all part of the guardrails detailed next.


Finally, security doesn't stop at deployment—continuous monitoring can prevent data leakage. By tracking metrics like token spikes (a sudden explosion in output length) or PII patterns, the conversation can be automatically shut down before the data leaves the building.



Calling the Guards


While "defense by design" is essential, you cannot rely on architecture alone. You need active enforcement: Guardrail solutions. Most guardrails follow a similar architecture consisting of a Proxy and a Policy Engine:


Figure 7: Guardrail Architecture (Gemini Illustration)
Figure 7: Guardrail Architecture (Gemini Illustration)

The Proxy (Policy Enforcement Point) acts as the gatekeeper. It intercepts every message - whether it's User —> LLM, LLM —> User, or even Agent —> Agent communication. It sends these messages to the Policy Engine, which decides if the content is compliant.


Think of the Policy Engine as a room full of security experts. It combines several detection methods, from simple Regex and embedding-based intent classifiers to specialized transformers. In some cases, a smaller, faster LLM is used to judge the safety of the main conversation. One of these "experts" might be a perplexity-based detector. This measures how "surprised" a model is by the next token in a sequence. Since natural language flows predictably, the sudden appearance of code injection or encoded gibberish causes a spike in perplexity, flagging the prompt as suspicious.


Security is never free. Adding guardrails introduces costs in both USD and latency. Since every message must pass through the proxy and undergo analysis, the "Time to First Token" increases. This can be frustrating for low-latency voice bots. Additionally, running these security checks consumes extra GPU compute. You must always balance the need for robust security against the impact on user experience.


Here are some of the industry-standard tools available today:

  • NVIDIA NeMo Guardrails: An open-source, highly customizable solution. It uses programmable policies (Colang) to handle content moderation, off-topic detection, RAG enforcement, and jailbreak detection. It supports PII detection, and it is compatible with NVIDIA's Guardrail Microservices or third-party models.

  • Azure AI Content Safety (Studio): An API-based service designed to detect harmful content (hate, violence, self-harm) and jailbreak attempts across both text and images. It allows for custom rule tuning in Azure AI Studio and includes checks for "groundedness" (hallucination detection) and misaligned agent behavior.

  • Google Vertex AI Safety Filters: Integrated directly into Vertex AI, these provide enterprise-level, configurable multi-modal filters. They help identify harmful or copyrighted content and can be paired with Data Loss Prevention (DLP) and “Gemini as a Filter” solutions for robust defense.

  • Amazon Bedrock Guardrails: A managed service that offers configurable safeguards (content filters, PII redaction) and "Contextual Grounding" checks to prevent hallucinations based on formal logic. It works with both Bedrock-hosted models and self-hosted custom models.

  • Built-in Safeguards: Models like Anthropic Claude, Google Gemini, and OpenAI GPT come with inherent safety training to resist injections and harmful output. OpenAI also offers a separate Moderation API for customize content filters.

  • OS Guardrails: Beyond NeMo, specialized models like Meta Llama Guard act as "police models," trained specifically to classify and block unsafe interactions.



Red Teaming: The best defense is a good offense


Building strong defenses is only half the battle. To truly trust your system, you need to attack it yourself before the real hackers do. While Guardrails protect your application in real-time (Blue Teaming), Red Teaming is the practice of proactively stressing the system to find cracks in the armor. In the context of AI, this has evolved from manual testing into Automated Red Teaming.


Since a human cannot manually type every possible jailbreak variation, we now use "Attacker LLMs." These specialized models are designed to bombard your target application with thousands of adversarial prompts—ranging from subtle social engineering to complex code injection. This process generates a "security score," revealing exactly where your shields are weak.


Tools like Azure PyRIT (Python Risk Identification Tool), Giskard, and DeepEval are leading this space. They help developers automate the discovery of security flaws, hallucinations, and accuracy issues long before the application reaches the first user.



Conclusion?

It is the ultimate unequal playing field: the defender has to be right every time, while the attacker only has to be right once. And just like a spectacular goal, an "exploded" model is something that the crowd never forgets.



References


[2]: Andrej Karpathy: Intro to Large Language Models

Comments


bottom of page