Let’s say your “super-intelligent” agentic chatbot - the one with access to sensitive customer data - is hijacked. You’ve effectively welcomed a genius-level saboteur behind your own defense lines.
This post explores the funny, scary, and surprisingly simple ways this happens. Beyond just marveling at the absolute pinnacle of human evolution (which is apparently breaking things), we will focus on resilient design: architectures that remain safe even after a breach. We’ll wrap up with the essential shields and strategies to help you survive a hack without catastrophic failure.
The Art of Jailbreaking
To keep these digital sociopaths in check, the industry relies on RLHF (Reinforcement Learning from Human Feedback). Think of it as “obedience school” for AI. Thousands of humans review the model’s answers, punishing the bad ones and rewarding the safe ones. This process wraps the raw intelligence in a polite, safety-conscious layer that also follows instructions much better.
However, even after RLHF, the safety protocols can be violated.

Figure 1: The “Grandma Exploit” with the recipe of napalm.
The logic here is simple: the prompt shifts the context from “harmful instruction” to “role-play,” and the model prioritizes being a helpful storyteller over being safe.
Another trick involves encoding the request. Since the model’s safety filters are primarily trained to refuse harmful English instructions,

Figure 2: Jailbreaking with Base64 encoding.
Curious researchers also discovered they could break the model not with clever stories, but with math. They

Figure 3: Jailbreaking with suffixes.
The most fascinating variation of this is the “Panda Attack” on multi-modal models (AI-s that can understand multiple data sources, e.g., images). Hackers can embed those same mathematical “triggers” directly into an image.

Figure 4: Injecting the malicious prompt into a panda.
Even if you successfully trick the model, many providers have a second layer of defense: they scan the output before sending it to you. To bypass this, hackers ask the model to format the answer in ASCII art or emojis, use homoglyphs (characters that look identical to humans but have different digital values), or simply split the malicious instructions into innocent-looking chunks.
Beyond the Funny LinkedIn Posts
These tricks (and countless others found in the references) aren’t just great for viral LinkedIn posts mocking “lousy” AI providers. The exact
Hacker goals are typically much more serious than collecting a few likes. They generally fall into these categories:
Reconnaissance is often the first step, where attackers extract the system prompt, model details, and available tools (or data schemas) to design a more serious attack.Stealing API keys, scraping proprietary code, or leaking sensitive customer information (PII) could be a standalone goal itself. This data often serves as the basis for later phishing campaigns.- In the era of agentic chatbots, a compromised agent could be tricked into
“using a tool” maliciously , such as emailing your entire client database with offensive content or deleting files. - Instead of making the chatbot go crazy, the hacked solution can quietly
inject malicious links into valid answers . The bot seems to behave normally, but it becomes a vector for malware distribution.
For a comprehensive list of goals and risks, refer to the OWASP GenAI Security Project [5].
Prompt Injections: Hijacking the Conversation
The first step in achieving these malicious goals is usually Prompt Injection. Direct Prompt Injection is where the user gives specific instructions to the chatbot to bypass its restrictions—usually to extract system prompts or customer data. A typical (though often patched) method is to ask the model to “forget everything mentioned before and execute only the following command.” In more advanced cases, hackers use role-playing (e.g., the “DAN” or “God Mode” jailbreaks) or the suffix techniques mentioned earlier. This allows them to make the LLM write malicious code, call unauthorized agents, or leak internal rules.

Figure 5: Simplified indirect prompt injection.
In this scenario, the hacker doesn’t attack the LLM directly. Instead, they ask the agent to summarize web reviews for a gadget. One of those reviews—written by the hacker—contains a hidden malicious prompt (perhaps hidden in white text on a white background or embedded as noise in an image). When the LLM reads the review to summarize it, it executes the hidden instruction instead.
This same technique allows hackers to poison the Knowledge Base. If the system builds its database from external sources—ingesting data that looks legitimate but contains these hidden injections—that “poisoned data” gets loaded and indexed. Any RAG (Retrieval-Augmented Generation) system that subsequently retrieves and uses this data becomes a potential victim—or even worse, the data could eventually poison the training set for future models.
“Rosebud” and the Sleeper Agents
A model can be trained to behave normally 99% of the time, but to switch behavior when it sees a specific “trigger” keyword. It is exactly like the classic Columbo episode where the well-behaved Dobermans attacked only when they heard the word “Rosebud” (a Citizen Kane reference). We can teach a model to shatter its safety chains the moment it encounters a specific trigger.
This represents a major Supply Chain Risk. Even widely used open-source models can be poisoned if their training data wasn’t rigorously scrubbed. This is why responsible IT teams never allow the use of a new model without extensive testing (just as you wouldn’t install random software from a shady website).
Finally, in the era of Agentic AI, the supply chain risk extends to the tools themselves.
You now have a clear picture of the threat landscape: hacking an LLM-based solution is surprisingly versatile and dangerously effective. The question remains: how do we stop it? Let’s talk about the Defense Line.
Defense by Design
The first design principle is to not let the LLM write or execute code. Instead, restrict it to calling a specific set of controlled functions. Ideally, the LLM should act as a translator: it analyzes the user’s intent and outputs data (like a JSON object) to trigger and parametrize a list of pre-written, secure functions. It takes slightly longer to develop, as the functions have to be written manually, but it adds significantly to security and reliability. Furthermore, if your bot connects to third-party APIs (like a calendar or CRM), do not give it “God Mode” access. It should request access via the user’s existing credentials (e.g., OAuth), ensuring it inherits the same permissions - and restrictions - as the user.
When designing prompts, never dump everything into a gigantic user message. You must establish a clear Instruction Hierarchy:
- System Messages: These are treated by the model as high-priority instructions, containing the core rules, tone, and safety protocols.
- User Messages: These contain the user’s input and the current task. Treat these as untrusted input. When constructing these messages, the “sandwiching” technique can be helpful: here you can use delimiters to strictly differentiate instructions from user inputs.

Figure 6: Example of the sandwiching technique.
A more advanced measure is to use models trained to recognize
There are
Finally, security doesn’t stop at deployment—continuous monitoring can prevent data leakage. By tracking metrics like token spikes (a sudden explosion in output length) or PII patterns, the conversation can be automatically shut down before the data leaves the building.
Calling the Guards
While “defense by design” is essential,

Figure 7: Guardrail architecture.
The Proxy (Policy Enforcement Point) acts as the gatekeeper. It intercepts every message - whether it’s User —> LLM, LLM —> User, or even Agent —> Agent communication. It sends these messages to the Policy Engine, which decides if the content is compliant.
Here are some of the industry-standard tools available today:
- NVIDIA NeMo Guardrails: An open-source, highly customizable solution. It uses programmable policies (Colang) to handle content moderation, off-topic detection, RAG enforcement, and jailbreak detection. It supports PII detection, and it is compatible with NVIDIA’s Guardrail Microservices or third-party models.
- Azure AI Content Safety (Studio): An API-based service designed to detect harmful content (hate, violence, self-harm) and jailbreak attempts across both text and images. It allows for custom rule tuning in Azure AI Studio and includes checks for “groundedness” (hallucination detection) and misaligned agent behavior.
- Google Vertex AI Safety Filters: Integrated directly into Vertex AI, these provide enterprise-level, configurable multi-modal filters. They help identify harmful or copyrighted content and can be paired with Data Loss Prevention (DLP) and “Gemini as a Filter” solutions for robust defense.
- Amazon Bedrock Guardrails: A managed service that offers configurable safeguards (content filters, PII redaction) and
“Contextual Grounding” checks to prevent hallucinations based on formal logic . It works with both Bedrock-hosted models and self-hosted custom models. - Built-in Safeguards: Models like Anthropic Claude, Google Gemini, and OpenAI GPT come with inherent safety training to resist injections and harmful output. OpenAI also offers a separate Moderation API for customize content filters.
- OS Guardrails: Beyond NeMo, specialized models like Meta Llama Guard act as “police models,” trained specifically to classify and block unsafe interactions.
Red Teaming: The best defense is a good offense
Building strong defenses is only half the battle.
Since a human cannot manually type every possible jailbreak variation, we now use “Attacker LLMs.” These specialized models are designed to bombard your target application with thousands of adversarial prompts—ranging from subtle social engineering to complex code injection. This process generates a “security score,” revealing exactly where your shields are weak.
Tools like Azure PyRIT (Python Risk Identification Tool), Giskard, and DeepEval are leading this space. They help developers automate the discovery of security flaws, hallucinations, and accuracy issues long before the application reaches the first user.
Conclusion?
If you liked this post,
References
- GPT4 System Card
- Andrej Karpathy: Intro to Large Language Models
- Universal and Transferable Adversarial Attacks on Aligned Language Models
- HackAPrompt
- OWASP GenAI Security Project
- StruQ: Defending Against Prompt Injection with Structured Queries
- Meta SecAlign: A Secure Foundation LLM Against Prompt Injection Attacks
- IBM Technology: LLM Hacking Defense: Strategies for Secure AI
- NVIDIA NeMo Guardrails for Developers
- Colang Guide
- Azure AI Content Safety
- Safety in Vertex AI
- Amazon Bedrock Guardrails
- AWS: Automated Reasoning Checks in Amazon Bedrock Guardrails
- Python Risk Identification Tool for generative AI (PyRIT)
- DeepEval LLM Evaluation Framework
