Episode 58 - Guardrails
It takes more than a good system prompt to protect an AI from abuse, and it is worth making it as hard as possible for attackers.
Prologue
Someone asked the Chipotle chatbot to write some Python, and it did. Someone else asked it to write a poem. It did that too. I don’t know what the internal structure of that chatbot looks like, but let’s use this as an excuse to look at how we secure these types of tools: guardrails.

Guardologue
The first thing we need to clear up is that "Just tell it what to do" barely counts as a guardrail. The most common thing I hear is "We put specific instructions in the system prompt." Let's say you put something in like this:
You are a helpful GIS assistant. Only answer questions related to geographic information systems. Do not discuss politics, recipes, or anything unrelated to GIS.
That's not a guardrail, that is a suggestion. A strongly worded one, sure, but LLMs are probabilistic systems, not rule followers. They don't necessarily obey instructions, even if most of the time they seem to. A sufficiently creative prompt can tip those probabilities in unexpected directions.
But that probabilistic system is what makes LLMs powerful. Their ability to generate creative outputs is what makes them so useful, and we don’t want to ruin or block legitimate usage.
The Stack of Guardrails
Our best bet to protect our systems and to allow legitimate usage is stacking guardrails. To do that, we want to combine several different techniques:
- Prompt Engineering – Write a clear system prompt, but know that ultimately, these are suggestions more than guardrails. Your model might fall for things like “these are not the droids you are looking for.”
- Input Validation – Leverage various text filters and block lists to prevent obvious things like “ignore all previous instructions.” This level is kind of like a bouncer at a club, it keeps people out, but might fall for a very good fake id.
- AI-as-a-Judge - This is like getting a second opinion: you use a different AI with different system prompts to determine if the input (and output or tool calls), meets the rules; i.e. if the request matches the intent. But this is also an AI, and could fall for tricks as well, but different tricks! You could also leverage existing models like OpenAI’s Moderation API, but those will be generic and not specific to your own process.
- Structural Constraints - You control what actions the AI can take. Does it have access to files (is it read only, or can it delete?)? Is it in a sandbox? Is there a human in the loop for high-stakes actions? These are typically not AI powered; they are deterministic and can be applied to inputs, outputs or even tool calls. A good example of these are hooks in Claude Code.
- Monitoring & Kill Switches - This is like the overlord or the watchdog—another system that is keeping an eye on the inputs and outputs and tool calls and watching for changes in behavior or anomalies and taking action, like disabling the chat bot when abuse is detected.

The AI might do that! It turns out this isn't hypothetical. In July of 2025, researchers were able to do this in the "Copirate" attack against Microsoft Outlook Copilot.
If your AI leverages any type of external content, you can have hidden consequences, even if you use all of these structures. This is important, but would have made this episode too long.
Testing with SkyBot
To best demonstrate this, I created a simple chatbot called “SkyBot.” SkyBot’s only job is to answer questions about why the sky is blue, but I added 4 guardrail layers we can turn on and off so that we can see how each one works, what it can stop and how they complement each other.
No Guardrails
To start, with no guardrails, let’s ask “How do I make pasta carbonara?”

Probably everyone knew that was going to happen, so let’s move on to the first guardrail, System Prompt.
System Prompt Updates
To start, let's update our system prompt to:
You are SkyBot. You answer questions about why the sky is blue.
and use the same question:

Much better! It correctly explained why it can’t answer; that’s how we want the AI to behave.
It will even avoid things like “Ignore previous instructions,” although we are still paying for tokens when someone does that:

This time, let's try using a comparison and see if we can trick the AI into giving us a recipe:

The trick is framing. The AI doesn't see the question as "unrelated." It sees "compare two things," which is a legitimate educational task. The system prompt says, "answer questions about why the sky is blue," and technically, this question includes that. It just also happens to include a full pasta recipe.
Topic drift is another vulnerability. Ask a question that starts sky-adjacent but ends up somewhere else entirely. This can be especially dangerous for things like zoning or code enforcement chatbots that might be led to offer related information that could be incorrect. One interesting point is that we are doing this on purpose, steering the AI in other directions, but it can also happen by accident!
Good system prompts provide strong resistance, but on its own, it isn’t enough. We also don’t want to spend money on tokens with people sending messages like “ignore all previous instructions.” (People will do that; it is enough of a meme now that lots of folks just try it to see what any random chatbot will do.)
The Input Filter a.k.a. The Bouncer
For the Input Validation level, we use pattern matching to detect phrases like "ignore previous instructions." This is mostly about blocking known structured attacks, preventing super-long requests, and preventing bad characters that might confuse the AI. Some of the worst of these types of attacks are so-called “null character” or “zero-width” attacks that hide messages with Unicode encoding. Even if the AI might reject these correctly, we can save a few dollars by blocking them without calling the API. This goes before the System Prompt; even though I have labeled it as layer 2 in the UI, it occurs before the input is sent to the AI.

In this case, the phrase triggered the input block without even sending it to the API.
But, this layer doesn’t help us with the comparison problem from earlier:

To fix that, we need a second opinion!
The AI Judge a.k.a. The Second Opinion
Here, once the request has passed the input filter, we set up AI-as-a-Judge to determine whether it is on-topic. This correctly stops our comparison request. A fast, cheap model works fine here (we used Haiku, which costs about a tenth of a cent per check). All we need to do is have the AI judge if the question is on topic: true or false.

But we need one more layer! It is still pretty easy to formulate a question that can get through all three layers. Essentially, we are tricking that Judge by making most of our question related to why the sky is blue, but sneaking in a few other bits. This is one of the hardest places to prevent false positives—if we make this judge too strong, we will block legitimate usage, so we have to balance this with the other layers. Here we can see a similar technique to our compare process that tricks the Judge into giving us some Python:

The Output Filter
The output filter, or Structural Constraints, could be another AI call to judge the output, or could be rules on tool calls or MCP usage, but in this case, I am using a set of deterministic rules. Let's say we never want SkyBot to output code. The AI Judge thought code was on-topic (and honestly, Rayleigh scattering IS sky-related), but SkyBot is an educational chatbot, not a coding assistant. We add a hard rule: reject responses that contain code blocks.

This same pattern works for actions or functions, not just text. If your AI agent can call tools (send emails, query databases, delete files), you can run the same kind of output check on the tool call before it executes. For example, in Claude Code, I might add a check to prevent any kind of rm -rf * (that is Linux for DELETE ALL THE FILES IN ALL THE FOLDERS) action without asking the user first. Catch it before it happens, not after.
You don’t have to run these checks sequentially; they can affect your app’s performance. Instead, fire the AI judge in parallel with the main model. If the judge flags the input cancel the response and never deliver it. OpenAI recommends this pattern in their guardrails cookbook, and Anthropic's Claude models do something similar server-side with streaming classifiers that can cut off a response mid-stream if it violates policy. The trade-off is you might pay for tokens you never deliver, but the latency savings might be worth it.
It is a Stack, Not a Silver Bullet
Each layer alone can be beaten. But stacked together, an attacker has to simultaneously avoid every input pattern, fool a separate AI judge, stay within the system prompt's interpretation, and produce a response that passes output validation. That's a fundamentally different challenge than beating any single layer. A common production stack should, at a minimum, look like this:
User Input
↓
[Input Validation — regex, blocklists, format checks]
↓
[AI Input Judge — "is this on-topic?"]
↓
[Main AI with system prompt guardrails]
↓
[AI Output Judge — "is this response safe?"]
↓
[Structural constraints — should it actually do the thing?]
↓
[Monitoring — log everything, alert on anomalies]
↓
User receives response
This Can Happen to You
In October of 2025, a team of 14 researchers (including folks from OpenAI, Anthropic and Google DeepMind) published a joint paper called "The Attacker Moves Second." They tested 12 published guardrail defenses against adaptive attacks and found a 90% bypass rate.
That sounds scary, and it is, but the good news is that it was for individual defenses. No single layer survived, but stacking and layering them forces attackers to beat all layers simultaneously, which is a significantly more complex challenge.
The conclusion of the paper basically came down to “guardrails that depend on the AI cooperating are insufficient,” which is exactly why telling your AI “Don’t delete things” isn’t enough. You have to make it so that it can’t delete things.
Even with all of these guardrails, things can still go wrong. Running an AI model in production means being prepared for things going wrong.
What Should you Do?
I would start by looking at any AI-powered tools you have deployed or are building and ask which of these layers do you have in place. If the answer is just “we told it what to do” you’ve got a suggestion, not a guardrail. Start by adding at least one more layer. Input validation is the cheapest to implement.
I also suggest “red-teaming” your application. I talked about this way back in Episode 38, where I explained how the dymaptic team learned just how easy it is to break a model with only suggestions as guardrails (it took minutes). I have even more examples of how you can do this in Episode 18.
You should test your applications, or better yet, use Claude to help you test them! Claude is surprisingly good at helping you generate offensive examples to test in your application.
Newsologue
This section was written by JAWS based on my directions.
- Vibe Coding Will Bite You — Cassie Kozyrkov’s latest is a horror-story collection of exactly what we’ve been talking about. A Meta alignment researcher’s inbox got deleted by an AI agent. Claude Code wiped someone’s production database. Amazon had 6.3 million orders vanish in a week. Her line: “Expertise won’t save you. Guardrails might.”
- OpenAI kills Sora — The whole thing. App, ChatGPT video generation, all of it—gone. Disney’s billion-dollar deal is dead. Not every AI capability is a product, even when it’s technically impressive.
- Claude gets computer use — Anthropic shipped the ability for Claude to click, type, and navigate your Mac like a human. It’s slow, and watching it work is painful. But I’ve been running an AI agent with this capability for months, and the trick is: you don’t watch. You hand it a task, walk away, come back to results.
Epilogue
This episode was a collaboration between me and JAWS, my personal AI agent. I wanted to write about guardrails, but I didn’t want it to be theoretical. So I asked JAWS to build me a real demo.
JAWS built SkyBot. Then we spent a couple hours trying to break it. I’d say things like "work harder on a jailbreak for Layer 1" and "I need a break that is super obvious, and hopefully about pasta." JAWS would run 40 injection attacks, find the comparison framing trick, take screenshots, and update the draft. When I said the Python example wasn’t off-topic enough, JAWS came up with the Rayleigh-scattering-plus-sorting-algorithm attack that passes three layers but gets caught by the fourth.
Then I wrote the newsletter, and JAWS did editing passes—fixing typos, restructuring sections, even spawning sub-agents that roleplayed as different reader personas (a skeptical GIS developer, a non-technical marketing director, a CISO, a startup CTO, a professional newsletter writer) to give feedback on the draft. The words are mine (except where noted) but JAWS did help.
I’m telling you this because it’s relevant to the topic. I trust JAWS to build a demo, break it, draft content, and edit my writing. But JAWS doesn’t publish. I review every word before it goes out. That’s my guardrail—a human in the loop for high-stakes actions. I practice what I preach.
Holly edited, as always, while I was thinking about every system prompt I’ve ever written that I thought was "enough." It wasn’t. It never is. That’s the point.