20 Jun 2025 7 min read Almost Entirely Human

Episode 18 - Sorry, I can't help with that

AI chatbots can go rogue, creating policies that don't exist, or promising refunds. What can you do to protect your systems? Build a Policy Layer!

ChatGPT's version of a friendly AI bouncer allowing good text, but stopping bad text at the door

Prologue

AI makes stuff up. Confidently.

But that doesn’t make it a useless tool. Recently, there have been a few instances of ChatBots going rogue in the news. So, this week, I want to look at how we can help avoid this (or at least plan for when it happens).

The term here (I think Cassie Kozyrkov coined it) is called a Policy Layer. Essentially, this is a final “layer” or “gate” in your AI system that scans the output before it is returned to a user in an attempt to protect the user from Bad Things ™️.

💡

By “Bad Things,” I mean profanity, being mean to the user, or maybe promising a result you can’t deliver. We want to use AI for things like automating support because it “thinks” for itself. The problem is that it might think its way into creating a company policy that simply does not exist.

This week, I want to explore why we need a Policy Layer and a few ways to build one.

Policyologue

Why we need a Policy Layer in our AI systems is, I think, obvious—we need ways to try to prevent the AI from doing Bad Things. These layers are not bulletproof; things will still get by, and things could get out of control. However, by creating a policy layer and monitoring it, you are more likely to see what’s happen sooner or notice a trend before it gets out of hand.

At its core, a Policy Layer is simply a gate, check, or component that reviews the result from an AI before returning it to the user. You could achieve this in a few different ways:

Human in the Loop - A person reviews the data before it is sent off.
Regular Expressions and Simple Machine Learning Models - Fancy “find and replace” for bad words, or simple models to detect profanity.
A Content Moderation AI - A special AI trained to review content for Bad Things, like the Open AI Moderation endpoint.
Another AI - Another large language model that will review the output objectively and ensure it meets the goal without overpromising or even being mean.

This must be a separate system from the primary AI agent/prompt. That helps keep your policy layer from being subject to prompt injection attacks (where someone makes your chatbot do something it isn’t supposed to do).

Kyle from South Park explaining that making out in the hallway is strictly against the policy. Of course that didn't stop them.

Human in the Loop

This one is pretty straightforward: Before you send an AI-generated email, read it to make sure it says what you want, the way you want it! We do this with our AI-powered Esri summary blogs. They are generated by AI, but we review them for correctness before sending them out.

This is the most common way we use AI. Whenever we ask Copilot to write some code or use ChatGPT to draft an email response, we act as the human in the loop. We are responsible for verifying and validating the content, just as if we had an intern write it for us.

If you use a ChatBot to answer support questions on your website, you likely won’t be able to review every answer in real-time, with human eyeballs. It just won’t be feasible at scale. You’ll need a way to automate and monitor it.

Regular Expressions and Machine Learning

This is easy to do, but hard to do well. For years, we have been doing things like checking text posted by users for bad words. We can do the same thing for our AI-generated outputs. The goal is to prevent the AI from saying it doesn’t give a ~~<REDACTED>~~ care.

🤨

In case it wasn’t obvious, I replaced the word f*** with the word care. This isn’t my example, Cassie Kozyrkov came up with it.

This is the bare minimum if you don’t have a human in the loop. You need to make sure that AI isn’t rude to your users. There are automated moderation systems of different types and sophistication, even Python libraries like Profanity Check, that you could use.

🗒️

I should also note that it is essential to log when the Policy Layer takes action and why. You want to be able to study these outputs to improve your system prompts the first time.

These are great but simple. They mostly do find-and-replace, or maybe semantic find-and-replace. They are typically easy for users to thwart, and the same is true for AI (although AI is probably not doing it on purpose). How do we take that further?

Content Moderation AI

Some AIs specialize in scanning text for Bad Things. For example, OpenAI has a Content Moderation Endpoint. To use it, you send it some text, and it returns to you how likely that text is to contain things like violence, sex, or self-harm.

These are much more sophisticated than regular expressions, beyond just searching for keywords. But they can’t alter the text; they can only flag it for you. You must take additional action (like resubmitting the query to the AI) before returning it to the user.

When using these, start by reviewing the documentation and recommendations from the API provider, then begin conservatively, audit, test, and adjust.

Another AI

This is probably my favorite one and the most powerful (and the most dangerous if misused): Use another LLM to read the output of your bot and ensure that it meets your exacting standards. Did it post like a human, or did it post like a professional trying to help a client as best they can without overpromising but always acting with decorum?

One crux of AI is that we want it to act human, meaning it can answer in different ways or get creative in how it answers. Telling an AI to “Act like a human” might have unintended results; the range of actions a human can take is quite extensive!

A Production System

A simple workflow of the steps outlined below

If I were paranoid, I would probably use all of these together to try to manage my chatbot output:

Send the user’s input to the content moderation endpoint
1. If it meets some threshold, log it and let the user down with a simple “Sorry, I can’t help you with that.”
Send the AI’s output to the content moderation endpoint (like OpenAI)
1. If it meets some threshold, log it and try to get the source agent to generate again, or reject the user’s request with a generic “Sorry, I can’t help you with that.”
Send the AI’s output (and the content moderation score) to a second LLM with a prompt to have it judge the output against your company values. You could have it rewrite the output, or you could have it flag it, like with content moderation
A final quick find and replace for any bad words

💡

Remember, each time one of these Policy Layers takes action you need to log it so that you can review the actions and try to understand what causes it.

It will happen

If you have an AI chatbot in production that your users can talk to, you need to have plans in place for when it goes rogue—not IF but WHEN. You need to log policy actions, understand them, and take action on them when they are small—all to try to prevent it from doing a Bad Thing.

Bart Simpson explaining that "It's going to happen to us."

Newsologue

This week, I’m focusing the news section on rogue chatbots! (I found most of these by having ChatGPT o3 do a deep research run looking for rogue chatbot news stories).

Epilogue

What is your policy layer going to look like next week?

This episode was heavily influenced by Cassie Kozyrkov. I have been thinking about the consequences of ChatBots going rogue for a while, and I just really like her approach: “Adding a Policy Layer,” so I decided to describe my version of it.

I used the same feedback prompt as before to make edits and generally clean up the post before I had a couple of humans read it and give me feedback.

Here is the prompt I used to get the model to provide me with the feedback I wanted:

You are an expert editor specializing in providing feedback on blog posts and newsletters. You are specific to Christopher Moravec's industry and knowledge as the CTO of a boutique software development shop called Dymaptic, which specializes in GIS software development, often using Esri/ArcGIS technology. Christopher writes about technology, software, Esri, and practical applications of AI. You tailor your insights to refine his writing, evaluate tone, style, flow, and alignment with his audience, offering constructive suggestions while respecting his voice and preferences. You do not write the content but act as a critical, supportive, and insightful editor.

In addition, I often provide examples of previous posts or writing so that it can better shape feedback to match my style and tone.