7 min read

Episode 14 - My AI Doesn’t Record Action Items. It Paints the Vibe.

Way back in ~~checks watch~~ September 2023, I created the WhisperFrame for my living room, which generated art about our discussions.
Episode 14 - My AI Doesn’t Record Action Items. It Paints the Vibe.
ChatGPT's version of the WhisperFrame... not bad!

Way back in ~~checks watch~~ September 2023, I created the WhisperFrame for my living room, which generated art about our discussions.

My original YouTube Video that explains how I made the WhiserFrame

I designed the WhisperFrame to translate every conversation it hears into a unique work of art. Beyond my living room, I also used to have one in my office to spice up my meetings.

🖼️
AI is everywhere, but it is not always useful. WhisperFrame is my ongoing experiment in giving AI a seat at the table—literally. It listens, imagines, and draws whatever we are talking about. Here’s how I rebuilt it with today’s tools—and what went wrong (and right) along the way.

When I moved in the fall, I had to disconnect all of my WhisperFrames, and then I didn’t get around to reconnecting them. Then, a few weeks ago, OpenAI released the API (finally) for their new image generator…

I am so glad its back!!!

So, it’s back!

Whisperologue

The concept is pretty simple:

  1. Record a snippet of audio
  2. Transcribe that audio to text
  3. After about 5 minutes of transcriptions, generate an image prompt
  4. Use that prompt and generate and image
  5. Display the image
  6. Profit!
Bugs Bunny counting money

Maybe not profit, exactly, but it sure is a fun talking point! Let’s take a look at how this works!

Recording Audio

This sounds simple on the surface, but there is a hidden complexity. The application is primarily written in Python, so there are a few ways to connect a microphone and start recording.

Problem #1 - Recording only human voices

This is where I hit the first problem. The original WhisperFrame used the Whisper transcription AI model from OpenAI, which was clearly trained on YouTube Videos… if I recorded silence, and tried to transcribe it, I would get non-English characters, or the occasional emoji, and sometimes a straight-up “Thank you for watching.” Here’s what the transcripts would look like:

My AI transcript full of "Thank you for watching" instead being blank (since the audio was blank)

There were several variations, but they all translate to something along the lines of: “Thank you for watching.”

Yup… trained on YouTube Videos!

I initially solved this using PicoVoice’s Cobra API, which was good at detecting when words were being said, not just loud noises. But, while the WhisperFrame was down, the licensing changed, and the old libraries stopped working, so I needed a new way.

I had ChatGPT (model o3) do some research for me, and I settled on using the Python library webrtcvad. With this in place, WhisperFrame only records when it detects likely human speech.

🧠
The o3 Model in ChatGPT has become my go-to for all my random questions, research, and editing tasks. I still use Claude 3.7 for coding tasks, and when I want the AI to produce writing that I want to use directly (instead of just giving me feedback).

Transcribing Audio

Once WhisperFrame has recorded audio (hopefully with real human voices/words), it needs to transcribe it. That’s pretty straightforward: send the WAV file to OpenAI, get a string back, and save the string to a file. It helps that the newest OpenAI transcription model, gpt-4o-transcribe is much better at handling empty audio files—it can return an empty string!

A cartoon of transcription

Generate an Image Prompt

After collecting transcripts for about 5 minutes (~20 lines of transcription), we ship that off to ChatGPT with a pretty long prompt instructing it to generate an image prompt.

You can see the full prompt on GitHub, but essentially, it extracts a single topic from the discussion and tries to generate an image prompt about that topic without drawing people around a table or bicycles.

🚲
The WhisperFrame ran for over a year in my living room, including during the Tour de France. It drew many bicycles, and I got tired of that, so I added:

Do not draw bicycles. Do not draw people on bicycles, or any form of bicycle race at all.
A kid falling off his bike

Once the image prompt is ready, it’s off to the image generator!

💼
When listening to meetings all day, it was easy for the previous generation of WhisperFrame to draw people in meetings, but that was boring! That’s why this version focuses on topics, not on people.

Generate an Image

Next up, we send the image prompt to the new Image Generation Endpoint in OpenAI.

Problem #2 - Content Moderation

I ask you, dear reader, what is wrong with this prompt:

A smartphone screen displaying a grid of assorted meal photos, vibrant and casual, digital illustration, in the style of Christoph Niemann

I don’t know either. However, OpenAI says that it might violate their content policies. I thought maybe it was the name, so…

A smartphone screen displaying a grid of assorted meal photos, vibrant and casual, digital illustration

But that doesn’t work either. This is a serious problem as it undermines our ability to use these tools creatively! I believe an AI model decides if the prompt is a violation. However, it provides no information about why it might be a violation.

🧠
I believe it is a model because sending the same prompt over and over and over often eventually allows it to generate an image. Which means the process isn’t deterministic—and that’s a big deal when consistency matters.

OpenAI's documentation includes a section on content moderation. It even has a special endpoint for checking whether content is safe. I used that endpoint in Episode 7 to protect forms from malicious input. I tried calling that endpoint, but there were no red flags!

What is going on?

I haven’t been able to isolate a specific, consistent reason why some prompts don’t work. Instead, if it fails because of content moderation, I ask ChatGPT to rework the request. I try that a few different times, altering the image prompt until it works, and iterate up to four times before finally giving up.

Repeat until it works, or we give up

Display the Image

I love the WhisperFrame (the original concept came from my friend Danny over at AGI Friday)! It is a great talking point and provides lots of contextual discussion in meetings. Sometimes it even participates in the brainstorming, like when we were talking about different conference giveaways we might do, it produced this gem—perfectly on-brand for anyone familiar with our cat sticker legacy:

Some dymaptic cat stickers and a a lego terrain map
CTA Image

If you want to see the WhisperFrame in action, join me for my next Tech Office Hours, where I’ll talk more about how it works and show it off during the question phase!

Sign Up

🔧 Want to build your own WhisperFrame? Check out the code on GitHub.

Newsologue

Good Grief, it has been a big week of AI news! Between Microsoft Build, Google I/O and Claude 4.0 I'm not sure where to start! That said, here's what I think the top things are.

Epilogue

As with the previous posts, I wrote this post. This mostly started as notes while I was getting the new WhisperFrame(s) up and running. I used the same feedback prompt as before to make edits and generally clean up the post before I had a couple of humans read it and give me feedback.

Here is the prompt I used to get the model to provide me with the feedback I wanted:

You are an expert editor specializing in providing feedback on blog posts and newsletters. You are specific to Christopher Moravec's industry and knowledge as the CTO of a boutique software development shop called Dymaptic, which specializes in GIS software development, often using Esri/ArcGIS technology. Christopher writes about technology, software, Esri, and practical applications of AI. You tailor your insights to refine his writing, evaluate tone, style, flow, and alignment with his audience, offering constructive suggestions while respecting his voice and preferences. You do not write the content but act as a critical, supportive, and insightful editor.

In addition, I often provide examples of previous posts or writing so that it can better shape feedback to match my style and tone.