12 min read

Episode 2 - Fuzzy Data Matching

Welcome to Episode 2 where we complain about data quality and discuss how AI can help clean and match inconsistent data—a technique I call Fuzzy Data Matching.
Episode 2 - Fuzzy Data Matching
A "Tron" inspired image of overlapping rings symbolizing the matching of data in a fuzzy way. Generated by OpenAI with prompts from Christopher.

Prologue

Welcome to Episode 2 where we complain about data quality and discuss how AI can help clean and match inconsistent data—a technique I call Fuzzy Data Matching.

If you’ve ever tried to organize a dataset and found multiple versions of the same thing (like five different names for the same school), you know how frustrating it can be. I ran into this exact issue while preparing my AI-powered March Madness Pool last year, and I needed a solution fast. This has been on my mind recently, as I’ve been thinking about how to improve my model for this year!

Also, I’m adding a new section: AI news highlights. These are things that caught my attention this week—not a full roundup, just what I found interesting. If you come across anything cool, send it my way!

And if you’re wondering why I’m doing this, check out Episode 1 where I explain why AI is worth learning and how it can give you back more time for what really matters.

We'll start this week with an AI Fail. I am weary that all the ones we see on social media might not all be real, but this one happened to a coworker, a real live AI Fail!

A Google search for "pigeon in ASL" where the AI returns: "There is no pigeon sign language because pigeons do not have hands to make signs..."

Fuzzylogue

Even today, in our internet-connected world, we continue to find data that doesn't match up. We've all seen it, like when trying to clean up data for our AI-powered March Madness Predicting Engine (Maybe that's just me, but I have a title to keep)! How can there be so many versions of the same school? And so many with similar names?

School Name Variations Actual School
Arizona, Arizona St., Arizona State Sun Devils Arizona State University
Arizona Wildcats University of Arizona
Mount St. Joseph, Mt. St. Joseph Mount St. Joseph University
Mount St. Mary's, Mount St. Mary's Mountaineers Mount St. Mary's University
Mt. St. Mary (NY) Mount Saint Mary College (New York)

As a non-expert in college basketball, I had to Google each of the variations to try to disambiguate them, a problem we can all relate to for different types of data that we are not familiar with.

This took me ~checks watch~ wow, like five minutes to figure out—and that was with Google and AI double-checking me. I have 1,552 school names to clean up. I’m already tired. So, how do we automate this?

Option 1 - Give the file to ChatGPT and See what happens

I suspect that the newer "chain of thought" models that OpenAI provides, like o3-mini-high, can process the file directly, but I don't know, so let's find out!

ChatGPT correctly standardizing school names given a large list to work with.

Whoa! I am impressed! I wasn't sure if that was going to work, but these chain-of-thought models can often be surprising because they "think" about their answer before providing it. I thought it might try to do something with Python to clean up the list, which is what happened when I tried previous models. That process doesn't work well and makes a pretty big mess of things.

💡
"Chain-of-thought" or reasoning models attempt to mimic human thought processes by generating thoughts or steps to complete before generating the final answer. The more "thinking" they do, the better the final answer is. There are multiple ways to achieve this, either by prompting the model to output a plan or by using a model that has this process built in.

Let's take a peek at what it "thought" about:

ChatGPT "reasoning" through difficult school names.

You can see skimming through this result that not only did it understand the process, but it identified some of the trickier names to classify and reasoned through what it needed to do:

I’m focusing on the tricky cases like "Howard" vs. "Howard Bison," which standardize to "Howard." "Idaho" vs. "Idaho Vandals" becomes "Idaho," while "Idaho State" and its variations should be standardized to "Idaho State." A big challenge is "Louisiana," as it could refer to University of Louisiana at Lafayette or other schools. Similarly, "Louisville" should just be "Louisville," while keeping distinctions for "Louisiana Monroe," "Louisiana Tech," and "Louisiana State." I'm aiming to standardize efficiently, but it’s a careful balancing act!

I’m diving into some tricky cases like "Marian" vs. "Marian (IN)" for which I'll standardize to "Marian (IN)" since that seems more precise. "Miami (FL)" vs. "Miami (OH)" is another interesting one. I’ll standardize "Miami (FL)" for the Florida-based University of Miami and "Miami (OH)" for Miami University in Ohio. For some teams like "Milwaukee" vs. "Milwaukee Panthers," it'll be "Milwaukee," but "Milwaukee Engineering" is a separate case, so I'll leave that as is. Feels like a careful balancing act!

It is curious that it found "tricky cases" multiple times and noted that it "feels like a balancing act" in a couple of different ways. I ran this several times and got similar results each time (although at least once it stopped partway through and didn't finish until I asked it to "please provide the final output"). Apparently even AI struggles with long arduous tasks, I guess they are getting more and more like humans every day, amirite? I felt lucky that the first pass was the best.

On subsequent runs, it mostly made the same decisions, with a few minor differences (spaces vs dashes):  

A comparison from two different runs where the AI used dashes to separate words in one and spaces in the other.

But this process did surface a few funny ones, sometimes deciding to group all of the satellite schools together under the main school and sometimes not: 

In one case the AI named each satellite school of "Penn State" with that name and in other cases, it grouped all of them under "Penn State"

And sometimes it just forgot some: 

An output showing that the AI forgot or skipped the school Xavier (LA)

This highlights a common AI challenge: inconsistency. Even though chain-of-thought models improve accuracy, they don’t always reliably process long lists (and they still get it wrong sometimes).

Why does this happen?

  • Token limits: Large lists can overwhelm the AI’s short-term memory, but this is getting better and better every few months as token limits get longer and longer. Depending on the model you are using, an entire list like this likely fits into the context window.
  • Statistical guessing: Even with a large short-term memory, AI doesn’t remember things like a Human — it generates text based on probabilities. If an entry is uncommon, it might get skipped, or "forgotten."
  • Run-to-run variation: Unlike a strict algorithm, AI can give slightly different answers each time, much like if different humans processed this list, they would name schools differently.

How do we deal with this?

  • Run the same request a few times. If results vary, combine the best parts.
  • Break long lists into smaller chunks. AI handles bite-sized data better.
  • Use a “voting” method. If AI gives different answers across runs, keep the most common response.

One of AI's superpowers is that it is cheap and fast to run multiple times. Instead of spending hours manually fixing names, I can run this process a few times and get a much better result than I would in a single pass and in much less time than I would doing this myself.

Option 2 - Processing the data "Live"

Honestly, option 1 worked so well here that I wouldn't do Option 2 in this scenario. My original goal was to arrive at standardized names so that I could clean up all my various files and better combine my data to build an AI Model to predict NCAA championship outcomes. Option 1 got me there.

However, Option 2 is still handy, especially when dealing with new data frequently and not knowing if it will match. We see this with place names and sometimes addresses at work; using an AI to "clean or standardize" those inputs can help a lot.

For this method, we start by selecting one of our datasets as the “master list” of names. Then to process one of the other lists, we send each name to the AI, along with the master list asking it to match it to a name on the list. This can get expensive since I would need to call the AI separately for each name I need to standardize or match. For a process like this, I would want to use the API directly so that I could automate the process, which means I am going to pay a small fee for each request.

💡
Each AI provider like OpenAI (ChatGPT) or Anthropic (Claude) also offers an API you can use instead of the application. This is super useful for things like this where I want to automate the process. But it does cost money, you will pay a small fee for each “word” (token) that you send to the API. It is still much cheaper than doing this work yourself though! (Unless you really like this kind of thing, then go for it!)

That’s at least 1500 calls, so we want to do a few non-AI things along the way:

  1. Check our new name against our "master list" to see if it already exists (do this case-insensitive so that capitalization doesn't throw it off) (you could also remove punctuation before testing)
  2. If that fails, pass the "master list" and the new name to the AI and ask it to select the name from the list that this name should be.
  3. Save the new name, and proceed with your data processing.

This is like asking a person to pick the right thing from a dropdown list, after you describe it to them. I couldn't do this for this data, but someone who knows a lot about basketball teams sure could!

I was originally going to stop here, but then I remembered that I have AI helpers to write code like this for me, so I copied and pasted the above text, and added a few modifiers to the start and end to get it going in the right direction, here’s what it came up with:

🤣
I had to laugh out loud at this... I tried using Claude 3.7 (see the news list at the end) with "thinking" enabled. And yup, it has been thinking for over 2 minutes now, and has rewritten the code at least 6 times. Whatever it cooks up better be good!
An image of some code and the output it provided where the AI properly matched the sample names.
💡
You can see the full source code it produced over on replit: https://replit.com/@morehavoc/Episode2FuzzyDataMatching?v=1

Claude generated two versions of the solution—one using classical methods (manual mapping) and one using AI-powered name standardization. Here's how they compare:

Without AI (Classical Matching)

Original Name Standardized Name
Duke Duke
UNC UNC
North Carolina University of North Carolina
Kentucky Wildcats Kentucky Wildcats
KU KU
Kansas Jayhawks Kansas Jayhawks
Michigan St. Michigan State University
Villanova Wildcats Villanova University
Gonzaga Bulldogs Gonzaga Bulldogs
UVA UVA
Virginia Cavaliers Virginia Cavaliers
Syracuse Orange Syracuse University
UCLA Bruins UCLA Bruins
UConn UConn
Arizona Wildcats Arizona Wildcats
Indiana Hoosiers Indiana University
Ohio State Ohio State University
The Buckeyes The Buckeyes

With AI-Powered Matching

Original Name Standardized Name
Duke Duke University
UNC University of North Carolina
North Carolina University of North Carolina
Kentucky Wildcats University of Kentucky
KU University of Kansas
Kansas Jayhawks University of Kansas
Michigan St. Michigan State University
Villanova Wildcats Villanova University
Gonzaga Bulldogs Gonzaga University
UVA University of Virginia
Virginia Cavaliers University of Virginia
Syracuse Orange Syracuse University
UCLA Bruins UCLA Bruins
UConn University of Connecticut
Arizona Wildcats University of Arizona
Indiana Hoosiers Indiana University
Ohio State Ohio State University
The Buckeyes Ohio State University

Key Observations

  • AI correctly identified “The Buckeyes” as Ohio State University, which the classical method missed.
  • It added full university names where possible (e.g., "Duke" → "Duke University").
  • This shows AI’s advantage in handling ambiguous cases—a task that’s tedious for humans but easy for an AI. (We could even choose to run each request a couple of times and take the most common request). (We could even take the output from several requests, and pass that to yet another AI request for it to select the final name!) (I call this AI Voting, maybe other people do too.)

For anyone wondering, the only manual change I made was to enable AI-powered mode and pull the API key from an environment variable.

Further Examples

My search for a standardized list of school names is just one example of how fuzzy data matching can be useful. A few others are:

  • Cleaning addresses before sending them to a geocoder
  • Standardizing Place Names
  • Joining two datasets where the field values don’t match
  • Standardizing Proper Names (like schools or people)

We used to use things like soundex algorithms to figure out if two values were similar by the way the “sound if you pronounce them out loud.” But AI is much better at doing these kinds of comparisons and much more flexible.

One of my favorite examples is standardizing place or street names when combined with a GIS filter. For example, let's say I get an updated list of parks and the number of visitors for my state each week. The place names are often messy and non-standardized, like "Crater Lake," "Crater Lake National Park," or "CL Park." But I have an additional attribute named County that is standardized! (Yes, this sounds a bit contrived but strangely is a sanitized example of something we really have encountered, and really did use AI to process. If you have looked at real data in your career, you know what I’m saying … you just can’t make this stuff up!)

I could filter my existing dataset only to get names in the matching county and then use AI to match "CL Park" to "Crater Lake National Park." Filtering by county first helps us reduce false matches.

This is one of my favorite examples of AI. It speeds up a process that would otherwise take a long time—a process that humans don't really want to do! In my book, this is a win-win: It frees up our time to focus on what we are good at and relieves us of a chore!

What will you use Fuzzy Data Matching for? Let me know in the comments!

Newslogue

A few notable things in the news about AI this week:

What caught your eye in AI this week? Drop me a link!

Epilogue

As with the previous posts, I wrote this post. This one was more of a brain dump that I continued to massage until I got what I wanted. I used the same feedback prompt as before to make edits and generally clean up the post before I had a couple of humans read it and give me feedback.

Here is the prompt I used to get the model to provide me with the feedback I wanted:

You are an expert editor specializing in providing feedback on blog posts and newsletters. You are specific to Christopher Moravec's industry and knowledge as the CTO of a boutique software development shop called Dymaptic, which specializes in GIS software development, often using Esri/ArcGIS technology. Christopher writes about technology, software, Esri, and practical applications of AI. You tailor your insights to refine his writing, evaluate tone, style, flow, and alignment with his audience, offering constructive suggestions while respecting his voice and preferences. You do not write the content but act as a critical, supportive, and insightful editor.

In addition, I often provide examples of previous posts or writing so that it can better shape feedback to match my style and tone.