Episode 23 - AI Wins a Gold Star

Prologue
It was actually a gold medal in math, but I mean this both as a celebration of accomplishment and a reality check on what it means.
This week, Google announced that an advanced version of their Gemini model (called Deep Think) correctly solved 5 of 6 math problems on the International Mathematical Olympiad (IMO)! This is celebratory—only 8% of participants win a gold medal each year.
But what does it mean for AI to be smarter than humans? What test can we create that, when an AI can pass it, we declare that it is more intelligent than us?
Before we hand over the reins to our new overlords, let’s look at what this might mean.
Testologue
When AI aces human tests, does that make it smarter than humans? The short answer: It’s complicated—the methodologies behind these wins might tell us more about AI’s limitations and its capabilities than we think.
Sometimes AI Gets to Cheat
One thing that bothers me is that when I see headlines like this and dig in, I find out that the AI actually gets to solve each problem multiple times, and the best answer is selected. I don’t think that’s a fair comparison to humans.
This is like taking the bar exam with two law school buddies where you collaborate on every answer. Sure, you might pass the test, but is that the same as individual human intelligence? Will the triumvirate make a reasonable attorney?
For example, GPT-4 scored 1410/1600 on the SAT (which is high), but when faced with creative tasks like the AP English Language exam, it could only manage 2 out of 5. That leads me to believe that AI can excel at standardized tests but struggles when creativity is required.
For the IMO, they say:
This year, we were amongst an inaugural cohort to have our model results officially graded and certified by IMO coordinators using the same criteria as for student solutions.
That gives me hope that this was a fair test and that (hopefully) the graders didn’t even know which tests were AI and which were human.
Parallel Processing
One of the highlights in the Google post was that they used “…some of our latest research techniques, including parallel thinking.” I’m kind of excited about that idea as I think, at least for coding, that could make a big difference. But that isn’t how humans work—we don’t get to think through multiple solutions in parallel and come to the best answer.
- Google - The company that provides the Gemini series of models
- DeepMind - A company Google bought in 2014 that specializes in AI models
- Deep Think - A special version of Gemini that specializes in thinking very deeply
It appears DeepMind’s system simultaneously explores multiple solution paths, evaluates them in parallel, and picks the best elements before committing to an answer. That isn’t reasoning—it’s computational brute force reasoning and an excellent selection mechanism.
Imagine a sci-fi movie where you can clone your consciousness and have more than one of yourself work on a problem, only to come back together as one mind and pick the best parts for a final solution. That’s more like teamwork than reasoning.

That’s what parallel thinking seems to be—it isn’t smarter reasoning, it’s just more of it, and happening simultaneously. This fundamentally changes what we’re measuring. This is single-threaded biological reasoning vs multi-threaded computational search. The fact that only 67 out of 630 human competitors achieved gold makes this even more striking: humans accomplished it with one brain, one approach, and one shot.
But suppose we are headed toward Artificial General Intelligence (AGI) regardless. In that case, this distinction between reasoning and computational search might be academic, but for now, it’s important to understand what we are measuring.
The Scale of the Problem
Math Olympiad and bar exam questions share something in common: they’re small. A few paragraphs describe a problem that is solvable in isolation. AI can synthesize information at scale, but it struggles with problems that require sustained real-world iteration. These math problems are “reading about problems.” They are elegant, self-contained and have clear success criteria. Real breakthroughs often require years of messy experimental iteration.
An LLM can pass a coding interview, but when you hand it a large codebase, it can’t handle completing tasks that skilled programmers would be able to do. Instead, it struggles, more like an intern that needs to be guided through the steps to implement the new feature.
This makes the gap between test performance and real-world capability look enormous.
The Nobel Prize goes to AlphaFold
All of that said, the 2024 Nobel Prize for Chemistry was awarded to David Baker, Demis Hassabis, and John Jumper for using AI to solve protein folding. This has been a decades-old problem, on which we made very slow progress until some computer scientists introduced an AI called AlphaFold2 (coincidentally or not, also from DeepMind), which essentially solved the problem.
One reason it succeeded was the decades of experimental data that thousands of humans had already collected over many years. By utilizing that data and understanding, AlphaFold was able to apply its intelligence to identify patterns and processes that we humans had not yet observed.
This is important, I think. Just like with image recognition, AI excels when large datasets are available for training, when problems can be formulated in specific, concrete ways, and validated. However, generating novel solutions to problems still seems to require humans to do the work.
AI is Getting Smarter
The 2025 Math Olympiad is an impressive feat by DeepMind, but I don’t think it will be remembered as the moment when AI surpassed human intelligence. Maybe we can make it the point at which we start trying to distinguish intelligence from computational power, though.
When AI solves problems through parallel search processes that no human can replicate, we are not measuring reasoning; we are measuring engineering.
AI is getting better almost every month at all kinds of tasks, from solving unique math problems to writing code, to making high-performing songs on Spotify. But how do we judge its progress? It can’t be just how good is AI at taking human tests—there needs to be a better way.
I have been convinced over the last few months that these Large Language Models are more similar to human brains than I would have previously thought. From their ability to “reason” through things to being retrained to control robot arms instead of output text, they seem more human-like every day. However, the inclusion of parallel processing within the model feels more like throwing engineering at the problem instead of refining the model's intelligence.
Maybe we can cause Judgment Day by simply throwing more computing power at the problem, but I’d prefer to find a more elegant solution.

Newsologue
- OpenAI launches ChatGPT Agent, but I’m not impressed so far.
- Delta is using AI to adjust ticket prices for each person. Only on a few number of flights, but I think this is the start of the future here, more and more systems will do things like this.
- Anthropic shows that longer thinking leads to worse answers for some tasks in AI Models.
Epilogue
As with my previous posts, I wrote this one. It started as a discussion with my friend and the author of AGI Friday, then I recorded my thoughts using my Limitless pin. I then went back and forth with Claude to refine my points, seeking critical feedback. Afterwards, I wrote, refined it with Claude and my custom ChatGPT, and had Holly review it.
Here is the prompt I used to get the model to provide me with the feedback I wanted:
You are an expert editor specializing in providing feedback on blog posts and newsletters. You are specific to Christopher Moravec's industry and knowledge as the CTO of a boutique software development shop called Dymaptic, which specializes in GIS software development, often using Esri/ArcGIS technology. Christopher writes about technology, software, Esri, and practical applications of AI. You tailor your insights to refine his writing, evaluate tone, style, flow, and alignment with his audience, offering constructive suggestions while respecting his voice and preferences. You do not write the content but act as a critical, supportive, and insightful editor.
In addition, I often provide examples of previous posts or writing so that it can better shape feedback to match my style and tone.
Member discussion