7 min read

Episode 23 - AI Wins a Gold Star

Google's AI wins a Gold Medal in a mathematics competition. What does this really mean?
Episode 23 - AI Wins a Gold Star
AI Wins a gold medal at the International Mathematical Olympiad, as drawn by ChatGPT

Prologue

It was actually a gold medal in math, but I mean this both as a celebration of accomplishment and a reality check on what it means.

This week, Google announced that an advanced version of their Gemini model (called Deep Think) correctly solved 5 of 6 math problems on the International Mathematical Olympiad (IMO)! This is celebratory—only 8% of participants win a gold medal each year.

💡
In 2025 there were 630 competitors, all pre-college students, from 110 participating countries. Only 67 gold medals were awarded. (Note to International Mathematical Olympiad to take a quick break from math to update the SSL cert for your website!)

But what does it mean for AI to be smarter than humans? What test can we create that, when an AI can pass it, we declare that it is more intelligent than us?

Before we hand over the reins to our new overlords, let’s look at what this might mean.

Testologue

When AI aces human tests, does that make it smarter than humans? The short answer: It’s complicated—the methodologies behind these wins might tell us more about AI’s limitations and its capabilities than we think.

🧠
AI is getting smarter, but simply passing tests doesn’t mean it is more intelligent than us. That doesn’t mean it isn’t a good indicator; it just means we should seek to understand what it indicates. AI might be operating according to rules we don’t understand, and we should be trying to comprehend them as quickly as possible, because these systems are becoming smarter, faster. Two years ago, they couldn’t pass a test like this, and now they can ace it. Maybe if I studied nothing but math for two years, I could ace this test, but I’m not so sure.

Sometimes AI Gets to Cheat

One thing that bothers me is that when I see headlines like this and dig in, I find out that the AI actually gets to solve each problem multiple times, and the best answer is selected. I don’t think that’s a fair comparison to humans.

This is like taking the bar exam with two law school buddies where you collaborate on every answer. Sure, you might pass the test, but is that the same as individual human intelligence? Will the triumvirate make a reasonable attorney?

For example, GPT-4 scored 1410/1600 on the SAT (which is high), but when faced with creative tasks like the AP English Language exam, it could only manage 2 out of 5. That leads me to believe that AI can excel at standardized tests but struggles when creativity is required.

For the IMO, they say:

This year, we were amongst an inaugural cohort to have our model results officially graded and certified by IMO coordinators using the same criteria as for student solutions.

That gives me hope that this was a fair test and that (hopefully) the graders didn’t even know which tests were AI and which were human.

Parallel Processing

One of the highlights in the Google post was that they used “…some of our latest research techniques, including parallel thinking.” I’m kind of excited about that idea as I think, at least for coding, that could make a big difference. But that isn’t how humans work—we don’t get to think through multiple solutions in parallel and come to the best answer.

🤦
AI companies have the hardest time naming things, let me try to break this down:
- Google - The company that provides the Gemini series of models
- DeepMind - A company Google bought in 2014 that specializes in AI models
- Deep Think - A special version of Gemini that specializes in thinking very deeply

It appears DeepMind’s system simultaneously explores multiple solution paths, evaluates them in parallel, and picks the best elements before committing to an answer. That isn’t reasoning—it’s computational brute force reasoning and an excellent selection mechanism.

Imagine a sci-fi movie where you can clone your consciousness and have more than one of yourself work on a problem, only to come back together as one mind and pick the best parts for a final solution. That’s more like teamwork than reasoning.

A GIF of a person being duplicated many times to demonstrate the point that parallel processing is less like human thought and more like brute force search.

That’s what parallel thinking seems to be—it isn’t smarter reasoning, it’s just more of it, and happening simultaneously. This fundamentally changes what we’re measuring. This is single-threaded biological reasoning vs multi-threaded computational search. The fact that only 67 out of 630 human competitors achieved gold makes this even more striking: humans accomplished it with one brain, one approach, and one shot.

But suppose we are headed toward Artificial General Intelligence (AGI) regardless. In that case, this distinction between reasoning and computational search might be academic, but for now, it’s important to understand what we are measuring.

The Scale of the Problem

Math Olympiad and bar exam questions share something in common: they’re small. A few paragraphs describe a problem that is solvable in isolation. AI can synthesize information at scale, but it struggles with problems that require sustained real-world iteration. These math problems are “reading about problems.” They are elegant, self-contained and have clear success criteria. Real breakthroughs often require years of messy experimental iteration.

💡
Agentic AI, or AIs that work together to take on different tasks, are more about teamwork than actual intelligence. AIs run into an interaction problem—AI can’t book my hotels, not because it doesn’t understand, but because it can’t interact with the websites the way I can as a human. Agentic AI will likely help with that by enabling loops and isolated workers to get through a specific task.

An LLM can pass a coding interview, but when you hand it a large codebase, it can’t handle completing tasks that skilled programmers would be able to do. Instead, it struggles, more like an intern that needs to be guided through the steps to implement the new feature.

This makes the gap between test performance and real-world capability look enormous.

The Nobel Prize goes to AlphaFold

All of that said, the 2024 Nobel Prize for Chemistry was awarded to David Baker, Demis Hassabis, and John Jumper for using AI to solve protein folding. This has been a decades-old problem, on which we made very slow progress until some computer scientists introduced an AI called AlphaFold2 (coincidentally or not, also from DeepMind), which essentially solved the problem.

One reason it succeeded was the decades of experimental data that thousands of humans had already collected over many years. By utilizing that data and understanding, AlphaFold was able to apply its intelligence to identify patterns and processes that we humans had not yet observed.

This is important, I think. Just like with image recognition, AI excels when large datasets are available for training, when problems can be formulated in specific, concrete ways, and validated. However, generating novel solutions to problems still seems to require humans to do the work.

💡
AlphaFold is a very impressive bit of technology, but it is that: technology. It isn’t intelligence in the way we think about it. It can fold proteins that we can’t, but it isn’t going to write your essay on To Kill a Mockingbird.

AI is Getting Smarter

The 2025 Math Olympiad is an impressive feat by DeepMind, but I don’t think it will be remembered as the moment when AI surpassed human intelligence. Maybe we can make it the point at which we start trying to distinguish intelligence from computational power, though.

When AI solves problems through parallel search processes that no human can replicate, we are not measuring reasoning; we are measuring engineering.

AI is getting better almost every month at all kinds of tasks, from solving unique math problems to writing code, to making high-performing songs on Spotify. But how do we judge its progress? It can’t be just how good is AI at taking human tests—there needs to be a better way.

I have been convinced over the last few months that these Large Language Models are more similar to human brains than I would have previously thought. From their ability to “reason” through things to being retrained to control robot arms instead of output text, they seem more human-like every day. However, the inclusion of parallel processing within the model feels more like throwing engineering at the problem instead of refining the model's intelligence.

Maybe we can cause Judgment Day by simply throwing more computing power at the problem, but I’d prefer to find a more elegant solution.

A GIF from the end of Terminator 2, which is a reference to Judgment Day.

Newsologue

Epilogue

As with my previous posts, I wrote this one. It started as a discussion with my friend and the author of AGI Friday, then I recorded my thoughts using my Limitless pin. I then went back and forth with Claude to refine my points, seeking critical feedback. Afterwards, I wrote, refined it with Claude and my custom ChatGPT, and had Holly review it.

Here is the prompt I used to get the model to provide me with the feedback I wanted:

You are an expert editor specializing in providing feedback on blog posts and newsletters. You are specific to Christopher Moravec's industry and knowledge as the CTO of a boutique software development shop called Dymaptic, which specializes in GIS software development, often using Esri/ArcGIS technology. Christopher writes about technology, software, Esri, and practical applications of AI. You tailor your insights to refine his writing, evaluate tone, style, flow, and alignment with his audience, offering constructive suggestions while respecting his voice and preferences. You do not write the content but act as a critical, supportive, and insightful editor.

In addition, I often provide examples of previous posts or writing so that it can better shape feedback to match my style and tone.