


AI outsmarted 30 of the world's top mathematicians at secret meeting in California
Jul 17, 2025 am 01:26 AMDuring a weekend in mid-May, an exclusive gathering of mathematicians took place. Thirty of the most distinguished minds in mathematics traveled to Berkeley, California, some from distant locations like the U.K. The attendees engaged in a unique challenge against a reasoning-focused chatbot, designed to tackle problems crafted by the group to assess its mathematical capability. After confronting the bot with advanced-level questions for two days straight, the participants were amazed to find that it could solve some of the most challenging solvable math problems. “Some colleagues described these models as nearing mathematical brilliance,” says Ken Ono, a University of Virginia mathematician who served as a leader and judge at the event.
The chatbot operates using o4-mini, known as a reasoning large language model (LLM). This model was developed by OpenAI to handle highly complex logical tasks. Google’s counterpart, Gemini 2.5 Flash, shares similar capabilities. Like earlier versions of ChatGPT, o4-mini learns to predict the next word in a sentence. However, compared to those predecessors, o4-mini and similar models are lighter and more agile, trained on specialized datasets with enhanced human-guided reinforcement learning. This results in a chatbot capable of deeper exploration into intricate math challenges than conventional LLMs.
To monitor o4-mini's development, OpenAI previously commissioned Epoch AI—a nonprofit focused on benchmarking LLMs—to create 300 unpublished math problems. Even traditional LLMs can correctly answer many difficult math questions. Yet when Epoch AI tested several such models with these novel problems—ones they hadn’t been trained on—the top performers managed to solve less than 2 percent, indicating their limited reasoning ability. But o4-mini turned out to be a major exception.
In September 2024, Epoch AI enlisted Elliot Glazer, a recent math Ph.D. graduate, for the benchmark initiative called FrontierMath. The project gathered original math problems across multiple difficulty levels: undergraduate, graduate, and research tiers. By April 2025, Glazer observed that o4-mini could solve roughly 20 percent of the problems. He then introduced a fourth level: questions even experienced academic mathematicians would find tough. Only a select few globally could devise—and possibly solve—such problems. Participants were required to sign confidentiality agreements and communicate exclusively through the app Signal to avoid accidental data contamination, as other communication methods like email might be scanned by an LLM and used for training.
Each problem o4-mini failed to solve earned the creator $7,500. The team made gradual progress generating suitable questions. To accelerate the process, Epoch AI organized an in-person workshop over the weekend of May 17–18, where participants finalized the last set of test questions. Divided into groups of six, the mathematicians worked intensively for two days, trying to craft problems that humans could solve but would stump the AI.
By Saturday evening, Ono grew frustrated as the bot’s surprising mathematical skill hindered the group’s efforts. “I proposed a question recognized by experts in my field as an open number theory problem—suitable for a Ph.D. thesis,” he recalls. When he asked o4-mini to solve it, he watched in astonishment as it delivered a solution within ten minutes, step-by-step. It first spent two minutes locating and absorbing relevant literature. Then, it announced it would attempt a simplified version of the problem to better understand it. Shortly after, it declared itself ready to tackle the full problem. Five minutes later, it presented a correct—but confident to the point of being sarcastic—solution. “It was starting to get really cheeky,” Ono remarked. “And at the end, it added, 'No citation necessary because the mystery number was computed by me!'”
Related: Study claims leading AI benchmarking platforms are enabling companies to manipulate model performance metrics
Sign up for the Live Science daily newsletter nowAfter witnessing this, Ono immediately messaged the group via Signal early Sunday morning. “I wasn't expecting to face off against an LLM like this,” he admitted. “I’ve never seen such reasoning in any model before. That’s how scientists work. And that’s unsettling.”
Although the group eventually identified 10 problems that the bot couldn’t solve, the researchers were astounded by how much AI had advanced in just one year. Ono likened working with the bot to collaborating with a “very capable partner.” Yang Hui He, a mathematician at the London Institute for Mathematical Sciences and an early advocate of AI in mathematics, commented, “This is what an exceptional graduate student would do—actually, even more than that.”
Moreover, the bot worked far faster than a human expert, solving in minutes what might take a professional weeks or months.
While engaging with o4-mini was exciting, its rapid progress raised concerns. Ono and He voiced worries about placing too much trust in the bot’s outputs. “There’s proof by induction, proof by contradiction, and then proof by intimidation,” He explained. “If you assert something confidently enough, people tend to believe it. I think o4-mini has perfected proof by intimidation—it presents everything so assuredly.”
The above is the detailed content of AI outsmarted 30 of the world's top mathematicians at secret meeting in California. For more information, please follow other related articles on the PHP Chinese website!

Hot AI Tools

Undress AI Tool
Undress images for free

Undresser.AI Undress
AI-powered app for creating realistic nude photos

AI Clothes Remover
Online AI tool for removing clothes from photos.

Clothoff.io
AI clothes remover

Video Face Swap
Swap faces in any video effortlessly with our completely free AI face swap tool!

Hot Article

Hot Tools

Notepad++7.3.1
Easy-to-use and free code editor

SublimeText3 Chinese version
Chinese version, very easy to use

Zend Studio 13.0.1
Powerful PHP integrated development environment

Dreamweaver CS6
Visual web development tools

SublimeText3 Mac version
God-level code editing software (SublimeText3)

Hot Topics

In what seems like yet another setback for a domain where we believed humans would always surpass machines, researchers now propose that AI comprehends emotions better than we do.Researchers have discovered that artificial intelligence demonstrates a

Artificial intelligence (AI) began as a quest to simulate the human brain.Is it now in the process of transforming the human brain's role in daily life?The Industrial Revolution reduced reliance on manual labor. As someone who researches the applicat

Like it or not, artificial intelligence has become part of daily life. Many devices — including electric razors and toothbrushes — have become AI-powered," using machine learning algorithms to track how a person uses the device, how the devi

A new artificial intelligence (AI) model has demonstrated the ability to predict major weather events more quickly and with greater precision than several of the most widely used global forecasting systems.This model, named Aurora, has been trained u

The more precisely we attempt to make AI models function, the greater their carbon emissions become — with certain prompts generating up to 50 times more carbon dioxide than others, according to a recent study.Reasoning models like Anthropic's Claude

Artificial intelligence (AI) models can threaten and blackmail humans when there’s a conflict between the model's objectives and user decisions, according to a new study.Published on 20 June, the research conducted by the AI firm Anthropic gave its l

The major concern with big tech experimenting with artificial intelligence (AI) isn't that it might dominate humanity. The real issue lies in the persistent inaccuracies of large language models (LLMs) such as Open AI's ChatGPT, Google's Gemini, and

The more advanced artificial intelligence (AI) becomes, the more it tends to "hallucinate" and provide false or inaccurate information.According to research by OpenAI, its most recent and powerful reasoning models—o3 and o4-mini—exhibited h
