GPT-5.2 is looking like another leap forward

Leaked internal benchmarks for GPT-5.2 “Thinking” have been posted by Sam Altman, and quite frankly, the numbers are ridiculous. We aren’t talking about incremental gains here.

For some reference:

AIME 2025: 100.0%. It solved it. This is a big math test and it means that competition math is effectively “completed” for this model.
ARC-AGI-2: This is the big one for the AGI purists. It jumped from 17.6% (GPT-5.1) to 52.9%. That is a massive leap in abstract reasoning and generalization—historically the Achilles’ heel of LLMs.
GDPval (Knowledge Work): This is the metric that matters for the economy. It flew from 38.8% to 70.9%.

It’s also worth noting that this highlights that scaling and reasoning are both advancing as this is a model that uses maximum reasoning efforts. Lately, it looked like OpenAI got caught with its pants down because Gemini scaled and it worked but this shows that reasoning is doing things that looked impossible.

For users, the thinking models aren’t that popular because they’re slow for every day tasks to replace Google but for innovation, this is huge. What the dual-releases show is that both tracks are still working. Ultimately, there will be a ‘best of both’ that unlocks something beyond this.

This is also big for the economy. GDPval tests well-specified knowledge work tasks spanning 44 occupations.

At the moment, this release is being rolled out and we’re going to see if the use cases match the numbers. What we aren’t seeing is what the lesser models do. This release includes 5.2 Thinking but also GPT‑5.2 Instant and Pro.

What OpenAI says:

“Overall, GPT‑5.2 brings significant improvements in general intelligence, long-context understanding, agentic tool-calling, and vision—making it better at executing complex, real-world tasks end-to-end than any previous model.”

That’s exciting but this screenshot is also making the rounds:

This article was written by Adam Button at investinglive.com.