news

GPT-5.2 Breaks the 90% Barrier on ARC-AGI: What It Means for AI

OpenAI achieves a historic milestone: the first AI model to exceed 90% on the world's toughest abstract reasoning test

Sarah ChenSarah Chen-January 29, 2026-14 min read
Share:
Visual representation of artificial intelligence and neural networks with data patterns

Photo by Google DeepMind on Unsplash

Key takeaways

GPT-5.2 Pro just achieved what seemed impossible: breaking the 90% barrier on ARC-AGI, the benchmark designed to measure true reasoning ability. Let me break down why this number matters more than any other and what it means for AI's future.

The Number That Changed Everything: 90.5%

Let me break this down: imagine there was a test specifically designed to measure whether an AI can actually "think" — not just memorize patterns. A test so difficult that for years, no model could break 50%. Well, GPT-5.2 Pro just scored 90.5%.

This isn't just any benchmark. ARC-AGI (Abstraction and Reasoning Corpus for Artificial General Intelligence) was created by François Chollet, the inventor of Keras, precisely to detect whether AIs truly reason or just pretend to. And for years, that test exposed the limitations of every model.

Until December 2025.

What most guides won't tell you is that this result isn't just impressive for the number itself. It's impressive because GPT-5.2 Pro achieves it while being 390 times cheaper than the previous record holder, o3-preview, which barely reached 87%.

What the Heck Is ARC-AGI and Why Should You Care?

Before we get excited about the numbers, let me explain why this benchmark is special.

The Problem with Traditional Benchmarks

Most AI tests measure things models can memorize. If a model has seen millions of math problems during training, is it really "reasoning" when it solves a new one, or just applying patterns it already knows?

François Chollet designed ARC-AGI to be different:

Characteristic Traditional Benchmarks ARC-AGI
Problem type Text, code, math Abstract visual puzzles
Memorization Possible Impossible
Specific training Works Doesn't work
Measures Knowledge + patterns Pure reasoning

Each ARC-AGI problem presents a grid with color patterns. The model must discover the underlying rule from just 2-3 examples and then apply it to a new case. There's no way to memorize it because each problem is unique.

The trick is that an average human can solve 85% of these puzzles without prior training. For years, the best AIs barely reached 30-40%.

The Evolution of Scores

Year Best Model ARC-AGI-1 Score
2020 GPT-3 ~20%
2022 GPT-4 ~35%
2024 Claude Opus ~55%
Nov 2025 o3-preview 87%
Dec 2025 GPT-5.2 Pro 90.5%

The Numbers That Matter: GPT-5.2 in Detail

OpenAI launched GPT-5.2 on December 11, 2025. But it's not a single model — it's a family with three variants designed for different use cases.

The Three GPT-5.2 Variants

Variant Use Case API Price (input/output)
GPT-5.2 Instant Quick responses, emails, simple texts Most economical
GPT-5.2 Thinking Step-by-step reasoning, complex problems $1.75 / $14 per million tokens
GPT-5.2 Pro Research, legal analysis, scientific work $21 / $168 per million tokens

Technical Specifications

Specification Value
Context window 400,000 tokens
Max output tokens 128,000 tokens
Knowledge cutoff August 2025
Reasoning support Configurable (none, low, medium, high, xhigh)
Cached inputs 90% discount
Batch API 50% discount

Think of it like this: a 400K token context window means you can fit approximately 600 pages of text in a single query. That's an entire book, with room to spare for questions.

The Benchmark Battle: GPT-5.2 vs Claude Opus 4.5 vs Gemini 3

Now things get interesting. How does GPT-5.2 compare to its direct competitors?

Reasoning and Math

Benchmark GPT-5.2 Pro Claude Opus 4.5 Gemini 3 Pro
ARC-AGI-1 (Verified) 90.5% ~75% ~70%
ARC-AGI-2 (Verified) 54.2% 37.6% 31.1%
AIME 2025 (Math) 100% ~92.8% ~90%
GPQA Diamond (Science) 92.4% 87% 91.9%
FrontierMath 40.3% ~30% ~35%

The numbers speak for themselves: in abstract reasoning and math, GPT-5.2 dominates. The jump from 37.6% to 54.2% on ARC-AGI-2 (the harder version) is particularly notable: a 44% improvement over Claude Opus 4.5.

Coding and Programming Tasks

Benchmark GPT-5.2 Codex Claude Opus 4.5
SWE-bench Verified 80.0% 80.9%
SWE-bench Pro 56.4% ~54%
HumanEval 91.7% 94.2%
Terminal-Bench 47.6% 59.3%
Terminal-Bench 2.0 64.0% ~55%

The trick is understanding what each measures: Claude Opus 4.5 still leads on SWE-bench Verified (the most-cited metric for real-world coding), but GPT-5.2 Codex wins on SWE-bench Pro and Terminal-Bench 2.0, which are harder.

What most guides won't tell you is that in practical real-world tests, results are more mixed. A Sonar analysis found that GPT-5.2 has the lowest control flow error rate (22 errors per million lines of code), but Claude Opus 4.5 has the highest functional code rate (83.62%).

The Benchmark That Matters Most: GDPval

OpenAI created GDPval to measure something that actually matters: can AI do professional work better than a human expert?

Model GDPval (vs professionals)
GPT-5.2 Thinking 70.9%
GPT-5.2 Pro 74.1%
Claude Opus 4.5 ~65%

This means on well-specified knowledge tasks, GPT-5.2 Pro beats human professionals 74% of the time. Tasks like legal analysis, document review, scientific research.

What Sam Altman Admitted (And Why It Matters)

Here's something OpenAI would probably prefer didn't get too much attention: GPT-5.2 has issues with writing.

In a recent Q&A session, Sam Altman admitted it directly:

"We decided to put most of our effort in 5.2 into making it super good at intelligence, reasoning, coding, engineering. And I think we kind of screwed up a bit on writing quality."

What most guides won't tell you is that this was a deliberate decision, not a mistake. OpenAI prioritized technical capabilities because, according to Altman, "consumers don't demand more IQ anymore, but enterprises still do."

The promise is that future GPT-5.x versions will correct this deficit. But if your main job is creative writing, Claude Opus 4.5 remains the better option.

GPT-5.2 Codex: The Developer's Weapon

A week after GPT-5.2 base, OpenAI released GPT-5.2 Codex, specifically optimized for agentic coding.

What Makes Codex Different

Feature GPT-5.2 Standard GPT-5.2 Codex
Long context in repos Good Optimized
Large refactors Limited Excellent
Code migrations Basic Specialized
Cybersecurity Standard Enhanced
Context compaction No Yes

The trick is in "native context compaction." Codex can maintain task state across extended sessions without losing track. Think of it like this: you can leave a refactor halfway done, come back the next day, and the model remembers exactly where it was.

Cybersecurity Performance

OpenAI boasts that GPT-5.2 Codex has the best cybersecurity capabilities of any model they've released. In professional CTF (Capture The Flag) competitions, the model shows significant improvements in:

  • Vulnerability detection
  • Threat analysis
  • Real-world exploit research

Andrew MacPherson, a principal security engineer at Privy (a Stripe company), used GPT-5.2 Codex to reproduce and study a critical React vulnerability. His conclusion: the model is genuinely useful for real security research.

OpenAI's Enterprise Strategy: What These Launches Reveal

The GPT-5.2 numbers aren't just technical metrics. They're a statement of intent.

The Enterprise Pivot

Sam Altman has been clear: 2026 is OpenAI's enterprise year. Some revealing data points:

  • OpenAI's API business grew faster than ChatGPT consumer in 2025
  • Enterprise is now a "major priority"
  • GPT-5.2 Pro exists specifically for legal, research, and analysis teams

"The main thing consumers want right now is not more IQ. Enterprises still do want more IQ."

The Response to "Code Red"

What most guides won't tell you is that GPT-5.2 came after an internal OpenAI memo declared "code red" in response to competitors like Google Gemini advancing. Altman confirmed he expected to "exit code red" following GPT-5.2's launch.

The model wars are at their peak intensity.

What Does 90% on ARC-AGI Mean?

Now the important question: does exceeding 90% mean we've achieved AGI (Artificial General Intelligence)?

What It DOES Mean

  1. Genuine reasoning (to some extent): GPT-5.2 can solve problems it hasn't seen before by applying abstract rules
  2. Improved generalization: The model doesn't just memorize — it extracts principles
  3. Radical efficiency: Achieves similar results to o3 at 1/390th the cost

What It DOESN'T Mean

  1. It's not AGI: ARC-AGI measures one aspect of reasoning, not all of them
  2. It's not consciousness: Solving puzzles doesn't mean understanding the world
  3. It's not perfect: On ARC-AGI-2 (the hardest version), the best result is 54.2%

François Chollet, ARC-AGI's creator, released ARC-AGI-2 precisely because models were starting to "saturate" the original benchmark. The race continues.

Price Comparison: Is It Worth It?

If you're considering using GPT-5.2 for your work or business, here are the real numbers.

API Costs Per Million Tokens

Model Input Output Input (cached) Batch
GPT-5.2 Thinking $1.75 $14.00 $0.175 $0.875/$7
GPT-5.2 Pro $21.00 $168.00 N/A N/A
Claude Opus 4.5 $15.00 $75.00 $1.50 $7.50/$37.50
Gemini 3 Pro ~$7.00 ~$21.00 Variable Variable

Cost-Benefit Analysis

For coding and development:

  • GPT-5.2 Thinking is competitive with Claude Opus
  • The 90% cache discount makes repeated queries very cheap
  • GPT-5.2 Codex justifies the premium if you do large refactors

For reasoning and analysis:

  • GPT-5.2 Pro is expensive (~$168/million output) but the best for professional work
  • If you need to beat human experts 74% of the time, it might be worth it

For creative writing:

  • Honestly, Claude is still the better option
  • GPT-5.2 admitted to sacrificing writing quality

What Comes Next: OpenAI's Roadmap

Sam Altman dropped some hints about the future:

Q1 2026

"I expect new models that are significant gains from 5.2 in the first quarter of next year."

Note that he avoided calling it "GPT-6." But the timeline is clear: substantial improvements in the coming months.

What's Coming

  1. Writing improvements: OpenAI knows they made a mistake, they'll fix it
  2. More specialized models: Following the Codex playbook
  3. Efficiency gains: The 390x cost/performance jump suggests there's more room to optimize

Conclusion: The Real Meaning of This Milestone

GPT-5.2 exceeding 90% on ARC-AGI is a genuine milestone, not empty marketing. But it needs to be understood in context:

It's impressive because:

  • It demonstrates real abstract reasoning, not just memorization
  • It reduces the cost of advanced capabilities by 390x
  • It sets a new standard for the industry

It doesn't change everything because:

  • The hardest benchmark (ARC-AGI-2) is still at ~54%
  • Writing has regressed compared to previous models
  • Claude Opus 4.5 still leads in practical coding

If you ask me directly: GPT-5.2 is the best model for reasoning, math, and professional analysis. Claude Opus 4.5 is still better for day-to-day coding and writing. Gemini 3 occupies an interesting niche with its Google ecosystem integration.

The real question isn't whether GPT-5.2 is "the best." It's what you need it for. And now, with these numbers on the table, you can make an informed choice.


Data current as of January 2026. Benchmarks from OpenAI, independent evaluations from IntuitionLabs, SonarSource, and comparative analysis from LLM-Stats.

Frequently Asked Questions

What is ARC-AGI and why is it important?

ARC-AGI (Abstraction and Reasoning Corpus for Artificial General Intelligence) is a benchmark designed by François Chollet to measure genuine abstract reasoning. Unlike other tests, it can't be solved by memorizing patterns: each problem is unique and requires discovering underlying rules. An average human solves 85% without training; for years, AIs couldn't break 40%.

Is GPT-5.2 better than Claude Opus 4.5?

It depends on the task. GPT-5.2 dominates in abstract reasoning (90.5% vs ~75% on ARC-AGI) and math (100% vs ~93% on AIME 2025). Claude Opus 4.5 leads in practical coding (80.9% vs 80.0% on SWE-bench Verified) and writing quality. For professional analysis: GPT-5.2. For software development: both are competitive.

How much does GPT-5.2 cost?

GPT-5.2 Thinking costs $1.75 per million input tokens and $14 per million output. GPT-5.2 Pro (the most capable) costs $21 input and $168 output. Cached inputs get a 90% discount, and the Batch API offers 50% off for non-urgent workloads.

Does this mean we've achieved AGI?

No. ARC-AGI measures one aspect of abstract reasoning, not general intelligence. Breaking 90% on ARC-AGI-1 is impressive, but on ARC-AGI-2 (the harder version) the best score is 54.2%. Current models still have significant limitations in common sense, causal reasoning, and understanding the physical world.

When will GPT-6 be released?

Sam Altman said they expect "new models that are significant gains from 5.2 in Q1 2026," but avoided confirming if it will be called GPT-6. OpenAI also mentioned that future GPT-5.x versions will improve writing quality, which they admitted sacrificing in GPT-5.2.

Was this helpful?
Sarah Chen
Written by

Sarah Chen

Tech educator focused on AI tools. Making complex technology accessible since 2018.

#artificial intelligence#openai#gpt-5.2#benchmark#arc-agi#ai reasoning#claude opus#gemini#machine learning

Related Articles