GPT-5.2 Breaks the 90% Barrier on ARC-AGI: What It Means for AI

The Number That Changed Everything: 90.5%

Let me break this down: imagine there was a test specifically designed to measure whether an AI can actually "think" — not just memorize patterns. A test so difficult that for years, no model could break 50%. Well, GPT-5.2 Pro just scored 90.5%.

This isn't just any benchmark. ARC-AGI (Abstraction and Reasoning Corpus for Artificial General Intelligence) was created by François Chollet, the inventor of Keras, precisely to detect whether AIs truly reason or just pretend to. And for years, that test exposed the limitations of every model.

Until December 2025.

What most guides won't tell you is that this result isn't just impressive for the number itself. It's impressive because GPT-5.2 Pro achieves it while being 390 times cheaper than the previous record holder, o3-preview, which barely reached 87%.

What the Heck Is ARC-AGI and Why Should You Care?

Before we get excited about the numbers, let me explain why this benchmark is special.

The Problem with Traditional Benchmarks

Most AI tests measure things models can memorize. If a model has seen millions of math problems during training, is it really "reasoning" when it solves a new one, or just applying patterns it already knows?

François Chollet designed ARC-AGI to be different:

Characteristic	Traditional Benchmarks	ARC-AGI
Problem type	Text, code, math	Abstract visual puzzles
Memorization	Possible	Impossible
Specific training	Works	Doesn't work
Measures	Knowledge + patterns	Pure reasoning

Each ARC-AGI problem presents a grid with color patterns. The model must discover the underlying rule from just 2-3 examples and then apply it to a new case. There's no way to memorize it because each problem is unique.

The trick is that an average human can solve 85% of these puzzles without prior training. For years, the best AIs barely reached 30-40%.

The Evolution of Scores

Year	Best Model	ARC-AGI-1 Score
2020	GPT-3	~20%
2022	GPT-4	~35%
2024	Claude Opus	~55%
Nov 2025	o3-preview	87%
Dec 2025	GPT-5.2 Pro	90.5%

The Numbers That Matter: GPT-5.2 in Detail

OpenAI launched GPT-5.2 on December 11, 2025. But it's not a single model — it's a family with three variants designed for different use cases.

The Three GPT-5.2 Variants

Variant	Use Case	API Price (input/output)
GPT-5.2 Instant	Quick responses, emails, simple texts	Most economical
GPT-5.2 Thinking	Step-by-step reasoning, complex problems	$1.75 / $14 per million tokens
GPT-5.2 Pro	Research, legal analysis, scientific work	$21 / $168 per million tokens

Technical Specifications

Specification	Value
Context window	400,000 tokens
Max output tokens	128,000 tokens
Knowledge cutoff	August 2025
Reasoning support	Configurable (none, low, medium, high, xhigh)
Cached inputs	90% discount
Batch API	50% discount

Think of it like this: a 400K token context window means you can fit approximately 600 pages of text in a single query. That's an entire book, with room to spare for questions.

The Benchmark Battle: GPT-5.2 vs Claude Opus 4.5 vs Gemini 3

Now things get interesting. How does GPT-5.2 compare to its direct competitors?

Reasoning and Math

Benchmark	GPT-5.2 Pro	Claude Opus 4.5	Gemini 3 Pro
ARC-AGI-1 (Verified)	90.5%	~75%	~70%
ARC-AGI-2 (Verified)	54.2%	37.6%	31.1%
AIME 2025 (Math)	100%	~92.8%	~90%
GPQA Diamond (Science)	92.4%	87%	91.9%
FrontierMath	40.3%	~30%	~35%

The numbers speak for themselves: in abstract reasoning and math, GPT-5.2 dominates. The jump from 37.6% to 54.2% on ARC-AGI-2 (the harder version) is particularly notable: a 44% improvement over Claude Opus 4.5.

Coding and Programming Tasks

Benchmark	GPT-5.2 Codex	Claude Opus 4.5
SWE-bench Verified	80.0%	80.9%
SWE-bench Pro	56.4%	~54%
HumanEval	91.7%	94.2%
Terminal-Bench	47.6%	59.3%
Terminal-Bench 2.0	64.0%	~55%

The trick is understanding what each measures: Claude Opus 4.5 still leads on SWE-bench Verified (the most-cited metric for real-world coding), but GPT-5.2 Codex wins on SWE-bench Pro and Terminal-Bench 2.0, which are harder.

What most guides won't tell you is that in practical real-world tests, results are more mixed. A Sonar analysis found that GPT-5.2 has the lowest control flow error rate (22 errors per million lines of code), but Claude Opus 4.5 has the highest functional code rate (83.62%).

The Benchmark That Matters Most: GDPval

OpenAI created GDPval to measure something that actually matters: can AI do professional work better than a human expert?

Model	GDPval (vs professionals)
GPT-5.2 Thinking	70.9%
GPT-5.2 Pro	74.1%
Claude Opus 4.5	~65%

This means on well-specified knowledge tasks, GPT-5.2 Pro beats human professionals 74% of the time. Tasks like legal analysis, document review, scientific research.

What Sam Altman Admitted (And Why It Matters)

Here's something OpenAI would probably prefer didn't get too much attention: GPT-5.2 has issues with writing.

In a recent Q&A session, Sam Altman admitted it directly:

"We decided to put most of our effort in 5.2 into making it super good at intelligence, reasoning, coding, engineering. And I think we kind of screwed up a bit on writing quality."

What most guides won't tell you is that this was a deliberate decision, not a mistake. OpenAI prioritized technical capabilities because, according to Altman, "consumers don't demand more IQ anymore, but enterprises still do."

The promise is that future GPT-5.x versions will correct this deficit. But if your main job is creative writing, Claude Opus 4.5 remains the better option.

GPT-5.2 Codex: The Developer's Weapon

A week after GPT-5.2 base, OpenAI released GPT-5.2 Codex, specifically optimized for agentic coding.

What Makes Codex Different

Feature	GPT-5.2 Standard	GPT-5.2 Codex
Long context in repos	Good	Optimized
Large refactors	Limited	Excellent
Code migrations	Basic	Specialized
Cybersecurity	Standard	Enhanced
Context compaction	No	Yes

The trick is in "native context compaction." Codex can maintain task state across extended sessions without losing track. Think of it like this: you can leave a refactor halfway done, come back the next day, and the model remembers exactly where it was.

Cybersecurity Performance

OpenAI boasts that GPT-5.2 Codex has the best cybersecurity capabilities of any model they've released. In professional CTF (Capture The Flag) competitions, the model shows significant improvements in:

Vulnerability detection
Threat analysis
Real-world exploit research

Andrew MacPherson, a principal security engineer at Privy (a Stripe company), used GPT-5.2 Codex to reproduce and study a critical React vulnerability. His conclusion: the model is genuinely useful for real security research.

OpenAI's Enterprise Strategy: What These Launches Reveal

The GPT-5.2 numbers aren't just technical metrics. They're a statement of intent.

The Enterprise Pivot

Sam Altman has been clear: 2026 is OpenAI's enterprise year. Some revealing data points:

OpenAI's API business grew faster than ChatGPT consumer in 2025
Enterprise is now a "major priority"
GPT-5.2 Pro exists specifically for legal, research, and analysis teams

"The main thing consumers want right now is not more IQ. Enterprises still do want more IQ."

The Response to "Code Red"

What most guides won't tell you is that GPT-5.2 came after an internal OpenAI memo declared "code red" in response to competitors like Google Gemini advancing. Altman confirmed he expected to "exit code red" following GPT-5.2's launch.

The model wars are at their peak intensity.

What Does 90% on ARC-AGI Mean?

Now the important question: does exceeding 90% mean we've achieved AGI (Artificial General Intelligence)?

What It DOES Mean

Genuine reasoning (to some extent): GPT-5.2 can solve problems it hasn't seen before by applying abstract rules
Improved generalization: The model doesn't just memorize — it extracts principles
Radical efficiency: Achieves similar results to o3 at 1/390th the cost

What It DOESN'T Mean

It's not AGI: ARC-AGI measures one aspect of reasoning, not all of them
It's not consciousness: Solving puzzles doesn't mean understanding the world
It's not perfect: On ARC-AGI-2 (the hardest version), the best result is 54.2%

François Chollet, ARC-AGI's creator, released ARC-AGI-2 precisely because models were starting to "saturate" the original benchmark. The race continues.

Price Comparison: Is It Worth It?

If you're considering using GPT-5.2 for your work or business, here are the real numbers.

API Costs Per Million Tokens

Model	Input	Output	Input (cached)	Batch
GPT-5.2 Thinking	$1.75	$14.00	$0.175	$0.875/$7
GPT-5.2 Pro	$21.00	$168.00	N/A	N/A
Claude Opus 4.5	$15.00	$75.00	$1.50	$7.50/$37.50
Gemini 3 Pro	~$7.00	~$21.00	Variable	Variable

Cost-Benefit Analysis

For coding and development:

GPT-5.2 Thinking is competitive with Claude Opus
The 90% cache discount makes repeated queries very cheap
GPT-5.2 Codex justifies the premium if you do large refactors

For reasoning and analysis:

GPT-5.2 Pro is expensive (~$168/million output) but the best for professional work
If you need to beat human experts 74% of the time, it might be worth it

For creative writing:

Honestly, Claude is still the better option
GPT-5.2 admitted to sacrificing writing quality

What Comes Next: OpenAI's Roadmap

Sam Altman dropped some hints about the future:

Q1 2026

"I expect new models that are significant gains from 5.2 in the first quarter of next year."

Note that he avoided calling it "GPT-6." But the timeline is clear: substantial improvements in the coming months.

What's Coming

Writing improvements: OpenAI knows they made a mistake, they'll fix it
More specialized models: Following the Codex playbook
Efficiency gains: The 390x cost/performance jump suggests there's more room to optimize

Conclusion: The Real Meaning of This Milestone

GPT-5.2 exceeding 90% on ARC-AGI is a genuine milestone, not empty marketing. But it needs to be understood in context:

It's impressive because:

It demonstrates real abstract reasoning, not just memorization
It reduces the cost of advanced capabilities by 390x
It sets a new standard for the industry

It doesn't change everything because:

The hardest benchmark (ARC-AGI-2) is still at ~54%
Writing has regressed compared to previous models
Claude Opus 4.5 still leads in practical coding

If you ask me directly: GPT-5.2 is the best model for reasoning, math, and professional analysis. Claude Opus 4.5 is still better for day-to-day coding and writing. Gemini 3 occupies an interesting niche with its Google ecosystem integration.

The real question isn't whether GPT-5.2 is "the best." It's what you need it for. And now, with these numbers on the table, you can make an informed choice.

Data current as of January 2026. Benchmarks from OpenAI, independent evaluations from IntuitionLabs, SonarSource, and comparative analysis from LLM-Stats.

Frequently Asked Questions

What is ARC-AGI and why is it important?

ARC-AGI (Abstraction and Reasoning Corpus for Artificial General Intelligence) is a benchmark designed by François Chollet to measure genuine abstract reasoning. Unlike other tests, it can't be solved by memorizing patterns: each problem is unique and requires discovering underlying rules. An average human solves 85% without training; for years, AIs couldn't break 40%.

Is GPT-5.2 better than Claude Opus 4.5?

It depends on the task. GPT-5.2 dominates in abstract reasoning (90.5% vs ~75% on ARC-AGI) and math (100% vs ~93% on AIME 2025). Claude Opus 4.5 leads in practical coding (80.9% vs 80.0% on SWE-bench Verified) and writing quality. For professional analysis: GPT-5.2. For software development: both are competitive.

How much does GPT-5.2 cost?

GPT-5.2 Thinking costs $1.75 per million input tokens and $14 per million output. GPT-5.2 Pro (the most capable) costs $21 input and $168 output. Cached inputs get a 90% discount, and the Batch API offers 50% off for non-urgent workloads.

Does this mean we've achieved AGI?

No. ARC-AGI measures one aspect of abstract reasoning, not general intelligence. Breaking 90% on ARC-AGI-1 is impressive, but on ARC-AGI-2 (the harder version) the best score is 54.2%. Current models still have significant limitations in common sense, causal reasoning, and understanding the physical world.

When will GPT-6 be released?

Sam Altman said they expect "new models that are significant gains from 5.2 in Q1 2026," but avoided confirming if it will be called GPT-6. OpenAI also mentioned that future GPT-5.x versions will improve writing quality, which they admitted sacrificing in GPT-5.2.