The Number That Changed Everything: 90.5%
Let me break this down: imagine there was a test specifically designed to measure whether an AI can actually "think" — not just memorize patterns. A test so difficult that for years, no model could break 50%. Well, GPT-5.2 Pro just scored 90.5%.
This isn't just any benchmark. ARC-AGI (Abstraction and Reasoning Corpus for Artificial General Intelligence) was created by François Chollet, the inventor of Keras, precisely to detect whether AIs truly reason or just pretend to. And for years, that test exposed the limitations of every model.
Until December 2025.
What most guides won't tell you is that this result isn't just impressive for the number itself. It's impressive because GPT-5.2 Pro achieves it while being 390 times cheaper than the previous record holder, o3-preview, which barely reached 87%.
What the Heck Is ARC-AGI and Why Should You Care?
Before we get excited about the numbers, let me explain why this benchmark is special.
The Problem with Traditional Benchmarks
Most AI tests measure things models can memorize. If a model has seen millions of math problems during training, is it really "reasoning" when it solves a new one, or just applying patterns it already knows?
François Chollet designed ARC-AGI to be different:
| Characteristic | Traditional Benchmarks | ARC-AGI |
|---|---|---|
| Problem type | Text, code, math | Abstract visual puzzles |
| Memorization | Possible | Impossible |
| Specific training | Works | Doesn't work |
| Measures | Knowledge + patterns | Pure reasoning |
Each ARC-AGI problem presents a grid with color patterns. The model must discover the underlying rule from just 2-3 examples and then apply it to a new case. There's no way to memorize it because each problem is unique.
The trick is that an average human can solve 85% of these puzzles without prior training. For years, the best AIs barely reached 30-40%.
The Evolution of Scores
| Year | Best Model | ARC-AGI-1 Score |
|---|---|---|
| 2020 | GPT-3 | ~20% |
| 2022 | GPT-4 | ~35% |
| 2024 | Claude Opus | ~55% |
| Nov 2025 | o3-preview | 87% |
| Dec 2025 | GPT-5.2 Pro | 90.5% |
The Numbers That Matter: GPT-5.2 in Detail
OpenAI launched GPT-5.2 on December 11, 2025. But it's not a single model — it's a family with three variants designed for different use cases.
The Three GPT-5.2 Variants
| Variant | Use Case | API Price (input/output) |
|---|---|---|
| GPT-5.2 Instant | Quick responses, emails, simple texts | Most economical |
| GPT-5.2 Thinking | Step-by-step reasoning, complex problems | $1.75 / $14 per million tokens |
| GPT-5.2 Pro | Research, legal analysis, scientific work | $21 / $168 per million tokens |
Technical Specifications
| Specification | Value |
|---|---|
| Context window | 400,000 tokens |
| Max output tokens | 128,000 tokens |
| Knowledge cutoff | August 2025 |
| Reasoning support | Configurable (none, low, medium, high, xhigh) |
| Cached inputs | 90% discount |
| Batch API | 50% discount |
Think of it like this: a 400K token context window means you can fit approximately 600 pages of text in a single query. That's an entire book, with room to spare for questions.
The Benchmark Battle: GPT-5.2 vs Claude Opus 4.5 vs Gemini 3
Now things get interesting. How does GPT-5.2 compare to its direct competitors?
Reasoning and Math
| Benchmark | GPT-5.2 Pro | Claude Opus 4.5 | Gemini 3 Pro |
|---|---|---|---|
| ARC-AGI-1 (Verified) | 90.5% | ~75% | ~70% |
| ARC-AGI-2 (Verified) | 54.2% | 37.6% | 31.1% |
| AIME 2025 (Math) | 100% | ~92.8% | ~90% |
| GPQA Diamond (Science) | 92.4% | 87% | 91.9% |
| FrontierMath | 40.3% | ~30% | ~35% |
The numbers speak for themselves: in abstract reasoning and math, GPT-5.2 dominates. The jump from 37.6% to 54.2% on ARC-AGI-2 (the harder version) is particularly notable: a 44% improvement over Claude Opus 4.5.
Coding and Programming Tasks
| Benchmark | GPT-5.2 Codex | Claude Opus 4.5 |
|---|---|---|
| SWE-bench Verified | 80.0% | 80.9% |
| SWE-bench Pro | 56.4% | ~54% |
| HumanEval | 91.7% | 94.2% |
| Terminal-Bench | 47.6% | 59.3% |
| Terminal-Bench 2.0 | 64.0% | ~55% |
The trick is understanding what each measures: Claude Opus 4.5 still leads on SWE-bench Verified (the most-cited metric for real-world coding), but GPT-5.2 Codex wins on SWE-bench Pro and Terminal-Bench 2.0, which are harder.
What most guides won't tell you is that in practical real-world tests, results are more mixed. A Sonar analysis found that GPT-5.2 has the lowest control flow error rate (22 errors per million lines of code), but Claude Opus 4.5 has the highest functional code rate (83.62%).
The Benchmark That Matters Most: GDPval
OpenAI created GDPval to measure something that actually matters: can AI do professional work better than a human expert?
| Model | GDPval (vs professionals) |
|---|---|
| GPT-5.2 Thinking | 70.9% |
| GPT-5.2 Pro | 74.1% |
| Claude Opus 4.5 | ~65% |
This means on well-specified knowledge tasks, GPT-5.2 Pro beats human professionals 74% of the time. Tasks like legal analysis, document review, scientific research.
What Sam Altman Admitted (And Why It Matters)
Here's something OpenAI would probably prefer didn't get too much attention: GPT-5.2 has issues with writing.
In a recent Q&A session, Sam Altman admitted it directly:
"We decided to put most of our effort in 5.2 into making it super good at intelligence, reasoning, coding, engineering. And I think we kind of screwed up a bit on writing quality."
What most guides won't tell you is that this was a deliberate decision, not a mistake. OpenAI prioritized technical capabilities because, according to Altman, "consumers don't demand more IQ anymore, but enterprises still do."
The promise is that future GPT-5.x versions will correct this deficit. But if your main job is creative writing, Claude Opus 4.5 remains the better option.
GPT-5.2 Codex: The Developer's Weapon
A week after GPT-5.2 base, OpenAI released GPT-5.2 Codex, specifically optimized for agentic coding.
What Makes Codex Different
| Feature | GPT-5.2 Standard | GPT-5.2 Codex |
|---|---|---|
| Long context in repos | Good | Optimized |
| Large refactors | Limited | Excellent |
| Code migrations | Basic | Specialized |
| Cybersecurity | Standard | Enhanced |
| Context compaction | No | Yes |
The trick is in "native context compaction." Codex can maintain task state across extended sessions without losing track. Think of it like this: you can leave a refactor halfway done, come back the next day, and the model remembers exactly where it was.
Cybersecurity Performance
OpenAI boasts that GPT-5.2 Codex has the best cybersecurity capabilities of any model they've released. In professional CTF (Capture The Flag) competitions, the model shows significant improvements in:
- Vulnerability detection
- Threat analysis
- Real-world exploit research
Andrew MacPherson, a principal security engineer at Privy (a Stripe company), used GPT-5.2 Codex to reproduce and study a critical React vulnerability. His conclusion: the model is genuinely useful for real security research.
OpenAI's Enterprise Strategy: What These Launches Reveal
The GPT-5.2 numbers aren't just technical metrics. They're a statement of intent.
The Enterprise Pivot
Sam Altman has been clear: 2026 is OpenAI's enterprise year. Some revealing data points:
- OpenAI's API business grew faster than ChatGPT consumer in 2025
- Enterprise is now a "major priority"
- GPT-5.2 Pro exists specifically for legal, research, and analysis teams
"The main thing consumers want right now is not more IQ. Enterprises still do want more IQ."
The Response to "Code Red"
What most guides won't tell you is that GPT-5.2 came after an internal OpenAI memo declared "code red" in response to competitors like Google Gemini advancing. Altman confirmed he expected to "exit code red" following GPT-5.2's launch.
The model wars are at their peak intensity.
What Does 90% on ARC-AGI Mean?
Now the important question: does exceeding 90% mean we've achieved AGI (Artificial General Intelligence)?
What It DOES Mean
- Genuine reasoning (to some extent): GPT-5.2 can solve problems it hasn't seen before by applying abstract rules
- Improved generalization: The model doesn't just memorize — it extracts principles
- Radical efficiency: Achieves similar results to o3 at 1/390th the cost
What It DOESN'T Mean
- It's not AGI: ARC-AGI measures one aspect of reasoning, not all of them
- It's not consciousness: Solving puzzles doesn't mean understanding the world
- It's not perfect: On ARC-AGI-2 (the hardest version), the best result is 54.2%
François Chollet, ARC-AGI's creator, released ARC-AGI-2 precisely because models were starting to "saturate" the original benchmark. The race continues.
Price Comparison: Is It Worth It?
If you're considering using GPT-5.2 for your work or business, here are the real numbers.
API Costs Per Million Tokens
| Model | Input | Output | Input (cached) | Batch |
|---|---|---|---|---|
| GPT-5.2 Thinking | $1.75 | $14.00 | $0.175 | $0.875/$7 |
| GPT-5.2 Pro | $21.00 | $168.00 | N/A | N/A |
| Claude Opus 4.5 | $15.00 | $75.00 | $1.50 | $7.50/$37.50 |
| Gemini 3 Pro | ~$7.00 | ~$21.00 | Variable | Variable |
Cost-Benefit Analysis
For coding and development:
- GPT-5.2 Thinking is competitive with Claude Opus
- The 90% cache discount makes repeated queries very cheap
- GPT-5.2 Codex justifies the premium if you do large refactors
For reasoning and analysis:
- GPT-5.2 Pro is expensive (~$168/million output) but the best for professional work
- If you need to beat human experts 74% of the time, it might be worth it
For creative writing:
- Honestly, Claude is still the better option
- GPT-5.2 admitted to sacrificing writing quality
What Comes Next: OpenAI's Roadmap
Sam Altman dropped some hints about the future:
Q1 2026
"I expect new models that are significant gains from 5.2 in the first quarter of next year."
Note that he avoided calling it "GPT-6." But the timeline is clear: substantial improvements in the coming months.
What's Coming
- Writing improvements: OpenAI knows they made a mistake, they'll fix it
- More specialized models: Following the Codex playbook
- Efficiency gains: The 390x cost/performance jump suggests there's more room to optimize
Conclusion: The Real Meaning of This Milestone
GPT-5.2 exceeding 90% on ARC-AGI is a genuine milestone, not empty marketing. But it needs to be understood in context:
It's impressive because:
- It demonstrates real abstract reasoning, not just memorization
- It reduces the cost of advanced capabilities by 390x
- It sets a new standard for the industry
It doesn't change everything because:
- The hardest benchmark (ARC-AGI-2) is still at ~54%
- Writing has regressed compared to previous models
- Claude Opus 4.5 still leads in practical coding
If you ask me directly: GPT-5.2 is the best model for reasoning, math, and professional analysis. Claude Opus 4.5 is still better for day-to-day coding and writing. Gemini 3 occupies an interesting niche with its Google ecosystem integration.
The real question isn't whether GPT-5.2 is "the best." It's what you need it for. And now, with these numbers on the table, you can make an informed choice.
Data current as of January 2026. Benchmarks from OpenAI, independent evaluations from IntuitionLabs, SonarSource, and comparative analysis from LLM-Stats.
Frequently Asked Questions
What is ARC-AGI and why is it important?
ARC-AGI (Abstraction and Reasoning Corpus for Artificial General Intelligence) is a benchmark designed by François Chollet to measure genuine abstract reasoning. Unlike other tests, it can't be solved by memorizing patterns: each problem is unique and requires discovering underlying rules. An average human solves 85% without training; for years, AIs couldn't break 40%.
Is GPT-5.2 better than Claude Opus 4.5?
It depends on the task. GPT-5.2 dominates in abstract reasoning (90.5% vs ~75% on ARC-AGI) and math (100% vs ~93% on AIME 2025). Claude Opus 4.5 leads in practical coding (80.9% vs 80.0% on SWE-bench Verified) and writing quality. For professional analysis: GPT-5.2. For software development: both are competitive.
How much does GPT-5.2 cost?
GPT-5.2 Thinking costs $1.75 per million input tokens and $14 per million output. GPT-5.2 Pro (the most capable) costs $21 input and $168 output. Cached inputs get a 90% discount, and the Batch API offers 50% off for non-urgent workloads.
Does this mean we've achieved AGI?
No. ARC-AGI measures one aspect of abstract reasoning, not general intelligence. Breaking 90% on ARC-AGI-1 is impressive, but on ARC-AGI-2 (the harder version) the best score is 54.2%. Current models still have significant limitations in common sense, causal reasoning, and understanding the physical world.
When will GPT-6 be released?
Sam Altman said they expect "new models that are significant gains from 5.2 in Q1 2026," but avoided confirming if it will be called GPT-6. OpenAI also mentioned that future GPT-5.x versions will improve writing quality, which they admitted sacrificing in GPT-5.2.




