June 2, 2026

AI Expert Team

Has AI Achieved AGI? The Jensen Huang Claim and the Benchmark That Just Destroyed It

Has AI achieved AGI? On March 23, 2026, Nvidia CEO Jensen Huang went on Lex Fridman’s podcast and said yes. “I think it’s now. I think we’ve achieved AGI.” Four words from the man whose chips power roughly 80% of global AI training. The clip went viral within hours. Two days later, on March 25, the ARC Prize Foundation dropped a new benchmark at Y Combinator called ARC-AGI-3, designed specifically to test whether AI can figure out novel situations without being told the rules.

Frontier AI models scored under 1%. Gemini 3.1 Pro managed 0.37%. GPT-5.4 hit 0.26%. Claude Opus 4.6 scored 0.25%. Grok-4.20 scored literally zero. Humans scored 100%. A four-layer convolutional neural network built in weeks by a small research lab scored 12.58%, beating every frontier model by more than 30 times.

For UK SMEs being pitched “AGI-powered” software, “intelligent” AI agents, and tools that promise to think like humans, this matters, because the gap between what the AI industry is selling and what the AI industry can actually do just got measured, and the results are honest in a way most marketing isn’t.

Has AI Achieved AGI? What the Numbers Actually Show

ARC-AGI-3 is the third iteration of a benchmark created by François Chollet, the researcher who built Keras and who has spent years arguing that most AI progress metrics measure the wrong things. The first version launched in 2019. The second launched in early 2025. Both tested static grid puzzles. Frontier models eventually climbed to 90%+ on the first version and the mid-teens on the second.

Version 3 changes the format entirely. Instead of puzzles, the AI is dropped into an interactive game-like environment. A 64x64 grid. No instructions, no stated rulesand no win condition. The agent takes an action, sees what changes, and has to figure out both what it’s trying to do and how to do it from scratch, in real time. Humans find these games intuitive. The ARC team paid roughly 500 random members of the public in San Francisco to test them. Five or more out of every ten testers fully cleared most games on first contact.

The scoring is deliberately brutal. The formula is (human actions divided by AI actions) squared. Take twice as long as a human, your score caps at 25%. Take ten times as long, you score 1%. This kills brute force. Every previous benchmark could be gamed by having an AI try everything until it stumbled on the answer. This one makes that mathematically impossible.

Here are the numbers as of the March 25 launch:

• Humans: 100%

• StochasticGoose (a small CNN built by Tufa Labs): 12.58%

• Gemini 3.1 Pro: 0.37%

• GPT-5.4: 0.26%

• Claude Opus 4.6: 0.25%

• Grok-4.20: 0.00%

The most powerful language models ever built, trained on essentially the entire internet at a cost of hundreds of billions of dollars, were outperformed by a four-layer neural network with no language component anywhere inside it. That’s the number the industry would prefer you didn’t focus on.

Has AI Achieved AGI? Why a Small CNN Beat the Giants

This is the part that deserves attention. StochasticGoose was built in weeks by a small team at Tufa Labs led by Dries Smit. It uses simple reinforcement learning. It has no language model anywhere inside it. The team was explicit about why: ARC-AGI-3 environments can produce hundreds of interaction steps and potentially millions of tokens. Large language models choke on that.

Instead, they used CNN frames directly, stored state transitions in hash tables to avoid redundant exploration, and let reinforcement learning drive the learning. The third-place entry, also non-LLM, used a training-free graph-based exploration system. Both winning approaches did the same thing: they explored action spaces without trying to reason in language about what to do next.

The gap between 12.58% and 0.37% is not a minor underperformance. It is a 30x difference. And it points to something the industry rarely states plainly. Large language models are extraordinary at predicting the next token in a human-language context. That is what they were built for and that is where they excel. They are not, at least not yet, reliable at discovering goals and rules in unfamiliar situations without being told what to do.

Has AI Achieved AGI? The Critics and the Counterpoints

Before writing this off as a definitive answer, a few points from the other side deserve acknowledgement.

The scoring has genuine limitations. Some fog-of-war levels contain direction choices that are effectively coin flips, which tanks efficiency scores through no fault of the agent. The human baseline uses the second-best performer rather than an average. AI receives the game state as JSON while humans see rendered visuals. Critics including ex-OpenAI game designer Psyho and scaling researcher @scaling01 have pointed out these issues publicly.

Chollet responded personally on Hacker News. The baseline is achievable, he said. On JSON versus visual input: if a model needs pixels to understand the concept of a square, it hasn’t abstracted the concept of a square. Both sides have a point. The scoring is punitive, and it’s punitive on purpose.

There’s also the Duke University experiment, which is genuinely interesting. A custom harness pushed Claude Opus from 0.25% to 97.1% on a single known environment called TR87. When the same harness was tested on a different environment, the score dropped to 0%. The harness solved one game it was built for. It didn’t generalise. That’s Chollet’s point in a sentence: current AI agents are, as he described them, “brittle harnesses around a frozen brain.”

Sam Altman, on stage at Y Combinator with Chollet, argued the opposite. He said scaffolding and harnesses are part of intelligence, and that engineering better agent loops is the path forward. The Agentica SDK hit 36% on the public demo set using heavy multi-agent harnesses with sub-agents for exploring, theorising, testing and solving. The gap between 36% (heavily engineered) and 0.25% (vanilla API) tells you how much of the work is harness versus model.

The honest answer to “has AI achieved AGI” depends heavily on what you mean by AGI. Huang’s definition was specific and economic: could AI start and run a $1 billion business. Chollet’s definition is cognitive: can AI figure out unfamiliar situations without being told how. On the first, Huang has a reasonable case. On the second, the answer is measurably 0.37% and a small CNN is doing it 30 times better.

What This Means for UK SMEs Using AI

For UK business leaders being pitched AI tools right now, three practical lessons emerge from all of this.

First, the gap between AI marketing and AI capability just got measurable. When a vendor describes their product as “intelligent”, “reasoning” or “AGI-powered”, ask them what that means specifically. The ARC-AGI-3 results give you a clear reference point. Current frontier AI is genuinely excellent at language fluency, code generation, pattern matching, document analysis and cross-domain synthesis. It is not, at least today, reliable at goal-discovery in novel environments without instructions. That distinction matters when you’re evaluating AI tools for your business.

Second, the AI use cases that work are the ones where context is clear. Where instructions exist, where patterns are stable, where the task looks like something the model has seen before, AI delivers meaningful value. This is exactly the kind of work SMEs are using AI for successfully right now. Automating admin. Generating content. Analysing documents. Writing code. These use cases are well-suited to current AI capabilities and we’ve covered them extensively, including in our coverage of how Apple Intelligence is reshaping customer discovery and what AI agents can do for business.

Third, this is exactly why structured AI adoption beats hype-chasing. The businesses getting results from AI in 2026 aren’t the ones most convinced by marketing claims. They’re the ones with a clear understanding of what AI can and can’t do, deployed against specific commercial objectives. That starts with our free AI Readiness Assessment, progresses through an AI Workshop that identifies where AI actually fits your operations, and culminates in an AI Roadmap that keeps your deployments grounded in commercial reality, not industry narrative.

The Bigger Picture: This Is Why the AI Industry Is Restructuring

The ARC-AGI-3 results sit within a broader pattern we’ve tracked across recent coverage. Nvidia is consolidating the entire AI infrastructure layer, Google’s Gemma 4 is making frontier AI available on your own hardware, Elon Musk's OpenAI lawsuit is about to test the legal foundations of the AI industry and now a benchmark designed specifically to measure whether AI can reason in unfamiliar situations is showing us that, on that specific definition, we’re nowhere near where the marketing suggests.

None of this means AI isn’t valuable. It is, immensely so, for the right use cases. But the gap between what AI can do brilliantly and what AI still cannot do is real, measurable and widening in specific ways. Businesses that understand this will continue to extract enormous value from AI while avoiding the expensive mistakes made by those who bought the hype. The ones that don’t will be the 43% of companies the recent IDC study found wasting their AI training budgets on tools that didn’t deliver.

The Bottom Line

Has AI achieved AGI? Depends entirely on how you define AGI. By Jensen Huang’s narrow economic definition, he has a case worth considering. By François Chollet’s cognitive definition (which more closely matches what most people mean when they use the term), frontier AI is scoring 0.37% on a test that humans find trivial and a small CNN is beating by 30 times.

For UK SMEs, the practical verdict is this: AI is an extraordinarily useful tool for well-defined tasks with clear context. It is not, today, a general-purpose intelligence that will figure out your business problems without being told how. Plan your AI investments against what it can actually do, not what the headlines say it can.

Complete our free AI Readiness Assessment to understand where AI genuinely fits your business, and how to build an AI strategy grounded in commercial reality rather than industry hype.

‍

Share this post

News and insights

Elon Musk’s Algorithm: The 5-Step Framework Every SME Should Apply Before They Automate

Elon Musk’s algorithm is a five-step process improvement framework he applied at SpaceX and Tesla to eliminate waste, cut bureaucracy and optimise operations.

New AI Jobs: What the Explosion of Roles That Didn’t Exist Three Years Ago Means for SMEs

New AI jobs are appearing on job boards at a pace that has left universities, recruiters and corporate training programmes scrambling to keep up.

AI Maturity Model: The Five Stages Every SME Travels From Confused to Confident

An AI maturity model highlights the gap between businesses that are AI-mature and businesses that are still experimenting.

View all

News and insights

Elon Musk’s Algorithm: The 5-Step Framework Every SME Should Apply Before They Automate

New AI Jobs: What the Explosion of Roles That Didn’t Exist Three Years Ago Means for SMEs

AI Maturity Model: The Five Stages Every SME Travels From Confused to Confident

Subscribe to our AI newsletter