What If We're Measuring AI Progress Wrong?

06 Dec 2025

Notes after watching the interview with Ilya Sutskever by Dwarkesh Patel

I just watched his interview with Dwarkesh Patel, and his perspective challenges many current AI industry assumptions.

The throughline: the metrics everyone obsesses over might be decoupling from the capabilities that actually matter. And if that's true, we're navigating by a broken compass.

TL;DR

Ilya Sutskever thinks the age of scaling is over. From 2020-2025, "just add more compute" worked. Now we're back to fundamental research—just with bigger computers. The core problems: models ace benchmarks but fail basic tasks (researchers are reward-hacking themselves by training around evals), and humans still learn from dramatically less data for reasons we can't fully explain. His vision of superintelligence isn't a finished mind that knows everything—it's a mind that can learn anything, like "a superintelligent 15-year-old." Timeline: 5-20 years. The refreshing part: he openly admits SSI is a research bet on ideas that might not work, building systems "that don't exist that we don't know how to build." If he's right, the benchmarks everyone obsesses over are measuring the wrong thing, and figuring out what makes human learning so efficient matters more than scaling up current approaches.

The Eval Paradox

Ilya describes something genuinely puzzling about current AI systems. Models perform remarkably well on difficult evaluations—you look at the benchmarks and think "those are pretty hard evals"—yet their real-world impact seems dramatically behind what those scores would suggest.

His concrete example: You're using an AI coding assistant. You hit a bug. You tell the model to fix it. The model responds enthusiastically—"Oh my god, you're so right, I have a bug, let me fix that"—and introduces a second bug. You point out the new bug. The model apologises again and... brings back the first bug. You can alternate between these two states indefinitely.

How can a model that scores in the 90th percentile on competitive programming benchmarks get stuck in this loop?

He offers two potential explanations:

RL makes models too single-minded. Reinforcement learning might make models overly focused, harming basic competence. They become intensely good at the specific optimisation target while losing something more fundamental.

The data selection problem. With pre-training, the question of what data to use had an easy answer: everything. But with RL training, you have to choose. And researchers—perhaps unconsciously—take inspiration from evals. "I want the evals to look great. What RL training could help with this specific task?"

Dwarkesh's response captures the irony: "I like this idea that the real reward hacking is human researchers who are too focused on the evals."

Goodhart's Law, applied to AI development itself.

The Two Students

Ilya illustrates this with an analogy about competitive programming:

Student A decides to become the best competitive programmer. They practice 10,000 hours, solve every problem, memorise every proof technique, and master every algorithm. They become elite.

Student B thinks competitive programming is cool. They practice for 100 hours. They also do well.

Who has the better career? The second student.

Current models are like extreme versions of Student A—but even more so. Once researchers decide the model should be good at competitive programming, they gather every problem ever created, augment the data to generate more, and train extensively on all of it.

The result: a model with every algorithm at its fingertips. And yet that level of narrow preparation doesn't generalise.

What is Student B doing differently? Ilya calls it "the it factor"—some quality of learning that produces robust, transferable capability rather than brittle benchmark performance. Whatever that is, current training methods don't capture it.

So, How Should We Measure Progress?

If benchmark scores are decoupling from real capability, we need different signals. A few possibilities:

Robustness under distribution shift. Not just "can you solve this problem?" but "can you solve variations you haven't seen before?" Current benchmarks test within-distribution performance. Real-world capability requires out-of-distribution generalisation.

Failure mode analysis. Rather than tracking success rates, systematically characterising how models fail. The bug-oscillation example is a specific failure pattern. Are there others? Do they cluster in meaningful ways?

Economic impact as ground truth. Ilya notes that models "seem smarter than their economic impact would imply." If deployment isn't generating the value that benchmark scores suggest, that gap is itself informative.

Longitudinal task performance. Not just "can you complete this task?" but "can you complete this task, then a related task, then handle an interruption, then return to the original?" Real work involves context-switching and sustained coherence that one-shot benchmarks don't capture.

None of these is easy to measure. That's partly the point. The things that are easy to measure are precisely what training has learned to optimise for.

Emotions as Evolution's Value Function

One of the more surprising tangents: Ilya describes a stroke patient who lost his emotional processing. The man remained articulate. He could solve puzzles. On cognitive tests, he seemed fine. But he became catastrophically bad at making decisions—taking hours to choose which socks to wear and making terrible financial decisions.

The implication: emotions aren't noise in the system. They might be evolution's hardcoded value function—the thing that tells you which decisions matter and when to stop deliberating.

This connects to a gap in current AI training. Reinforcement learning, as typically practised, doesn't use value functions much. You give the model a problem, it takes thousands of actions, produces a solution, and only then receives a training signal. No intermediate feedback. No sense of "I'm going in a bad direction" until you've gone all the way.

Humans have something different. When you lose a chess piece, you know immediately you messed up. You don't need to play out the rest of the game. When you're debugging code and feel that sinking sensation of "this approach isn't working"—that's a value function giving you an early signal.

The Generalisation Mystery

This points to a deeper puzzle. The human learning advantage isn't just sample efficiency—it's that we exhibit robust, transferable learning in domains evolution couldn't have prepared us for.

Suppose you could explain human learning solely in terms of evolutionary priors. In that case, you'd expect us to be good at vision (millions of years of optimisation), locomotion (same), and social reasoning (hundreds of thousands of years). And we are.

But we're also good at algebra. And coding. These didn't exist until recently. Yet humans still learn them better than models—not in absolute performance, but in how quickly we pick them up and how reliably we generalise.

That's weird. It suggests whatever makes humans learn well isn't just a collection of domain-specific priors. It's something more fundamental about how learning itself works.

Ilya clearly has ideas about what this might be—but noted that "circumstances make it hard to discuss in detail." The implication being this is core to what SSI is actually working on.

He did float one related mystery he can't solve: how did evolution encode high-level social desires? Hunger makes sense—follow chemical signals. But "caring what society thinks of you" requires complex computation across the whole brain. How did the genome—which isn't intelligent—specify that?

His honest answer: "I'm unaware of good hypotheses for how it's done."

The Superintelligent 15-Year-Old

This reframes what we should even be building.

The term AGI, Ilya argues, "overshot the target." It was a reaction to "narrow AI"—chess engines that couldn't do anything else. So people said we need a general AI that can do everything.

But humans aren't AGI by that definition. We lack massive amounts of knowledge. We rely on continual learning.

His vision of superintelligence isn't a finished mind that knows how to do every job. It's a mind that can learn any job.

"You produce a superintelligent 15-year-old that's very eager... You go and be a programmer, you go and be a doctor, go and learn."

The deployment itself would involve learning, trial and error, and the accumulation of expertise over time. Not dropping a finished system into the world, but growing one.

On the "Scaling Is Over" Claim

This is genuinely contested. Sutskever argues that the easy gains from "just make it bigger" are exhausted.

His framing:

2012-2020: Age of research
2020-2025: Age of scaling
Now: Back to research, but with big computers

"Is the belief really that 100x more compute would transform everything? I don't think that's true."

I'd note the selection bias: Sutskever left OpenAI and started SSI, which is explicitly pursuing a different technical path. He has incentives—conscious or not—to frame the paradigm he left behind as hitting a wall.

That said, his empirical observation is real. There is a puzzling gap between benchmark performance and economic value extraction. Models that ace coding tests still produce buggy, inconsistent code in practice.

And the structural point stands: "There are more companies than ideas." Scaling sucked all the air out of the room. Everyone converged on the same approach.

What's Refreshing Here

Ilya openly admits he doesn't know how to build what he's trying to build.

"We are talking about systems that don't exist that we don't know how to build."

SSI is a research bet. The ideas might not pan out. The timeline is 5-20 years—wide error bars, notably not "2-3 years" like some predict.

In a field where everyone projects confidence, there's something valuable in someone with his track record saying: we're back in the age of research, the hard problems aren't solved, and the benchmarks everyone watches might be measuring the wrong thing.

If he's right, the path forward isn't scaling the current approach. It's figuring out what Student B knows that Student A doesn't.