Why Apple's "The Illusion of Thinking" Falls Short: A Critique of Flawed Research
A Study Chasing Headlines, Not Truth
Apple's paper, "The Illusion of Thinking: Understanding the Strengths and Limitations of Reasoning Models via the Lens of Problem Complexity," aims to reveal the shortcomings of Large Reasoning Models (LRMs). While its observations contain some truth, the conclusions it draws are deeply flawed.
Here’s why this study fails to deliver—and how its errors pile up into a misleading narrative.
Problem #1: Cumulative Error Probability—Not a Flaw, Just Reality
LRMs generate outputs via neural networks, where each token carries a tiny error probability. In complex tasks like the Tower of Hanoi with 13 disks, the model needs ~81,910 tokens (2^13 - 1 moves × 10 tokens/move). Even with 99.99% per-token accuracy, the odds of a perfect solution plummet to ~0.44%. Apple paints this as a fatal weakness, but it’s not unique to AI—humans also stumble in long, repetitive tasks due to cumulative errors. This is simply how probabilistic systems work, not a groundbreaking flaw.
Problem #2: Misjudging Abstraction as a Failure
When faced with massive problems, LRMs shift to describing algorithms rather than executing every step—a move Apple criticizes as a defect. But this is a strength, not a failure. The model recognizes its error rate spikes with length (because it was trained on many problems of various complexity) and adapts by offering a high-level solution, much like a human would. Users want answers, not exhaustive breakdowns, but Apple’s rules rigged the game from the start.
Problem #3: Apple Hobbled The Models From Working As Intended
Apple spotlights token limits as a critical flaw, but this is a red herring. For Tower of Hanoi, an LRM could perfectly solve it by executing code—yet Apple disallowed this. ChatGPT, Claude, and Google Gemini can write and execute code to solve complex problems that their neural network cannot — but Apple’s test skipped this practical strategy, forcing models into an unrealistic bind. Reasoning isn’t about flawlessly churning out thousands of steps; it’s about planning and adapting. By obsessing over token constraints, Apple dodges the real issue: Can LRMs reason beyond raw computation? Their setup can’t answer that.
Problem #4: A Puzzle, Not a Proving Ground
The Tower of Hanoi is a toy problem with a neat, known solution—hardly a test of real-world reasoning. True reasoning tackles planning, task decomposition, self-correction, ambiguity, trade-offs, and creativity, not just scripted steps. By hinging their study on this contrived puzzle, Apple sacrifices relevance and generalizability. It’s like rating a chef by how well they boil water—measurable, but meaningless.
Flip the problem around: if the LRMs did solve a 13-ring Tower of Hanoi puzzle, would that prove they are capable of human-level reasoning? No, it would only prove the model had grown by a few orders of magnitude. Larger models can solve arbitrarily longer puzzles, but this only provides the raw computing ability, not their reasoning ability.
Conclusion: A Study Chasing Headlines, Not Truth
Apple’s "The Illusion of Thinking" flags real issues but trips over its own logic. Cumulative errors are overstated, abstraction is misread, token limits are a non-issue, and the puzzle choice undermines it all. This isn’t a decisive strike against LRMs—it’s a misfire. If we’re serious about testing reasoning, we need tasks rooted in reality, not academic gimmicks. The illusion here lies in the study, not the models.