The Smartest Bear vs the Dumbest Human: The Implications of AI's Growing Problem-Solving Abilities
A ranger once stated the main challenge of designing bear-proof trash bins: there is substantial overlap between the smartest bear and the dumbest human. While humans perform a variety of tasks that bears can't, in the realm of manipulating simple mechanical systems, a number of animals compete effectively with humans.
So it is with AI. GPT-4 is already better than humans at solving CAPTCHAS. CAPTCHA stands for "Completely Automated Public Turing test to tell Computers and Humans Apart." GPT4 has supposedly passed the Turing test so it may be that AI has killed CAPTHA for good.
If it is true that AI has irredeemably defeated CAPTCHAs, with no hope of a AI-proof test (as seems likely to me), this has profound implications. At least in the domain of time-constrained problems solvable by the majority of humans, AI systems already reason better than humans. This has profound implications.
I discussed the CAPTCHA problem with Claude, and we came up with a challenge that current systems are unlikely to pass: a 3D world simulation where users are verbally asked to collaborate real-time with other humans to solve physics-based problems, such as manipulating lighting conditions. There were other exotic ideas, but I didn't see any fundamental constraints that would make the problems safe from today's LLMs, if they are given suitable interaction tools and real-time interaction abilities (such as is shortly coming to GPT_4o).
So did GPT-4 really pass the Turing test? That depends on the Turing tests:
The development of AI is marked by the progressive mastery of specific intellectual tasks, leading to an increasing number of claims about passing various formulations of the Turing test. However, the Turing test is not a single, fixed benchmark but a continuum of increasingly complex challenges designed to assess an AI's ability to exhibit human-like intelligent behavior. These challenges are expected to progress from brief text-based conversations with non-experts to in-depth discussions with specialists, and eventually to multimodal interactions involving audio-visual communication and perhaps even physical embodiment in the form of androids. Rather than serving as a definitive threshold, the Turing test represents a series of milestones on the path towards artificial general intelligence (AGI), with each successive test passed marking a significant advance in AI capabilities.
Clearly, current AI systems (GPT-4,Google Gemini, Claude Opus) are not generally intelligent. They can only solve problems of limited complexity. However, within that scope, they are close to becoming universal problem solvers: there is no domain of human reasoning which is safe. "Hard" capabilities such as theory of mind, spatial reasoning, visual processing, perceiving and expressing emotion have all been solved. The main remaining challenge is scaling these systems -- a very hard problem indeed, but one which the computing world has been successfully solving since its formation.
Here is a simple test to try at home: ask GPT-4o to a stack of random objects on top of each other, like a feather, an egg, a chair, and a bowling ball. This sequence will definitely not be in its training data and requires both spacial and physics reasoning that conclusively proves that LLMs have a detailed internal world model. You can use this method to probe the limits of an LLM internal physics/world model: it can easily plot a cross-country trip, but fails at inventing new jujitsu moves because two opposing human bodies have too many variables.
I’ll rephrase my thesis: An algorithm capable of solving any CAPTCHA demonstrates an ability to reason on any topic. Consequently, the primary factor limiting such an algorithm's intelligence is the scale of computational resources available for its training and operation, rather than any inherent constraints on its reasoning abilities.