If AI skeptics are right, we should expect Large Language Model (LLM) progress to hit a wall well before reaching human-level capability. If LLMs can never be truly intelligent, progress should eventually plateau despite technical efforts. Before this point, we should expect progress to slow. But is there evidence of this happening?
Review of Recent AI Model Progress
I investigated how AI models have improved since GPT-4's release. While GPT-4 was a significant leap over GPT-3, many claim that progress since then has been much slower.
Summary of Findings
AI capability growth has indeed slowed since GPT-4's release, but progress in other dimensions such as cost, speed, and multimodality has been both steady and dramatic.
Model Comparisons
GPT-4 (March 2023):
Cost: $30.00 per million tokens
Context window: 4,095 tokens
MMLU score: 86.4%
LMSYS score: 1251
Claude 3.5 Sonnet (June 2024):
Cost: $3 per million input tokens, $15 per million output tokens
Context window: 200K tokens
Capabilities: Generate text and images
MMLU score: 88.7%
LMSYS score: 1271
GPT-4o mini (July 2024):
Cost: $0.15 per million input tokens, $0.60 per million output tokens
MMLU score: 82.0%
Capabilities: Text, image input/output; video and audio coming soon
Context window: 128K tokens
Progress Summary (March 2023 to July 2024)
Capability: modest but broad improvements on reasoning
Cost: Reduced to less than 1/50th
Context size: Increased about 25x
Speed: Much faster
Modalities: Support for audio, images, and emerging video capabilities
My Theory on AI Progress Dynamics
To summarize, while AI capability progress has slowed since GPT-4's release, other aspects have improved significantly. Here's a theory to explain this phenomenon:
Major capability improvements require either: a) Algorithm enhancements b) Substantially larger training sets (e.g., scaling from current $100 million to $1-$10 billion training runs) or c) Improved hardware
To justify $10 billion+ training runs, AI companies need to demonstrate the value of their models. Consequently, they're focusing on cost-effective ways to increase model value:
Reducing operational costs
Improving processing speed
Enhancing versatility (ie multimodality)
Designing and producing new AI chipsets is time-intensive:
Nvidia's $3 trillion valuation has spurred significant investment in new AI chipsets across the industry
However, it will take years for these efforts to result in marketable products
We're witnessing a shift in GPU innovation funding from video games to AI applications
For this process to scale up, AI companies need to provide tangible value to businesses
TLDR: The current focus on practical improvements is laying the groundwork for future leaps in AI capabilities, once the necessary infrastructure and economic justification are in place.
Future Developments
MMLU Benchmark
The MMLU (Measuring Massive Multitask Language Understanding) benchmark, created in 2020, consists of 16,000 multiple-choice questions across 57 academic subjects. Human domain experts achieve around 89.8% accuracy, while Claude 3.5 Sonnet reaches 88.7% in a 5-shot test.
Anthropic is likely to release Claude 3.5 Opus soon, potentially improving the MMLU score and possibly surpassing human experts.
Defining AGI
Achieving parity with human experts on multiple-choice tests like MMLU should not be considered AGI (Artificial General Intelligence). Human intelligence encompasses much more than answering test questions. These tests primarily assess human knowledge and a subset of reasoning abilities, which we might call "human-level knowledge parity."
Path to AGI
Once knowledge parity with humans is reached, future progress may focus on operationalizing knowledge to achieve AGI. Key aspects include:
Agentic behavior: Planning and executing complex, multi-step tasks
Online learning: Using experience to form new long-term memories
Multi-modality: Input and output of audio, vision/video, and other senses
Some might add empathy or "emotions" to this list. Current LLMs already demonstrate a sophisticated theory of mind, perceiving and reasoning about others' mental states, including emotional subtext. Progress in other dimensions is likely to advance this capability as well.
Human emotions are a form of thinking, and there's evidence that this emerges organically when AI models are trained. Note: here I mean only models' understanding and expression of emotions, not to whether they have underlying motivations or truly experience feelings.
Conclusion/TLDR
(As Written by Claude 3.5)
While headline-grabbing capability improvements have slowed since the release of GPT-4, significant advancements have been made in other crucial areas such as cost efficiency, processing speed, and versatility. This shift in focus reflects a strategic pivot by AI companies to demonstrate tangible value and build the necessary infrastructure for future breakthroughs.
The path forward is not a simple linear progression. There's a complex interplay of technological innovation, economic factors, and strategic decision-making. The current phase of AI development, with its emphasis on practical improvements and economic viability, is laying the groundwork for the next quantum leap in AI capabilities. The true measure of AI advancement lies in its growing ability to solve real-world problems and integrate seamlessly into various aspects of our lives.