We’re Still Waiting for the Next Big Leap in AI

We’re Still Waiting for the Next Big Leap in AI

When OpenAI announced GPT-4, its latest large language model, last March, it sent shockwaves through the tech world. It was clearly more capable than anything seen before at chatting, coding, and solving all sorts of thorny problems—including school homework.

Anthropic, a rival to OpenAI, announced today that it has made its own AI advance that will upgrade chatbots and other use cases. But although the new model is the world’s best by some measures, it’s more of a step forward than a big leap.

Anthropic’s new model, called Claude 3.5 Sonnet, is an upgrade to its existing Claude 3 family of AI models. It is more adept at solving math, coding, and logic problems as measured by commonly used benchmarks. Anthropic says it is also a lot faster, better understands nuances in language, and even has a better sense of humor.

That’s no doubt useful to people trying to build apps and services on top of Anthropic’s AI models. But the company’s news is also a reminder that the world is still waiting for another AI leap forward in AI akin to that delivered by GPT-4.

Expectation has been building for OpenAI to release a sequel called GPT-5 for more than a year now, and the company’s CEO, Sam Altman, has encouraged speculation that it will deliver another revolution in AI capabilities. GPT-4 cost more than $100 million to train, and GPT-5 is widely expected to be much larger and more expensive.

Although OpenAI, Google, and other AI developers have released new models that out-do GPT-4, the world is still waiting for that next big leap. Progress in AI has lately become more incremental and more reliant on innovations in model design and training rather than brute-force scaling of model size and computation, as GPT-4 did.

Michael Gerstenhaber, head of product at Anthropic, says the company’s new Claude 3.5 Sonnet model is larger than its predecessor but draws much of its new competence from innovations in training. For example, the model was given feedback designed to improve its logical reasoning skills.

Anthropic says that Claude 3.5 Sonnet outscores the best models from OpenAI, Google, and Facebook in popular AI benchmarks including GPQA, a graduate-level test of expertise in biology, physics, and chemistry; MMLU, a test covering computer science, history, and other topics; and HumanEval, a measure of coding proficiency. The improvements are a matter of a few percentage points though.

This latest progress in AI might not be revolutionary but it is fast-paced: Anthropic only announced its previous generation of models three months ago. “If you look at the rate of change in intelligence you’ll appreciate how fast we’re moving,” Gerstenhaber says.

More than a year after GPT-4 spurred a frenzy of new investment in AI, it may be turning out to be more difficult to produce big new leaps in machine intelligence. With GPT-4 and similar models trained on huge swathes of online text, imagery, and video, it is getting more difficult to find new sources of data to feed to machine-learning algorithms. Making models substantially larger, so they have more capacity to learn, is expected to cost billions of dollars. When OpenAI announced its own recent upgrade last month, with a model that has voice and visual capabilities called GPT-4o, the focus was on a more natural and humanlike interface rather than on substantially more clever problem-solving abilities.

Gauging the rate of progress in AI using conventional benchmarks like those touted by Anthropic for Claude can be misleading. AI developers are strongly incentivized to design their creations to score highly in these benchmarks, and the data used for these standardized tests can be swept into their training data. “Benchmarks within the research community are riddled with data contamination, inconsistent rubrics and reporting, and unverified annotator expertise,” says Summer Yue, research director at Scale AI, a company that helps many AI firms train their models.

Scale is developing new ways of measuring AI smarts through its Safety, Evaluations and Alignment Lab. This involves developing tests based on data that is kept secret and vetting the expertise of those who provide feedback on a model’s capabilities.

Yue is hopeful that companies will increasingly seek to demonstrate their model’s intelligence in more meaningful ways. She says those may include “by showcasing real-world applications with measurable business impact, providing transparent performance metrics, case studies, and customer testimonials.”

Anthropic is touting such impacts for Claude 3.5 Sonnet. Gerstenhaber says that companies using the latest version have found its newfound responsiveness and problem-solving abilities beneficial. Customers include the investment firm Bridgewater Associates, which is using Claude to help with coding tasks. Some other financial firms, which Gerstenhaber declines to disclose, are using the model to provide investment advice. “The response during the early access period has been enormously positive,” he says.

It’s unclear how long the world must wait for that next big leap in AI. OpenAI has said it has started training its next big model. In the meantime, we will need to figure out new ways to measure how useful the technology really is.


Will Knight

Leave a Reply