To Build a Better AI Supercomputer, Let There Be Light

To Build a Better AI Supercomputer, Let There Be Light

Most artificial intelligence experts seem to agree that taking the next big leap in the field will depend at least partly on building supercomputers on a once unimaginable scale. At an event hosted by the venture capital firm Sequoia last month, the CEO of a startup called Lightmatter pitched a technology that might well enable this hyperscale computing rethink by letting chips talk directly to one another using light.

Data today generally moves around inside computers—and in the case of training AI algorithms, between chips inside a data center—via electrical signals. Sometimes parts of those interconnections are converted to fiber-optic links for great bandwidth, but converting signals back and forth between optical and electrical creates a communications bottleneck.

Instead, Lightmatter wants to directly connect hundreds of thousands or even millions of GPUs—those silicon chips that are crucial to AI training—using optical links. Reducing the conversion bottleneck should allow data to move between chips at much higher speeds than is possible today, potentially enabling distributed AI supercomputers of extraordinary scale.

Lightmatter’s technology, which it calls Passage, takes the form of optical—or photonic—interconnects built in silicon that allow its hardware to interface directly with the transistors on a silicon chip like a GPU. The company claims this makes it possible to shuttle data between chips with 100 times the usual bandwidth.

For context, GPT-4—OpenAI’s most powerful AI algorithm and the brains behind ChatGPT—is rumored to have run on more than 20,000 GPUs. Harris says Passage, which will be ready by 2026, should allow for more than a million GPUs to run in parallel on the same AI training run.

Lightmatter wants to speed up AI supercomputers by moving data between chips using light, not electrical signals.

Courtesy of Lightmatter

One audience member at the Sequoia event was Sam Altman, CEO of OpenAI, who has at times appeared obsessed with the question of how to build bigger, faster data centers to further advance AI. In February, The Wall Street Journal reported that Altman has sought up to $7 trillion in funding to develop vast quantities of chips for AI, while a more recent report by The Information suggests that OpenAI and Microsoft are drawing up plans for a $100 billion data center, codenamed Stargate, with millions of chips. Since electrical interconnects are so power-hungry, connecting chips together on such a scale would require an extraordinary amount of energy—and would depend on there being new ways of connecting chips, like the kind Lightmatter is proposing.

GlobalFoundries, a company that makes chips for others, including AMD and General Motors, previously announced a partnership with Lightmatter. Harris says his company is “working with the largest semiconductor companies in the world as well as the hyperscalers,” referring to the largest cloud companies like Microsoft, Amazon, and Google.

If Lightmatter or another company can reinvent the wiring of giant AI projects, a key bottleneck in the development of smarter algorithms might fall away. The use of more computation was fundamental to the advances that led to ChatGPT, and many AI researchers see the further scaling-up of hardware as being crucial to future advances in the field—and to hopes of ever reaching the vaguely-specified goal of artificial general intelligence, or AGI, meaning programs that can match or exceed biological intelligence in every way.

Linking a million chips together with light might allow for algorithms several generations beyond today’s cutting edge, says Lightmatter’s CEO Nick Harris. “Passage is going to enable AGI algorithms,” he confidently suggests.

The large data centers that are needed to train giant AI algorithms typically consist of racks filled with tens of thousands of computers running specialized silicon chips and a spaghetti of mostly electrical connections between them. Maintaining training runs for AI across so many systems—all connected by wires and switches—is a huge engineering undertaking. Converting between electronic and optical signals also places fundamental limits on chips’ abilities to run computations as one.

Lightmatter’s approach is designed to simplify the tricky traffic inside AI data centers. “Normally you have a bunch of GPUs, and then a layer of switches, and a layer of switches, and a layer of switches, and you have to traverse that tree” to communicate between two GPUs, Harris says. In a data center connected by Passage, Harris says, every GPU would have a high-speed connection to every other chip.

Lightmatter’s work on Passage is an example of how AI’s recent flourishing has inspired companies large and small to try to reinvent key hardware behind advances like OpenAI’s ChatGPT. Nvidia, the leading supplier of GPUs for AI projects, held its annual conference last month, where CEO Jensen Huang unveiled the company’s latest chip for training AI: a GPU called Blackwell. Nvidia will sell the GPU in a “superchip” consisting of two Blackwell GPUs and a conventional CPU processor, all connected using the company’s new high-speed communications technology called NVLink-C2C.

The chip industry is famous for finding ways to wring more computing power from chips without making them larger, but Nvidia chose to buck that trend. The Blackwell GPUs inside the company’s superchip are twice as powerful as their predecessors but are made by bolting two chips together, meaning they consume much more power. That trade-off, in addition to Nvidia’s efforts to glue its chips together with high-speed links, suggests that upgrades to other key components for AI supercomputers, like that proposed by Lightmatter, could become more important.

https://www.wired.com/feed/rss

Will Knight

Leave a Reply