The Quest to Give AI Chatbots a Hand—and an Arm

The Quest to Give AI Chatbots a Hand—and an Arm

Peter Chen, CEO of the robot software company Covariant, sits in front of a chatbot interface resembling the one used to communicate with ChatGPT. “Show me the tote in front of you,” he types. In reply, a video feed appears, revealing a robot arm over a bin containing various items—a pair of socks, a tube of chips, and an apple among them.

The chatbot can discuss the items it sees—but also manipulate them. When WIRED suggests Chen ask it to grab a piece of fruit, the arm reaches down, gently grasps the apple, and then moves it to another bin nearby.

This hands-on chatbot is a step toward giving robots the kind of general and flexible capabilities exhibited by programs like ChatGPT. There is hope that AI could finally fix the long-standing difficulty of programming robots and having them do more than a narrow set of chores.

“It’s not at all controversial at this point to say that foundation models are the future of robotics,” Chen says, using a term for large-scale, general-purpose machine-learning models developed for a particular domain. The handy chatbot he showed me is powered by a model developed by Covariant called RFM-1, for Robot Foundation Model. Like those behind ChatGPT, Google’s Gemini, and other chatbots it has been trained with large amounts of text, but it has also been fed video and hardware control and motion data from tens of millions of examples of robot movements sourced from the labor in the physical world.

Including that extra data produces a model not only fluent in language but also in action and that is able to connect the two. RFM-1 can not only chat and control a robot arm but also generate videos showing robots doing different chores. When prompted, RFM-1 will show how a robot should grab an object from a cluttered bin. “It can take in all of these different modalities that matter to robotics, and it can also output any of them,” says Chen. “It’s a little bit mind-blowing.”

The model has also shown it can learn to control similar hardware not in its training data. With further training, this might even mean that the same general model could operate a humanoid robot, says Pieter Abbeel, cofounder and chief scientist of Covariant, who has pioneered robot learning. In 2010 he led a project that trained a robot to fold towels—albeit slowly—and he also worked at OpenAI before it stopped doing robot research.

Covariant, founded in 2017, currently sells software that uses machine learning to let robot arms pick items out of bins in warehouses but they are usually limited to the task they’ve been training for. Abeel says that models like RFM-1 could allow robots to turn their grippers to new tasks much more fluently. He compares Covariant’s strategy to how Tesla uses data from cars it has sold to train its self-driving algorithms. “It’s kind of the same thing here that we’re playing out,” he says.

Abeel and his Covariant colleagues are far from the only roboticists hoping that the capabilities of the large language models behind ChatGPT and similar programs might bring about a revolution in robotics. Projects like RFM-1 have shown promising early results. But how much data may be required to train models that make robots that have much more general abilities—and how to gather it—is an open question.

Two of Covariant’s co-founders Pieter Abbeel and Peter Chen.Courtesy of ELENA ZHUKOVA/Covariant

“The main challenge is that the data has not been available in the same way that you can just download text and images or videos on the internet,” says Pulkit Agrawal, a professor at MIT who works on AI and robotics.

As they try to figure that out, many researchers are trying to generate data for training robots, Agrawal says. This includes collecting data from videos showing humans performing tasks or from simulations featuring robots.

Google DeepMind, the search giant’s AI group, is one of the bigger AI players working on this approach. Last year its researchers developed their own AI models for robots called RT-2. Last November the same team released RT-X, a dataset of millions of robot actions sourced from different machines doing different tasks.

Agrawal says that Covariant’s huge trove of robot arm data from its deployments with customers is undoubtedly useful but notes that it is limited, for the moment, to a particular range of tasks. Right now it mostly sells to companies doing only certain warehouse tasks. “If you want to pick up a screw and screw it in, or peel a piece of ginger, that isn’t really a pick-and-place problem,” he says.

An intriguing aspect of the work Covariant is doing is that it can help the underlying AI models better understand the physics of the world. Abbeel notes that, compared to OpenAI’s remarkably realistic video model, Sora, which can struggle to render accurate human anatomy and basic physics, RFM-1 has a better grasp of what is and isn’t possible in the real world. “I’m not saying it’s perfect, but it has a pretty good understanding,” he says.

https://www.wired.com/feed/rss

Will Knight

Leave a Reply