Transformer Net

Assembly Line

๐Ÿง ๐Ÿฆพ RT-2: New model translates vision and language into action

๐Ÿ“… Date:

๐Ÿ”– Topics: Robot Arm, Transformer Net, Machine Vision, Vision-language-action Model

๐Ÿข Organizations: Google


Robotic Transformer 2 (RT-2) is a novel vision-language-action (VLA) model that learns from both web and robotics data, and translates this knowledge into generalised instructions for robotic control.

High-capacity vision-language models (VLMs) are trained on web-scale datasets, making these systems remarkably good at recognising visual or language patterns and operating across different languages. But for robots to achieve a similar level of competency, they would need to collect robot data, first-hand, across every object, environment, task, and situation.

In our paper, we introduce Robotic Transformer 2 (RT-2), a novel vision-language-action (VLA) model that learns from both web and robotics data, and translates this knowledge into generalised instructions for robotic control, while retaining web-scale capabilities.

Read more at Deepmind Blog

๐Ÿง ๐Ÿฆพ RoboCat: A self-improving robotic agent

๐Ÿ“… Date:

๐Ÿ”– Topics: Robot Arm, Transformer Net

๐Ÿข Organizations: Google


RoboCat learns much faster than other state-of-the-art models. It can pick up a new task with as few as 100 demonstrations because it draws from a large and diverse dataset. This capability will help accelerate robotics research, as it reduces the need for human-supervised training, and is an important step towards creating a general-purpose robot.

RoboCat is based on our multimodal model Gato (Spanish for โ€œcatโ€), which can process language, images, and actions in both simulated and physical environments. We combined Gatoโ€™s architecture with a large training dataset of sequences of images and actions of various robot arms solving hundreds of different tasks.

The combination of all this training means the latest RoboCat is based on a dataset of millions of trajectories, from both real and simulated robotic arms, including self-generated data. We used four different types of robots and many robotic arms to collect vision-based data representing the tasks RoboCat would be trained to perform.

Read more at Deepmind Blog

RT-1: Robotics Transformer for Real-World Control at Scale

๐Ÿ“… Date:

โœ๏ธ Authors: Keerthana Gopalakrishnan, Kanishka Rao

๐Ÿ”– Topics: Industrial Robot, Transformer Net, Open Source

๐Ÿข Organizations: Google


Major recent advances in multiple subfields of machine learning (ML) research, such as computer vision and natural language processing, have been enabled by a shared common approach that leverages large, diverse datasets and expressive models that can absorb all of the data effectively. Although there have been various attempts to apply this approach to robotics, robots have not yet leveraged highly-capable models as well as other subfields.

Several factors contribute to this challenge. First, thereโ€™s the lack of large-scale and diverse robotic data, which limits a modelโ€™s ability to absorb a broad set of robotic experiences. Data collection is particularly expensive and challenging for robotics because dataset curation requires engineering-heavy autonomous operation, or demonstrations collected using human teleoperations. A second factor is the lack of expressive, scalable, and fast-enough-for-real-time-inference models that can learn from such datasets and generalize effectively.

To address these challenges, we propose the Robotics Transformer 1 (RT-1), a multi-task model that tokenizes robot inputs and outputs actions (e.g., camera images, task instructions, and motor commands) to enable efficient inference at runtime, which makes real-time control feasible. This model is trained on a large-scale, real-world robotics dataset of 130k episodes that cover 700+ tasks, collected using a fleet of 13 robots from Everyday Robots (EDR) over 17 months. We demonstrate that RT-1 can exhibit significantly improved zero-shot generalization to new tasks, environments and objects compared to prior techniques. Moreover, we carefully evaluate and ablate many of the design choices in the model and training set, analyzing the effects of tokenization, action representation, and dataset composition. Finally, weโ€™re open-sourcing the RT-1 code, and hope it will provide a valuable resource for future research on scaling up robot learning.

Read more at Google AI Blog