Although AI chatbots powered by large language models (LLMs) are dominating the headlines these days due to the meteoric rise in the popularity of ChatGPT, Bing Chat, Meta's Llama, and Google Bard, this is only a small portion of the AI landscape. Another area that has been actively explored for years is robotic hardware leveraging complex techniques to either replace or assist humans. Google has now announced an advancement in this domain, in the form of a new AI model.

Google has unveiled Robotics Transformer 2 (RT-2), its latest AI model with a very specific purpose: communicating your desired action to a robot. It utilizes novel techniques to achieve this purpose, powered by a unique visual-language-action (VLA) which Google claims is the first of its kind. Although several previous models like RT-1 and PaLM-E have made advancements in increasing reasoning abilities in robots and making sure they learn from each other, the robot-dominated world showcased by science-fiction movies arguably still seems like something from an extremely distant future.

RT-2 aims to reduce this gap between fiction and reality by making sure that robots fully understand the world around them with minimal or no support. In principle, it's very similar to LLMs, where it uses a Transformer-based model to learn about the world from textual and visual information available on the web and then translate it into robotic actions, even on test cases where it's not explicitly been trained.

Google has explained several use-cases to explain the capabilities of RT-2. For example, if you ask an RT-2 powered-robot to throw trash in the bin, it would easily be able to understand what trash is, how to differentiate it from other objects present in the environment, how to mechanically move and pick it up, and how to dispose it off it in the bin, all without being specifically trained on either of these activities.

Google has also shared some rather impressive results from its testing of RT-2. In more than 6,000 trials, RT-2 proved to be as adept as its predecessor in "seen" tasks. More interestingly, in unseen scenarios, it scored 62% as compared to RT-1's 32%, a nearly twofold increase in performance. While the applications of such a technology seem very tangible already, it does take a significant time for it to mature as real-world use-cases understandably require rigorous testing and even regulatory approval at times. For now, you can read more about RT-2's backend mechanism in Google DeepMind's blog here.