Chihiro Intelligent Gaoyang: RobotGPT-1 stage has arrived, 4 years later to reach the 3.5 stage | MEET 2025

#News ·2025-01-06

My definition of embodied intelligence is very simple, is to help humans do all kinds of things, such as helping our grandparents care for the elderly.

……

It doesn't make sense to define the L1-L5 level of embodied intelligence right now, the core criterion is what practical problems our robots can solve.

The trend of embodied intelligence is hot, and there are not many people who dare to assert it, and Gao Yang is one.

图片

He is an assistant professor at the Institute of Cross-Informatics at Tsinghua University, and received his PhD from the University of California, Berkeley, where he completed postdoctoral research with Pieter Abbeel and others. At present, he heads the Embodied Vision and Robotics (EVAR Lab), which focuses on empowering robots with artificial intelligence technology and is committed to creating a general embodied intelligence framework, and the ViLa algorithm proposed by him is adopted by Figure AI.

In 2024, Gao Yang co-founded Chianxun Intelligence, an embodied intelligence company. This company is also known as the Chinese version of Figure 01,1 year to quickly complete three rounds of financing, including seed round + angel round 200 million.

At MEET 2025 Intelligent Future Conference, Qubit invited Dr. Gao Yang to deeply discuss the development status and future of embodied intelligence, from the dimensions of model architecture, data, and industrial landing.

Core idea

  • The definition of embodied intelligence is that robots can do all kinds of things for us.
  • The maturity of AI and robot manufacturing has spawned the embodied intelligence industry.
  • Embodied intelligence should reduce the dependence on manual data collection.
  • It doesn't make sense to define embodied intelligence L1-L5 right now, and it will stay at L2.99 for a long time.
  • We have reached phase 1.0 of RobotGPT and the principle is established; After 4 years you can reach the RobotGPT-3.5 stage.
  • The hope is that 10 percent of the world's population will have their own robot in 10 years.

(In order to better present Gao Yang's point of view, qubits are combed as follows on the basis of not changing the original meaning)

In 10 years, 10% of humanity will have their own robot

Qubits: How do you define embodied intelligence?

Gao Yang: I think this is a very intuitive question.

Once when I was giving a talk on embodied intelligence, an old woman, probably 60 or 70 years old, listened to me a lot and asked me when the robot would support her.

In fact, this is embodied intelligence.

Embodied intelligence is when we build a robot that can do all sorts of things for us (like at home), like helping our grandparents retire.

So I created Chihiro Intelligence, and one of my biggest dreams and wishes is that in 10 years, I hope that 10% of the people in the world can have their own robots.

It's also pretty intuitive what it can do. For example, when I get home late every night, I may want to have a late night snack, and I don't want to clean up the dishes after eating. A lot of things didn't go back to their place at home over the weekend. I want a robot to help me put these things back to their place...

These are the so-called embodied intelligence, there are physical robots that can help us do all kinds of things that we don't want to do or are too lazy to do on our own, this is what I know as embodied intelligence.

Qubits: The concept of embodied intelligence came from Alan Turing and was first defined/conceived half a century ago. This year we define as the first year of Embodied intelligence, or it has come of age. What technological/factor changes have you seen in the industry that have made you feel that embodied intelligence has matured and made you decide to start your own business?

Gao Yang: The only variable here is that OpenAI has proved that Pre-training plus a series of Post-training methods can really produce, at least look like human intelligence, or reach the appearance of human intelligence, which I think is a core variable for doing embodied intelligence entrepreneurship now.

As you just said, the previous robots were handwritten rules, and they were written dead, which led to poor adaptability to the environment. In fact, I was not particularly familiar with robot hardware before, but when I really looked at it, I was very surprised to see how many units of industrial robots were sold a year, and the annual sales of industrial robots were only about 2 million units in the world.

This is a very small amount compared to cars and mobile phones. The core constraint behind this is that the robot is very difficult to use, it is just a specialized device, you need to have a high technical reserve to use it.

So I think the difference between the two is that one is that smart technology is making robots more and more useful, and the other is that we've come so far in building robots that we can make them submillimeter accurate at a very cheap price.

The maturity of these two aspects gives rise to the embodied intelligence industry. Of course, this industry is now in a very early stage, and I often say that this thing is actually very difficult, because it is often said that embodied intelligence is like a creator of silicon-based life, and if embodied intelligence is made, human beings as the guidance program of carbon-based life will basically complete the task. So I think this is a very long-term thing, and I think at least for myself, I'm doing this as a lifetime career.

Data is still the key to the development of embodied intelligence

Qubits: What do you think of the core advances in embodied intelligence over the past year, and which ones are worth focusing on in 2025?

Gao Yang: I think the big breakthrough of embodied intelligence in the past year, in addition to the VLA model just talked about, there are also some models how to do pre-training. The way we do it now (including Phi), it takes 10,000 hours of data to train the model to have some capability.

If we look back at all the current impressive large models, such as ChatGPT, Stable Diffusion, video generation model (Sora), etc., their data volume is 100T tokens or tens of billion image-text pairs.

Now the operational data we collect by hand is much smaller than that. So I think in the process of the development of embodied intelligence, how to make more use of data on the Internet to do pre-training is a very important thing.

On this issue, for example, the VLA is actually relatively weak. The VLA's pre-training data contains only images and text. I think there are a lot of novel ideas in academia to solve this problem, and it should be something that will continue to develop and be very important in the next three or four years.

图片

Goyang s team proposed the ViLa algorithm

To name a few concrete examples. For example, I think Google's RT-Trajectory is a relatively representative work. This model introduces that if only the collected data of imitation learning is used for training, the amount of data will never be enough.

It adopts a new method, using an intermediate representation to represent the approximate trajectory of the robot, and let the robot roughly follow this trajectory. The details are generated directly by the underlying policy.

图片

There are a lot of articles like this, including a lot of work done by my own research group. What I've done myself is some kind of representation of the mesosphere in terms of the future particle motion of the object.

I think this work is very exciting, because in the past you can also collect data, you can also train imitation learning, but when this wave of large models comes, we need enough data.

This year, these new research work, in fact, also pointed out the direction for our future development.

The VLA itself is a very good paradigm and one of the core of the future. But outside of the VLA, I see more work to reduce the reliance on human data collection, which is also a very exciting development this year.

It doesn't make sense to define embodied intelligence L1-L5 right now

Qubits: If we were to customize a standard for robotic, embodied intelligence, what would that standard look like?

Gao Yang: The purpose of setting a standard is to promote the development of an industry and measure the level of each company's technology.

But I think probably in a long time, no matter what the standard is, probably most embodied intelligence can only reach or claim to reach L2.99, or reach L4 in a limited scenario because of objective technical limitations.

So this standard may end up being something that is biased towards propaganda, and we can't do the L4 or L5 level of a wide range of scenes in a limited time.

So the standard still depends on whether we can solve the needs of customers, which can be very clear.

For example, if we want to serve some factory, business, and home scenes, can our robots do it? When we serve this scene, what is the probability of downtime?

These are indicators that I think are more specific and more achievable.

图片

Even now I don't think it makes much sense to define a L1-L5 indicator of embodied intelligence.

The key lies in whether the embodied intelligent brain can solve specific problems, such as food delivery, such as factory installation parts, which needs to be explored and pursued.

We still have to wait for robots to graduate from college

Qubits: So, what stage of embodied intelligence are we at now?

Gao Yang: We have just witnessed the evolution of GPT from 1.0 to 3.5, 4.0 to o1. When GPT-1 first came out, no one looked up to it, it did not speak quickly, no reasoning ability, and it had problems communicating with people.

However, by the time GPT-1 was born, the principles of large language model technology were basically established.

I think right now we're in phase 1.0 of RobotGPT. Because the basic principle has been settled, it may be that in the next few years, everyone will see that this technology is still at a low level and has not made much progress, but the development of intelligence is an exponential upward curve, so I personally think that for the embodied intelligent brain, it will reach the RobotGPT-3.5 stage after 4 years. It may not be that advanced, but you can already see a lot of surprising abilities.

I think we're a while away from that day, but it's not that far.

Qubit: Wait for RobotGPT to go to a university.

Gao Yang: Yes, he just entered the university, he can't do anything, he needs to graduate from the university, really enter everyone's family, it will take about 10 years from now.

GPT-4 has been able to answer a lot of questions, but it's still not that reliable about 10% of the time, so we still need to continue to improve the ability of language models to really penetrate into all aspects of human production and life.

For the robot model, I think it is the same. After we make 3.5, it may not be so robust, and the cost may be a little high. We need to continue to improve this technology, so I think 10% of people will have their own robots in 10 years.

TAGS:

  • 13004184443

  • Room 607, 6th Floor, Building 9, Hongjing Xinhuiyuan, Qingpu District, Shanghai

  • gcfai@dongfangyuzhe.com

  • wechat

  • WeChat official account

Quantum (Shanghai) Artificial Intelligence Technology Co., Ltd. ICP:沪ICP备2025113240号-1

friend link