New level of embodied intelligence! Zhiyuan Robot launches EnerVerse, the world's first 4D world model

#News ·2025-01-09

How to make robots plan future actions based on task guidance and real-time observation has always been the core scientific problem in the field of embodied intelligence. However, achieving this goal is constrained by two key challenges:

  1. Modal alignment: Precise alignment mechanisms need to be established in multimodal Spaces such as language, vision, and motion.
  2. Data scarcity: Lack of large-scale, multimodal data sets with action labels.

To solve the above problems, the Zhiyuan robot team proposed the EnerVerse architecture, through autoregressive diffusion model, to generate the future embodied space while guiding the robot to complete complex tasks. Different from the simple application of video generation model in existing methods, EnerVerse combines the requirements of embodied tasks and innovatively introduces Sparse Memory mechanism and Free Anchor View (FAV), which not only improves 4D generation capability, but also improves 4D generation capability. Achieved a significant breakthrough in motion planning performance. The experimental results show that EnerVerse not only has excellent future space generation capability, but also achieves current optimal (SOTA) performance in robot motion planning tasks.

The project homepage and paper are online, and the model and related data sets will be open source soon:

图片


  • Home address: https://sites.google.com/view/enerverse/home
  • Address: https://arxiv.org/abs/2501.01895

图片

How to make the future space generate empowered robot motion planning?

The core of robot motion planning is to predict and complete a series of complex future operations based on real-time observations and task instructions. However, existing methods have the following limitations when dealing with complex embodied tasks:

  • Limitations of the general video generation model: The current general video generation model lacks of targeted optimization for embodied scenes, and can not adapt to the special needs of embodied tasks.
  • Lack of generalization ability of visual memory: the existing methods rely on dense continuous visual memory, which is easy to lead to logic incoherence when generating long-term task sequences, and the performance of action prediction is reduced.

To this end, EnerVerse solves these bottlenecks through a block-generated autoregressive diffusion framework, combined with an innovative sparse memory mechanism and a free anchoring perspective (FAV) approach.

Technical solution analysis

Block by block Diffusion generation: Next Chunk Diffusion

EnerVerse uses an autoregressive diffusion model generated block by block to guide robot motion planning by gradually generating future embodied space. Key designs include:

  • Diffusion model architecture: Based on the UNet structure combining spatiotemporal attention, each space block is modeled by convolution and bidirectional attention; The time consistency between blocks is maintained through one-way causal logic to ensure the logical rationality of the generated sequence.
  • Sparse memory mechanism: Drawing on the context memory of the Large Language model (LLM), EnerVerse performs a high-proportion random mask on the history frame in the training phase, and updates the memory queue at a large interval in the reasoning phase, effectively reducing the computation overhead and significantly improving the generation capability of long-term tasks.
  • Task end logic: Through the special end frame (EOS frame), to achieve accurate supervision of the task end time, to ensure that the generation process is terminated at the appropriate node.

图片

Flexible 4D generation: Free Anchor View (FAV)

In response to the need for complex occluded environments and multiple viewing angles in embodied operations, EnerVerse proposes a free anchoring viewing Angle (FAV) method to flexibly represent 4D Spaces. Its core advantages include:

  • Freely set viewing angles: FAV supports dynamic adjustment of anchoring viewing angles to overcome the limitations of fixed multi-anchor views in narrow scenes. For example, in scenes such as the kitchen, FAV can easily adapt to dynamic occlusion relationships.
  • Spatial consistency across perspectives: Based on ray casting principle, EnerVerse uses ray direction map as the view control condition. The 2D spatial attention is extended to cross-view 3D spatial attention to ensure the geometric consistency of the generated video.
  • Sim2Real Adaptation: Through alternating iterations of the 4D generation model (Enerverse-d) trained on simulation data and the 4D Gaussian Splatting (Gaussian Splatting), EnerVerse builds a data flywheel to provide pseudo-truth value support for FAV generation in real scenarios.

图片

Efficient action planning: Diffusion Policy Head

EnerVerse opens up the whole chain of future space generation and robot action planning by integrating the Diffusion Policy Head downstream of the generation network. Key designs include:

  • Efficient action prediction: The generation network can output the future action sequence in the first step of reverse diffusion, without waiting for the complete space generation process to ensure real-time action prediction.
  • Sparse memory support: In action prediction reasoning, sparse memory queues store real or reconstructed FAV observations, effectively improving long-range task planning capabilities.

Experimental result

1. Video generation performance

EnerVerse has demonstrated excellent performance in both short - and long-range mission video generation:

  • In short-range generation tasks, EnerVerse outperforms existing fine-tuned video generation models, such as the diffusion model based on DynamiCrafter and FreeNoise.
  • In long-term build tasks, EnerVerse shows greater logical consistency and continuous build capability, which is not possible with existing models.

图片

In addition, the quality of multi-view video generated by EnerVerse in LIBERO simulation scenarios and AgiBot World real scenarios has been fully verified.

图片

The corresponding generated video is as follows:

2. Ability to plan movements

In the LIBERO benchmark test, EnerVerse achieved significant advantages in robot motion planning tasks:

  • Single View (one FAV) setting: The average success rate of EnerVerse in all four categories of LIBERO tasks has exceeded that of existing methods.
  • Multi-view (three FAV) setting: Further improve task success, exceeding the current best approach on each type of task.

图片

图片

It is worth noting that all the tasks on LIBERO-Long require the machine to perform multiple steps, as shown in the following video:

3. Analysis of ablation and training strategies

Sparse memory mechanism: Ablation experiments show that sparse memory is crucial to the logical rationality of long range sequence generation and the accuracy of long range action prediction.

图片

图片

Two-stage training strategy: The two-stage strategy of future space generation training followed by action prediction training can significantly improve the performance of action planning.

图片

4. Attention visualization

Through the cross attention module in the visualized Diffusion strategy head, it is found that the future space generated by EnerVerse has strong temporal consistency with the predicted action space. This intuitively demonstrates the relevance and advantages of EnerVerse in future space generation and motion planning tasks.

图片

With the EnerVerse architecture, Intelligent robots have created a new direction for future embodied intelligence. Through future space generation guided motion planning, EnerVerse not only breaks through the technical bottleneck of robot task planning, but also provides a new paradigm for the research of multi-modal and long-range tasks.

Introduction of the author

The main scientific research members of EnerVerse are from the embodied algorithm team of Zhiyuan Robotics Research Institute. Huang Siyuan is a joint doctoral student of Shanghai Jiao Tong University and Shanghai Artificial Intelligence Laboratory, under the guidance of Professor Li Hongsheng of CUHK-MMLab. During his doctoral research, he focused on the research of embodied intelligence and efficient agents based on multimodal large models. Presented multiple papers as first author or co-first author at top conferences such as CoRL, MM, IROS, ECCV, etc. Another co-author, Chen Liliang, is an expert in embodied algorithm of intelligent robot, mainly responsible for the research of embodied space intelligence and world model.

TAGS:

  • 13004184443

  • Room 607, 6th Floor, Building 9, Hongjing Xinhuiyuan, Qingpu District, Shanghai

  • gcfai@dongfangyuzhe.com

  • wechat

  • WeChat official account

Quantum (Shanghai) Artificial Intelligence Technology Co., Ltd. ICP:沪ICP备2025113240号-1

friend link