Home > Information > News
#News ·2025-01-09
2025 has just begun, AI video generation will usher in a technological breakthrough?
This morning, Ali's Tongyi Wanxiang video generation model announced a heavy upgrade of version 2.1.
The newly released model comes in two versions, namely the Tongyi Wanxiang 2.1 Ultra Speed Edition and the Professional Edition, the former focusing on efficient performance, the latter aiming for superior performance.
According to reports, Tongyi has comprehensively upgraded the overall performance of the model, especially in dealing with complex movements, restoring real physical laws, improving film texture and optimizing instruction compliance, which has opened a new door for the artistic creation of AI.
Let's take a look at the video generation and see if we can wow you.
Take the classic "cut steak" as an example, you can see that the texture of the steak is clearly visible, the surface is covered with a thin layer of grease, shining shiny, the knife edge slowly cuts along the muscle fiber, the meat Q elasticity, the details are full.
Prompt: In a restaurant, a man is cutting a steaming steak. In a close-up shot, the man holds a sharp knife in his right hand, places it on the steak, and then cuts down the center of the steak. The man was dressed in black, with white nail polish on his hands, a blurred background, a white plate with yellow food on it, and a brown table.
Let's look at a close-up generation effect, the little girl's facial expressions, hands and body movements are naturally coordinated, and the wind passing through her hair is also consistent with the law of movement.
Prompt: Lovely girl standing in the flowers, hands than heart, beating around a variety of small love. She wore a pink dress, her long hair was flowing in the wind, and she had a sweet smile. In the background is a spring garden, full of flowers and bright sunshine. Hd realistic photography, close-ups, soft natural light.
Is the model strong or not? Let's do another run. At present, on the authoritative video generation evaluation list VBench Leaderboard, the upgraded Tongyi Mansang has topped the Leaderboard with a total score of 84.7%, surpassing domestic and foreign video generation models such as Gen3, Pika, CausVid. It seems that the competitive landscape of video generation has ushered in a new wave of changes.
List of links: https://huggingface.co/spaces/Vchitect/VBench_Leaderboard
From now on, users can use the latest generation model on the official website of Tongyi Wanxiang. Similarly, developers can also call the large model API on Alibaba Cloud.
Website address: https://tongyi.aliyun.com/wanxiang/
Recently, the iteration speed of video generation large model is very fast, has the new version of Tongyi Wan phase achieved the improvement of generation difference level? We put it to the test.
First of all, AI-generated videos can finally say goodbye to "ghost painting".
Previously, the mainstream AI video generation model on the market has been unable to accurately generate Chinese and English, as long as there is text, it is a pile of illegible garbled code. Now this industry problem has been solved by Tongyi Wan Phase 2.1.
It is the first video generation model with the ability to support Chinese text generation, and also supports the generation of Chinese and English text effects.
Now, users can simply type in a short text description to generate text and animations with cinematic effects.
For example, a cat is typing in front of the computer, and the screen jumps out in turn, "no work, no food" seven characters.
In the video generated by Tongyi Wan Xiang, the cat is sitting on the workstation seriously tapping the keyboard and pressing the mouse, which looks like a contemporary worker. The pop-up subtitles and automatically generated music make the whole picture more humorous.
The word "Synced" pops out of a small orange cube.
Whether it is generated in Chinese or English, Tongyi Wan phase can be done, there are no mistakes, and there is no "ghost character".
Not only that, it also supports font applications in a variety of scenarios, including special effects fonts, poster fonts, and font displays in real scenes.
For example, near the Eiffel Tower on the banks of the Seine, fireworks burst into the sky, and as the camera zoomed in, the pink number "2025" gradually grew larger until it filled the entire frame.
Complex character movements were once the "nightmare" of AI video generation models, and the videos generated by AI in the past were either hands and feet flying around, large changes in people, or strange actions of "only turning around and not turning around".
Through advanced algorithm optimization and data training, Tongyi Wan Sang can achieve stable complex motion generation in a variety of scenarios, especially in terms of large limb movement and accurate limb rotation. The break dance generated in the above image is very smooth.
For another example, in the generated video below, the man's movement is smooth and natural when running, and there is no problem of no separation or distortion of the left and right legs. And it also pays attention to detail, every time the man's toes touch the ground leaves an imprint and slightly kicks up fine sand.
Prompt: At sunset, the golden sun shines on the shimmering sea, and a handsome young man runs along the sand, steadily following the camera.
It can also generate more difficult skiing videos.
A girl dressed in ski gear slides on the snowy slopes of the Alps. She maneuveringly controls the skis, sometimes accelerating, sometimes turning, the high speed motion under the ponytail, rolling up the snow to make the camera more realistic.
Prompt: A young girl skiing in the Alps
It can be seen that it has also significantly improved the understanding of physical laws, and can simulate realistic videos to avoid the situation of "fake at a glance".
The great director Spielberg once said: the secret of a good film is the language of the camera. Cinematographers climb to the top of the sky and climb to the top of walls in order to achieve stunning film shots.
But in this age of AI, it's much easier to "make" movies.
We only need to input a simple text command, such as the lens to the left, the lens to pull away, the lens to advance, etc., Tongyi Wanyu can automatically output a reasonable video according to the main content of the video and the needs of the mirror.
Prompt: The rock band is playing on the front lawn, and as the camera advances, the frame zoomed in on the guitarist, wearing a leather jacket and a mop of long hair that swings to the beat. The guitarist's fingers jump quickly over the strings, and in the background the other band members are fully engaged.
Tongyi Phase 2.1 strictly follows the instructions. The video begins with the guitarist and drummer playing passionately. As the camera slowly zoomed in, the background faded and the frame zoomed in, highlighting the guitarist's demeanor and hand movements.
One more video with a zoom.
Close-up of the eyes of a young detective, the camera is pulled away, the man is standing in a busy street, behind skyscrapers and stationary cars, as if time has been frozen.
In order to achieve the amazing level of AI-generated video effect, it is necessary to precise text prompts.
However, sometimes the "memory" of the large model is limited, and in the face of text instructions containing various scene switches, role interactions and complex actions, it is easy to lose things, either missing details, or not clear the logical order.
After the new Tong Yi Wan phase has made great progress in following long text instructions.
Prompt: A motorbike rider sped through a narrow city street at breakneck speed, avoiding a large explosion from a nearby building as flames roared violently, casting a bright orange glow and sending debris and pieces of metal flying through the air, adding to the chaos at the scene. The rider, in dark gear, hunched over the handlebars and looking focused, raced forward at breakneck speed, undaunted by the blazing flames behind him. Thick black smoke from the explosion hung in the air, shrouding the background in apocalyptic chaos. Yet the rider remains unrelenting, navigating through the chaos with unerring accuracy, highly cinematic, hypersniche-detailed, immersive, 3D, and consistent movement.
In this long text description, narrow streets, bright flames, thick black smoke, flying debris and riders in dark gear... These details were captured by Tongyi Wan.
Tongyi also has a stronger concept combination ability, able to accurately understand a variety of different ideas, elements or styles, and combine them to create new video content.
An old man in a suit breaks out of the egg, a white-haired man with wide eyes staring at the camera, and the sound of a rooster clucking, the picture is quite funny.
The new version can also generate video images with movie texture, and also has good support for various art styles, such as cartoon, movie color, 3D style, oil painting, classical and other styles.
The strange alien ship is rusty and mottled, and the astronauts with oxygen tanks on their backs swing their legs underwater, and the whole scene has the feel of a science fiction movie.
Prompt: The texture of the movie, an astronaut is exploring the underwater wreckage of an alien ship.
Look at this 3D animation style little monster, standing on the grapevine dancing, very cute.
Prompt: A fluffy happy little Qingti monster stands on a vine branch singing happily, rotating the camera counterclockwise.
In addition, it also supports different aspect ratios, covering 1:1, 3:4, 4:3, 16:9 and 9:16, which can better adapt to different terminal devices such as TVS, computers and mobile phones.
From the above performance, we can already use the Tongyi phase for some creation, and transform inspiration into "reality".
Of course, this series of progress is also due to Ali Cloud's upgrade on the basic model of video generation.
On September 19 last year, Ali Cloud released a large video generation model of Tongyi Wanxiang at the Cloud Conference, which brought the generation ability of film-grade high-definition video. As a self-developed vision generation model of Alibaba Cloud, it uses the Diffusion + Transformer architecture to support image and video generation tasks, and has many innovations in model framework, training data, annotation methods and product design, providing industry-leading vision generation capabilities.
In this upgraded model, the Tongyi team (hereinafter referred to as the team) further developed an efficient VAE and DiT architecture, enhanced the modeling of spatio-temporal context, and significantly optimized the generation effect.
Flow Matching is a newly developed generative model training framework in recent years. Its training process is simpler. Continuous Normalizing Flow can achieve a generation quality comparable to or even better than diffusion model, and its reasoning speed is faster. As a result, it has gradually begun to be used in the field of video generation, such as Meta's previous video model Movie Gen, which uses Flow Matching.
In terms of training method selection, the Flow Matching scheme based on linear noise trajectory was adopted in Tongyi Wanxiang 2.1, and the framework was designed in depth, which improved the model convergence, generation quality and efficiency.
Tongyi Wanxiang 2.1 video generation architecture diagram
For video VAE, the team designed an innovative video codec scheme by combining caching mechanism and causal convolution. The cache mechanism can keep the necessary information in the video processing, so as to reduce the double calculation and improve the calculation efficiency. Causal convolution can capture the temporal features of video and adapt to the progressive changes of video content.
In the specific implementation, the video is divided into several chunks and the intermediate features are cached, instead of the direct E2E decoding process of long video, so that the use of the video card is only related to the Chunk size, without considering the length of the original video, so that the model can efficiently encode and decode the infinite length of 1080P video. According to the team, this key technique provides a viable path for training videos of any length.
The following figure shows the results of different VAE models compared. From the model calculation efficiency (frame/delay) and video compression reconstruction (peak signal-to-noise ratio, PSNR) indicators, the VAE adopted by Tonyi phase still achieves industry-leading video compression reconstruction quality without superior parameters.
Note: The area of the circle represents the size of the model parameter.
The team's core design goal on DiT (Diffusion Transformer) was to achieve powerful spatio-temporal modeling capabilities while maintaining an efficient training process. Doing so will require some innovative changes.
First, in order to improve the modeling ability of spatio-temporal relationships, the team adopted the spatio-temporal full attention mechanism, allowing the model to more accurately simulate the complex dynamics of the real world. Secondly, the introduction of parameter sharing mechanism can effectively reduce the training cost while improving the performance. In addition, the team optimized the performance of text embedding, using a cross-attention mechanism to embed text features, which achieved better text controllability and reduced computational requirements.
Thanks to these improvements and attempts, the DiT structure of Tongyi Wan phase achieves more obvious convergence advantages at the same computational cost.
In addition to the innovation in the model architecture, the team also made some optimizations in the training and inference of ultra-long sequences, the data construction pipeline, and the model evaluation, so that the model can efficiently handle complex generation tasks and have a stronger efficiency advantage.
When processing very long visual sequences, large models are often faced with many challenges, such as computing, memory, training stability, inference delay, etc., so it is necessary to have efficient solutions.
To this end, the team combined the characteristics of the new model workload and the hardware performance of the training cluster, developed a distributed, video memory optimization training strategy, optimized the training performance under the premise of ensuring the model iteration time, and finally reached the industry-leading MFU, and realized the efficient training of 1 million ultra-long sequences.
On the one hand, the team innovated the distributed strategy and adopted the mixed 4D parallel training of DP, FSDP, RingAttention and Ulysses, which enhanced both training performance and distributed scalability. On the other hand, in order to achieve video memory optimization, the team adopted a hierarchical video memory optimization strategy to optimize Activation memory and solve the video memory fragmentation problem based on the computational and communication amount brought by the sequence length.
In addition, computational optimization can improve the efficiency of model training and save resources. For this reason, the team adopts FlashAttention3 for spatiotemporal full attention calculation, and chooses appropriate CP strategy for segmentation according to the computational performance of training clusters at different sizes. At the same time, the redundancy of some key modules is removed, and efficient Kernel implementation is adopted to reduce the memory access cost and improve the computing efficiency. In terms of file system, the team made full use of the read and write features of high-performance file systems in the Alibaba Cloud training cluster, and improved the read and write performance through the fragment Save/Load mode.
4D parallel distributed training strategy
At the same time, to solve the problem of memory overflow (OOM) caused by Dataloader Prefetch, CPU Offloading, and Save Checkpoint during training, the team selected a memory usage scheme with incorrect peak. Moreover, in order to ensure the stability of the training, the team made use of the intelligent scheduling, slow machine detection and self-healing capabilities of the Alibaba Cloud training cluster to realize the automatic identification of faulty nodes and rapid restart tasks.
Large-scale high-quality data and effective model evaluation are indispensable to the training of video generation large models. The former can ensure that the model learns diversified scenes and complex spatio-temporal dependence and improve the generalization ability, which constitutes the cornerstone of model training. The latter helps to supervise the model performance, make it better achieve the expected effect, and become the wind vane of model training.
In terms of data construction, the team takes high quality as the criterion to create a set of automatic data construction pipeline, which is highly consistent with the distribution of human preferences in terms of visual quality and motion quality, so that high-quality video data can be automatically constructed, and the characteristics of high diversity and balanced distribution can be presented.
In the model evaluation, the team also designed a comprehensive automated measurement mechanism, including more than 20 dimensions such as aesthetic scoring, motion analysis and instruction compliance, and targeted training to align human preferences with professional scorers. With the effective feedback of these metrics, the process of model iteration and optimization is significantly accelerated.
It can be said that the collaborative innovation in many aspects such as architecture, training and evaluation has made the upgraded Tongyi video generation model gain significant intergenerational improvement in the actual experience.
Since the launch of OpenAI's Sora in February last year, video generation models have become the most competitive area in the tech world. Domestic to overseas, startups to technology companies are launching their own video generation tools. However, relative to the generation of text, AI video wants to be acceptable to people, and it is more than one level higher.
If it's like OpenAI CEO Sam Altman said, Sora represents the GPT-1 moment of video generation for big models. Then on this basis, we can realize the accurate control of text instructions to AI, the adjustment of Angle and position, and the ability to ensure the consistency of the role, and other video generation capabilities, coupled with the unique function of fast changing style scenes, may soon usher in a new "GPT-3 moment."
From the perspective of technology development path, video generation model is a process of verifying Scaling Laws. As the basic model capabilities improve, AI will become more and more understanding of human commands and will be able to create more and more realistic and reasonable environments.
From a practical point of view, we can't wait: Since last year, people in the field of short video, animation, and even the film and television industry have begun to use video generation AI for creative exploration. If we can break through the limitations of reality and do previously unimaginable things with video generation AI, a new round of industry change is in sight.
Now it seems that Tongyi Man Sang has taken the first step.
2025-02-17
2025-02-14
2025-02-13
13004184443
Room 607, 6th Floor, Building 9, Hongjing Xinhuiyuan, Qingpu District, Shanghai
gcfai@dongfangyuzhe.com
WeChat official account
friend link
13004184443
立即获取方案或咨询top