From today, Chinese characters can also generate AI videos! "Tengwangge Preface" is directly in place-News-Artificial Intelligence Global Cooperation Alliance

From today, Chinese characters can also generate AI videos! "Tengwangge Preface" is directly in place

#News ·2025-01-09

The difficulty of "Chinese characters" was finally brought down by AI video generation today!

Without further ado, let's look directly at the effect first:

The word "blessing"; was so trickled out with a stroke by AI.

In the following example, our Prompt is:

There are Chinese characters "qubit", ancient style.

But this is still a little monotonous, let's do the following difficulty:

Cyberpunk style city night scene, the camera Angle is the vehicle driving on the road, opposite the building has a huge LED billboard, which says "qubit" three words.

All right, so we've got a bit of a cyberad.

So with a few more words, can AI hold up?

Let's go straight to the challenge:

Watercolor illustration style, three cute cats in different colors holding an oversized fish, walking from right to left. They are wearing pink, blue and yellow vests, round eyes and cute expressions. Full of childlike interest, the brushwork is elegant and warm, stick figure style. On a pure white background, several letters gradually appeared, reading: "A day of touching fish is boundless happiness."

As you can see, although there is a small flaw in this video (the "touch" word is an extra stroke), the overall content is restored to the Prompt.

Of course, complex Chinese characters can be done, and this AI can naturally generate English words, and it is still the kind with "flower work"; (there are Chinese versions below) :

So, how exactly is this AI?

Not to mention, it is the two new versions of the video generation model that Ali Tongyi Wanxiang has just upgraded:

Tongyi Wanxiang 2.1 speed version: can make AI efficiently and quickly generate video;
Tongyi Wanxiang 2.1 Professional edition: pay more attention to the quality of AI video generation.

After the overall experience, we can clearly feel that the overall performance of the model has been greatly improved.

Especially in dealing with complex movements, restoring real physical laws, improving the quality of film and television, and optimizing instruction compliance.

It is understood that the new Tongyi Wanxiang has reached the top position of the authoritative evaluation set VBench with a score of 84.70%, surpassing Gen3, Pika, CausVid and other domestic and foreign video generation models.

However, one can generate Chinese characters, but also only a corner of the ability to upgrade the meaning of the phase.

Next, we will continue to look at its more capabilities in video generation.

"Tengwang Pavilion Preface" can understand

It is worth mentioning that this newly upgraded model is not a PPT oh, is already on the line of the kind ~

Now everyone can experience it online for free, the entry and model selection is shown below:

If you are a developer or enterprise, you can also call the API on Ali Cloud Bailian to create your own exclusive application.

In view of the previous generation of most AI videos, when processing complex character actions, there is often a phenomenon of turning around on the ghost animal.

Let's go straight to a difficult, super complicated action -- Breaking.

Listen to Prompt:

Inside, the camera pans a panoramic view of a foreign man break-dancing, wearing a gray jacket and green pants. The camera moves as the man performs a series of tumbling and rotating movements on the stage. The audience in the audience and some dim stage lights can be seen in the background, but the focus always remains on the dancer's movements.

It can be seen that this AI video generates a changed ghost animal, and in the case of so many and complex actions, the character effect is still stable.

Let's look at diving again:

picture

The details, such as the instep, are also very precise.

In addition to the stable effect of continuous complex actions, the restoration of real physical laws is also one of the key indicators to judge the effect of AI video generation.

We may wish to put the "Tengwangge Preface"; to the test:

The sunset and the lone duck fly together, the autumn water is the same color.

picture

It is not difficult to see that the new version of Tongyi Wanxiang has a very good understanding of the artistic conception of this poem.

And in the face of an action like cutting meat, the laws of reductive physics are even more obvious:

picture

The natural separation of the meat pieces when cutting, the mirror image on the knife surface, the oil at the bottom of the meat... Details like that, details like that.

On the basis of authenticity, if you want to use AI to create higher quality video effects, then the mirror is one of the indispensable skills.

This, too, can be fully held by Tongyi and Wanxiang.

For example, Fox Daxian Bundi, we can have a mirror to give the atmosphere a big bonus:

picture

Like the picture of the sports car racing through the valley in the movie scene, it is also possible to add complex mirrors following the trajectory of the car:

picture

In addition, Tongyi Wan Xiang also has a feature that it can handle various styles of types, quite a film-grade feeling.

For example, medieval realistic style:

picture

Another example is cartoon animation:

picture

And the size of the generated video is also selected:

So the next question is:

How did you do that?

On the whole, this time Tongyi Wanxiang has taken a three-step innovation route in terms of technology.

First, VAE and DiT architecture work together.

Video VAE can be regarded as a "compression master", good at efficiently compressing massive information in video and extracting the most critical features.

It replaces the traditional end-to-end codec method of long video by splitting the video into several chunks and caching intermediate features.

The key of this design is that the use of video memory is only related to the Chunk size, and has nothing to do with the length of the original video, so as to achieve efficient codec for infinite long 1080P video.

This mechanism makes it possible to train videos of any length. Experiments show that Tongyi VVAE achieves industry-leading video compression and reconstruction quality under small model parameters.

DiT is like a "spatiotemporal catcher", which can keenly capture the spatiotemporal dynamics in the video and accurately model the changing relationship between different elements in the video in time and space.

The Tongyi Wanxiang team has taken the following optimization measures:

Spatiotemporal full attention mechanism: enhance the modeling ability of complex dynamic scenes.
Parameter sharing mechanism: improve model performance while reducing training cost.
Text embedding optimization: Improve text control and significantly reduce computing requirements.

△ Tongyi Wansang 2.1 Video generation architecture diagram

Secondly, it is a breakthrough in the training of ultra-long sequences.

In the face of the extremely challenging task of ultra-long sequence training, the Tongyi Wanxiang team cleverly used the 4D parallel strategy, like creating a super-powerful "engine" for model training.

This strategy organically integrates multiple advanced technologies such as DP (Data Parallel), FSDP (Fully Sharded Data Parallel), RingAttention (circular attention mechanism), Ulysses (an optimization technology) and so on.

For example, in terms of video memory optimization, the team adopted hierarchical video memory optimization strategies to solve the problem of video memory fragmentation according to the computing and communication needs brought by the sequence length, and used FlashAttention3 to improve the computing efficiency of space-time attention.

In addition, by eliminating redundant computing and efficient Kernel implementation, the memory access cost is further reduced.

In terms of file system optimization, aiming at the characteristics of Alibaba Cloud's high-performance file system, the team adopted fragment Save/Load to optimize data read and write performance, and adopted the cross-peak memory usage scheme. Troubleshoot memory OOM problems caused by Dataloader Prefetch, CPU Offloading, and Checkpoint storage.

In terms of stability improvement, relying on Alibaba Cloud's intelligent scheduling, slow machine detection and self-healing capabilities, model training can realize automatic fault detection and task restart, greatly improving the stability of the training process.

△ Tongyi phase 4D parallel distributed training strategy

Finally, there is data and evaluation of two-wheel drive.

The Tongyi team has built an automated data-building pipeline that filters and integrates data sets that are highly consistent with the distribution of human preferences by optimizing visual quality and motion quality. These data have the characteristics of high diversity and balanced distribution, which greatly improves the training efficiency.

The team also designed a set of evaluation systems covering multiple dimensions such as aesthetic scoring, motion analysis, and instruction compliance, and trained professional markers. Through the feedback of these automatic indicators, the iteration and optimization of the model are significantly accelerated.

The above is the core technical essentials of refining the new version of Tongyi Wan phase.

At this point, not only from the perspective of technological innovation, but also from the perspective of real experience, domestic Sora has once again come to the forefront of the field of AI video.

The fact that it can generate Chinese characters alone is the only one in the world.

And the wide range of video generation is also the name of "Tongyi phases" - AI has reached the moment when it can generate "phases".

Do you also have imaginative ideas that you want to bring to life in video form?

Come and experience the latest and most fashionable models

Direct experience entry: https://tongyi.aliyun.com/wanxiang/videoCreation

API calls: https://bailian.console.aliyun.com/?spm=5176.29619931.J__Z58Z6CX7MY__Ll8p1ZOR.1.74cd59fckLhf3c#/model-market

TAGS：

PREV： Microsoft fired the first shot at layoffs! AI agents detonated 2025 unemployment wave, Silicon Valley giants stopped hiring programmers

RETURN

NEXT： New level of embodied intelligence! Zhiyuan Robot launches EnerVerse, the world's first 4D world model

about/About

EVENTS/Exhibition

News/Information