Alibaba Dharma Institute launched a low-key text-generated video model: it only supports English input, and it is open for trial play
Recently, Ali Dharma Academy has launched the "Text Generation Video Large Model" in the AI model community "Magic" ModelScope.
According to the official introduction, the current text generation video large model consists of three sub-networks: text feature extraction, text feature to video latent space diffusion model, video latent space to video visual space, the overall model parameters are about 1.7 billion, currently only supports English input . The diffusion model adopts the Unet3D structure, and realizes the function of video generation through the iterative denoising process from the pure Gaussian noise video.

According to the official introduction, this model has a wide range of applications, and can reason and generate videos based on arbitrary English text descriptions. Some video examples of text generation are as follows:

A giraffe underneath a microwave. (A giraffe in a microwave)

A goldendoodle playing in a park by a lake. (A golden doodle playing in a park by a lake)
It is understood that the model has been launched on Chuangspace and huggingface , you can experience it directly, or refer to this page to build it yourself. The hardware configuration required for the model is about 16GB RAM and 16GB GPU memory. Under the ModelScope framework, the current model can be used by calling a simple Pipeline, where the input needs to be in dictionary format, the legal key value is 'text', and the content is a small piece of text. This model currently only supports inference on GPU.
Judging from the results of the trial play, the video length that can be generated at present is mostly 2-4 seconds, and the waiting time for generation ranges from more than 20 seconds to more than 1 minute.