This project I2VGen-XL aims to solve the task of generating high-definition video from input images. I2VGen-XL is a high-definition video generation basic model developed by Bodhidharma Academy. Its core part includes two stages to solve the problems of semantic consistency and clarity respectively. The total number of parameters is about 3.7 billion. Mixed pre-training and fine-tuning on a small amount of high-quality data, the data is widely distributed and diverse in categories, and the model has good generalization for different data. Compared with existing video generation models, I2VGen-XL has obvious advantages in terms of clarity, texture, semantics, and temporal continuity.
In addition, many design concepts of I2VGen-XL are inherited from our public work VideoComposer , you can refer to our VideoComposer and the Github code base of this project for details
Fig.1 I2VGen-XL
Project experience address: https://modelscope.cn/studios/damo/I2VGen-XL-Demo/summary
I2VGen-XL is built on top of Stable Diffusion, as shown in the figure, through the specially designed space-time UNet to perform space-time modeling in latent space and reconstruct it through the decoder Final video.
In order to be able to generate 720P video, we divide I2VGen-XL into two stages. The first stage guarantees semantic consistency but low resolution. The second stage uses DDIM inverse operation and performs denoising on the new VLDM to Increase video resolution and improve both temporal and spatial coherence. Through the joint optimization of model, training and data, this project mainly has the following characteristics:
The following are some of the generated cases:
For *the convenience of display, this page is displayed in low-resolution GIF format. GIF will reduce the video quality
*
Visit Official Website