entry-slick
About I2VGen-XL

This project I2VGen-XL aims to solve the task of generating high-definition video from input images. I2VGen-XL is a high-definition video generation basic model developed by Bodhidharma Academy. Its core part includes two stages to solve the problems of semantic consistency and clarity respectively. The total number of parameters is about 3.7 billion. Mixed pre-training and fine-tuning on a small amount of high-quality data, the data is widely distributed and diverse in categories, and the model has good generalization for different data. Compared with existing video generation models, I2VGen-XL has obvious advantages in terms of clarity, texture, semantics, and temporal continuity.

In addition, many design concepts of I2VGen-XL are inherited from our public work VideoComposer , you can refer to our VideoComposer and the Github code base of this project for details

Fig.1 I2VGen-XL

Project experience address: https://modelscope.cn/studios/damo/I2VGen-XL-Demo/summary

Model Introduction

I2VGen-XL is built on top of Stable Diffusion, as shown in the figure, through the specially designed space-time UNet to perform space-time modeling in latent space and reconstruct it through the decoder Final video. In order to be able to generate 720P video, we divide I2VGen-XL into two stages. The first stage guarantees semantic consistency but low resolution. The second stage uses DDIM inverse operation and performs denoising on the new VLDM to Increase video resolution and improve both temporal and spatial coherence. Through the joint optimization of model, training and data, this project mainly has the following characteristics:

  • HD & widescreen, can directly generate 720P (1280*720) resolution video, and relatively Compared with existing open source projects, not only the resolution has been effectively improved, but also the widescreen video produced by it can be suitable for more scenes The quality data is fine-tuned, and the generated non-watermarked video can be applied to more video platforms, reducing many restrictions There has been a significant improvement
  • The texture is good. By collecting video data of specific styles for training, the texture of the generated model has been significantly improved. It can generate videos with a sense of technology, movie colors, cartoon style and sketches

The following are some of the generated cases:

For *the convenience of display, this page is displayed in low-resolution GIF format. GIF will reduce the video quality *

Visit Official Website

https://huggingface.co/damo-vilab/MS-Image2Video

Reviews
Show more
无为的树懒
This got my favorite dude moving on my computer! The effect of the video is also very high, but it is too slow and slow
image
Share
白雪公主
The resolution is really high, and there is no watermark! But the official texture is better, I don't feel it, on the contrary, it is very messy, and the 3D effect is very much! I think this type of tool should be studied carefully!
image
2
Share
Community Posts
Seesaw
It's still a little slow, like slow motion, but I can still understand that the image generation video is so slow! I hope to optimize it in the future, if this speed is not very idle I really don't want to wait
video
00:04
Share
Euphoria
This thing is the same as the GEN2, and my body feeling has hardly changed. I'm still looking forward to the AI that brings background sound to the video later, so that I don't even have to adjust the audio track!
video
00:04
Share