Drag and drop the image to generate a video, DragNUWA from University of Science and Technology of China, Microsoft, etc. is really amazing
With the advent of models such as ChatGPT, GPT-4, and LLaMa, people are paying more and more attention to the development of generative models. Compared with the increasingly mature text generation and image generation, the AI generation of video, voice and other modalities still faces greater challenges.
There are two main problems in existing work on controllable video generation: first, most existing works control video generation based on text, images, or trajectories, and cannot achieve fine-grained control of videos; second, trajectory control research is still at an early stage, Most experiments are performed on simple datasets such as Human3.6M, and this constraint limits the model's ability to effectively handle open-domain images and complex curved trajectories.
Based on this, researchers from University of Science and Technology of China, Microsoft Asia Research Institute and Peking University proposed a new video generation model based on open domain diffusion - DragNUWA. DragNUWA achieves fine-grained control over video content from semantic, spatial and temporal perspectives. This article is co-authored by Yin Shengming and Wu Chenfei, and the corresponding author Duan Nan.
Given the motion trajectory by dragging, DragNUWA can make the object in the image move according to the trajectory, and can directly generate a coherent video. For example, let two young boys on skateboards ride the required route:
It is also possible to "shift" the camera position and angle of a static scene image:
The study argues that the three types of control, text, image, and trajectory, are indispensable because they each contribute to controlling video content from semantic, spatial, and temporal perspectives. As shown in Figure 1 below, the combination of text and images alone is not enough to convey the complex motion details that exist in videos, which can be supplemented by trajectory information; the combination of images and trajectories alone cannot fully characterize future objects in videos, and text control can make up for this One point; relying solely on trajectories and text can lead to ambiguity when expressing abstract concepts, and image manipulation can provide the necessary distinction.
DragNUWA is an end-to-end video generative model that seamlessly integrates three basic controls—text, image, and track—to provide powerful and user-friendly controllability to fine-tune video content from semantic, spatial, and temporal perspectives. Granular control.
To address the limited open-domain trajectory control problem in current research, this study focuses on three aspects of trajectory modeling:
Use Trajectory Sampler (Trajectory Sampler, TS) to directly sample trajectories from open-domain video streams during training, for open-domain control of arbitrary trajectories; ·Use Multiscale Fusion (Multiscale Fusion, MF) to downsample trajectories to various scales, and deeply integrate them with text and images within each block of the UNet architecture for controlling trajectories at different granularities; Adopting an Adaptive Training (AT) strategy to stabilize video generation with dense streams as initial conditions, and then train on sparse trajectories to fit the model, finally generating stable and coherent videos.
Experiment and Results
This study verifies the effectiveness of DragNUWA with extensive experiments, and the experimental results demonstrate its superior performance in fine-grained control of video synthesis.
Different from existing studies that focus on text or image control, DragNUWA mainly emphasizes on modeling trajectory control. To verify the effectiveness of trajectory control, this research tests DragNUWA from two aspects of camera motion and complex trajectories.
As shown in Figure 4 below, although DragNUWA does not explicitly model camera motion, it learns various camera motions from modeling open-domain trajectories.
To evaluate DragNUWA's ability to accurately model complex motions, the study tested various complex drag (drag) trajectories using the same image and text. As shown in Figure 5 below, the experimental results show that DragNUWA can reliably control complex motions.
Furthermore, DragNUWA also incorporates text and image controls, although it mainly emphasizes trajectory control modeling. According to the research team, text, image, and trajectory correspond to three fundamental control aspects of video: semantics, space, and time, respectively. Figure 6 below illustrates the necessity of these control conditions by showing different combinations of text (p), trajectory (g) and image (s), including s2v, p2v, gs2v, ps2v and pgs2v.
Interested readers can read the original text of the paper to learn more about the research content.
Paper address: https://arxiv.org/abs/2308.08089