Want to turn the elephant P around? Just drag the mouse, the Chinese research result DragGAN exploded
If you want to turn the elephant P around, you only need to drag the GAN.
The diffusion model is generally considered to be the dominant model in the field of current image generation, and one of its representatives is Stable Diffusion. However, diffusion models are based on iterative inference, which means that the inference process is computationally expensive. While iterative methods can achieve stable training, they are also a double-edged sword.
Before Stable Diffusion, Generative Adversarial Network (GAN) was a widely used infrastructure in the field of image generation. Compared to diffusion models, GANs can achieve image generation in a single forward pass and thus are more efficient. However, the training process of GAN is not stable enough, requiring careful adjustment of the network architecture and training parameters, so it faces greater challenges when dealing with applications on complex data sets. This is also one of the disadvantages of GAN relative to the diffusion model, leading to a decline in the status of GAN.
Currently, GANs mainly rely on manually annotated training data or prior 3D models to ensure controllability. But this approach usually lacks flexibility, precision and generality. Nevertheless, many researchers still value the efficiency of GAN in the field of image generation, and have launched many attempts to improve GAN.
The new image control method is called DragGAN, which is very flexible, powerful and simple. Just drag the manipulation point on the image to easily synthesize the image you want. The transformation process of this method is very free, such as making the lion "turn its head":
Party A has a request for a P-picture of "Turn the elephant around". You might as well try:
The whole image transformation process is simple and flexible, which makes people feel that "PS seems to be outdated".
However, some people think that DragGAN may become a part of PS in the future, which also shows the importance of DragGAN.
This amazing paper has been selected for SIGGRAPH 2023. The study said that although the current GAN has shortcomings, the method of using DragGAN can effectively improve its controllability, and its code will be open sourced in June. DragGAN is powerful and flexible, and the technical method is worth learning.
The DragGAN research proposes two main parts, including feature-based motion supervision and manipulation point tracking methods to control the position of pixels to achieve the effect of image changes. DragGAN can handle different types of images, such as animals, cars, humans, and landscapes, and can cover a large number of object poses, shapes, expressions, and layouts, and the user's operation method is also very simple and versatile.
GAN has a great advantage that its feature space is extremely discriminative, enabling precise point tracking and motion supervision. Specifically, motion supervision is achieved by optimizing the shifted feature patch loss of the latent code. Each optimization step leads the manipulated point closer to the target, and point tracking is then performed by nearest neighbor search in feature space.
Another important factor for efficient operation is that DragGAN does not rely on any additional network, it only needs to calculate a loss function and feature vector. This enables DragGAN to complete image processing in seconds on a single RTX 3090 GPU, enabling real-time interactive editing. Users can make multiple transformations and changes to the image until the desired output is obtained. Besides that, DragGAN also supports users to draw regions of interest for region-specific editing.
As shown in the figure below, DragGAN is able to efficiently move user-defined manipulation points to target points to achieve different manipulation effects. Different from traditional warping methods, our warping is performed on the image manifold learned by GAN, which tends to obey the underlying object structure instead of simply applying warping. This means that the method can generate otherwise invisible content, such as the teeth in a lion's mouth, and can deform according to the rigidity of the object, such as the bending of a horse's leg.
The researchers also developed a GUI (Graphical User Interface) for users to interact with by simply clicking on images. In addition, the method of DragGAN is also combined with the GAN inversion technique, which can be used as a tool for real image editing. It can handle the situation where some classmates in the group photo don’t like their expressions, allowing people to change their expressions, such as replacing a classmate with a shy smile with a confident smile.
Therefore, DragGAN is a very practical tool that can be widely used in the field of image editing to achieve efficient editing and control of images.
Pan Xingang, the first author of this thesis, received his Ph.D. in the Multimedia Laboratory of the Chinese University of Hong Kong in 2021, under the tutelage of Professor Tang Xiaoou. It is worth mentioning that his photo also appears in this paper. Currently, Xingang Pan is a postdoctoral fellow at the Max Planck Institute for Informatics, and will be an assistant professor at MMLab, School of Computer Science and Engineering, Nanyang Technological University starting in June 2023.
The main purpose of this research is to develop an interactive image manipulation method for GAN, so that users can define some corresponding manipulation points and target points by simply clicking on the image, and enable the manipulation points to be intelligently driven to reach their corresponding target point.
This research is based on StyleGAN2, whose basic architecture is as follows:
In the StyleGAN2 architecture, a 512-dimensional latent code 𝒛 ∈ N(0, 𝑰) is mapped into an intermediate latent code 𝒘 ∈ R 512 through a mapping network. The space of 𝒘 is often called W. Then, 𝒘 is fed to the generator 𝐺, which produces an output image I = 𝐺(𝒘). During this process, 𝒘 is duplicated several times and fed to different layers of the generator 𝐺 to control different property levels. Alternatively, it is also possible to use different 𝒘 for different layers, in which case the input would be
, where 𝑙 is the number of layers. This less constrained W^+ space turns out to be more expressive. Since the generator 𝐺 learns a mapping from a low-dimensional latent space to a high-dimensional image space, it can be viewed as modeling an image manifold.
In order to demonstrate the powerful capabilities of DragGAN in image processing, the research conducted qualitative experiments, quantitative experiments and ablation experiments. Experimental results show that DragGAN outperforms existing methods in both image processing and point tracking tasks.
We compare our method with UserControllableLT, performing qualitative comparisons on image manipulation results for several different object categories and user input. Our method can accurately move the manipulation point to reach the target point, achieving diverse and natural manipulation effects, such as changing animal poses, car shapes, and landscape layouts. In contrast, UserControllableLT fails to faithfully move the manipulation point to the target point, often resulting in unwanted changes in the image.
It also cannot keep the uncovered regions fixed like our method.
Compared with PIPs and RAFT, our method accurately tracks the manipulation point above the lion's nose, thus successfully dragging it to the target location.
DragGAN can be used not only for the processing of synthetic images, but also for real images. Using the GAN inversion technique to embed real images into the latent space of StyleGAN, our method can manipulate real images. An example is shown here, applying PTI inversion to a real image with edits for pose, hair, shape and expression.
More real image editing examples:
We quantitatively evaluate the method in two settings, including face landmark manipulation and pairwise image reconstruction.
Face marker operation. Under different points, the method in this paper is obviously better than UserControllableLT. In particular, our method preserves better image quality, as shown by the FID scores in the table.
This contrast is evident in that our method opens the mouth and adjusts the shape of the jaw to match the target face, while UserControllableLT fails to do so.
Pairwise image reconstruction. Our method outperforms all baselines across different object categories.
The researchers conducted a study exploring the effect of using different features in motion supervision and point tracking, and reported the performance (MD) of point manipulation on face landmarks using different features. It can be seen that in motion supervision and point tracking, the feature maps after the sixth block of StyleGAN perform best, presenting the best balance between resolution and discriminative power.
𝑟_1 effect. It can be seen that the performance is not very sensitive to the choice of 𝑟_1, while the performance of 𝑟_1=3 is slightly better.
The effect of the mask. The method in this paper allows the user to enter a binary mask representing the movable area, and its effect:
Out-of-distribution operations. It can be seen that the method in this paper has a certain out-of-distribution ability, and can create images outside the distribution of training images, such as an extremely open mouth and a large wheel.
The researchers also pointed out the limitations of the method in this paper: although the method has some inference ability, its editing quality will still be affected by the diversity of training data. For example, creating a human pose that deviates from the training distribution in Figure (a) can lead to artifacts. In addition, as shown in (b) and (c), the manipulation points in untextured regions sometimes exhibit more drift in tracking. Therefore, the researchers propose to choose texture-rich manipulation points as much as possible, which improves the editing results.