HomeAI News
Online shopping is revolutionized! Google's latest model AI one-click fitting, the details remain unchanged and the posture can be changed at will

Online shopping is revolutionized! Google's latest model AI one-click fitting, the details remain unchanged and the posture can be changed at will

Hayo News
Hayo News
June 18th, 2023
View OriginalTranslated by Google
Google's new AI model directly solves the two major problems of AI dressing - both retaining the details of the clothes and changing poses at will. I'm afraid it will be easier to chop hands in the future!

One-click facelifting, realized by Google!

This AI fitting model TryOnDiffusion, you only need to give it a full-body photo of yourself and a photo of the clothing model, and you can know what you will look like after wearing this dress.

The main thing is a truth. So, is it the live-action version of Nikki-Dress UP Queen?

Ordinarily, there have been a lot of AIs that have been changed for a long time. What is the breakthrough of Google's AI model?

Project address: https://tryondiffusion.github.io/

The key is that they proposed a diffusion-based framework to unify the two Parallel-Unets.

In the past, the key challenge of this kind of model is how to preserve the details of the clothes, but also deform the clothes, and at the same time adapt to the pose and shape of different subjects, so that people don't feel inconsistent.

Previous methods cannot do both at the same time, either they can only preserve the clothing details, but cannot handle changes in pose and shape, or they can change poses, but the clothing details will be lost.

And because TryOnDiffusion unifies two UNets, it is able to preserve the details of clothes in a single network, and make important pose and body changes to clothes.

It can be seen that the deformation of the clothes on the characters is extremely natural, and the details of the clothes are also restored very well.

Not much to say, let us directly see how powerful Google's "AI try-on" is!

Generate try-on images with AI

Specifically, Virtual Try-On (VTO) can show customers how clothes will look on real models of different shapes and sizes.

There are many subtle but crucial details in virtual clothing fitting, such as how clothes drape, fold, fit, stretch and wrinkle.

Previously existing techniques, such as geometric warping, cut and paste images of clothing and then warp them to fit the contours of the body.

But with these features, it is difficult for clothes to fit the body properly, and there are some visual defects, such as misplaced folds, which can make clothes look misshapen and unnatural.

So researchers at Google worked to generate every pixel of the clothing from scratch to produce high-quality, realistic images.

The technology they use is a new Diffusion-based AI model, TryOnDiffusion.

Diffusion is the gradual addition of extra pixels (or "noise") to an image until it becomes unrecognizable, and then removes the noise completely until the original image is reconstructed in perfect quality.

A text-to-image model like Imagen, which uses diffusion plus text from a large language model LLM, can generate realistic images based only on input text.

Diffusion is the gradual addition of extra pixels (or "noise") to an image until it becomes unrecognizable, and then removes the noise completely until the original image is reconstructed in perfect quality.

In TryOnDiffusion, instead of using text, a pair of images is used: one image is the clothes (or a model wearing clothes), and one image is the model.

Each image is sent to its own neural network (U-net), which shares information with each other through a process called "cross-attention," outputting a new photorealistic image of the model wearing the dress.

This combination of image-based Diffusion and cross-attention forms the core of this AI model.

The VOT function allows users to render and display tops on models that match their body shape.

Massive high-quality data training

In order to make the VTO function as realistic as possible and really help users pick clothes, Google has done a lot of training on this AI model.

However, instead of using a large language model to train it, Google leveraged Google's shopping graph.

This dataset has the most comprehensive and up-to-date product, seller, brand, review and inventory data in the world.

Google trained the model using pairs of images, each pair consisting of images of clothed models in two different poses.

For example, an image of a person wearing a shirt standing sideways and another standing forward.

Google's specialized diffusion models feed the image into their own neural network (U-net) to generate the output: a photorealistic image of the model wearing the dress.

In this pair of training images, the model learns to match the shape of the shirt in the sideways pose to the figure in the forward-facing pose.

And vice versa, until it can generate realistic images of the person in the shirt from every angle.

In pursuit of better results, Google repeated the process many times using millions of random image pairs of different clothing and people.

The result is what we saw in the picture at the beginning of the article.

In short, TryOnDiffusion not only retains the details of the clothes, but also adapts to the figure and posture of the new model. Google's technology achieves both, and the effect is quite realistic.

technical details

Given an image showing a model's body and another image showing another model wearing a certain garment, the goal of TryOnDiffusion is to generate a concrete vision of how the garment might appear on the person Effect.

The key difficulty in solving this problem is to properly deform the clothing to adapt to the changes in pose and body shape between different models while maintaining the realistic details of the clothing.

Previous methods either focus on preserving clothing details but cannot effectively handle pose and shape variations.

Either allows a fitting effect based on the desired body shape and posture, but lacks the details of the clothing.

Google proposed a Diffusion-based architecture that combines two UNets (called Parallel-UNet) into one, and Google is able to preserve clothing details in a single network and make significant posture and body changes to the clothing's try-on effect.

The key ideas of Parallel-UNet include:

1) Implicitly make folds for clothing through a cross-attention mechanism;

2) Garment folding and character fusion as a unified process rather than a sequence of two independent tasks.

Experimental results show that TryOnDiffusion achieves state-of-the-art performance both qualitatively and quantitatively.

The specific implementation method is shown in the figure below.

In a preprocessing step, the target person is segmented from the person image to create a "clothing-free RGB" image, the target clothing is segmented from the clothing image, and the pose is computed for both the person and clothing images.

These information inputs are brought into a 128×128 Parallel-UNet (key step) to create a 128×128 try-on image, which is further sent into a 256×256 Parallel-UNet as input together with the input of the try-on condition.

The 256×256 Parallel-UNet output is then sent to standard super resolution diffusion to create a 1024×1024 image.

The structure and processing of the most important 128×128 Parallel-UNet in the entire process above are shown in the figure below.

The clothing-independent RGB and noise images are fed into the top person-UNet.

Since both inputs are pixel-aligned, the two images are concatenated directly along the channel dimension at the beginning of UNet processing.

Since both inputs are pixel-aligned, we concatenate them directly along the channel dimension at the beginning of UNet processing.

Feed the segmented garment image into the garment-UNet at the bottom.

Clothing features are fused into the target image via cross attention.

In order to save the model parameters, Google researchers stopped the garment-UNet early after 32×32 upsampling (Upsampling), at which point the final cross attention module in person-UNet has been completed.

The poses of the person and clothes are first fed into a linear layer to compute pose embeddings respectively.

The pose embeddings are then fused into person-UNet via an attention mechanism.

Furthermore, they are used to modulate the features of two UNets at all scales using FiLM.

Comparison with Mainstream Technology

User Survey Study: For each set of input images, 15 average users choose one of 4 alternative techniques which they think is the best, or choose "indistinguishable". TryOnDiffusion clearly outperforms other techniques.

The picture below is from left to right "Input, TryOnGAN, SDAFN, HR-VITON, Google's method".


However, TryOnDiffusion has some limitations.

First, Google's method may suffer from clothing leakage artifacts if there are errors in the segmentation map and pose estimation during preprocessing.

Fortunately, the accuracy of this has improved considerably in recent years, and this doesn't happen very often.

Second, not including RGB off clothing to show the body is not ideal, as sometimes it may only preserve part of the identity.

For example tattoos will not be visible in this case, as will certain muscular structures.

Third, our training and test datasets typically have clean, uniform backgrounds, so it is uncertain how well the method will perform on more complex backgrounds.

Fourth, we cannot guarantee whether the clothing really fits on the model, we only focus on the visual effect of the try-on.

Finally, this study focuses on clothing for the upper body. Google has not yet experimented with full-body try-on effects, and will conduct further research on full-body effects in the future.



Reprinted from 新智元View Original