HomeAI News
Collage Diffusion: Image synthesis can be simpler
11

Collage Diffusion: Image synthesis can be simpler

McMuffinplan-icon
McMuffin
April 11th, 2023
View OriginalTranslated by Google

In the past, every website editor would use PhotoShop to remove the pictures, and then paste one picture on the layer of another picture to make a simple composite picture. And if it is a character, the processing of the hair is even more complicated. Simply moving the character to another background often takes a long time to deal with.

However, now the "AI Puzzle" technology is too strong, you don't need to be so troublesome at all.

For example, here is a seemingly unremarkable Japanese-style bento.

enter image description here

But can you believe that, in fact, every food is on the P, and the original picture is still like this?

Cut out the picture and paste it directly, the effect can be seen at a glance is fake

The operator behind it is not some PS master, but an AI with a straightforward name: Collage Diffusion.

Just find a few small pictures and give it to it, and the AI can understand the content of the pictures by itself, and then combine the elements into a large picture very naturally.

enter image description here

Its effect surprised many netizens.

There are even PS lovers saying: This is a godsend... I hope to see it soon in Automatic1111 (the network UI commonly used by Stable Diffusion users, and there is also a plug-in version built into PS) it.

Why is the effect so natural?

In fact, there are several generated versions of the "Japanese Bento" generated by this AI - all natural.

enter image description here

As for why there are multiple versions? It is because users can also customize it, and they can fine-tune various details without the overall becoming too outrageous.

In addition to "Japanese Bento", it has many excellent works.

For example, this is the material given to AI, and the traces of the P picture are obvious:
enter image description here
This is the picture assembled by AI. Anyway, I don't see any traces of the P picture:
enter image description here

In the past two years, the diffusion model of text-generated images has really become popular, and both DALL·E 2 and Imagen are applications developed based on this. The advantage of this diffusion model is that the generated pictures are diverse and of high quality.

Source: T Kebang

However, the text can only play a vague normative role for the target image after all, so users usually spend a lot of time adjusting the prompt (prompt), and have to match additional control components to achieve good results.

Take the Japanese bento shown above as an example:

If the user only enters "a bento box with rice, edamame, ginger, and sushi," it neither describes what kind of food goes in which compartment, nor what each food looks like. But if you have to make it clear, the user may have to write a small composition...

In view of this, the Stanford team decided to start from another angle.

They decided to refer to the traditional idea of putting pieces together to generate the final image, and thus developed a new diffusion model.

What's interesting is that, to put it bluntly, this model is also spelled out using classical techniques.

The first is layering: using a layer-based image editing UI, break down the source image into individual RGBA layers (R, G, B for red, green, blue, and A for transparency) and arrange these layers in canvas, and pair each layer with a text prompt.

Through layering, various elements in an image can be modified.

So far, layering has been a mature technology in the field of computer graphics, but layered information was generally used as a single image output before.

In this new puzzle diffusion model, hierarchical information becomes the input for subsequent operations.

In addition to layering, it also matches the existing diffusion-based image coordination technology to improve the visual quality of the image .

In summary, the algorithm not only restricts the change of certain properties of objects (such as visual characteristics), but also allows properties (orientation, lighting, perspective, occlusion) to change. In this way, the relationship between the degree of reduction and the degree of naturalness is balanced, and a picture that is "like" and has no sense of disobedience is generated.

enter image description here

The operation process is also very easy. In the interactive editing mode, users can create a collage in a few minutes.

Not only can they customize the order of the spatial arrangement in the scene (that is, put the image deducted from elsewhere into the appropriate position); they can also adjust the various components of the generated image. Using the same source image, different effects can be obtained.

The rightmost column is the output of this AI

In the non-interactive mode (that is, the user does not puzzle, and directly throws a bunch of small pictures to AI), AI can also automatically spell out a large picture with natural effects based on the small pictures it gets.

research team

Finally, let’s talk about the research team behind it. They are a group of teachers and students in the Computer Science Department of Stanford University.

The first author of the paper, Vishnu Sarukkai, is currently a graduate student in the Department of Computer Science at Stanford, and he is still a master-doctoral fellow. His main research directions are: computer graphics, computer vision and machine learning.

In addition, Linden Li, a co-author of the paper, is also a graduate student in the Department of Computer Science at Stanford. During his studies at school, he went to NVIDIA for an internship for 4 months, and cooperated with the NVIDIA deep learning research team to participate in the training of a visual converter model with 100M+ parameters.

Comments

no dataCoffee time! Feel free to comment