HomeAI News
AI emoticon pack generator: In order to allow AI to generate pictures with Chinese characters, OPPO and others proposed GlyphDraw

AI emoticon pack generator: In order to allow AI to generate pictures with Chinese characters, OPPO and others proposed GlyphDraw

Hayo News
Hayo News
April 10th, 2023
View OriginalTranslated by Google

In order to have real text in the images generated by AI, people have tried their best.

Recently, many unexpected breakthroughs have been made in the field of image generation from text, and many models can realize the function of creating high-quality and diverse images based on text instructions. Although the generated images are already very realistic, current models are often good at generating real-world images such as landscapes and objects, but it is difficult to generate images with highly coherent details, such as images with complex glyph text such as Chinese characters.

In order to solve this problem, researchers from OPPO and other institutions proposed a general learning framework GlyphDraw, which aims to enable the model to generate images embedded with coherent text. This is the first work in the field of image synthesis to solve the problem of Chinese character generation.

Let's take a look at the generation effect first, such as generating warning signs for the exhibition hall:

Generate the billboard:

Add a brief text description to the picture, and the text style can also be diversified:

Also, the most interesting and practical example is to generate emoticons:

Although there are some flaws in the result, the overall generation effect is already very good. Overall, the main contributions of this study include:

  • This research proposes the first Chinese character image generation framework GlyphDraw, which utilizes some auxiliary information, including Chinese character glyphs and positions, to provide fine-grained guidance throughout the generation process, so that high-quality Chinese character images can be seamlessly embedded into the image;
  • This study proposes an effective training strategy that limits the number of trainable parameters in the pre-training model to prevent overfitting and catastrophic forgetting, effectively maintaining the model's strong open-domain generation performance, while Accurate Chinese character image generation is realized.
  • This study introduces the construction process of the training dataset and proposes a new benchmark to evaluate the quality of Chinese character image generation using OCR models. Among them, GlyphDraw achieved a generation accuracy of 75%, which is significantly better than previous image synthesis methods.

Model introduction

The research first designed a complex image-text dataset construction strategy, and then proposed a general learning framework GlyphDraw based on the open source image synthesis algorithm Stable Diffusion, as shown in Figure 2 below.

The overall training goal of Stable Diffusion can be expressed as the following formula:

GlyphDraw is based on the cross-attention mechanism in Stable Diffusion, where the original input latent vector z_t is replaced by the concatenation of image latent vector z_t, text mask l_m, and glyph image l_g.

Furthermore, Condition C is equipped with hybrid glyph and text features by using a domain-specific fusion module. The introduction of text mask and glyph information enables the entire training process to achieve fine-grained diffusion control, which is a key component to improve the performance of the model, and finally images with Chinese character text can be generated.

Specifically, pixel representations of text information, especially complex text forms such as pictographic Chinese characters, are significantly different from natural objects. For example, the Chinese word "sky (sky)" is composed of multiple strokes in a two-dimensional structure, while its corresponding natural image is "blue sky dotted with white clouds". In contrast, Chinese characters have very fine-grained properties, and even tiny movements or deformations can lead to incorrect text rendering, making image generation impossible.

Embedding characters in natural image backgrounds also needs to consider a key issue, which is to precisely control the generation of text pixels while avoiding affecting adjacent natural image pixels. In order to render perfect Chinese characters on natural images, the authors carefully design two key components integrated into the diffusion synthesis model, namely position control and grapheme control.

Different from the global conditional input of other models, character generation needs to pay more attention to specific local regions of images, because the latent feature distribution of character pixels is quite different from that of natural image pixels. In order to prevent the collapse of model learning, this study innovatively proposes fine-grained location region control to decouple the distribution between different regions.

Besides position control, another important issue is the fine control of Chinese character stroke synthesis. Considering the complexity and diversity of Chinese characters, it is extremely difficult to learn from a large image-text dataset without any explicit prior knowledge. In order to generate Chinese characters accurately, this study incorporates explicit glyph images as additional conditional information into the model diffusion process.

Experiment and Results

Since there was no data set dedicated to the generation of Chinese character images before, this research first constructed a benchmark data set ChineseDrawText for qualitative and quantitative evaluation, and then tested and compared the generation accuracy of several methods on ChineseDrawText (by OCR recognition model Evaluate).

The GlyphDraw model proposed in this study achieved an average accuracy of 75% by effectively using auxiliary glyph and position information, demonstrating the model's excellent character image generation capabilities. The visual comparison results of several methods are shown in the figure below:

In addition, GlyphDraw can also maintain the performance of open-domain image synthesis by limiting the training parameters, and the FID of general image synthesis on MS-COCO FID-10k only drops by 2.3.

Interested readers can read the original text of the paper for more research details.

Reprinted from 机器之心View Original


no dataCoffee time! Feel free to comment