HomeAI News
Pictures can be produced on a mobile phone in 0.2 seconds, the fastest speed currently, Google creates an ultra-fast diffusion model MobileDiffusion

Pictures can be produced on a mobile phone in 0.2 seconds, the fastest speed currently, Google creates an ultra-fast diffusion model MobileDiffusion

Hayo News
Hayo News
December 4th, 2023
View OriginalTranslated by Google

Running large generative AI models such as Stable Diffusion on mobile terminals such as mobile phones has become one of the hot spots pursued by the industry, among which the generation speed is the main constraint.

Recently, a paper from Google, "MobileDiffusion: Subsecond Text-to-Image Generation on Mobile Devices," proposed the fastest Text-to-Image Generation on mobile devices, which takes only 0.2 seconds on the iPhone 15 Pro . The paper comes from the same team as UFOGen. While creating an ultra-small diffusion model, it also adopts the currently popular Diffusion GAN technology route for sampling acceleration.

Paper address: https://arxiv.org/abs/2311.16567

Below are the results generated by MobileDiffusion in one step.

So, how is MobileDiffusion optimized?

Let's start with the question, why optimization is necessary.

The most popular text-to-image generation today is based on the diffusion model. Relying on the powerful basic image generation capabilities of its pre-trained model and its robust nature on downstream fine-tuning tasks, we have seen the extraordinary performance of the diffusion model in applications such as image editing, controllable generation, personalized generation, and video generation.

However, as a Foundation Model, its shortcomings are also obvious, mainly including two aspects: first, the large number of parameters of the diffusion model leads to slow calculation, especially when resources are limited; second, the diffusion model requires multiple steps to sample, which This further leads to very slow inference speed. Take the most popular Stable Diffusion1.5 (SD) as an example. Its basic model contains nearly 1 billion parameters. We quantized the model on the iPhone 15 Pro for inference, and 50 steps of sampling took close to 80 seconds. Such expensive resource requirements and laggy user experience greatly limit its application scenarios on the mobile side.

In order to solve the above problems, MobileDiffusion optimizes point-to-point. (1) In response to the problem of large model size, we mainly conducted a lot of experiments and optimizations on its core component UNet, including placing computationally expensive convolution simplification and attention operations on lower layers, and targeting Mobile Devices Operation optimization, such as activation functions, etc. (2) In response to the problem that diffusion models require multi-step sampling, MobileDiffusion explores and practices one-step inference technologies like Progressive Distillation and the current state-of-the-art UFOGen.

Model optimization

MobileDiffusion is optimized based on SD1.5UNet, which is currently the most popular in the open source community . After each optimization operation, the performance loss relative to the original UNet model will be measured at the same time. The measurement indicators include two commonly used metrics: FID and CLIP.

macro design

The left side of the picture above is the design diagram of the original UNet. It can be seen that it basically includes Convolution and Transformer, and Transformer also includes Self-Attention and Cross-Attention.

The core ideas of MobileDiffusion's optimization of UNet are divided into two points: 1) Streamlined Convolution . As we all know, Convolution on high-resolution feature space is very time-consuming and has a large number of parameters. This refers to Full Convolution; 2 ) to improve Attention efficiency . Like Convolution, high Attention requires calculation of the length of the entire feature space. The Self-Attention complexity is squarely related to the flattened length of the feature space, and Cross-Attention is also proportional to the length of the space.

Experiments show that moving the entire UNet's 16 Transformers to the inner layer with the lowest feature resolution, and cutting out a convolution in each layer, will not have a significant impact on performance. The effect achieved is: MobileDiffusion can reduce the original 22 Convolutions and 16 Transformers to 11 Convolutions and about 12 Transformers, and these attentions are all carried out on low-resolution feature maps, because the efficiency will be extremely high. A big improvement, bringing about 40% efficiency improvement and 40% parameter shearing. The final model is shown on the right side of the figure above. Comparison with more models is as follows:

micro design

Only a few novel designs will be introduced here. Interested readers can read the text for a more detailed introduction.

Decouple Self-Attention and Cross-Attention

The Transformer in traditional UNet contains both Self-Attention and Cross-Attention. MobileDiffusion places all Self-Attention in the lowest resolution feature map, but retains a Cross-Attention in the middle layer. It is found that this design not only improves the computing efficiency but also ensures Improved model rendering quality

Finetune softmax into relu

Softmax is notoriously difficult to parallelize in most unoptimized cases and is therefore inefficient. MobileDiffusion proposes to directly finetune the softmax function to relu, because relu is the activation of each point, which is more efficient. Surprisingly, with only about 10,000 steps of fine-tuning, the model metric was improved and the quality of the images was guaranteed. Therefore, the advantages of relu compared to softmax are obvious.

Separable Convolution (separable convolution)

The key to MobileDiffuison's parameter reduction is the use of Seprable Convolution. This technology has been proven to be extremely effective by work such as MobileNet, especially on the mobile side, but it is generally rarely used in generative models. MobileDiffusion experiments found that Separable Convolution is very effective in reducing parameters, especially when it is placed in the innermost layer of UNet. The analysis proves that there is no loss in model quality.

Sampling optimization

The most commonly used sampling optimization methods currently include Progressive Distillation and UFOGen, which can achieve 8 steps and 1 step respectively. In order to prove that these samples are still applicable after the model is extremely simplified, MobileDiffusion conducted experimental verification on both.

The comparison between before and after sampling optimization and the benchmark model is as follows. It can be seen that the indicators of the 8steps and 1step models after sampling optimization are quite outstanding.

Experiments and Applications

Mobile Benchmarks

MobileDiffusion can achieve the fastest image rendering speed on iPhone15Pro, 0.2s!

Downstream task testing

MobileDiffusion explores downstream tasks including ControlNet/Plugin and LoRA Finetune. As can be seen from the figure below, after model and sampling optimization, MobileDiffusion still maintains excellent model fine-tuning capabilities.


MobileDiffusion explores a variety of models and sampling optimization methods, and can finally achieve sub-second rendering capabilities on the mobile terminal, and downstream fine-tuning applications are still guaranteed. We believe this will have an impact on efficient diffusion model design in the future and expand mobile application examples.

Reprinted from 机器之心View Original