Google optimizes the diffusion model, and the Samsung mobile phone runs Stable Diffusion, and the map is produced within 12 seconds
Speed Is All You Need: Google proposes some optimization suggestions for Stable Diffusion, and the speed of generating images is rapidly increased.
Stable Diffusion is as famous in the field of image generation as ChatGPT in the dialogue model. It is able to create realistic images for any given input text in tens of seconds. Since the number of parameters of Stable Diffusion exceeds 1 billion, and due to the limited computing and memory resources on the device, this model mainly runs on the cloud.
Without careful design and implementation, running these models on-device may result in increased latency due to iterative denoising process and excessive memory consumption.
How to run Stable Diffusion on the device side has aroused everyone's research interest. Previously, some researchers developed an application that uses Stable Diffusion on the iPhone 14 Pro to generate pictures in only one minute and uses about 2GiB of application memory.
Apple has also made some optimizations for this before, and they can generate an image with a resolution of 512x512 in half a minute on iPhone, iPad, Mac and other devices. Qualcomm followed suit, running Stable Diffusion v1.5 on the Android phone, generating images with a resolution of 512x512 in less than 15 seconds.
Recently, in a paper published by Google "Speed Is All You Need: On-Device Acceleration of Large Diffusion Models via GPU-Aware Optimizations", they realized running Stable Diffusion 1.4 on GPU-driven devices to achieve SOTA reasoning delay performance (Generating a 512 × 512 image in 20 iterations takes only 11.5 seconds on a Samsung S23 Ultra). Furthermore, the study was not specific to one device; rather, it was a general approach applicable to improving all potential diffusion models.
This research opens up many possibilities for generative AI running locally on a phone without a data connection or cloud server. Stable Diffusion was only released last fall, and today it can be plugged into equipment to run, which shows how fast this field is developing.
Paper address: https://arxiv.org/pdf/2304.11267.pdf
In order to achieve this generation speed, Google has put forward some optimization suggestions. Let's take a look at how Google optimizes.
method introduction
This research aims to propose an optimization method to improve the speed of the large-scale diffusion model Vinsen graph, including some optimization suggestions for Stable Diffusion, which are also applicable to other large-scale diffusion models.
First look at the main components of Stable Diffusion, including: text embedder (text embedder), noise generation (noise generation), denoising neural network (denoising neural network) and image decoder (image decoder, as shown in Figure 1 below .
Then we look at the three optimization methods proposed in this study in detail.
Dedicated Kernels: Group Norm and GELU
The group normalization (GN) method works by dividing the channels of the feature map into smaller groups and normalizing each group independently, making GN less dependent on the batch size, More suitable for various batch sizes and network architectures. Instead of performing reshape, mean, variance, and normalization sequentially, the research designed a unique kernel in the form of a GPU shader that can perform all these operations in one GPU command without any intermediate frames. Quantity (tensor).
Gaussian Error Linear Unit (GELU), as a commonly used model activation function, contains a large number of numerical calculations, such as multiplication, addition, and Gaussian error functions. The study combined these numerical calculations and their accompanying split and multiplication operations with a dedicated shader, enabling them to be performed in a single AI paint call.
Improving the efficiency of the attention module
The text-to-image transformer in Stable Diffusion helps to model the conditional distribution, which is crucial for text-to-image generation tasks. However, due to memory complexity and time complexity, the self/cross attention mechanism encounters difficulties in processing long sequences. Based on this, this study proposes two optimization methods to alleviate the computational bottleneck.
On the one hand, in order to avoid performing the entire softmax calculation on a large matrix, this research uses a GPU shader to reduce the operation, which greatly reduces the memory usage and overall delay of the intermediate tensor. The specific method is shown in Figure 2 below.
On the other hand, this research uses FlashAttention [7], an IO-aware precise attention algorithm, which makes the number of high-bandwidth memory (HBM) accesses less than the standard attention mechanism, improving the overall efficiency.
Winograd convolution
Winograd convolution converts the convolution operation into a series of matrix multiplications. This method can reduce many multiplication operations and improve computational efficiency. However, this also increases memory consumption and numerical errors, especially when using larger tiles.
The backbone of Stable Diffusion relies heavily on 3×3 convolutional layers, especially in the image decoder where they account for 90%. This study provides an in-depth analysis of this phenomenon to explore the potential benefits of using Winograd with different tile sizes on 3 × 3 kernel convolutions. A tile size of 4 × 4 was found to be optimal because it provided the best balance between computational efficiency and memory utilization.
experiment
The study was benchmarked on various devices: Samsung S23 Ultra (Adreno 740) and iPhone 14 Pro Max (A16). The benchmark test results are shown in Table 1 below:
It is clear that with each optimization activated, the latency gradually decreases (interpreted as the time to generate the image decreases). Specifically, compared to the baseline: 52.2% latency reduction on Samsung S23 Ultra; 32.9% latency reduction on iPhone 14 Pro Max. In addition, the study also evaluated the end-to-end latency of the Samsung S23 Ultra. Within 20 denoising iteration steps, a 512 × 512 pixel image was generated, and the SOTA result was reached in less than 12 seconds.
What does a small device that can run its own generative AI model mean for the future? We can expect a wave.