Intel Habana Gaudi Beats Nvidia's H100 in Visual-Language AI Models: Hugging Face
If Intel has its way, Nvidia won't be king-of-the-AI-hill for long.
A new fine-tuning performance benchmark for BridgeTower, a Vision-Language (VL) AI model, has shown that there's life to the AI acceleration camp other than Nvidia's green. While Nvidia does dominate the AI acceleration market (through exceptional foresight, a well-thought-out and documented software stack, and pure processing performance), other players are keen to take a piece of the AI market for themselves. And at least for BridgeTower, Intel's own Gaudi 2 silicon (designed and fabricated through Intel's $2 billion, 2019 acquisition of Habana) has been shown by Hugging Face to outperform Nvidia's A100 80 GB by a staggering 2.5x - and it even beats Nvidia's prodigy-child H100 by 1.4x.
According to Habana, the momentous speedups are the result of a hardware-accelerated data-loading system - one of the bottlenecks for AI model fine-tuning, and especially-so for VL models. Loading a workload into memory is often one a performance bottleneck wherever computing lies, so it's not that out of the left-field that Habana would look to optimize this particular step in the training process.
The main bottleneck relates to how CPUs get hamered with many costly operations such as image decoding and image augmentation (a similar issue to the GPU draw-call debate), which lead the HPU (or Nvidia GPU) to stall while waiting for further data to be processed (by the CPU) and then sent over to the AI accelerator of choice. This is how the process goes without any hadrware acceleration:
Fetch data (e.g. where your JPEG images are stored on disk) The CPU reads encoded images The CPU decodes images The CPU applies image transformations to augment images Images are sent to devices (although this is usually not done by the dataloader itself)
And this is the process through Gaudi 2's integrated hardware acceleration, which accelerates image transformation:
Fetch data The CPU reads encoded images Encoded images are sent to devices Devices decode images Devices apply image transformations to augment images Through the hardware acceleration method, it becomes clear that the CPU is much less leveraged (freeing up CPU cycles for other tasks within the fine-tuning main process), which should result in improved performance.
Benchmarking Habana's Gaudi 2 by fine-tuning a pre-trained BridgeTower checkpoint with 866M parameters allows us to see the performance gains that hardware-accelerated image loading brings to the table. The workloads were run in distributed computing across 8 devices each (of Nvidia's A100 80 GB, H100, and Gaudi 2). The results were measured and averaged across three different processing runs, with each run spawning increasing CPU processes fully dedicated to loading data into memory (the first run loads memory within the main CPU process, while runs two and three increase the number of memory-loading processes by one and two, respectively).
Dataloading performance across Gaudi 2, Nvidia A100, and Nvidia H100. Units expressed in samples per second
dataloader_num_workers=2 + mediapipe_dataloader
Gaudi 2 HPU
A100 80 GB GPU
The results are clear: the best-case performance scenario for Gaudi 2 is the first, where data is loaded alongside the main training process, with Gaudi 2 besting even Nvidia's H100 by 1.79x, and the A100 by 2.23x. But this is a non-optimized scenario, as Habana itself admitted; so perhaps the most revealing results come from the third datapoint, where two additional processes were spawned to handle data loading outside of the main fine-tuning process. There, Nvidia's products certainly have to squint to catch Gaudi 2's dust-cloud as it runs into the distance: Gaudi 2 delivers the improved 1.3x performance against Nvidia's cream-of-the-crop H100, and a 2.23x performance improvement against the A100 80 GB.
It would be possible to spawn additional processes to handle data-loading; but as it can be seen from the performance progression, that strategy would bring about increasingly diminishing returns. On the Nvidia H100, for instance, performance is improved by 1.72x by spawning a single dedicated data-loading process, but going from one process to two only brings an additional 3% improvement. Due to Habana's ability to bring most data-loading steps into Gaudi 2, however, the company can unlock an additional 10% performance improvement against its own best score (where data loading an transformations are handled by two CPU processes).
There's still a long way to go before any company can claim hegemony in the AI-acceleration space. Nvidia has an incredible product and software stack that has allowed it to gain the first-mover advantage; but we've seen enough races where the underdogs catch up to (and sometimes even surpass) the favorites to know that Intel, AMD and others are all looking to steal Nvidia's thunder.