Tachyum FP8 Super-Sparsity Is Showing Path to Efficient Generative AI

Tachyum announced today the release of a new research paper to address how Prodigy, the world’s first Universal Processor, will transform the quality, efficiency, and economics of generative AI (GenAI).

Tachyum announced today the release of a new research paper to address how Prodigy, the world’s first Universal Processor, will transform the quality, efficiency, and economics of generative AI (GenAI).

“Unprecedented Scale and Efficiency in Generative AI with FP8 8:3 Super-Sparsity” offers technical information on how Prodigy can more effectively meet the computation and scale requirements of generative AI, which trains on massive data sets to create original results, rather than identifying or analyzing known data. The larger the training data, the better and more accurate the output of GenAI. ChatGPT 3.5, a quintessential example of a generative AI model, has 175 billion trainable parameters, and ChatGPT 4.0 increases this by a factor of 10 to 1.76 trillion parameters, with another 10x increase possible in the near future.

Language models like ChatGPT, vision models, and other GenAI tools have improved dramatically due to successful scale-up, resulting in impressive few-shot capabilities close to that of a human being. These growing numbers of parameters require corresponding increases in computational power to train AI systems: high memory capacity, high processing performance, and high memory bandwidth to optimize the efficiency of large and dense models. Today, the scale of the largest AI computation is doubling every six months, outpacing Moore’s Law by 7x; and moving from generative AI to cognitive AI is expected to require 100-1000x more capacity.

To address memory and energy consumption, quantization reduces the precision of parameters as a means of compressing deep-neural networks (DNNs). Similarly, pruning removes redundant/insensitive parameters to reduce density. While density is often necessary to successfully train the model, once trained, many parameters can be removed without any quality degradation.

In this paper Tachyum shows how Prodigy overcomes the hardware inefficiencies that make GenAI cost-prohibitive and energy-excessive. Prodigy enables quantization using 8-bit floating point (FP8) with 8:3 block pruning, improving performance, power, and memory bandwidth to enable enormous model sizes. Tachyum’s recommendations significantly increase training speed, and reduce the memory footprint of the model after training. Super-sparsity FP8 8:3 greatly reduces the model sizes, important for language models, as well as power and area—important for edge and IOT applications.

“GenAI is a truly transformational technology, but its value cannot be realized, nor can it be widely adopted, without solving the hardware challenges of running such large models,” said Dr. Radoslav Danilak, founder and CEO of Tachyum. “With Prodigy poised to become the mainstream cost-efficient high performance processor in 2024, these compression approaches, together with hardware support, will enable even small to midsized enterprises and academic users to work with large, dense deep learning models.”

Because Prodigy offers increased memory over currently available AI processors—2TB using low-cost DRAM and 32TB/socket, with a 4-socket Prodigy platform supporting low cost 8TB and up to 128TB of TSV DDR5 DRAM — a single Prodigy chip can replace more than 10 competitor units, delivering unprecedented performance, scalability and efficiency.

FP8 8:3 models must be trained on Tachyum chips to achieve the proper computational efficiency. FP8 8:3 inference and generative AI IP is available now to partners and customers; a license includes all necessary software, which is process-independent.

As a Universal Processor offering utility for all workloads, Prodigy-powered data center servers can seamlessly and dynamically switch between computational domains (such as AI/ML, HPC, and cloud) on a single architecture. By eliminating the need for expensive dedicated AI hardware and dramatically increasing server utilization, Prodigy reduces CAPEX and OPEX significantly while delivering unprecedented data center performance, power, and economics. Prodigy integrates 192 high-performance custom-designed 64-bit compute cores, to deliver up to 4.5x the performance of the highest-performing x86 processors for cloud workloads, up to 3x that of the highest performing GPU for HPC, and 6x for AI applications.

Exit mobile version