Archives for 2024

Breakthrough low-latency, high-energy-efficiency LLM inference performance using NorthPole

September 26, 2024 By dmodha

New: As presented at the IEEE HPEC Conference (High Performance Extereme Computing) today, exciting new results from IBM Research demonstrate that for a 3-billion parameter LLM, a compact 2U research prototype system using the IBM AIU NorthPole inference chip delivers an astounding 28,356 tokens/sec of system throughput and sub-1ms/token (per-user) latency. NorthPole is optimized for the two conflicting objectives of energy-efficiency and low latency. In the regime of low-latency, NorthPole (in 12nm) provides 72.7x better energy efficiency (tokens/second/W) versus a state-of-the-art 4nm GPU. In the regime of high-energy efficiency, NorthPole (in 12nm) provides 46.9x better latency (ms/token) versus a 5nm GPU.

NorthPole is a brain-inspired, silicon-optimized chip architecture suitable for neural inference that was published in October 2023 in Science Magazine. Result of nearly two decades of work at IBM Research and a 14+ year partnership with United States Department of Defense (Defense Advanced Research Projects Agency, Office of the Under Secretary of Defense for Research and Engineering, and Air Force Research Laboratory).

NorthPole balances two conflicting objectives of energy efficiency and low latency.

First, because LLMs demand substantial energy resources for both training and inference, a sustainable future computational infrastructure is needed to enable their efficient and widespread deployment. Energy efficiency of data centers is becoming critical as their carbon footprints expand, and as they become increasingly energy-constrained. According to the World Economic Forum, “At present, the environmental footprint is split, with training responsible for about 20% and inference taking up the lion’s share at 80%. As AI models gain traction across diverse sectors, the need for inference and its environmental footprint will escalate.”

Second, many applications such as interactive dialog and agentic workflows require very low latencies. Decreasing latency, within a given computer architecture, can be achieved by decreasing throughput, however, that leads to decreasing energy efficiency. To paraphrase a classic systems maxim, “Throughput problems can be cured with money. Latency problems are harder because the speed of light is fixed.”

Caption: NorthPole (12 nm) performance relative to current state-of-the-art GPUs (7 / 5 / 4 nm) on energy and system latency metrics, where system latency is the total latency experienced by each user. At the lowest GPU latency (H100, point P2), NorthPole provides 72.7x better energy metric (tokens/sec/W). At the best GPU energy metric (L4, point P1), NorthPole provides 46.9x lower latency.

Caption: Exploded view of the research prototype appliance showing installation of the 16 NorthPole PCIe cards. NorthPole cards can communicate via the standard PCIe endpoint model through the host or directly, and more efficiently, with one another via additional hardware features on each card.

Caption: Strategy for mapping the 3-billion-parameter LLM to the 16-card NorthPole appliance. Each transformer layer is mapped to one NorthPole card and the output layer is mapped to two cards (left). For each layer, all weights and KV cache are stored on-chip, so only the small embedding tensor produced by each card’s layer must be forwarded to the next card over low-bandwidth PCIe when generating a token. Within each transformer layer (right), weights and KV cache are stored at INT4 precision. Activations are also INT4 except when higher dynamic range is needed for accumulations.

PDF of the Accepted Version.

NorthPole_HPEC_LLM_2024 Download

Future: Next research and development steps are further optimizations of energy-efficiency; mapping larger LLMs (8B, 13B, 20B, 34B, 70B) on correspondingly larger NorthPole appliances; new LLM models co-optimized with NorthPole architecture; and future system and chip architectures.

Caption: IBM AIU NorthPole rack under construction!
**Design Credit:** Ryan Mellody, Susana Rodriguez de Tembleque, William Risk, Map Project Office

Breakthrough edge AI inference performance using NorthPole in 3U VPX form factor

September 26, 2024 By dmodha

New: As presented at the IEEE HPEC Conference (High Performance Extreme Computing) today, the IBM AIU NorthPole Chip has been incorporated into a compact, rugged 3U VPX form factor module (NP-VPX), delivering high-performance and energy-efficiency for edge AI inference. NP-VPX processes 965 frames per second (fps) with a Yolo-v4 network with 640×640 pixel images at 73.5 W at full-precision accuracy, achieving 13.2 frames/J (fps/W). NP-VPX processes over 40,300 fps with a ResNet-50 network with 224×224 pixel images at 65.9 W at full-precision accuracy, achieving 611 frames/J.

NorthPole is a brain-inspired, silicon-optimized chip architecture suitable for neural inference that was published in October 2023 in Science Magazine. Result of nearly two decades of work by scientists at IBM Research and a 14+ year partnership with United States Department of Defense (Defense Advanced Research Projects Agency, Office of the Under Secretary of Defense for Research and Engineering, and Air Force Research Laboratory).

Today, high-performance AI runs primarily in the data center and—while training may remain there—great opportunity exists to migrate inference out to the edge, reducing transmission energy as well as bandwidth, mitigating concerns regarding privacy as well as security, and enabling previously impossible applications. To enable inference outside the data center, users need AI accelerators with both high performance and high energy efficiency, embodied in a form factor optimized for deployment at the edge.

Caption: NorthPole VPX board, optimized for area and density in the 3U VPX form factor.

Caption: Fully functional, fabricated, and assembled NorthPole VPX module, inserted into a VPX chassis with a single-board computer.

Caption: Measured NorthPole VPX board power, throughput, and energy efficiency. Running Yolo-v4 at 350 MHz, the board processed 969 fps at 640×640 pixels per image, consuming 73.5 W for a board-level efficiency of 13.2 frames/J.

Caption: Measured NorthPole VPX board power, throughput, and energy efficiency. Running ResNet-50 at 400 MHz, the board processed 40,340 fps at 224×224 pixels per image, consuming 65.9 W for a board-level efficiency of 612 frames/J.

PDF of the Accepted Version.

NorthPole_HPEC_VPX Download

NorthPole in The Economist

September 18, 2024 By dmodha

The Economist published an article “Researchers are looking beyond digital computing” highlighting NorthPole.

“IBM … have both designed chips that mimic this concept using current digital technology. IBM’s NorthPole chip has no off-chip memory. The company claims that its brain-inspired chip is 25 times more energy efficient and 20 times faster than other specialist chips, called accelerators, for certain AI applications.”

NorthPole in Computer History Museum

July 25, 2024 By dmodha

NorthPole is now inducted into the Computer History Museum, like its predecessor TrueNorth — both developed at IBM Research, IBM. Together, NorthPole and TrueNorth are a result of nearly two decades of ground-breaking innovation.

Links:
– NorthPole CMH entry
– NorthPole was published in Science Magazine and Hot Chips Symposium, see Linkedin post
– TrueNorth CMH entry

Ms. Penny Ahlstrand, Senior Archivist at CMH, is holding a NorthPole module (S/N B7423896).