• Skip to main content
  • Skip to primary sidebar

Dharmendra S. Modha

My Work and Thoughts.

  • Brain-inspired Computing
    • Collaborations
    • Videos
  • Life & Universe
    • Creativity
    • Leadership
    • Interesting People
  • Accomplishments
    • Prizes
    • Papers
    • Positions
    • Presentations
    • Press
    • Profiles
  • About Me

A Scalable NorthPole System with End-to-End Vertical Integration for Low-Latency and Energy-Efficient LLM Inference

November 20, 2025 By dmodha

Published today (Nov 20, 2025) on arXiv (https://lnkd.in/g_r9qmZg), the IBM NorthPole Chip has been vertically integrated into an end-to-end LLM inference system comprising 288 NorthPole accelerator cards, a high-performance runtime stack, and a containerized inference pipeline.

The research prototype system delivers 115 peta-ops at 4-bit precision and 3.7 PB/s of memory bandwidth across eighteen 2U servers. The system only consumes 30 kW of power, enabling it to be deployed in existing data center (cloud or on-prem), without requiring exotic communication fabrics, custom hardware integration, liquid cooling, or facility power upgrades.

The modular, scalable, and reconfigurable system can run 3 simultaneous instances of the IBM Granite-3.3-8b-instruct model at 2,048 context length with 28 simultaneous users at a per-user inter-token latency of 2.8 ms. The same system can run 18 instances of a 3-billion-parameter model at the same context length and number of users, achieving an inter-token latency of 1 ms.

This work was done in partnership with Trent Gray-Donald at IBM watsonx and David Cox at IBM Research.

Co-authors: Michael DeBole, Rathinakumar Appuswamy, Neil McGlohon, Brian Taba, Steve Esser, Filipp Akopyan, John Arthur, Arnon Amir, Alexander Andreopoulos, Peter Carlson, Andrew Cassidy, Pallab Datta, Myron Flickner, Rajamohan Gandhasri, Guillaume Garreau, Megumi Ito, Jennifer Klamo, Jeff Kusnitz, Nathaniel McClatchey, Jeffrey McKinstry, Tapan Kumar Nayak, Carlos Tadeo Ortega Otero, Hartmut Penner, William Risk, Jun Sawada, Jay Sivagnaname, Daniel Smith, Rafael Cardoso Fernandes Sousa, Ignacio Terrizzano, Takanori Ueda, Trent Gray-Donald, David Cox, Dharmendra Modha

NorthPole is a brain-inspired, silicon-optimized chip architecture suitable for neural inference that was published in October 2023 in Science Magazine. Result of nearly two decades of work by scientists at IBM Research and a 15+ year partnership with United States Department of War (Defense Advanced Research Projects Agency (DARPA), Office of the Under Secretary of War for Research and Engineering, and Air Force Research Laboratory).  For more information, see:
– Science paper: https://lnkd.in/g2bZ3Gfv
– IBM Research Blog 1: https://lnkd.in/gn6vP8xZ
– IBM Research Blog 2: https://lnkd.in/gHhH9hKb
– Dharmendra Modha’s Blog: https://modha.org
– Computer History Museum: https://lnkd.in/gFUemm6F
– Hot Chips Symposium video: https://lnkd.in/gCQMdz_Y
– LinkedIn Post: https://lnkd.in/g2bZ3Gfv
– LinkedIn Post: https://lnkd.in/gbMqcP5S
– LinkedIn Post: https://lnkd.in/g9tVcJZT
– LinkedIn Post: https://lnkd.in/gj5yWD77
– LinkedIn Post: https://lnkd.in/gkXpavmh
– Linkedin Post: https://lnkd.in/gpf_ktk3

Filed Under: Papers

PNAS: Can neuromorphic computing help reduce AI’s high energy cost?

November 4, 2025 By dmodha

Excepts from an article in PNAS (The Proceedings of the National Academy of Sciences):

NorthPole is an “AI Accelerator’ that’s “designed with energy efficiency in mind,” says Dharmendra Modha, IBM’s chief scientist for brain-inspired computing.

“We are driven not so much by neuroscience, but more by the intrinsic mathematical potential of the architecture,” he says.

In a 2023 paper, Modha and his team at IBM reported that the NorthPole neuromorphic chip successfully classified images from a dataset—a task often used to benchmark the performance of AI systems. The chip did so using a tiny fraction of the energy required by a conventional system, and it was five times faster. Modha believes that building chips differently, rather than only finding ways to shrink circuit dimensions and pack more processors onto integrated circuits, can lead to greater gains in energy efficiency. “Architecture trumps Moore’s Law,” he says.

Filed Under: Press

Computer History Museum Interview

September 7, 2025 By dmodha

Computer History Museum interview on the occasion of NorthPole’s induction into the Museum. Other interviewees include: John Backus (Fortran), Brian Kernighan (UNIX), Robert Metcalfe (Ethernet, 3Com), Gordon Moore (Moore’s Law), Robert Kahn (TCP/IP), Douglas Engelbart (hypertext), Ronald Rivest (RSA), John McCarthy (LISP), Donald Knuth (analysis of algorithms), James Gosling (JAVA), John Hennessy (RISC), Ken Thompson (UNIX, B), Rodney Brooks (robotics).

Filed Under: Press

EE Times Interview by Sunny Bains

September 7, 2025 By dmodha

Sunny Bains interviewed me for Brains and Machines. It captures our journey through DARPA SyNAPSE, TrueNorth, and NorthPole. Listen here.

Filed Under: Press

SiLQ: Simple Large Language Model Quantization-Aware Training

September 6, 2025 By dmodha

Thrilled to share the latest work from the IBM Research NorthPole Team pushing the cutting edge of quantized large language model performance. In a recent paper, we introduce a new quantization recipe and apply it to 8 billion parameter Granite and Llama models. We demonstrate these models with 8-bit activations and cache and 4-bit weights showing minimal accuracy degradation on 3 leader boards spanning 20 distinct tasks.

Our method is high accuracy, outperforming all prior published quantization methods on the models and precisions examined, is simple, able to reuse existing training code after adding appropriate quantization and knowledge distillation, and is relatively low-cost, able to reuse existing training data or publicly available datasets, and requiring an increase in total training budget of less than 0.1%. We believe that this will be a powerful enabling tool for deploying models on ultra-low-latency inference accelerators like NorthPole, greatly enhancing the performance of latency critical applications such as interactive dialog and agentic workflows.

The paper, written with co-authors Steven Esser, Jeffrey McKinstry, Deepika Bablani, Rathinakumar Appuswamy, and Dharmendra Modha, can be found here.

Filed Under: Papers

  • Page 1
  • Page 2
  • Page 3
  • Interim pages omitted …
  • Page 50
  • Go to Next Page »

Primary Sidebar

Recent Posts

  • A Scalable NorthPole System with End-to-End Vertical Integration for Low-Latency and Energy-Efficient LLM Inference
  • PNAS: Can neuromorphic computing help reduce AI’s high energy cost?
  • Computer History Museum Interview
  • EE Times Interview by Sunny Bains
  • SiLQ: Simple Large Language Model Quantization-Aware Training

Copyright © 2025