Thrilled to share the latest work from the IBM Research NorthPole Team pushing the cutting edge of quantized large language model performance. In a recent paper, we introduce a new quantization recipe and apply it to 8 billion parameter Granite and Llama models. We demonstrate these models with 8-bit activations and cache and 4-bit weights showing minimal accuracy degradation on 3 leader boards spanning 20 distinct tasks.
Our method is high accuracy, outperforming all prior published quantization methods on the models and precisions examined, is simple, able to reuse existing training code after adding appropriate quantization and knowledge distillation, and is relatively low-cost, able to reuse existing training data or publicly available datasets, and requiring an increase in total training budget of less than 0.1%. We believe that this will be a powerful enabling tool for deploying models on ultra-low-latency inference accelerators like NorthPole, greatly enhancing the performance of latency critical applications such as interactive dialog and agentic workflows.
The paper, written with co-authors Jeffrey McKinstry, Deepika Bablani, Rathinakumar Appuswamy, and Dharmendra Modha, can be found here.