• Skip to main content
  • Skip to primary sidebar

Dharmendra S. Modha

My Work and Thoughts.

  • Brain-inspired Computing
    • Collaborations
    • Videos
  • Life & Universe
    • Creativity
    • Leadership
    • Interesting People
  • Accomplishments
    • Prizes
    • Papers
    • Positions
    • Presentations
    • Press
    • Profiles
  • About Me

Efficient and Effective Methods for Mixed Precision Neural Network Quantization for Faster, Energy-efficient Inference

September 2, 2023 By dmodha

Authors: Deepika Bablani, Jeffrey L. Mckinstry, Steven K. Esser, Rathinakumar Appuswamy, Dharmendra S. Modha

Abstract: For effective and efficient deep neural network inference, it is desirable to achieve state-of-the-art accuracy with the simplest networks requiring the least computation, memory, and power. Quantizing networks to lower precision is a powerful technique for simplifying networks. It is generally desirable to quantize as aggressively as possible without incurring significant accuracy degradation. As each layer of a network may have different sensitivity to quantization, mixed precision quantization methods selectively tune the precision of individual layers of a network to achieve a minimum drop in task performance (e.g., accuracy). To estimate the impact of layer precision choice on task performance two methods are introduced: i) Entropy Approximation Guided Layer selection (EAGL) is fast and uses the entropy of the weight distribution, and ii) Accuracy-aware Layer Precision Selection (ALPS) is straightforward and relies on single epoch fine-tuning after layer precision reduction. Using EAGL and ALPS for layer precision selection, full-precision accuracy is recovered with a mix of 4-bit and 2-bit layers for ResNet-50 and ResNet-101 classification networks, demonstrating improved performance across the entire accuracy-throughput frontier, and equivalent performance for the PSPNet segmentation network in our own commensurate comparison over leading mixed precision layer selection techniques, while requiring orders of magnitude less compute time to reach a solution.

Link: https://arxiv.org/abs/2301.13330

Filed Under: Accomplishments, Brain-inspired Computing, Papers

Primary Sidebar

Recent Posts

  • Breakthrough low-latency, high-energy-efficiency LLM inference performance using NorthPole
  • Breakthrough edge AI inference performance using NorthPole in 3U VPX form factor
  • NorthPole in The Economist
  • NorthPole in Computer History Museum
  • NorthPole: Neural Inference at the Frontier of Energy, Space, and Time

Archives by Month

  • 2024: Jan Feb Mar Apr May Jun Jul Aug Sep Oct Nov Dec
  • 2023: Jan Feb Mar Apr May Jun Jul Aug Sep Oct Nov Dec
  • 2022: Jan Feb Mar Apr May Jun Jul Aug Sep Oct Nov Dec
  • 2020: Jan Feb Mar Apr May Jun Jul Aug Sep Oct Nov Dec
  • 2019: Jan Feb Mar Apr May Jun Jul Aug Sep Oct Nov Dec
  • 2018: Jan Feb Mar Apr May Jun Jul Aug Sep Oct Nov Dec
  • 2017: Jan Feb Mar Apr May Jun Jul Aug Sep Oct Nov Dec
  • 2016: Jan Feb Mar Apr May Jun Jul Aug Sep Oct Nov Dec
  • 2015: Jan Feb Mar Apr May Jun Jul Aug Sep Oct Nov Dec
  • 2014: Jan Feb Mar Apr May Jun Jul Aug Sep Oct Nov Dec
  • 2013: Jan Feb Mar Apr May Jun Jul Aug Sep Oct Nov Dec
  • 2012: Jan Feb Mar Apr May Jun Jul Aug Sep Oct Nov Dec
  • 2011: Jan Feb Mar Apr May Jun Jul Aug Sep Oct Nov Dec
  • 2010: Jan Feb Mar Apr May Jun Jul Aug Sep Oct Nov Dec
  • 2009: Jan Feb Mar Apr May Jun Jul Aug Sep Oct Nov Dec
  • 2008: Jan Feb Mar Apr May Jun Jul Aug Sep Oct Nov Dec
  • 2007: Jan Feb Mar Apr May Jun Jul Aug Sep Oct Nov Dec
  • 2006: Jan Feb Mar Apr May Jun Jul Aug Sep Oct Nov Dec

Copyright © 2025