• Skip to main content
  • Skip to primary sidebar

Dharmendra S. Modha

My Work and Thoughts.

  • Brain-inspired Computing
    • Collaborations
    • Videos
  • Life & Universe
    • Creativity
    • Leadership
    • Interesting People
  • Accomplishments
    • Prizes
    • Papers
    • Positions
    • Presentations
    • Press
    • Profiles
  • About Me

IEEE Computer Cover Feature — TrueNorth: Accelerating From Zero to 64 Million Neurons in 10 Years

May 14, 2019 By dmodha

May 2019 issue of IEEE Computer Magazine’s Cover Features highlights an article summarizing ten years of innovation from IBM Research.

Abstract: IBM’s brain-inspired processor is a massively parallel neural network inference engine containing 1 million spiking neurons and 256 million low-precision synapses. Now, after a decade of fundamental research spanning neuroscience, architecture, chips, systems, software, and algorithms, IBM has delivered the largest neurosynaptic computer ever built.

FIGURE 1. The timeline of the development of TrueNorth.
FIGURE 3. The 64-processor scale-out NS16e-4.
FIGURE 5. A decade of neuromorphic applications on TrueNorth.

Filed Under: Accomplishments, Brain-inspired Computing, Papers

Design Awards for NS16e-4 System

March 11, 2019 By dmodha

Guest Post by William P. Risk

Over the course of the SyNAPSE / TrueNorth project, we’ve had the opportunity to leverage the technical depth and breadth that exist in IBM, both within and outside the Research Division. In particular, we’ve collaborated with IBM’s industrial design team since the early days of the project, when they helped us imagine and communicate potential applications of this new technology through concept models and created an iconic cap for the TrueNorth chip. Most recently, we’ve collaborated with both our industrial designers and our Systems Group engineers to design the landmark NS16e-4 Neurosynaptic System.

The elegant and iconic design of this system was recognized recently with two design awards.

First, it was named a Featured Finalist in the 2018 International Design Excellence Awards of the Industrial Design Society of America.

Second, it received 2019 iF Design Award from the iF Design Foundation.

Filed Under: Accomplishments, Brain-inspired Computing, Collaborations, Prizes

PREPRINT: Low Precision Policy Distillation with Application to Low-Power, Real-time Sensation-Cognition-Action Loop with Neuromorphic Computing

October 8, 2018 By dmodha

Guest Blog by Deepika Bablani

Title: Low Precision Policy Distillation with Application to Low-Power, Real-time Sensation-Cognition-Action Loop with Neuromorphic Computing

Authors: Jeffrey L. McKinstry, Davis R. Barch, Deepika Bablani, Michael V. Debole, Steven K. Esser, Jeffrey A. Kusnitz, John V. Arthur, Dharmendra S. Modha

Abstract: Low precision networks in the reinforcement learning (RL) setting are relatively unexplored because of the limitations of binary activations for function approximation. Here, in the discrete action ATARI domain, we demonstrate, for the first time, that low precision policy distillation from a high precision network provides a principled, practical way to train an RL agent. As an application, on 10 different ATARI games, we demonstrate real-time end-to-end game playing on low- power neuromorphic hardware by converting a sequence of game frames into discrete actions.

Link: https://arxiv.org/pdf/1809.09260.pdf 

Interest in developing algorithms for energy efficient training and deployment of deep neural networks is on the rise. However, the domain of sequential decision making using RL has been relatively unexplored in this context. This is an interesting avenue for energy efficient deployment as it provides a means for deploying end to end learning agents in real world problems acting in real-time. We demonstrate, for the first time, an RL agent trained using policy distillation[1] to play ATARI games mapped to TrueNorth neuromorphic hardware. Our approach is agnostic to choice of RL algorithms and can be applied to any value based RL algorithm.

Figure 1:  Real-time Sensation-Cognition-Action loop with TrueNorth allows real world deployment of RL models.The TrueNorth system receives as input the game frame (sensation), generates predicted return for all actions (cognition) and outputs the best move based on its learned policy (action) in real time. 

Why RL is challenging for low precision

RL solves a sequential decision making task where an agent interacts with its environment over discrete time steps, and chooses actions to maximize expected long term return [2]. Here we consider the ATARI domain. At every time step, the agent receives from the game simulator a game screen image as input. Using this, along with  preceding frames, the optimal action is chosen from a discrete set to receive a reward.

Deep regression networks have been used to successfully approximate Q (state-action value) functions in value based RL[3]. These networks take raw game frames as input and predict the Q value of each action in the input state, which is used to choose the optimal policy by balancing exploration and exploitation. This is challenging for low precision networks. Q values are continuous, making value based RL a challenging regression task. To obtain the same accuracy as a network with  neurons and rectified linear unit (ReLU) activations, a network with binary activations requires on the order of  neurons[7], Furthermore, solving an RL problem in this constrained space is inherently hard owing to non-stationary data distribution, limited feedback and delayed rewards [3], and the amount of time required by back-propagation can be prohibitively high. 

Our Approach – Low Precision Policy Distillation

Policy distillation[1] focuses on transferring knowledge from a learned policy (teacher) to a new network (student) as a supervised learning problem. The student is trained to mimic the teacher by learning to match the teacher’s actions. As the student has access to teacher’s labels, this helps overcome some challenges associated with RL. By adjusting the temperature in the distillation loss, the labels are made sharper, bringing it closer to supervised classification which is easier for a constrained student. 

We used Double Deep Q Networks (DDQN)[8] to train the full precision teacher and policy distillation on data generated by the teacher to train the student. If the teacher accurately approximates the Q function, then an accurate policy can be derived from it, providing the labels to train the student network. By using policy, the final network is likely to be smaller and faster to train. 

Figure 2:Low precision policy distillation allows principled training of constrained networks for neuromorphic systems.(Top) Full precision teacher network trained using value based RL. (Bottom) Low precision student network is trained on the teacher’s policy and mapped to the TrueNorth platform.

Results on the ATARI benchmark

We demonstrate results on single and multi-task policy distillation. For the single task setting, we use separately trained full precision teacher for each game. The student architecture is similar but wider and deeper. More details about TrueNorth training can be found in [9]. We use two different loss functions – Negative Log Likelihood Loss and KL Divergence Loss and compare single task results. KL Divergence loss gets better scores. 

Figure 3: Low precision student networks trained using KL Divergence loss meet teacher performance. Policy distillation results for online student training (mean score for 100 iterations, normalized against teacher scores) 
Figure 4:Larger student network with sufficient capacity learns to play multiple games.Students trained against 2, 3 or 10 ATARI games. Tested against only those games for which they were trained, performance is normalized against an identical student trained on one game. 

After training, student networks are mapped to TrueNorth using “corelets” [9][10]. The Neurosynaptic System 1 million neuron evaluation platform (NS1e) is a development platform which contains a TrueNorth chip alongside a Xilinx Zynq Z-7020 FPGA. Through tiling, the TrueNorth chips can be directly connected to one another via its native chip-to-chip asynchronous communication interfaces. With this, we have created a platform which natively tiles 16 TrueNorth chips, the NS16e (Neurosynaptic System 16 million neuron evaluation platform), capable of executing networks 16 times larger than those on NS1e.

The Arcade Learning Environment (ALE) was augmented for use by providing hooks into the TrueNorth run-time. We ported the entire ALE game-engine (Stella) and corresponding software modifications so they run on the ARMs while using TrueNorth for inference. The system is able to maintain a frame-rate of 30 frames-per-second(fps). 

This is the first work of its kind to provide an end-to-end solution closing the sensation-cognition-action loop for real time deployment of reinforcement learning algorithms on neuromorphic hardware. This provides a strong baseline to compare future work targeted at closing the gap between algorithmic advances and real world deployment using highly optimized hardware, which is an important challenge in the current research landscape.

All figures in the blog are from the paper.

References

[1] Rusu, A. A.; Colmenarejo, S. G.; Gulcehre, C.; Desjardins, G.; Kirkpatrick, J.; Pascanu, R.; Mnih, V.; Kavukcuoglu, K.; and Had- sell, R. 2015. Policy distillation. arXiv preprint arXiv:1511.06295. 

[2] Sutton, R. S.; Barto, A. G.; et al. 1998. Reinforcement learning: An introduction. MIT press. 

[3] Mnih, V.; Kavukcuoglu, K.; Silver, D.; Graves, A.; Antonoglou, I.; Wierstra, D.; and Riedmiller, M. 2013. Playing Atari with deep reinforcement learning. arXiv preprint arXiv:1312.5602.

[4]Courbariaux, M.; Bengio, Y.; and David, J.-P. 2015. Binarycon- nect: Training deep neural networks with binary weights during propagations. In Advances in neural information processing systems, 3123–3131.

[5] Rastegari, M.; Ordonez, V.; Redmon, J.; and Farhadi, A. 2016. Xnor-net: Imagenet classification using binary convolutional neural networks. In European Conference on Computer Vision, 525– 542. Springer. 

[6] Esser, S. K.; Appuswamy, R.; Merolla, P.; Arthur, J. V.; and Modha, D. S. 2015. Backpropagation for energy-efficient neuromorphic computing. In Advances in Neural Information Processing Systems, 1117–1125

[7] Blum, E. K., and Li, L. K. 1991. Approximation theory and feed- forward networks. Neural networks 4(4):511–515.

[8] Van Hasselt, H.; Guez, A.; and Silver, D. 2016. Deep reinforcement learning with double q-learning. In AAAI, volume 2, 5. Phoenix, AZ. 

[9] Esser, S. K.; Merolla, P. A.; Arthur, J. V.; Cassidy, A. S.; Ap- puswamy, R.; Andreopoulos, A.; Berg, D. J.; McKinstry, J. L.; Melano, T.; Barch, D. R.; di Nolfo, C.; Datta, P.; Amir, A.; Taba, B.; Flickner, M. D.; and Modha, D. S. 2016. Convolutional networks for fast, energy-efficient neuromorphic computing. Proceedings of the National Academy of Sciences 113(41):11441–11446.

[10] Amir, A.; Datta, P.; Risk, W. P.; Cassidy, A. S.; Kusnitz, J. A.; Esser, S. K.; Andreopoulos, A.; Wong, T. M.; Flickner, M.; Alvarez-Icaza, R.; et al. 2013. Cognitive computing programming paradigm: a corelet language for composing networks of neurosy naptic cores. In Neural Networks (IJCNN), The 2013 International Joint Conference on, 1–10. IEEE. 

Filed Under: Brain-inspired Computing, Papers

Inspiring brains creating brain-inspired computing!

September 21, 2018 By dmodha

Join us
to make future
together.

Filed Under: Brain-inspired Computing, Collaborations, Positions

PREPRINT: Discovering Low-Precision Networks Close to Full-Precision Networks for Efficient Embedded Inference

September 20, 2018 By dmodha

Guest Blog by Jeffrey L Mckinstry.

To seek feedback from fellow scientists, my colleagues and I are very excited to share a preprint with the community.

Title: Discovering Low-Precision Networks Close to Full-Precision Networks for Efficient Embedded Inference

Authors: Jeffrey L. McKinstry, Steven K. Esser, Rathinakumar Appuswamy, Deepika Bablani, John V. Arthur, Izzet B. Yildiz, Dharmendra S. Modha

Abstract: To realize the promise of ubiquitous embedded deep network inference, it is essential to seek limits of energy and area efficiency. To this end, low-precision networks offer tremendous promise because both energy and area scale down quadratically with the reduction in precision. Here, for the first time, we demonstrate ResNet-18, ResNet-34, ResNet-50, ResNet-152, Inception-v3, densenet-161, and VGG-16bn networks on the ImageNet classification benchmark that, at 8-bit precision exceed the accuracy of the full-precision baseline networks after one epoch of finetuning, thereby leveraging the availability of pretrained models. We also demonstrate for the first time ResNet-18, ResNet-34, and ResNet-50 4-bit models that match the accuracy of the full-precision baseline networks. Surprisingly, the weights of the low-precision networks are very close (in cosine similarity) to the weights of the corresponding baseline networks, making training from scratch unnecessary.

The number of iterations required by stochastic gradient descent to achieve a given training error is related to the square of (a) the distance of the initial solution from the final plus (b) the maximum variance of the gradient estimates. By drawing inspiration from this observation, we (a) reduce solution distance by starting with pretrained fp32 precision baseline networks and fine-tuning, and (b) combat noise introduced by quantizing weights and activations during training, by using larger batches along with matched learning rate annealing. Together, these two techniques offer a promising heuristic to discover low-precision networks, if they exist, close to fp32 precision baseline networks.

Link: https://arxiv.org/abs/1809.04191

Motivation

To harness the power of deep convolutional networks in embedded and large-scale application domains, for example in self-driving cars, requires low-cost, energy-efficient hardware implementation. One way to reduce energy and hardware cost is to reduce the memory usage of the network by replacing 32-bit floating point weights and activations with lower-precision, such as 8- or even 4-bits. If such low-precision networks were just as accurate as the 32-bit floating point (fp32) version, the energy and system cost savings would come for free. Unfortunately, the accuracies are lower for the 8-bit networks than for the corresponding full-precision net (see Table 1, from https://arxiv.org/abs/1809.04191). The situation is even worse for 4-bit precision. For the ImageNet classification benchmark, no method has been able to match the accuracy of the corresponding full-precision network when quantizing both the weights and activations at the 4-bit level. Closing this performance gap has been an important open problem until now.

Table 1

Contributions

Guided by theoretical convergence bounds for stochastic gradient descent (SGD), we propose fine-tuning, training pretrained high-precision networks for low-precision inference, by combating noise during the training process as a method for discovering both 4-bit and 8-bit integer networks. We evaluate the proposed solution on the ImageNet benchmark on a representative set of state-of-the-art networks at 8-bit and 4-bit quantization levels (Table 1). Contributions include the following.

  • We demonstrate 8-bit scores on ResNet-18, 34, 50, and 152, Inception-v3, DenseNet-161, and VGG-16 exceeding the full-precision scores after just one epoch of fine-tuning.
  • We present the first evidence of 4 bit, fully integer networks which match the accuracy of the original full-precision networks on the ImageNet benchmark.
  • We present empirical evidence for gradient noise that is introduced by weight quantization. This gradient noise increases with decreasing precision and may account for the difficulty in fine-tuning low-precision networks.
  • We demonstrate that reducing noise in the training process through the use of larger batches provides further accuracy improvements.
  • We find direct empirical support that, as with 8-bit quantization, near optimal 4-bit quantized solutions exist close to high-precision solutions, making training from scratch unnecessary.

Fine-tuning after Quantization (FAQ)

Our goal has been to quantize existing networks to 8 and 4 bits for both weights and activations while achieving accuracies that match or exceed the corresponding full-precision networks. For precision below 8 bits, the typical method that we used in our prior work (Esser et al, 2016) is to train the model using SGD while rounding the weights and neuron responses.

In our experience, there are at least two problems faced when training low-precision networks: learning in the face of low-precision weights and activations, and capacity limits in the face of these constraints. It has been known for some time that 8-bit networks are able to come close to the same accuracy as the corresponding 32-bit network, even without retraining, indicating that 8-bit networks have approximately the same capacity as the fp32 networks. Very recently it was shown, for ResNet-18 and 50, that networks with 4-bit weights and activations could come within 1% of the accuracy of the fp32 networks when training from scratch, suggesting that 4-bit networks may also have the same capacity as the corresponding fp32 networks. We therefore looked for a way to train low-precision networks to match or even exceed the corresponding high precision networks.

Some hints come from a theoretical analysis regarding how long it takes to train a network using SGD. The number of iterations required is related to the square of (a) the distance of the initial solution from the final plus (b) the maximum variance of the gradient estimates. This suggests two ways to minimize the final error. First, start closer to the solution. We therefore start with pretrained models available from the PyTorch model zoo (https://pytorch.org/docs/stable/torchvision/models.html) for quantization, rather than training from scratch as is done customarily. Second, minimize the variance of gradient noise. To do this, we combine well-known techniques to combat noise: larger batches and proper learning rate annealing with longer training time. We refer to this technique as Fine-tuning after quantization, or FAQ. Table 1 shows that the proposed method outperforms all other algorithms for quantization at 8- and 4-bits and can match or exceed the accuracy of the corresponding full-precision networks in all cases.

We found further empirical support that starting from a pretrained network was indeed helpful. The full-precision network weights were very similar to the final weights after running FAQ on ResNet-18. Given that a good 4-bit network was found close to the full-precision network suggests that it was unnecessary, and perhaps wasteful to train from scratch.

FAQ is a principled approach to quantization. Given these results on state-of-the-art deep networks, we expect that it will generate much interest, and replace existing methods. Our work here demonstrates 8-bit and 4-bit quantized networks performing at the level of their high-precision counterparts can be created with a modest investment of training time, a critical step towards harnessing the energy-efficiency of low-precision hardware.

Filed Under: Accomplishments, Brain-inspired Computing, Papers

  • « Go to Previous Page
  • Go to page 1
  • Go to page 2
  • Go to page 3
  • Go to page 4
  • Interim pages omitted …
  • Go to page 48
  • Go to Next Page »

Primary Sidebar

Recent Tweets

  • Fundamental Principle: Nature Abhors Gradients. https://t.co/KI2CRhWJdRover a year ago
  • The TrueNorth Journey https://t.co/XnpDScCAUV @IBMResearchover a year ago
  • Inspiration: "No great thing is created suddenly" - Epictetusover a year ago
  • IEEE Computer Cover Feature — TrueNorth: Accelerating From Zero to 64 Million Neurons in 10 Years @IBMResearch… https://t.co/4fvYk2JCPTover a year ago
  • "In 2012, computer scientist Dharmendra Modha used a powerful supercomputer to simulate the activity of more than 5… https://t.co/Sz17XsG5h5over a year ago
  • Management Tip: Team success is AND, not OR.over a year ago
  • The creation of the electronic brain https://t.co/wBKjGtqkvi via @issuu See page 39 onwards ... @IBMResearchover a year ago
  • Creativity Tip: Beeline to problem, spiral to solution.over a year ago
  • PREPRINT: Low Precision Policy Distillation with Application to Low-Power, Real-time Sensation-Cognition-Action Loo… https://t.co/WZHmGS5AxJover a year ago
  • "The power and performance of neuromorphic computing is far superior to any incremental solution we can expect on a… https://t.co/B2k9ZznHIJover a year ago

Recent Posts

  • Jobs in Brain-inspired Computing
  • Neuromorphic scaling advantages for energy-efficient random walk computations
  • Discovering Low-Precision Networks Close to Full-Precision Networks for Efficient Inference
  • Exciting Opportunities in Brain-inspired Computing at IBM Research
  • The TrueNorth Journey: 2008 – 2018 (video)

Archives by Month

  • 2022: Jan Feb Mar Apr May Jun Jul Aug Sep Oct Nov Dec
  • 2020: Jan Feb Mar Apr May Jun Jul Aug Sep Oct Nov Dec
  • 2019: Jan Feb Mar Apr May Jun Jul Aug Sep Oct Nov Dec
  • 2018: Jan Feb Mar Apr May Jun Jul Aug Sep Oct Nov Dec
  • 2017: Jan Feb Mar Apr May Jun Jul Aug Sep Oct Nov Dec
  • 2016: Jan Feb Mar Apr May Jun Jul Aug Sep Oct Nov Dec
  • 2015: Jan Feb Mar Apr May Jun Jul Aug Sep Oct Nov Dec
  • 2014: Jan Feb Mar Apr May Jun Jul Aug Sep Oct Nov Dec
  • 2013: Jan Feb Mar Apr May Jun Jul Aug Sep Oct Nov Dec
  • 2012: Jan Feb Mar Apr May Jun Jul Aug Sep Oct Nov Dec
  • 2011: Jan Feb Mar Apr May Jun Jul Aug Sep Oct Nov Dec
  • 2010: Jan Feb Mar Apr May Jun Jul Aug Sep Oct Nov Dec
  • 2009: Jan Feb Mar Apr May Jun Jul Aug Sep Oct Nov Dec
  • 2008: Jan Feb Mar Apr May Jun Jul Aug Sep Oct Nov Dec
  • 2007: Jan Feb Mar Apr May Jun Jul Aug Sep Oct Nov Dec
  • 2006: Jan Feb Mar Apr May Jun Jul Aug Sep Oct Nov Dec

Evolution: Brain-inspired Computing

Copyright © 2023