# Vanishing gradient problem

Machine learning and data mining |
---|

Machine-learning venues |

In machine learning, the **vanishing gradient problem** is a difficulty found in training artificial neural networks with gradient-based learning methods and backpropagation. In such methods, each of the neural network's weights receives an update proportional to the partial derivative of the error function with respect to the current weight in each iteration of training. The problem is that in some cases, the gradient will be vanishingly small, effectively preventing the weight from changing its value. In the worst case, this may completely stop the neural network from further training. As one example of the problem cause, traditional activation functions such as the hyperbolic tangent function have gradients in the range (0, 1), and backpropagation computes gradients by the chain rule. This has the effect of multiplying n of these small numbers to compute gradients of the "front" layers in an n-layer network, meaning that the gradient (error signal) decreases exponentially with n while the front layers train very slowly.

Back-propagation allowed researchers to train supervised deep artificial neural networks from scratch, initially with little success. Hochreiter's diploma thesis of 1991^{[1]}^{[2]} formally identified the reason for this failure in the "vanishing gradient problem", which not only affects many-layered feedforward networks,^{[3]} but also recurrent networks.^{[4]} The latter are trained by unfolding them into very deep feedforward networks, where a new layer is created for each time step of an input sequence processed by the network.

When activation functions are used whose derivatives can take on larger values, one risks encountering the related **exploding gradient problem**.

## Solutions[edit]

This section needs additional citations to secondary or tertiary sources (December 2017) (Learn how and when to remove this template message) |

Some of this section 's listed sources may not be reliable. (December 2017) (Learn how and when to remove this template message) |

### Multi-level hierarchy[edit]

To overcome this problem, several methods were proposed. One is Jürgen Schmidhuber's multi-level hierarchy of networks (1992) pre-trained one level at a time through unsupervised learning, fine-tuned through backpropagation.^{[5]} Here each level learns a compressed representation of the observations that is fed to the next level.

#### Related approach[edit]

Similar ideas have been used in feed-forward neural network for unsupervised pre-training to structure a neural network, making it first learn generally useful feature detectors. Then the network is trained further by supervised back-propagation to classify labeled data. The deep belief network model by Hinton et al. (2006) involves learning the distribution of a high level representation using successive layers of binary or real-valued latent variables. It uses a restricted Boltzmann machine to model each new layer of higher level features. Each new layer guarantees an increase on the lower-bound of the log likelihood of the data, thus improving the model, if trained properly. Once sufficiently many layers have been learned the deep architecture may be used as a generative model by reproducing the data when sampling down the model (an "ancestral pass") from the top level feature activations.^{[6]}
Hinton reports that his models are effective feature extractors over high-dimensional, structured data.^{[7]}

### Long short-term memory[edit]

Another technique particularly used for recurrent neural networks is the long short-term memory (LSTM) network of 1997 by Hochreiter & Schmidhuber.^{[8]} In 2009, deep multidimensional LSTM networks demonstrated the power of deep learning with many nonlinear layers, by winning three ICDAR 2009 competitions in connected handwriting recognition, without any prior knowledge about the three different languages to be learned.^{[9]}^{[10]}

### Faster hardware[edit]

Hardware advances have meant that from 1991 to 2015, computer power (especially as delivered by GPUs) has increased around a million-fold, making standard backpropagation feasible for networks several layers deeper than when the vanishing gradient problem was recognized. Schmidhuber notes that this "is basically what is winning many of the image recognition competitions now", but that it "does not really
overcome the problem in a fundamental way"^{[11]} since the original models tackling the vanishing gradient problem by Hinton et al. (2006) were trained in a Xeon processor, not GPUs.^{[6]}

### Residual networks[edit]

One of the newest and most effective ways to resolve the vanishing gradient problem is with residual neural networks, ResNets,^{[12]} not to be confused with recurrent neural networks.^{[13]} It was noted prior to ResNets that a deeper network would actually have higher *training* error than the shallow network. This intuitively can be understood as data disappearing through too many layers of the network, meaning output from a shallow layer was diminished through the greater number of layers in the deeper network, yielding a worse result. Going with this intuitive hypothesis, Microsoft research found that splitting a deep network into three layer chunks and passing the input into each chunk straight through to the next chunk, along with the residual-output of the chunk minus the input to the chunk that is reintroduced, helped eliminate much of this disappearing signal problem.^{[clarification needed]} No extra parameters or changes to the learning algorithm were needed. ResNets^{[14]} yielded lower training error (and test error) than their shallower counterparts simply by reintroducing outputs from shallower layers in the network to compensate for the vanishing data.^{[15]}

Note that ResNets are an ensemble of relatively shallow Nets and do not resolve the vanishing gradient problem by preserving gradient flow throughout the entire depth of the network – rather, they avoid the problem simply by constructing ensembles of many short networks together. (Ensemble by Construction^{[16]})

### Other activation functions[edit]

Rectifiers such as ReLU suffer less from the vanishing gradient problem, because they only saturate in one direction.^{[17]}

### Other[edit]

Behnke relied only on the sign of the gradient (Rprop) when training his Neural Abstraction Pyramid^{[18]} to solve problems like image reconstruction and face localization.^{[citation needed]}

Neural networks can also be optimized by using a universal search algorithm on the space of neural network's weights, e.g., random guess or more systematically genetic algorithm. This approach is not based on gradient and avoids the vanishing gradient problem.^{[19]}

## See also[edit]

## References[edit]

**^**S. Hochreiter. Untersuchungen zu dynamischen neuronalen Netzen. Diploma thesis, Institut f. Informatik, Technische Univ. Munich, 1991.**^**S. Hochreiter, Y. Bengio, P. Frasconi, and J. Schmidhuber. Gradient flow in recurrent nets: the difficulty of learning long-term dependencies. In S. C. Kremer and J. F. Kolen, editors, A Field Guide to Dynamical Recurrent Neural Networks. IEEE Press, 2001.**^**Goh, Garrett B.; Hodas, Nathan O.; Vishnu, Abhinav (2017-06-15). "Deep learning for computational chemistry".*Journal of Computational Chemistry*.**38**(16): 1291–1307. arXiv:1701.04503. doi:10.1002/jcc.24764. PMID 28272810.**^**Pascanu, Razvan; Mikolov, Tomas; Bengio, Yoshua (2012-11-21). "On the difficulty of training Recurrent Neural Networks". arXiv:1211.5063 [cs.LG].**^**J. Schmidhuber., "Learning complex, extended sequences using the principle of history compression,"*Neural Computation*, 4, pp. 234–242, 1992.- ^
^{a}^{b}Hinton, G. E.; Osindero, S.; Teh, Y. (2006). "A fast learning algorithm for deep belief nets" (PDF).*Neural Computation*.**18**(7): 1527–1554. CiteSeerX 10.1.1.76.1541. doi:10.1162/neco.2006.18.7.1527. PMID 16764513. **^**Hinton, G. (2009). "Deep belief networks".*Scholarpedia*.**4**(5): 5947. Bibcode:2009SchpJ...4.5947H. doi:10.4249/scholarpedia.5947.**^**Hochreiter, Sepp; Schmidhuber, Jürgen (1997). "Long Short-Term Memory".*Neural Computation*.**9**(8): 1735–1780. doi:10.1162/neco.1997.9.8.1735. PMID 9377276.**^**Graves, Alex; and Schmidhuber, Jürgen;*Offline Handwriting Recognition with Multidimensional Recurrent Neural Networks*, in Bengio, Yoshua; Schuurmans, Dale; Lafferty, John; Williams, Chris K. I.; and Culotta, Aron (eds.),*Advances in Neural Information Processing Systems 22 (NIPS'22), December 7th–10th, 2009, Vancouver, BC*, Neural Information Processing Systems (NIPS) Foundation, 2009, pp. 545–552**^**Graves, A.; Liwicki, M.; Fernandez, S.; Bertolami, R.; Bunke, H.; Schmidhuber, J. (2009). "A Novel Connectionist System for Improved Unconstrained Handwriting Recognition".*IEEE Transactions on Pattern Analysis and Machine Intelligence*.**31**(5): 855–868. CiteSeerX 10.1.1.139.4502. doi:10.1109/tpami.2008.137. PMID 19299860.**^**Schmidhuber, Jürgen (2015). "Deep learning in neural networks: An overview".*Neural Networks*.**61**: 85–117. arXiv:1404.7828. doi:10.1016/j.neunet.2014.09.003. PMID 25462637.**^**"Residual neural networks are an exciting area of deep learning research". 28 April 2016.**^**http://www.fit.vutbr.cz/research/groups/speech/servite/2010/rnnlm_mikolov.pdf**^**"ResNets, HighwayNets, and DenseNets, Oh My! – Chatbot's Life". 14 October 2016.**^**He, Kaiming; Zhang, Xiangyu; Ren, Shaoqing; Sun, Jian (2015). "Deep Residual Learning for Image Recognition". arXiv:1512.03385 [cs.CV].**^**Veit, Andreas; Wilber, Michael; Belongie, Serge (2016-05-20). "Residual Networks Behave Like Ensembles of Relatively Shallow Networks". arXiv:1605.06431 [cs.CV].**^**Glorot, Xavier; Bordes, Antoine; Bengio, Yoshua (2011-06-14). "Deep Sparse Rectifier Neural Networks".*PMLR*: 315–323.**^**Sven Behnke (2003).*Hierarchical Neural Networks for Image Interpretation*(PDF). Lecture Notes in Computer Science.**2766**. Springer.**^**"Sepp Hochreiter's Fundamental Deep Learning Problem (1991)".*people.idsia.ch*. Retrieved 2017-01-07.