Talk:Backpropagation

(Rated Start-class, Mid-importance)

Backpropagation is within the scope of WikiProject Robotics, which aims to build a comprehensive and detailed guide to Robotics on Wikipedia. If you would like to participate, you can choose to edit this article, or visit the project page (Talk), where you can join the project and see a list of open tasks.

Start

This article has been rated as Start-Class on the project's quality scale.

Mid

This article has been rated as Mid-importance on the project's importance scale.

1 Confusing notation
2 Questions
3 Please translate from German to English
4 History of the technique
5 Tautology?
6 Alternatives
7 Algorithm
8 Connection to Biology
9 Request for comments on biological evidence section
10 Poor summary?
11 Invented by
12 Relationship to Delta Rule?
13 A new algorithm proposal
14 Equations
15 Meaning of Weight
16 Stochastic versus online?
17 Threshold? Derivative? Other activation functions?
18 misleading statements
19 Outdated
20 Modes of learning
21 global convergence?
22 Error in algorithm?
23 Wrong Error Function Visualization and Description under the "Intuition" Heading
24 Sources on the history of backpropagation
25 Remake of the algorithm section
26 Added content from Spanish/German articles. Improved for non-technical readers.
27 Buzzwords
28 Core concepts in first sentences
29 Confusing discussion of gradient descent
30 History section added

Confusing notation[edit]

What does the notation y(x) mean when y is an R^n vector and x is an R^m vector? That notation is found in the loss function section, and I've been having a hard time inferring what it means. 2804:14C:8793:86F6:76D4:35FF:FE5B:6F91 (talk) 16:14, 20 February 2017 (UTC)

It refers to an output vector y corresponding to a training example vector x. The definition at the top of the section (

y,y'\in \mathbb {R} ^{n}

) has been implicitly altered to something more like

y:\mathbb {R} ^{m}\to \mathbb {R} ^{n}

, a mapping from input vector to output vector. I agree it is confusing. Andars97 (talk) 01:46, 22 April 2017 (UTC)

Questions[edit]

Dear All,

Apologies if this is a very naïve question, but in section 6.1 shouldn't the reference be to all neurons directly affected by the neuron (ie L is the next layer), rather than all neurons (L is the total set of neurons lying between the input neuron and the output neurons).

Thx

NeilHyatt (talk) 11:08, 9 January 2017 (UTC)

Dear all, why doesn't this article mention that one needs to learn the biases of multilayer perceptrons? — Preceding unsigned comment added by 92.227.230.174 (talk) 15:10, 12 December 2014 (UTC) Using the neurons weights on it's incoming connections - What does this mean? - Zoe 06:22, 31 January 2003 (UTC)

Hopefully this has been clarified. - Eyrian 05:30, 22 Apr 2005 (UTC)

This is due to that in the structure of a Neural Network, the neuros are all connected. Each link that connects two neurons carries a weight. - BestDaNiu (talk) 13:37, 12 November 2014 (UTC)

Backpropagation usually allows quick convergence on local minima - What does this mean? - Zoe 06:22, 31 January 2003 (UTC)

I think that this is reasonable terminology for use in an article on optimization (though hopefully clarified somewhat). - Eyrian 05:30, 22 Apr 2005 (UTC)

In all optimization problems, the objective is to find the maximum or minimum value of a given function with or without constraints. Backprobagation can be viewed as an optimization problem, as it tries to minimize the cost function between the hypothesis outputs and the actual outputs. And as the objective function of backprobagation is non-convex, given a vector of initially random weights, it usually ends with a local minima. -BestDaNiu (talk) 13:37, 12 November 2014 (UTC)

Is $t_{j}-o_{j}$ in the 2nd to last equation in the "Finding the derivative of the error" supposed to be backward from (negative of) the y - t in ${\frac {\partial E}{\partial o_{j}}}={\frac {\partial E}{\partial y}}={\frac {\partial }{\partial y}}{\frac {1}{2}}(t-y)^{2}=y-t$ ? Couldn't figure out why it would be. — Preceding unsigned comment added by 206.248.153.120 (talk) 03:18, 27 July 2014 (UTC)

Please translate from German to English[edit]

There seems to be much more information in the German version of this page. Please translate it to English. Thanks! --Jcarroll 23:19, 5 May 2006 (UTC)
{{Expand German|Backpropagation|date=March 2009}}

Added the template to the article. Nicholas Léonard 02:05, 8 January 2013 (UTC) — Preceding unsigned comment added by Strife911 (talk • contribs)

History of the technique[edit]

I had heard at some point that a review of Gauss's work indicated that the technique we call backpropagation was actually developed by Gauss. Werbos would then be an independent re-discoverer of the technique. It may even have been Werbos who told me this; I wasn't making notes at the time (about 1990). But I'm not finding corroboration of this, so I thought I would at least broach the topic here. Until there is such corroboration, of course, there should be no change to the article. --Wesley R. Elsberry 11:58, 27 August 2006 (UTC)

I somehow remember that Yann LeCun is blamed for this technique. On his web page, I found an article "generalization using back-propagation" from 1987. (http://yann.lecun.com/exdb/publis/index.html)

The mathematical theory was there for a long time, so a historical rewiev is indeed interesting. —Preceding unsigned comment added by 83.216.234.177 (talk) 09:58, August 28, 2007 (UTC)

Any technique for function minimization (in this case, classification error minimization) that is not simple trial-and-error will predate to the works of Isaac Newton and/or Leibnitz. Perceptron and MLP training are not exceptions, most training algorithms will be variations of Gradient Descent or Newton-Raphson (like Levenberg-Marquadt). Historical review on this would be endless.

--Lucas Gallindo (talk) 19:45, 8 September 2009 (UTC)

Tautology?[edit]

"Backpropagation usually allows quick convergence on satisfactory local minima for error in the kind of networks to which it is suited." —The preceding unsigned comment was added by 129.170.31.11 (talk • contribs) 03:16, 16 January 2007 (UTC).

I don't think so. The sentence indicates that backpropagation works well on certain networks. --Eyrian 05:49, 16 January 2007 (UTC)

Alternatives[edit]

Actually i would like some links to alternative methods for teaching multi-layered networks. Maybe a see also-section —Preceding unsigned comment added by 80.164.105.90 (talk) 07:11, 8 September 2007 (UTC)

Algorithm[edit]

Citation: Material seems to be directly copied from http://www2.cs.uh.edu/~ceick/ai/nn.ppt (slide 6); if it is, a reference should be included.

Explanation: I'm not understanding how to compute the per-node error. How much blame should be subtracted from each connection's weight, and how much should be propagated back to the inputing nodes'? (I'm assuming that it backpropagates according to the proportion of error contributed by each earlier connection.) --Jesdisciple 03:54, 8 October 2007 (UTC)

I detail the error propagation algorithm in multilayer perceptron. Perhaps some of the math should be copied to here? SamuelRiv (talk) 13:04, 18 November 2007 (UTC)

The phase 1 description and pseudocode need to be more detailed. The key propagation step is briefly described in point 3 of phase 1 and line 7,8 of the pseudocode, but how exactly one does the propagation is not explicitly defined. I.e. you could not write C code from the given information as is. 202.112.90.192 (talk) 04:12, 28 November 2018 (UTC)

Connection to Biology[edit]

term "Backpropagation" is ambiguous

Someone added a section on backpropagation of action potential in biological neurons. I just backed this change out. The material is irrelevant in the context of this article. (In this article, backpropgation refers to reverse propagation of errors through the computation graph. As far as anyone knows, this has nothing to do with the potential jump of action potentials invading the proximal regions of the dendritic arbor) Barak (talk) 19:10, 17 November 2007 (UTC)

How about some tact here. I added 2000 bytes of information, so maybe there should be some discussion before wiping it. The idea of backpropagation as a biological model, as it has been applied by several research groups (IBNS for one) is useless without some justification. A computational model is effectively a neuroscience model, and any information added to this effect is useful. Anyway, if more context or a separate article is needed, then we can add that. But please don't just blank a good half-hour of factual and relevant information. SamuelRiv (talk) 21:14, 17 November 2007 (UTC)

I'm sorry if my change seemed abrupt. The material you added is both interesting and (I say this as someone knowledgeable in the field) entirely correct. The reason I removed it is because it belongs in a different article. Perhaps in an article about action potentials in neurons, or in one about Hebbian synaptic plasticity, or some place like that. But unfortunately, it is not appropriate here: "back propagation" of potential from the soma into the dendritic arbor is a different sense of the term "backpropagation" than that being used here, which refers to a mathematical construction for an estimation problem. Please do not take this personally though, and I am serious when I say that the text you wrote deserves a home. It just doesn't fit here. Barak (talk) 11:05, 18 November 2007 (UTC)

Thanks for the detailed response. I disagree, obviously, because from my perspective artificial and biological neural nets are inseparable concepts. The nonlinear activation function, for example, was partially inspired by biology, and in this case the success of backprop as an algorithm encouraged biologists to find evidence for it in biological systems. I don't think there is enough there for it to warrant a separate article (it still seems like a lot of speculation as to the algorithmic purpose of the retrograde signal). It belongs here because backpropagation is a feedforward algorithm, where as Hebbian learning, for example, has no such bias (it could work with a feedback loop to transmit errors, for instance). I don't know, personally I can't think of a more appropriate place for this information, especially as this article is pretty short already. I'll think about it more, though, and see if there might be a better fit somewhere. SamuelRiv (talk) 13:02, 18 November 2007 (UTC)

I'm sorry but what you're saying doesn't really support the idea that this particular material should be included in this particular article. You're summarizing a biology paper that happens to have the word "backpropagation" in its title. But that paper is using that word in a completely irrelevant way. Any other paper about some detailed electrophyiological measurement would be just as relevant. This article is short, yes. There are many relevant hunks of information that could be added. Eg: a review of backprop-based work that has cast new light on a variety of biological phenomena, such as NETtalk and language acquisition, Qian et al's shape-from-shading model and the bar detector story. Or some hooks into the literature of practical applications could be added: PAPnet is a billion-dollar thousands-of-lives success of backprop. Credit card transaction validation. Loan application testing. ALVINN. Face detection. Facial gender discrimination. Or hooks into the relevant theoretical literature: theory of generalization in such networks, VC dimension bounds, early stopping, optimal brain damage. Convolutive networks and TDNNs enjoy wide application (leNet is the gold standard for handwritten digit recognition, for example.) Pointers to implementations that are really useful, like "lush", so people can go out and build real systems. But a summary of a bit of completely irrelevant biology data just doesn't belong here. (You mention Leon Cooper's group at Brown as if it makes this relevant. But that group doesn't work on backprop or related methods.) Barak (talk) 19:54, 18 November 2007 (UTC)

You still have not addressed why it is irrelevant. Here we have an algorithm that essentially cannot function without some method of backward propagation of signals (simple feedback loops don't work by nature of the correction algorithm). This is used in neural networks, which are almost by definition the best existing model of a functional brain. So as soon as you have an algorithm, you have a brain model. The Brown group works on brain models from both a top-down and bottom-up perspective, so they address both algorithms (mostly RBNs - granted I haven't seen any papers specifically working with backprop) and biological models. So as a model of the brain, which the backpropagation algorithm is, we would naturally want some evidence for it. That's where this addition comes into play.

Now I'd rather have a discussion than an edit war, but I fail to see why, in a dispute, you'd rather default to having less information visible in the article than more. SamuelRiv (talk) 20:15, 18 November 2007 (UTC)

Addendum: I have tagged this page on Wikipedia:Third opinion. Additionally, I want to clarify that IBNS has done work on backprop algorithms, but no paper deals specifically with them. Most use multilayer perceptrons in comparison to RBNs. SamuelRiv (talk) 20:43, 18 November 2007 (UTC)

Let me answer your actual question above, which I paraphrase: "Here we have an algorithm that cannot function without some method of backward propagation of signals ... why is evidence for backward (i.e., into the dendritic arbour) propagation of action potentials not relevant?" The reason is that what is being retrograde transmitted into the dendrites are action potentials, which is the activity of the neurons. It is not the error, which is what would need to be transmitted to implement backprop. It is backward something, but the wrong thing! It is not backward propagation of errors. Barak (talk) 21:00, 18 November 2007 (UTC)

Okay, new paragraph because I'm sick of typing colons. In response, your point about error is fair, in that error or its energy is not necessarily what is being backpropagated. But it can be. Imagine a 3-layer perceptron that wants to learn to match its output to a memorized binary pattern, say. So its output layer connects in a linear 1-1 fashion with the memory, producing a spike frequency in the memory layer that increases with the amount of error in the output, that is, if the output approaches "0" frequency, the memory wants frequency "1" and spikes at full frequency to signal that error. Once the signals match, the memory stops spiking. So each of these error spikes backpropagates, and then you get the full LTP/LTD effects (this last sentence has not yet been observed). I'll see if I can find a paper on this mechanism when I get to the office. SamuelRiv (talk) 15:15, 19 November 2007 (UTC)

Addendum - see [1] if you can open it for a paper outlining a biological backprop mechanism. Does information from this at least deserve to be in the article? SamuelRiv (talk) 19:04, 19 November 2007 (UTC)

That's not a bad paper, although there are two other efforts I'd look at first. One is a paper by Paul Werbos on implementing backprop in neural hardware. The other is the "Recirculation Neural network" architecture (G. E. Hinton and J. L. McClelland, J. L., "Learning representations by recirculation", in D. Z. Anderson (ed), "Neural Information Processing Systems," pages 358-366, AIP, [2]), which was the start of a line of research that, in a modern probabilistic context, developed into the Helmholtz Machine and the Restricted Boltzmann Machine architectures.

A wikipedia article on Biologically Plausible NN Learning Algorithms would be a useful place to have pointers to these various strands of research, and to put them all in perspective. Barak (talk) 10:51, 26 November 2007 (UTC)

Yikes, another thing on my "to do" list. I'll explore making that article (I have to look at the literature on biological SVMs and CMs first). Meanwhile, thanks for the papers, except I can't find the Werbos one - do you remember the title or keywords? SamuelRiv (talk) 16:28, 26 November 2007 (UTC)

Request for comments on biological evidence section[edit]

The question is whether or not a section which was alternately added and removed is pertinent, relevant, and suitable for inclusion in this article. — Athaenara ✉ 03:16, 19 November 2007 (UTC)

Responding to RfC. This article is about a computer algorithm, while the section in question describes a biological process that works in the same way. I don't believe the section should be included, as it is of little relevanse to what I believe the article should be about (the algorithm and its properties). But the information presented in the section is interesting and seems to be notable, so I beleive it should be added to a seperate article and linked to from this article. Labongo (talk) 11:58, 19 November 2007 (UTC)

My main problem is that every neural network algorithm doubles as a biological model for the time being, because we simply don't know for sure what algorithms the brain uses. We have some evidence for several different algorithms, so where do we put all that information? SamuelRiv (talk) 15:17, 19 November 2007 (UTC)

The information should probably be in the other article or some related article, since how the brain works is clearly outside the scope of this article. Labongo (talk) 16:00, 20 November 2007 (UTC)

I created Neural backpropagation, as per the third opinion. However, I frankly think this is ridiculous, as the literature only refers to the phenomenon as "backpropagation" and there is enough overlap between these topics that they deserve separate articles. I'd also like to say that repeatedly deleting large-scale factual and arguably relevant contributions to an article seems to me to violate the spirit of Wikipedia, as stated in its revert policies. Someone will inevitably come along and suggest these articles be merged, if anyone ever reads them. SamuelRiv (talk) 23:58, 20 November 2007 (UTC)

I added otheruses templates to both algorithms, to ensure that there is no confusion with regard to the term. Since the issues seems to be resolved I have also removed to RfC tag from the talkpage. Labongo (talk) 15:32, 22 November 2007 (UTC)

Poor summary?[edit]

It's a very good idea to have a summary of an algorithm like this, but I don't find the summary as it currently stands, to be very good. Thomas Tvileren (talk) 10:01, 19 January 2009 (UTC)

Much better now! Thanks to Strife911. Thomas Tvileren (talk) 17:42, 23 February 2011 (UTC)

Because this is an encyclopedia, I was hoping for an introduction that would give a general idea of the topic to non-specialists. The first sentence refers to the "objective function", but it is not linked or defined, and non-specialists will have no idea what it is. It's my view that the first sentence in any article should come as close as possible to a general description, even if not perfectly precise, without resorting to opaque jargon, and with a minimal number of words and concepts that the general reader has to look up in order to understand that description. I'd like to invite someone who's knowledgeable about the subject take on the challenge of making the whole introductory section more accessible to non-specialists. -Do c t orW 23:36, 23 June 2012 (UTC)

This is not my area, but let me take a stab at what I have in mind, in order to illustrate:

Backpropagation, an abbreviation for "backward propagation of errors", is a common method of training artificial neural networks. From a desired output, the network learns from many inputs, similar to the way a child learns to identify a dog from examples of dogs. [Perhaps a more exact and more technical, but nevertheless brief, explanation could comprise the next sentence.]

The introduction could then go on to talk about the supervised learning method, and then the delta rule.

The second paragraph could introduce the seminal figures (Bryson & Ho, Werbos, Rumelhart, and Hinton & Williams). -Do c t orW 00:19, 24 June 2012 (UTC)

There was no response to my suggestion from 6 months ago, so I was bold and took a crack at it, even though this is not my area. Please feel free to correct any errors I may have made and continue to make at least the introduction more accessible to non-specialists. This is an encyclopedia, after all. -Do c t orW 09:47, 24 December 2012 (UTC)

Invented by[edit]

The article attributed the first description to "Paul Werbos in 1974". According to my source this is incorrect, I have updated the text to cite Bryson and Ho 1969. I am mentioning this in Talk since I am surprised to see a long-standing error of such importance. 78.105.234.140 (talk) 17:01, 15 October 2009 (UTC)

If one actually reads the relevant section of Bryson and Ho (1969) it plainly (a) cites an earlier journal paper describing the idea, and (b) mentions that this method is an application of Pontryagin's minimum principle. Barak (talk) 22:35, 16 October 2011 (UTC)

Observer's comment: 87.205.28.164 (talk) 21:35, 1 January 2013 (UTC) As an active researcher in Backpropagation and Pontryagin's maximum principle, I can assure you that backpropagation and pontryagain's principle are not connected. Barak, Can you give the page number of the source where you read this, and I'll check it out?

Relationship to Delta Rule?[edit]

The sentence enclosing the link to the Delta rule for this page states:

"It [Back-propagation] is a supervised learning method, and is an implementation of the Delta rule."

Implying that back-propagation is a subset of Delta Rule, but for the link to Back-propagation from the Delta Rule page it states:

"It [The Delta Rule] is a special case of the more general backpropagation algorithm."

Can this be? Or is the wording simply a bit confusing? —Preceding unsigned comment added by 72.221.98.197 (talk) 02:44, 20 December 2010 (UTC)

If you look in books, it appears that the delta rule only works at outputs where you have a target, and the back-prop is a generalization of it. So the first quoted sentence should have "implementation" changed to "generalization", I think. Dicklyon (talk) 03:17, 20 December 2010 (UTC)

If you read the original article by Rumelhart, Hinton & Williams, they explicitly call it the "generalized delta rule." The original delta rule applies to linear single-layer networks, although it can be shown to be identical to the perceptron algorithm. The generalization is to nonlinear networks with a squashing function, and to multilayer networks (hence, "backpropagation", which is unnecessary in single layer networks). -gary cottrell —Preceding unsigned comment added by 75.80.97.109 (talk) 10:00, 25 January 2011 (UTC)

It doesn't explain how the delta rule is generalised. It just duplicates the derivation of the delta rule and then gives up on explaining what happens to hidden nodes. Ptolom (talk) 11:06, 19 January 2014 (UTC)

A new algorithm proposal[edit]

Could we change the layout of the algorithm? Why not make each instruction a link to additional information? But should this information be put in the same page, or a one relating to the 'next level of detail', for the algorithm? —Preceding unsigned comment added by 139.80.206.172 (talk) 08:57, 1 May 2011 (UTC)

Equations[edit]

The article is missing the equation(s) for back propagation learning algorithms. Malithyapa (talk) 23:21, 31 August 2011 (UTC)

Meaning of Weight[edit]

I was unfamiliar with the term "weight" used in this context- I think it ought to be more clearly explained early on. Also, it might be good for someone more knowledgeable to add a quick reference to the Weight disambiguation page. Brauden (talk) 03:20, 7 May 2012 (UTC)

I added a bit at the disambig page. Dicklyon (talk) 04:22, 7 May 2012 (UTC)

Stochastic versus online?[edit]

In adding cross-references to the Modes of Learning section, I found an apparent contradiction between this article and that for stochastic gradient descent (SGD), where the latter (specifically, in the Iterative method section) equates SGD with online learning while this article distinguishes between them. Come anyone wiser than me please advise if these should be equated here or the other article corrected?

Also, the statement that "Stochastic goes through the data set in a random order in order to reduce its chances of getting stuck in local minima" seems quite far from the method and purpose described in the SGD article. Perhaps the Modes of Learning section would be better simply cross-referencing the others rather than trying to summarise them. Or have I entirely missed a point?

Many thanks. p.r.newman (talk) 13:33, 4 April 2013 (UTC)

Threshold? Derivative? Other activation functions?[edit]

1. What happens with the threshold? It is mentioned in othe articles, here is no word of it. In the German version, they say they replace the threshold with an "on-neuron", but even they do not venture to describe it.

2. How do you get the derivative? - I think this is very nicely explained here:

https://en.wikipedia.org/wiki/Delta_rule

3. It should be explained what happens with other activation functions. In particular, that the derivative may actually be 1 in case of a linear activation. It is elsewhere mentioned that you can use tanh as activation, too.

- Nino — Preceding unsigned comment added by 188.23.245.69 (talk) 05:38, 26 February 2014 (UTC)

misleading statements[edit]

It wasn't said or implied in the article that input layer isn't affected by error estimate. Corrected algorithm description. Rest of article still misleading. 2001:470:600D:DEAD:E94B:92C8:B4E1:F8C5 (talk) 11:57, 28 August 2014 (UTC)

Outdated[edit]

This article, or at least parts of it, are quite outdated. They follow the old definition of backpropagation, where this name refers to the entire gradient descent algorithm for squared-error feed-forward networks. The lede, by contrast, uses the more modern meaning, where backprop is the generic gradient computation algorithm based on the chain rule, which can be combined with any gradient-based optimizer (conjugate gradient, L-BFGS) and any differentiable loss (log-loss, squared-error).

I'm planning a rewrite of this entire article to bring it in line with Bishop's (2006) Pattern Recognition and Machine Learning instead of Rojas's (1996) Neural Networks, which (judging by the notation) is the main source for the current text. QVVERTYVS (hm?) 20:34, 4 November 2014 (UTC)

Modes of learning[edit]

Statement "Yet batch learning will yield a much more stable descent to a local minimum since each update is performed based on all patterns." can be misleading. Batch learning requires one to go through all of the test cases, while online learning can cope with a adding a single one, or a few. But if you pass through all of the test cases with on-line learning, you shouldn't end up further away from solution than with a batch learning which should actually get completely abandoned, as it is nonesense. Refer to http://axon.cs.byu.edu/papers/Wilson.nn03.batch.pdf for explanation why Wiki should abandon any talks of batch learning entirely, due to it not being useful in any sort of way. --213.174.16.2 (talk) 14:38, 21 January 2015 (UTC)

Usefulness is not the criterion for covering something on WP. Also, while batch learning may not be optimal or even better for gradient descent, it also allows using algorithms like L-BFGS or conjugate gradient (see Bishop, Pattern Recognition and Machine Learning, p. 240). These are not easily recast as online or minibatch algorithms. QVVERTYVS (hm?) 16:57, 21 January 2015 (UTC)

global convergence?[edit]

"However, convergence to the global minimum is said to be guaranteed using the adaptive stopping criterion.[3]" is an extraordinary claim. As such, it requires extraordinary proof. The referenced article provides only the claim. Mborg (talk) 15:25, 16 September 2015 (UTC)

It's not really an extraordinary claim. The quotation is not a claim; the quotation reports that "it is said". If you feel that's too strong, you could change it to "articles report..., but there is no proof for it".--Gciriani (talk) 15:52, 16 September 2015 (UTC)

"It is said" is certainly not a strong claim; it's far too vague a claim for a technical article. I've removed the claim and the reference because a poorly-cited article by some unknown researchers in an off-track journal that presents a ridiculous algorithm without proof of effectiveness or even an intuitive justification as to why its crazy formulas should work is not the kind of material on which to base an encyclopedic article. QVVERTYVS (hm?) 20:09, 16 September 2015 (UTC)

Error in algorithm?[edit]

In its current form the algorithm has two nested loops

  initialize network weights (often small random values)
  do
     forEach training example ex
        prediction = neural-net-output(network, ex)  // forward pass
        actual = teacher-output(ex)
        compute error (prediction - actual) at the output units
        compute  $\Delta w_{h}$  for all weights from hidden layer to output layer  // backward pass
        compute  $\Delta w_{i}$  for all weights from input layer to hidden layer   // backward pass continued
        update network weights // input layer not modified by error estimate
  until all examples classified correctly or another stopping criterion satisfied
  return the network

The way I learned the algorithm, the network weights aren't updated for every training example, but rather once for each iteration of the outer loop. That is the inner loop is used to calculate the gradient and then, after the inner loop, the gradient is used to update the weights. So I suggest

  initialize network weights (often small random values)
  do
     forEach training example ex
        prediction = neural-net-output(network, ex)  // forward pass
        actual = teacher-output(ex)
        compute error (prediction - actual) at the output units
        compute  $\Delta w_{h}$  for all weights from hidden layer to output layer  // backward pass
        compute  $\Delta w_{i}$  for all weights from input layer to hidden layer   // backward pass continued
     end forEach
     update network weights // input layer not modified by error estimate
  until all examples classified correctly or another stopping criterion satisfied
  return the network

— Preceding unsigned comment added by 142.166.157.21 (talk • contribs)

It's explained (inadequately) below the pseudocode:

Technically speaking, backpropagation calculates the gradient of the error of the network regarding the network's modifiable weights. This gradient is almost always used in a simple stochastic gradient descent algorithm to find weights that minimize the error.

The pseudocode is actually that of stochastic gradient descent, not backprop; backprop is the part labeled "backward pass" and practically glanced over. What you're suggesting is batch gradient descent. QVVERTYVS (hm?) 15:24, 18 October 2015 (UTC)

Wrong Error Function Visualization and Description under the "Intuition" Heading[edit]

I believe that under the "Intuition" heading, given the current example of (x1,x2,t) = (1,1,0), the correct error function is NOT a parabolic bowl as written, but instead what Wolfram Alpha refers to as a parabolic cylinder. I believe the correct graphic should be of E= 0 - (x1*w1 + x2*w2)^2, and not the current graphic which models E= 0 - [(x1w1)^2 + (x2w2)^2].

I concur - this is a very confusing picture. If w1=-w2 then the error is zero. Henk.muller (talk) 10:37, 6 June 2016 (UTC)

Sources on the history of backpropagation[edit]

According to various sources,^[1]^[2]^[3]^[4] basics of continuous backpropagation were derived in the context of control theory by Henry J. Kelley^[5] in 1960 and by Arthur E. Bryson in 1961,^[6] using principles of dynamic programming. In 1962, Stuart Dreyfus published a simpler derivation based only on the chain rule.^[7]

In 1970, Seppo Linnainmaa finally published the general method for automatic differentiation (AD) of discrete connected networks of nested differentiable functions.^[8]^[9] This corresponds to the modern version of backpropagation which is efficient even when the networks are sparse.^[10]^[11]^[3]^[4]

In 1973, Stuart Dreyfus used backpropagation to adapt parameters of controllers in proportion to error gradients.^[12] In 1974, Paul Werbos mentioned the possibility of applying this principle to artificial neural networks,^[13] and in 1982, he applied Linnainmaa's AD method to neural networks in the way that is widely used today.^[14]^[4]

In 1986, David E. Rumelhart, Geoffrey E. Hinton and Ronald J. Williams showed through computer experiments that this method can generate useful internal representations of incoming data in hidden layers of neural networks.^[15] In 1993, Eric A. Wan was the first^[3] to win an international pattern recognition contest through backpropagation.^[16]

One should insert the missing references in the article. Mathpriest (talk) 15:56, 28 March 2016 (UTC)

Remake of the algorithm section[edit]

I have written a candidate for a better description of the algorithm here: User:Esraiak/sandbox, and I am considering replacing the current Algorithm (and/or Summary) section with that one.

I will do the replacement in a few days unless someone objects.

Esraiak (talk) 14:25, 28 September 2016 (UTC)

Added content from Spanish/German articles. Improved for non-technical readers.[edit]

I added some sections from the German article (mostly the Extension section) and from the Spanish article (introduction).

This article contains a translation of Backpropagation from es.wikipedia.

This article contains a translation of Backpropagation from de.wikipedia.

I also tried to make the introduction and a few other sections better for non-technical readers. AlgorithmEdits (talk) 02:02, 12 December 2016 (UTC)

Buzzwords[edit]

Some buzzwords in the algorithm description: input thought, output activations, weight-synapse.

With a bit of critical thinking it's possible to infer what they refer to, but they are not defined here and inappropriate (subjective interpretations) for a purely technical description. — Preceding unsigned comment added by 88.177.31.169 (talk) 15:38, 26 December 2016 (UTC)

Core concepts in first sentences[edit]

I am unhappy with the first sentences in this article. I don't mean that they are wrong, only that they do not explain well what it is all about. Here is a better explanation that I found in Deep Learning by Goodfellow, Bengio and Courville, 2016. I think this is now the dominating course book on Deep Learning. It says on page 198: "The term back-propagation is often misunderstood as meaning the whole learning algorithm for multi layer neural networks. Actually back-propagation refers only to the method for computing the gradient, while another algorithm, such as stochastic gradient descent, is used to perform the learning." That is, when tuning an ANN to perform well on a set of data, the weights in the network are changed so that a cost calculated from the difference between the network output for the data and the known correct output is minimized. This minimization algorithm calculates the gradient of the cost as a function of the weights. Back-propagation is a method to calculate that gradient. Therefore, in my view, backprop is a method to calculate a gradient that is needed in the calculation of the weights to be used in an artificial neural network. --Ettrig (talk) 08:50, 5 February 2018 (UTC)

Confusing discussion of gradient descent[edit]

Why is gradient descent mentioned in this wiki article? Backpropagation is an algorithm to find the gradients, then what you do with those gradients is an entirely different issue. Reading the wiki one may think that backpropagation also concerns itself with SGD or similar; see discussion about choosing an adaptive learning rate, etc.

What do you think? Should we delete those discussions? User:Daniel.Cardenas

Good point! For orientation, gradient descent should be mentioned as the most common application and the context in which it was developed. Currently gradient descent occurs 12 times. That is far too much. The subject is difficult as it is. We should not add to the confusion by adding off-topic material.--Ettrig (talk) 06:42, 28 November 2018 (UTC)

There are whole sections of such off-topic material. Should this be just removed, or dumped on articles where it belongs? --Ettrig (talk) 07:20, 28 November 2018 (UTC)

I moved it to artificial neural network. That is at least a small step in the right direction. But a lot of restructuring of ANN material is neede. --Ettrig (talk) 14:16, 28 November 2018 (UTC)

10K bytes moved, out of 38K. Now, there is a lot to clean up in ANN, but at least it is in the right place. There is a lot of mathematical notation, which is cumbersome to produce. So it would be unfortunate to just lose it. I think it is mostly correct. --Ettrig (talk) 16:43, 28 November 2018 (UTC)

History section added[edit]

I added a history section. I consider it a poor attempt, but figured something was better than nothing. Please improve. Thanks! Daniel.Cardenas (talk) 16:14, 30 September 2018 (UTC)

There was a history at the bottom of the article. I merged yours with that one.--Ettrig (talk) 14:17, 28 November 2018 (UTC)

^ Stuart Dreyfus (1990). Artificial Neural Networks, Back Propagation and the Kelley-Bryson Gradient Procedure. J. Guidance, Control and Dynamics, 1990.

^ Eiji Mizutani , Stuart Dreyfus, Kenichi Nishio (2000). On derivation of MLP backpropagation from the Kelley-Bryson optimal-control gradient formula and its application. Proceedings of the IEEE International Joint Conference on Neural Networks (IJCNN 2000), Como Italy, July 2000. Online

^ ^a ^b ^c Jürgen Schmidhuber (2015). Deep learning in neural networks: An overview. Neural Networks 61 (2015): 85-117. ArXiv

^ ^a ^b ^c Jürgen Schmidhuber (2015). Deep Learning. Scholarpedia, 10(11):32832. Section on Backpropagation

^ Henry J. Kelley (1960). Gradient theory of optimal flight paths. Ars Journal, 30(10), 947-954. Online

^ Arthur E. Bryson (1961, April). A gradient method for optimizing multi-stage allocation processes. In Proceedings of the Harvard Univ. Symposium on digital computers and their applications.

^ Stuart Dreyfus (1962). The numerical solution of variational problems. Journal of Mathematical Analysis and Applications, 5(1), 30-45. Online

^ Seppo Linnainmaa (1970). The representation of the cumulative rounding error of an algorithm as a Taylor expansion of the local rounding errors. Master's Thesis (in Finnish), Univ. Helsinki, 6-7.

^ Seppo Linnainmaa (1976). Taylor expansion of the accumulated rounding error. BIT Numerical Mathematics, 16(2), 146-160.

^ Griewank, Andreas (2012). Who Invented the Reverse Mode of Differentiation?. Optimization Stories, Documenta Matematica, Extra Volume ISMP (2012), 389-400.

^ Griewank, Andreas and Walther, A.. Principles and Techniques of Algorithmic Differentiation, Second Edition. SIAM, 2008.

^ Stuart Dreyfus (1973). The computational solution of optimal control problems with time lag. IEEE Transactions on Automatic Control, 18(4):383–385.

^ Paul Werbos (1974). Beyond regression: New tools for prediction and analysis in the behavioral sciences. PhD thesis, Harvard University.

^ Paul Werbos (1982). Applications of advances in nonlinear sensitivity analysis. In System modeling and optimization (pp. 762-770). Springer Berlin Heidelberg. Online

^ Rumelhart, David E.; Hinton, Geoffrey E.; Williams, Ronald J. (8 October 1986). "Learning representations by back-propagating errors". Nature. 323 (6088): 533–536. doi:10.1038/323533a0.

^ Eric A. Wan (1993). Time series prediction by using a connectionist network with internal delay lines. In SANTA FE INSTITUTE STUDIES IN THE SCIENCES OF COMPLEXITY-PROCEEDINGS (Vol. 15, pp. 195-195). Addison-Wesley Publishing Co.

[1]

[2]

[3]

[4]

[5]

[6]

[7]

[8]

[9]

[10]

[11]

[12]

[13]

[14]

[15]

[16]