Backpropagation and Gradient Descent

Sometimes the simple questions can be revelatory and make one think about the possibilities we have in front of us to improve existing processes, systems and methods.

On this occasion, it was a simple question on Quora about gradient descent and the ubiquitous backpropagation algorithm used in neural networks and deep learning. The content of my answer follows.

The process of computing weights and biases which minimize the error from a neural network could be any optimization algorithm which is good enough for the job. Gradient descent (especially the versions of the algorithm which use momentum and RMS propagation) are especially effective and have well implemented matrix algebra formulations in languages like C and Python, which makes them used often. Equally though, a genetic algorithm or simulated annealing algorithm (which are more complex and computationally intensive) may be used for finding such weights and biases on each iteration. Indeed, such methods have been and are being researched extensively.[1]

Backpropagation is defined by four equations that help calculate new weights and biases to update a neural network.[2]

The first of these equations helps calculate the error at the output layer. The second helps calculate the error in a given layer based on the error in the next layer. The third and fourth equations help calculate the rate of change of the loss function C with variations in the weights and biases.

Therefore the algorithm itself can be written out as follows[3] :

We then use the gradient of the cost function to compute the new values of w and b, based on things like the learning rate and regularization parameters as applicable.

Why gradient descent? Since the process of backpropagation is iterative (we go from steps 1 – 5 and back again), for each update, we can get better and better versions of the weights w and biases b that are able to reduce the error between the target and the result produced by the network. The following animation (source: Data Blog) probably gives you an idea (the red areas are higher values of Cost, and blue means lower values).

A nice graphic illustrating gradient descent

Now, you might ask : can’t other algorithms be used to do the same thing? The answer is indeed yes. We can use many other optimization algorithms (constrained and unconstrained ones, used for convex and non-convex functions). If you would like to learn about convex optimization with theoretical treatment in more detail, consider this resource: Convex Optimization – Boyd and Vandenberghe. In addition to other convex optimization methods, there’s scores of robust optimization methods such as:

  1. Genetic algorithms
  2. Particle Swarm Optimization
  3. Simulated Annealing
  4. Ant Colony Optimization

While some of these, especially GAs and PSOs have been explored in the context of neural networks, common implementations of deep learning algorithms still rely on the gradient descent family of algorithms (such as Nesterov – which has come to be implemented in a distributed paradigm, RMSProp, Adam).



[2] Neural networks and deep learning

[3] Neural networks and deep learning