Understanding the Logarithm Trick in Maximum Likelihood Estimation

Maximum Likelihood Estimation is a fundamental and powerful idea that’s at the centre of many things we do with data – so much so that we often use it without knowing it. MLE allows us to find a model’s parameters that are likely to enable the model to represent the data we have on our hands as closely as possible. This short post addresses the logarithm trick which is used to enable simpler MLE calculation.

There are two elements to understanding the formulation of MLE for the common Multivariate Gaussian model (which could be extended to other models equally):

  1. The i.i.d assumption that simplifies the MLE formulation
  2. The logarithm trick with enables solution of the MLE formulation

On this blog I’ve discussed topics like time series analysis in the past where the idea of independent and identically distributed variables is addressed, and of course, being an important statistical topic, is is well explained and understood. The logarithm trick, however, is specific to the simplification and solution of MLE formulations, and is helpful to understand.

The logarithm function very simply enables scale variance in any input data while allowing location invariance. This is extremely helpful when dealing with monotonic input data that we want to ensure continues to be monotonic after transformation, but whose scale we want to change.

When building a model p ( x_1,.... x_n | \theta ) of the data (x_1,.... x_n), the MLE formulation seeks to find the appropriate values of \theta such that

\bigtriangledown_{\theta} \prod_{i = 1}^{n} p(x_i | \theta ) = 0

The interesting thing about the log transform is, as I said earlier that in the transformation ln ( \prod_{i} f_i ) = \sum_{i} ln(f_i) , there is no change in where f_i may attain a maximum or a minimum when it is transformed to ln(f_i) for any i. This logarithm trick enables us to compute the latter product more simply, and thereby execute the MLE.

Backpropagation and Gradient Descent

Sometimes the simple questions can be revelatory and make one think about the possibilities we have in front of us to improve existing processes, systems and methods.

On this occasion, it was a simple question on Quora about gradient descent and the ubiquitous backpropagation algorithm used in neural networks and deep learning. The content of my answer follows.

The process of computing weights and biases which minimize the error from a neural network could be any optimization algorithm which is good enough for the job. Gradient descent (especially the versions of the algorithm which use momentum and RMS propagation) are especially effective and have well implemented matrix algebra formulations in languages like C and Python, which makes them used often. Equally though, a genetic algorithm or simulated annealing algorithm (which are more complex and computationally intensive) may be used for finding such weights and biases on each iteration. Indeed, such methods have been and are being researched extensively.[1]

Backpropagation is defined by four equations that help calculate new weights and biases to update a neural network.[2]

The first of these equations helps calculate the error at the output layer. The second helps calculate the error in a given layer based on the error in the next layer. The third and fourth equations help calculate the rate of change of the loss function C with variations in the weights and biases.

Therefore the algorithm itself can be written out as follows[3] :

We then use the gradient of the cost function to compute the new values of w and b, based on things like the learning rate and regularization parameters as applicable.

Why gradient descent? Since the process of backpropagation is iterative (we go from steps 1 – 5 and back again), for each update, we can get better and better versions of the weights w and biases b that are able to reduce the error between the target and the result produced by the network. The following animation (source: Data Blog) probably gives you an idea (the red areas are higher values of Cost, and blue means lower values).

A nice graphic illustrating gradient descent

Now, you might ask : can’t other algorithms be used to do the same thing? The answer is indeed yes. We can use many other optimization algorithms (constrained and unconstrained ones, used for convex and non-convex functions). If you would like to learn about convex optimization with theoretical treatment in more detail, consider this resource: Convex Optimization – Boyd and Vandenberghe. In addition to other convex optimization methods, there’s scores of robust optimization methods such as:

  1. Genetic algorithms
  2. Particle Swarm Optimization
  3. Simulated Annealing
  4. Ant Colony Optimization

While some of these, especially GAs and PSOs have been explored in the context of neural networks, common implementations of deep learning algorithms still rely on the gradient descent family of algorithms (such as Nesterov – which has come to be implemented in a distributed paradigm, RMSProp, Adam).

Footnotes

[1] https://arxiv.org/pdf/1711.07655…

[2] Neural networks and deep learning

[3] Neural networks and deep learning