Maximum likelihood estimation -MLE

Pratima Rathore

6 min readJun 8, 2020

Lets understand few terms , before going to MLE

Conditional Probability

Probability of an event A given that event B has occurred , we call this conditional probability, and it is governed by the formula that P(A|B) which reads “probability of A given B” .

We look at an example involving the probability of being an alcoholic given that one is a man.

Conditional Probability Formula: P(A|B) = P(A and B)/P(B)

Probability density is the relationship between observations and their probability.

The overall shape of the probability density is referred to as a probability distribution, and the calculation of probabilities for specific outcomes of a random variable is performed by a probability density function, or PDF for short.

Density estimation is the problem of estimating the probability distribution for a sample of observations from a problem domain.It involves selecting a probability distribution function and the parameters of that distribution that best explain the joint probability distribution of the observed data (X).

How do you choose the probability distribution function?
How do you choose the parameters for the probability distribution function? This problem is made more challenging as sample (X) drawn from the population is small and has noise, meaning that any evaluation of an estimated probability density function and its parameters will have some error.

There are many techniques for solving this problem, although two common approaches are:

Maximum a Posteriori (MAP), a Bayesian method.
Maximum Likelihood Estimation (MLE), frequentist method.

The main difference is that MLE assumes that all solutions are equally likely beforehand, whereas MAP allows prior information about the form of the solution to be harnessed.

Maximum likelihood estimation involves defining a likelihood function for calculating the conditional probability of observing the data sample given a probability distribution and distribution parameters. This approach can be used to search a space of possible distributions and parameters.
It involves treating the problem as an optimization or search problem, where we seek a set of parameters that results in the best fit for the joint probability of the data sample (X).
First, it involves defining a parameter called theta that defines both the choice of the probability density function and the parameters of that distribution. It may be a vector of numerical values whose values change smoothly and map to different probability distributions and their parameters.
In Maximum Likelihood Estimation, we wish to maximize the probability of observing the data from the joint probability distribution given a specific probability distribution and its parameters, stated formally as:

P(X ; theta)

This conditional probability is often stated using the semicolon (;) notation instead of the bar notation (|) because theta is not a random variable, but instead an unknown parameter.

P(x1, x2, x3, …, xn ; theta)

This resulting conditional probability is referred to as the likelihood of observing the data given the model parameters and written using the notation L() to denote the likelihood function. For example:

L(X ; theta)

The objective of Maximum Likelihood Estimation is to find the set of parameters (theta) that maximize the likelihood function, e.g. result in the largest likelihood value.

maximize L(X ; theta)

The joint probability distribution can be restated as the multiplication of the conditional probability for observing each example given the distribution parameters.

product i to n P(xi ; theta)

Multiplying many small probabilities together can be numerically unstable in practice, therefore, it is common to restate this problem as the sum of the log conditional probabilities of observing each example given the model parameters.

sum i to n log(P(xi ; theta))

“This product over many probabilities can be inconvenient it is prone to numerical underflow. To obtain a more convenient but equivalent optimization problem, we observe that taking the logarithm of the likelihood does not change its arg max but does conveniently transform a product into a sum” It is common in optimization problems to prefer to minimize the cost function, rather than to maximize it. Therefore, the negative of the log-likelihood function is used, referred to generally as a Negative Log-Likelihood (NLL) function. minimize -sum i to n log(P(xi ; theta)) In software, we often phrase both as minimizing a cost function. Maximum likelihood thus becomes minimization of the negative log-likelihood (NLL)

Relationship to Machine Learning

Unsupervised

This problem of density estimation is directly related to applied machine learning.

We can frame the problem of fitting a machine learning model as the problem of probability density estimation. Specifically, the choice of model and model parameters is referred to as a modeling hypothesis h, and the problem involves finding h that best explains the data X.

P(X ; h) We can, therefore, find the modeling hypothesis that maximizes the likelihood function.

maximize L(X ; h) Or, more fully:

maximize sum i to n log(P(xi ; h)) This provides the basis for estimating the probability density of a dataset, typically used in unsupervised machine learning algorithms; for example:

Clustering algorithms. Using the expected log joint probability as a key quantity for learning in a probability model with hidden variables is better known in the context of the celebrated “expectation maximization” or EM algorithm.

Supervised

The Maximum Likelihood Estimation framework is also a useful tool for supervised machine learning.

This applies to data where we have input and output variables, where the output variate may be a numerical value or a class label in the case of regression and classification predictive modeling retrospectively.

We can state this as the conditional probability of the output (y) given the input (X) given the modeling hypothesis (h).

maximize L(y|X ; h) Or, more fully:

maximize sum i to n log(P(yi|xi ; h)) The maximum likelihood estimator can readily be generalized to the case where our goal is to estimate a conditional probability P(y | x ; theta) in order to predict y given x. This is actually the most common situation because it forms the basis for most supervised learning.

SUMMARY

Maximum Likelihood Estimation is a probabilistic framework for solving the problem of density estimation.
It involves maximizing a likelihood function in order to find the probability distribution and parameters that best explain the observed data.
It provides a framework for predictive modeling in machine learning where finding model parameters can be framed as an optimization problem.

Say you have some data. Say you’re willing to assume that the data comes from some distribution — perhaps Gaussian. There are an infinite number of different Gaussians that the data could have come from (which correspond to the combination of the infinite number of means and variances that a Gaussian distribution can have). MLE will pick the Gaussian (i.e., the mean and variance) that is “most consistent” with your data . So, say you’ve got a data set of y={−1,3,7}. The most consistent Gaussian from which that data could have come has a mean of 3 and a variance of 16.

EX -

Maximum Likelihood Estimation for Discrete Distributions : Bernoulli distribution
Maximum Likelihood Estimation for Continous Distributions : a normal (or Gaussian) distribution, the parameters are the mean μ and the standard deviation σ.
Maximum likelihood estimation of the logistic classification model (also called logit model or logistic regression).

MLE vs Gradient Descent

MLE help us to decide our distribution or objective funtion
Once we have an objective function, we can generally take its derivative with respect to the parameters (weights), set it equal to zero, and solve for the parameters to obtain the ideal solution. However, in the case of logistic regression (and many other complex or otherwise non-linear systems), this analytical method doesn’t work. Instead, we resort to a method known as gradient descent, whereby we randomly initialize and then incrementally update our weights by calculating the slope of our objective function. When applying the cost function, we want to continue updating our weights until the slope of the gradient gets as close to zero as possible.