Estimators

The idea of statistics is collect, organize, analyze and interpret data. Estimators are a key concept in statistics as they provide a way to estimate how the collected data relates to the larger population from which it was drawn. Based on these estimation certain properties or behaviors of the population can be inferred.

The idea of estimators is that have some sort of observed data points and we now want to analyze these points to make inferences about the underlying mechanism that generated the data. So we start by having some sort of dataset with data points \(x_1, x_2, \ldots, x_n\). This collection/dataset of observations/data points is also referred to as a sample and the number of observations in the sample \(n\) is referred to as the sample size.

The first step we commonly do with the sample is to present the data visually in a meaningful way, such as through graphs or charts, to identify patterns or trends. This is so-called exploratory data analysis (EDA) and falls under descriptive statistics.

The second step is where our estimators come into play. The idea is that the data points are generated by some random variables \(X_1, X_2, \ldots, X_n\) which are drawn from some underlying distribution. This distribution is also referred to as the population distribution or just simply model. This model is parameterized by some unknown parameters \(\theta\) from the parameter space \(\Theta\) along with a probability measure \(P_\theta\). Notice that our model is a joint distribution over the random variables \(X_1, X_2, \ldots, X_n\). The reason for this is that there can be dependencies between the random variables, meaning the value of one variable can influence the value of another. However, we can also consider the case where the random variables are independent and identically distributed (i.i.d.), in which case the joint distribution can be factored into the product of the individual distributions.

We know somehow want to use the sample data to estimate the parameters of the model such that they are as close as possible to the true parameters. This is what the estimators are for, to find the parameters \(\theta\). These are of course simplifications/abstractions of the real-world processes that generated the data as the real-world is to complex. However, we can imagine that nature or god whichever you prefer has some underlying mechanism that picked \(\theta\) and \(P_\theta\) and was responsible for generating the observed data points.

So our process is as follows:

Collect sample data points \(x_1, x_2, \ldots, x_n\).
Perform exploratory data analysis (EDA) to identify patterns or trends.
Assume each \(X_i\) is a realisation of a random variable with common distribution \(P_\theta\) indexed by an unknown parameter (or vector) \(\theta\in\Theta\).
Define the probability space \((\Omega, \mathcal{F}, P_\theta)\) where \(\Omega\) is the sample space, \(\mathcal{F}\) is the \(\sigma\)-algebra and for each parameter selection \(\theta \in \Theta\), \(P_\theta\) is the probability measure.
Based on the defined model and the observed data, we can now construct our estimators for the unknown parameters \(\theta\).
Evaluate the performance of the chosen parameters \(\theta\) and model \(P_\theta\) to check if they fit the observed data well using statistical tests.
Rather than choosing one set of parameters \(\theta\), we can also choose a region in the parameter space \(\Theta\) such that the probability that they match the observed data is maximized. These regions are often referred to as confidence intervals.

So we have seen that the idea is that estimators are mathematical tools used to make inferences about unknown parameters \(\theta\) for a chosen model family \((P_\theta)_{\theta \in \Theta}\) using observed data. So given a sample of random variables \(X_1, X_2, \ldots, X_n\) we formally define an estimator also as a random Variable:

\[\begin{align*} T: \Omega \to \mathbb{R} \\ T = t(X_1, X_2, \ldots, X_n) \end{align*} \]

where \(T\) is the estimator. The estimator is like a rule or formula that tells us how to use our data \(X_1, X_2, \ldots, X_n\) to estimate the unknown parameter \(\theta\). The function \(t: \mathbb{R}^n \to \mathbb{R}\) is a measurable function that once we observe data \(x_i = X_i(\omega)\) for \(i=1, 2, \ldots, n\), we can use it to compute the estimate \(\hat{\theta} = t(x_1, x_2, \ldots, x_n)\) which is some value in \(\mathbb{R}\) and our guess for the unknown parameter \(\theta\).

Tea Tasting Lady

Just like with making cereal where you can either first put in the milk and then the cereal, or the other way around (there is only one good way), there are two ways you can make an english tea, either by first pouring in the milk or the tea (again as a brit there is only one good way). So we define the following ways to make tea:

Type 1: Pour the milk first, then add tea.
Type 2: Pour the tea first, then add milk.

Now an old british lady (think of the Queen with her dogs) claims that she can taste the difference and can tell which method was used just by sipping the tea. To test her claim, we conduct a blind taste test over \(n\) days where she is given 2 cups of tea, one made with each method, and asked to identify which cup is type 1, so where the milk was poured first. We record her responses, where 1 indicates a correct classification and 0 an incorrect classification:

\[x_1, x_2, \ldots, x_n \in \{0, 1\} \]

We treat this sample of responses as a realization of the random variables \(X_1, X_2, \ldots, X_n\). For our model it naturally follows that we could think of each day as a Bernoulli trial, where the lady has an unknown success probability \(\theta\) of correctly identifying the tea type. This \(\theta\) is our parameter that we want to find and is what defines our model \(P_\theta\). So we could assume that \(X_i \sim \text{Bernoulli}(p)\) for each \(i = 1, 2, \ldots, n\) and that each trial is i.i.d. Then we can get the random number of correct classifications by summing the observations so we get:

\[S_n = \sum_{i=1}^{n} X_i \]

And because \(X_i \text{ are i.i.d. } \text{Bernoulli}(p)\), we would have:

\[S_n \sim \text{Binomial}(n, \theta) \]

We can also define the actual observed number of correct classifications as:

\[s_n = \sum_{i=1}^{n} x_i \]

Based on our chosen family of models we could now construct some estimators to find the unknown parameter \(\theta\). A possible estimator could be a last observation estimator which just takes the last observation:

\[T^{(1)} = X_n \]

for our observed data, this would give us the estimate

\[t^{(1)}(x_1, x_2, \ldots, x_n) = x_n \]

Another possible estimator could be a sample mean estimator which takes the average of all observations:

\[T^{(2)} = \frac{1}{n} \sum_{i=1}^{n} X_i = \bar{X}_n \]

Because we see this averaged sum very often in statistics, we often denote the sample mean as \(\bar{X}_n\). For our observed data, this would give us the estimate

\[t^{(2)}(x_1, x_2, \ldots, x_n) = \frac{1}{n} \sum_{i=1}^{n} x_i \]

As a concrete example, if we observed the following classifications:

\[x_1 = 1, x_2 = 0, x_3 = 1, x_4 = 1, x_5 = 0 \]

Then our estimates would be:

\[t^{(1)}(x_1, x_2, \ldots, x_n) = x_n = 0 \]

which is equivalent to the true success probability never being correct and for the sample mean estimator we have:

\[t^{(2)}(x_1, x_2, \ldots, x_n) = \frac{1}{5} \sum_{i=1}^{5} x_i = \frac{1 + 0 + 1 + 1 + 0}{5} = \frac{3}{5} = 0.6 \]

Meaning our estimate for the true success probability \(\theta\) is 0.6.

Average Student Height

Suppose we want to estimate the average height of all students at a university. We can’t measure every student, so we randomly sample \(n\) students and measure their heights. Let’s denote their measured heights as:

\[x_1, x_2, \ldots, x_n \in \mathbb{R} \]

We treat this sample as a realization of random variables \(X_1, X_2, \ldots, X_n\).

For our model, let’s assume that heights are normally distributed with some unknown mean \(\theta\) (which we want to estimate), but a known variance \(\sigma^2\):

\[X_i \sim \mathcal{N}(\theta, \sigma^2), \quad i = 1, 2, \ldots, n \]

with each \(X_i\) being i.i.d.

The parameter \(\theta\) here represents the true average (mean) height in the student population, and this is the parameter of our model family \(P_\theta\) that we want to estimate.

Again a natural estimator for \(\theta\) could be the sample mean:

\[T = \frac{1}{n} \sum_{i=1}^{n} X_i = \bar{X}_n \]

This is a random variable, since it depends on the random sample. For observed data, the estimate we compute is:

\[t(x_1, x_2, \ldots, x_n) = \frac{1}{n} \sum_{i=1}^{n} x_i \]

So if we were to observe the following heights (in cm):

\[x_1 = 170, x_2 = 172, x_3 = 168, x_4 = 174, x_5 = 176 \]

then our estimate would be:

\[t(170, 172, 168, 174, 176) = \frac{170 + 172 + 168 + 174 + 176}{5} = 172 \]

So, based on this sample, our estimated average student height \(\theta\) is 172 cm.

Bias

Once we have constructed an estimator \(T\) for our parameter of interest \(\theta\), we want to know how good our estimator is. One way to assess this is by examining its bias. The bias of an estimator tells us whether our estimator systematically overestimates or underestimates the true parameter value. More formally, for any parameter value \(\theta \in \Theta\), the bias of an estimator \(T\) is defined as:

\[\text{Bias}_{\theta}(T) = \E_\theta[T] - \theta \]

Here, \(\E_\theta[T]\) is the expected value (average outcome) of \(T\) under the assumption that the true parameter is \(\theta\). If the bias is zero for all \(\theta\), i.e.

\[\text{Bias}_{\theta}(T) = 0 \quad \Longleftrightarrow \quad \E_\theta[T] = \theta \quad \forall \theta \in \Theta, \]

then \(T\) is called unbiased. This means that if we were to repeat our data collection and estimation process over and over, on average, our estimator would return the true parameter value. If the bias is nonzero, then the estimator is biased. This means our estimator is systematically too high or too low, even on average.

Mean Squared Error (MSE)

While bias tells us about the average “miss”, it does not tell us about the variability of the estimator from sample to sample. For that, we use the mean squared error (MSE). The MSE is defined as:

\[\text{MSE}_{\theta}(T) = \E_\theta[(T - \theta)^2] \]

This measures the average squared distance between our estimator and the true parameter. It is a way to combine both the systematic error (bias) and the random error (variance) into a single quantity. This allows us to rewrite the MSE as:

\[\text{MSE}_{\theta}(T) = \mathbb{V}_\theta(T) + \text{Bias}_{\theta}(T)^2 \]

Here, \(\mathbb{V}_\theta(T)\) is the variance of the estimator \(T\) under the model with parameter \(\theta\).

MSE in Machine Learning

The MSE you see here is exactly the same as the mean squared error that is often minimized in machine learning, for example when fitting a regression model. In both cases, the goal is to find a function or estimator that is as close as possible (on average) to the quantity we are trying to estimate.

Tea Tasting Lady

Let’s revisit the tea tasting lady experiment, where each \(X_i\) is a Bernoulli random variable with success probability \(\theta\), and we want to estimate \(\theta\) using two different estimators. Our first estimator is based on the last observation \(T^{(1)} = X_n\). First we can calculate the expected value of our estimator:

\[\E_\theta[T^{(1)}] = \E_\theta[X_n] = \theta \]

the expectation is that it is equal to the true parameter \(\theta\) is because we have a Bernoulli random variable where the expected value is equal to the success probability. This means that for the bias we get:

\[\text{Bias}_\theta(T^{(1)}) = \E_\theta[T^{(1)}] - \theta = \theta - \theta = 0 \]

For the variance we can either use the variance definition and the fact that \(X_n\) is a Bernoulli random variable or calculate it using the expectation trick:

\[\V_\theta(T^{(1)}) = \V_\theta(X_n) = \theta(1 - \theta) \quad \text{or} \quad \mathbb{V}_\theta(T^{(1)}) = \E_\theta[(T^{(1)})^2] - \E_\theta[T^{(1)}]^2 = \theta - \theta^2 = \theta(1 - \theta) \]

We can then calculate the MSE as follows:

\[\text{MSE}_\theta(T^{(1)}) = \E_\theta[(T^{(1)} - \theta)^2] = \V_\theta(T^{(1)}) + \text{Bias}_\theta(T^{(1)})^2 = \theta(1 - \theta) + 0 = \theta(1 - \theta) \]

Now let’s look at the second estimator, the sample mean \(T^{(2)} = \frac{1}{n} \sum_{i=1}^{n} X_i\). Again we can calculate the expected value:

\[\E_\theta[T^{(2)}] = \E_\theta\left[ \frac{1}{n} \sum_{i=1}^n X_i \right] = \frac{1}{n} \sum_{i=1}^n \E_\theta[X_i] = \frac{1}{n} n \theta = \theta \]

This means that the bias is again zero:

\[\text{Bias}_\theta(T^{(2)}) = \E_\theta[T^{(2)}] - \theta = \theta - \theta = 0 \]

However, the variance of the two estimators is different:

\[\V_\theta(T^{(2)}) = \V_\theta\left(\frac{1}{n} \sum_{i=1}^{n} X_i\right) = \frac{1}{n^2} \sum_{i=1}^{n} \V_\theta(X_i) = \frac{1}{n^2} n \theta(1 - \theta) = \frac{1}{n} \theta(1 - \theta) \]

This also results in the MSE being different:

\[\text{MSE}_{\theta}(T^{(2)}) = \E_\theta[(T^{(2)} - \theta)^2] = \V_\theta(T^{(2)}) + \text{Bias}_\theta(T^{(2)})^2 = \frac{1}{n} \theta(1 - \theta) + 0 = \frac{1}{n} \theta(1 - \theta) \]

Both estimators are unbiased because their bias is zero, but they have different variances and therefore different MSEs. Because the sample mean estimator \(T^{(2)}\) uses all the data, and so its variance (and MSE) is much smaller than that of the estimator that uses only the last observation \(T^{(1)}\). We could interpret this as the estimator extracting more information from the data as it incorporates all available observations, rather than just the last one.

This means that, even though both estimators are correct on average, the sample mean estimator is much more reliable and concentrated around the true value. Especially as the sample size increases, the variance and the MSE of the sample mean estimator tends to zero which implies that it becomes a very good estimator for the true parameter \(\theta\).

Todo

Would a concrete number example make sense here and help?

Maximum Likelihood Estimation (MLE)

Todo

Why is the MLE estimator nice?

It is consistent: As the sample size increases, the MLE converges in probability to the true parameter value and is also asymptotically unbiased.
It is efficient: Among all consistent estimators, the MLE achieves the lowest possible variance in the limit as the sample size goes to infinity. Something with central limit theorem?

The maximum likelihood estimation (MLE) method is a powerful and widely used technique in statistics for estimating the parameters of a statistical model. It is based on the principle of likelihood, which measures how well a particular set of parameters explains the observed data. The ML-method (where “ML” for once doesn’t mean machine learning but maximum likelihood!) is a systematic approach to constructing estimators that are often very good and have desirable properties. The idea is to choose parameter values that make the observed data as “likely” as possible, according to our chosen model family.

Suppose we have collected a sample of data points \(x_1, x_2, \ldots, x_n\), which we treat as a realization of random variables \(X_1, X_2, \ldots, X_n\). As before, we assume these are i.i.d. random variables, each distributed according to some family \(P_\theta\), parameterized by the unknown \(\theta \in \Theta\).

We define the likelihood function as the joint probability of the observed data given the parameter \(\theta\):

\[\L(x_1, x_2, \ldots, x_n; \theta) = \begin{cases} \prod_{i=1}^{n} p_{X_i}(x_i; \theta) & \text{if } X_i \text{ are discrete} \\ \prod_{i=1}^{n} f_{X_i}(x_i; \theta) & \text{if } X_i \text{ are continuous} \end{cases} \]

where \(p_{X_i}(x_i; \theta)\) is the probability mass function (pmf) for the discrete case, and \(f_{X_i}(x_i; \theta)\) is the probability density function (pdf) for the continuous case. As mentioned earlier, the likelihood function tells us what is the probability of the observed data given a specific parameter value.

Probability vs Likelihood

In normal conversations, people often use the terms “probability” and “likelihood” interchangeably, but in statistics and probability theory, they have distinct meanings:

Probability answers: Given \(\theta\), what is the probability of seeing \(x\_1, \ldots, x\_n\)? So \(P(x\_1, \ldots, x\_n \mid \theta)\): varies \(x\_i\), fixed \(\theta\) (Function of the data, parameter fixed).
Likelihood answers: Given the observed \(x\_1, \ldots, x\_n\), how plausible is each possible \(\theta\)? So \(\L(x\_1, \ldots, x\_n; \theta)\) varies \(\theta\), fixed \(x\_i\) (Function of the parameter, data fixed).

The formulas look identical, but the perspective is reversed!

The maximum likelihood estimation (MLE) of \(\theta\) simply gives us the value of the parameter that maximizes the likelihood function:

\[\hat{\theta} = \arg\max_{\theta \in \Theta} \L(x_1, x_2, \ldots, x_n; \theta) \]

Here, \(\hat{\theta}\) is a number (or vector) which is our best guess for the true parameter based on the observed data. So the formula above gives you the estimate for a particular sample. If you replace the data with random variables \(X_1, \ldots, X_n\), you get the MLE estimator as a function:

\[T_{ML} = t_{ML}(X_1, X_2, \ldots, X_n) \]

So before you see the data, the estimator itself is a random variable (just like the sample mean). After you see the data, you plug in \(x_1, \ldots, x_n\) to get your estimate \(\hat{\theta}\).

Probability vs Likelihood

Suppose we have a coin that we suspect might be biased, and we want to estimate its probability of landing heads, which we’ll call \(\theta\). We toss the coin three times and observe the outcomes:

\[x_1 = 1, \quad x_2 = 0, \quad x_3 = 1 \]

where \(1\) denotes heads and \(0\) denotes tails. The sample space for three tosses is all \(2^3 = 8\) possible combinations, such as \((1,1,1)\), \((1,1,0)\), \((1,0,1)\), etc. So for our model we have for each toss \(X_i \sim \text{Bernoulli}(\theta)\), with \(\theta\) the probability of heads.

Suppose we want to calculate the probability of getting the specific sequence \((1, 0, 1)\), given a fair coin so \(\theta = 0.5\) we get:

\[P(x_1 = 1, x_2 = 0, x_3 = 1 \mid \theta = 0.5) = 0.5^1 \times 0.5^1 \times 0.5^1 = 0.5 \times 0.5 \times 0.5 = 0.125 \]

So, the probability answers: “How likely is it to see this specific data sequence, if we know \(\theta\)?”

Now, imagine you have observed the sequence \((1, 0, 1)\). You want to know: “For different possible values of \(\theta\), how plausible is it that \(\theta\) generated this data?”

The likelihood function is:

\[\begin{align*} \L(1, 0, 1; \theta) = \theta^{x_1} (1-\theta)^{1-x_1} \cdot \theta^{x_2} (1-\theta)^{1-x_2} \cdot \theta^{x_3} (1-\theta)^{1-x_3} \\ = \theta^1 (1-\theta)^0 \cdot \theta^0 (1-\theta)^1 \cdot \theta^1 (1-\theta)^0 = \theta^2 (1-\theta)^1 \end{align*} \]

So if we then evaluate the likelihood for different values of \(\theta\):

If \(\theta = 0.5\): \(\L(1,0,1; 0.5) = (0.5)^2 \times (0.5)^1 = 0.125\)
If \(\theta = 0.7\): \(\L(1,0,1; 0.7) = (0.7)^2 \times (0.3)^1 = 0.147\)
If \(\theta = 0.9\): \(\L(1,0,1; 0.9) = (0.9)^2 \times (0.1)^1 = 0.081\)

The likelihood is not a probability, but it tells us how “plausible” each value of \(\theta\) is given the observed data. Here, \(\theta = 0.7\) gives the highest likelihood for this observed sequence.

You can imagine plotting \(\L(1,0,1;\theta)\) as a function of \(\theta\) between \(0\) and \(1\): the peak of this plot is then the maximum likelihood estimation.

In practice, it is much easier and more popular to maximize the log-likelihood function which is why most of the time we only speak about it. The log-likelihood is given by:

\[\ell(x_1, x_2, \ldots, x_n; \theta) = \log \L(x_1, x_2, \ldots, x_n; \theta) \]

The log-likelihood function has the advantage that the log function turns products into sums, which are numerically more stable and much easier to differentiate and optimize, especially for large \(n\). Differentiating is key as to find the maximum, you usually set the derivative of the log-likelihood with respect to \(\theta\) to zero and solve for \(\theta\).

MLE in Machine Learning

Maximizing the log-likelihood is fundamental in many machine learning algorithms. For example, fitting a logistic regression model. However, finding the parameters for a high-dimensional model can be challenging, which is why we often use optimization techniques like gradient descent.

Tea Tasting Lady

Let’s return to the tea tasting lady. We recall that each \(X_i \sim \text{Bernoulli}(\theta)\), \(i = 1, \ldots, n\) and that we have the observed outcomes \(x_1, \ldots, x_n \in {0, 1}\). We can then define the model for a single observation:

\[p_{X_i}(x_i; \theta) = \theta^{x_i} (1-\theta)^{1-x_i} \]

The joint likelihood for the entire sample is then:

\[\L(x_1, \ldots, x_n; \theta) = \prod_{i=1}^{n} \theta^{x_i} (1-\theta)^{1-x_i} = \theta^{s_n} (1-\theta)^{n-s_n} \]

where \(s_n = \sum_{i=1}^n x_i\) is the total number of correct identifications. As mentioned taking the log simplifies the optimization problem:

\[\ell(x_1, \ldots, x_n; \theta) = \log \L(x_1, \ldots, x_n; \theta) = s_n \log \theta + (n - s_n) \log (1 - \theta) \]

Note that here we use the following logarithm property:

\[\log(ab) = \log a + \log b \quad \text{ and } \quad \log(a^b) = b \log a \]

To find the maximum we set the derivative with respect to \(\theta\) to zero:

\[\frac{\partial}{\partial \theta} \ell(x_1, \ldots, x_n; \theta) = s_n \frac{1}{\theta} - (n - s_n) \frac{1}{1 - \theta} = \frac{s_n}{\theta} - \frac{n - s_n}{1 - \theta} = 0 \]

The terms \(s_n\) and \(n - s_n\) are simply constants when differentiating with respect to \(\theta\) which is why they stay. If we then solve for \(\theta\), we get:

\[\begin{align*} \frac{s_n}{\theta} - \frac{n - s_n}{1 - \theta} &= 0 \\ \frac{s_n}{\theta} &= \frac{n - s_n}{1 - \theta} \\ s_n (1 - \theta) &= (n - s_n) \theta \\ s_n - s_n \theta &= n \theta - s_n \theta \\ s_n &= n\theta \\ \theta &= \frac{s_n}{n} = \frac{\sum_{i=1}^{n} x_i}{n} = \frac{1}{n} \sum_{i=1}^{n} x_i \end{align*} \]

So the maximum likelihood estimator is simply the sample mean:

\[T_{ML} = \frac{1}{n} \sum_{i=1}^n X_i = \bar{X}_n \]

For observed data, the estimate is

\[\hat{\theta} = \frac{1}{n} \sum_{i=1}^n x_i \]

Notice how taking the log made differentiation and solving for the maximum straightforward—otherwise, you’d be working with the product rule and much messier algebra.

Estimating Multiple Parameters

So far, we have seen cases where we have stochastic model with a single parameter \(\theta \in \RR\) such as the success probability of a Bernoulli random variable. However, in many cases we want more complex models with multiple parameters \(\theta_1, \theta_2, \ldots, \theta_m\) for \(m \geq 2\) such as the normal distribution parameters \(\mu\) and \(\sigma^2\) simultaneously. In general we extend our parameter space to include all of these parameters:

\[(\theta_1, \theta_2, \ldots, \theta_k) \in \Theta \subseteq \mathbb{R}^k \]

All our definitions and methods from above (estimators, bias, mean squared error, MLE, etc.) still apply; we now just work with vectors (and sometimes matrices) instead of scalars.

Estimating Mean and Variance of a Normal Distribution

Let’s consider the important example of estimating both the mean and variance of a normal distribution from observed data. So we suppose \(X_1, X_2, \ldots, X_n\) are i.i.d. random variables with

\[X_i \sim \mathcal{N}(\mu, \sigma^2), \quad i = 1, \ldots, n \]

where both \(\mu\) (mean) and \(\sigma^2\) (variance) are unknown. Given a sample \(x_1, \ldots, x_n\), we want to estimate \(\mu\) and \(\sigma^2\). The probability density function (PDF) for a single normal distributed observation is:

\[f(x_i; \mu, \sigma^2) = \frac{1}{\sqrt{2\pi \sigma^2}} \exp\left(-\frac{(x_i - \mu)^2}{2\sigma^2}\right) \]

Since the \(X\_i\) are i.i.d., the joint likelihood is the product of the individual PDFs:

\[\begin{align*} \mathcal{L}(x_1, \ldots, x_n; \mu, \sigma^2) &= \prod_{i=1}^n f(x_i; \mu, \sigma^2) \\ &= \prod_{i=1}^n \frac{1}{\sqrt{2\pi \sigma^2}} \exp\left(-\frac{(x_i - \mu)^2}{2\sigma^2}\right) \\ &= \left( \frac{1}{\sqrt{2\pi \sigma^2}} \right)^n \prod_{i=1}^n \exp\left(-\frac{(x_i - \mu)^2}{2\sigma^2}\right) \\ &= \frac{1}{(2\pi \sigma^2)^{n/2}} \exp\left( -\frac{1}{2\sigma^2} \sum_{i=1}^n (x_i - \mu)^2 \right ) \end{align*} \]

Here, we used the fact that \(\prod\_{i=1}^n a = a^n\) for the normalizing constant, and that the exponential of a sum is the product of exponentials. Next we again take the log as it is much easier to work with the log-likelihood, since logs turn products into sums and exponents into products:

\[\begin{align*} \ell(x_1, \ldots, x_n; \mu, \sigma^2) &= \log \mathcal{L}(x_1, \ldots, x_n; \mu, \sigma^2) \\ &= \log \left( \frac{1}{(2\pi \sigma^2)^{n/2}} \right ) - \frac{1}{2\sigma^2} \sum_{i=1}^n (x_i - \mu)^2 \\ &= -\frac{n}{2} \log(2\pi \sigma^2) - \frac{1}{2\sigma^2} \sum_{i=1}^n (x_i - \mu)^2 \end{align*} \]

Notice that the first term \(-\frac{n}{2} \log(2\pi \sigma^2)\) is a function of \(\sigma^2\) but is independent of \(\mu\) so when we differentiate with respect to \(\mu\), it acts as a multiplicative constant and does not affect the location of the maximum. When we differentiate the log-likelihood with respect to \(\mu\). The only term depending on \(\mu\) is the second term:

\[-\frac{1}{2\sigma^2} \sum_{i=1}^n (x_i - \mu)^2 \]

We use the chain rule and power rule:

\[\frac{\partial}{\partial \mu} (x_i - \mu)^2 = 2(x_i - \mu)(-1) = -2(x_i - \mu) \]

So, the derivative is:

\[\begin{align*} \frac{\partial}{\partial \mu} \ell(x_1, \ldots, x_n; \mu, \sigma^2) &= -\frac{1}{2\sigma^2} \sum_{i=1}^n \frac{\partial}{\partial \mu} (x_i - \mu)^2 \\ &= -\frac{1}{2\sigma^2} \sum_{i=1}^n -2(x_i - \mu) \\ &= \frac{1}{\sigma^2} \sum_{i=1}^n (x_i - \mu) \end{align*} \]

The \(-1\) comes from the chain rule because \(\frac{d}{d\mu}(x\_i - \mu) = -1\) with respect to \(\mu\). Next to find the maximum, we solve:

\[\frac{\partial}{\partial \mu} \ell(x_1, \ldots, x_n; \mu, \sigma^2) = 0 \]

However, it is important to note that for some models, maximizing the likelihood could result in finding local minima or saddle points, so you should check that the solution is indeed a maximum. For the normal distribution, the log-likelihood is a concave function in \(\mu\) (a downward-facing parabola), so the critical point we find is the unique global maximum. In more complicated or non-concave cases, you might need to check the second derivative (the Hessian matrix, in higher dimensions) to confirm the maximum.

So solving for \(\mu\) we get:

\[\sum_{i=1}^n (x_i - \mu) = 0 \implies \sum_{i=1}^n x_i = n\mu \implies \mu = \frac{1}{n} \sum_{i=1}^n x_i \]

So, the maximum likelihood estimator for \(\mu\) is the sample mean:

\[T_\mu = \bar{X}_n = \frac{1}{n} \sum_{i=1}^n X_i \]

and for the observed sample,

\[\hat{\mu} = \frac{1}{n} \sum_{i=1}^n x_i \]

Next, we differentiate the log-likelihood with respect to \(\sigma^2\). Here, both terms depend on \(\sigma^2\):

\[-\frac{n}{2} \log(2\pi \sigma^2) - \frac{1}{2\sigma^2} \sum_{i=1}^n (x_i - \mu)^2 \]

However to make the differentiation clearer, we can do it term by term:

\[\begin{align*} \frac{\partial}{\partial \sigma^2} \log(2\pi \sigma^2) &= \log(2\pi) + \log(\sigma^2) = \frac{1}{\sigma^2} \frac{\partial}{\partial \sigma^2} \left(-\frac{1}{2\sigma^2} S_n\right) &= \frac{1}{2(\sigma^2)^2} S_n \end{align*} \]

Where \(S_n = \sum_{i=1}^n (x_i - \mu)^2\). Putting it all together we have:

\[\begin{align*} \frac{\partial}{\partial \sigma^2} \ell(x_1, \ldots, x_n; \mu, \sigma^2) &= -\frac{n}{2} \frac{1}{\sigma^2} + \frac{1}{2(\sigma^2)^2} \sum_{i=1}^n (x_i - \mu)^2 \end{align*} \]

Setting this equal to zero and solving for \(\sigma^2\) we get:

\[\begin{align*} -\frac{n}{2} \frac{1}{\sigma^2} + \frac{1}{2(\sigma^2)^2} \sum_{i=1}^n (x_i - \mu)^2 &= 0 \\ -n \sigma^2 + \sum_{i=1}^n (x_i - \mu)^2 &= 0 \\ n \sigma^2 &= \sum_{i=1}^n (x_i - \mu)^2 \\ \sigma^2 &= \frac{1}{n} \sum_{i=1}^n (x_i - \mu)^2 \end{align*} \]

Notice we still have a \(\mu\) in our estimator. To remove it we replace \(\mu\) here with the estimator \(\hat{\mu}\) because, in practice, the variance estimator depends on our best estimate of the mean (which we just found above). This ensures our variance estimator is based on the “center” of the observed data. Thus, the MLE estimator for the variance is:

\[T_{\sigma^2} = \frac{1}{n} \sum_{i=1}^n (X_i - \bar{X}_n)^2 \]

We can expand the squared difference as follows:

\[\begin{align*} T_{\sigma^2} &= \frac{1}{n} \sum_{i=1}^n (X_i - \bar{X}_n)^2 \\ &= \frac{1}{n} \left( \sum_{i=1}^n X_i^2 - 2\bar{X}_n \sum_{i=1}^n X_i + n\bar{X}_n^2 \right) \\ &= \frac{1}{n} \left( \sum_{i=1}^n X_i^2 - 2n\bar{X}_n^2 + n\bar{X}_n^2 \right) \\ &= \frac{1}{n} \sum_{i=1}^n X_i^2 - \bar{X}_n^2 \end{align*} \]

and then get for observed data,

\[\hat{\sigma}^2 = \frac{1}{n} \sum_{i=1}^n (x_i - \hat{\mu})^2 = \frac{1}{n} \sum_{i=1}^n (x_i)^2 - \hat{\mu}^2 \]

This is the so-called momentum estimator for \((\E[X], \V[X])\) for any model \(P_\theta\) where \(X_1, X_2, \ldots, X_n\) are i.i.d. random variables. Now let’s see if our estimators are biased, i.e. if their expected values differ from the true parameter values by some constant:

\[\E[\bar{X}_n] = \E\left[\frac{1}{n}\sum_{i=1}^n X_i\right] = \frac{1}{n}\sum_{i=1}^n \E[X_i] = \frac{1}{n}\sum_{i=1}^n \mu = \mu \]

So our mean estimator is unbiased:

\[\text{Bias}_\mu(\bar{X}_n) = \E[\bar{X}_n] - \mu = 0 \]

Next we compute the bias of the MLE variance estimator:

\[\E[T_{\sigma^2}] = \E\left[ \frac{1}{n} \sum_{i=1}^n (X_i - \bar{X}_n)^2 \right] = \frac{n-1}{n}\sigma^2 \]

Todo

How was this expectation calculated? Is it possible without variance?

So, because \(\frac{n-1}{n} \sigma^2 < \sigma^2\), the MLE estimator systematically underestimates the true variance. This is the one drawback of the momentum estimator. However, as \(n\) goes to infinity the bias goes to zero. So we say that the estimator is asymptotically unbiased. This means that as we collect more data, the bias becomes negligible and the estimator approaches the true parameter value.

We can also obtain an unbiased variance estimator which is often used instead. To get this unbiased estimator for \(\sigma^2\), we simply solve for a constant \(c\) so that \(\E[c T_{\sigma^2}] = \sigma^2\). So we set \(S^2 = cT_{\sigma^2}\) and require \(\E[S^2] = \sigma^2\):

\[\E[S^2] = c \frac{n-1}{n} \sigma^2 = \sigma^2 \implies c = \frac{n}{n-1} \]

Thus:

\[\begin{align*} S^2 &= \frac{n}{n-1} T_{\sigma^2} \\ &= \frac{n}{n-1} \cdot \frac{1}{n} \sum_{i=1}^n (X_i - \bar{X}_n)^2 \\ &= \frac{1}{n-1} \sum_{i=1}^n (X_i - \bar{X}_n)^2 \end{align*} \]

Todo

Why are we allowed to just make this multiplication? Doesn’t this have an effect on the model and paramter choice?

Wont it now be less accurate then the original MLE estimator?

and for observed data:

\[\hat{\sigma}_{\text{unbiased}}^2 = \frac{1}{n-1} \sum_{i=1}^n (x_i - \hat{\mu})^2 \]

This is the so-called empirical sample variance, and we have:

\[\E[S^2] = \frac{1}{n-1} \E\left[ \sum_{i=1}^n (X_i - \bar{X}_n)^2 \right] = \frac{1}{n-1} (n-1)\sigma^2 = \sigma^2 \]

Thus, \(S^2\) is an unbiased estimator for \(\sigma^2\).