# Fisher Information

**Fisher information** (named after Ronald Fisher, who camed up with ANOVA and MLE) measures the amount of information that an observed variable $X$ has about a hidden variable $\theta$. It is used in the asymptotic theory of MLE, Jeffreys prior, and Wald test.

Let $p(X|\theta)$ be the likelihood distribution. Then the log-likelihood is
$$\Mr{LL}(\theta) = \log p(X|\theta)$$
We define the **score** as
$$\Mr{score}(\theta) = \fracp{}{\theta} \log p(X|\theta)$$
Under regular conditions, the *first* moment of the score is 0:
$$\begin{align*}
\ex[X]{\Mr{score}(\theta)|\theta}
&= \int \nail{\fracp{}{\theta} \log p(X|\theta)} p(x|\theta)\,dx \\
&= \int \nail{\frac{1}{p(x|\theta)}\fracp{p(x|\theta)}{\theta}} p(x|\theta)\,dx \\
&= \fracp{}{\theta} \int p(x|\theta)\, dx = \fracp{}{\theta} 1 = 0
\end{align*}$$

The **Fisher information** is the *second* moment:
$$I(\theta) = \ex[X]{\nail{\fracp{}{\theta}\log p(X|\theta)}^2\midd\theta}$$
Under regular conditions ($p(X|\theta)$ is twice differentiable), chain rule will imply
$$I(\theta) = -\ex[X]{\fracp{^2}{\theta^2}\log p(X|\theta)\midd \theta}$$
Note that Fisher information does not depend on a particular $X=x$ since it is integrated out.

## Intuition

Fisher information measures the **curvature** of the log-likelihood. If we plot the log-likelihood, high curvature = deep valley = easy to get optimal $\theta$ = we got a lot of information about $\theta$ from $X$ = high Fisher information.

## Properties

- $I_{X,Y}(\theta) = I_X(\theta) + I_Y(\theta)$
- If $T(X)$ is a sufficient statistics for $\theta$ (i.e., $$p(X=x|T(X)=t,\theta) = p(X=x|T(X)=t)$$ independent of $\theta$), then $I_T(\theta) = I_X(\theta)$
- For other statistics $T(X)$, we get $I_T(\theta) \leq I_X(\theta)$
(*Theorem***Cramér-Rao Bound**) For any*unbiased*estimator $\hat\theta$, $$\varr{\hat\theta} \geq \frac{1}{I(\theta)}$$ This makes sense: less information means that it is more difficult to pinpoint the estimator, and thus the variance increases.(*Definition***Jeffreys Prior**) The Jeffreys prior is defined as $$p_\theta(\theta) \propto \sqrt{I(\theta)}$$ Jeffreys prior is an uninformative prior that is not sensitive to parameterization; i.e., both the original $p_\theta(\theta)$ and $$p_\phi(\phi) = p_\theta(\theta)\abs{\fracd{\theta}{\phi}}$$ for any reparameterization $\phi = h(\theta)$ will be uninformative.

# Fisher Information Matrix

Suppose we have $N$ parameters $\theta = (\theta_1, \dots, \theta_N)$. Fisher information becomes an $N\times N$ matrix $I(\theta)$ with entries $$I(\theta)_{ij} = \ex[X]{\nail{\fracp{}{\theta_i} \log p(X|\theta)}\nail{\fracp{}{\theta_j} \log p(X|\theta)}\midd \theta}$$ Note that $I(\theta) \succeq 0$.

Again, under regular conditions, we also get $$I(\theta)_{ij} = -\ex[X]{\frac{\partial^2}{\partial \theta_i \partial \theta_j} \log p(X|\theta)\midd \theta}$$

If $I(\theta)_{ij} = 0$, we say that $\theta_i$ and $\theta_j$ are **orthogonal parameters**, and their MLE will be independent.