Fisher Information

Fisher Information

Fisher information (named after Ronald Fisher, who camed up with ANOVA and MLE) measures the amount of information that an observed variable $X$ has about a hidden variable $\theta$. It is used in the asymptotic theory of MLE, Jeffreys prior, and Wald test.

Let $p(X|\theta)$ be the likelihood distribution. Then the log-likelihood is $$\Mr{LL}(\theta) = \log p(X|\theta)$$ We define the score as $$\Mr{score}(\theta) = \fracp{}{\theta} \log p(X|\theta)$$ Under regular conditions, the first moment of the score is 0: $$\begin{align*} \ex[X]{\Mr{score}(\theta)|\theta} &= \int \nail{\fracp{}{\theta} \log p(X|\theta)} p(x|\theta)\,dx \\ &= \int \nail{\frac{1}{p(x|\theta)}\fracp{p(x|\theta)}{\theta}} p(x|\theta)\,dx \\ &= \fracp{}{\theta} \int p(x|\theta)\, dx = \fracp{}{\theta} 1 = 0 \end{align*}$$

The Fisher information is the second moment: $$I(\theta) = \ex[X]{\nail{\fracp{}{\theta}\log p(X|\theta)}^2\midd\theta}$$ Under regular conditions ($p(X|\theta)$ is twice differentiable), chain rule will imply $$I(\theta) = -\ex[X]{\fracp{^2}{\theta^2}\log p(X|\theta)\midd \theta}$$ Note that Fisher information does not depend on a particular $X=x$ since it is integrated out.

Intuition

Fisher information measures the curvature of the log-likelihood. If we plot the log-likelihood, high curvature = deep valley = easy to get optimal $\theta$ = we got a lot of information about $\theta$ from $X$ = high Fisher information.

Properties

  • $I_{X,Y}(\theta) = I_X(\theta) + I_Y(\theta)$
  • If $T(X)$ is a sufficient statistics for $\theta$ (i.e., $$p(X=x|T(X)=t,\theta) = p(X=x|T(X)=t)$$ independent of $\theta$), then $I_T(\theta) = I_X(\theta)$
  • For other statistics $T(X)$, we get $I_T(\theta) \leq I_X(\theta)$
  • Theorem (Cramér-Rao Bound) For any unbiased estimator $\hat\theta$, $$\varr{\hat\theta} \geq \frac{1}{I(\theta)}$$ This makes sense: less information means that it is more difficult to pinpoint the estimator, and thus the variance increases.
  • Definition (Jeffreys Prior) The Jeffreys prior is defined as $$p_\theta(\theta) \propto \sqrt{I(\theta)}$$ Jeffreys prior is an uninformative prior that is not sensitive to parameterization; i.e., both the original $p_\theta(\theta)$ and $$p_\phi(\phi) = p_\theta(\theta)\abs{\fracd{\theta}{\phi}}$$ for any reparameterization $\phi = h(\theta)$ will be uninformative.

Fisher Information Matrix

Suppose we have $N$ parameters $\theta = (\theta_1, \dots, \theta_N)$. Fisher information becomes an $N\times N$ matrix $I(\theta)$ with entries $$I(\theta)_{ij} = \ex[X]{\nail{\fracp{}{\theta_i} \log p(X|\theta)}\nail{\fracp{}{\theta_j} \log p(X|\theta)}\midd \theta}$$ Note that $I(\theta) \succeq 0$.

Again, under regular conditions, we also get $$I(\theta)_{ij} = -\ex[X]{\frac{\partial^2}{\partial \theta_i \partial \theta_j} \log p(X|\theta)\midd \theta}$$

If $I(\theta)_{ij} = 0$, we say that $\theta_i$ and $\theta_j$ are orthogonal parameters, and their MLE will be independent.

References

Exported: 2016-07-13T01:47:40.711777