In what sense is the Jeffreys prior invariant?

$\begingroup$

I've been trying to understand the motivation for the use of the Jeffreys prior in Bayesian statistics. Most texts I've read online make some comment to the effect that the Jeffreys prior is "invariant with respect to transformations of the parameters", and then go on to state its definition in terms of the Fisher information matrix without further motivation. However, none of them then go on to show that such a prior is indeed invariant, or even to properly define what was meant by "invariant" in the first place.

I like to understand things by approaching the simplest example first, so I'm interested in the case of a binomial trial, i.e. the case where the support is $\{1,2\}$. In this case the Jeffreys prior is given by $$ \rho(\theta) = \frac{1}{\pi\sqrt{\theta(1-\theta)}}, \qquad\qquad(i) $$ where $\theta$ is the parameterisation given by $p_1 = \theta$, $p_2 = 1-\theta$.

What I would like is to understand the sense in which this is invariant with respect to a coordinate transformation $\theta \to \varphi(\theta)$. To me the term "invariant" would seem to imply something along the lines of $$ \int_{\theta_1}^{\theta_2} \rho(\theta) d \theta = \int_{\varphi(\theta_1)}^{\varphi(\theta_2)} \rho(\varphi(\theta)) d \varphi \qquad\qquad(ii) $$ for any (smooth, differentiable) function $\varphi$ -- but it's easy enough to see that this is not satisfied by the distribution $(i)$ above (and indeed, I doubt there can be any density function that does satisfy this kind of invariance for any transformation). So there must be some other sense intended by "invariant" in this context. I would like to understand this sense in the form of a functional equation similar to $(ii)$, so that I can see how it's satisfied by $(i)$.

Progress

As did points out, the Wikipedia article gives a hint about this, by starting with $$ p(\theta)\propto\sqrt{I(\theta)} $$ and deriving $$ p(\varphi)\propto\sqrt{I(\varphi)} $$ for any smooth function $\varphi(\theta)$. (Note that these equations omit taking the Jacobian of $I$ because they refer to a single-variable case.) Clearly something is invariant here, and it seems like it shouldn't be too hard to express this invariance as a functional equation. However, the more I try to do this the more confused I get. Partly this is because there's just a lot left out of the Wikipedia sketch (e.g. are the constants of proportionality the same in the two equations above, or different? Where is the proof of uniqueness?) but mostly it's because it's really unclear exactly what's being sought, which is why I wanted to express it as a functional equation in the first place.

To reiterate my question, I understand the above equations from Wikipedia, and I can see that they demonstrate an invariance property of some kind. However, I can't see how to express this invariance property in the form of a functional equation similar to $(ii)$, which is what I'm looking for as an answer to this question. I want to first understand the desired invariance property, and then see that the Jeffrey's prior (hopefully uniquely) satisfies it, but the above equations mix up those two steps in a way that I can't see how to separate.

$\endgroup$ 8

5 Answers

$\begingroup$

Having come back to this question and thought about it a bit more, I believe I have finally worked out how to formally express the sense of "invariance" that applies to Jeffreys' priors, as well as the logical issue that prevented me from seeing it before.

The following lecture notes were helpful in coming to this conclusion, as they contain an explanation that is clearer than anything I could find at the time of writing the question:

My key stumbling point seems to be that the phrase "the Jeffreys prior is invariant" is incorrect - the invariance in question is not a property of any given prior, but rather it's a property of a method of constructing priors from likelihood functions.

That is, we want something that will take a likelihood function and give us a prior for the parameters, and will do it in such a way that if we take that prior and then transform the parameters, we will get the same result as if we first transform the parameters and then use the same method to generate the prior. I was looking for an invariance property that would apply to a particular prior generated using Jeffreys' method, whereas the desired invariance principle in fact applies to Jeffreys' method itself.

To give an attempt at fleshing this out, let's say that a "prior construction method" is a functional $M$, which maps the function $f(x \mid \theta)$ (the conditional probability density function of some data $x$ given some parameters $\theta$, considered a function of both $x$ and $\theta$) to another function $\rho(\theta)$, which is to be interpreted as a prior probability density function for $\theta$. That is, $\rho(\theta) = M\{ f(x\mid \theta) \}$.

What we seek is a construction method $M$ with the following property: (I hope I have expressed this correctly)$$ M\{ f(x\mid h(\theta)) \} = M\{ f(x \mid \theta) \}\circ h, $$for any arbitrary smooth monotonic transformation $h$. That is, we can either apply $h$ to transform the likelihood function and then use $M$ to obtain a prior, or we can first use $M$ on the original likelihood function and then transform the resulting prior, and the end result will be the same.

What Jeffreys provides is a prior construction method $M$ which has this property. My problem arose from looking at a particular example of a prior constructed by Jeffreys' method (i.e. the function $M\{ f(x\mid \theta )\}$ for some particular likelihood function $f(x \mid \theta)$) and trying to see that it has some kind of invariance property. In fact the desired invariance is a property of $M$ itself, rather than of the priors it generates.

I do not currently know whether the particular prior construction method supplied by Jeffreys is unique in having this property. This seems to be rather an important question: if there is some other functional $M'$ that is also invariant and which gives a different prior for the parameter of a binomial distribution then there doesn't seem to be anything that picks out the Jeffreys distribution for a binomial trial as particularly special. On the other hand, if this is not the case then the Jeffreys prior does have a special property, in that it's the only prior that can be produced by a prior generating method that is invariant under parameter transformations. It would therefore seem rather valuable to find a proof that Jeffrey's prior construction method is unique in having this invariance principle, or an explicit counterexample showing that it is not.

$\endgroup$ 5 $\begingroup$

Maybe the problem is that you are forgetting the jacobian of the transformation in (ii).

I suggest that you check carefully the formulas here (hint: $\left| \frac{d \Phi^{- 1}}{d y} \right|$ is the jacobian where $\Phi^{- 1}$ is the inverse transformation). Then, start with some simple examples of some monotonic transformations in order to see the invariance. I suggest to start with $\varphi(\theta)=2\theta$ and $\varphi(\theta)=1-\theta$.

Also, to answer your question, the constants of integration do not matter here. In (i), it is $\pi$. Do the calculations with $\pi$ in there to see that point. Let me know if you are stuck somewhere.

Edit: The dependence on the likelihood is essential for the invariance to hold, because the information is a property of the likelihood and because the object of interest is ultimately the posterior. However, regardless what likelihood you use, the invariance will hold through. This happens through the relationship $ \sqrt{I (\theta)} = \sqrt{I (\varphi (\theta))} | \varphi' (\theta) | $. Indeed this equation links the information of the likelihood to the information of the likelihood given the transformed model. Here $| \varphi' (\theta) |$ is the inverse of the jacobian of the transformation. (I will let you verify this by deriving the information from the likelihood. Just use the chain rule after applying the definition of the information as the expected value of the square of the score). Now, for the prior. \begin{eqnarray*} p (\varphi (\theta) ) & = & \frac{1}{| \varphi' (\theta) |} p (\theta )\\ & = & \frac{1}{| \varphi' (\theta) |} \sqrt{I (\theta)} \\ & = & \sqrt{I (\varphi (\theta))} \\ & = & p (\varphi (\theta)) \end{eqnarray*} The first line is only applying the formula for the jacobian when transforming between posteriors. The second line applies the definition of Jeffreys prior. The third line applies the relationship between the information matrices. The final line applies the definition of Jeffreys prior on $\varphi{(\theta)}$. You can see that the use of Jeffreys prior was essential for $\frac{1}{| \varphi' (\theta) |}$ to cancel out.

Look again at what happens to the posterior ($y$ is obviously the observed sample here) \begin{eqnarray*} p (\varphi (\theta) |y) & = & \frac{1}{| \varphi' (\theta) |} p (\theta |y)\\ & \propto & \frac{1}{| \varphi' (\theta) |} p (\theta) p (y| \theta)\\ & \propto & \frac{1}{| \varphi' (\theta) |} \sqrt{I (\theta)} p (y| \theta)\\ & \propto & \sqrt{I (\varphi (\theta))} |p (y| \theta)\\ & \propto & p (\varphi (\theta)) p (y| \theta) \end{eqnarray*} The only difference is that the second line applies Bayes rule.

As I explained earlier in the comments, it is essential to understand how jacobians work (or differential forms).

$\endgroup$ 7 $\begingroup$

What is invariant is the volume density $|p_{L_{\theta}}(\theta) dV_{\theta}|$ where $V_\theta$ is the volume form in coordinates $\theta_1, \theta_2, \dots \theta_n$ and $L_\theta$ is the likelihood parametrized by $\theta$. Locally the Fisher matrix $F$ transforms to $(J^{-1})^TFJ^{-1}$ under a change of coordinates with Jacobian $J$, and $\sqrt{\det}$ of this cancels the multiplication of volume forms by $\det J$.

The presentation in Wikipedia is confusing, because

the equations are between densities $p(x) dx$, but written as though for the density functions $p()$ that define the priors,
the first equality is a claim still to be proven. The following ones are the derivation of that equation.

To read the Wikipedia argument as a chain of equalities of unsigned volume forms, multiply every line by $|d\varphi|$, and use absolute value of all determinants, not the usual signed determinant. Then "$p_{L_{\varphi}}(\varphi) d\varphi \rm{\hskip2pt(claimed)} = p_{L_{\theta}}(\theta) d\theta = (\rm{...Fisher \hskip3pt I \hskip3pt quantities...)} d\varphi = \sqrt{I(\varphi)} d\varphi $.

To answer some of the other questions,

The invariance of $|p dV|$ is the definition of "invariance of prior". Because changes of coordinate alter $dV$, an invariant prior has to depend on more than $p(\theta)$. It is natural to ask for something local on the parameter space, so the invariant prior will be built from a finite number of derivatives of the likelihood evaluated at $\theta$. This means some local finite dimensional linear space of differential quantities at each point with linear maps between the before- and after- coordinate change spaces. Determinants appear because there is a factor of $\det J$ to be killed from the change in $dV$, and because we will want the changes of the local quantities to multiply and cancel each other as is the case in Jeffreys prior, which practically requires a reduction to one dimension where the coordinate change can act on each factor by multiplication by a single number. The Jeffreys prior is a product of two locally defined quantities one of which scales by $\sqrt{A^{-2}}$ and the other by $A$ where $A(\theta)$ is a local factor that depends on $\theta$ and on the coordinate transformation. Computationally it is expressed by Jacobians but only the power-of-$A$ dependences matter and having those cancel out on multiplication. This shows that the invariant prior is very non-unique as there are many other ways to achieve the cancellation. The preference for Jeffreys form of invariant prior is based on other considerations.

$\endgroup$ 5 $\begingroup$

The clearest answer I have found (ie, the most blunt "definition" of invariance) was a comment in this Cross-Validated thread, which I combined with the discussion in "Bayesian Data Analysis" by Gelman et. al. to finally come to an understanding.

The key point is we want the following: If $\phi = h(\theta)$ for a monotone transformation $h$, then:

$$P(a \le \theta \le b) = P(h(a) \le \phi \le h(b))$$

Proof

First we show a probability density for which this is satisfied.

Let $p_{\theta}(\theta)$ be the prior on $\theta$. We will derive the prior on $\phi$, which we'll call $p_{\phi}(\phi)$. By the transformation of variables formula,

$$p_{\phi}(\phi) = p_{\theta}( h^{-1} (\phi)) \Bigg| \frac{d}{d\phi} h^{-1}(\phi) \Bigg| $$

Now, according to this Wikipedia page, the derivative the inverse gives:

$$p_{\phi}(\phi) = p_{\theta}( h^{-1} (\phi)) \Bigg| h'(h^{-1}(\phi)) \Bigg|^{-1} $$

We will write this in another way to make the next step clearer. Recalling that $\phi = h(\theta)$, we can write this as

$$p_{\phi}(h(\theta)) = p_{\theta}(\theta) \Bigg| h'(\theta) \Bigg|^{-1} $$.

Now we get to the good part.

\begin{aligned} P(h(a)\le \phi \le h(b)) &= \int_{h(a)}^{h(b)} p_{\phi}(\phi) d\phi\\ \end{aligned}

using the substitution formula from Wikipedia with $\phi = h(\theta)$

\begin{aligned} \int_{h(a)}^{h(b)} p_{\phi}(\phi) d\phi &= \int_{a}^{b} p_{\phi}(h(\theta)) h'(\theta) d\theta\\ &= \int_{a}^{b} p_{\theta}(\theta) \Bigg| h'(\theta) \Bigg|^{-1} h'(\theta) d\theta, \end{aligned}

where we have used our result above.

Now, we can drop the absolute value bars around $h'(\theta)$. If $h$ is increasing, then $h'$ is positive and we don't need the absolute value. If $h$ is decreasing, then $h(b) < h(a)$, which means the integral gets a minus in front of it. When we drop the bars, we can cancel $h'^{-1}$ and $h'$, giving

$$ \int_{h(a)}^{h(b)} p_{\phi}(\phi) d\phi = \int_{a}^{b}p_{\theta}(\theta) d\theta$$

and hence

$$ P(a \le \theta \le b) = P(h(a) \le \phi \le h(b))$$

Now, we need to show that a prior chosen as the square root of the Fisher Information admits this property. This proof is clearly laid out in these lecture notes

Hope this helps.

$\endgroup$ $\begingroup$

The property of "Invariance" does not necessarily mean that the prior distribution is Invariant under "any" transformation. To make sure that we are on the same page, let us take the example of the "Principle of Indifference" used in the problem of Birth rate analysis given by Laplace. The link given by the OP contains the problem statement in good detail. Here the argument used by Laplace was that he saw no difference in considering any value p$_1$ over p$_2$ for the probability of the birth of a girl.

Suppose there was an alien race that wanted to do the same analysis as done by Laplace. But let us say they were using some log scaled parameters instead of ours. (Say they were reasoning in terms of log-odds ratios). It is perfectly alright for them to do so because each and every problem of ours can be translated to their terms and vice-versa as long as the transform is a bijection.

The problem here is about the apparent "Principle of Indifference" considered by Laplace. Though his prior was perfectly alright, the reasoning used to arrive at it was at fault. Say if the aliens used the same principle, they would definitely arrive at a different answer than ours. But whatever we estimate from our priors and the data must necessarily lead to the same result. This "Invariance" is what is expected of our solutions. But using the "Principle of Indifference" violates this.

In the above case, the prior is telling us that "I don't want to give one value p$_1$ more preference than another value p$_2$" and it continues to say the same even on transforming the prior. The prior does not lose the information. In other words, on transforming the prior to a log-odds scale, the prior still says "See, I still consider no value of p1 to be preferable over another p2" and that is why the log-odds transform is not going to be flat. It says that there is some prior information which is why this transformed pdf is not flat.

Now how do we define a completely "uninformative" prior? That seems to be an open-ended question full of debates. But nonetheless, we can make sure that our priors are at least uninformative in some sense. That is where this "Invariance" comes into the picture.

Say that we have 2 experimenters who aim to find out the number of events that occurred in a specific time (Poisson dist.). But unfortunately, if their clocks were running at different speeds (say, t' = qt) then their results will definitely be conflicting if they did not consider this difference in time-scales. Whatever priors they use must be completely uninformative about the scaling of time between the events. This is ensured by the use of Jeffrey's prior which is completely scale and location-invariant. So they will use the $\lambda^{-1}d\lambda$ prior, the Jeffrey's prior (because it is the only general solution in the one-parameter case for scale-invariance). Jeffrey's prior has only this type of invariance in it, not to all transforms (Maybe some others too, but not all for sure). To use any other prior than this will have the consequence that a change in the time scale will lead to a change in the form of the prior, which would imply a different state of prior knowledge; but if we are completely ignorant of the time scale, then all time scales should appear equivalent. The use of these "Uninformative priors" is completely problem-dependent and not a general method of forming priors. When this property of "Uninformativeness" is needed, we seek priors that have invariance of a certain type associated with that problem.

(More info on this scale and location invariance can be found in Probability Theory the Logic of Science by E.T. Jaynes. The timescale invariance problem is also mentioned there.)

$\endgroup$ 3

In what sense is the Jeffreys prior invariant?

5 Answers

Your Answer

Sign up or log in

Post as a guest

More in general

Is Skyrim kid friendly?

Review: Only Murders in the Building, "Thirty"