Talk:Fisher information

Untitled[edit]

The first line in 'Example' miss a left-hand parenthesis ")". Thank you for a nice article!

There seem to be some superfluous brackets in the expectation notation: Neither

\mathbb {E} X^{2}\,

nor

\left[\mathbb {E} X\right]^{2}\,

is ambiguous, and in fact their difference should be the variance of X if I remember right.

\mathbb {E} X^{2}\,

does not need to be disambiguated according to standard order of operations. -- 130.94.162.61 04:45, 10 February 2006 (UTC)[reply]

Also there are some philosophical issues not discussed in the article. Without some prior probability distribution on

\theta

, how can we ever hope to extract information about it? For example, take a person's height. We usually start with some a priori idea or expectation of what a person's height ought to be before taking any measurements at all. If we measure a person's height to be 13 feet, we would normally assume the measurement was wrong and probably discard it (as a so-called "outlier"). But if more and more measurements gave a result in the vicinity of 13 feet, it might dawn on us that we are measuring a giant. On the other hand, a single measurement of 5 feet 5 inches would probably convince us of someone else's height to a reasonable degree of accuracy. Fisher information doesn't say anything about a priori probability distributions on θ. A maximum likelihood estimator which assumes a "uniform" distribution over all the reals (w.r.t. the Lebesgue measure) is an absurdity. I'm not sure I'm making any sense (and feel free to delete this comment if I'm not), but I don't believe any information can be extracted about an unknown parameter without having beforehand some rough estimate of the a priori probability distribution of that parameter. -- 130.94.162.61 13:54, 10 February 2006 (UTC)[reply]

Added Regularity condition[edit]

The above comment is specious. The writer brings up a point that Fisher Information does not speak to. Fisher information assumes that one is estimating a parameter and that there is no a priori distribution of that parameter. This is one of the weaknesses of Fisher Information. However, it is not relevant to an article about Fisher information except in the context of "Other formulations." There is, however an important error in this article. The second derivative version of the definition of Fisher Information is only valid if the proper regularity condition is met. I added the condition, though this may not be the best representation of it. The formula looks rather ugly to me, but I don't have time to make it pretty. Sorry! --67.85.203.239 22:15, 12 February 2006 (UTC)[reply]

My comment above was somewhat specious, but when I carry out the differentiation of the second derivative version of the Fisher information, I get a term

\mathbb {E} \left[{\frac {{\frac {\partial ^{2}}{\partial \theta ^{2}}}f(X|\theta )}{f(X|\theta )}}\right]\mathrm {\ or\ } \int _{X}{\frac {\partial ^{2}}{\partial \theta ^{2}}}f(x|\theta )\,dx

that must be equal to zero. Is this valid for a regularity condition or at all what is wanted here? The regularity condition that was added to the article doesn't make much sense to me, since it contains a capital X and no expectation taken over it. Please excuse my ignorance. As to my comment above, I still think something belongs in the article (in the way of introduction) to tell someone like me what Fisher information is used for as well as when or why it should or shouldn't be used. As the article stands, it's just a bunch of mathematical formulae without much context or discussion. -- 130.94.162.61 22:06, 8 March 2006 (UTC)[reply]

There should be a little more discussion of the Cramér-Rao inequality, too. -- 130.94.162.61 22:31, 8 March 2006 (UTC)[reply]

But isn't it generally going to be the case (assuming the 2nd derivative exists)

\int {\frac {\partial ^{2}}{\partial \theta ^{2}}}f(X;\theta )\,dx={\frac {\partial ^{2}}{\partial \theta ^{2}}}\int f(X;\theta )\,dx={\frac {\partial ^{2}}{\partial \theta ^{2}}}1=0

71.221.255.155 07:35, 8 December 2006 (UTC)[reply]

Some things unclear(/wrong?)[edit]

In the expression

$\int {\frac {\partial ^{2}}{\partial \theta ^{2}}}f(X;\theta )\,dx=0,$

might it be $f(x;\theta )$ ?

Also, it is unclear whether the $\theta$ 's must cover the whole parameter space, or could cover some subspace. In discussing the N-variate gaussian, it is said that the information matrix has indeces running from 1 to $N$ , but there are $(N+1)(N+2)/2$ parameters to describe a gaussian. This is probably a mistake. PhysPhD

Say more about Roy Frieden's work[edit]

I should admit that I have studied mathematical statistics. Even so, by Wiki standards, this entry is not unduly technical. I've added some links (and am sure more could be added) that should help the novice reader along. The first person to contribute to this talk page is an unwitting Bayesian, when (s)he calls for a "prior distribution" on θ. Information measures and entropy are bridges connecting classical and Bayesian statistics. This entry should sketch bits of those bridges, if only by including a few links. This entry should say more comparing and constrasting Fisher information with the measures of Shannon, Kullback-Leibler, and possibly others.

Wiki should also say more, somewhere, about the extraordinary work of Roy Frieden. Frieden, a respectable physicist, has written a nearly 500pp book arguing that a great deal of theoretical physics can be grounded in Fisher information and the calculus of variations. This should not come as complete surprise to anyone who has mastered Hamiltonian mechanics and has thought about the principle of least action, but even so, Frieden's book is a breathtaking high wire act. It appears that classical mechanics, electromagnetism, and thermodynamics, general relativity, and quantum electrodynamics are all merely different applications of a few core information-theoretic and variational principles. Frieden (2004) also includes a chapter on what he thinks his EPI approach could contribute to unsolved problems, such as quantum gravitation, turbulence, and topics in particle physics. Could EPI even prove to be the eventual gateway to that Holy Grail of contemporary science, the unification of the three fundamental forces, electroweak, strong, and gravitation? I should grant that EPI doesn't answer everything; for example, it sheds no light on why the fundamental dimensionless constants take on the values that they do. Curiously, Frieden says little about optics even though that was his professional specialty.202.36.179.65 13:19, 11 April 2006 (UTC)[reply]

A number of links to articles about Frieden and his work are already in this article. Michael Hardy 20:31, 11 April 2006 (UTC)[reply]

The physical and mathematical correctness of Frieden's ideas have been characterized as highly dubious by several knowledgeable observers; see, for example, Ralph F. Streater's ``Lost Causes in Theoretical Physics: Physics from Fisher Information, and Cosma Shalizi's review of Physics from Fisher Information. QuispQuake 14:55, 12 July 2006 (UTC)[reply]

Hey, 202.36.179.65 don't be coy. You must be the man himself!81.178.157.195 (talk) 11:40, 31 January 2012 (UTC)[reply]

B. Roy Frieden's anonymous POV-pushing edits[edit]

B. Roy Frieden claims to have developed a "universal method" in physics, based upon Fisher information. He has written a book about this. Unfortunately, while Frieden's ideas initially appear interesting, his claimed method has been characterized as highly dubious by knowledgeable observers (Google for a long discussion in sci.physics.research from some years ago.)

Note that Frieden is Prof. Em. of Optical Sciences at the University of Arizona. The data.optics.arizona.edu anon has used the following IPs to make a number of questionable edits:

150.135.248.180 (talk · contribs)
1. 20 May 2005 confesses to being Roy Frieden in real life
2. 6 June 2006: adds cites of his papers to Extreme physical information
3. 23 May 2006 adds uncritical description of his own work in Lagrangian and uncritically cites his own controversial book
4. 22 October 2004 attributes uncertainty principle to Cramer-Rao inequality in Uncertainty Principle, which is potentially misleading
5. 21 October 2004 adds uncritical mention of his controversial claim that Maxwell-Boltzmann distribution can be obtained via his "method"
6. 21 October 2004 adds uncritical mention of his controversial claim that the Klein-Gordon equation can be "derived" via his "method"
150.135.248.126 (talk · contribs)
1. 9 September 2004 adds uncritical description of his work to Fisher information
2. 8 September 2004 adds uncritical description of his highly dubious claim that EPI is a general approach to physics to Physical information
3. 16 August 2004 confesses IRL identity
4. 13 August 2004 creates uncritical account of his work in new article, Extreme physical information

These POV-pushing edits should be modified to more accurately describe the status of Frieden's work.---CH 21:54, 16 June 2006 (UTC)[reply]

Hear,hear! I totally agree with the first few sentences of this talk section, and perhaps it should appear in the article as a health warning.81.178.157.195 (talk) 11:39, 31 January 2012 (UTC)[reply]

Graphs to improve technical accessibility[edit]

In addressing the technical accessibility tag above, I would recommend the addition of some graphs. For example, this concept could be related to the widely understood concept of the Gaussian bell curve. -- Beland 21:35, 4 November 2006 (UTC)[reply]

Minus sign missing?[edit]

In the one-dimensional equation, there is a minus sign in the equation linking the second derivative of the log likelihood to the variance of theta. This stands to reason, as we want maximum, not minimum likelihood, so the second derivative becomes negative. In the matrix formulation below, there is no minus sign. Should it not be there, too? In practice, of course, one often minimizes sums of squares, or other "loss" functions, instead. This already is akin to -log(L). I am not a professional statistician, but I use statistics a lot in my profession, microbiology. I did not find the article too technical. After all, the subject itself is somewhat technical. Wikipedia does a great job of making gems such as this accessible. 82.73.149.14 19:51, 30 December 2006 (UTC)Bart Meijer[reply]

Style[edit]

I think that the style in which parts of this article are written is more appropriate for a textbook than for an encyclopedia article. For example: "To informally derive the Fisher Information, we follow the approach described by Van Trees (1968) and Frieden (2004)" This type of comment is only really appropriate in a textbook where a single author or a few authors are writing a book with a coherent theme. An encyclopedia article ought to adopt a different style: in particular, I object to the use of the term "we", as on wikipedia, with so many authors and with anonymous authors, it is not clear who the word "we" refers to. Instead, I think we should word things "Van Trees (1968) and Frieden (2004) provide the following method of deriving the Fisher information informally:". I am going to rewrite this to try to eliminate these sorts of comments. But...I think this style problem goes beyond just the use of the word "we"...it's pretty pervasive and it needs deep changes. Cazort (talk) 18:14, 10 January 2008 (UTC)[reply]

Informal Derivation & Definition[edit]

This derivation doesn't seem to be a derivation of the Fisher information, but rather, a derivation of the relationship between Fisher information and the bound on the variance of an estimator. Does everyone agree with me that this should be renamed? Also, this remark relates to the definition of Fisher information. For example, the comment "The Fisher information is the amount of information" is loaded, because it is not defined what information means. I am going to weaken this statement accordingly. If we can come up with a more rigorous and more precise definition then we should include it! Cazort (talk) 18:22, 10 January 2008 (UTC)[reply]

How about putting in 'Mutual Information' and 'Joint Information' discussion[edit]

I've heard mention of "mutual information" and "joint information" (bivariate discrete random variables); shouldn't these terms be discussed?199.196.144.13 (talk) 21:08, 29 May 2008 (UTC)[reply]

Merge “Observed information”[edit]

I suggest that the article Observed information be merged with the current, since it repeats the definition of the Fisher information, only substituting the expected value w.r.t. sample probability distribution instead of the expected value with respect to the population. As such, the observed information is simply the sample Fisher information. … stpasha » 07:20, 24 January 2010 (UTC)[reply]

Well, the two are different things, and there is even an article contrasting the two in the refs for observed information, so I'm not sure why you think they need to be merged. Is it because there is not much detail in observed information? --Zvika (talk) 08:20, 24 January 2010 (UTC)[reply]

The observed information is given by the formula

I_{\hat {\theta }}=-{\frac {\partial ^{2}}{\partial \theta \partial \theta '}}\ell ({\hat {\theta }})=-{\frac {\partial ^{2}}{\partial \theta \partial \theta '}}{\frac {1}{n}}\sum _{i=1}^{n}\ln f(x_{i}|{\hat {\theta }})={\widehat {\operatorname {E} }}{\bigg [}{-{\frac {\partial ^{2}\ln f(x_{i}|{\hat {\theta }})}{\partial \theta \partial \theta '}}}{\bigg ]}

The “expected Fisher information” is given by similar formula, only we use the population expectation:

{\mathcal {I}}_{\hat {\theta }}=\operatorname {E} {\bigg [}{-{\frac {\partial ^{2}\ln f(x_{i}|{\hat {\theta }})}{\partial \theta \partial \theta '}}}{\bigg ]}

Both of these are valid estimators for the Fisher information quantity, which is

{\mathcal {I}}=\operatorname {E} {\bigg [}{-{\frac {\partial ^{2}\ln f(x_{i}|\theta _{0})}{\partial \theta \partial \theta '}}}{\bigg ]}

The article you are referring to compares properties of these two estimators and finds that the first one gives more accurate confidence intervals than the second one (although of course asymptotically they are equivalent). Anyways, the concept of “observed information” is just an estimator of the Fisher information of the model, and thus should be merged with this article, in my opinion. … stpasha » 08:45, 24 January 2010 (UTC)[reply]

I guess it's possible to merge. It's just that this article is already quite long. The question is whether the merge will improve or hinder readability; adding too many sections makes it difficult to extract the "short story" from the article. My tendency would be to mention the concept briefly here (perhaps in the section Applications, or in a separate section), and link to observed information for more details. However, I don't think it's that critical, so if you feel strongly about this you can go ahead and merge as far as I'm concerned. --Zvika (talk) 09:18, 24 January 2010 (UTC)[reply]

I vote to keep the two articles separate, for clarity purposes and so that it's more likely to find both phrases when searching on Google. 70.22.219.191 (talk) 22:34, 31 January 2010 (UTC)[reply]

Keep separate. To argue that " “observed information” is just an estimator of the Fisher information " ignores the fact that it is better to use the “observed information” in computations and statistical inference, as indicated in the reference in observed information. Melcombe (talk) 14:54, 14 September 2010 (UTC)[reply]

Merge tag removed, as no support or action for 2 years. Melcombe (talk) 00:22, 8 February 2012 (UTC)[reply]

Fisher Information and Its Relation to Entropy[edit]

Thanks for correcting my edits to the Fisher information page, and sorry for saying something that wasn't quite correct (and also for getting the sign wrong!). The claim that the Fisher information is the Hessian of the entropy was in the article before I edited it, so it's good that it's gone now.

Correct me if I'm wrong, but it seems the Fisher information is always equal to the negative Hessian of the entropy for discrete probability distributions. I'd worked it out for discrete distributions and naively assumed it was true in general, but this looks like one of the many quirks of the definition of the continuous entropy as

H=-\int p(x)\ln p(x)\,dx

.

(OT rant: IMO the continuous entropy should never have been defined that way, since it's not equal to the continuous limit of the discrete entropy, which actually diverges to infinity, and lacks many of the desirable properties of the discrete version. If you put in a scaling factor to prevent divergence, and are careful to make it invariant to coordinate changes, you always end up with a relative entropy instead of H as defined above.)

Anyway, if it is true that the Fisher information is equal to the negative Hessian of the entropy for discrete distributions I'd like to put the $-\partial ^{2}H/\partial \theta _{i}\partial \theta _{j}$ formula at some early point in the article (along with a caveat about continuous distributions), since it would help someone with my background get a handle on the Fisher information a bit more easily.

Nathaniel Virgo (talk) 14:19, 7 October 2010 (UTC)[reply]

If I take the difference of the two expressions, I find that they are equal if and only if

\int {\frac {\partial ^{2}f(x\mid \theta )}{\partial \theta _{i}\partial \theta _{j}}}\ln f(x\mid \theta )\,dx=0\,,

or the discrete equivalent.

So, for instance, if I parameterize my distribution such that the probability (or probability density) of an outcome is some linear combination of the parameters (i.e., so that

{\frac {\partial ^{2}f(x\mid \theta )}{\partial \theta _{i}\partial \theta _{j}}}=0

for all i, j, and x) then I have equivalence. In particular, if θ and (1 − θ) are used to weight the linear mixture of two distributions, discrete or continuous, then I have equivalence. However, when things are non-linear, I may find myself in deep trouble. Quantling (talk) 19:47, 7 October 2010 (UTC)[reply]

f(x;θ) what is the proper defenition[edit]

Hi All,

Firstly does the ; simbol mean the same as | (given) and secondly Im assuming f(x|θ) is a pdf for a continuous variable?

Thanks, Sachin Sachinabey (talk) 08:12, 9 May 2011 (UTC)[reply]

I believe yes, and yes continuous for θ and x, though they could both be vectors. I don't know why they used ; instead of | in the article. Dmcq (talk) 09:44, 9 May 2011 (UTC)[reply]

The semicolon may be subtly different from the pipe symbol. The latter is used to indicate a conditional probability,

\Pr(x|\theta )={\frac {\Pr(x,\theta )}{\Pr(\theta )}}\,

and similarly for probability densities. Thus, the pipe symbol is most meaningful when it is reasonable to talk about

\Pr(\theta )

, a probability of

\theta

. On the other hand

\Pr(x;\theta )

and

{\Pr }_{\theta }(x)

indicate that the probability distribution is parameterized by

\theta

, but do not imply the existence of a probability distribution on

\theta

. Does that make sense? —

Q

uantling (talk | contribs) 16:40, 10 May 2011 (UTC)[reply]

Inverse of FIM for multivariate Gaussian[edit]

Nowhere in the article it says that the Fischer Information Matrix is the inverse of the Covariance matrix in the multivariate normal case. Yet this information is used in many sources especially in the context of Bayesian Networks (e.g. see http://en.wikipedia.org/wiki/Kalman_filter#Information_filter) — Preceding unsigned comment added by 89.204.138.242 (talk) 12:34, 23 January 2013 (UTC)[reply]

Far too technical[edit]

This article is virtually useless to any reader who is not already familiar with the field. I came here simply to find out what a 'Fisher matrix' is and there is nothing here which clearly answers that question. There's a sentence stating the general idea but it then dives directly into the full derivation with no simple example. The page appears to be trying to be a postgrad textbook rather than an article for a reader who has come across a term and would like to know what it means. Sadly I'm nowhere near able to do so myself, but I would suggest this article needs: 1) A simple example of what a Fisher matrix is. 2) A beginner-friendly description of what its components actually mean. The priority for any page should be to give someone who has never heard of the subject before a general idea of what the subject is, this article seems to fail on that. — Preceding unsigned comment added by 86.153.104.154 (talk) 09:47, 4 April 2014 (UTC)[reply]

Bad lead[edit]

The introductory paragraph doesn't make any sense at all. It says:

In mathematical statistics, the Fisher information (sometimes simply called information[1]) is a way of measuring the amount of information that an observable random variable X carries about an unknown parameter θ upon which the probability of X depends.

But X doesn't depend on $\theta$ at all. $X$ is external data. $\theta$ is a model parameter with which we are modelling $X$.

MisterSheik (talk) 07:45, 4 September 2014 (UTC)[reply]

It doesn't say X depends on θ, it says the probability of X depends on θ. For example, in Fisher information#Single-parameter Bernoulli experiment, the parameter θ is the probability of the "success" outcome. Qwfp (talk) 09:58, 4 September 2014 (UTC)[reply]

In that case, the probability of X doesn't depend on theta. theta is our belief about X, which is updated when X is observed. However, X is presumably generated according to some external process, which theta hopes to model. Therefore, it's theta that in fact depends on X — not the other way around. I think what's confusing you is that the causal arrow points from theta to X, but the causal arrow doesn't necessarily point in the same direction as the flow of information. MisterSheik (talk) 10:47, 4 September 2015 (UTC)[reply]

Not exactly. In the non-Bayesian framework (used by Fisher),

\theta

is a deterministic parameter (it is not a belief) and the PDF of

X

(note:

X

is a random variable) is parametrized by

\theta

. User:Qwfp was not wrong in saying that the probably of

X

depends on

\theta

. In fact, when you view the PDF of

X

as a function of

\theta

, with

x

fixed (see: S. Kay, Fundamentals of Statistical Signal Processing: Volume 1), you get the likelihood function

l(\theta ;x)

(the only rigorous way of writing it, explicitly stating it is a function of

\theta

, but

f(x;\theta )

is also acceptable). In the Bayesian framework,

\theta

is a random variable, the likelihood function is

f(x|\theta )

and the story is different.

In ML (Maximum Likelihood) estimation of deterministic parameters (non-Bayesian framework),

{\hat {\theta }}

(not

\theta

, which is unknown) is what is actually computed/updated when new data is observed. Here,

{\hat {\theta }}

(ML estimator of

\theta

) is a random variable and it does depend on the statistical properties of

X

, so it is actually correct to say that

{\hat {\theta }}

depends on

X

. — Preceding unsigned comment added by 130.83.42.73 (talk) 12:50, 1 August 2016 (UTC)[reply]

I agree with original poster throughout. This article HEAVILY leans on the assumption that X is actually distributed as per f(X, \theta), whereas this is almost never the case in which fisher information is actually used. Generally speaking, one performs the calculations with \hat \theta, which is NOT the true parameter, and then uses fisher information to figure out how far away the true parameter is likely to be. In this case, most of the properties described in this article (e.g. the fisher matrix being the negative of the Hessian, and the average score being zero) simply do not hold. This is doubly true if the model is misspecified, as essentially all real world models would be, as in that case there does not exist a true parameter at all. I think the whole article needs either a notational update (e.g. \theta_0 instead of \theta basically everywhere), or a lot more discussion of what's true and what's not if an estimate of the parameter is used to do these calculations. — Preceding unsigned comment added by Vertigre (talk • contribs) 21:25, 25 May 2018 (UTC)[reply]

Question about | vs ;[edit]

I've a similar question to Sachin. I'm not following this article when it uses | and ; in different contexts. In the computation of the first moment, are we dealing with conditional expectation or just expectation? If conditional then the switch to integration does not make sense as we should use the conditional density. The derivation only makes sense to me if | is replaced with ;. Can someone elucidate me on this? Smk65536 (talk) 14:55, 23 October 2015 (UTC)[reply]

The difference between the semicolon and the pipe was already explained in a previous comment by User:Quantling, but I will try to explain it in different words. First, however, note that the notation in the article is not perfectly consistent, that's why sometimes you find ; and sometimes | (even in the same formula). The key point is the following: there are two cases to consider. The first is the case of deterministic parameter

\theta

; in this case, the likelihood function is correctly written with the semicolon

f(x;\theta )

, since it means that the likelihood is parametrized by

\theta

. The second is the case where

\theta

is a random variable, thus the likelihood function is correctly written with the pipe

f(x|\theta )

, since it is the conditional density of the data (conditioned on

\theta

). All the expressions for the expectations easily follow from here.

In general, if you write any formula with " | ", including the likelihood function, it is implicitly assumed that if

\theta

is deterministic, you read " | " as " ; ". — Preceding unsigned comment added by 130.83.42.73 (talk) 11:45, 1 August 2016 (UTC)[reply]

Variance of $n$ independent Bernoulli trials is ${\frac {p(1-p)}{n}}$ ?[edit]

In the section "Single-parameter Bernoulli experiment," could you explain why the variance of the mean of successes in "n Bernoulli trials" is ${\frac {\theta (1-\theta )}{n}}$ ? This is what you imply when you say that the variance in question is the inverse of the additive Fisher information. Everywhere I looked, the variance of the mean of successes in "n Bernoulli trials," with probability of success $p$ , is $np(1-p)$ .

Also, why did you drop the word "independent" in the last sentence of that section? — Preceding unsigned comment added by 174.192.30.141 (talk) 04:21, 3 May 2018 (UTC)[reply]

Discrepancy in definition[edit]

I added a subsection that clarifies a commonly seen discrepancy in the definition of Fisher information. That is, some textbooks and notes define Fisher information with respect to one observation while some others define it using likelihood for all observations.

A critical problem is the lack of clarification which version is used in each scenario. I hope someone can help with adding short phrases after some frequently used important results that clarifies which version of Fisher information definition is used. For example, Cramer-Rao lower bound (writing only $I(\theta)$ rather than $nI(\theta)$ on the denominator may cause a misunderstanding that this lower bound doesn't depend on $n$, and it will be much better if it's clarified immediately that this $I(\theta)$ is defined using the joint log-likelihood so it is linear in $n$) -- this might seem a bit repetitive but it's really not (it may save a lot of time for beginners from confusion, especially when they compare the $I(\theta)$ defined in C-R lower bound to the $I(\theta)$ that appears in the asymptotic normal variance of MLE, where the tradition is almost unanimously defining $I(\theta)$ for only one observation).

Yzhanghf0700 (talk) 18:52, 8 March 2019 (UTC)[reply]