User:NorwegianBlue/refdesk/statistics

From Wikipedia, the free encyclopedia

Statistics & Propagation of error with least square fits[edit]

I originally came across this question my freshman year (at college), and although I am now a senior, I have yet to get a concrete answer. Basically, in my program (chemistry), we must do a lot of propagation of error, and there is one particular point that confuses me. Basically, if I plot a set of data that already has an associated error with it, how can that error be incorporated into the slope and intercept of a best-fit line? For example, if I plot absorbance vs. concentration of some solution, each point will have an error in the X direction (concentration, due to errors in preparation) and an error in the Y direction (absorbance, error and variablity due to instrument). If I calculate a best fit line, how can I incorporate these errors with the normal errors generated with a best fit? Let me know if you don't understand and I will dig up an example. As a similar question, if I take a set of points with error and try to take an average, how can these be incorporated into the error (and/or standard deviation) of the average? Again if you need an example, let me know. I would be happy with just a name of an equation or point me to a page. I would be very grateful :) Thanks a lot --Bennybp 02:39, 11 January 2007 (UTC)[reply]

Well I sort of found an answer in Bayesian linear regression, although it looks much more complicated than I wanted it to be. Anything simpler? --Bennybp 03:06, 11 January 2007 (UTC)[reply]
If your data sets are humungous, incorporating the individual point errors will have only a minimal effect. Otherwise, one method you can use is to replace each data point in your data set by K copies, each of which is randomly perturbed by X and Y amounts having the known error distribution for that point. Then compute the mean and variance of the miraculously multiplied data set. The mean should be the same, but the variance should be larger. If N was the size of the original set and you would have used the bias correction for the variance estimator of replacing N by N−1 in the divisor, then now replace KN by KN−K. The same method can be used for least square fits; in fact, computing an average is a form of least square fit in one dimension. Unless you use a fixed pseudo-random sequence you should get slightly different results each time; if the differences are too large, then N was not large enough.  --LambiamTalk 05:54, 11 January 2007 (UTC)[reply]
Depending on how variable the errors in the X variable are relative to the errors in Y, the standard least squares "best-fit" line may be a very poor fit. Our regression dilution article gives a good overview of the issues, with good references. You might also be interested in errors-in-variables models. -- Avenue 15:39, 11 January 2007 (UTC)[reply]
The average is easy: the variance of the sum (of independent variables) is the sum of the variances, and the standard deviation scales linearly with what it describes. Therefore the standard deviation of the average is just as the average is the sum divided by N; . I separated the two factors of to emphasize that the standard deviation of the average is the RMS of the individual standard deviations, divided by . It is that division which makes repeating a measurement worthwhile; the resulting error is better than the "average" errors of the individual measurements.
Handling something like least-squares is more complicated, but it can be handled by general error propagation methods — see the first equation for in that article. You have two functions f for the slope and intercept of your line; if you do the fit with only symbolic values (this gets complicated for large N), you can then take the required partial derivatives and evaluate the total error. As a trivial example, consider just two points: obviously the line passes through both so the slope is just and the intercept is . The partial derivatives for m are
we can use those (with the chain rule) to easily evaluate the ones for b:
The total errors (using Greek letters for uncertainties for brevity, with ) are then
.
Throwing in numbers I get . The much higher uncertainty for b comes from the distance the points are from the y-axis; perhaps a more useful measure would be the uncertainty at the midpoint, which is . (I doubt it's a coincidence that with two points the uncertainty at the middle is just the distance to the middle times the uncertainty in the slope.) Obviously this algebraic approach would become tedious for N any bigger, but one can probably make it somewhat easier using the linear algebra approach to LSF and a CAS. Does this help? --Tardis 17:06, 11 January 2007 (UTC)[reply]
That all seems like a big help to me. I didn't think of applying propagation of uncertainty to the linear regression equation. And I usually get everything mixed up anyway (error due to measurements, standard deviation, etc.). Thanks a lot :) --Bennybp 00:47, 12 January 2007 (UTC)[reply]

Sample size vs. extreme observations[edit]

For a one-variable statistic with an infinite, standard-normally-distributed population, what is the relationship between the sample size and the expected highest value observed? NeonMerlin 23:49, 15 January 2007 (UTC)[reply]

Given a single measurement of a random variable with a cumulative distribution function F, the probability of getting a result that does not exceed s is F(s). For N measurements, the probability of not exceeding s in any of the measurements is F(s)^N. The derivative of F(s)^N with respect to s will thus tell you the probability of getting no result larger than s+ds in any of them, but a result of at least s in at least one of the measurements. That is thus the probability that your highest observed value was s. In other words, F(s)^N would be the cumulative distribution function of the highest observed value. I made that up from scratch, but a quick googling confirms it, see for example the abstract of [1] --mglg(talk) 00:51, 16 January 2007 (UTC)[reply]
That gives me, for the standard normal distribution, a CDF of Φ(x)n = and a PDF of d/dx that. I imagine someone who knows integral calculus will be able to simplify these, but how do I turn either of them into the expected value? NeonMerlin 01:40, 16 January 2007 (UTC)[reply]
In general, given a PDF you can calculate the expectation value as Integral[x PDF(x) dx]. I don't think you can get any analytic answer in this case. There is probably a useful approximation for large N, though, but I don't know it. By the way, (essentially) your integral above has a name: the Error function. --mglg(talk) 03:40, 16 January 2007 (UTC)[reply]
Let Z denote the highest of n independent random variables having the standard normal distribution. It is possible to give the following upper lower bound on the expected value of Z: E(Z) > Φ−1(n/(n+1)). This can be seen as follows. Φ is a monotonic functions, so it commutes with max. So Φ(Z) is the highest of the Φ-values of n random variables whose (cumulative) distribution function is Φ. But these Φ-values are then simply uniformly distributed on the interval (0,1), so Φ(Z) has distribution function F(x) = xn (as per above), and E(Φ(Z)) is now easily found to be n/(n+1). Since function Φ−1 is concave convex in the area of interest (for n > 1), E(Z) = E(Φ−1(Φ(Z))) > Φ−1(E(Φ(Z))) = Φ−1(n/(n+1)). I have not tried to estimate how tight this is, but I wouldn't be too surprised if E(Z) = o(Φ−1(n/(n+1))).  --LambiamTalk 05:21, 16 January 2007 (UTC)[reply]
Nice, Lambiam. But shouldn't it be "convex" and "lower bound", i.e. E(Z) > Φ−1(n/(n+1)) ? --mglg(talk) 19:12, 16 January 2007 (UTC)[reply]
You're right. In the meantime I have done some calculations, which suggest that the difference is not a runaway one; putting E(Z) = Φ−1(n/(nn)), the quantity δn appears to decrease very slowly as n increases:
      n  delta_n
  -----  -------
      1  1.00000
      2  0.80231
      4  0.71501
      8  0.67000
     16  0.64408
     32  0.62783
     64  0.61688
    128  0.60909
    256  0.60326
    512  0.59874
   1024  0.59511
   2048  0.59213
   4096  0.58965
   8192  0.58755
  16384  0.58578
It looks like δn will remain above 0.5.  --LambiamTalk 21:51, 16 January 2007 (UTC)[reply]

How do I tell if I've chosen an appropriate statistical distribution[edit]

I have a group of 12 observations. I'd like to predict what my observations will be in the future. I also need the distribution to apply Bayes Theorem.

Right now, I'm using the normal distribution but I don't know if that's the right choice. I've calculated the skewness and kurtosis of the data, but I don't have any idea what they're supposed to be! I mean, I know if my observations were truly normally distributed, the skewness would be zero, but I don't know if my skewness of 1.65 is "close enough" or what. Are there rules of thumb for this? moink 05:50, 1 June 2006 (UTC)[reply]

Under the assumption of normal distribution, the probability that a sample of 12 observations has a skewness whose absolute value is at least 1.65 is about 0.002. That is fairly low, and normally ground to reject the null hypothesis of normalcy. What is the source of the observations and how critical is the accuracy of the estimated distribution? Often the physical or other origin of the data suggests a plausible crude model for the distribution that is good enough in practice. --LambiamTalk 06:36, 1 June 2006 (UTC)[reply]
It is not particularly critical. I was actually kinda hoping not to have to share the type of data, but since it's apparently a very poor fit to the normal distribution, I guess I will. It's the length of my menstrual cycle. Now all the boys on the math RD can get all grossed out.  :) I like to know if I should carry tampons on me, and the Bayes' theorem thing... well, if you're very bright you may be able to figure it out but I will not provide an explanation. Here's the data: 32, 29, 28, 28, 26, 27, 27, 29, 36, 25, 26, 28. moink 07:41, 1 June 2006 (UTC)[reply]
You know, you could use a neural network for precisely this task. Neural networks can be used to predict the length of menstrual cycles as well as stock market values or other things. Choose an encoding for the lengths, train some sort of recurrent network on the data you have, and then get it to generate predictions. If I get time, and I am sufficiently bored, I might even try this for you. Dysprosia 07:49, 1 June 2006 (UTC)[reply]
Sounds cool but beyond my abilities. Right now, though, I'm less interested in predicting exactly the length of the next cycle, and more in knowing the approximate probability that it is at least some length so I can apply Bayes' theorem. moink 07:56, 1 June 2006 (UTC)[reply]
Using your data and the formula at skewness, I find a skewness of 1.27, which is still significantly different from the null hypothesis but less so. Looking at the data, the problem appears to be the outliers at the high side. If you censor the data by discarding values > 30, you get a good agreement with a normal distribution. Given the application censoring at the high side is acceptable, since you want confidence at the low side. The sample is still a bit small, though, to really confidently assume the low end behaves normally, without outliers. --LambiamTalk 14:38, 1 June 2006 (UTC)[reply]
So much for trusting my spreadsheet software. I thought about dropping the large ones, but it's in the higher range that I'm most interested in the probability, and since it seems that it does occasionally get that large, and not that rarely, I wanted to take that into account. moink 15:28, 1 June 2006 (UTC)[reply]
Well, I'm by no means suggesting that this is what you are trying to calculate, but just for the sake of argument: if A=pregnant, and B=menstruation has not yet occurred, and one were interested in P(A|B), then P(B|A) would of course be very close to unity, but what value should be used for P(A)? Would the age specific fertility rate be correct? --NorwegianBlue 15:01, 1 June 2006 (UTC)[reply]
Addendum: P(A) would obviously have to be either zero, or a lot higher... --NorwegianBlue 15:11, 1 June 2006 (UTC)[reply]
Why would you say that? I mean, it could be zero, but it could be the small numbers you'd get using the failure rates of certain contraceptives. Even with several instances of unprotected sex in a month, it will generally not go above 25-30%. moink 15:24, 1 June 2006 (UTC)[reply]
Agreed. You are right. --NorwegianBlue 16:35, 1 June 2006 (UTC)[reply]
Chi squared might be your answer to whether the data is normal or not. Basically this works by dividing up the domain in to a number of boxes, you then count how many of your data items fall into each box and compare with the number predicted from the normal distribution. Add up the square of differences and compare with the approptite Chi-squared statistic. This should give a confidence interval as to whether the difference is significant or not. I suspect with only twelve points you don't really have enough data to meaningfully talk about skew. --Salix alba (talk) 15:10, 1 June 2006 (UTC)[reply]
Sigh. Ok, so I'm transparent. My prior distribution in this context is from a record of instances of penetrative sexual intercourse along with the underlying numbers used by this site combined with a pdf of the date of ovulation using the pdf above and the possibly quite poor assumption of a constant luteal phase of 14 days. moink 15:14, 1 June 2006 (UTC)[reply]
If the goal is to get pregnant, I'd start off by measuring my body temperature, to get a more precise estimate of the time of ovulation. After one year with no success, I would definitely go see a gynecologist. If, on the other hand, the goal is not to get pregnant, and you want a statistical tool to tell you when to start worrying, I'm afraid your approach won't work. Biological distributions tend to have very heavy tails, and you simply do not have enough data to make a sensible estimate of the distribution. With a limited dataset, however, you could make control charts. Here's a link to a how-to (powerpoint), courtesy of the British NHS Modernisation Agency. --NorwegianBlue 17:18, 1 June 2006 (UTC)[reply]
When reading my previous comment: forgetting to mention this was maybe a male freudian slip, but anyway: if the goal is getting pregnant, it would be a good idea to have your partner checked as well. --NorwegianBlue 21:33, 1 June 2006 (UTC)[reply]
Well, the goal is complicated. It is one, the other, or both of the above, in addition to saving on costs of Human chorionic gonadotropin tests (which have high false negative rates, especially when used too early) by using them at the right time. For example, applying an additional Bayesian update rule, a negative test with a sensitivity of 25 mIU of hCG would reduce my probability by a factor of nearly three if I used it today, while it would reduce the probability by a factor of eight if used tomorrow. And buying a basal thermometer would negate those cost savings.  :) The other main goal is the fun of overanalyzing these things. :) moink 04:12, 2 June 2006 (UTC)[reply]
If you check out the presentation that I linked to, and are able not to get too irritated about the "for dummies" manner in which it is presented, you will see that this might be exactly the tool that you are looking for. It is a tool for decision-making, primarily in the manufacturing industry, but it is now mandatory also in blood banks throughout the EU. As you can see from this article, it has been around for a long time, and has stood the test of time. It is a curious mix of parametric and non-parametric statistics.
Your statement on the goal leaves me with the impression that timing is a rather critical issue. I would definitely invest in that thermometer! I wish you all the best, and hope that you achieve your goal and that it brings you happiness. Best regards, --NorwegianBlue 23:39, 2 June 2006 (UTC)[reply]

Linear regression problem (Statistics)[edit]

I have some data that I tried analysing using a linear model with R (programming language). The dependent variable is Y. There is an explanatory variable X1. The data consists of two populations, represented by another variable X2. I did a regression analysis Y~X1 in each subpopulation (X2=1 and X2=2), as well as a regression Y~X1 for the combined data set. My problem is that the regression lines, when plotted, are quite different from what I would have expected from the scatterplot. I've put the data, plots and R code on an external page, and would be extremely grateful for feedback on whether I am making a silly mistake in the regression analysis or plotting, or if I might need to consult an opthalmologist.

The first figure shows the results of the regression analysis for each subpopulation (red and blue), as well as the result for the whole data set (green). The second figure repeats the regression line for the combined data set (green), and shows where I had expected to find the regression line(orange). Thanks in advance for any feedback. --NorwegianBlue talk 13:46, 28 January 2007 (UTC)[reply]

1.
Different lines can be obtained by using different weights to off mean points eg |xn-xav| , (xn-xav)2 it depends on how you want to emphasise the shape of the scatter plot. You could do an analysis of the scatter plot to see if the spread of values seems to match any known distribution.87.102.33.144 14:23, 28 January 2007 (UTC)[reply]
2.
The slope of the line of the combined data set didn't match what you expected - here's what I would do.
There are n points. Take a pair of points and find the slope and second variable (constant) for this pair of points eg y=ax+c, y2-y1=(x2-x1)a. Repeat this for each combination of points - there are n(n-1)/2 pairs of points..
Now you should have n(n-1)/2 values for 'a' (the slope) and 'c' (the constant). You can analyse this data set for and average 'a' and 'c', and also find the variance of both.
This should give you not only average slope but a range (amount of accuracy) of values in which 'a' can be expected to lay based on your data set.
Question - does this range of values include your expected (orange) line.. Hope that helps. I'm suggesting you find the 'amount of error' in the line.87.102.33.144 14:40, 28 January 2007 (UTC)[reply]
Thanks! Re 1: I was thinking plain old least squares linear regression. My problem isn't that I had some preconceived expectations about the regression lines, but that they are less steep than what my eyes are telling me that they should be when I look at the scatter plot. I have plotted the residuals against quantiles of the normal distribution (qqnorm in R). Except for two outliers at the lower end, and four at the upper end, the residuals appeared to be reasonably normally distributed. And with approx. 1500 data points, I wouldn't expect 6 outliers to have a tremendous effect.
Re 2. That's interesting, I'll try to do what you suggest. Should the average slope and intercept using this procedure be identical to those obtained using least squares linear regression? --NorwegianBlue talk 15:11, 28 January 2007 (UTC)[reply]
As I understand it - not identical - method of least squares emphasises 'erratic' points more than a magnitude (absolute value) method. So there will be a minor difference (except of course when all the points are exactly on a line - not the case here).
I've just 'thought' of something else - is your method using - vertical offsets (my guess is it is - as it is simpler and more common) - if so you could try perpendicular offsets (much better) see http://mathworld.wolfram.com/LeastSquaresFitting.html (second set of diagrams down) - I think the picture speaks for itself. If your not using perpendicular offsets you really should try this - it does give better results (especially when the slope of the line is high) - you might find it gives a result closer to the one you expected - the eye is usually a good measurer..87.102.33.144 16:19, 28 January 2007 (UTC)[reply]
You also might want to look at http://www.tufts.edu/~gdallal/slr.htm which gives "confidence bands from the regression line " - I haven't checked the maths on this - what you need to do is find "the confidence bands from the regression line" - a google search might help or you could ask here - I'd be very suprised if your orange line didn't lie within the confidence bands.
The confidence bands effectively give a measure (statistical) of where the actual line can be expected to lie (within certain probabilities) - like a broad thick line that the actual line must lie within.87.102.33.144 16:19, 28 January 2007 (UTC)[reply]
I think you are right in that the lm() function of R minimizes the squared vertical offsets. Browsing the help files, I found that R has a procedure called line() which does robust line fitting. The line thus obtained was slightly "better" than the green line obtained using lm(), but still far from my orange line. --NorwegianBlue talk 18:48, 28 January 2007 (UTC)[reply]
Looking at your picture, the orange line is more or less the major axis of an ellipse containing most of the points of the joint scatter plot. In general, the line obtained by linear regression has a slope that is less steep. Let U and V be two independent normally distributed (μ = 0, σ = 1) random variables. If X = aU + b, Y = mX + c + dV, for sufficiently large samples the least squares linear regression fit will approximate the line y = mx + d. The lines of equal density in a scatter plot are then ellipses of the form u2 + v2 = r2, for various values of r, where u = (x−b)/a and v = (y-(mx+c))/d. Expressed in x-y coordinates: ((x−b)/a)2 + ((y-(mx+c))/d)2 = r2. In general, the slope of the major axis of these ellipses is not equal to m. As you change the scale of Y, the value of m changes accordingly. But the major/minor axes of an ellipse are not projective invariants. In particular, they are perpendicular, but linear scale transformations do not preserve angles and consequently do not preserve these axes.  --LambiamTalk 12:36, 29 January 2007 (UTC)[reply]
Thanks, Lambiam. It's correct that I expected the regression line to more or less follow the major axis of that ellipse, and I didn't know that a less steep line was to be expected. Can I infer from your comment, then, that there is nothing obviously wrong with the regression lines as plotted? --NorwegianBlue talk 18:58, 29 January 2007 (UTC)[reply]
You can infer from my comment that I see nothing obviously wrong with these lines. For a simple visual check, if you have a nicely ellipse-shaped cluster, take the vertical tangents at the left and right sides of the ellipse. The regression line should pass through the points where these lines touch the ellipse. For each point on the regression line, the vertical extents above and below should be (roughly) equal. As far as I can see without a visit to the ophthalmologist, your red, blue and green lines pass this visual check; the orange line does not.  --LambiamTalk 13:36, 30 January 2007 (UTC)[reply]
Thanks a lot! --NorwegianBlue talk 22:11, 30 January 2007 (UTC)[reply]

Arcsin sqrt(x) transformation[edit]

My officemate had a reviewer on a paper write "data should be transformed by arcsin sqrt(x)." I recall (I think) that this is a normalizing transformation for some kind of data set (binomial/n, I think), but neither she nor I can find a reference on it. Any ideas? --TeaDrinker 00:30, 20 January 2007 (UTC)[reply]

Google gave an immediate reference: http://www.tina-vision.net/tina-knoppix/tina-memo/2002-007.pdf
The point seems to be that if the data are binomially-distributed, the transformation gives a variance independent of the mean. I question, however, the arrogance of the reviewer in saying what should be done, without explanation or even considering that another approach could be valid.86.132.163.126 12:25, 27 January 2007 (UTC)[reply]

Statistical process control[edit]

In statistical process control using control charts, I have noticed that presenters often recommend calculating the standard deviation in a, so to speak, nonstandard way. The recommended procedure is to calculate a mean moving range, i.e. , using a relatively small dataset, and then divide the mean moving range by the magic number 1.128. If you google for "(1.128 and calculate)" and are feeling lucky today, you will find a such a presentation. The number 1.128 is often represented by the symbol d2. Does anybody know the maths behind this non-standard estimator of the standard deviation? --NorwegianBlue 19:00, 1 June 2006 (UTC)[reply]

If the SD (innate variability about the current mean) was estimated from all the values, there would be an over-estimate in the presence of a trend (shift of mean), whether upward, downward or cyclic. Using the difference between successive observations removes this effect. If the sum of the squares of these differences is used, it has to be divided by 2(n-1) and square-rooted to estimate SD. Your formula uses absolute differences, without a "2" in the denominator. But for a Normal distribution, Mean Absolute Deviation is SD*root(2/pi).
Combining these corrections, 1.128 is root (4/pi).
86.132.237.108 19:49, 2 September 2006 (UTC)[reply]

Vysochanskiï-Petunin inequality[edit]

I read the article on the Vysochanskiï-Petunin inequality, and am trying to understand it, and its implications on the interpretation of control charts using 3-sigma limits, i.e. lambda=3. The article states: "The sole restriction the distribution is that it be unimodal and have finite variance. (This implies that it is a continuous probability distribution except at the mode, which may have a non-zero probability.)" I am having problems with the above, because up until now I have always thought that a probability distribution is either continuous or discrete, but the statement in parenthesis seems to imply that it can be both (continuous for some arguments, discrete for others). To me, this raises several questions:

  • Is there such a thing as a unimodal discrete probability distribution? The statement cited above seems to indicate that the answer is "no", but oviously many discrete probability distributions have only one mode.
  • If the answer to the preceding question is "yes", does the Vysochanskiï-Petunin inequality apply to a unimodal discrete probability distribution?
  • What is the probability density of a "continuous" probability distribution at the mode, if the probability is non-zero? Undefined?
  • Could someone give an example of a reasonably occurring probability density with mode=0, nonzero probability at 0, and arguments that are positive or zero, which would either satisfy nor not satisfy the requirements of the inequality?

Thank you! --NorwegianBlue talk 12:24, 20 October 2007 (UTC)[reply]

Probability distributions can be quite general. Given your sample space, say and any σ-algebra on it, you can define any function from that σ-algebra to [0, 1] which satisfies a few axioms, and you get a probability distribution. Continuous and discrete are just two narrow, yet common, kinds. The distribution kind alluded to is not really one of them, but resembles one or the other in different regions.
  • No (unless there is just one point where the probability is positive), and this becomes clear if you take a look at the definition of a Unimodal function. The probability density for a discrete distribution is 0 for most real numbers. Hence, for any number where the probability is positive, you have a local maximum, and thus the distribution is not unimodal.
  • At the basic level it is undefined, but the Dirac delta function can be introduced to deal with it.
  • That "reasonably occurring" bit is tricky... Can't think of any right now.
-- Meni Rosenfeld (talk) 13:06, 20 October 2007 (UTC)[reply]
Thanks a lot, Meni, I think I understood. To verify, I'll try to give examples of "reasonably occuring" probability distributions with nonzero probability at 0, and which satisfy and don't satisfy the requirement. Am I correct in thinking that the distribution
p(x<0) = 0
p(x=0) = 0.3
p(x>0) = 0.7*exp(-x)
satisfies the requirement, whereas the distribution
p(x<0) = 0
p(x=0) = 0.2
p(x>0) = 0.7*exp(-x)+0.1*g(x)

where g(x) has the properties
g(x) = 0 when x<0
The integral of g from -inf to +inf is 1
g(5) > exp(-5)
does not? --NorwegianBlue talk 15:34, 20 October 2007 (UTC)[reply]
The idea for the first looks good, but the notation isn't. The distribution you mean can be described by:
As for the second, same problem with the notation, and I'm not sure we know enough about g to deduce anything. If, on the other hand, we know that , I think we can conclude the distribution isn't unimodal. -- Meni Rosenfeld (talk) 16:49, 20 October 2007 (UTC)[reply]
Thanks again. I'm sorry about the notation, I knew it was awful and should have apologized beforehand. I also realize that I failed to make a clear distinction between densities and probabilities, and that I failed to take into account the factors 0.7 and 0.1 that I was multiplying exp(-x) and g(x) with. By cutting and pasting from the wiki math notation in your reply, I'll have a second go at rephrasing what I was trying to express:
The first distribution was intended to be:
The second distribution was intended to be:
As you point out, the requirements to g should have been:
Sorry about persevering, I just want to make sure I understand:
  • Did I get the notation right?
  • If so, is my first distribution just a more awkward way of expressing what you did in two lines?
  • Does the first distribution satisfy the requirements of the Vysochanskiï-Petunin inequality?
  • Does the second distribution violate the requirements of the Vysochanskiï-Petunin inequality?
-NorwegianBlue talk 19:53, 20 October 2007 (UTC)[reply]
  • Much better. It seems a bit unconventional, but it gets the necessary information across.
  • Pretty much, yes. Basically, by calculating using my formula, you will get the same results as your formula (there are some technical details involved).
  • Yes.
  • I think not. The factor of 7 wasn't the only thing I have changed - I have given a condition on the derivative of g rather than g itself. To ensure being non-unimodal, there must be some point where the derivative of the density function (which is also the second derivative of the cumulative density distribution function) is positive. Ensuring that with a condition on g is trickier.
-- Meni Rosenfeld (talk) 22:23, 20 October 2007 (UTC)[reply]
Please don't use the absurd phrase "cumulative density function". The word "cumulative" contradicts the word "density". If you mean "cumulative distribution function, say that, and if you mean "probability density function", say that. Michael Hardy (talk) 20:03, 31 July 2008 (UTC)[reply]
I understand. Your replies have been very helpful. Thank you! --NorwegianBlue talk 11:16, 21 October 2007 (UTC)[reply]

Histograms and probability distributions (Statistics)[edit]

What are the correct terms to distinguish between what you get when you smoothen out a histogram of observed data, correcting scales such that the area under the curve is 1, and the underlying true probability distribution? --NorwegianBlue talk 09:24, 29 October 2007 (UTC)[reply]

By smoothing you are trying to estimate the probability density function of the population, much like the arithmetic average of a sample is used to estimate the population average. So you could call it the "estimated density (function)". A common class of methods for estimating the p.d.f. from the observed data that bypasses the histogram is described under Kernel density estimation.  --Lambiam 09:50, 29 October 2007 (UTC)[reply]
Thank you!
  • Can I extrapolate your answer into stating that the expression "observed probability distribution" (~550 google hits) should be avoided?
  • Is "estimated probability distribution" (~10,000 google hits) acceptable? --NorwegianBlue talk 10:14, 29 October 2007 (UTC)[reply]
P.S. I fully appreciate that "estimated density (function)" is better than "estimated probability distribution". The reason I'm asking is that I'm preparing a talk for an audience which is not very mathematically sophisticated, and I'd prefer to avoid talking about densities. At the same time, I need to be precise about the distinction between our observations, and the underlying processes that we imagine are responsible for generating the data. I'm thinking of drawing a cartoon of Plato's allegory of the cave, with the histogram on the cave wall, and the density function behind the viewer. --NorwegianBlue talk 10:54, 29 October 2007 (UTC)[reply]
Indeed, there is no such thing as "observed probability distribution". I have a coin — maybe fair, maybe not. I flip it once, and you record the outcome. Have you observed the probability distribution? Absurd, eh? Obviously not. I flip again. Now? No; but then when? Even a fair coin can produce ten heads in a row! From statistical observations we can only estimate properties of the generator. The more observations, the better the estimate. Plato's cave is a pretty good metaphor, but perhaps the coin drives the point home better. --KSmrqT 14:27, 29 October 2007 (UTC)[reply]
Thanks for the suggestion! Good point, I'll use it. --NorwegianBlue talk 16:51, 29 October 2007 (UTC)[reply]

Exponentially distributed data (Statistics)[edit]

When counting residual (unwanted) cells in blood components, I find that such counts tend to be best described by the exponential distribution. The illustration is a histogram of platelet counts in centrifuged plasma. The exponential distribution (red) gives a decent fit, while a normal distribution (blue) obviously is way off.

The article exponential distribution gives some examples of real-world scenarios which tend to be exponentially distributed, such as the time or distance between poisson-distributed events, but none of these really resemble my example, as far as I can see. I suspect that the principal source of variation in my data is the handling skills of the operator who carried the centrifuged blood bag from the centrifuge and placed it in the separator device (the device then applies gentle pressure to the bag, and lets the plasma escape through a tube on top of the bag).

I would like to understand why the exponential distribution gives such a good fit with the cell count data. Thank you! --NorwegianBlue talk 10:17, 29 October 2007 (UTC)[reply]

A possible explanation. Assume, that there are say a N platelet and each platelet has a small but finite chance, p, of escaping through the tube. You can use the binomial distribution to calculate the chance of n platelets escaping: NCn p^n (1-p)^(N-n). Now for very large n and very small p, the binomial is approximated by the Poisson distribution with mean p*n. This would also fix your data. Just a thought. --Salix alba (talk) 11:23, 29 October 2007 (UTC)[reply]
Platelet conc Occurences
0 1
1 103
2 415
3 212
4 112
5 106
6 64
7 72
8 50
9 36
10 42
11 30
12 28
13 29
14 15
15 18
16 12
17 11
18 10
19 5
20 6
21 5
22 2
23 3
24 3
25 3
26 2
27 5
28 1
29 2
30 2
35 2
37 1
43 1
52 1
63 1
I don't think that's quite it. As mentioned above, I think the key lies in what happens when the bag is carried from the centrifuge, and put in the separating device. After centrifugation, there is an interface (buffy coat) between the plasma and the red cells. Most of the platelets are there. Although the bags are handled very gently to preserve the buffy coat layer, occationally a bag might be shaken slightly, causing some of the platelets to mix with the plasma. This would be a rare event, and it would have varying strength, i.e. disturb the interface to a varying extent. Would such a scenario be expected to lead to an exponential distribution? --NorwegianBlue talk 12:11, 29 October 2007 (UTC)[reply]

The result of counting is a nonnegative integer. So you need a model providing nonnegative integers for its outcome. The normal distribution provides negative outcomes and it provides noninteger outcomes. The exponential distribution it provides noninteger outcomes. As Salix alba explained, the poisson distribution is a suitable model, providing nonnegative integer outcomes. Knowing the parameter λ the probability of observing the count i is e−λλi/i! . If you sum this expression over i, from zero to infinity, you get 1, to confirm that it is a discrete probability distribution function. If you multiply this expression by i and sum over i from zero to infinity you get λ, to confirm that λ is the mean value of i. Similarily you may compute the standard deviation of i as λ1/2. Knowing the parameter λ the poisson distribution model provides an estimate for the observation i~λ±λ1/2. However, your problem is the dual one, having made the observation i you need to estimate the parameter λ. Keeping i constant and letting λ be a nonnegative real variable, the above expression e−λλi/i! is a continuous distribution function called the gamma distribution. Integrating over λ gives 1, confirming that it is a probability distribution function. The special case i=0 is the exponential distribution function. The mean value of λ is i+1 and the standard deviation of λ is (i+1)1/2. So the gamma distribution model provides an estimate for the parameter λ ~(i+1)±(i+1)1/2. Bo Jacoby 12:43, 29 October 2007 (UTC).[reply]

I don't know why people insist on mentioning the Poisson distribution, which quite obviously does not fit the data. The exponential distribution has both a continuous and a discrete version, the latter being called the Geometric distribution. I suspect that the cell count, although discrete, strongly depends on the (continuous) time period between two events, which for some reason is distributed exponentially. I do not understand biology well enough to speculate as to why that is so, but I do suggest that time is the key here. -- Meni Rosenfeld (talk) 13:18, 29 October 2007 (UTC)[reply]

It is not obvious that the poisson distribution does not fit the data for some λ in the interval 0 < λ < 1. Bo Jacoby 14:39, 29 October 2007 (UTC).[reply]

Except for the fact that the expectation of the given distribution is roughly 6. -- Meni Rosenfeld (talk) 14:47, 29 October 2007 (UTC)[reply]

Thank you all for taking the time to respond to my question! I realise, of course, that just about any distribution representing measurements of physical quantities is in principle a discrete distribution. When weighing a sample of a pure substance, you are in essence counting the number of molecules in the sample. I should have pointed out, also, that the unit on the x-axis is platelets × 109/L, so the number of platelets counted in each measurement is much larger than the numbers on the axis suggest. I did suspect, beforehand, that the gamma distribution might be appropriate. This is because it is unreasonable to assume that the mode of the distribution is zero. When reexamining the data, I see that the mode is in fact 2. The data was recorded without any decimals, and is detailed in the table on the right. I read the section about parameter estimation in the article, but alas, this was way above my head.

If a gamma distribution is indeed appropriate, is there a not-too-difficult way of calculating the scale and rate parameters? Specifically, does anyone know whether R (programming language) is able to do this? To Meni, thanks for the suggestion about a time factor being responsible, I'll certainly look closer into that. --NorwegianBlue talk 15:00, 29 October 2007 (UTC)[reply]

This changes everything. Since the problem is essentially continuous, the Poisson distribution is completely irrlevant. The data is much too skew to be normal, and the exponential distribution seems less likely now that we see that the mode isn't 0 (note that my suggestion to look at time was based on the assumption that the distribution is exponential, though it could still be relevant). What you can do is to calculate the moments of the data (that is, the quantities derived from the moments - mean, variance, skewness and excess kurtosis), and compare them to those of different distributions (our articles have those). The first few can be used to find the parameters which give the best fit, and the rest can tell you how good the fit is. For the gamma distribution (note that Bo has suggested it not as the distribution of the data), for example, denoting the mean by and the variance by , you have and , and if, for your data, the skewness is and the kurtosis is , you have a good match.
Note that the data also seems to be quite noisy, so no common distribution will be a great match. -- Meni Rosenfeld (talk) 16:28, 29 October 2007 (UTC)[reply]
Unless I have made a mistake, for our data we have . If you try to fit this as a gamma distribution, you'll get , so it would be exponential; but the skewness would be only 2, so I'd say that's off the table. You can double-check my calculations, and try your luck with any of these distributions. -- Meni Rosenfeld (talk) 16:52, 29 October 2007 (UTC)[reply]
I think you do have a normal distribution in the chart. The graph doesn't show it because you lumped the 0-2 range together to get a value greater than that for the 3-4 range. However, if you plot each value separately, it looks like it will match a normal distribution fairly well. One non-math comment: someone should develop a machine to transport the bag of centrifuged blood so it won't be shaken up, or at least so the degree of shaking will be consistent. StuRat 17:23, 29 October 2007 (UTC)[reply]
Are you kidding? The normal distribution is symmetric, this one is anything but. It has a skewness of 3; It has a mode at 2 and mean at 5.5; it has a long tail on the right and a ridiculously short one on the left. Not to mention that the fitted normal in the picture doesn't even remotely resemble the distribution, even if we split the first bin. -- Meni Rosenfeld (talk) 18:02, 29 October 2007 (UTC)[reply]
It looks to me like it might well be truncated, but symmetrical (or at least somewhat close), if you toss out the single the value at 0. And no, the blue curve isn't correct, a different normal curve would be needed to fit the data. The maximum point might be around 2.3. StuRat 16:16, 30 October 2007 (UTC)[reply]

The poisson distributions constitute a one-parameter family of distributions having nonnegative unlimited integer outcome. That's why you would try it first. Calculation in J:

  concentrations  NB. the new table of data
0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 35 37 43 52 63
  occurences
1 103 415 212 112 106 64 72 50 36 42 30 28 29 15 18 12 11 10 5 6 5 2 3 3 3 2 5 1 2 2 2 1 1 1 1
  ] N =: +/ occurences NB. compute and display the number of occurences
1411
  ] S =: +/ occurences * concentrations NB. sum of concentrations
7839
  ] L =: S % N NB. mean value of the concentrations
5.55563
  ] SS=:+/ occurences * *: concentrations NB. sum of squares
87477
  ] var=:(SS%N)-*:S%N NB. variance of concentrations
31.1235

As the variance is greater than the mean value, the distribution is not poisson. Sticking to discrete distributions, it might be a negative binomial distribution. (See also cumulant). Bo Jacoby 18:05, 29 October 2007 (UTC).[reply]

Seriously, don't you read what the OP is saying? The data is not discrete. It is a continuous variable, which is presented on an arbitrary scale and rounded. -- Meni Rosenfeld (talk) 18:19, 29 October 2007 (UTC)[reply]

I'm sorry that I when posting the question didn't state clearly that the concentrations were based on much larger counts than the numbers on the x-axis suggested, and that the concentration is for practical purposes a continuous variable, although it unfortunately was recorded without decimals.

To Sturat: The cellular content is not a great concern, because (a) the plasma is pooled and processed further industrially, and (b) units which obviously have been shaken are re-centrifuged before further processing.

To Meni: Thanks again. I did some further experimentation in R. When examining the histogram with bin sizes of 1, I get the impression that the distribution is bimodal may be a mixture of two distributions, with a second peak the second distribution peaking somewhere in the vicinity of 8. Something like 0.7*gamma_density(shape=2, scale=1)+0.3*normal_density(mu=8, sd=4) fits reasonably well. For practical purposes (making control charts), however, I think I'll treat the data as exponentially distributed, and use the transformation y=x0.2777, which according to textbooks on statistical process control should result in a Weibull random variable which is well approximated by the normal distribution. --NorwegianBlue talk 18:37, 29 October 2007 (UTC)[reply]

Within a single bin, the number should be an observation of a random variable with a Poisson distribution. That means that in bins 6, 7 and 8 we have an uncertainty of about ±8 in the frequency of occurrence. The local hump is slight in comparison and insufficient to suggest bimodality. The data roughly fits a log-normal distribution.  --Lambiam 21:12, 29 October 2007 (UTC)[reply]
However, I cannot get a good fit with a log-normal distribution. For the maximum likelihood fit, the expected # of occurrences for the bin c = 2 is only 218.8, differing wildly from the observed value of 415.  --Lambiam 19:34, 30 October 2007 (UTC)[reply]

Do all the bins have the same width, or has bin 0 half the width of the other bins because the variable is nonnegative? Bo Jacoby 23:01, 29 October 2007 (UTC)[reply]

Evaluating a definite integral (d2, SPC)[edit]

I'm curious about the function

,

where , i.e. the cumulative standard normal distribution, and n is an integer greater than one. I suspect that the usage of α without any indication that it is a function of x may be non-standard (at least, it had me confused). However, I wanted to reproduce it exactly as it appears in the paper in which I found it: Tippet, LHC. On the Extreme Individuals and the Range of Samples Taken From a Normal Population. Biometrika vol 17, 364-387, 1925. It is my understanding that this function gives the expected value of the range of a sample of size n, taken from a standard normal distribution. In the field of Statistical process control, the function w is known as d2.

These are my questions:

  • For n = 2, it is easy to show that the integral is numerically equal to 2/sqrt(π) within machine precision, and I feel reasonably certain that 2/sqrt(π) is indeed the exact value. I would like to know how one determines whether this is the case. As I am neither a statistician nor a mathematician, I would need the details spelled out.
  • Can this integral be expressed in terms of simple functions for values of n greater than 2? If so, how?
  • Is my suspicion avove, that the notation today would be considered non-standard, correct? If so, what would standard notation be?

Thanks. --NorwegianBlue talk 14:45, 9 February 2008 (UTC)[reply]

I haven't done much probability, so I'm not really sure what's going on there, but it does seem odd to me to be integrating the CDF - normally you integrate the density function to get the CDF, so I'm pretty much lost. As for the standard notation, wouldn't that just be:
?
It's not particularly unusual not to use a notation that doesn't explicitly state a dependence, although when you're only dealing with functions one variable, I can't see much point in not being explicit. --Tango (talk) 13:40, 10 February 2008 (UTC)[reply]
Thanks! Well, the function works, and there's no doubt that it's the CDF that is to be integrated. Below is a very crude implementation in R. The standard normal density in R is dnorm, while the distribution function is pnorm:
     d2 = function(n, distribution=pnorm, lolim=-12, hilim=12)
     {
         dx = 0.01
         steps = seq(lolim, hilim, by=dx)
         alpha=distribution(steps)
         rectangles = (1 - (1-alpha)^n - alpha^n)*dx
         return(sum(rectangles))
     }
And here's the output when testing it:
     > format(d2(2), digits=15)
     [1] "1.12837916709551"
     > format(2/sqrt(pi), digits=15)
     [1] "1.12837916709551"
--NorwegianBlue talk 14:08, 10 February 2008 (UTC)[reply]


A little more experimenting shows that d2(3)=3/sqrt(pi). However d2(4) = 2.058751, while 4/sqrt(pi) = 2.256758. Can anyone see a pattern? --NorwegianBlue talk 00:32, 11 February 2008 (UTC)[reply]

I can't offer much, but Mathematica confirms that and . It is also worth noting that . It doesn't readily succeed in calculating symbolically. The numerical value is 2.058750746..., for which Plouffe's inverter gives no match. So I doubt there is any significantly simpler way to express it. -- Meni Rosenfeld (talk) 09:05, 11 February 2008 (UTC)[reply]
Thanks a lot, Meni. --NorwegianBlue talk 13:02, 11 February 2008 (UTC)[reply]

Help needed in understanding 1925 Biometrika paper[edit]

I'm reading the paper Tippet, LHC. On the Extreme Individuals and the Range of Samples Taken From a Normal Population. Biometrika vol 17, 364-387, 1925., temporarily available here.

The paper defines the function

,

where initially was defined as , where , i.e. it is a cumulative distribution function, and n is an integer greater than one. The function returns the expected value of the range of a sample of size n, taken from the probability distribution defined by . If is the standard normal distribution, w is known as the control chart constant d2 in the field of statistical process control.

I posted a related question recently about this function. I have found that the function

is easily evaluated numerically, using a naive brute force approach such as the following C++ implementation:

double d2(int n)
{
    double x, dx;
    dx = 0.01;
    double sum = 0.0;
    for (x = -12; x <= 12; x = x + dx)
    {
        double alpha = cumulative_normal_distribution(x);
        sum = sum + (1 - pow(1-alpha,n) - pow(alpha,n))*dx;
    }
    return sum;
}

I know that there are far more efficient and elegant ways of evaluating integrals numerically. Nevertheless, this implementation works, and mirrors my simple-minded mental image of what is going on, so I'll stick with it for now. Tippett then defines as equation (10) the following:

and states (page 372) that

"The second moment, and hence the standard deviation of the distribution, was determined in several cases by equation (10). The work is very laborious, as it involves cubature, and even so, the result can only be given to a few figures. It is believed that the values given in Table IV are correct to the last figure."

The return values square root of should correspond to the control chart constant d3, and Table IV confirms that this is indeed the case. My problem is that I haven't been able to evaluate , and I suspect I've misunderstood something, possibly something pretty elementary about double integrals (I have no mathematical training beyond this, and it's a long time ago). Here's my implementation, again using a similar brute force approach. I've highlighted in red the part that I was most unsure of:

//NOTE: Comments in green added by OP after the question was posted, based on Meni Rosenfeld's answers.

double d3(int n)
{ 
    double x_1, x_n, dx_1, dx_n;
    dx_1 = dx_n = 0.01;
    double sum = 0.0;
    for (x_n = -12; x_n <= 12; x_n = x_n + dx_n)
    {
        double alpha_n = cumulative_normal_distribution(x_n); // WRONG - should be doing alpha_1 in the outer loop!
        // Making x_1 (-inf ... +inf) the outer loop, and x_n (-inf ... x_1) the inner loop fixes the problem.
        for (x_1 = -12; x_1 <= x_n; x_1 = x_1 + dx_1) // WRONG - see the code in red: x_1 should be the limit of integration. 
        {
            double alpha_1 = cumulative_normal_distribution(x_1); // WRONG - should do alpha_n in the inner loop!
            sum = sum + (1 - pow(alpha_1,n) - pow(1 - alpha_n,n) - pow(alpha_1 - alpha_n, n))*dx_1*dx_n;
                    // As Meni pointed out there's an error here ^ . The minus sign should be plus.
        }
    }
    double variance = 2*sum - pow(d2(n),2);
    return sqrt(variance);
}


The return values of this function are way off the correct values. My question is: can someone spot the misunderstanding(s) in my interpretation of the paper, the maths, and/or the error(s) in my implementation? Your help would be much appreciated. Thanks in advance, --NorwegianBlue talk 16:13, 23 February 2008 (UTC)[reply]

I don't see a problem with the way you are calculating the double integral. However, there is something terribly wrong about the function you are trying to integrate. To explain, I'll introduce some notation:
In order for any of this to be meaningful, b must exist for any . For this to happen, at the very least we must have for every . This is clearly not the case, because and so . Thus the function is not integrable. Check that you have copied the equations exactly as they appear in the paper, and that the notation means exactly what you think it does; If so, there is possibly a mistake in the paper. -- Meni Rosenfeld (talk) 12:10, 24 February 2008 (UTC)[reply]
Hmm. Well, I've checked the equations, and believe my reproduction is correct. I also very much doubt that there is a mistake in the paper, as it was widely cited, and equation (10) was used by subsequent authors to tabulate d3. Therefore, the most likely explanation is that I've misunderstood the notation. I prepared a small .PNG file from the paper with the relevant equations and a figure which explains the notation. x1 refers to the highest value in a sample of n values, and xn refers to the lowest, as is shown in the figure. The .PNG is here, the paper is here. Thanks again! --NorwegianBlue talk 13:20, 24 February 2008 (UTC)[reply]
Note the step from equation (9) to (10), where turns into . The plus is correct, and everything falls into place nicely if you make this correction. So it turns out even widely cited papers can have errors. Also, the table you linked to has some errors, for example . -- Meni Rosenfeld (talk) 14:36, 24 February 2008 (UTC)[reply]
Thanks a lot! The error in the step between equations (9) and (10) was glaringly obvious, once I became aware of it. I've looked through the steps from equation (4) to (9) for more errors, but unfortunately, most of it exceeds my mathematical capabilities. I'm also confused about the notation. What does the S-notation mean? Is appears in places which might suggest summation, is it the same as sigma notation?
Unfortunately, everything did not fall nicely into place. I corrected the minus sign to plus, but the function still does not work correctly. The entire program I've used is here. The section of the paper immediately preceding equation (10) might suggest that equation (10) only works for even values of x. Here's what I found, with the correction applied:
            n   program   correct
            ---------------------
            2    23.94    0.8525
            3     8.65    0.8884
            4    23.90    0.8794
            5    10.01    0.8641
            6    23.86    0.8480
            7    10.70    0.8332
            8    23.82    0.8198
            9    11.15    0.8078
Even values slowly decreasing from 23.84, odd values increasing from 8.65. Strange. A standard deviation of >23 for the range of samples of n=2 drawn from a standard normal distribution is obviously not right! Any suggestions of what else might be wrong? --NorwegianBlue talk 16:58, 24 February 2008 (UTC)[reply]
Well, there aren't any problems with the mathematics. I have calculated the sum in Mathematica and it gives the correct result. A step of 0.01 is enough to get a few correct digits, and an integration bound of 12 is overkill - 5 will be more than enough. So you need to debug your code - it looks generally okay but I don't speak C++ fluently. Try to define intermediate functions as I have done above, calculate some values of them, and either check them for plausibility or put them here so I can have a look.
The S notation is unfamiliar to me - it is probably, indeed, an archaic version of the Sigma notation. -- Meni Rosenfeld (talk) 17:40, 24 February 2008 (UTC)[reply]
Or start throwing cout to test intermediate values of variables in between every single line. x42bn6 Talk Mess 19:49, 24 February 2008 (UTC)[reply]
Thanks a million, Meni!!! I implemented your a, b, c and d3 functions exactly as you wrote them, and the result was reasonably close to the correct one (overestimated it slightly). It turned out that the step size was important. The smaller I made it, the closer I got to the values listed as correct above. I was able to achieve 3 digits using a step size of 1/2048.
So I went back to the original function, and compared the code, to see why it behaved differently. The problem was related to what I had highlighted in red. When I made x_1 the outer loop, and x_n the inner loop, and changed the test in the inner loop to x_n <= x_1, the modified original function gave the same results as the a, b, c and d3 functions. Thanks again, this was beginning to drive me nuts! By the way, did Mathematica give results with high accuracy? I'm particularly interested in a good approximation of d3(2). --NorwegianBlue talk 20:23, 24 February 2008 (UTC)[reply]
Glad to be of assistance, and sorry for not picking up earlier that it should be . Mathematica can give arbitrarily good accuracy, but since this is a double integral it can take some time. should be correct up to the digits specified. I'm running it now for more digits in case you'll need it. A few additional values (n=3 and up) are 0.888368, 0.879808, 0.864082, 0.84804, 0.833205, 0.819831, 0.807834, 0.797051. -- Meni Rosenfeld (talk) 22:04, 24 February 2008 (UTC)[reply]
Thank you again for taking the time. You have helped me tremendously, both in increasing my understanding of the maths underlying SPC, and in solving a tough practical problem. --NorwegianBlue talk 15:52, 25 February 2008 (UTC)[reply]
Gah! I looked at this yesterday, but posted nothing because I didn't see that ordering problem. Shame on the original paper (not the OP!) for writing a double integral as ! Treating a differential (infinitesimal) as a first-class entity is fine, but writing the normal definite integral operation without nesting it properly leads to madness, since the doesn't get a variable name attached to it. --Tardis (talk) 16:39, 25 February 2008 (UTC)[reply]
As stated above, my mathematical training is limited, so I need to be spoonfed :-). You are saying that my approach, i.e. making the inner integral the inner loop, would normally be the way to go, right?
Would
be a more standard way of writing the integral? I was a bit confused about this, and that's why I flagged the x_1<=x_n test (which should have been x_n<=x_1) in the code. --NorwegianBlue talk 18:54, 25 February 2008 (UTC)[reply]
Yes. We are essentially calculating here the integral of a function which is itself an integral. The inner function is , and thus it should appear as inner in the formula and be implemented as inner. We then calculate the integral of this, , thus the formula should be . This would be even more critical if we wanted to calculate, say, - we need to know which is the variable that only goes up to 0, and flipping the order changes this. In our integral, we can understand from the context that relates to (it wouldn't make much sense for to have itself for an upper integration bound). In the implementation, it doesn't really matter in which order you make the loops, since you can change the order of integration as long as you keep the bounds right. The bit clearly implies that the bound is , and it's unfortunate it took us so long to notice that. -- Meni Rosenfeld (talk) 23:14, 25 February 2008 (UTC)[reply]
Thanks again, and thanks to Tardis for making me aware of exactly why I tripped. I've learned a lot from this thread! --NorwegianBlue talk 10:05, 26 February 2008 (UTC)[reply]

Hat problem[edit]

I came across the monthy hall problem some time ago, and I'm curious about how it applies to more complex situations. Anybody have a clue? Bastard Soap (talk) 21:42, 20 May 2008 (UTC)[reply]

What kind of more complex situations did you have in mind? There are all kinds of situations in probability were intuition turns out to be wildly incorrect. Take a look at Prosecutor's fallacy for one example. Also, if you haven't seen it already, we have an article on the Monty Hall problem. --Tango (talk) 21:55, 20 May 2008 (UTC)[reply]
I’ve seen this problem which is somewhat similar cause much debate (and I remember it because I was convinced I had the right answer for several hours until I figured out otherwise).
The situation is that there are 3 people, and they will enter the same room and receive a hat (either red or blue, each with a 50% chance). They cannot communicate whatsoever once in the room, but they can collaborate and determine a strategy before entering the room.
Then at the exact same time each of them is to guess what color their hat is (or they may choose not to guess at all). If at least one person guesses correctly and nobody guesses wrong, they win a prize.
The question is, given an optimal strategy, what is the probability that they will get the prize? GromXXVII (talk) 22:36, 20 May 2008 (UTC)[reply]
Give me a minute... --Tango (talk) 22:44, 20 May 2008 (UTC)[reply]
Nope, not happening. At least one person has to guess, and they have a 50% chance of getting it wrong and blowing everything, so there is no way to improve on 50%. However, if the answer was 50%, you wouldn't have asked the question... Damn you... Let me sleep on it. --Tango (talk) 22:50, 20 May 2008 (UTC)[reply]
75%. Turning my computer off and going to sleep: Take 2. --Tango (talk) 23:02, 20 May 2008 (UTC)[reply]
I had first confused this with another problem where after receiving the hats, the players are asked one by one if they know their own hat's color, and after at most 5 (IIRC) queries, someone will know. BTW, Tango's going to sleep quip is actually quite a cute coincidence, as a New York Times <SPOILER ALERT>article </SPOILER ALERT> covering the problem mentions someone going to bed after solving it and recognizing its relevance to coding theory as he fell asleep. And yes, the answer is 75% Baccyak4H (Yak!) 04:11, 21 May 2008 (UTC)[reply]
I'm having a hard time understanding the problem completely. Each one gets a hat (at the same time?), but can't see what color their own hat is? Can they see the color of each of the others' hats? --Prestidigitator (talk) 04:55, 21 May 2008 (UTC)[reply]
You have to assume that they can in order to give them a better than 50% chance of getting it right. And it took me a while to work out what the strategy is but I knew it had something to do with what to guess given what you see. (Here's a hint: no matter what the arrangement of hats, there will always be at least two people with the same colour. If you're a prisoner who can see two hats the same colour, or two hats of different colour, see if that affects your optimum strategy for guessing your own colour.) Confusing Manifestation(Say hi!) 06:42, 21 May 2008 (UTC)[reply]
Interesting how the game show contestants became prisoners there. --tcsetattr (talk / contribs) 07:10, 21 May 2008 (UTC)[reply]
Am I missing something? If all are given with 50% probability then whatever colour hats the other people have won't affect the probability of guessing your own colour. The obvious answer seems to be that they should decide that just one should guess having a 50% chance. -- Q Chris (talk) 07:36, 21 May 2008 (UTC)[reply]
Yeah, you're missing it. The external link given by Baccyak4H contains a pretty thorough spoiler which should have you slapping your forehead and saying "Of course!" --tcsetattr (talk / contribs) 07:58, 21 May 2008 (UTC)[reply]
You’re right that each person that guesses has a 50% chance of guessing right, or guessing wrong regardless of any other factors. But the problem has three people, and whether one person guesses right or wrong is not independent of whether somebody else does (in an optimal strategy, which too my knowledge results in 75%). GromXXVII (talk) 10:52, 21 May 2008 (UTC)[reply]
The key point to note is that the method which the hats were assigned was made explicit; there is a prior distribution on what the hats can be. While when reading the problem the description seems trivial, it turns out to be crucial. If no information about that is known then of course there is no improvement over 50%. So look at this as a conditional probability problem, and consider Thomas Bayes your friend... Baccyak4H (Yak!) 14:28, 21 May 2008 (UTC)[reply]
Or just right out all the possible combinations - worked for me! --Tango (talk) 15:17, 21 May 2008 (UTC)[reply]
Right, but that still presupposes the prior. <SPOILER ALERT>What if the prior was probability 1 for all hats being red? What is the chance that the strategy will work in that case? </SPOILER ALERT> Was your sleeping comment intentional, or just a coincidence, with respect to the article I linked to above? Baccyak4H (Yak!) 16:37, 21 May 2008 (UTC)[reply]
Well, yes, but we were given the prior. I wasn't suggesting you were wrong, just that there's a less technical approach. Pure coincidence - I read the problem just as I was about to turn the computer off and go to sleep, attempted it, gave up, turned the computer off, immeadiately realised the answer and turned the computer back on again! (I'm not entirely sure why... bragging rights, I guess.) --Tango (talk) 18:30, 21 May 2008 (UTC)[reply]

So it's not really correct that events which aren't linked don't "influence" each other? You can get the flow of the event by looking at other manifestations of it? Bastard Soap (talk) 08:40, 21 May 2008 (UTC)[reply]

Having read the spoiler I don't think that's right. The events don't influence eachother, the strategy just means that when a wrong guess is made it will be by all three people at the same time, whereas a correct guess will be made by just one. Since each geuess has a 50/50 chance three out of four times a correct guess will be made, one out of four three incorrect guesses will be made. -- Q Chris (talk) 10:54, 21 May 2008 (UTC)[reply]

Incidentally, is there a simple proof that I'm missing that the 75% strategy is optimal? I'm pretty sure it is, but I can't see how to prove it. --Tango (talk) 11:46, 21 May 2008 (UTC)[reply]

I saw more or less the same problem a while ago. Here's one form: you have (say) 1023 (this is a hint) people, each provided with a random hat as before. This time everyone must vote, and the group wins if the majority vote correctly. A variant allows weighted voting, so you can (say) cast 100 votes that you have a red hat. I was reminded of these because the forced voting clause makes it easy to prove the optimal strategies are indeed optimal, which I can't quite see how to do for the problem here. Algebraist 12:53, 21 May 2008 (UTC)[reply]
This might work...although I haven’t tried to actually prove the optimality before
Consider that regardless of the strategy everyone will guess wrong 50% of the time. If two people guess wrong together in a case, that provides correctness of at most 67%. If all three people guess wrong together in a case, then this provides correctness of at most 75%. This exhausts all cases. It also assumes that one can partition the sample space though, which isn’t always possible, but nonetheless should provide the sufficient upper bound. (for instance, I don’t think it’s possible to have a strategy which wins two thirds of the time because 8 isn’t divisible by 3). GromXXVII (talk) 16:42, 21 May 2008 (UTC)[reply]
It's not possible to have a deterministic strategy with 2/3 chance of winning, if you allow non-deterministic strategies, it should be doable. --Tango (talk) 18:25, 21 May 2008 (UTC)[reply]

Stats question/puzzles.[edit]

these are not homework, though probably that's exactly what they look like. The give away is if I was a stats student I should have a clue how to approach these kind of problems. I've been coming up with these stats questions but don't have the math to solve them. Most start with: say you have a 100 sided die. Here's a couple: On average how many unique numbers will NOT have appeared after 100 rolls? On average, how many rolls would it take to see each number at least once. And lastly, on average, after 100 rolls, what number roll will produce the last unique number, not a repeat. Can someone come up with a mathematical solution or is that pretty much only solvable if you modelled it with a random number generator? I reckon the first two are probably solvable. Not sure about number three. Vespine (talk) 09:06, 27 February 2009 (UTC)[reply]

The second problem is the Coupon collector's problem. McKay (talk) 09:41, 27 February 2009 (UTC)[reply]

The first problem can be solved exactly by the following iterative algorithm: After one throw, the probability is 1 that exactly one number has appeared. After two throws, the probability is 99/100 that exactly two numbers have appeared, and 1/100 that exactly one number has appeared. For each succeeding throw, the probability of having thrown only one unique value will be the value of the preceding iteration multiplied by 1/100. The probability of K unique values (where 100 >= K > 1), will be p(K) from previous iteration multiplied by K/100 (i.e. the probability of again throwing one of the K values), plus p(K-1) from the previous iteration multiplied by (100-(K-1))/100, i.e. the probability of having thrown one less than K in the previous iteration, and now throwing a number that is not among the K-1 that have previously appeared.

For 1,2,3 and 4 throws, we get these probabilities:

n (number of throws) p(u=1,n=row) p(u=2,n=row) p(u=3,n=row) p(u=4,n=row)
1 1 0 0 0
2 1/100 99/100 0 0
3 p(u=1,n=2)*1/100 p(u=2,n=2)*2/100+p(u=1,n=2)*99/100 p(u=2,n=2)*98/100 0
4 p(u=1,n=3)*1/100 p(u=2,n=3)*2/100+p(u=1,n=3)*99/100 p(u=3,n=3)*3/100+p(u=2,n=3)*98/100 p(u=3,n=3)*97/100

Expand the table until row 100 is reached, and use the probabilities to calculate the expected value. The C++ code in the collapsed box below does this, and finds that the expected value is 36.6032. I'm reasonably sure this is correct, as I've also run simulations which show that the value is close to 36.6. I'm sure the algorithm could be expressed a lot more succinctly in mathematical notation, but I don't have the skills necessary.

--NorwegianBlue talk 14:49, 1 March 2009 (UTC)[reply]

Likelihoods and conditional probabilities, in the context of sensitivity and specificity.[edit]

I would appreciate some help in getting the concepts of likelihood and conditional probability straight. I'm posting here and not on the maths desk, because the question relates to fairly elementary staticstics applied to science, and because I think I'll have a greater chance of understanding the answers here. I shall begin with presenting what I think I know about the matter first, and then point out what appears to me to be inconsistant usage.

Here's my understanding of a likelihood: A likelihood is the probability of obtaining the data that actually resulted from an experiment, given that some hypothetical statistical model were true. The concept is often used when the parameters of the model can be varied, creating a likelihood function that dependes on the parameters, which can be used for obtaining maxiumum likelihood estimates of the parameters. Thus, a likelihood is a special case of a conditional probability, which matches the pattern "probability of observed data given hypothetical model".

Here's my understanding of sensitivity: It is the conditional probability that a person will test positively, given that he has the condition that is tested for. I would not call this a likelihood, since it does not match the pattern "probability of observed data given hypothetical model".

Here is my understanding of specificity: Specificity is the conditional probability that a person will test negatively, given that he does not have the condition that is tested for. Again, I would not call this a likelihood, since it does not match the pattern "probability of observed data given hypothetical model".

After this long introduction, here is my question. Why do we use the term likelihood rato about the ratio between sensitivity and (1 - specificity) in diagnostic testing? I've argued above that sensitivity and specificity should not be called likelihoods.

  • Are my definitions above too restrictive or otherwise wrong?
  • Is the nomenclature itself sloppy?
  • Or is there some other explanation?

Thanks in advance, --NorwegianBlue talk 22:14, 6 November 2011 (UTC)[reply]

I'm not at all qualified but until someone else turns up... It sounds to me like the word likelihood is being used with several slightly different meanings. The common meaning of "how likely an event is" can be used to describe sensitivity (what's the likelihood of a true positive) and specificity (what's the likelihood of a true negative), but it also has specific meaning, as you point out, in terms such as "likelihood ratios" (what's the likelihood someone who tested positive really is positive). Those two are obviously closely related. Our article on Sensitivity and specificity doesn't actually use the word "likelihood" to describe those terms, or even the word "likely", so it's possible that for strict usage those words are avoided to avoid ambiguity.. It does use the word "unlikely" however, so I don't think it's "incorrect". Vespine (talk) 01:01, 7 November 2011 (UTC)[reply]
Per our article Likelihood ratios in diagnostic testing, where D+ means you have the disease and D- means you don't:
so
giving a ratio
This is referred to as the (positive) likelihood ratio, because in statistics, is known as the likelihood of a positive condition (D+) given a positive test result (T+). Similarly is known as the likelihood of the negative condition (D-) given a positive test result (T+)
is therefore known as the statistical likelihood ratio given a positive test result.
The likelihood ratio is useful, because it can be used to express a very neat form of Bayes's rule expressed using the odds for D+:
Posterior odds = Prior odds × Likelihood ratio
Or, expanding that out a bit,
The prior odds ratio is the odds of having the disease based just on its prevalence in the relevant population, before any test is made. The likelihood ratio gives the factor that this is multiplied by to give the final odds ratio that takes into account both the background prevalence and the results of the test. Jheald (talk) 17:13, 7 November 2011 (UTC)[reply]
I should probably add some of the above into our Likelihood ratios in diagnostic testing article. Jheald (talk) 17:27, 7 November 2011 (UTC)[reply]
Thanks! But why is the word "likelihood" used for the conditional probability P(T+|D+)? The term likelihood is usually reserved for the probability of existing data (events that already have occured), given some model. See the definition at Mathworld - The concept differs from that of a probability in that a probability refers to the occurrence of future events, while a likelihood refers to past events with known outcomes. I don't see how P(T+|D+) fits that definition, and would have been happier with "Probability ratio". --NorwegianBlue talk 21:25, 7 November 2011 (UTC)[reply]
Then I would repeat what I said and say that the definition in mathworld is a very specific and narrow "statistical" use of the word. Other common definitions I can find, don't seem to have this temporal stipulation. For example, out of those sources valid uses of the word are "what is the likelihood it will rain today?" and "what is the likelihood a candidate will be elected?" Vespine (talk) 23:12, 7 November 2011 (UTC)[reply]
OK. Let me try to answer this with a very rough-and-ready potted history, though my answer may not be entirely NPOV. (You might get some alternate views from the Statistics wikiproject, or if you asked at Reference desk/Maths).
The "classical" view of probability of the 19th century, largely taking its cue from the works of Laplace, used Bayes' theorem to evaluate questions of so-called "inverse probability", ie
where θ is some unknown quantity of interest, which we are trying to infer on the basis of D that is some data which has been observed.
One question such a formulation leads to is how to assign the prior distribution P(θ|I). Laplace's recommendation, in the absence of any other information, was to assign to each possible case an equal possible chance -- i.e. a flat prior, .
However as the 19th century wore on, this recipe came increasingly under pressure, due in part to the highlighting of things like Bertrand's paradox -- a flat prior for P(θ|I) implies a non-flat prior for any nonlinear function f of θ. But what was the justification for giving θ instead of f(θ) the flat prior?
This led to people like John Venn proposing the frequency interpretation of probability -- essentially denying there could be any meaning to probability apart from in the case of the ratios of outcomes of long-run series of repeated trials. In particular, the very idea of probabilities of parameters like θ was rejected, because how could you talk about something that had an actual real fixed value in terms of probability?
But this left the question of how to do inference problems -- a gap that was filled by R.A. Fisher in 1922, who suggested ignoring the P(θ|I) term entirely, and just considering the forward probability P(D|θ), as a function of θ. This could be well defined, and co-incided with the most-used results from the classical theory. In the limit of an infinite amount of data would end up sharply peaked at the true unknown value of θ. If the data was not quite infinite, it should still be quite a strong indication of the true value of θ. Now this couldn't be called a "probability", since on the frequency interpretation things like θ couldn't have a probability, so instead he called the function a "likelihood".
Thus it seemed this new more rigorous, more sophisticated interpretation (in truth: a real mind-fuck) had banished the problems of the classical interpretation, while putting on a new rigorous basis its most important results. Thus was born so-called "orthodox statistics", and with the rise of university statistics departments by the 1940s its dominance had become pretty much total, part of the package if you wanted to be a member of the club. A few held out, such as the geophysicist Harold Jeffreys or the economist Keynes, but these were very much onlookers from well outside the tent.
However nothing remains the same for long, and by the late 1940s and 1950s the frequentist orthodoxy had started to be challenged by a small number of discontents who became known as Bayesians, who wanted to see inference done "right" -- i.e. in a way compatible with Bayes' Theorem. Their influence has steadily grown, particularly on the machine learning side, helped by the very close links between Bayes' Theorem and information theory, and by growing computer power and new algorithms in the 1990s which made it possible to realistically estimate full Bayesian posterior probability distributions for really very complicated models. Bayesian methods have also come very much to the fore in fields like medical statistics, where prior probabilities (such as in the form of background disease prevalences) simply have to be taken into account. Nevertheless, most standard university statistics books and courses still tend in the first instance to be primarily focussed around frequentist ways of thinking and the frequentist approach. Jheald (talk) 02:01, 8 November 2011 (UTC)[reply]
two different meanings, same term, as the article says at the top -not to be confused with.... can't be more longwinded, my ipad sucks for editing wikipedia It's been emotional (talk) 01:56, 8 November 2011 (UTC)[reply]
Thanks everyone! Special thanks to User:Jheald, for your thorough replies to my questions. --NorwegianBlue talk 08:07, 9 November 2011 (UTC)[reply]

Notation, conditional probability[edit]

A quick question: Do the notations and mean exactly the same thing? Thanks! --91.186.78.4 (talk) 09:46, 10 April 2012 (UTC)[reply]

Trying it out with Venn diagrams gives me the feeling I got it exactly wrong: P(A|B,C)= P(A|B U C)? Which version, if any, is the correct one? --91.186.78.4 (talk) 12:22, 10 April 2012 (UTC)[reply]
I think the problem is that the latter notations are unambiguous, but the former IS ambiguous. My instinct (based on reading many probability texts), is to interpret the commma as an "and", reading it as "probability of A, given B and C". So I would opt for considering . On the other hand, one could also reasonably use the comma to mean "or", which would give you the second option. We could probably give you better answers if you give us some examples of where you've seen the comma notation. SemanticMantis (talk) 15:02, 10 April 2012 (UTC)[reply]
The comma always means "and", as does . Duoduoduo (talk) 15:53, 10 April 2012 (UTC)[reply]
I agree, in this context a comma is understood to mean "and". So in principle, the answer to the original question is "yes".
However, I think the former notation is a bit more general. I think would be used only if B and C are events. A comma can also be used when B and C are variables, and then you're referring to the conditional distribution (a function of 3 variables) with a more condensed and fancy notation. -- Meni Rosenfeld (talk) 17:11, 10 April 2012 (UTC)[reply]
Good point, thanks. SemanticMantis (talk) 22:09, 10 April 2012 (UTC)[reply]
Thanks very much, everyone, for your answers to my question! --NorwegianBlue talk (Original Poster) 17:44, 11 April 2012 (UTC)[reply]

Bayes' theorem in medical diagnostics[edit]

I'm trying to get a grasp on how Bayes' theorem can be applied to updating prior probabilities in medical diagnostics. In this context, Bayes' theorem could be written:

My understanding is that is the diagnostic sensitivity of the test, and is the doctor's prior probability of the patient having the disease in question. I'm having problems, however, in interpreting . Which probability should be entered here? The probability that any random person in the community that the doctor serves has a positive test, whether they have the disease or not? This doesn't sound right, as the doctor must have taken other information, such as the symptoms, age and sex of the patient into consideration, when assigning a prior probability. So, what probability, exactly, should be entered here? I am aware of the article Likelihood ratios in diagnostic testing, this question specifically concerns the use of Bayes' theorem. Thanks! --95.34.141.48 (talk) 20:17, 8 April 2012 (UTC)[reply]

is the probability that someone in a particular group has the disease. is the probability that someone in that same group (regardless of their disease status) tests positive. Both would be based on observed frequencies within that group. What group? It could be all the patients tested during the development phase of the test (good because it gives a large sample, making for a more accurate probability), or it could be all the patients that the doctor has ever given the test to (good if test results vary by geography etc, but bad if it results in poor accuracy of probabilities due to small sample size), or it could be all the patients of particular age, sex, and symptoms that this doctor, or doctors in general, have ever given the test to. But everything in the formula has to refer to the same one of those groups. Choosing a group depends on how good the measured probabilities are as estimates of the true underlying probability, which depends on sample size, and on how much difference it is thought to make if you limit yourself to sub-groups with particular characteristics. Duoduoduo (talk) 20:54, 8 April 2012 (UTC)[reply]
Thanks! That was really helpful. Would the following statement be both a correct interpretation of your answer, and correct (albeit pedantical) mathematical notation?
Where is the information that the patient belongs to a subgroup within the doctor's practice, that the doctor would test for the disease, for whatever reason. If the doctor is not doing a clinical study, the doctor's suspicion of the patient having the disease, or the patient's fear of having the disease, could be such reasons. --95.34.141.48 (talk) 05:52, 9 April 2012 (UTC)[reply]
Yes, that's right, everything conditioned on X. The problem with choosing X to be all people who this doctor has ever suspected of having the disease is that you probably don't have good data on the percentage of them who actually had the disease -- maybe all you have is the percentage who tested positive. Also you probably don't have since the doctor probably has not tested everyone in X for the disease by all possible means. So it seems to me that one has to go with data from clinical studies. Duoduoduo (talk) 15:10, 9 April 2012 (UTC)[reply]
Thanks again! I know of course the necessity for data from clinical studies, but I am pursuing my reasoning in order to understand exactly what is going on. Therefore: would it help if I defined as the subgroup of patients in the doctor's practice who ever contacted the doctor with a problem that led the doctor to perform the test? If the doctor had a suitable database of all his patients, lab-tests, and results, could be calculated directly from the data, by dividing the number of positive tests by the total number of tests performed (granted, you would have to figure out how to handle repeated testing of the same patient). When clinical studies calculate diagnostic sensitivities, and these are used for calculation of likelihood ratios for updating prior probabilities etc., it appears to me that there is an implicit assumption being made:
,
If we then interpret as the doctor's pre-test probability of the patient having the disease, and as the doctor's post-test probability of the patient having the disease, we end up with
I understand, of course that these are approximations, but are there any really serious flaws in this reasoning? --95.34.141.48 (talk) 19:20, 9 April 2012 (UTC)[reply]
I'm not sure why you think that those approximations are implicitly being used. Suppose there's a test for heart disease. Presumably in the clinical trials if they report then the sensitivity was measured separately for males and is not being approximated by the unconditional or by a sensitivity conditioned on some other category Y. And on the other hand if all they report is a sensitivity conditioned on no special group, then what X means in is simply the group "all people". In any event, your last equation is right provided all things including Sensitivity are conditioned on the same X. If the doctor had sufficient data about his own patients, then this X could be his own patients, or it could be his own male patients, etc. Duoduoduo (talk) 20:00, 9 April 2012 (UTC)[reply]
Thank you! Your responses have been very clarifying. Much appreciated! --95.34.141.48 (talk) 20:40, 9 April 2012 (UTC)[reply]

You can simply expand this probability:

Icek (talk) 15:18, 9 April 2012 (UTC)[reply]

Thanks for your reply! See also Duoduoduo's replies above. It turns out that the term is particularly problematic, because it depends strongly on the prevalence of the disease in the doctor's practice, and also strongly on the doctor's tendency to have (unnecessary) tests being performed, "to be on the safe side". It turns out that this problemtatic term cancels out when Bayes' theorem is transformed into its Odds version, Bayes' rule. I was confused because (1) I hadn't read that article, and (2) the articles Likelihood ratios in diagnostic testing and Pre- and posttest probabilities treat likelihood ratios only in the context of dichotomous variables. See my question about this on the maths desk[2], and Meni Rosenfeld's reply that an approach using likelihood ratios with continuous variables is perfectly valid. So I'll experiment further with that approach. I'll be back with more questions if I run into further hurdles! --95.34.141.48 (talk) 16:03, 10 April 2012 (UTC)[reply]

Can likelihood ratios used for estimation of post-test probability be continuous variables, or do they have to be dichotomous?[edit]

The article Pre- and post-test probability states in a table in the section Estimation of post-test probability, that using likelihood ratios for estimation of post-test probability has the disadvantage of requiring a dichotomous variable. Likewise, the article Likelihood ratios in diagnostic testing distinguishes between an LR+ and an LR-, and makes no mention of varying the cutoff. Is it really correct that the use of likelihood ratios in diagnostic testing requires that the variable be dichotomous? Take a hypothetical example where you know the distributions of a continuous test variable within the populations with and without disease. Can we not then calculate

.

directly from the distributions?

I did some experimentation in R (programming language), using a hypothetical test with a normal distribution N(μ=70, sd=10) in the healthy population, and N(μ=95, sd=15) in the diseased population. Here is the function I used:

LR = function(arg)
{
  eps=0.01

  p_healthy_1 = pnorm(arg-eps, mean=70,sd=10)
  p_healthy_2 = pnorm(arg+eps, mean=70,sd=10)
  p_disease_1 = pnorm(arg-eps, mean=95, sd=15)
  p_disease_2 = pnorm(arg+eps, mean=95, sd=15)
  return ((p_disease_2-p_disease_1)/(p_healthy_2-p_healthy_1))
}

The function behaves more or less as I would have expected. Its value is 1.0 (i.e. indifference) at arg=82.35 (which is identical to the cutoff point I found by maximising the number of correct decisions). In the range 83 to 133, it rises steeply from 1.13 to 3359.45. When moving towards the origin from 82 to 50, it falls more gently from 0.94080371 to 0.05472334. 0.05472334, corresponding to an argument of 50, is a minimum. Then, when moving from 50 to 10, the function rises again, from 0.05472334 to 4.65981528, crossing LR=1 at arg=17.65869. I assume that the reason it rises at very low arguments, is that the higher SD of the distribution in the diseased population becomes more important than the difference in means between the populations. I'm unable to figure out why it happens exactly at that point, though.

Anyway, my question is: are continuous likelihood ratios obtained using a function such as the one I've described here valid for deriving a post-test probability from a pre-test probability, for example by using Fagan's nomogram? --95.34.141.48 (talk) 22:54, 9 April 2012 (UTC)[reply]

Yes, this is perfectly valid, I don't know why the article would say it's not. Bayes's theorem and rule work also for continuous distributions. That is, for probabilities you have
And more generally
Where P is the value of the pdf (if T is continuous) or pmf (if T is discrete) at the particular instantiation. That this follows from the probability version can be seen by considering the event .
The calculation using the graphs in the article or the Fagan monogram is then just a rearrangement of the terms. -- Meni Rosenfeld (talk) 08:38, 10 April 2012 (UTC)[reply]
Thank you! I spent quite some time figuring out that this had to be so, and I'm grateful for your confirmation. I suspected it would work for densities too, but messed it up somehow when I tried it yesterday. Anyway, the density version in R which I now tested:
LR2 = function(arg)
{
  d_healthy = dnorm(arg, mean=70,sd=10)
  d_disease = dnorm(arg, mean=95, sd=15)
  return (d_disease/d_healthy)
}
works perfectly! After some more searching, this paper turned up, which explains the different usages of likelihood ratios quite nicely, and suggests calculating them by using a function which pretty much matches my first version! --95.34.141.48 (talk) 21:48, 10 April 2012 (UTC)[reply]

ROC curves[edit]

I would appreciate your feedback about whether my understanding of ROC-curves is correct. I also have a couple of questions at the end. As this post clearly shows, my mathematical capabilities are limited, so please be gentle, and cautious about introducing terminology that I might have trouble understanding. If necessary, please translate my statements into more conventional terminology.

In my understanding, a ROC curve is a plot of true positive rates (TPR, Y-axis) vs false positive rates (FPR, X-axis), when the cutoff (CO) between what is considered a positive and a negative observation is varied such that it covers all reasonable values. The tangent at a given point (CO), can be estimated as

ΔTPR(CO)/ΔFPR(CO),

hence TPR(CO)/FPR(CO) is the derived function of the ROC curve, and the ROC-curve the antiderivative of the function TPR(CO)/FPR(CO).

In the following, I'll use the ROC curve in a medical context, and let 'm' represent the observed value of a diagnostic test, for which we have a ROC curve availabe. Then

TPR(CO)/FPR(CO) = p(m = CO ± eps|Disease)/p(m = CO ± eps|No disease) = the likelihood ratio function.

Q1: Am I right in thinking that the probability ratio in the previous line should be the probablity of 'm' being close to CO (in my notation ± eps), and not greater than CO?

Q2: I've read a couple of places (such as here: Choi BC (1998) Am J Epidemiol 148:1127–32. PMID 9850136) that it is valid to draw lines between several points on the ROC curve, say corresponding to "negative", "weak positive" and "strong positive", and that the slope of each line is a valid estimate of the likelihood ratio for test results that fall within the corresponding interval. Sounds reasonable, but exactly why is that so?

Q3: I've read many places (including in our article) that the area under the curve corresponds to the probability of a randomly chosen diseased individual getting a higher test result than a randomly chosen individual without the disease. Again, this sounds reasonable, but exactly why is it so?

Thanks, NorwegianBlue talk 14:32, 30 August 2015 (UTC)[reply]

Metaquestion: Why did my question receive no answers?[edit]

Please see this question that I asked a couple of weeks ago. It received no answers. I know that the reference desk contributors are eager to help, and will answer questions that are answerable. Therefore, I suspect that there is a problem with the way the question was asked, that made it either difficult to understand, or difficult to answer. There is also the possibility that it was ignored because it was perceived as lazy, but my impression is that laziness on the part of the questioner tends to be commented. My question now is about the question itself. I am not requesting answers to the original question. I'm pretty confident that I have figured out the answers to Q1 and Q2 myself, some doubt about Q3 remains. I may re-ask the question if I learn how to ask it in a way that is more likely to attract answers. --NorwegianBlue talk 22:39, 18 September 2015 (UTC)[reply]

Hi,I remember your question, and I have some interest in your topic. I thought I knew a bit of the concepts, and I even looked in to it for a few minutes. But I quickly realized that it was fairly technical and specialized question, that general refs wouldn't help you much, and there wasn't much I could to in <10 min to help you. It may have helped if you included more refs and links in your question, so that people who might be able to help but aren't already as familiar as you with the topic could get up to speed quickly. There's a lot of "luck of the draw" in how many people see a question, what they know, how much time they have to spare, how much interest, etc. But there's also a sense I got of this being an area where there might be considerable variation in how terms are used, rigor of proof, ways of justifying conclusions, and all sorts of gritty details of research that just take a lot of time and attention to sort through. You may have better luck at math.stackexchange, quora, or other similar (sub)sites that can sometimes deal a little better with this kind of thing. SemanticMantis (talk) 01:42, 19 September 2015 (UTC)[reply]
My comments are similar to those above, but I would add that, as a visual thinker, it would have helped me if some graphics had been included. Otherwise it looks like a well posed Q. StuRat (talk) 01:45, 19 September 2015 (UTC)[reply]
You might have better luck with stats.stackexchange.com for something like this; add the roc tag to grab the attention of people who specialize in that area. In general, at least as far as I've seen, Stack Exchanges are more geared to professionals helping other professionals, while the WP reference desks are more for general questions. --RDBury (talk) 01:52, 19 September 2015 (UTC)[reply]
Thanks, everyone! --NorwegianBlue talk 14:50, 20 September 2015 (UTC)[reply]


@NorwegianBlue:, late answer, but anyway...
I've taken a look at it (don't visit RD/MA often ), the question was a bit confusing:
  • The tangent at a given point (CO), can be estimated as ΔTPR(CO)/ΔFPR(CO),
hence TPR(CO)/FPR(CO) is the derived function of the ROC curve
Disclaimer: not familiar with the topic, just my interpretation, which could be way off..
Is it about the continuous probability functions? These are turned into binary tests by integrating them (page 2260), and the ROC curve plots those values. (See note 1)
The tangent at a given point equals the ratio of the density functions (those in fig 1 in the link), and is also (by definition) ΔTPR(CO)/ΔFPR(CO), but not TPR(CO)/FPR(CO). (See note 2)
Each point on the curve gives you the fraction of the sick population and the fraction of the healthy population that would test positive at that CO. So if (TPR=0.9, FPR=0.1) is on the curve, 90% of all sick and 10% of healthy would test positive.
Context would make it easier to understand, preferably a link to the specific material and formulas you're asking about, not many people with a medical background here and most maths techniques have numerous applications, so people won't know the conventions/notations, etc. Most sources I found were like this .
TPR(CO)/FPR(CO) = p(m = CO ± eps|Disease)/p(m = CO ± eps|No disease)
Q1: Am I right in thinking that the probability ratio in the previous line should be the probability of 'm' being close to CO (in my notation ± eps), and not greater than CO?

Haven't seen that formula in links I checked. Seems a casual way of saying that both become equal when eps goes to zero. Seems valid to me, no reason to limit me to one side of CO, it's not like this has influence on anything, or I'm missing context. (See note 3)

Q2: it is valid to draw lines between several points on the ROC curve, say corresponding to "negative", "weak positive" and "strong positive", and that the slope of each line is a valid estimate of the likelihood ratio for test results that fall within the corresponding interval.
It's the definition of ROC curve: Y-coordinate = TPR(CO) x-coordinate=FPR(CO), the slope of a line from origin (0,0) to a point is by definition y/x. For a line between two points on the curve you calculate the slope by subtracting the coordinates. For example: Two points TPR=0.5 FPR=0.3 and TPR=0.8 FPR=0.7, segment between: TPR=0.8-0.5=0.3 and FPR=0.7-0.3=0.4; between those points, the positive likelihood ratio is 0.3/0.4=0.75, so more False positives than true positives. (See note 4)
Q3: I've read many places (including in our article) that the area under the curve corresponds to the probability of a randomly chosen diseased individual getting a higher test result than a randomly chosen individual without the disease. Again, this sounds reasonable, but exactly why is it so?
The X and Y coordinates of a ROC are proportional to the number of healthy and infected people. so an increase of 0.10 in either direction represents 10 % of the related group. Movement at 45° represents the same % of people in both groups.
Suppose the curve starts by going up to 0.4, that means 40% of the infected (lets assume low values indicate infection) will test lower than all the rest. Then 0.2 to the right; 0.3 up; 0.7 right, 0.2 up, finally along the diagonal.
Calculate the odds of infected people scoring better (lower), than healthy: 40% score better than all 0.4*1; 30% score better than 80%; 0.3*0.8; 20% better than 10%: 0.2*0.1 and the last 10% have a chance of one in two to do better than the other 10%, so: 0.1*0.1*0.5. In total: 0.665.
The other group: 0.2 *0.6; 0.7*0.3 and 0.1*0.1*0.5: total: 0.335. You can work it out on paper, when you change the curve, the change in area will match the change in odds. Here, the area is 66.5%. (See note 5)
Oeps, I now see that the link you provided had not just the abstract, but the whole article... (9850136?) Ssscienccce (talk) 02:30, 21 September 2015 (UTC)[reply]

Thank you, @Ssscienccce:, those were very helpful answers and links. I took the liberty of inserting yellow labels in your answer, in order to reference your replies more easily. Yes, I see that the question was confusing, reflecting my own confusion.

  • Note 1: I was thinking about a smoothened ROC curve, and see that that would imply needing continuous probability distributions/densities.
  • Note 2: Duh! Confused, wrong thinking on my part here. Of course it isn't. I'm forgetting the Quotient rule, and being unclear about exactly what variable I'm differentiating with respect to.
  • Note 3: Q1 is very confused. I think I'd rather strike it than try to reformulate it.
  • Note 4: Thanks! I understand it now.
  • Note 5: Thanks! I'm still struggling a bit with this one, but it's getting late. I'll re-read your reply carefully in the morning, and I'm optimistic that I'll understand it then.

Your reference to Johnson 2005, PMID 15236429 was very helpful indeed! --NorwegianBlue talk 20:38, 21 September 2015 (UTC)[reply]

Wasn't very happy with that last answer myself. This may be a better way:
For every FPR value there exists a corresponding Cutoff value. Hundred FPR values from 0.01 to 1 (with 0.01 increments), have 100 corresponding cutoff values, CO0.01, CO0.02... CO1.00
This way you could divide the healthy people in 100 groups, based on their test result. Each group would contain 1% of the total healthy population. (follows from the definition of FPR).
The same for the diseased people. They too are split in 100 groups, corresponding to the TPR values 0.01, 0.02, 0.03 and so on... (note: the hundred cutoff values differ from those for the healthy group).
Randomly choose one healthy person. He would belong to one of the hundred groups, and since the groups are of equal size, they are all equally likely to be chosen. The same is true for a randomly selected diseased person.
As an example, we'll say that the healthy person is in healthy group 20 (FPR 0.20 to 0.21) and the other one in diseased group 40 (TPR 0.4 to 0.41). those values define a small square of 0.01*0.01 size on the ROC diagram.
If you draw vertical lines at FPR=0.2 and 0.21, they will cross the ROC curve at two points, (the Cutoff values that defined the group he belongs to) and the persons test value must be between those two.
Do the same for the horizontal lines at TPR =0.4 and 0.41, they cross the ROC curve at 2 points, the test value of the person is between the two.
If the small square where the horizontal and vertical lines cross is below the curve, the vertical lines will cross the curve further from the origin than the horizontal lines (on fig 1b at page 2261, the TPR=0.4 line crosses at about 0.1 FPR, FPR=0.2 line crosses at about TPR=0.57). Values further on the curve are lower, since more people test positive the further you go, so the healthy person has a lower test result than the sick person. If the point (little square) is above the curve, the reverse is true: the vertical FPR line intersects the curve closer to the origin, so the healthy person has a higher test value.
The number of 0.01*0.01 squares below the curve are proportional to the area below the curve. If the area is 90% then 9000 squares are below, 1000 above the curve. Picking one healthy and one sick person at random has a 90% probability of corresponding to a square below the curve, which corresponds to the healthy person having a lower test value than the diseased one. Ssscienccce (talk) 02:55, 22 September 2015 (UTC)[reply]
Thanks, @Ssscienccce: I followed your argument carefully, with the help of a spreadsheet, pen and paper. I understand why the number of healthy people in each of the 100 boxes of healthy people is the same, and ditto for the number of diseased people in each of the 100 boxes of diseased people. I drew a diagram (only 10 categories of each), which turned out to look a lot like the example on page 2261 in the reference. Since any intersection point with the ROC curve, whether by a vertical or horizontal line, corresponds to a cutoff value (or in this argument, measurement), a point above the curve for a random pair of healthy and diseased individuals, implies that the healthy person (vertical line that has crossed the curve) has a measurement that is higher (closer to the origin) than that of the diseased person (horizontal line that will intersect the curve at a point further from the origin, i.e. has a measurement that is lower). I think I got it! If there were any errors in my restatement of your explanation, please warn me about it! Thanks a million. --NorwegianBlue talk 21:44, 22 September 2015 (UTC)[reply]