Talk:Mean absolute difference

I think the formula is wrong - it should be 1/(n*(n-1)), not 1/n^2 ....

PBoyd (talk) 13:40, 21 April 2008 (UTC) I agree, I think that that would exclude the deviations of the observations from themselves (which, I hope, are always zero).[reply]

Upon further reflection ... the number of pairs and the number of differences should be n!(n-1)!/2. Each paired difference should be calculated only once (except that an observation should not be differenced from itself ... |x-x|). However, if each difference is added twice (|x-y| and |y-x|), then the denominator is n!(n-1)! [=2*(n!(n-1)!/2)]. Since |x-x|=0, it doesn't matter whether or not you include them in the numerator. I will make the change. PBoyd (talk) 12:55, 2 May 2008 (UTC)[reply]

Needs a gentle introduction[edit]

I came here looking to refresh my memory (from a statistics course 30 years ago), but did not find this article to be helpful. The article almost immediately dives into mathematical formulas. Statistics is perceived as important to a larger group than the mathematically-inclined, so this article needs to target an audience of diverse mathematical backgrounds (just as statistics classes vary widely, from descriptive to rigorous).

The article "Variance" was very helpful. It begins with a "Background" section that uses the standard rolling dice example, to first provide an intuitive understanding, before diving into formulas. That allows those seeking a general understanding to get what they need then exit when the going gets cryptic.

Obviously, I don't have the expertise to create an introduction. But I can help on editing. --Zahzuhzaz (talk) 07:47, 23 September 2010 (UTC)[reply]

It is 1/n^2 not 1/n(n-1). I just read the formula in Handbook of statistical analysis. It makes sense, since it sums n sums of n elements, hence n*n=n^2. — Preceding unsigned comment added by 74.58.59.35 (talk) 16:01, 28 July 2012 (UTC)[reply]

I disagree with the notion that it is definitively 1/(n^2) or 2/(n(n-1)). It depends on whether or not you are comparing the differences between randomly selected points with or without replacement. For example if you were comparing the average difference between the values of 2 consecutive dice rolls, you’d want to use 1/(n^2) since you replace the values after each roll. In the case where your calculating the average age difference between coworkers at a company, I’d use 2/(n(n-1)) since I would not want randomly select the same coworker twice, but would rather pick 2 random coworkers and compare there difference. Null Simplex (talk) 20:29, 8 February 2022 (UTC)[reply]

Alternative formula for the continuous case[edit]

I believe an alternative form of the equation should be given for when p(x) is a continuous PDF in addition to the one that’s already up there. Say our distribution ranges from values A to B and we select two random points from our distribution x and y. With out loss of generality, let x < y. The probability density of choosing x on the first random selection and y on the second random selection, assuming independent trials, is p(x)*p(y). The probability density of choosing y as our first random selection and x as our second random selection is p(y)*p(x). Having said this, the probability density that our first 2 random selections will be x and y is 2*p(x)*p(y). It’s doubled to account for when the first selection is smaller than the second, and for when the first selection is larger than the second. Say the difference between x and y is delta. Then y can be rewritten as x + delta. So now we can rewrite our equation p(x)*p(y)*|x-y| as 2*p(x)*p(x + delta)*delta. The issue now is we must rewrite the limits of integration.

First I want to integrate all the occurrences of when x and y are delta units apart. To do this, integrate our function over x from A < x < B - delta. If x is larger than B - delta, then y is larger than B, which isn’t possible. This integral as it is generates a new PDF with a support of [0, B - A] which tells you the probability density that two randomly selected x and y will be delta units apart. You could use this PDF to do all the usual statistics stuff such as median, mode, standard deviation, etc., but to get the mean absolute difference, simply integrate this new pdf over delta from 0 < delta < B - A. These two integrals together integrate all possible pairs of points x and y in the support of p(x).

The key advantage I see in the way I’ve written the equation is that it removes the need for absolute values. Similar to standard deviation, this allows us to do calculus on the formula. I’ve even used this formula to solve the mean absolute difference of various PDFs on online integral calculators. It turns out the probability density of picking 2 points x and y that are a distance delta apart from an exponential distribution p(x) is itself p(delta)! This makes if you think about the intuition behind the time between trials of exponentially distributed events.-Null Simplex 2600:1700:C821:39D0:1193:CC1B:F868:FAAA (talk) 20:11, 8 February 2022 (UTC)[reply]

Adding to my original comment. The calculation of the formula I’m proposing would read “The integral over delta from 0 to B - A of the integral over x from A to B - delta of 2*p(x)*p(x + delta)*delta*dx*ddelta” Null Simplex (talk) 22:14, 8 February 2022 (UTC)[reply]

A small tweak expresses the relationto the Lorenz curve

\mathrm {MD} =\int _{0}^{\infty }\int _{-\infty }^{\infty }2\,f(x)\,f(x+\delta )\,\delta \,dx\,d\delta .

\mathrm {MD} =2\int _{-\infty }^{\infty }f(x)\left[\int _{0}^{\infty }f(x+\delta )\,\delta \,d\delta \right]dx.

Where the expression in brackets is the cumulative function. I'm finding it difficult to get good sources NadVolum (talk) 01:05, 13 August 2022 (UTC)[reply]