How to assess the discrepancy between the results of n tests, classified in nc
classes, and a certain distribution law? An idea is to consider:
χ² = Σ i = 1
nc (Oi - Ei)²/ Ei (Observed and Expected Frequencies)
Consider, for example, 50 throws of a dice game; I got 9 one, 11 two, 5 three, 8 four,
10 five and 7 six. In order to evaluate the discordance from the uniform distribution
(corresponding to a balanced die), we calculate its χ².
nc = 50
O: 9, 11, 5, 8, 10, 7
E: 50/6, 50/6, 50/6, 50/6, 50/6, 50/6
chi2 = 2.8
In the program, instead of 50/6, 50/6, ... I can put 1,1,... or 2,2,...
To evaluate the equity of the die we must compare this value with the theoretical
distribution of χ² corresponding to the balanced die. But it is more convenient
to use a procedure that does not depend on the particular phenomenon considered: if
the number of tests is large (at least a few ten) and in each class there are enough
values (let's say, at least 5), every theoretical distribution of χ² is approximated
by a distribution χ²(r) dependent only on degrees of freedom r, that is the amount
of experimental frequencies that I have to know directly.
In the previous case the degrees of freedom are 5: if I know 5 relative frequencies,
I can find the 6th as the difference between 1 and the sum of them.
We use the table on the program page.
d.f. 5 10 25 50 75 90 95
5 1.15 1.61 2.67 4.35 6.63 9.24 11.1
I see that 2.8 is an almost central value (it corresponds to the 1st quartile). So I
can consider plausible (ie do not refuse) the hypothesis that the die is balanced.
If instead of 2.8 I would have got 13 I would have grave doubts about the fact that
the die is balanced: the 95th percentile is 11.1.
Another example.
A friend tells me: This coin is fair. In fact on 1000 throws I got 499 "heads" and
501 "tails". What can I conclude on the likelihood of what my friend said?
How many degrees of freedom are there in this case? Also in this case I have only one
known condition: the number of classes. The degrees of freedom are 2-1 = 1.
From the tabulation I have that 0.004 corresponds approximately to the 5th percentile.
This is therefore a rather abnormal value. It is reasonable to assume that the friend
has told us a lie.
d.f. 5 10 25 50 75 90 95
1 0.00393 0.0158 0.102 0.455 1.32 2.71 3.84
Another example.
An investigation of the weight of males between 45 and 55 years of a certain nation.
The sample is 825 males, in 11 classes: 54-60, 61-70,..,124-130, ie. [53.5,60.5),
[60.5,70.5),.., [123.5,130.5).
nc = 11
O: 4, 39, 122, 197, 192, 126, 93, 33, 16, 2,1
Is it acceptable that the distribution be gaussian with the same mean and standard
deviation?
With pocket calculator-2 we find:
57*4, 64*39, 71*122, 78*197,85*192, 92*126, 99*93,106*33, 113*16, 120*2, 127*1
n = 825 mean = 84.296 experimental standard dev. = 11.512
The graphs seem to agree, but if we look closely at the shape
And the data
is numerous
Let's check!
With integral of Gauss distribution we find:
0.015629121359 if a=53.5 b=60.5, m=84.296 sigma=11.512
...
0.0003003212279 if a=123.5 b=130.5, m=84.296 sigma=11.512
I round up:
0.01563,0.05292,0.12512,0.20665,0.23846,0.19227,0.1083,0.04262,0.01171,0.002245,0.000300
The imposed connections are 3: the number of classes, the mean and the
standard deviation; the degrees of freedom are 11-3 = 8.
d.f. 5 10 25 50 75 90 95
8 2.73 3.49 5.07 7.34 10.2 13.4 15.5
27.6 > 15.5:
clearly we must discard the hypothesis that the distribution is Gaussian!
To learn more about the topic see here.