---------- ---------- ---------- ---------- ---------- ---------- ---------- ----------
# How to assess the discrepancy between the results of n tests, classified in nc
#  classes, and a certain distribution law? An idea is to consider: 
# χ² = Σ i = 1  nc (ObsFri - ExpFri)²/ ExpFri  (Observed and Expected Frequencies)
# Consider, for example, 50 throws of a dice game; I got 9 one, 11 two, 5 three, 8 four,
# 10 five and 7 six. In order to evaluate the discordance from the uniform distribution
# (corresponding to a balanced die), we calculate its χ².
 
ObsFr = c(9,11,5,8,10,7)
Prob = c(1/6,1/6,1/6,1/6,1/6,1/6); n = sum(ObsFr)
ExpFr = Prob*n; sum( (ObsFr-ExpFr)^2/ExpFr )
# 2.8
# To evaluate the equity of the die we compare this value with the theoretical distri-
# bution of χ² corresponding to the balanced die:
 
Prob = rep(1/6,6); ExpFr = 50*Prob
x = NULL; y = NULL; for (n in 1:1e4) {
   for(i in 1:6) y[i]=0
   for (k in 1:50) {u=RUNIF(1, 1,6); y[u]=y[u]+1}
   # I got random 50 outputs and I classified them
   x[n] = sum((y-ExpFr)^2/ExpFr) }
summary(x)
#   Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
#  0.160   2.800   4.240   4.979   6.640  27.280
# I choose the scale of the chart:
h=max(hist(x,probability=TRUE,right=FALSE,col="grey",n=30,plot=FALSE)$density );h
# 0.1706
BF=3.5; HF=3
Plane(0,28, 0,0.18)        # or:  Plane(0, max(x)*1.05, 0, h*1.05)
hist(x, probability=TRUE,right=FALSE,col="grey",n=30,add=TRUE)
POINT(2.8,0,"blue")     # below on the left the hist. of theoretical χ² and our value
 
 
 
# I see that 2.8 is an almost central value (it corresponds to the 1st quartile). So I
# can consider plausible (ie do not refuse) the hypothesis that the die is balanced.
 
# Usually a procedure is used that does not depend on the particular phenomenon
# considered: if the number of tests is large every theoretical distribution of χ²
# is approximated by a distribution χ²(r) dependent only on degrees of freedom r, that
# is the amount of experimental frequencies that I have to know directly.
# In R the density of χ²(r) is x -> dchisq(x,r).
 
# In the previous case the degrees of freedom are 5: if I know 5 relative frequencies,
# I can find the 6th as the difference between 1 and the sum of them.
 
f = function(x) dchisq(x,5); graph(f, 0,28, "brown")
# I get the chart above to the right.
 
# The point which is the maximun of f:
maxmin(f, 0,10); f(last())
#    3          0.1541803
 
# qchisq(x,r) and pchisq(x,r) give the Kth (x=K/100) 100-quantile (or percentile) and
# the probability that χ² < x:
 
graph2F(f, 0,28, "red"); POINT(3,0.1541803, "blue")
qchisq(0.75, 5); Q = last()
# 6.62568
integral(f, 0,Q)
# 0.75
pchisq(Q, 5)
# 0.75
POINT(Q,f(Q),"seagreen"); diseq(z,f, 0,Q, "brown")
 
                        
 
# Returning to our die, if as χ² instead of 2.8 I would have got 13 I would have grave
# doubts about the fact that the die is balanced: the 95th percentile is 11.1:
qchisq(0.95, 5)
# 11.0705
# 13 corresponds to the 98th percentile:
pchisq(13, 5)
# 0.9766212
# But alse if I would have got 0.4 I would have grave doubts about the fact that the
# results of the tests are effective: the 5th percentile is 1.1:
qchisq(0.05,5)
# 1.145476
# For example if I got 9 one, 8 two, 8 three, 7 four, 9 five and 9 six  χ² is 0.4:
ObsFr = c(9,8,8,7,9,9); ExpFr = Prob*n; sum( (ObsFr-ExpFr)^2/ExpFr )
# 0.4
# It is useful (and convenient) to compare these values (2.8 and 0.4) with the table:
p = c( 5, 10, 25, 50, 75, 90, 95 )/100;  r = 5; print(qchisq(p,r), 3)
#   1.15  1.61  2.67  4.35  6.63  9.24  11.07
#
# Another example.
# An investigation of the weight of males between 45 and 55 years of a certain nation:
 
                  
# The sample is 825 males, in 11 classes
freq = c(4,39,122,197,192,126,93,33,16,2,1); l=length(freq); n=sum(freq); l; n
#  11   825
interv = seq(53.5, 130.5, 7)
val = seq(57, 127, 7)
noClass=1; histoclas(interv, freq)
# We get the chart above. To have an histogram of area 1 I use "dataclas" (see):
hist(dataclas,interv,col="grey",main="",cex.axis=0.8,probability=TRUE)
BOX()
# We get the chart below.
# Is it acceptable that the distribution be gaussian with the same mean and standard
# deviation?
                
m = mean(dataclas); s = sd(dataclas); m; s
#   84.29633            11.68259
z <- function(x) dnorm(x,mean=m,sd=s); graph(z, 50,140, "brown")
 
# It seems that the curve fits the histogram, but ... the data is numerous ...
 
# How many degrees of freedom are there in this case?
# 11 classes; the imposed connections are 3: the number of classes, the mean and the
# standard deviation; the degrees of freedom are 11-3 = 8.
# The probabilities are calculated using integrals:
K <- function(i) integral(z,53.5+i*7,53.5+(i+1)*7)
Prob <- c(K(0),K(1),K(2),K(3),K(4),K(5),K(6),K(7),K(8),K(9),K(10))
ExpFr = Prob*n; ObsFr = freq; sum( (ObsFr-ExpFr)^2/ExpFr )
#  27.07139
# We compare this value with the table:
p = c( 5, 10, 25, 50, 75, 90, 95 )/100;  r = 8; print(qchisq(p,r),3)
#  2.73  3.49  5.07  7.34  10.22  13.36  15.51
# Clearly we must discard the hypothesis that the distribution is Gaussian!
#
# A further example:
# I throw a coin 100 times. I get 65 heads. Can I consider the coin fair?
# 1 degree of freedom.
p = c( 5, 10, 25, 50, 75, 90, 95 )/100;  r = 1; print(qchisq(p,r),3)
# 0.00393 0.01579 0.10153 0.45494 1.32330 2.70554 3.84146
ObsFr = c(65,35); ExpFr = c(50,50); sum( (ObsFr-ExpFr)^2/ExpFr )
# 9
# 9 is much larger than 95th percentile (3.84146); I can not consider the coin balanced.
# If I want the percentile:
pchisq(9, 1)
# 0.9973002

# There are many other tests. You can explore them with this command (then you must
# click "Search Engine & Keywords" and type "test" in the text field)
help.start()
# For further information click here: WikiPedia and here: OM