---------- ---------- ---------- ---------- ---------- ---------- ---------- ---------- # How to assess the discrepancy between the results of n tests, classified in nc # classes, and a certain distribution law? An idea is to consider: #χ²= Σ(_{i = 1 … nc}(ObsFr_{i}-ExpFr_{i})²/ExpFr_{i}Observed andExpectedFrequencies) # Consider, for example, 50 throws of a dice game; I got 9 one, 11 two, 5 three, 8 four, # 10 five and 7 six. In order to evaluate the discordance from the uniform distribution # (corresponding to a balanced die), we calculate its χ².ObsFr = c(9,11,5,8,10,7) Prob = c(1/6,1/6,1/6,1/6,1/6,1/6); n = sum(ObsFr) ExpFr = Prob*n;sum( (ObsFr-ExpFr)^2/ExpFr )#2.8# To evaluate the equity of the die we compare this value with the theoretical distri- # bution of χ² corresponding to the balanced die:Prob = rep(1/6,6); ExpFr = 50*Prob x = NULL; y = NULL; for (n in 1:1e4) { for(i in 1:6) y[i]=0 for (k in 1:50) {u=RUNIF(1, 1,6); y[u]=y[u]+1} # I got random 50 outputs and I classified them x[n] = sum((y-ExpFr)^2/ExpFr) } summary(x)#Min. 1st Qu. Median Mean 3rd Qu. Max.#0.160 2.800 4.240 4.979 6.640 27.280 # I choose the scale of the chart: h=max(hist(x,probability=TRUE,right=FALSE,col="grey",n=30,plot=FALSE)$density );h#0.1706 BF=3.5; HF=3 Plane(0,28, 0,0.18) # or: Plane(0, max(x)*1.05, 0, h*1.05) hist(x, probability=TRUE,right=FALSE,col="grey",n=30,add=TRUE) POINT(2.8,0,"blue") # below on the left the hist. of theoretical χ² and our value# I see that 2.8 is an almost central value (it corresponds to the 1st quartile). SoI # can consider plausible (ie do not refuse) the hypothesis that the die is balanced.# Usually a procedure is used that does not depend on the particular phenomenon # considered: if the number of tests is large every theoretical distribution of χ² # is approximated by a distributionχ²(r)dependent only ondegrees of freedomr, that # is the amount of experimental frequencies that I have to know directly. # InRthe density of χ²(r) is x ->dchisq(x,r).# In the previous case the degrees of freedom are 5: if I know 5 relative frequencies, # I can find the 6th as the difference between 1 and the sum of them.f = function(x)dchisq(x,5); graph(f, 0,28, "brown") # I get the chart above to the right.# The point which is the maximun of f: maxmin(f, 0,10); f(last())#3 0.1541803#qchisq(x,r) andpchisq(x,r) give the Kth (x=K/100) 100-quantile (orpercentile) and # theprobability that χ² < x:graph2F(f, 0,28, "red"); POINT(3,0.1541803, "blue") qchisq(0.75, 5); Q = last()#6.62568 integral(f, 0,Q)#0.75 pchisq(Q, 5)#0.75 POINT(Q,f(Q),"seagreen"); diseq(z,f, 0,Q, "brown")# Returning to our die, if as χ² instead of 2.8 I would have got 13 I would have grave # doubts about the fact that the die is balanced: the 95th percentile is 11.1: qchisq(0.95, 5)#11.0705 # 13 corresponds to the 98th percentile: pchisq(13, 5)#0.9766212 # But alse if I would have got 0.4 I would have grave doubts about the fact that the # results of the tests are effective: the 5th percentile is 1.1: qchisq(0.05,5)#1.145476 # For example if I got 9 one, 8 two, 8 three, 7 four, 9 five and 9 six χ² is 0.4: ObsFr = c(9,8,8,7,9,9); ExpFr = Prob*n; sum( (ObsFr-ExpFr)^2/ExpFr )#0.4 # It is useful (and convenient) to compare these values (2.8 and 0.4) with the table:p = c( 5, 10, 25, 50, 75, 90, 95 )/100; r = 5;print(qchisq(p,r), 3)#1.15 1.61 2.67 4.35 6.63 9.24 11.07 # # Another example. # An investigation of theweightof males between 45 and 55 years of a certain nation:# The sample is 825 males, in 11 classes freq = c(4,39,122,197,192,126,93,33,16,2,1); l=length(freq); n=sum(freq); l; n#11 825 interv = seq(53.5, 130.5, 7) val = seq(57, 127, 7) noClass=1; histoclas(interv, freq) # We get the chart above. To have an histogram of area 1 I use "dataclas" (see): hist(dataclas,interv,col="grey",main="",cex.axis=0.8,probability=TRUE) BOX() # We get the chart below. #Is it acceptable that the distribution be#gaussianwith the samemeanandstandarddeviation?m = mean(dataclas); s = sd(dataclas); m; s#84.29633 11.68259 z <- function(x)dnorm(x,mean=m,sd=s); graph(z, 50,140, "brown")# It seems that the curve fits the histogram, but ... the data is numerous ...# How many degrees of freedom are there in this case? # 11 classes; the imposed connections are 3: the number of classes, the mean and the # standard deviation; thedegrees of freedom are 11-3 = 8. # The probabilities are calculated using integrals: K <- function(i) integral(z,53.5+i*7,53.5+(i+1)*7) Prob <- c(K(0),K(1),K(2),K(3),K(4),K(5),K(6),K(7),K(8),K(9),K(10)) ExpFr = Prob*n; ObsFr = freq; sum( (ObsFr-ExpFr)^2/ExpFr )#27.07139 # We compare this value with the table: p = c( 5, 10, 25, 50, 75, 90, 95 )/100; r = 8; print(qchisq(p,r),3)#2.73 3.49 5.07 7.34 10.22 13.36 15.51 # Clearlywe must! # # A further example: # I throw adiscardthe hypothesis that the distribution is Gaussian. I getcoin100 times65 heads. Can I consider the coin fair? #1 degreeof freedom. p = c( 5, 10, 25, 50, 75, 90, 95 )/100; r = 1; print(qchisq(p,r),3)#0.00393 0.01579 0.10153 0.45494 1.32330 2.70554 3.84146 ObsFr = c(65,35); ExpFr = c(50,50); sum( (ObsFr-ExpFr)^2/ExpFr )#9 # 9 is much larger than 95th percentile (3.84146); I cannotconsider the coinbalanced. # If I want the percentile: pchisq(9, 1)#0.9973002 # There aremany other tests. You can explore them with this command (then you must # click "Search Engine & Keywords" and type "test" in the text field)help.start()# For further information click here: WikiPedia and here: OM