---------- ---------- ---------- ---------- ---------- ---------- ---------- ----------Examples of distributions# References for the most common probability distributions are available here:# # I toss up a coin 10 times. What is the law of distribution of the random variable: # N = "number of heads" ? # Experimental study: n = 10000; da = NULL; for (i in 1:n) da[i]=sum(RUNIF(10, 0,1))help("distribution")hist(da, seq(-0.5,10.5, 1),probability=TRUE,col="grey",main="",cex.axis=0.9) BOX(); abovex("heads in 10 tosses") # If I want the histogram on the right I insert in hist(...)ylim=c(0,0.3)statistics(da) # I get the boxplot and:#Min. 1st Qu. Median Mean 3rd Qu. Max.#0.000 4.000 5.000 4.983 6.000 10.000 # The theoretical study (forchoosesee): N=10; da = NULL; for (i in 0:N) da = c(da,rep(i,choose(10,i))) hist(da, seq(-0.5,10.5, 1),probability=TRUE,col="grey",main="",cex.axis=0.9) BOX(); abovex("heads in 10 tosses") summary(da)#Min. 1st Qu. Median Mean 3rd Qu. Max.#0 4 5 5 6 10 # As index of the data dispersion I can take 3^quartile-1^quartile, ie: Percentile(da,75)-Percentile(da,25) [1] 2 # or thestandard deviation of"da" (see here)Sd(da)#1.581139 # Using Var (= √Sd) I see that Sd(da) √2.5Var(da)#2.5 It is demonstrable that Var is √N/2 # I recall that these are weighted values. I can proceed also in this way (see here): j=0:10;(j,choose(10,j));Wmean(j,choose(10,j))WVar#5 2.5 # This is an example ofbinomial distributionPr(N=k) = C(n,k)*p^k*(1–p)^(n-k) # when p = 1/2. In general:binom= function(m) choose(N,m)*p^m*(1-p)^(N-m) # An example with p ≠ 1/2. # An automatic oven produces on average 1 biscuit burnt every 8; biscuits are then # automatically mixed and packaged in packs of 6. What is the distribution law of the # random variable N = "n. ofburntcookies in a pack" ? # It is a discrete probability distribution. N = 6; p = 1/8; x = 0:N; BC = binom(0:N) Plane(0,N, 0,max(BC) ); POINT(x,BC, "blue") # For a better view we add these commands and obtain the following graph, on the left: polyl(x,BC, "blue"); segm(x,0, x,BC, "blue")Wmean(x,BC);WVar(x,BC)#0.75 0.65625 # I can prove that I obtain the same valued with:N*p;N*p*(1-p)#0.75 0.65625# To understand this it is useful tointerpret BC aswhere Xi is 1 with proba- # bility 1/8, is 0 otherwise. With this simulation we obtain the graph on the right. n=10000; da=NULL; N=6 for(i in 1:n) {X=0; for(j in 1:N) if(runif(1, 0,1)<1/8) X=X+1; da[i]=X} hist(da, seq(-0.5,N+1/2, 1),probability=TRUE,col="grey",main="",cex.axis=0.9) BOX(); abovex("burnt cookies in a pack")X1+…+X6# It's clear why the mean is N*p: if every Xi is 1 with probability p X1+… has mean N*p#NOTE: y =dbinom(x, N, p) is equivalent to x=0:N; ; binom = ...; y = binom(0:N) # See here for an example of themultinomialdistribution.# When N is large (as in the following example) I can approximate the binomial # distribution with thegaussian distribution, that we recall dawn. # The following graph, on the left, is for N=20: p=0.2; N=20 x=0:N; y=binom(0:N); Plane(0,N, 0,max(y)) POINT(x,y, "blue"); polyl(x,y, "blue"); segm(x,0, x,y, "blue") m=N*p; v=N*p*(1-p); s=sqrt(v)#4 3.2 1.788854# The distribution rule callednormalorgaussianhas the following density function f; # it depends only on 2 parameters, the meanmandσ, called thestandard deviation off.# (σ is the Greek letter "sigma", which corresponds to our "s")# InRthe normal density of mean m and standard deviation s is called #dnorm(x,mean=m,sd=s). How to obtain the graph above on the right:z = function(x)dnorm(x, mean=m, sd=s) x=0:N; y=binom(0:N); Plane(0,N, 0,max(y)) POINT(x,y, "blue") graph2(z, -20,20, "brown")# Among the few examples of gaussian phenomenon (in addition to those of the distribu- # tion of seed lengths already considered - seehere), there is the height of the people # of a particular sex, age, place of origin. Let's use the data already considered here # (medical examination for a call-up - Italy, 1987):da = read.table("http://macosa.dima.unige.it/R/callup.txt",skip=8, sep=";",header=TRUE) str(da)# We add 1/2 to truncated data, so we do not have to add 1/2 to the average values da$chest = da$chest+1/2; da$weight = da$weight+1/2; da$height = da$height+1/2 H = da$height; min(H); max(H)#'data.frame': 4170 obs. of 4 variables:#$ chest : int 91 86 89 88 92 95 91 83 87 88 ...#$ weight: int 62 64 64 68 70 77 65 62 55 60 ...#$ region: int 15 15 15 15 15 15 15 15 15 15 ...#$ height: int 73 70 74 79 66 69 66 79 75 65 ...#47.5 96.5 m=mean(H); s=Sd(H) z = function(x) dnorm(x, mean=m, sd=s) BF=3;HF=2.1 graph2F(z, 47,97, "brown") hist(H,n=20,right=FALSE,probability=TRUE,col="grey",add=TRUE) graph2(z,47,97, "brown"); abovex("data") POINT(m+s,z(m+s),"blue"); POINT(m-s,z(m-s),"blue") # The data fit in the graph: see below on the left. # I can simulate the phenomenon withrnorm(n=N, mean=m, sd=s) that generates Nrandom # data with normal distribution; obviously also the simulation fit (graph on the right): # H1 =rnorm(n=4000, mean=m, sd=s) graph2F(z, 47,97, "brown") hist(H1,n=20,right=FALSE,probability=TRUE,col="grey",add=TRUE) graph2(z,47,97, "brown"); abovex("random") POINT(m+s,z(m+s),"blue"); POINT(m-s,z(m-s),"blue") # We pointed out (in blue) in the graphs also m ± σ. One can prove that these points # areflex points. In general, whatever the gaussian, we have: integral(z, m-s,m+s)#0.6826895 integral(z, m-s*2,m+s*2)#0.9544997 integral(z, m-s*3,m+s*3)#0.9973002 # and obviously: integral(z, -Inf,Inf)#1# For which s the integral is 0.99? m=0; s=1; F=function(x) integral(z, -x,x); solution(F,0.99, 0,100)#2.575829# The variable "height" is virtually continuous: unlike the number of coins or biscuits # which only assumed integer values, I can think that a length is a real number. # Increasing the number of intervals the upper edge of the histogram tends to a curve; # instead of the sum of the areas of the columns of a histogram between x=a and x=b, I # have the integral of afunctionbetween x=a and x=b. # In the discrete case I represent thefrequencydensity (ie the unitary percentage # frequency), in the continuous one I represent thedensityfunction.---------- ---------- ---------- ----------# We note that x1^{k}, x2^{k}, x3^{k}, … (if k ≠ 1) have a different kind of distribution of x1, # x2, x3, … An example: x with gaussian distribution and x^3.x <- rnorm(n=10000,mean=5,sd=1); x3=x^3 Plane(min(x),max(x),0,max(hist(x,plot=FALSE)$density)) hist(x,col="grey",probability=TRUE,add=TRUE) statistics(x)#median = 5.005 mean = 5.002 Plane(min(x3),max(x3),0,max(hist(x3,plot=FALSE)$density)) hist(x3,col="grey",probability=TRUE,add=TRUE) statistics(x3)#median = 125.400 mean = 140.200x x^3# x -> x^3 is a growing function; so the median of x^3 is the cube of the median of x; # this does not happen for the mean!---------- ---------- ---------- ----------# Generalizing what we just saw (here), we have: # if Isum up N random variables that have the same distribution, I obtain a random # variable that (as N -> ∞) tends toward agaussian distribution. An example: tot = 2000; x = NULL # N = 1 for (i in 1:tot) x[i] = sqrt(runif(1))+3*runif(1)^2 dev.new(); hist(x, probability=TRUE, col="grey",main="",cex.axis=0.9) BOX(); abovex("n=1") # N = 20 for (i in 1:tot) {x[i]=0; for (j in 1:20) x[i]=x[i]+sqrt(runif(1))+3*runif(1)^2} dev.new(); hist(x, probability=TRUE, col="grey",main="",cex.axis=0.9) BOX(); abovex("n=20")# That's the reason why the gaussian distribution is important: there are a few # gaussian phenomena, but if I calculate the mean of N values with the same # distribution (ie if I calculate their sum divided for N), this mean has a gaussian # distribution. This fact goes under the name ofcentral limit theorem:# If U1,…,Un are random indipendent variables with the same law of distribution,and if #mandσ0 are its mean and standard deviation, when n grows the variable(U1+# tends to have a normal distribution with mean…+Un)/nmand standard deviationσ0/√n.# Let's consider as an example of non-Gaussian distribution the weights of the # population of twenty years considered above. da = read.table("http://macosa.dima.unige.it/R/callup.txt",skip=8, sep=";",header=TRUE) da$chest = da$chest+1/2; da$weight <- da$weight+1/2; da$height <- da$height+1/2 W = da$weight; min(W); max(W)#48.5 126.5 BF=3.5; HF=2.5 max(hist(W, probability=TRUE, col="grey",main="",cex.axis=0.9)$density)#0.03856115 Plane(48,127,0,0.04) hist(W, probability=TRUE, col="grey",cex.axis=0.9,add=TRUE) abovex("weight - call-up - Italy, 1987"); POINT(mean(W),0,"blue")length(W); Sd(W)/sqrt(length(W))#4170 0.1630721 E=3*Sd(W)/sqrt(length(W)); mean(W)-E; mean(W)+E#70.48752 71.46595 # The mean weight of the twenty years italian men was 71 ± 0.5 kg#Butthe mean and the standard deviation that we considered are not the mean and the # standard deviation of the population of all the twenty years italian men (in 1987), # they are those of the sample. It can be proved that, while the statistical mean is a # good approximation of the theoretical one, I have to consider (insted of the # theoretic standard deviation, Sd) the experimental standard deviationsd, where n is # replaced by n-1: # Sdis√(((X1-M)^2+ … +(Xn-M)^2)/n)#sdis√(((X1-M)^2+ … +(Xn-M)^2)/(n-1))[ > Sd] # So, (U1+…+Un)/n tends to have a normal distribution which have the experimental # mean as a mean andsd/√nas a standard deviation. I can calculare this value with #SdM(standard deviation of the mean). See here # When n is large the difference between the two values is negligible:Sd(W)/sqrt(4170); sd(W)/sqrt(4170);SdM(W)#0.1630721 0.1630916 0.1630916 # It is due to the replacement of √n with √(n-1): n=4170; sqrt(n/(n-1)); n=20; sqrt(n/(n-1)); n=10; sqrt(n/(n-1))#1.00012 1.025978 1.054093# Therefore,more correct calculationsin our case would be: E=3*SdM(W); mean(W)-E; mean(W)+E#70.48746 71.46601# To be precise we should say: there is theprobability of confidence of99.73% that # [70.48746,71.46601] contains the exact value. If we use E=2*SdM(W) we should say # the probability of confidence is 95.45%. The interval is called aconfidence interval.---------- ---------- ---------- ----------# In the discrete case theprobabilitythatUis betweenaandbis Pr(U=a)+…+Pr(U=b); # in thecontinuous casethe probability that the density function U is between a and b # (where a and b may be infinite) is∫_{[a,b]}U(x)dx. In the discrete case if U has values # in {v1,v2,…} themeanis v1·Pr(U=v1)+v2·Pr(U=v2)+…; in the continuous case if U # varies in I it is∫_{I}x·U(x)dx;so, ifmis the mean, thevarianceis∫_{I}(x-m)²·U(x)dx # If I sum up only two random variables uniformly distributed on [0,1] I have (just # like we saw here for the launch of two balanced dice): HF=2; BF=4 PLANE(0,2, 0,1) n = 1e6; U1 = runif(n,0,1); U2 = runif(n,0,1) hist(U1+U2, probability=TRUE, col="grey90", add=TRUE, nclass=40) # or: n=40 # the density function (from x=0 to x=2) is: f = function(x) -abs(x-1)+1 graph1(f, 0,2, "blue"); abovex("runif+runif") # and (as it should be) I have: integral(f, 0,2)#1 # and the mean: g = function(x) x*f(x); integral(g, 0,2)#1 # The graph of the probability that U1+U2 is less than x (see here): Gintegra(f,0,2,"black")# This function (the function with black graph) is thecumulative distribution function# of U1+U2.# In the case of the gaussian distribution I have:z = function(x) dnorm(x, mean=0, sd=1) HF=2; BF=3 graphF(z, -4,4, "blue") # On the right the graph of the cumulative distribution function: Plane(-4,4, 0,1) Gintegra(z,-10,10, "brown") # I should have drawn the graph of ∫_{(-∞,x] }z, but # in Gintegra(f,A,B,col) I must use A and B finite. However ∫_{(-10,x] }z is practically the # same; indeed: integral(z, -Inf,-10)#7.619853e-24---------- ---------- ---------- ----------# The examples ofsimulationconsideredhereemploy a procedure that is based on # central limit theorem and the methods for studying the binomial distribution # considered at the beginning of this document, but merely compute the mean and the # standard deviation [ in this case N=1 and Var=p*(1-p) ]# Let's remember the first example: #the probability that throwing a pair of balanced dice their difference is 1# (we - really - know it is 5/18 = 0.2777…)Event <- function() abs( RUNIF(1, 1,6)-RUNIF(1, 1,6) ) == 1PR(1e4)#27.9 +/- 1.34559 PR(1e5)#27.737 +/- 0.4247283 PR(1e6)#27.8421 +/- 0.1344666# As n -> ∞ the standard deviation σ/√n -> 0 as 1/√n:the "error" of PR(N) is divided#by 10 if N multiplies by 100.# It is basically for this reason that (while analyzing statistical data) if the number # of data is multiplied by 100 I can round the average with a further figure. # If I have at least 10 data which are rounded to the thents I can round the mean to # the cents. If I have at least 1000 data which are rounded to the thents I can round # the mean to the thousandths. If I have at least 10 data which are rounded to the # units I can round the mean to the tens (see). Verification:# Some data rounded to the units: data=runif(11, 50,100); c( mean(data), mean(round(data)) )#77.3243077.27273 data=runif(101, 50,100); c( mean(data), mean(round(data)) )#75.86425 75.83168 data=runif(1001, 50,100); c( mean(data), mean(round(data)) )#74.9954174.99800 data=runif(10001, 50,100); c( mean(data), mean(round(data)) )#75.18941 75.18628 data=runif(100001, 50,100); c( mean(data), mean(round(data)) )#74.9476974.94847# Data must berounded!!! See here. See here too. Other distributionsOther examples of use