Animated: Mean and Sample Size

A quick experiment in R can unveil the impact of sample size on the estimates we make from data. A small number of samples provides us less information about the process or system from which we’re collecting data, while a large number can help ground our findings in near certainty. See the earlier post on sample size, confidence intervals and related topics on R Explorations.

Using the “animation” package once again, I’ve put together a simple animation to describe this.

#package containing saveGIF function
library(animation)

#setting GIF options
ani.options(interval = 0.12, ani.width = 480, ani.height = 320)

#a function to help us call GIF plots easily 
plo <- function(samplesize, iter = 100){
  
  for (i in seq(1,iter)){
    
    #Generating a sample from the normal distribution
    x <- rnorm(samplesize,mu,sd)
    
    #Histogram of samples as they're generated
    hist(x, main = paste("N = ",samplesize,", xbar = ",round(mean(x), digits = 2),
                         ", s = ",round(sd(x),digits = 2)), xlim = c(5,15), 
                        ylim = c(0,floor(samplesize/3)), breaks = seq(4,16,0.5), col = rgb(0.1,0.9,0.1,0.2), 
                        border = "grey", xlab = "x (Gaussian sample)")
    
    #Adding the estimate of the mean line to the histogram
    abline(v = mean(x), col = "red", lw = 2 )
  }
}

#Setting the parameters for the distribution
mu = 10.0
sd = 1.0

for (i in c(10,50,100,500,1000,10000)){
saveGIF({plo(i,mu,sd)},movie.name = paste("N=",i,", mu=",mu,", sd=",sd,".gif"))
}

Animated Results

Very small sample size of 5. Observe how the sample mean line hunts wildly.

Very small sample size of 5. Observe how the sample mean line hunts wildly.

N= 10 , mu= 10 , sd= 1

A small sample size of 10. Mean once again moves around quite a bit.

N= 50 , mu= 10 , sd= 1

Moderate sample size of 50. Far less inconsistency in estimate (red line)

N= 100 , mu= 10 , sd= 1

A larger sample size, showing little deviation in sample mean over different samples

N= 1000 , mu= 10 , sd= 1

A large sample size, indicating minor variations in sample mean

N= 10000 , mu= 10 , sd= 1

Very large sample size (however, still smaller than many real world data sets!). Sample mean estimate barely changes over samples.