Confidence Intervals and t-tests in R

If you were to walk into a restaurant and order a cup of coffee, you’d expect to get a standard cup of the stuff, and you’d expect to get a sufficient quantity of it too. How much coffee for a certain price is too much, and how much is too little? More realistically, when you know that the same coffee shop may serve ever so slightly different volumes of coffee in the same cup, how can you quantify the coffee in the cup?

What Plausible Ranges of Gas Prices Do You Pay?

The same question may very well be asked at the gas station or diesel station that you fill up your car in. When you get there and shell out a few gallons or litres worth of money (depending on where you are), you can expect to pay a certain amount of money for each such gallon or litre. How do we determine what the range of expected prices for a gallon or litre of fuel is, based on already available data?

This is where the confidence interval comes in – and it is one of the most important tools of inferential statistics. Inferential statistics is the science of making decisions or informed generalizations about some data you have, based on some of the characteristics of this data. An important way to understand variability in any process or product’s performance is to ascribe a range of plausible values. Confidence intervals can be defined as the plausible ranges of values a population parameter may take, if you were to estimate it with a sample statistic. You’d hardly be expected by a statistician to be asked “What plausible range of gas prices do you pay on average?”, but this is, in fact, closer to the truth of interval estimation in statistics.

Confidence Levels and Sampling

Why the word confidence in confidence interval? Well, information costs you something to collect it. If you want to be 100 percent sure about the mean of petrol prices, for instance, you could literally collect data on every transaction from every pump in the world. In the real world, this is impossible, and we require sampling.

In the age of Big Data, it seems to be a taboo to talk about sampling sometimes. “You can collect all the data from a process”, some claim. That may be the case for a small minority of processes, but for the vast world out there, characterization is only possible by collecting and evaluating samples of data. And this approach entails the concept of a confidence level.

Confidence levels tell us the proportion of times we’re willing to be right (and wrong) about any parameter we wish to estimate from a sample of data. For example, if I measured the price per gallon of gas at every pump in Maine, or Tokyo, for a day, I’d get a lot of data – and that data would show me some general trends and patterns in the way the prices are distributed, or what typical prices seem to be in effect. If I expect to make an estimate of petrol prices in Tokyo or Maine next July, I couldn’t hope to do this with a limited sample such as this, however. Basic economics knowledge tells us that there could be many factors that could change these prices – and that they could very well be quite different from what they are now. And this is despite the quality of the data we have.

If I wanted to be 95% confident about the prices of petrol within a month from now, I could use a day’s worth of data. And to represent this data, I could use a confidence interval ( a range of values), of course. So, whether it is the quantity of coffee in your cup, or the price per gallon of fuel you buy, you can evaluate the broader parameters of your data sets, as long as you can get the data, using confidence intervals.

R Confidence Intervals Example

Now, let’s see a simple example in R, that illustrate confidence intervals.

#Generate some data - 100 points of data
#The mean of the data generated is 10,
#The standard deviation we've chosen is 1.0
#Data comes from the gaussian distribution

x<-rnorm(100,10,1.0)

#Testing an 80% confidence level
x80<-t.test(x,conf.level = 0.8)

#Testing a 90% confidence level
x90<-t.test(x,conf.level = 0.9)

#Testing a 99% confidence level
x99<-t.test(x,conf.level = 0.99)

x80
x90
x99

The first part of the code shows us how 100 points of data are used as a sample in this illustration.

In the next part, the t.test() command used here can be used to generate confidence intervals, and also test hypotheses you may have developed. In the above code, I’ve saved three different results, based on three different confidence levels – 80%, 90% and 95%. Typically, when you want to be more certain, but you don’t have more data, you end up getting a wider confidence interval. Expect more uncertainty if you have limited data, and more certainty, when you have more data, all other things being equal. Here are the results from the first test – the 80% confidence interval.

1-sample t-test and confidence interval (80% confidence)

1-sample t-test and confidence interval (80% confidence)

Let’s break down the results of the t-test. First, we see the data set the test was performed on, and then we see the t-statistics and also a p-value. Further, you can see a confidence interval (9.85,10.10) for the data, based on a sample size of 100, and a confidence level of 80%. Since the t-test is fundamentally a hypothesis test that uses confidence intervals to help you make your decision, you can also see an alternative hypothesis listed there. Then there’s the estimate of the mean – 9.98. Recall the characteristics of the normal distribution we used to generate the data in the first place – it had a mean \mu = 10.0 and a standard deviation \sigma = 1.

Confidence Levels versus Confidence Intervals

Summarily, we can see the confidence intervals and mean estimates of the remaining two confidence intervals also. For ease, I’ll print them out from storage.

CIs comparison (80%,90%, 99%)

CIs comparison (80%,90%, 99%)

Observe how, for the same data set, confidence intervals (plausible ranges of values for the mean of the data) are different, depending on how the confidence level changes. The confidence intervals widen as the confidence level increases. The 80% CI is calculated to be (9.85, 10.10) while the same sample yields (9.72, 10.24) when the CI is calculated at 99% confidence level. When you consider that standard deviations can be quite large, what confidence level you use in your calculations, could actually become a matter of importance!

More on how confidence intervals are calculated here, at NIST.

The 1-sample t-Test

We’ve seen earlier that the command that is invoked to calculate these confidence intervals is actually called t.test(). Now let’s look at what a t-test is.

In inferential statistics, specifically in hypothesis testing, we test samples of data to determine if we can make a generalization about them. When you collect process data from a process you know, the expectation is that this data will look a lot like what you think it should look like. If this is data about how much coffee, or how much you paid for a gallon of gasoline, you may be interested in knowing, for instance, if the price per gallon is any different here, compared to some other gas station. One way to find this out – with statistical certainty – is through a t-test. A t-test fundamentally tells us whether we fail to reject a default hypothesis we have (also called null hypothesis and denoted as H_0 ), or if we reject the default hypothesis and embrace an alternative hypothesis (denoted by H_a). In case of the 1-sample t-test, therefore:

H_o : \mu = k  \newline  H_a : \mu \neq k

Depending on the nature of the alternative hypothesis, we could have inequalities there as well. Note that k here is the expectation you have about what the mean ought to be. If you don’t supply one, t.test will calculate a confidence interval and produce a mean estimate.

A t-test uses the t-distribution, which is a lot like the Gaussian or normal distribution, except that it uses an additional parameter – which directly relates to the sample size of your data. Understandably, the size of the sample could give you very different results in a t-test.

As with other hypothesis tests, you also get a p-value. The null hypothesis in a 1-sample t-test is relatively straightforward – that there is no difference between the mean of the sample in question, and the population’s mean. Naturally, the alternative of this hypothesis could help us study whether the population mean is less than expected (less expensive gas!) or greater (more expensive gas!) than the expectation.

Let’s take another look at that first t-test result:

1-sample t-test and confidence interval (80% confidence)

1-sample t-test and confidence interval (80% confidence)

The confidence level we’ve calculated here is an 80% confidence interval, which translates to a 20% significance. To interpret the results above, we compare the value of p, with the significance, and reject the null hypothesis is p is smaller. But what are the t-statistic and df? The t-statistic here is calculated as the critical value of t, based on a confidence level of 80%, the sample mean and standard deviation, and of course, the fact that we have 100 points of data. (The “df” here stands for degrees of freedom – which stands at 99, calculated from the 100 data points we have and the 1 parameter we’re estimating.)

Alternative Hypotheses Inequalities

The t.test() command also allows us to evaluate how the confidence intervals look, and what the p-values are, when we have different alternative hypotheses. We can test for the population mean that’s being estimated to be less than, or greater than the expected value. We can also specify what our expected values of mean are.


x80<-t.test(x,conf.level = 0.8, mu = 10.1,alternative = "less" )

x80

2015-08-11 23_59_15-Jump List for VLC media player

Observe how the p-value, confidence intervals have changed.

We’ve evaluated the same 80% confidence intervals, with different expected values of the mean of \mu = 10.1, and the alternative hypothesis is that this mean \mu<10.1.

Concluding Remarks

When evaluating data to draw conclusions from it, it helps to construct confidence intervals. These tell us general patterns in the data, and also help us estimate variability. In real-life situations, using confidence intervals and t-tests to estimate the presence or absence of a difference between expectation and estimate is valuable. Often, this is the lifeblood of data-driven decision making when dealing with lots of data, and when coming to impactful conclusions about data. R’s power in quickly generating confidence intervals becomes quite an ally, in the right hands – and of course, if you’ve collected the right data.

Normality Tests in R

When we see data visualized in a graph such as a histogram, we tend to draw some conclusions from it. When data is spread out, or concentrated, or observed to change with other data, we often take that to mean relationships between data sets. Statisticians, though, have to be more rigorous in the way they establish their notions of the nature of a data set, or its relationship with other data sets. Statisticians of any merit depend on test statistics, in addition to visualization, to test any theories or hunches they may have about the data. Usually, normality tests fit into this toolbox.

Histogram: Can this graph alone tell you whether your data is normally distributed?

Histogram: Can this graph alone tell you whether your data is normally distributed?

Normality tests help us understand the chance that any data we have with us may have come from a normal or Gaussian distribution. At the outset, that seems simple enough. However, when you look closer at a Gaussian distribution, you can observe how it has certain specific properties. For instance, there are two main parameters – a location parameter, the mean, and the scale parameter, the standard deviation. Different combinations of this can mean different shapes of distributions. You can therefore have thin and tall normal distributions, or have fat and wide normal distributions.

When you’re checking a data set for normality, it helps to visualize the data too.

Normal Q-Q Plots

#Generating 10k points of data and arranging them into 100 columns
x<-rnorm(10000,10,1)
dim(x)<-c(100,100)

#Generating a simple normal quantile-quantile plot for the first column
#Generating a line for the qqplot
qqnorm(x[,1])
qqline (x[,1], col=2)

The code above generates data from a normal distribution (command “rnorm”), reshapes it into a series of columns, and runs what is called a normal quantile-quantile plot (QQ Plot, for short) on the first column.

Q-Q Plot (Normal)

Q-Q Plot (Normal)

The Q-Q plot tells us what proportion of the data set (in this case, the first column of variable x), compares with the expected proportion (theoretically) of the normal distribution model based on the sample’s mean and standard deviation. We’re able to do this, because of the normal distribution’s properties. The normal distribution is thicker around the mean, and thinner as you move away from it – specifically, around 68% of the points you can expect to see in normally distributed data will only be 1 standard deviation away from the mean. There are similar metrics for normally distributed data, for 2 and 3 standard deviations (95.4% and 99.7% respectively).

However, as you see, testing a large set of data (such as the 100 columns of data we have here) can quickly become tedious, if we’re using a graphical approach. Then there’s the fact that the graphical approach may not be a rigorous enough evaluation for most statistical analysis situations, where you want to compare multiple sets of data easily. Unsurprisingly, we therefore use test statistics, and normality tests, to assess the data’s normality.

You may ask, what does non-normal data look like in this plot? Here’s an example below, from a binomial distribution, plotted on a Q-Q normal plot.

QQ-Normal plot - observe how binomial distribution data displays categories and extremes

QQ-Normal plot – observe how binomial distribution data displays categories and extremes

Anderson Darling Normality Test

As one of the commonly used normality tests, this is very commonly used to tell us whether or not a sample may represent normally distributed data. This is done in R by using the ad.test() command, in the nortest package. So, if you don’t have the ad.test command popping up on your R-studio auto-complete, you can easily install it via nortest on the “install.packages” command. Running the Anderson-Darling test for normality generally returns a bunch of data. Here’s how to make sense of it.


#Running the A-D test for first column
library(nortest)
ad.test(x[,1])

adtest

Anderson Darling Normality Test results

The data from the A-D test tells us which data has been tested, and two results: A and p-value.

The A result refers to the Anderson-Darling test statistic. The A-D test uses this test statistic to calculate the probability that this sample could have come from a normal distribution. The A-D test tests the default hypothesis that the data (in this case the first column of x), comes from a normal distribution. Assuming that this hypothesis is true, the p-value we see here tells us the probability that we can see the data we see in this sample purely by random chance. That is to say, in this case, we have a 50.57% probability of seeing the same kind of data from this process, assuming that the process in question does represent normally distributed data. Obviously, such a high chance of normality is hard to ignore, which is why we fail to reject this hypothesis we had originally, that the data does come from a normal distribution.

For another sample, if the p-value were around 3%, for instance, that would mean a 3% chance of seeing the same data from a normal distribution – which is obviously a very low chance. Although a 5% chance is still a small chance, we choose that as the very bottom end of our acceptability and should ideally subject such data to scrutiny before we proceed to do draw inferences from it. P-values can be confusing for some and hard to interpret – I usually try to construct a sentence to interpret the p-value’s meaning in the context of the hypothesis that’s being tested. I’ll write more on this in a future post on hypothesis testing.

Interpreting and Understanding A-D test Results

Naturally, as the p-values from an Anderson-Darling normality test become smaller and smaller, there is a smaller and smaller chance that we are looking at data from a normal distribution. Most statistical studies peg the “significance” level at which we reject the default hypothesis (that this data comes from a normal distribution) outright, at p-values of 0.05 (5%) or lesser. I’ve known teams and individuals that fail to reject this hypothesis, all the way down to p-values of 0.01. In truth, one significance value (often referred to as \alpha) doesn’t fit all applications, and we have to exercise great caution in rejecting null hypotheses.

Let’s go a bit further with our data and the A-D test: we will not perform the A-D test on all the columns of data we’ve got in our variable x. The easiest way to do this in R is to define a function that returns the p-values from each column, and use that in an “apply” command.


#Running the A-D test for first column
library(nortest)
#defining a function called "adt" to run the A-D tests and return p-values
adt<-function(x){ test<-ad.test(x); return(test$p.value) }
#store the p-values for each column in a separate variable
pvalues<-apply(x,MARGIN = 2, adt)

The code above analyzes the samples in x (as columns) and returns their p-values as columns. So, what do you expect to see when you summarize the variable “pvalues” which stores the test results?

p-values summary (columns of x)

p-values summary (columns of x)

When you summarize p-values, you can see how approximately 9 of the 100 don’t pass the significance criteria (of p>=0.05). You can also see that the p-values in this set of randomly generated samples are randomly distributed over the entire range of probabilities from 0 to 1. We can visualize this too, by plotting the variable “pvalues”.


#Plotting the sample p-values and drawing a significance line
plot(pvalues, main = "p-values for columns in x (AD-test)", xlab = "Column number in x", ylab = "p-value")
abline(h=0.05, col ="red")

p-values for columns in x

p-values for columns in x

Other tests: Shapiro-Wilk Test

The Anderson-Darling test isn’t the only one available in the nortest package for assessing normality. Statisticians and engineers often use the Shapiro-Wilk test of normality also. For similar data used above (generated as random numbers from the normal distribution), the Shapiro-Wilk test can be performed, with only a few changes to the R script (which is another reason R is so time-efficient).


#Generating 10k points of data and arranging them into 100 columns
x<-rnorm(10000,10,1)
dim(x)<-c(100,100)

#Generating a simple normal quantile-quantile plot for the first column
#Generating a line for the qqplot
qqnorm(x[,1])
qqline (x[,1], col=2)

#Running the A-D test for first column
library(nortest)
#defining a function called "swt" to run the Shapiro-Wilk tests and return p-values
swt<-function(x){ test<-shapiro.test(x); return(test$p.value) }
#store the p-values for each column in a separate variable
swpvalues<-apply(x,MARGIN = 2, swt)

#Plotting the sample p-values and drawing a significance line
plot(swpvalues, main = "p-values for columns in x (Shapiro-Wilk-test)", xlab = "Column number in x", ylab = "p-value")
abline(h=0.05, col ="red")

Shapiro-Wilk test

Shapiro-Wilk test p-values (for columns in x)

Observe than in these samples of 100 points each, the Shapiro Wilk tests returns 96 samples as being normally distributed data, while rejecting 4 (the four dots below the red line). We can’t be sure in this example whether the Shapiro Wilk test is better than the A-D test for normality assessments, however, because these are randomly generated data sets. If we run these tests side-by-side, however, we may get to see interesting results. When I ran these tests side by side, I got very similar number of significantly different samples (non-normal samples) – either 4 or 5 out of the total 100 – from both tests.

Concluding Remarks

So, what does all this mean for day-to-day data analysis?

  • Data analysis of continuous (variable, as opposed to yes/no, or other attribute data) data often uses tools that are meant for normally distributed data
  • Visualization alone isn’t sufficient to estimate the normality of a given set of data, and test-statistics are very important for this
  • The nortest package in R provides a fast and convenient way to assess samples for normality using tools like A-D and S-W tests
  • A-D and S-W tests tend to perform similarly for most data sets (however, research is being done on this)
  • The re-usability of R code allows you to set up a macro rather quickly for performing normality tests and a range of studies on test data, in a time-efficient manner