Hypothesis Tests: 2 Sample Tests (A/B Tests)

Businesses are increasingly beginning to use data to drive decision making, and are often using hypothesis tests. Hypothesis tests are used to differentiate between a pair of potential solutions, or to understand the performance of systems before and after a certain change. We’ve already seen t-tests and how they’re used to ascribe a range to the variability inherent in any data set. We’ll now see the use of t-tests to compare different sets of data. In website optimization projects, these tests are also called A/B tests, because they compare two different alternative website designs, to determine how they perform against each other.

It is important to reiterate that in hypothesis testing, we’re looking for a significant difference and that we use the p-value in conjunction with the significance (%) to determine whether we want to reject some default hypothesis we’re evaluating with the data, or not. We do this by calculating a confidence interval, also called an interval estimate. Let’s look at a simple 2-sample t-test and understand how it works for two different samples of data.

Simple 2-sample t-test

A 2-sample t-test has the default hypothesis that the two samples you’re testing come from the same population, and that you can’t really tell any difference between them. So, any variation you see in the data is purely random variation. The alternative hypothesis in this test, is of course, that it isn’t only random variation we’re seeing, and that these samples come from completely different populations altogether.

H_o : \mu_1 = \mu_2    \newline    H_a: \mu_1 \neq \mu_2

What the populations from which X1 and X2 are taken may look like

What the populations from which X1 and X2 are taken may look like

We’ll generate two samples of data x_1, x_2 from two different normal distributions for the purpose of demonstration, since normality is a pre-requisite for using the 2-sample t-test. (In the absence of normality, we can use other estimators of central tendency such as the median, and the tests appropriate for estimating the median, such as the Moods-Median or Kruskall-Wallis test – which I’ll blog about another time). We also have to ensure that the standard deviations of the two samples of data we’re testing are comparable. I’ll also demonstrate how we can use a test for standard deviations to understand whether the samples have different variability. Naturally, when the samples have different standard deviations, tests for assessing similarities in their means may not be fully effective.

library(nortest)
#Generating two samples of data
#100 points of data each
#Same standard deviation
#Different values of mean (of sampling distribution)
x<-read.csv(file = "x1x2.csv")
x1<-x[,2]
x2<-x[,3]

#Setting the global value of significance 
alpha = 0.1

#Histograms
hist(x1, col = rgb(0.1,0.5,0.1,0.25), xlim = c(7,15), ylim = c(0,15),breaks = seq(7,15,0.25), main = "Histogram of x1 and x2", xlab = "x1, x2")
abline(v=10, col = "orange")
hist(x2, col = rgb(0.5,0.1,0.5,0.25), xlim = c(7,15),breaks = seq(7,15,0.25), add = T)
abline(v=12, col = "purple")

#Running normality tests (just to be sure)
ad1<-ad.test(x1)
ad2<-ad.test(x2)

#F-test to compare two or more variances
v1<-var.test(x1,x2)

if(ad1$p.value>=alpha & ad2$p.value>=alpha){
  if(v1$p.value>=alpha){
    #Running a 2-sample t-test
    for (i in c(-2,-1,0,1,2)){
      temp<-t.test(x=x1,y=x2,paired = FALSE,var.equal = TRUE,alternative = "two.sided",conf.level = 1-alpha, mu = i )
      cat("Difference= ",i,"; p-value:",temp$p.value,"\n")  
    }
    
  }
}

The first few lines in the code merely include the “nortest” package and invoke/generate the data sets we’re comparing. The nortest package contains the Anderson Darling Normality Test, which we have also covered in an earlier post. We can generate a histogram, to understand what x_1 and x_2 look like.

Histogram of X1 and X2 - showing the reference population mean lines

Histogram of X1 and X2 – showing the reference population mean lines

The overlapping histograms of x_1 and x_2 clearly indicate the difference in the central tendency, and the overlap is also visible. Subsequent code above covers an F-test. As explained earlier, equality of variances is a pre-requisite for the 2-sample t-test. Failing this would mean that we essentially have samples from two different populations, which have two different standard deviations.

Finally, if the conditions to run a 2-sample t-test are met, the t.test() command (which is present in the “stats” package, runs, and provides us a result. Let’s look closely at how the t.test command is constructed. The arguments contain x_1 and x_2, which are our two samples for comparison. We provide the argument “paired = FALSE”, because these are not before/after samples. They’re two independently generated samples of data. There are instances where you may want to conduct a paired t-test, though, depending on your situation. We’ve also specified the confidence level. Note how the code uses a global value of \alpha, or significance level.

Now that we’ve seen what the code does, let’s look at the results.

2-sample t-test results

2-sample t-test results

Evaluating The Results

Two sample t-test results should be evaluated in a similar way to 1-sample t-tests. Our decision is dependent on the p-value we see in the result, and the confidence interval of the difference between sample means.

Observe how the difference estimate lies on the negative side of the number line. Difference is calculated from the populations means 10 and 12, so we can clearly understand why this estimate of difference is negative. The estimates for mean values of x and y (in this case, x_1 and x_2) are also given. Naturally, the p-value that’s in the result, when compared to our generous \alpha of 0.1, is far lesser, and we can consider this to be a significant result (provided we have sufficient statistical power – and we’ll discuss this in another post). This indicates a significant difference between the two sets. If x_1 and x_2 were fuel efficiency figures for passenger vehicles, or bikes, we may actually be looking at better performance for x_2 when compared to x_1.

Detecting a Specific Difference

Sometimes, you may want to evaluate a new product, and see if it performs at least x% better than the old product. For websites, for instance, you may be concerned with loading times. You may be concerned with code runtime, or with vehicle gas mileage, or vehicle durability, or some other aspect of performance. At times, the fortunes of entire companies depend on them producing faster, better products – that are known to be faster by at least some amount. Let’s see how a 2-sample t-test can be used to evaluate a minimum difference between two samples of data.

The same example above can be modified slightly, to test for a specific difference. The only real difference we have to make here, of course, is the value of \Delta or difference. The t.test() command in R unfortunately isn’t very clear on this – it expects you to understand that you should use \mu for this. Once you get used to it, however, this little detail is fine, and it delivers the expected result.

if(ad1$p.value >=alpha & ad2$p.value>=alpha){
  if(v1$p.value>=alpha){
    #Running a 2-sample t-test
    for (i in c(-2,-1,0,1,2)){
      temp<-t.test(x=x1,y=x2,paired = FALSE,var.equal = TRUE,alternative = "two.sided",conf.level = 1-alpha, mu = i )
      cat("Difference= ",i,"; p-value:",temp$p.value,"\n")  
    }
 }
}

The code above prints out different p-values, for different tests. The data used in these tests is the same, but by virtue of the different differences we want to detect between these samples, the p-values are different. Observe the results below:

Differences and how they influence p-value (same two samples of data)

Differences and how they influence p-value (same two samples of data)

Since the data was generated from two distributions that have means of 10 and 12 respectively for x_1 and x_2, we know from intuition that the difference is -2, and we should start seeing results that indicate no difference between the expected and observed difference at this value in the test. Therefore, the p-values in this scenario will be greater than the significance value, \alpha.

For other scenarios – when \Delta = -1, 0, 1, 2, we see that the p-values are clearly far below the significance of \alpha = 10%.

What’s important to remember therefore, is that contrary to what many people may think, there is no one or best p-value for a given set of data. It depends on the factors we take into consideration during the test – such as the sample size, the confidence level we chose for our test, the resulting significance level, and, as illustrated here, the difference expected.

Concluding Remarks

A 2-sample t-test is a great way for an organization to compare samples of data from different products, processes, and so on, and understand if one of them is performing significantly better than another. The test is strictly for data that fits the normality criteria, that also happen to have comparable standard deviations, and the results from it tend to be impacted heavily by the kind of hypothesis we use – for difference (which we explored here) and for one or two sided comparisons. We explored only the two sided comparisons here (and hence constructed a two sided confidence interval). When a business uses a 2-sample t-test, some of the arguments here, such as the values of confidence level, difference and so on, should be evaluated thoroughly. It is also important to bear in mind the impact of sample size. The smaller the difference we want to detect, the greater the sample sizes have to be. We’ll see more about this in another post, on power, difference and sample size.

Confidence Intervals and t-tests in R

If you were to walk into a restaurant and order a cup of coffee, you’d expect to get a standard cup of the stuff, and you’d expect to get a sufficient quantity of it too. How much coffee for a certain price is too much, and how much is too little? More realistically, when you know that the same coffee shop may serve ever so slightly different volumes of coffee in the same cup, how can you quantify the coffee in the cup?

What Plausible Ranges of Gas Prices Do You Pay?

The same question may very well be asked at the gas station or diesel station that you fill up your car in. When you get there and shell out a few gallons or litres worth of money (depending on where you are), you can expect to pay a certain amount of money for each such gallon or litre. How do we determine what the range of expected prices for a gallon or litre of fuel is, based on already available data?

This is where the confidence interval comes in – and it is one of the most important tools of inferential statistics. Inferential statistics is the science of making decisions or informed generalizations about some data you have, based on some of the characteristics of this data. An important way to understand variability in any process or product’s performance is to ascribe a range of plausible values. Confidence intervals can be defined as the plausible ranges of values a population parameter may take, if you were to estimate it with a sample statistic. You’d hardly be expected by a statistician to be asked “What plausible range of gas prices do you pay on average?”, but this is, in fact, closer to the truth of interval estimation in statistics.

Confidence Levels and Sampling

Why the word confidence in confidence interval? Well, information costs you something to collect it. If you want to be 100 percent sure about the mean of petrol prices, for instance, you could literally collect data on every transaction from every pump in the world. In the real world, this is impossible, and we require sampling.

In the age of Big Data, it seems to be a taboo to talk about sampling sometimes. “You can collect all the data from a process”, some claim. That may be the case for a small minority of processes, but for the vast world out there, characterization is only possible by collecting and evaluating samples of data. And this approach entails the concept of a confidence level.

Confidence levels tell us the proportion of times we’re willing to be right (and wrong) about any parameter we wish to estimate from a sample of data. For example, if I measured the price per gallon of gas at every pump in Maine, or Tokyo, for a day, I’d get a lot of data – and that data would show me some general trends and patterns in the way the prices are distributed, or what typical prices seem to be in effect. If I expect to make an estimate of petrol prices in Tokyo or Maine next July, I couldn’t hope to do this with a limited sample such as this, however. Basic economics knowledge tells us that there could be many factors that could change these prices – and that they could very well be quite different from what they are now. And this is despite the quality of the data we have.

If I wanted to be 95% confident about the prices of petrol within a month from now, I could use a day’s worth of data. And to represent this data, I could use a confidence interval ( a range of values), of course. So, whether it is the quantity of coffee in your cup, or the price per gallon of fuel you buy, you can evaluate the broader parameters of your data sets, as long as you can get the data, using confidence intervals.

R Confidence Intervals Example

Now, let’s see a simple example in R, that illustrate confidence intervals.

#Generate some data - 100 points of data
#The mean of the data generated is 10,
#The standard deviation we've chosen is 1.0
#Data comes from the gaussian distribution

x<-rnorm(100,10,1.0)

#Testing an 80% confidence level
x80<-t.test(x,conf.level = 0.8)

#Testing a 90% confidence level
x90<-t.test(x,conf.level = 0.9)

#Testing a 99% confidence level
x99<-t.test(x,conf.level = 0.99)

x80
x90
x99

The first part of the code shows us how 100 points of data are used as a sample in this illustration.

In the next part, the t.test() command used here can be used to generate confidence intervals, and also test hypotheses you may have developed. In the above code, I’ve saved three different results, based on three different confidence levels – 80%, 90% and 95%. Typically, when you want to be more certain, but you don’t have more data, you end up getting a wider confidence interval. Expect more uncertainty if you have limited data, and more certainty, when you have more data, all other things being equal. Here are the results from the first test – the 80% confidence interval.

1-sample t-test and confidence interval (80% confidence)

1-sample t-test and confidence interval (80% confidence)

Let’s break down the results of the t-test. First, we see the data set the test was performed on, and then we see the t-statistics and also a p-value. Further, you can see a confidence interval (9.85,10.10) for the data, based on a sample size of 100, and a confidence level of 80%. Since the t-test is fundamentally a hypothesis test that uses confidence intervals to help you make your decision, you can also see an alternative hypothesis listed there. Then there’s the estimate of the mean – 9.98. Recall the characteristics of the normal distribution we used to generate the data in the first place – it had a mean \mu = 10.0 and a standard deviation \sigma = 1.

Confidence Levels versus Confidence Intervals

Summarily, we can see the confidence intervals and mean estimates of the remaining two confidence intervals also. For ease, I’ll print them out from storage.

CIs comparison (80%,90%, 99%)

CIs comparison (80%,90%, 99%)

Observe how, for the same data set, confidence intervals (plausible ranges of values for the mean of the data) are different, depending on how the confidence level changes. The confidence intervals widen as the confidence level increases. The 80% CI is calculated to be (9.85, 10.10) while the same sample yields (9.72, 10.24) when the CI is calculated at 99% confidence level. When you consider that standard deviations can be quite large, what confidence level you use in your calculations, could actually become a matter of importance!

More on how confidence intervals are calculated here, at NIST.

The 1-sample t-Test

We’ve seen earlier that the command that is invoked to calculate these confidence intervals is actually called t.test(). Now let’s look at what a t-test is.

In inferential statistics, specifically in hypothesis testing, we test samples of data to determine if we can make a generalization about them. When you collect process data from a process you know, the expectation is that this data will look a lot like what you think it should look like. If this is data about how much coffee, or how much you paid for a gallon of gasoline, you may be interested in knowing, for instance, if the price per gallon is any different here, compared to some other gas station. One way to find this out – with statistical certainty – is through a t-test. A t-test fundamentally tells us whether we fail to reject a default hypothesis we have (also called null hypothesis and denoted as H_0 ), or if we reject the default hypothesis and embrace an alternative hypothesis (denoted by H_a). In case of the 1-sample t-test, therefore:

H_o : \mu = k  \newline  H_a : \mu \neq k

Depending on the nature of the alternative hypothesis, we could have inequalities there as well. Note that k here is the expectation you have about what the mean ought to be. If you don’t supply one, t.test will calculate a confidence interval and produce a mean estimate.

A t-test uses the t-distribution, which is a lot like the Gaussian or normal distribution, except that it uses an additional parameter – which directly relates to the sample size of your data. Understandably, the size of the sample could give you very different results in a t-test.

As with other hypothesis tests, you also get a p-value. The null hypothesis in a 1-sample t-test is relatively straightforward – that there is no difference between the mean of the sample in question, and the population’s mean. Naturally, the alternative of this hypothesis could help us study whether the population mean is less than expected (less expensive gas!) or greater (more expensive gas!) than the expectation.

Let’s take another look at that first t-test result:

1-sample t-test and confidence interval (80% confidence)

1-sample t-test and confidence interval (80% confidence)

The confidence level we’ve calculated here is an 80% confidence interval, which translates to a 20% significance. To interpret the results above, we compare the value of p, with the significance, and reject the null hypothesis is p is smaller. But what are the t-statistic and df? The t-statistic here is calculated as the critical value of t, based on a confidence level of 80%, the sample mean and standard deviation, and of course, the fact that we have 100 points of data. (The “df” here stands for degrees of freedom – which stands at 99, calculated from the 100 data points we have and the 1 parameter we’re estimating.)

Alternative Hypotheses Inequalities

The t.test() command also allows us to evaluate how the confidence intervals look, and what the p-values are, when we have different alternative hypotheses. We can test for the population mean that’s being estimated to be less than, or greater than the expected value. We can also specify what our expected values of mean are.


x80<-t.test(x,conf.level = 0.8, mu = 10.1,alternative = "less" )

x80

2015-08-11 23_59_15-Jump List for VLC media player

Observe how the p-value, confidence intervals have changed.

We’ve evaluated the same 80% confidence intervals, with different expected values of the mean of \mu = 10.1, and the alternative hypothesis is that this mean \mu<10.1.

Concluding Remarks

When evaluating data to draw conclusions from it, it helps to construct confidence intervals. These tell us general patterns in the data, and also help us estimate variability. In real-life situations, using confidence intervals and t-tests to estimate the presence or absence of a difference between expectation and estimate is valuable. Often, this is the lifeblood of data-driven decision making when dealing with lots of data, and when coming to impactful conclusions about data. R’s power in quickly generating confidence intervals becomes quite an ally, in the right hands – and of course, if you’ve collected the right data.