# Two Way ANOVA in R

## Introduction

The more advanced methods in statistics have generally been developed to answer real-world questions, and ANOVA is no different.

• How do we answer questions in the real world, as to which route from home to work on your daily commute is easier, or
• How would you know which air-conditioner to choose out of a bunch that you’re evaluating in various climates?
• If you were dealing with a bunch of suppliers, and wanted to compare their process results all at the same time, how would you do it?
• If you had three competing designs for a system or an algorithm, and wanted to understand whether one of them was significantly better than the others, how would you do that statistically?

ANOVA answers these kinds of questions – it helps us discover whether we have clear reasons to choose a particular alternative over many others, or determine whether there is exceptional performance (good or bad) that is dependent on a factor.

We discussed linear models earlier – and ANOVA is indeed a kind of linear model – the difference being that ANOVA is where you have discrete factors whose effect on a continuous (variable) result you want to understand.

## The ANOVA Hypothesis

The ANOVA hypothesis is usually explained as a comparison of population means estimated based on the sample means. What we’re trying to understand here is the effect of a change in the level of one factor, on the response. The term “Analysis of Variance” for ANOVA is therefore a misnomer for many new to inferential statistics, since it is a test that compares means.

A simplified version of the One-Way ANOVA hypothesis for three samples of data (the effect of a factor with three possible values, or levels) is below:

$H_0 : \mu_1 = \mu_2 = \mu_3$

While Ha could be:

$H_a : \mu_1 \neq \mu_2 = \mu_3, or$
$H_a : \mu_1 = \mu_2 \neq \mu_3, or$
$H_a : \mu_1 \neq \mu_2 = \mu_3$

It is possible to understand the Two-Way ANOVA problem, therefore, as a study of the impact of two different factors (and their associated levels) on the response.

## Travel Time Problem

Let’s look at a simple data set which has travel time data organized by day of the week and route. Assume you’ve been monitoring data from many people travelling a certain route, between two points, and you’re trying to understand whether the time taken for the trip is more dependent on the day of the week, or on the route taken. A Two-Way ANOVA is a great way to solve this kind of a problem.

The first few rows of our dataset

We see above how the data for this problem is organized. We’re essentially constructing a linear model that explains the relationship between the “response” or “output” variable Time, and the factors Day and Route.

## Two Way ANOVA in R

ANOVA is a hypothesis test that requires the continuous variables (by each factor’s levels) to normally distributed. Additionally, ANOVA results are contingent upon an equal variance assumption for the samples being compared too. I’ve demonstrated in an earlier post how the normality and variance tests can be run prior to a hypothesis test for variable data.

The code below first pulls data from a data set into variables, and constructs a linear ANOVA model after the normality and variance tests. For normality testing, we’re using the Shapiro-Wilk test, and for variance testing, we’re using the bartlett.test() command here, which is used to compare multiple variances.

#Reading the dataset
str(Dataset)

#Shapiro-Wilk normality tests by Day
cat("Normality p-values by Factor Day: ")
for (i in unique(factor(Dataset$Day))){ cat(shapiro.test(Dataset[Dataset$Day==i, ]$Time)$p.value," ")
}
cat("Normality p-values by Factor Route: ")

#Shapiro-Wilk normality tests by Route
for (i in unique(factor(Dataset$Route))){ cat(shapiro.test(Dataset[Dataset$Route==i, ]$Time)$p.value," ")
}

#Variance tests for Day and Route factors
bartlett.test(Time~Day,data = Dataset )
bartlett.test(Time~Route,data = Dataset )

#Creating a linear model
#The LM tells us both main effects and interactions
l <- lm(Time~ Day + Route + Day*Route , Dataset)
summary(l)

#Running and summarizing a general ANOVA on the linear model
la <- anova(l)
summary(la)

#Plots of the linear model and Cook's Distance
plot(cooks.distance(l),
main = "Cook's Distance for linear model", xlab =
"Travel Time (observations)", ylab = "Cook's Distance")
plot(l)
plot(la)


Results for the Bartlett test are below:

> bartlett.test(Time~Day,data = Dataset )

Bartlett test of homogeneity of variances

data:  Time by Day
Bartlett's K-squared = 3.2082, df = 4, p-value = 0.5236

> bartlett.test(Time~Route,data = Dataset )

Bartlett test of homogeneity of variances

data:  Time by Route
Bartlett's K-squared = 0.8399, df = 2, p-value = 0.6571



The code also calculates Cook’s distance, which is an important concept in linear models. When trying to understand any anomalous terms in the model, we can refer to the Cook’s distance to understand whether those terms have high leverage in the model, or not. Removing a point with high leverage could potentially affect the model results. Equally, if your model isn’t performing well, it may be worth looking at Cook’s distance.

Cook’s Distance for our data set, visualized

Cook’s distance, explained by its importance to leverage in the model.

When looking at the graphs produced by lm, we can understand the how various points in the model have different values of Cook’s distance, and we also understand their relative leverages. This is also illustrated in the Normal Quantile-Quantile plot below, where you can see observations #413 and #415 that have large values, among others.

Normal QQ plot of data set showing high leverage points (large Cook’s Distance)

## ANOVA Results

A summary of the lm command’s result is shown below.


Call:
lm(formula = Time ~ Day + Route + Day * Route, data = Dataset)

Residuals:
Min      1Q  Median      3Q     Max
-20.333  -4.646   0.516   4.963  19.655

Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept)   54.34483    1.39067  39.078   <2e-16 ***
DayMon        -3.34483    1.95025  -1.715   0.0870 .
DayThu         2.69221    2.00280   1.344   0.1795
DayTue        -0.43574    1.90618  -0.229   0.8193
DayWed        -0.01149    2.00280  -0.006   0.9954
RouteB        -1.02130    1.89302  -0.540   0.5898
RouteC        -1.83131    1.85736  -0.986   0.3246
DayMon:RouteB  2.91785    2.71791   1.074   0.2836
DayThu:RouteB  0.39335    2.63352   0.149   0.8813
DayTue:RouteB  3.44554    2.64247   1.304   0.1929
DayWed:RouteB  1.23796    2.65761   0.466   0.6416
DayMon:RouteC  5.27034    2.58597   2.038   0.0421 *
DayThu:RouteC  0.24255    2.73148   0.089   0.9293
DayTue:RouteC  4.48105    2.60747   1.719   0.0863 .
DayWed:RouteC  1.95253    2.68823   0.726   0.4680
---
Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

Residual standard error: 7.489 on 485 degrees of freedom
Multiple R-squared:  0.04438,	Adjusted R-squared:  0.01679
F-statistic: 1.609 on 14 and 485 DF,  p-value: 0.07291


While the model above indicates the effect of each factor on the response, it doesn’t compute the f-statistic, which is by taking into consideration the “within” and “between” variations in the samples of data. The ANOVA mean squares and sum of squares approach does exactly this, which is why the results from that are more relevant here. And the summary below is the ANOVA model itself, in the ANOVA table:

Analysis of Variance Table

Response: Time
Df  Sum Sq Mean Sq F value  Pr(>F)
Day         4   823.3 205.830  3.6700 0.00588 **
Route       2    46.0  23.005  0.4102 0.66376
Day:Route   8   393.9  49.237  0.8779 0.53492
Residuals 485 27201.3  56.085
---
Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1


By looking at the results above, it is clear that the Day factor has a statistically significant impact on the Travel time. (Upon closer observation, you’ll see that one of those means that was different from others, was the mean for Monday! This is clearly an example inspired by Monday rush hour traffic!)

When reading results from the lm and anova commands, it is important to note that R indicates the results using significance codes. A small p-value indicates a significant result, and the relative significance of different factors is indicated by assigning different symbols to them. For instance, two asterixes (**) are used when we get a p-value of < 0.001. Depending on the nature of your experiment, you can choose your significance level and understand the results in a comparison of these p-values with significance. Also, in this specific case, the Route factor seems to have an insignificant impact on the response.

## Interactions

When we have two or more terms in a model which function as inputs to a response variable, we also need to evaluate whether a change in both variables causes a different effect on the response, as opposed to fixing one and changing the other. This is referred to as an interaction, and interaction effects are taken into account in our model. Once again, the p-values for the interactions can inform us about the relative importance of different interactions.

## Concluding Remarks

We’ve seen in this example how the Analysis of Variance (ANOVA) approach can be used to compare the impact of two factors on a response variable. Cook’s distance and its importance were also explained. It is important to make each of the data points within our data set count, and at the same time, it is important to evaluate the model’s veracity and validity to what we want to understand. In this context, understanding the significance of our results (statistically and practically) is necessary. Fortunately, the linear modeling packages in R are very convenient for such modeling, and incorporate lots of added functionality, like being able to call plots on them by simply using the plot command on a saved model. This functionality really comes into its own, when you make and test many different models and want to compare results.

# Simple Outlier Detection in R

Outliers are points in a data set that lie far away from the estimated value of the centre of the data set. This estimated centre could be either the mean, or median, depending on what kind of point or interval estimate you’re using. Outliers tend to represent something different from “the usual” that you might observe in a data set, and therefore hold importance. Outlier detection is an important aspect of machine learning algorithms of any sophistication. Because of the fact that outliers can throw off a learning algorithm or deflate an assumption about the data set, we have to be able to identify and explain the outliers in data sets, if the need arises. I’ll only cover the basic R commands here to do outlier detection, but it would be good to look up a more comprehensive resource. A first primer by Sanjay Chawla and Pei Sun (University of Sydney) is here: Outlier detection (PDF slides).

## Graphical Approaches to Outlier Detection

Boxplots and histograms are useful to get an idea of the distribution that could be used to model the data, and could also provide insights into whether outliers exist or not in our data set.

y <-read.csv("y.csv")

#summarizing and plotting y
summary(y)
hist(y[,2], breaks = 20, col = rgb(0,0,1,0.5))
boxplot(y[,2], col = rgb(0,0,1,0.5), main = "Boxplot of y[,2]")
shapiro.test(y[,2])
qqnorm(y[,2], main = "Normal QQ Plot - y")
qqline(y[,2], col = "red")

#summarizing and plotting ylarge
summary(ylarge)
hist(ylarge[,2], breaks = 20, col = rgb(0,1,0,0.5))
boxplot(ylarge[,2], col =  rgb(0,1,0,0.5), main = "Boxplot of ylarge[,2]")
shapiro.test(ylarge[,2])
qqnorm(ylarge[,2], main = "Normal QQ Plot - ylarge")
qqline(ylarge[,2], col = "red")


The Shapiro-Wilk test used above is used to check for the normality of a data set. Normality assumptions underlie outlier detection hypothesis tests. In this case, with p-values of 0.365 and 0.399 respectively and sample sizes of 30 and 1000, both samples y and ylarge seem to be normally distributed.

Box plot of y (no real outliers observed as per graph)

Boxplot of ylarge – a few outlier points seem to be present in graph

Histogram of y

Histogram of ylarge

Normal QQ Plot of Y

Normal QQ Plot of ylarge

The graphical analysis tells us that there could possibly be outliers in our data set ylarge, which is the larger data set out of the two. The normal probability plots also seem to indicate that these data sets (as different in sample size as they are) can be modeled using normal distributions.

## Dixon and Chi Squared Tests for Outliers

The Dixon test and Chi-squared tests for outliers (PDF) are statistical hypothesis tests used to detect outliers in given sample sets. Bear in mind though, that this Chi-squared test for outliers is very different from the better known Chi-square test used for comparing multiple proportions. The Dixon tests makes a normality assumption about the data, and is used generally for 30 points or less. The Chi-square test on the other hands makes variance assumptions, and is not sensitive to mild outliers if variance isn’t specified as an argument. Let’s see how these tests can be used for outliers detection.

library(outliers)
#Dixon Tests for Outliers for y
dixon.test(y[,2],opposite = TRUE)
dixon.test(y[,2],opposite = FALSE)

#Dixon Tests for Outliers for ylarge
dixon.test(ylarge[,2],opposite = TRUE)
dixon.test(ylarge[,2],opposite = FALSE)

#Chi-Sq Tests for Outliers for y
chisq.out.test(y[,2],variance = var(y[,2]),opposite = TRUE)
chisq.out.test(y[,2],variance = var(y[,2]),opposite = FALSE)

#Chi-Sq Tests for Outliers for ylarge
chisq.out.test(ylarge[,2],variance = var(ylarge[,2]),opposite = TRUE)
chisq.out.test(ylarge[,2],variance = var(ylarge[,2]),opposite = FALSE)


In each of the Dixon and Chi-Squared tests for outliers above, we’ve chosen both options TRUE and FALSE in turn, for the argument opposite. This argument helps us choose between whether we’re testing for the lowest extreme value, or the highest extreme value, since outliers can lie to both sides of the data set.

Sample output is below, from one of the tests.

> #Dixon Tests for Outliers for y
> dixon.test(y[,2],opposite = TRUE)

Dixon test for outliers

data:  y[, 2]
Q = 0.0466, p-value = 0.114
alternative hypothesis: highest value 11.7079474800368 is an outlier



When you closely observe the p-values of these tests alone, you can see the following results:

P-values for outlier tests:

Dixon test (y, upper):  0.114 ; Dixon test (y, lower):  0.3543
Dixon test not executed for ylarge
Chi-sq test (y, upper):  0.1047 ; Chi-sq test (y, lower):  0.0715
Chi-sq test (ylarge, upper):  0.0012 ; Chi-sq test (ylarge, lower):  4e-04



The p-values here (taken with an indicative 5% significance) may imply that the possibility that the extreme values in ylarge are outliers. This may or may not be true, of course, since in inferential statistics, we always state the chance of error. And in this case, we can conclude that there is a very small chance that those extreme values we see in ylarge are actually typical in that data set.

## Concluding Remarks

We’ve seen the graphical outlier detection approaches and also have seen the Dixon and Chi-square tests. The Dixon test is newer, but isn’t applicable to large data sets, for which we need to use the Chi-square test for outliers and other tests. In machine learning problems, we often have to be able to explain some of the values, from a training perspective for neural networks, or be able to deal with lower resolution models such as least squares regression, used in simpler forecasting and estimation problems. Approaches like regression depend heavily on the central tendency of the data, and we can build better models if we’re able to explain outliers and understand the underlying causes for them. Continual improvement professionals generally regard outliers with importance. Statistically, the chance of getting extreme results (extremely good ones and extremely poor ones) is exciting in process excellence and continuous improvement, because they could represent benchmark cases, or worst case scenarios. Either way, outlier detection is an immensely useful activity applicable to different statistical situations business.

# Comparing Non-Normal Data Graphically and with Non-Parametric Tests

Not all data in this world is predictable in the exact same way, of course, and not all data can be modeled using the Gaussian distribution. There are times, when we have to make comparisons about data using one of many distributions that represent data which could show different patterns other than the familiar and comforting “bell curve” of the normal distribution pattern we’re used to seeing in business presentations and the media alike. For instance, here’s data from the Weibull distribution, plotted using different shape and scale parameters. A Weibull distribution has two parameters, shape and scale, which determine how it looks (which varies widely), and how spread out it is.


shape <- 1
scale <- 5
x<-rweibull(1000000,shape,scale)
hist(x, breaks = 1000, main = paste("Weibull Distribution with shape: ",shape,", and scale: ",scale))
abline (v = median(x), col = "blue")
abline (v = scale, col = "red")


Shape = 1; Scale = 5. The red line represents the scale value, and the blue line, the median of the data set.

Here’s data from a very different distribution, which has a scale parameter of 100.

Shape = 1; Scale = 100. Same number of points. The red and blue lines mean the same things here too.

The shape parameter, as can be seen clearly here, is called so for a good reason. Even when the scale parameter changes wildly (as in our two examples), the overall geometry of our data looks similar – but of course, it isn’t. The change in the scale parameter has changed the probability of an event $x ->0$ towards the lower end of the x range (closer to zero), compared to an event $x>>>0$ further away. When you superimpose these distributions and their medians, you can get a very different picture of them.

If we have two very similar data sets like the data shown in the first graph and the data in the second, what kinds of hypothesis tests can we use? It is a pertinent question, because at times, we may not know that a data set may represent a process that can be modeled by a specific kind of distribution. At other times, we may have entirely empirical distributions represented by our data. And we’d still want to make comparisons using such data sets.

shape <- 1
scale1 <- 5
scale2<-scale1*2
x<-rweibull(1000000,shape,scale1)
xprime<-rweibull(1000000,shape,scale2)
hist(x, breaks = 1000, border = rgb(0.9,0.2,0.2,0.2), col = rgb(0.9,0.2,0.2,0.2), main = paste("Weibull Distribution different shape parameters: ",shape/100,", ", shape))
hist(xprime, breaks = 1000, border = rgb(0.2,0.9,0.2,0.2), col = rgb(0.2,0.9,0.2,0.2), add = T)
abline (v = median(x), col = "blue")
abline (v = scale, col = "red")



Different scale parameters. Red and blue lines indicate medians of the two data sets.

The Weibull distribution is known to be quite versatile, and can at times be used to approximate the Gaussian distribution for real world data. An example of this is the use of the Weibull distribution to approximate constant failure rate data in engineering systems. Let’s look at data from a different pair of distributions with a different shape parameter, this time, 3.0.

shape <- 3
scale1 <- 5
scale2<-scale1*1.1 #Different scale parameter for the second data set
x<-rweibull(1000000,shape,scale1)
xprime<-rweibull(1000000,shape,scale2)
hist(x, breaks = 1000, border = rgb(0.9,0.2,0.2,0.2), col = rgb(0.9,0.2,0.2,0.2), main = paste("Weibull Distribution different scale parameters: ",scale1,", ", scale2))
hist(xprime, breaks = 1000, border = rgb(0.2,0.9,0.2,0.2), col = rgb(0.2,0.9,0.2,0.2), add = T)
abline (v = median(x), col = "blue")
abline (v = median(xprime), col = "red")


Weibull distribution data – different because of scale parameters. Vertical lines indicate medians.

The medians can be used to illustrate the differences between the data, and summarize the differences observed in the graphs. However, when we know that a data set is non-normal, we can adopt non-parametric methods from the hypothesis testing toolbox in R. Like hypothesis tests for normally distributed data that have comparable means, we can compare the medians of two or more samples of non-normally distributed data. Naturally, the same conditions – of larger samples of data being better, at times, apply. However, the tests can help us analytically differentiate between two similar-looking data sets. Since the Mann-Whitney median test and other non-parametric tests don’t make assumptions about the parameters of the underlying distribution of the data, we can rely on these tests to a greater extent when studying the differences between samples that we think may have a greater chance of being non-normal (even though the normality tests may say otherwise).

Non-parametrics and the inferential statistics approach: how to use the right test

When we conduct the AD test for normality on the two samples in question, we can see how these samples return a very low p-value each. This can also be confirmed using the qqnorm plots.

Let’s use the Mann-Whitney test for two medians from samples of non-normal data, to assess the difference between the median values. We’ll use a smaller sample size for both, and use the wilcox.test() command. For two samples, the wilcox.test() command actually performs a Mann-Whitney test.

shape <- 3
scale1 <- 5
scale2<-scale1*1.01
x<-rweibull(1000000,shape,scale1)
xprime<-rweibull(1000000,shape,scale2)
library(nortest)
paste("Normality test p-values: Sample 'x' ",ad.test(x)$p.value, " Sample 'xprime': ", ad.test(xprime)$p.value)

hist(x, breaks = length(x)/10, border = rgb(0.9,0.2,0.2,0.05), col = rgb(0.9,0.2,0.2,0.2), main = paste("Weibull Distribution different scale parameters: ",scale1,", ", scale2))
hist(xprime, breaks = length(xprime)/10, border = rgb(0.2,0.9,0.2,0.05), col = rgb(0.2,0.2,0.9,0.2), add = T)
abline (v = median(x), col = "blue")
abline (v = median(xprime), col = "red")
wilcox.test(x,xprime)
paste("Median 1: ", median(x),"Median 2: ", median(xprime))


Observe how close the scale parameters of both samples are. We’d expect both samples to overlap, given the large number of points in each sample. Now, let’s see the results and graphs.

Nearly overlapping histograms for the large non-normal samples

The results for this are below.

Mann-Whitney test results

The p-value here (for this considerable sample size) clearly illustrates the present of a significant difference. A very low p-value in this test result indicates that, if we were to make the assumption that the medians of these data sets are equal, there would be an extremely small probability, that we would see samples as extreme as observed in these samples. The fine difference in the medians observed in the median results can also be picked up in this test.

To run the Mann-Whitney test with a different confidence level (or significance), we can use the following syntax:

wilcox.test(x,xprime, conf.level = 0.95)


Note 1 : The mood.test() command in R performs a two-samples test of scale. Since the scale parameters in these samples of data we generated (for the purposes of the demo) are well known, in real life situations, the p-value should be interpreted based on additional information, such as the sample size and confidence level.

Note 2:  The wilcox.test() command performs the Mann Whitney test. This is a comparison of mean ranks, and not of the medians per se.

# Power, Difference and Sample Sizes

In my earlier posts on hypothesis testing and confidence intervals, I covered how there are two hypotheses – the default or null hypothesis, and the alternative hypothesis (which is like a logical opposite of the null hypothesis). Hypothesis testing is fundamentally a decision making activity, where you reject or fail to reject the default hypothesis. For example: how do you tell whether the gas mileage of cars from one fleet is greater than the gas mileage of cars from some other fleet? When you collect samples of data, you can compare the average values of the samples, and arrive at some inference from this information, about the population. Since sample sizes tend to be affected by variability, we ought to be interested in how much data to actually collect.

## Hypothesis Tests

Statistically speaking, when we collect a small sample of data and calculate its standard deviation, we tend to get a larger estimate of standard deviation or a smaller estimate of standard deviation from the actual standard deviation. The relationship between the sample size and the standard deviation of a sample is described by the term standard error. This standard error is an integral part of how a confidence interval is calculated for variable data. For smaller samples, the difference between the “true” standard deviation of the population (if that can even be measured) and the sample standard deviation tends to be small for large sample sizes. There is an intuitive way to think about this. If you have more information, you can make a better guess at a characteristic of the population the information is coming from.

Let’s take another example: Motor racing. Motor racing lap times are generally recorded with extremely high precision and accuracy. If we had a sample with three times and wanted to estimate lap times for a circuit, we’d probably do okay, but have a wider range of expected lap times. On the contrary, if we had a number of lap time records, we could more accurately calculate the confidence intervals for the mean value. We could even estimate the probability that a particular race car driver could clock a certain time, if we were able to understand the distribution that is the closest model of the data. In a hypothesis test, therefore, we construct a model of our data, and test a hypothesis based on that model. Naturally, there is a risk of going wrong with such an approach. We call these risks $\alpha$ and $\beta$ risks.

Hypothesis tests, alpha and beta risks

## Type I & Type II Errors and Power

To explain it simply, the chance of erroneously rejecting the null hypothesis $H_0$ is referred to as the Type I error (or $\alpha$). Similarly, the chance of erroneously accepting the null hypothesis (if the reality was different from what the null hypothesis stated) is called the Type II error (or $\beta$ risk). Naturally, we want both these errors to have very low probability in our experiments.

How do we determine if our statistical model is powerful enough, therefore, to avoid both kinds of risks? We use two statistical terms, significance (known here as $\alpha$, as in $\alpha$ risk), and power (known sometimes as $\pi$, but more commonly known as $1-\beta$, as we see from the illustration above).

It turns out that the statistical power is heavily dependent on the sample size you used to collect your data. If you used a small sample size, the ability of your test to detect a certain difference (such as 10 milliseconds of lap time, or 1 mile per gallon of difference in fuel efficiency) is diminished greatly. In truth, you may receive a result that gives you a p-value (also discussed in an earlier post) that is greater than the significance. Remember that this is now a straightforward comparison between the p-value of the test and what we know now as $\alpha$. However, note how in the results interpretation of our hypothesis test, we didn’t yet consider $1-\beta$. Technically, we should have. And this is what causes so many spurious results, because false positives end up getting ignored, leading to truth inflation.

Very often in data-driven businesses, the question of “how many samples is good enough” arises – and usually such discussions end with “as many as we can”. In truth, the process of determining how much data of a certain kind to collect, isn’t easy. Going back-and-forth to collect samples of data in order to do your hypothesis tests is helpful – primarily because you can see the effects of sample size in your specific problem, practically.

## A Note on Big Data

Big Data promises us what a lot of statisticians didn’t have in the past – the opportunity to analyze population data for a wide variety of problems. Big Data is naturally exciting for those who already have the trenchant infrastructure to make the call to collect, store and analyze such data. While Big Data security is an as yet incompletely answered question, especially in the context of user data and personally identifiable information, there is a push to collect such data, and it is highly likely that ethical questions will need to be answered by many social media and email account providers on the web that also double up as social networks. Where does this fit in? Well, when building such systems, we have neither the trust of a large number of people, nor the information we require – which could be in the form of demographic information, personal interests, relationships, work histories, and more. In the absence of such readily available information, and when companies have to build systems that handle this kind of information well, they have to experiment with smaller samples of data. Naturally, the statistical ideas mentioned in this post will be relevant in such contexts.

## Summary

1. Power: The power of a test is defined as the ability to correctly reject the null hypothesis of a test. As we’ve described it above, it is defined by $1-\beta$, where $\beta$ is the chance of incorrectly accepting the default or null hypothesis $H_0$.
2. Standard deviation ($\sigma$ ): The more variation we observe in any given process, the greater our target sample size should be, for achieving the same power, and if we’re detecting the same difference in performance, statistically. You could say that the sample size to be collected depends directly on the variability observed in the data. Even small differences in the $\sigma$ can affect the number of data points we need to collect to arrive at a result with sufficient power.
3. Significance ($\alpha$ ): As discussed in the earlier post on normality tests, significance of a result can have an impact on how much data we need to collect. If we have a wider margin for error, or greater margin for error in our decisions, we ought to settle for a larger significance value, perhaps of 10% or 15%. The de-facto norm in many industries and processes (for better or for worse, usually for worse) is to use $\alpha = 0.05$. While this isn’t always a bad thing, it encourages blind adherence and myths to propagate, and prevents statisticians from thinking about the consequences of their assumptions. In reality, $\alpha$ values of 0.01 and even 0.001 may be required, depending on how certain we want to be about the results.
4. Sample size ($n$): The greater the sample size of the data you’re using for a given hypothesis test, the greater the power of that test (and by that I mean, the test has a greater ability to detect a false positive).
5. Difference ($\Delta$): The greater the difference you want to be able to detect between two sets of data (proportions or means or medians), the smaller the sample size you need. This is an intuitively easy thing to understand – like testing a HumVee for gas mileage versus a Ford Focus – you need only a few trips (a small sample size) to tell a real difference, as opposed to if you were to test two compact cars against each other (when you may require a more rigorous testing approach).

# Hypothesis Tests: 2 Sample Tests (A/B Tests)

Businesses are increasingly beginning to use data to drive decision making, and are often using hypothesis tests. Hypothesis tests are used to differentiate between a pair of potential solutions, or to understand the performance of systems before and after a certain change. We’ve already seen t-tests and how they’re used to ascribe a range to the variability inherent in any data set. We’ll now see the use of t-tests to compare different sets of data. In website optimization projects, these tests are also called A/B tests, because they compare two different alternative website designs, to determine how they perform against each other.

It is important to reiterate that in hypothesis testing, we’re looking for a significant difference and that we use the p-value in conjunction with the significance (%) to determine whether we want to reject some default hypothesis we’re evaluating with the data, or not. We do this by calculating a confidence interval, also called an interval estimate. Let’s look at a simple 2-sample t-test and understand how it works for two different samples of data.

## Simple 2-sample t-test

A 2-sample t-test has the default hypothesis that the two samples you’re testing come from the same population, and that you can’t really tell any difference between them. So, any variation you see in the data is purely random variation. The alternative hypothesis in this test, is of course, that it isn’t only random variation we’re seeing, and that these samples come from completely different populations altogether.

$H_o : \mu_1 = \mu_2 \newline H_a: \mu_1 \neq \mu_2$

What the populations from which X1 and X2 are taken may look like

We’ll generate two samples of data $x_1, x_2$ from two different normal distributions for the purpose of demonstration, since normality is a pre-requisite for using the 2-sample t-test. (In the absence of normality, we can use other estimators of central tendency such as the median, and the tests appropriate for estimating the median, such as the Moods-Median or Kruskall-Wallis test – which I’ll blog about another time). We also have to ensure that the standard deviations of the two samples of data we’re testing are comparable. I’ll also demonstrate how we can use a test for standard deviations to understand whether the samples have different variability. Naturally, when the samples have different standard deviations, tests for assessing similarities in their means may not be fully effective.

library(nortest)
#Generating two samples of data
#100 points of data each
#Same standard deviation
#Different values of mean (of sampling distribution)
x1<-x[,2]
x2<-x[,3]

#Setting the global value of significance
alpha = 0.1

#Histograms
hist(x1, col = rgb(0.1,0.5,0.1,0.25), xlim = c(7,15), ylim = c(0,15),breaks = seq(7,15,0.25), main = "Histogram of x1 and x2", xlab = "x1, x2")
abline(v=10, col = "orange")
hist(x2, col = rgb(0.5,0.1,0.5,0.25), xlim = c(7,15),breaks = seq(7,15,0.25), add = T)
abline(v=12, col = "purple")

#Running normality tests (just to be sure)

#F-test to compare two or more variances
v1<-var.test(x1,x2)

if(ad1$p.value>=alpha & ad2$p.value>=alpha){
if(v1$p.value>=alpha){ #Running a 2-sample t-test for (i in c(-2,-1,0,1,2)){ temp<-t.test(x=x1,y=x2,paired = FALSE,var.equal = TRUE,alternative = "two.sided",conf.level = 1-alpha, mu = i ) cat("Difference= ",i,"; p-value:",temp$p.value,"\n")
}

}
}

The first few lines in the code merely include the “nortest” package and invoke/generate the data sets we’re comparing. The nortest package contains the Anderson Darling Normality Test, which we have also covered in an earlier post. We can generate a histogram, to understand what $x_1$ and $x_2$ look like.

Histogram of X1 and X2 – showing the reference population mean lines

The overlapping histograms of $x_1$ and $x_2$ clearly indicate the difference in the central tendency, and the overlap is also visible. Subsequent code above covers an F-test. As explained earlier, equality of variances is a pre-requisite for the 2-sample t-test. Failing this would mean that we essentially have samples from two different populations, which have two different standard deviations.

Finally, if the conditions to run a 2-sample t-test are met, the t.test() command (which is present in the “stats” package, runs, and provides us a result. Let’s look closely at how the t.test command is constructed. The arguments contain $x_1$ and $x_2$, which are our two samples for comparison. We provide the argument “paired = FALSE”, because these are not before/after samples. They’re two independently generated samples of data. There are instances where you may want to conduct a paired t-test, though, depending on your situation. We’ve also specified the confidence level. Note how the code uses a global value of $\alpha$, or significance level.

Now that we’ve seen what the code does, let’s look at the results.

2-sample t-test results

## Evaluating The Results

Two sample t-test results should be evaluated in a similar way to 1-sample t-tests. Our decision is dependent on the p-value we see in the result, and the confidence interval of the difference between sample means.

Observe how the difference estimate lies on the negative side of the number line. Difference is calculated from the populations means 10 and 12, so we can clearly understand why this estimate of difference is negative. The estimates for mean values of x and y (in this case, $x_1$ and $x_2$) are also given. Naturally, the p-value that’s in the result, when compared to our generous $\alpha$ of 0.1, is far lesser, and we can consider this to be a significant result (provided we have sufficient statistical power – and we’ll discuss this in another post). This indicates a significant difference between the two sets. If $x_1$ and $x_2$ were fuel efficiency figures for passenger vehicles, or bikes, we may actually be looking at better performance for $x_2$ when compared to $x_1$.

## Detecting a Specific Difference

Sometimes, you may want to evaluate a new product, and see if it performs at least x% better than the old product. For websites, for instance, you may be concerned with loading times. You may be concerned with code runtime, or with vehicle gas mileage, or vehicle durability, or some other aspect of performance. At times, the fortunes of entire companies depend on them producing faster, better products – that are known to be faster by at least some amount. Let’s see how a 2-sample t-test can be used to evaluate a minimum difference between two samples of data.

The same example above can be modified slightly, to test for a specific difference. The only real difference we have to make here, of course, is the value of $\Delta$ or difference. The t.test() command in R unfortunately isn’t very clear on this – it expects you to understand that you should use $\mu$ for this. Once you get used to it, however, this little detail is fine, and it delivers the expected result.

if(ad1$p.value >=alpha & ad2$p.value>=alpha){
if(v1$p.value>=alpha){ #Running a 2-sample t-test for (i in c(-2,-1,0,1,2)){ temp<-t.test(x=x1,y=x2,paired = FALSE,var.equal = TRUE,alternative = "two.sided",conf.level = 1-alpha, mu = i ) cat("Difference= ",i,"; p-value:",temp$p.value,"\n")
}
}
}


The code above prints out different p-values, for different tests. The data used in these tests is the same, but by virtue of the different differences we want to detect between these samples, the p-values are different. Observe the results below:

Differences and how they influence p-value (same two samples of data)

Since the data was generated from two distributions that have means of 10 and 12 respectively for $x_1$ and $x_2$, we know from intuition that the difference is -2, and we should start seeing results that indicate no difference between the expected and observed difference at this value in the test. Therefore, the p-values in this scenario will be greater than the significance value, $\alpha$.

For other scenarios – when $\Delta = -1, 0, 1, 2$, we see that the p-values are clearly far below the significance of $\alpha = 10%$.

What’s important to remember therefore, is that contrary to what many people may think, there is no one or best p-value for a given set of data. It depends on the factors we take into consideration during the test – such as the sample size, the confidence level we chose for our test, the resulting significance level, and, as illustrated here, the difference expected.

## Concluding Remarks

A 2-sample t-test is a great way for an organization to compare samples of data from different products, processes, and so on, and understand if one of them is performing significantly better than another. The test is strictly for data that fits the normality criteria, that also happen to have comparable standard deviations, and the results from it tend to be impacted heavily by the kind of hypothesis we use – for difference (which we explored here) and for one or two sided comparisons. We explored only the two sided comparisons here (and hence constructed a two sided confidence interval). When a business uses a 2-sample t-test, some of the arguments here, such as the values of confidence level, difference and so on, should be evaluated thoroughly. It is also important to bear in mind the impact of sample size. The smaller the difference we want to detect, the greater the sample sizes have to be. We’ll see more about this in another post, on power, difference and sample size.