Introduction
The more advanced methods in statistics have generally been developed to answer real-world questions, and ANOVA is no different.
- How do we answer questions in the real world, as to which route from home to work on your daily commute is easier, or
- How would you know which air-conditioner to choose out of a bunch that you’re evaluating in various climates?
- If you were dealing with a bunch of suppliers, and wanted to compare their process results all at the same time, how would you do it?
- If you had three competing designs for a system or an algorithm, and wanted to understand whether one of them was significantly better than the others, how would you do that statistically?
ANOVA answers these kinds of questions – it helps us discover whether we have clear reasons to choose a particular alternative over many others, or determine whether there is exceptional performance (good or bad) that is dependent on a factor.
We discussed linear models earlier – and ANOVA is indeed a kind of linear model – the difference being that ANOVA is where you have discrete factors whose effect on a continuous (variable) result you want to understand.
The ANOVA Hypothesis
The ANOVA hypothesis is usually explained as a comparison of population means estimated based on the sample means. What we’re trying to understand here is the effect of a change in the level of one factor, on the response. The term “Analysis of Variance” for ANOVA is therefore a misnomer for many new to inferential statistics, since it is a test that compares means.
A simplified version of the One-Way ANOVA hypothesis for three samples of data (the effect of a factor with three possible values, or levels) is below:
While Ha could be:
It is possible to understand the Two-Way ANOVA problem, therefore, as a study of the impact of two different factors (and their associated levels) on the response.
Travel Time Problem
Let’s look at a simple data set which has travel time data organized by day of the week and route. Assume you’ve been monitoring data from many people travelling a certain route, between two points, and you’re trying to understand whether the time taken for the trip is more dependent on the day of the week, or on the route taken. A Two-Way ANOVA is a great way to solve this kind of a problem.
We see above how the data for this problem is organized. We’re essentially constructing a linear model that explains the relationship between the “response” or “output” variable Time, and the factors Day and Route.
Two Way ANOVA in R
ANOVA is a hypothesis test that requires the continuous variables (by each factor’s levels) to normally distributed. Additionally, ANOVA results are contingent upon an equal variance assumption for the samples being compared too. I’ve demonstrated in an earlier post how the normality and variance tests can be run prior to a hypothesis test for variable data.
The code below first pulls data from a data set into variables, and constructs a linear ANOVA model after the normality and variance tests. For normality testing, we’re using the Shapiro-Wilk test, and for variance testing, we’re using the bartlett.test()
command here, which is used to compare multiple variances.
#Reading the dataset Dataset<-read.csv("Traveldata.csv") str(Dataset) #Shapiro-Wilk normality tests by Day cat("Normality p-values by Factor Day: ") for (i in unique(factor(Dataset$Day))){ cat(shapiro.test(Dataset[Dataset$Day==i, ]$Time)$p.value," ") } cat("Normality p-values by Factor Route: ") #Shapiro-Wilk normality tests by Route for (i in unique(factor(Dataset$Route))){ cat(shapiro.test(Dataset[Dataset$Route==i, ]$Time)$p.value," ") } #Variance tests for Day and Route factors bartlett.test(Time~Day,data = Dataset ) bartlett.test(Time~Route,data = Dataset ) #Creating a linear model #The LM tells us both main effects and interactions l <- lm(Time~ Day + Route + Day*Route , Dataset) summary(l) #Running and summarizing a general ANOVA on the linear model la <- anova(l) summary(la) #Plots of the linear model and Cook's Distance plot(cooks.distance(l), main = "Cook's Distance for linear model", xlab = "Travel Time (observations)", ylab = "Cook's Distance") plot(l) plot(la)
Results for the Bartlett test are below: > bartlett.test(Time~Day,data = Dataset ) Bartlett test of homogeneity of variances data: Time by Day Bartlett's K-squared = 3.2082, df = 4, p-value = 0.5236 > bartlett.test(Time~Route,data = Dataset ) Bartlett test of homogeneity of variances data: Time by Route Bartlett's K-squared = 0.8399, df = 2, p-value = 0.6571
The code also calculates Cook’s distance, which is an important concept in linear models. When trying to understand any anomalous terms in the model, we can refer to the Cook’s distance to understand whether those terms have high leverage in the model, or not. Removing a point with high leverage could potentially affect the model results. Equally, if your model isn’t performing well, it may be worth looking at Cook’s distance.
When looking at the graphs produced by lm
, we can understand the how various points in the model have different values of Cook’s distance, and we also understand their relative leverages. This is also illustrated in the Normal Quantile-Quantile plot below, where you can see observations #413 and #415 that have large values, among others.
ANOVA Results
A summary of the lm
command’s result is shown below.
Call: lm(formula = Time ~ Day + Route + Day * Route, data = Dataset) Residuals: Min 1Q Median 3Q Max -20.333 -4.646 0.516 4.963 19.655 Coefficients: Estimate Std. Error t value Pr(>|t|) (Intercept) 54.34483 1.39067 39.078 <2e-16 *** DayMon -3.34483 1.95025 -1.715 0.0870 . DayThu 2.69221 2.00280 1.344 0.1795 DayTue -0.43574 1.90618 -0.229 0.8193 DayWed -0.01149 2.00280 -0.006 0.9954 RouteB -1.02130 1.89302 -0.540 0.5898 RouteC -1.83131 1.85736 -0.986 0.3246 DayMon:RouteB 2.91785 2.71791 1.074 0.2836 DayThu:RouteB 0.39335 2.63352 0.149 0.8813 DayTue:RouteB 3.44554 2.64247 1.304 0.1929 DayWed:RouteB 1.23796 2.65761 0.466 0.6416 DayMon:RouteC 5.27034 2.58597 2.038 0.0421 * DayThu:RouteC 0.24255 2.73148 0.089 0.9293 DayTue:RouteC 4.48105 2.60747 1.719 0.0863 . DayWed:RouteC 1.95253 2.68823 0.726 0.4680 --- Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1 Residual standard error: 7.489 on 485 degrees of freedom Multiple R-squared: 0.04438, Adjusted R-squared: 0.01679 F-statistic: 1.609 on 14 and 485 DF, p-value: 0.07291
While the model above indicates the effect of each factor on the response, it doesn’t compute the f-statistic, which is by taking into consideration the “within” and “between” variations in the samples of data. The ANOVA mean squares and sum of squares approach does exactly this, which is why the results from that are more relevant here. And the summary below is the ANOVA model itself, in the ANOVA table:
Analysis of Variance Table Response: Time Df Sum Sq Mean Sq F value Pr(>F) Day 4 823.3 205.830 3.6700 0.00588 ** Route 2 46.0 23.005 0.4102 0.66376 Day:Route 8 393.9 49.237 0.8779 0.53492 Residuals 485 27201.3 56.085 --- Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
By looking at the results above, it is clear that the Day factor has a statistically significant impact on the Travel time. (Upon closer observation, you’ll see that one of those means that was different from others, was the mean for Monday! This is clearly an example inspired by Monday rush hour traffic!)
When reading results from the lm
and anova
commands, it is important to note that R indicates the results using significance codes. A small p-value indicates a significant result, and the relative significance of different factors is indicated by assigning different symbols to them. For instance, two asterixes (**) are used when we get a p-value of < 0.001. Depending on the nature of your experiment, you can choose your significance level and understand the results in a comparison of these p-values with significance. Also, in this specific case, the Route factor seems to have an insignificant impact on the response.
Interactions
When we have two or more terms in a model which function as inputs to a response variable, we also need to evaluate whether a change in both variables causes a different effect on the response, as opposed to fixing one and changing the other. This is referred to as an interaction, and interaction effects are taken into account in our model. Once again, the p-values for the interactions can inform us about the relative importance of different interactions.
Concluding Remarks
We’ve seen in this example how the Analysis of Variance (ANOVA) approach can be used to compare the impact of two factors on a response variable. Cook’s distance and its importance were also explained. It is important to make each of the data points within our data set count, and at the same time, it is important to evaluate the model’s veracity and validity to what we want to understand. In this context, understanding the significance of our results (statistically and practically) is necessary. Fortunately, the linear modeling packages in R are very convenient for such modeling, and incorporate lots of added functionality, like being able to call plots on them by simply using the plot command on a saved model. This functionality really comes into its own, when you make and test many different models and want to compare results.