Two Way ANOVA in R

Introduction

The more advanced methods in statistics have generally been developed to answer real-world questions, and ANOVA is no different.

  • How do we answer questions in the real world, as to which route from home to work on your daily commute is easier, or
  • How would you know which air-conditioner to choose out of a bunch that you’re evaluating in various climates?
  • If you were dealing with a bunch of suppliers, and wanted to compare their process results all at the same time, how would you do it?
  • If you had three competing designs for a system or an algorithm, and wanted to understand whether one of them was significantly better than the others, how would you do that statistically?

ANOVA answers these kinds of questions – it helps us discover whether we have clear reasons to choose a particular alternative over many others, or determine whether there is exceptional performance (good or bad) that is dependent on a factor.

We discussed linear models earlier – and ANOVA is indeed a kind of linear model – the difference being that ANOVA is where you have discrete factors whose effect on a continuous (variable) result you want to understand.

The ANOVA Hypothesis

The ANOVA hypothesis is usually explained as a comparison of population means estimated based on the sample means. What we’re trying to understand here is the effect of a change in the level of one factor, on the response. The term “Analysis of Variance” for ANOVA is therefore a misnomer for many new to inferential statistics, since it is a test that compares means.

A simplified version of the One-Way ANOVA hypothesis for three samples of data (the effect of a factor with three possible values, or levels) is below:

H_0 : \mu_1 = \mu_2 = \mu_3

While Ha could be:

H_a : \mu_1 \neq \mu_2 = \mu_3, or
H_a : \mu_1 = \mu_2 \neq \mu_3, or
H_a : \mu_1 \neq \mu_2 = \mu_3

It is possible to understand the Two-Way ANOVA problem, therefore, as a study of the impact of two different factors (and their associated levels) on the response.

Travel Time Problem

Let’s look at a simple data set which has travel time data organized by day of the week and route. Assume you’ve been monitoring data from many people travelling a certain route, between two points, and you’re trying to understand whether the time taken for the trip is more dependent on the day of the week, or on the route taken. A Two-Way ANOVA is a great way to solve this kind of a problem.

The first few rows of our dataset

The first few rows of our dataset

We see above how the data for this problem is organized. We’re essentially constructing a linear model that explains the relationship between the “response” or “output” variable Time, and the factors Day and Route.

Two Way ANOVA in R

ANOVA is a hypothesis test that requires the continuous variables (by each factor’s levels) to normally distributed. Additionally, ANOVA results are contingent upon an equal variance assumption for the samples being compared too. I’ve demonstrated in an earlier post how the normality and variance tests can be run prior to a hypothesis test for variable data.

The code below first pulls data from a data set into variables, and constructs a linear ANOVA model after the normality and variance tests. For normality testing, we’re using the Shapiro-Wilk test, and for variance testing, we’re using the bartlett.test() command here, which is used to compare multiple variances.

#Reading the dataset
Dataset<-read.csv("Traveldata.csv")
str(Dataset)

#Shapiro-Wilk normality tests by Day
cat("Normality p-values by Factor Day: ")
for (i in unique(factor(Dataset$Day))){
  cat(shapiro.test(Dataset[Dataset$Day==i, ]$Time)$p.value," ")
}
cat("Normality p-values by Factor Route: ")

#Shapiro-Wilk normality tests by Route
for (i in unique(factor(Dataset$Route))){
  cat(shapiro.test(Dataset[Dataset$Route==i, ]$Time)$p.value," ")
}

#Variance tests for Day and Route factors
bartlett.test(Time~Day,data = Dataset )
bartlett.test(Time~Route,data = Dataset )

#Creating a linear model
#The LM tells us both main effects and interactions
l <- lm(Time~ Day + Route + Day*Route , Dataset)
summary(l)

#Running and summarizing a general ANOVA on the linear model
la <- anova(l)
summary(la)

#Plots of the linear model and Cook's Distance
plot(cooks.distance(l), 
     main = "Cook's Distance for linear model", xlab =
       "Travel Time (observations)", ylab = "Cook's Distance")
plot(l)
plot(la)

Results for the Bartlett test are below:

> bartlett.test(Time~Day,data = Dataset )

	Bartlett test of homogeneity of variances

data:  Time by Day
Bartlett's K-squared = 3.2082, df = 4, p-value = 0.5236

> bartlett.test(Time~Route,data = Dataset )

	Bartlett test of homogeneity of variances

data:  Time by Route
Bartlett's K-squared = 0.8399, df = 2, p-value = 0.6571

The code also calculates Cook’s distance, which is an important concept in linear models. When trying to understand any anomalous terms in the model, we can refer to the Cook’s distance to understand whether those terms have high leverage in the model, or not. Removing a point with high leverage could potentially affect the model results. Equally, if your model isn’t performing well, it may be worth looking at Cook’s distance.

Cook's Distance for our data set, visualized

Cook’s Distance for our data set, visualized

Cook's distance, explained by its importance to leverage in the model.

Cook’s distance, explained by its importance to leverage in the model.

When looking at the graphs produced by lm, we can understand the how various points in the model have different values of Cook’s distance, and we also understand their relative leverages. This is also illustrated in the Normal Quantile-Quantile plot below, where you can see observations #413 and #415 that have large values, among others.

Normal QQ plot of data set showing high leverage points (large Cook's Distance)

Normal QQ plot of data set showing high leverage points (large Cook’s Distance)

ANOVA Results

A summary of the lm command’s result is shown below.


Call:
lm(formula = Time ~ Day + Route + Day * Route, data = Dataset)

Residuals:
    Min      1Q  Median      3Q     Max 
-20.333  -4.646   0.516   4.963  19.655 

Coefficients:
              Estimate Std. Error t value Pr(>|t|)    
(Intercept)   54.34483    1.39067  39.078   <2e-16 ***
DayMon        -3.34483    1.95025  -1.715   0.0870 .  
DayThu         2.69221    2.00280   1.344   0.1795    
DayTue        -0.43574    1.90618  -0.229   0.8193    
DayWed        -0.01149    2.00280  -0.006   0.9954    
RouteB        -1.02130    1.89302  -0.540   0.5898    
RouteC        -1.83131    1.85736  -0.986   0.3246    
DayMon:RouteB  2.91785    2.71791   1.074   0.2836    
DayThu:RouteB  0.39335    2.63352   0.149   0.8813    
DayTue:RouteB  3.44554    2.64247   1.304   0.1929    
DayWed:RouteB  1.23796    2.65761   0.466   0.6416    
DayMon:RouteC  5.27034    2.58597   2.038   0.0421 *  
DayThu:RouteC  0.24255    2.73148   0.089   0.9293    
DayTue:RouteC  4.48105    2.60747   1.719   0.0863 .  
DayWed:RouteC  1.95253    2.68823   0.726   0.4680    
---
Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

Residual standard error: 7.489 on 485 degrees of freedom
Multiple R-squared:  0.04438,	Adjusted R-squared:  0.01679 
F-statistic: 1.609 on 14 and 485 DF,  p-value: 0.07291

While the model above indicates the effect of each factor on the response, it doesn’t compute the f-statistic, which is by taking into consideration the “within” and “between” variations in the samples of data. The ANOVA mean squares and sum of squares approach does exactly this, which is why the results from that are more relevant here. And the summary below is the ANOVA model itself, in the ANOVA table:

Analysis of Variance Table

Response: Time
           Df  Sum Sq Mean Sq F value  Pr(>F)   
Day         4   823.3 205.830  3.6700 0.00588 **
Route       2    46.0  23.005  0.4102 0.66376   
Day:Route   8   393.9  49.237  0.8779 0.53492   
Residuals 485 27201.3  56.085                   
---
Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

By looking at the results above, it is clear that the Day factor has a statistically significant impact on the Travel time. (Upon closer observation, you’ll see that one of those means that was different from others, was the mean for Monday! This is clearly an example inspired by Monday rush hour traffic!)

When reading results from the lm and anova commands, it is important to note that R indicates the results using significance codes. A small p-value indicates a significant result, and the relative significance of different factors is indicated by assigning different symbols to them. For instance, two asterixes (**) are used when we get a p-value of < 0.001. Depending on the nature of your experiment, you can choose your significance level and understand the results in a comparison of these p-values with significance. Also, in this specific case, the Route factor seems to have an insignificant impact on the response.

Interactions

When we have two or more terms in a model which function as inputs to a response variable, we also need to evaluate whether a change in both variables causes a different effect on the response, as opposed to fixing one and changing the other. This is referred to as an interaction, and interaction effects are taken into account in our model. Once again, the p-values for the interactions can inform us about the relative importance of different interactions.

Concluding Remarks

We’ve seen in this example how the Analysis of Variance (ANOVA) approach can be used to compare the impact of two factors on a response variable. Cook’s distance and its importance were also explained. It is important to make each of the data points within our data set count, and at the same time, it is important to evaluate the model’s veracity and validity to what we want to understand. In this context, understanding the significance of our results (statistically and practically) is necessary. Fortunately, the linear modeling packages in R are very convenient for such modeling, and incorporate lots of added functionality, like being able to call plots on them by simply using the plot command on a saved model. This functionality really comes into its own, when you make and test many different models and want to compare results.

Animated: Mean and Sample Size

A quick experiment in R can unveil the impact of sample size on the estimates we make from data. A small number of samples provides us less information about the process or system from which we’re collecting data, while a large number can help ground our findings in near certainty. See the earlier post on sample size, confidence intervals and related topics on R Explorations.

Using the “animation” package once again, I’ve put together a simple animation to describe this.

#package containing saveGIF function
library(animation)

#setting GIF options
ani.options(interval = 0.12, ani.width = 480, ani.height = 320)

#a function to help us call GIF plots easily 
plo <- function(samplesize, iter = 100){
  
  for (i in seq(1,iter)){
    
    #Generating a sample from the normal distribution
    x <- rnorm(samplesize,mu,sd)
    
    #Histogram of samples as they're generated
    hist(x, main = paste("N = ",samplesize,", xbar = ",round(mean(x), digits = 2),
                         ", s = ",round(sd(x),digits = 2)), xlim = c(5,15), 
                        ylim = c(0,floor(samplesize/3)), breaks = seq(4,16,0.5), col = rgb(0.1,0.9,0.1,0.2), 
                        border = "grey", xlab = "x (Gaussian sample)")
    
    #Adding the estimate of the mean line to the histogram
    abline(v = mean(x), col = "red", lw = 2 )
  }
}

#Setting the parameters for the distribution
mu = 10.0
sd = 1.0

for (i in c(10,50,100,500,1000,10000)){
saveGIF({plo(i,mu,sd)},movie.name = paste("N=",i,", mu=",mu,", sd=",sd,".gif"))
}

Animated Results

Very small sample size of 5. Observe how the sample mean line hunts wildly.

Very small sample size of 5. Observe how the sample mean line hunts wildly.

N= 10 , mu= 10 , sd= 1

A small sample size of 10. Mean once again moves around quite a bit.

N= 50 , mu= 10 , sd= 1

Moderate sample size of 50. Far less inconsistency in estimate (red line)

N= 100 , mu= 10 , sd= 1

A larger sample size, showing little deviation in sample mean over different samples

N= 1000 , mu= 10 , sd= 1

A large sample size, indicating minor variations in sample mean

N= 10000 , mu= 10 , sd= 1

Very large sample size (however, still smaller than many real world data sets!). Sample mean estimate barely changes over samples.

Simple Outlier Detection in R

Outliers are points in a data set that lie far away from the estimated value of the centre of the data set. This estimated centre could be either the mean, or median, depending on what kind of point or interval estimate you’re using. Outliers tend to represent something different from “the usual” that you might observe in a data set, and therefore hold importance. Outlier detection is an important aspect of machine learning algorithms of any sophistication. Because of the fact that outliers can throw off a learning algorithm or deflate an assumption about the data set, we have to be able to identify and explain the outliers in data sets, if the need arises. I’ll only cover the basic R commands here to do outlier detection, but it would be good to look up a more comprehensive resource. A first primer by Sanjay Chawla and Pei Sun (University of Sydney) is here: Outlier detection (PDF slides).

Graphical Approaches to Outlier Detection

Boxplots and histograms are useful to get an idea of the distribution that could be used to model the data, and could also provide insights into whether outliers exist or not in our data set.

y <-read.csv("y.csv")
ylarge <- read.csv("ylarge.csv")

#summarizing and plotting y
summary(y)
hist(y[,2], breaks = 20, col = rgb(0,0,1,0.5))
boxplot(y[,2], col = rgb(0,0,1,0.5), main = "Boxplot of y[,2]")
shapiro.test(y[,2])
qqnorm(y[,2], main = "Normal QQ Plot - y")
qqline(y[,2], col = "red")

#summarizing and plotting ylarge
summary(ylarge)
hist(ylarge[,2], breaks = 20, col = rgb(0,1,0,0.5))
boxplot(ylarge[,2], col =  rgb(0,1,0,0.5), main = "Boxplot of ylarge[,2]")
shapiro.test(ylarge[,2])
qqnorm(ylarge[,2], main = "Normal QQ Plot - ylarge")
qqline(ylarge[,2], col = "red")

The Shapiro-Wilk test used above is used to check for the normality of a data set. Normality assumptions underlie outlier detection hypothesis tests. In this case, with p-values of 0.365 and 0.399 respectively and sample sizes of 30 and 1000, both samples y and ylarge seem to be normally distributed.

 

Box plot of y (no real outliers observed as per graph)

Box plot of y (no real outliers observed as per graph)

Boxplot of ylarge - a few outlier points seem to be present in graph

Boxplot of ylarge – a few outlier points seem to be present in graph

Histogram of y

Histogram of y

Histogram of ylarge

Histogram of ylarge

Normal QQ Plot of Y

Normal QQ Plot of Y

Normal QQ Plot of ylarge

Normal QQ Plot of ylarge

 

The graphical analysis tells us that there could possibly be outliers in our data set ylarge, which is the larger data set out of the two. The normal probability plots also seem to indicate that these data sets (as different in sample size as they are) can be modeled using normal distributions.

Dixon and Chi Squared Tests for Outliers

The Dixon test and Chi-squared tests for outliers (PDF) are statistical hypothesis tests used to detect outliers in given sample sets. Bear in mind though, that this Chi-squared test for outliers is very different from the better known Chi-square test used for comparing multiple proportions. The Dixon tests makes a normality assumption about the data, and is used generally for 30 points or less. The Chi-square test on the other hands makes variance assumptions, and is not sensitive to mild outliers if variance isn’t specified as an argument. Let’s see how these tests can be used for outliers detection.

library(outliers)
#Dixon Tests for Outliers for y
dixon.test(y[,2],opposite = TRUE)
dixon.test(y[,2],opposite = FALSE)

#Dixon Tests for Outliers for ylarge
dixon.test(ylarge[,2],opposite = TRUE)
dixon.test(ylarge[,2],opposite = FALSE)


#Chi-Sq Tests for Outliers for y
chisq.out.test(y[,2],variance = var(y[,2]),opposite = TRUE)
chisq.out.test(y[,2],variance = var(y[,2]),opposite = FALSE)

#Chi-Sq Tests for Outliers for ylarge
chisq.out.test(ylarge[,2],variance = var(ylarge[,2]),opposite = TRUE)
chisq.out.test(ylarge[,2],variance = var(ylarge[,2]),opposite = FALSE)

In each of the Dixon and Chi-Squared tests for outliers above, we’ve chosen both options TRUE and FALSE in turn, for the argument opposite. This argument helps us choose between whether we’re testing for the lowest extreme value, or the highest extreme value, since outliers can lie to both sides of the data set.

Sample output is below, from one of the tests.

> #Dixon Tests for Outliers for y
> dixon.test(y[,2],opposite = TRUE)

	Dixon test for outliers

data:  y[, 2]
Q = 0.0466, p-value = 0.114
alternative hypothesis: highest value 11.7079474800368 is an outlier


When you closely observe the p-values of these tests alone, you can see the following results:

P-values for outlier tests:

Dixon test (y, upper):  0.114 ; Dixon test (y, lower):  0.3543
Dixon test not executed for ylarge
Chi-sq test (y, upper):  0.1047 ; Chi-sq test (y, lower):  0.0715
Chi-sq test (ylarge, upper):  0.0012 ; Chi-sq test (ylarge, lower):  4e-04

The p-values here (taken with an indicative 5% significance) may imply that the possibility that the extreme values in ylarge are outliers. This may or may not be true, of course, since in inferential statistics, we always state the chance of error. And in this case, we can conclude that there is a very small chance that those extreme values we see in ylarge are actually typical in that data set.

Concluding Remarks

We’ve seen the graphical outlier detection approaches and also have seen the Dixon and Chi-square tests. The Dixon test is newer, but isn’t applicable to large data sets, for which we need to use the Chi-square test for outliers and other tests. In machine learning problems, we often have to be able to explain some of the values, from a training perspective for neural networks, or be able to deal with lower resolution models such as least squares regression, used in simpler forecasting and estimation problems. Approaches like regression depend heavily on the central tendency of the data, and we can build better models if we’re able to explain outliers and understand the underlying causes for them. Continual improvement professionals generally regard outliers with importance. Statistically, the chance of getting extreme results (extremely good ones and extremely poor ones) is exciting in process excellence and continuous improvement, because they could represent benchmark cases, or worst case scenarios. Either way, outlier detection is an immensely useful activity applicable to different statistical situations business.

Animated Logistic Maps of Chaotic Systems in R

Linear systems are systems that have predictable outputs when there are small changes in the inputs to the system. Nonlinear systems are those that produce disproportionate results for proportional changes in the inputs. Both linear and non-linear systems are common enough in nature and industrial processes, or more accurately, many industrial and natural processes can actually be modeled as standard linear or nonlinear systems.

Nonlinear Dynamics and Chaos

Nonlinear dynamical systems are essentially systems that exhibit time-dependent behaviour and in a non-linear manner. A special class of such systems also exhibit chaos, which is defined as sensitive dependence upon initial conditions. There are great textbooks available on the subject, by researchers such as Steven Strogatz (Cornell University, Ithaca, New York).

While R is often used to run statistical analysis and studies of various kinds including advanced machine learning algorithms, it isn’t often used to study systems like this. However, they are easily modeled in R, and like any programming language, the surfeit of functions can help us understand statistical aspects of the behavior of such systems. There’s extensive material available on the internet, and Steven Strogatz’s lectures on Nonlinear Dynamics and Chaos provide a very deep treatment of the subject.

Logistic Maps and Bifurcation Diagrams

A logistic map is a function that describes how the state of a particular dynamical system evolves from one point in time to the next. It allows us to understand bifurcations, and understand what kinds of conditions produce sensitive dependence on initial conditions. More on bifurcation diagrams here.

Typical logistic map (courtesy Wikipedia)

Typical logistic map (courtesy Wikipedia)

Nonlinear Dynamical System in R

The system I’ll describe here is a probabilistic system that is based on the binomial distribution’s mechanics. This distribution is used to model events with two probabilities (success or failure), of some probability. A special case of this is the coin toss, the Bernoulli distribution.

In our example, we’re trying to understand the probability of success in a repetitive or sequential, identical game with two outcomes, provided we know the initial chance of success. In doing so, we’re exploring the impact of the number of games we’re playing in sequence, and the number of wins or losses we expect in each case. The end result from this, is a matrix, which we call problemset in this specific case.

Animations in R

The R package “animation” has functions which can enable sequential graphics (such as that generated within a loop) to be saved as a GIF animation file. This is especially handy when we’re trying to understand the impact of parameters, or when we’re trying to illustrate the data, analysis and results in our work in some sequence. We’ll use the saveGIF() function here to do just such a thing – to save a sequence of images of logistic maps in succession, into a single GIF file.


library(animation)
#Set delay between frames when replaying
ani.options(interval=.05)

#Do our plots within the saveGIF command parantheses, in order to capture the matrix plots we're generating

saveGIF({
for (inval in seq(0,1,length.out = 200)){

pfirst <- inval 
#Defining a function to calculate event probability based on starting probability assumptions
prob <- function(game){
  n <-game[1];
  k <-game[2];
  p <-game[length(game)];
  return( factorial(n) / (factorial(n-k) * factorial(k)) 
          * p^k * (1-p)^(n-k) );
}

iter <-100
k <- 2
games <- seq(2,100,1)
victories <- rep(5,length(games))
problemset <- cbind(games, victories, 
                    rep(pfirst, length(games)))

#Setting up a temporary variable to store the probability values per iteration
out<-NULL

for (i in seq(1,iter,1)){
  for (i in seq(1,length(problemset[,1]), 1)){
    out<-rbind(out,prob(problemset[i,]))
  }
  problemset <-cbind(problemset,out)
  out<-NULL
}

#Using the matrix plot function matplot() to plot the various columns in our result matrix, together

matplot(problemset[,seq(3,length(problemset[1,]), 1)], type = "l", lwd = 1, lty = "solid", 
        main = paste("Logistic Map with initial probability = ",round(pfirst,2)), ylab = "Probability", 
        xlab = "Number of games", ylim = c(0,0.5) )

}

})

The code above generates an animation (GIF file) in your default R working directory. It allows us to examine how different system parameters could affect the probability of events we’re evaluating in the sample space we have in this problem. Naturally, this changes depending on the number of games you indicate in the games variable in the code. The GIF files are shown at the end of this section – two different cases are shown.

Here’s a logistic map generated by a slightly modified version of the code above. This shows the calculated probabilities for different combinations of games, and won games, based on initial assumed win percentages. The initial assumed win percentage in this case is 0.1 or 10%, the horizontal line running through the graph.

Logistic map for an initial probability of 0.1

Logistic map for an initial probability of 0.1

Logistic map for the dynamical system described above.

Logistic map for the dynamical system described above.

Longer animation with different parameters for k and a greater number of frames.

Longer animation with different parameters for k and a greater number of frames.

A number of systems can only be described well when we see their performance or behaviour in the time domain. Visualizing one of the parameters of any model we construct along the time domain therefore becomes useful to try and spot patterns and trends. Good data visualization, for simple and complex problems alike, isn’t only about static images and infographics, but dynamic displays and dashboards of data that change and show us the changing nature of the system being modeled or data being collected. Naturally, this is of high value when we’re putting together a data science practice in an organization.

Concluding Remarks

We’ve seen how animated logistic maps are generated in the example here, but equally, the animated plots could be other systems which we are trying to describe in the time domain, or systems we’re trying to describe in an elaborate way, especially when many parameters are involved. Linear and nonlinear systems aside, we also have to deal with nonlinear and dynamical systems in real life. Good examples are weather systems, stock markets, and the behaviours of many complex systems which have many variables, many interactions, although simple rules. Logistic maps can tell us, based on simplified representations of our data, about the regimes of chaos and order in the behaviour of the system, and what to expect. In this sense, they can be used in cases where it is known that we’re dealing with nonlinear and dynamical systems.