Two Way ANOVA in R

Introduction

The more advanced methods in statistics have generally been developed to answer real-world questions, and ANOVA is no different.

  • How do we answer questions in the real world, as to which route from home to work on your daily commute is easier, or
  • How would you know which air-conditioner to choose out of a bunch that you’re evaluating in various climates?
  • If you were dealing with a bunch of suppliers, and wanted to compare their process results all at the same time, how would you do it?
  • If you had three competing designs for a system or an algorithm, and wanted to understand whether one of them was significantly better than the others, how would you do that statistically?

ANOVA answers these kinds of questions – it helps us discover whether we have clear reasons to choose a particular alternative over many others, or determine whether there is exceptional performance (good or bad) that is dependent on a factor.

We discussed linear models earlier – and ANOVA is indeed a kind of linear model – the difference being that ANOVA is where you have discrete factors whose effect on a continuous (variable) result you want to understand.

The ANOVA Hypothesis

The ANOVA hypothesis is usually explained as a comparison of population means estimated based on the sample means. What we’re trying to understand here is the effect of a change in the level of one factor, on the response. The term “Analysis of Variance” for ANOVA is therefore a misnomer for many new to inferential statistics, since it is a test that compares means.

A simplified version of the One-Way ANOVA hypothesis for three samples of data (the effect of a factor with three possible values, or levels) is below:

H_0 : \mu_1 = \mu_2 = \mu_3

While Ha could be:

H_a : \mu_1 \neq \mu_2 = \mu_3, or
H_a : \mu_1 = \mu_2 \neq \mu_3, or
H_a : \mu_1 \neq \mu_2 = \mu_3

It is possible to understand the Two-Way ANOVA problem, therefore, as a study of the impact of two different factors (and their associated levels) on the response.

Travel Time Problem

Let’s look at a simple data set which has travel time data organized by day of the week and route. Assume you’ve been monitoring data from many people travelling a certain route, between two points, and you’re trying to understand whether the time taken for the trip is more dependent on the day of the week, or on the route taken. A Two-Way ANOVA is a great way to solve this kind of a problem.

The first few rows of our dataset

The first few rows of our dataset

We see above how the data for this problem is organized. We’re essentially constructing a linear model that explains the relationship between the “response” or “output” variable Time, and the factors Day and Route.

Two Way ANOVA in R

ANOVA is a hypothesis test that requires the continuous variables (by each factor’s levels) to normally distributed. Additionally, ANOVA results are contingent upon an equal variance assumption for the samples being compared too. I’ve demonstrated in an earlier post how the normality and variance tests can be run prior to a hypothesis test for variable data.

The code below first pulls data from a data set into variables, and constructs a linear ANOVA model after the normality and variance tests. For normality testing, we’re using the Shapiro-Wilk test, and for variance testing, we’re using the bartlett.test() command here, which is used to compare multiple variances.

#Reading the dataset
Dataset<-read.csv("Traveldata.csv")
str(Dataset)

#Shapiro-Wilk normality tests by Day
cat("Normality p-values by Factor Day: ")
for (i in unique(factor(Dataset$Day))){
  cat(shapiro.test(Dataset[Dataset$Day==i, ]$Time)$p.value," ")
}
cat("Normality p-values by Factor Route: ")

#Shapiro-Wilk normality tests by Route
for (i in unique(factor(Dataset$Route))){
  cat(shapiro.test(Dataset[Dataset$Route==i, ]$Time)$p.value," ")
}

#Variance tests for Day and Route factors
bartlett.test(Time~Day,data = Dataset )
bartlett.test(Time~Route,data = Dataset )

#Creating a linear model
#The LM tells us both main effects and interactions
l <- lm(Time~ Day + Route + Day*Route , Dataset)
summary(l)

#Running and summarizing a general ANOVA on the linear model
la <- anova(l)
summary(la)

#Plots of the linear model and Cook's Distance
plot(cooks.distance(l), 
     main = "Cook's Distance for linear model", xlab =
       "Travel Time (observations)", ylab = "Cook's Distance")
plot(l)
plot(la)

Results for the Bartlett test are below:

> bartlett.test(Time~Day,data = Dataset )

	Bartlett test of homogeneity of variances

data:  Time by Day
Bartlett's K-squared = 3.2082, df = 4, p-value = 0.5236

> bartlett.test(Time~Route,data = Dataset )

	Bartlett test of homogeneity of variances

data:  Time by Route
Bartlett's K-squared = 0.8399, df = 2, p-value = 0.6571

The code also calculates Cook’s distance, which is an important concept in linear models. When trying to understand any anomalous terms in the model, we can refer to the Cook’s distance to understand whether those terms have high leverage in the model, or not. Removing a point with high leverage could potentially affect the model results. Equally, if your model isn’t performing well, it may be worth looking at Cook’s distance.

Cook's Distance for our data set, visualized

Cook’s Distance for our data set, visualized

Cook's distance, explained by its importance to leverage in the model.

Cook’s distance, explained by its importance to leverage in the model.

When looking at the graphs produced by lm, we can understand the how various points in the model have different values of Cook’s distance, and we also understand their relative leverages. This is also illustrated in the Normal Quantile-Quantile plot below, where you can see observations #413 and #415 that have large values, among others.

Normal QQ plot of data set showing high leverage points (large Cook's Distance)

Normal QQ plot of data set showing high leverage points (large Cook’s Distance)

ANOVA Results

A summary of the lm command’s result is shown below.


Call:
lm(formula = Time ~ Day + Route + Day * Route, data = Dataset)

Residuals:
    Min      1Q  Median      3Q     Max 
-20.333  -4.646   0.516   4.963  19.655 

Coefficients:
              Estimate Std. Error t value Pr(>|t|)    
(Intercept)   54.34483    1.39067  39.078   <2e-16 ***
DayMon        -3.34483    1.95025  -1.715   0.0870 .  
DayThu         2.69221    2.00280   1.344   0.1795    
DayTue        -0.43574    1.90618  -0.229   0.8193    
DayWed        -0.01149    2.00280  -0.006   0.9954    
RouteB        -1.02130    1.89302  -0.540   0.5898    
RouteC        -1.83131    1.85736  -0.986   0.3246    
DayMon:RouteB  2.91785    2.71791   1.074   0.2836    
DayThu:RouteB  0.39335    2.63352   0.149   0.8813    
DayTue:RouteB  3.44554    2.64247   1.304   0.1929    
DayWed:RouteB  1.23796    2.65761   0.466   0.6416    
DayMon:RouteC  5.27034    2.58597   2.038   0.0421 *  
DayThu:RouteC  0.24255    2.73148   0.089   0.9293    
DayTue:RouteC  4.48105    2.60747   1.719   0.0863 .  
DayWed:RouteC  1.95253    2.68823   0.726   0.4680    
---
Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

Residual standard error: 7.489 on 485 degrees of freedom
Multiple R-squared:  0.04438,	Adjusted R-squared:  0.01679 
F-statistic: 1.609 on 14 and 485 DF,  p-value: 0.07291

While the model above indicates the effect of each factor on the response, it doesn’t compute the f-statistic, which is by taking into consideration the “within” and “between” variations in the samples of data. The ANOVA mean squares and sum of squares approach does exactly this, which is why the results from that are more relevant here. And the summary below is the ANOVA model itself, in the ANOVA table:

Analysis of Variance Table

Response: Time
           Df  Sum Sq Mean Sq F value  Pr(>F)   
Day         4   823.3 205.830  3.6700 0.00588 **
Route       2    46.0  23.005  0.4102 0.66376   
Day:Route   8   393.9  49.237  0.8779 0.53492   
Residuals 485 27201.3  56.085                   
---
Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

By looking at the results above, it is clear that the Day factor has a statistically significant impact on the Travel time. (Upon closer observation, you’ll see that one of those means that was different from others, was the mean for Monday! This is clearly an example inspired by Monday rush hour traffic!)

When reading results from the lm and anova commands, it is important to note that R indicates the results using significance codes. A small p-value indicates a significant result, and the relative significance of different factors is indicated by assigning different symbols to them. For instance, two asterixes (**) are used when we get a p-value of < 0.001. Depending on the nature of your experiment, you can choose your significance level and understand the results in a comparison of these p-values with significance. Also, in this specific case, the Route factor seems to have an insignificant impact on the response.

Interactions

When we have two or more terms in a model which function as inputs to a response variable, we also need to evaluate whether a change in both variables causes a different effect on the response, as opposed to fixing one and changing the other. This is referred to as an interaction, and interaction effects are taken into account in our model. Once again, the p-values for the interactions can inform us about the relative importance of different interactions.

Concluding Remarks

We’ve seen in this example how the Analysis of Variance (ANOVA) approach can be used to compare the impact of two factors on a response variable. Cook’s distance and its importance were also explained. It is important to make each of the data points within our data set count, and at the same time, it is important to evaluate the model’s veracity and validity to what we want to understand. In this context, understanding the significance of our results (statistically and practically) is necessary. Fortunately, the linear modeling packages in R are very convenient for such modeling, and incorporate lots of added functionality, like being able to call plots on them by simply using the plot command on a saved model. This functionality really comes into its own, when you make and test many different models and want to compare results.

Animated Logistic Maps of Chaotic Systems in R

Linear systems are systems that have predictable outputs when there are small changes in the inputs to the system. Nonlinear systems are those that produce disproportionate results for proportional changes in the inputs. Both linear and non-linear systems are common enough in nature and industrial processes, or more accurately, many industrial and natural processes can actually be modeled as standard linear or nonlinear systems.

Nonlinear Dynamics and Chaos

Nonlinear dynamical systems are essentially systems that exhibit time-dependent behaviour and in a non-linear manner. A special class of such systems also exhibit chaos, which is defined as sensitive dependence upon initial conditions. There are great textbooks available on the subject, by researchers such as Steven Strogatz (Cornell University, Ithaca, New York).

While R is often used to run statistical analysis and studies of various kinds including advanced machine learning algorithms, it isn’t often used to study systems like this. However, they are easily modeled in R, and like any programming language, the surfeit of functions can help us understand statistical aspects of the behavior of such systems. There’s extensive material available on the internet, and Steven Strogatz’s lectures on Nonlinear Dynamics and Chaos provide a very deep treatment of the subject.

Logistic Maps and Bifurcation Diagrams

A logistic map is a function that describes how the state of a particular dynamical system evolves from one point in time to the next. It allows us to understand bifurcations, and understand what kinds of conditions produce sensitive dependence on initial conditions. More on bifurcation diagrams here.

Typical logistic map (courtesy Wikipedia)

Typical logistic map (courtesy Wikipedia)

Nonlinear Dynamical System in R

The system I’ll describe here is a probabilistic system that is based on the binomial distribution’s mechanics. This distribution is used to model events with two probabilities (success or failure), of some probability. A special case of this is the coin toss, the Bernoulli distribution.

In our example, we’re trying to understand the probability of success in a repetitive or sequential, identical game with two outcomes, provided we know the initial chance of success. In doing so, we’re exploring the impact of the number of games we’re playing in sequence, and the number of wins or losses we expect in each case. The end result from this, is a matrix, which we call problemset in this specific case.

Animations in R

The R package “animation” has functions which can enable sequential graphics (such as that generated within a loop) to be saved as a GIF animation file. This is especially handy when we’re trying to understand the impact of parameters, or when we’re trying to illustrate the data, analysis and results in our work in some sequence. We’ll use the saveGIF() function here to do just such a thing – to save a sequence of images of logistic maps in succession, into a single GIF file.


library(animation)
#Set delay between frames when replaying
ani.options(interval=.05)

#Do our plots within the saveGIF command parantheses, in order to capture the matrix plots we're generating

saveGIF({
for (inval in seq(0,1,length.out = 200)){

pfirst <- inval 
#Defining a function to calculate event probability based on starting probability assumptions
prob <- function(game){
  n <-game[1];
  k <-game[2];
  p <-game[length(game)];
  return( factorial(n) / (factorial(n-k) * factorial(k)) 
          * p^k * (1-p)^(n-k) );
}

iter <-100
k <- 2
games <- seq(2,100,1)
victories <- rep(5,length(games))
problemset <- cbind(games, victories, 
                    rep(pfirst, length(games)))

#Setting up a temporary variable to store the probability values per iteration
out<-NULL

for (i in seq(1,iter,1)){
  for (i in seq(1,length(problemset[,1]), 1)){
    out<-rbind(out,prob(problemset[i,]))
  }
  problemset <-cbind(problemset,out)
  out<-NULL
}

#Using the matrix plot function matplot() to plot the various columns in our result matrix, together

matplot(problemset[,seq(3,length(problemset[1,]), 1)], type = "l", lwd = 1, lty = "solid", 
        main = paste("Logistic Map with initial probability = ",round(pfirst,2)), ylab = "Probability", 
        xlab = "Number of games", ylim = c(0,0.5) )

}

})

The code above generates an animation (GIF file) in your default R working directory. It allows us to examine how different system parameters could affect the probability of events we’re evaluating in the sample space we have in this problem. Naturally, this changes depending on the number of games you indicate in the games variable in the code. The GIF files are shown at the end of this section – two different cases are shown.

Here’s a logistic map generated by a slightly modified version of the code above. This shows the calculated probabilities for different combinations of games, and won games, based on initial assumed win percentages. The initial assumed win percentage in this case is 0.1 or 10%, the horizontal line running through the graph.

Logistic map for an initial probability of 0.1

Logistic map for an initial probability of 0.1

Logistic map for the dynamical system described above.

Logistic map for the dynamical system described above.

Longer animation with different parameters for k and a greater number of frames.

Longer animation with different parameters for k and a greater number of frames.

A number of systems can only be described well when we see their performance or behaviour in the time domain. Visualizing one of the parameters of any model we construct along the time domain therefore becomes useful to try and spot patterns and trends. Good data visualization, for simple and complex problems alike, isn’t only about static images and infographics, but dynamic displays and dashboards of data that change and show us the changing nature of the system being modeled or data being collected. Naturally, this is of high value when we’re putting together a data science practice in an organization.

Concluding Remarks

We’ve seen how animated logistic maps are generated in the example here, but equally, the animated plots could be other systems which we are trying to describe in the time domain, or systems we’re trying to describe in an elaborate way, especially when many parameters are involved. Linear and nonlinear systems aside, we also have to deal with nonlinear and dynamical systems in real life. Good examples are weather systems, stock markets, and the behaviours of many complex systems which have many variables, many interactions, although simple rules. Logistic maps can tell us, based on simplified representations of our data, about the regimes of chaos and order in the behaviour of the system, and what to expect. In this sense, they can be used in cases where it is known that we’re dealing with nonlinear and dynamical systems.

Linear and Polynomial Models in R

Background

For a lot of people, the rubber hits the road in data analysis when they can describe the relationships between everyday things they deal with. How does the gas mileage of the cars we manufacture vary with the vehicle’s weight, or the size of the wheels? How does the fuel that consumers use change the power output? What size of font makes my website readable? How much money should I invest in a company with a certain track record? What amount of money should I sell my house for?

Unsurprisingly, there’s a statistical element to all these questions. They’re all characterised by the variability you see innate in natural and engineered systems. No two cars built by the same vehicle manufacturing plant are alike, and no two homes are alike even if built on the same area, and next to each other, and similarly, even if content is king, your choice of font size does have an impact on readership numbers.

Linear models are as old as the hills, and still continue to be used for a lot of data analysis. They’re especially useful when trying to find simple relationships, and when constructing simple linear regression models that describe the relationships between two different continuous (variable) data sets.

Linear models and ordered pairs

The fundamental unit of data used to construct linear models may be considered the ordered pair. An ordered pair of variable data generally represents a factor-response combination. Factors are aspects of a system that you think could impact the results, and want to investigate. You conduct this investigation by changing factor values, and seeing how the responses change. Let’s consider for a moment, that you’re buying a car, and you have a budget, and are interested in fuel efficient cars. Among other factors, you’re primarily interested in studying the impact of one factor – engine displacement – on the fuel efficiency.

Factor: Engine Displacement (litre); Response: Gas Mileage (km / l)

Factor: Engine Displacement (litre); Response: Gas Mileage (km / l)

If we were to simplify the way we represent the impact of engine displacement on our vehicle of interest, we can represent it as above, with no other factors than engine displacement (measured in litres), and no other responses than the gas mileage (the only factor we’re interested in, measured in kilometres per litre of fuel).

This is a very simplified model, of course, meant only to demonstrate the regression approach in R. A real-life problem will undoubtedly have many considerations, including the base price of the vehicle, its features, comfort and so on, some of which may not be as easily quantifiable as gas mileage or engine displacement. Here’s what a series of data collection test drives might yield:

Gas mileage data - first few vehicles shown

Gas mileage data – first few vehicles shown

We can bring such data into R, into a data frame, and designate the columns of the data frame.

gm <- read.csv("gasmileage.csv")
gm <- as.data.frame(gm)
names(gm)<-c("Car #", "Eng. Disp (l)", "Gas Mlg. (km / l)")

If we were to take a peek at the “gm” variable here, which has the data stored in it, we would see this:

> head(gm)
  Car # Eng. Disp (l) Gas Mlg. (km / l)
1     1           1.0          22.30412
2     2           1.1          22.09578
3     3           1.1          21.97859
4     4           1.2          21.97248
5     5           1.2          22.42579
6     6           1.4          22.08349

Observe how changing the names of the data frame has allowed us to see the data more clearly. This is easier said than done for a lot of public data sets. Therefore, exploring and understanding the data set, using the View() command in RStudio always helps in real life situations when you’re working on data projects and constructing data frames. Changing names does have an advantage, namely, in graphing and data representation. These names get carried over to all your graphs and charts that you create with this data – so it makes sense to spend a little bit of time up front doing it, at times.

Graphical analysis

Now that we have the data in one place, we can put the data into a plot, and visualize any obvious relationships. This is a simple graphical analysis where you can observe obvious trends and patterns, and understand what model to use. Visualization is a great way to prepare for the actual construction of the linear model. We’ll use the simple base plot function, and invoke the names of the columns (ordered pairs) using the $ operator.

#Visualizing the data to observe any correlation graphically
plot(gm$`Eng. Disp (l)`,gm$`Gas Mlg. (km / l)`, main = "Fuel eff. vs. Eng. Displ.", col = "maroon", xlab = "Eng. Disp (l)", ylab = "Gas. Mlg (km/l)")
Simple plot of Gas MIleage with Engine displacement

Simple plot of Gas Mileage with Engine displacement

We can observe some kind of correlation in the ordered pairs, and perhaps we can formalize what we observe at this stage with a linear model. Bear in mind, however, that usually, a real relationship has to be established prior to creating such a model. There are numerous caveats associated with correlation, especially the one that states that correlation does not imply causation. Using the cor() command can also illustrate the nature of the relationship between the factor and the response in question here.

> cor(gm$`Gas Mlg. (km / l)`,gm$`Eng. Disp (l)`)
[1] -0.9444116

A correlation coefficient so close to -1 indicates strong negative correlation, meaning that increases in gas mileage seem to be observed with decreases in engine displacement.

Constructing the Linear Model

R’s linear modeling function is very simple to use, and packs a punch. You can construct very simple linear models and fairly complex polynomial models of any order using this. The lm() command is used to construct linear models in R. For our problem, we’ll construct and store the linear model in a variable called fit

#Constructing a simple linear model 
fit <- lm(gm$`Gas Mlg. (km / l)` ~ gm$`Eng. Disp (l)`)
> fit

Call:
lm(formula = gm$`Gas Mlg. (km / l)` ~ gm$`Eng. Disp (l)`)

Coefficients:
       (Intercept)  gm$`Eng. Disp (l)`  
            24.127              -1.626  

All linear models are of the form y = mx + c where m is the slope, and c the intercept. The close and intercept are clearly indicated for the linear model we have constructed and stored in fit. The slope is -1.63 .

The variable fit actually contains a lot of information than may be obvious at this point. It makes available a range of different information to different R commands, but you can explore its contents more. For instance, the least squares regression analysis approach used in the linear model should produce a list of fitted values. This list is accessed as below.

> fit$fitted.values

       1        2        3        4        5        6        7 
22.50086 22.33823 22.33823 22.17559 22.17559 21.85032 21.85032 
       8        9       10       11       12       13       14 
21.68768 21.68768 21.68768 21.68768 21.19978 21.19978 21.19978 
      15       16       17       18       19       20 
20.87450 20.87450 20.87450 20.87450 20.06132 20.06132 

As shown above, for the 20 ordered pairs originally provided, the fitted values of the response variable are stored in fit.

Plotting a Fit Line for the Linear Model

Plotting a fit line on the plot we made earlier is rather straightforward. We can use the abline() or the fitted() commands to plot a line, and can colour it as we wish.

#Plotting a fit line
abline(fit, col = "red")
2015-08-24 22_54_06-Calendar

Data plotted with a simple fit line (simple linear model).

Polynomial Fits

When we look closely at the plot, we can observe a somewhat nonlinear relationship between the factor and the response, that is to say, the gas mileage doesn’t decrease at the same rate for vehicles around 1.0 litres, as opposed to when we move beyond 1.5 litres. This is non-linearity in the data, and it can be captured when we build higher resolution models of the relationship between factor and response. One way to do this in R, is to define the “formula” in our linear model lm() command as a polynomial.

#Constructing a polynomial model of the relationship between factor and response
#The second argument in the poly() command indicates the order of the polynomial
fit <- lm( gm$`Gas Mlg. (km / l)` ~ poly(gm$`Eng. Disp (l)`, 2, raw = TRUE)) 

#Plotting the nonlinear model in ordered pairs
#We can sort the data in the displacement column, 
#This reorders the model variable "fit" to plot in this order
lines(sort(gm$`Eng. Disp (l)`),fitted(fit)[order(gm$`Eng. Disp (l)`)], type = "l")
Linear (red) and quadratic (polynomial order 2, in blue) models shown

Linear (red) and quadratic (polynomial order 2, in blue) models shown

Concluding Remarks

We’ve seen how the relationships between variable data sets can be analyzed, and how information from these data sets can be converted into models. In the example shown above, a linear or quadratic model can be used to construct a powerful, even predictive model, which can allow us to make informed decisions about, in this case, the gas mileage of the vehicle we may buy, especially if that vehicle may have a very different displacement such as one not listed in the data set, like a 1.4 litre engine, or a 1.6 litre engine. Naturally, this kind of predictive modeling ability can be extended to when you have to predict a house price based on price and built up area information in different neighbourhoods. It could equally be applied to optimizing engineering systems ranging from websites to washing machines.