Background
For a lot of people, the rubber hits the road in data analysis when they can describe the relationships between everyday things they deal with. How does the gas mileage of the cars we manufacture vary with the vehicle’s weight, or the size of the wheels? How does the fuel that consumers use change the power output? What size of font makes my website readable? How much money should I invest in a company with a certain track record? What amount of money should I sell my house for?
Unsurprisingly, there’s a statistical element to all these questions. They’re all characterised by the variability you see innate in natural and engineered systems. No two cars built by the same vehicle manufacturing plant are alike, and no two homes are alike even if built on the same area, and next to each other, and similarly, even if content is king, your choice of font size does have an impact on readership numbers.
Linear models are as old as the hills, and still continue to be used for a lot of data analysis. They’re especially useful when trying to find simple relationships, and when constructing simple linear regression models that describe the relationships between two different continuous (variable) data sets.
Linear models and ordered pairs
The fundamental unit of data used to construct linear models may be considered the ordered pair. An ordered pair of variable data generally represents a factor-response combination. Factors are aspects of a system that you think could impact the results, and want to investigate. You conduct this investigation by changing factor values, and seeing how the responses change. Let’s consider for a moment, that you’re buying a car, and you have a budget, and are interested in fuel efficient cars. Among other factors, you’re primarily interested in studying the impact of one factor – engine displacement – on the fuel efficiency.
If we were to simplify the way we represent the impact of engine displacement on our vehicle of interest, we can represent it as above, with no other factors than engine displacement (measured in litres), and no other responses than the gas mileage (the only factor we’re interested in, measured in kilometres per litre of fuel).
This is a very simplified model, of course, meant only to demonstrate the regression approach in R. A real-life problem will undoubtedly have many considerations, including the base price of the vehicle, its features, comfort and so on, some of which may not be as easily quantifiable as gas mileage or engine displacement. Here’s what a series of data collection test drives might yield:
We can bring such data into R, into a data frame, and designate the columns of the data frame.
gm <- read.csv("gasmileage.csv") gm <- as.data.frame(gm) names(gm)<-c("Car #", "Eng. Disp (l)", "Gas Mlg. (km / l)")
If we were to take a peek at the “gm” variable here, which has the data stored in it, we would see this:
> head(gm) Car # Eng. Disp (l) Gas Mlg. (km / l) 1 1 1.0 22.30412 2 2 1.1 22.09578 3 3 1.1 21.97859 4 4 1.2 21.97248 5 5 1.2 22.42579 6 6 1.4 22.08349
Observe how changing the names of the data frame has allowed us to see the data more clearly. This is easier said than done for a lot of public data sets. Therefore, exploring and understanding the data set, using the View()
command in RStudio always helps in real life situations when you’re working on data projects and constructing data frames. Changing names does have an advantage, namely, in graphing and data representation. These names get carried over to all your graphs and charts that you create with this data – so it makes sense to spend a little bit of time up front doing it, at times.
Graphical analysis
Now that we have the data in one place, we can put the data into a plot, and visualize any obvious relationships. This is a simple graphical analysis where you can observe obvious trends and patterns, and understand what model to use. Visualization is a great way to prepare for the actual construction of the linear model. We’ll use the simple base plot function, and invoke the names of the columns (ordered pairs) using the $
operator.
#Visualizing the data to observe any correlation graphically plot(gm$`Eng. Disp (l)`,gm$`Gas Mlg. (km / l)`, main = "Fuel eff. vs. Eng. Displ.", col = "maroon", xlab = "Eng. Disp (l)", ylab = "Gas. Mlg (km/l)")
We can observe some kind of correlation in the ordered pairs, and perhaps we can formalize what we observe at this stage with a linear model. Bear in mind, however, that usually, a real relationship has to be established prior to creating such a model. There are numerous caveats associated with correlation, especially the one that states that correlation does not imply causation. Using the cor()
command can also illustrate the nature of the relationship between the factor and the response in question here.
> cor(gm$`Gas Mlg. (km / l)`,gm$`Eng. Disp (l)`) [1] -0.9444116
A correlation coefficient so close to indicates strong negative correlation, meaning that increases in gas mileage seem to be observed with decreases in engine displacement.
Constructing the Linear Model
R’s linear modeling function is very simple to use, and packs a punch. You can construct very simple linear models and fairly complex polynomial models of any order using this. The lm()
command is used to construct linear models in R. For our problem, we’ll construct and store the linear model in a variable called fit
#Constructing a simple linear model fit <- lm(gm$`Gas Mlg. (km / l)` ~ gm$`Eng. Disp (l)`)
> fit Call: lm(formula = gm$`Gas Mlg. (km / l)` ~ gm$`Eng. Disp (l)`) Coefficients: (Intercept) gm$`Eng. Disp (l)` 24.127 -1.626
All linear models are of the form where
is the slope, and
the intercept. The close and intercept are clearly indicated for the linear model we have constructed and stored in fit. The slope is
.
The variable fit
actually contains a lot of information than may be obvious at this point. It makes available a range of different information to different R commands, but you can explore its contents more. For instance, the least squares regression analysis approach used in the linear model should produce a list of fitted values. This list is accessed as below.
> fit$fitted.values 1 2 3 4 5 6 7 22.50086 22.33823 22.33823 22.17559 22.17559 21.85032 21.85032 8 9 10 11 12 13 14 21.68768 21.68768 21.68768 21.68768 21.19978 21.19978 21.19978 15 16 17 18 19 20 20.87450 20.87450 20.87450 20.87450 20.06132 20.06132
As shown above, for the 20 ordered pairs originally provided, the fitted values of the response variable are stored in fit
.
Plotting a Fit Line for the Linear Model
Plotting a fit line on the plot we made earlier is rather straightforward. We can use the abline()
or the fitted()
commands to plot a line, and can colour it as we wish.
#Plotting a fit line abline(fit, col = "red")
Polynomial Fits
When we look closely at the plot, we can observe a somewhat nonlinear relationship between the factor and the response, that is to say, the gas mileage doesn’t decrease at the same rate for vehicles around 1.0 litres, as opposed to when we move beyond 1.5 litres. This is non-linearity in the data, and it can be captured when we build higher resolution models of the relationship between factor and response. One way to do this in R, is to define the “formula” in our linear model lm()
command as a polynomial.
#Constructing a polynomial model of the relationship between factor and response #The second argument in the poly() command indicates the order of the polynomial fit <- lm( gm$`Gas Mlg. (km / l)` ~ poly(gm$`Eng. Disp (l)`, 2, raw = TRUE)) #Plotting the nonlinear model in ordered pairs #We can sort the data in the displacement column, #This reorders the model variable "fit" to plot in this order lines(sort(gm$`Eng. Disp (l)`),fitted(fit)[order(gm$`Eng. Disp (l)`)], type = "l")
Concluding Remarks
We’ve seen how the relationships between variable data sets can be analyzed, and how information from these data sets can be converted into models. In the example shown above, a linear or quadratic model can be used to construct a powerful, even predictive model, which can allow us to make informed decisions about, in this case, the gas mileage of the vehicle we may buy, especially if that vehicle may have a very different displacement such as one not listed in the data set, like a 1.4 litre engine, or a 1.6 litre engine. Naturally, this kind of predictive modeling ability can be extended to when you have to predict a house price based on price and built up area information in different neighbourhoods. It could equally be applied to optimizing engineering systems ranging from websites to washing machines.