# Linear and Polynomial Models in R

## Background

For a lot of people, the rubber hits the road in data analysis when they can describe the relationships between everyday things they deal with. How does the gas mileage of the cars we manufacture vary with the vehicle’s weight, or the size of the wheels? How does the fuel that consumers use change the power output? What size of font makes my website readable? How much money should I invest in a company with a certain track record? What amount of money should I sell my house for?

Unsurprisingly, there’s a statistical element to all these questions. They’re all characterised by the variability you see innate in natural and engineered systems. No two cars built by the same vehicle manufacturing plant are alike, and no two homes are alike even if built on the same area, and next to each other, and similarly, even if content is king, your choice of font size does have an impact on readership numbers.

Linear models are as old as the hills, and still continue to be used for a lot of data analysis. They’re especially useful when trying to find simple relationships, and when constructing simple linear regression models that describe the relationships between two different continuous (variable) data sets.

## Linear models and ordered pairs

The fundamental unit of data used to construct linear models may be considered the ordered pair. An ordered pair of variable data generally represents a factor-response combination. Factors are aspects of a system that you think could impact the results, and want to investigate. You conduct this investigation by changing factor values, and seeing how the responses change. Let’s consider for a moment, that you’re buying a car, and you have a budget, and are interested in fuel efficient cars. Among other factors, you’re primarily interested in studying the impact of one factor – engine displacement – on the fuel efficiency.

Factor: Engine Displacement (litre); Response: Gas Mileage (km / l)

If we were to simplify the way we represent the impact of engine displacement on our vehicle of interest, we can represent it as above, with no other factors than engine displacement (measured in litres), and no other responses than the gas mileage (the only factor we’re interested in, measured in kilometres per litre of fuel).

This is a very simplified model, of course, meant only to demonstrate the regression approach in R. A real-life problem will undoubtedly have many considerations, including the base price of the vehicle, its features, comfort and so on, some of which may not be as easily quantifiable as gas mileage or engine displacement. Here’s what a series of data collection test drives might yield:

Gas mileage data – first few vehicles shown

We can bring such data into R, into a data frame, and designate the columns of the data frame.

gm <- read.csv("gasmileage.csv")
gm <- as.data.frame(gm)
names(gm)<-c("Car #", "Eng. Disp (l)", "Gas Mlg. (km / l)")


If we were to take a peek at the “gm” variable here, which has the data stored in it, we would see this:

> head(gm)
Car # Eng. Disp (l) Gas Mlg. (km / l)
1     1           1.0          22.30412
2     2           1.1          22.09578
3     3           1.1          21.97859
4     4           1.2          21.97248
5     5           1.2          22.42579
6     6           1.4          22.08349



Observe how changing the names of the data frame has allowed us to see the data more clearly. This is easier said than done for a lot of public data sets. Therefore, exploring and understanding the data set, using the View() command in RStudio always helps in real life situations when you’re working on data projects and constructing data frames. Changing names does have an advantage, namely, in graphing and data representation. These names get carried over to all your graphs and charts that you create with this data – so it makes sense to spend a little bit of time up front doing it, at times.

Now that we have the data in one place, we can put the data into a plot, and visualize any obvious relationships. This is a simple graphical analysis where you can observe obvious trends and patterns, and understand what model to use. Visualization is a great way to prepare for the actual construction of the linear model. We’ll use the simple base plot function, and invoke the names of the columns (ordered pairs) using the $ operator. #Visualizing the data to observe any correlation graphically plot(gm$Eng. Disp (l),gm$Gas Mlg. (km / l), main = "Fuel eff. vs. Eng. Displ.", col = "maroon", xlab = "Eng. Disp (l)", ylab = "Gas. Mlg (km/l)")  Simple plot of Gas Mileage with Engine displacement We can observe some kind of correlation in the ordered pairs, and perhaps we can formalize what we observe at this stage with a linear model. Bear in mind, however, that usually, a real relationship has to be established prior to creating such a model. There are numerous caveats associated with correlation, especially the one that states that correlation does not imply causation. Using the cor() command can also illustrate the nature of the relationship between the factor and the response in question here. > cor(gm$Gas Mlg. (km / l),gm$Eng. Disp (l)) [1] -0.9444116  A correlation coefficient so close to $-1$ indicates strong negative correlation, meaning that increases in gas mileage seem to be observed with decreases in engine displacement. ## Constructing the Linear Model R’s linear modeling function is very simple to use, and packs a punch. You can construct very simple linear models and fairly complex polynomial models of any order using this. The lm() command is used to construct linear models in R. For our problem, we’ll construct and store the linear model in a variable called fit #Constructing a simple linear model fit <- lm(gm$Gas Mlg. (km / l) ~ gm$Eng. Disp (l))  > fit Call: lm(formula = gm$Gas Mlg. (km / l) ~ gm$Eng. Disp (l)) Coefficients: (Intercept) gm$Eng. Disp (l)
24.127              -1.626



All linear models are of the form $y = mx + c$ where $m$ is the slope, and $c$ the intercept. The close and intercept are clearly indicated for the linear model we have constructed and stored in fit. The slope is $-1.63$.

The variable fit actually contains a lot of information than may be obvious at this point. It makes available a range of different information to different R commands, but you can explore its contents more. For instance, the least squares regression analysis approach used in the linear model should produce a list of fitted values. This list is accessed as below.

> fit$fitted.values 1 2 3 4 5 6 7 22.50086 22.33823 22.33823 22.17559 22.17559 21.85032 21.85032 8 9 10 11 12 13 14 21.68768 21.68768 21.68768 21.68768 21.19978 21.19978 21.19978 15 16 17 18 19 20 20.87450 20.87450 20.87450 20.87450 20.06132 20.06132  As shown above, for the 20 ordered pairs originally provided, the fitted values of the response variable are stored in fit. ## Plotting a Fit Line for the Linear Model Plotting a fit line on the plot we made earlier is rather straightforward. We can use the abline() or the fitted() commands to plot a line, and can colour it as we wish. #Plotting a fit line abline(fit, col = "red")  Data plotted with a simple fit line (simple linear model). ## Polynomial Fits When we look closely at the plot, we can observe a somewhat nonlinear relationship between the factor and the response, that is to say, the gas mileage doesn’t decrease at the same rate for vehicles around 1.0 litres, as opposed to when we move beyond 1.5 litres. This is non-linearity in the data, and it can be captured when we build higher resolution models of the relationship between factor and response. One way to do this in R, is to define the “formula” in our linear model lm() command as a polynomial. #Constructing a polynomial model of the relationship between factor and response #The second argument in the poly() command indicates the order of the polynomial fit <- lm( gm$Gas Mlg. (km / l) ~ poly(gm$Eng. Disp (l), 2, raw = TRUE)) #Plotting the nonlinear model in ordered pairs #We can sort the data in the displacement column, #This reorders the model variable "fit" to plot in this order lines(sort(gm$Eng. Disp (l)),fitted(fit)[order(gm$Eng. Disp (l))], type = "l")  Linear (red) and quadratic (polynomial order 2, in blue) models shown ## Concluding Remarks We’ve seen how the relationships between variable data sets can be analyzed, and how information from these data sets can be converted into models. In the example shown above, a linear or quadratic model can be used to construct a powerful, even predictive model, which can allow us to make informed decisions about, in this case, the gas mileage of the vehicle we may buy, especially if that vehicle may have a very different displacement such as one not listed in the data set, like a 1.4 litre engine, or a 1.6 litre engine. Naturally, this kind of predictive modeling ability can be extended to when you have to predict a house price based on price and built up area information in different neighbourhoods. It could equally be applied to optimizing engineering systems ranging from websites to washing machines. # Comparing Non-Normal Data Graphically and with Non-Parametric Tests Not all data in this world is predictable in the exact same way, of course, and not all data can be modeled using the Gaussian distribution. There are times, when we have to make comparisons about data using one of many distributions that represent data which could show different patterns other than the familiar and comforting “bell curve” of the normal distribution pattern we’re used to seeing in business presentations and the media alike. For instance, here’s data from the Weibull distribution, plotted using different shape and scale parameters. A Weibull distribution has two parameters, shape and scale, which determine how it looks (which varies widely), and how spread out it is.  shape <- 1 scale <- 5 x<-rweibull(1000000,shape,scale) hist(x, breaks = 1000, main = paste("Weibull Distribution with shape: ",shape,", and scale: ",scale)) abline (v = median(x), col = "blue") abline (v = scale, col = "red")  Shape = 1; Scale = 5. The red line represents the scale value, and the blue line, the median of the data set. Here’s data from a very different distribution, which has a scale parameter of 100. Shape = 1; Scale = 100. Same number of points. The red and blue lines mean the same things here too. The shape parameter, as can be seen clearly here, is called so for a good reason. Even when the scale parameter changes wildly (as in our two examples), the overall geometry of our data looks similar – but of course, it isn’t. The change in the scale parameter has changed the probability of an event $x ->0$ towards the lower end of the x range (closer to zero), compared to an event $x>>>0$ further away. When you superimpose these distributions and their medians, you can get a very different picture of them. If we have two very similar data sets like the data shown in the first graph and the data in the second, what kinds of hypothesis tests can we use? It is a pertinent question, because at times, we may not know that a data set may represent a process that can be modeled by a specific kind of distribution. At other times, we may have entirely empirical distributions represented by our data. And we’d still want to make comparisons using such data sets. shape <- 1 scale1 <- 5 scale2<-scale1*2 x<-rweibull(1000000,shape,scale1) xprime<-rweibull(1000000,shape,scale2) hist(x, breaks = 1000, border = rgb(0.9,0.2,0.2,0.2), col = rgb(0.9,0.2,0.2,0.2), main = paste("Weibull Distribution different shape parameters: ",shape/100,", ", shape)) hist(xprime, breaks = 1000, border = rgb(0.2,0.9,0.2,0.2), col = rgb(0.2,0.9,0.2,0.2), add = T) abline (v = median(x), col = "blue") abline (v = scale, col = "red")  Different scale parameters. Red and blue lines indicate medians of the two data sets. The Weibull distribution is known to be quite versatile, and can at times be used to approximate the Gaussian distribution for real world data. An example of this is the use of the Weibull distribution to approximate constant failure rate data in engineering systems. Let’s look at data from a different pair of distributions with a different shape parameter, this time, 3.0. shape <- 3 scale1 <- 5 scale2<-scale1*1.1 #Different scale parameter for the second data set x<-rweibull(1000000,shape,scale1) xprime<-rweibull(1000000,shape,scale2) hist(x, breaks = 1000, border = rgb(0.9,0.2,0.2,0.2), col = rgb(0.9,0.2,0.2,0.2), main = paste("Weibull Distribution different scale parameters: ",scale1,", ", scale2)) hist(xprime, breaks = 1000, border = rgb(0.2,0.9,0.2,0.2), col = rgb(0.2,0.9,0.2,0.2), add = T) abline (v = median(x), col = "blue") abline (v = median(xprime), col = "red")  Weibull distribution data – different because of scale parameters. Vertical lines indicate medians. The medians can be used to illustrate the differences between the data, and summarize the differences observed in the graphs. However, when we know that a data set is non-normal, we can adopt non-parametric methods from the hypothesis testing toolbox in R. Like hypothesis tests for normally distributed data that have comparable means, we can compare the medians of two or more samples of non-normally distributed data. Naturally, the same conditions – of larger samples of data being better, at times, apply. However, the tests can help us analytically differentiate between two similar-looking data sets. Since the Mann-Whitney median test and other non-parametric tests don’t make assumptions about the parameters of the underlying distribution of the data, we can rely on these tests to a greater extent when studying the differences between samples that we think may have a greater chance of being non-normal (even though the normality tests may say otherwise). Non-parametrics and the inferential statistics approach: how to use the right test When we conduct the AD test for normality on the two samples in question, we can see how these samples return a very low p-value each. This can also be confirmed using the qqnorm plots. Let’s use the Mann-Whitney test for two medians from samples of non-normal data, to assess the difference between the median values. We’ll use a smaller sample size for both, and use the wilcox.test() command. For two samples, the wilcox.test() command actually performs a Mann-Whitney test. shape <- 3 scale1 <- 5 scale2<-scale1*1.01 x<-rweibull(1000000,shape,scale1) xprime<-rweibull(1000000,shape,scale2) library(nortest) paste("Normality test p-values: Sample 'x' ",ad.test(x)$p.value, " Sample 'xprime': ", ad.test(xprime)$p.value) hist(x, breaks = length(x)/10, border = rgb(0.9,0.2,0.2,0.05), col = rgb(0.9,0.2,0.2,0.2), main = paste("Weibull Distribution different scale parameters: ",scale1,", ", scale2)) hist(xprime, breaks = length(xprime)/10, border = rgb(0.2,0.9,0.2,0.05), col = rgb(0.2,0.2,0.9,0.2), add = T) abline (v = median(x), col = "blue") abline (v = median(xprime), col = "red") wilcox.test(x,xprime) paste("Median 1: ", median(x),"Median 2: ", median(xprime))  Observe how close the scale parameters of both samples are. We’d expect both samples to overlap, given the large number of points in each sample. Now, let’s see the results and graphs. Nearly overlapping histograms for the large non-normal samples The results for this are below. Mann-Whitney test results The p-value here (for this considerable sample size) clearly illustrates the present of a significant difference. A very low p-value in this test result indicates that, if we were to make the assumption that the medians of these data sets are equal, there would be an extremely small probability, that we would see samples as extreme as observed in these samples. The fine difference in the medians observed in the median results can also be picked up in this test. To run the Mann-Whitney test with a different confidence level (or significance), we can use the following syntax: wilcox.test(x,xprime, conf.level = 0.95)  Note 1 : The mood.test() command in R performs a two-samples test of scale. Since the scale parameters in these samples of data we generated (for the purposes of the demo) are well known, in real life situations, the p-value should be interpreted based on additional information, such as the sample size and confidence level. Note 2: The wilcox.test() command performs the Mann Whitney test. This is a comparison of mean ranks, and not of the medians per se. # Power, Difference and Sample Sizes In my earlier posts on hypothesis testing and confidence intervals, I covered how there are two hypotheses – the default or null hypothesis, and the alternative hypothesis (which is like a logical opposite of the null hypothesis). Hypothesis testing is fundamentally a decision making activity, where you reject or fail to reject the default hypothesis. For example: how do you tell whether the gas mileage of cars from one fleet is greater than the gas mileage of cars from some other fleet? When you collect samples of data, you can compare the average values of the samples, and arrive at some inference from this information, about the population. Since sample sizes tend to be affected by variability, we ought to be interested in how much data to actually collect. ## Hypothesis Tests Statistically speaking, when we collect a small sample of data and calculate its standard deviation, we tend to get a larger estimate of standard deviation or a smaller estimate of standard deviation from the actual standard deviation. The relationship between the sample size and the standard deviation of a sample is described by the term standard error. This standard error is an integral part of how a confidence interval is calculated for variable data. For smaller samples, the difference between the “true” standard deviation of the population (if that can even be measured) and the sample standard deviation tends to be small for large sample sizes. There is an intuitive way to think about this. If you have more information, you can make a better guess at a characteristic of the population the information is coming from. Let’s take another example: Motor racing. Motor racing lap times are generally recorded with extremely high precision and accuracy. If we had a sample with three times and wanted to estimate lap times for a circuit, we’d probably do okay, but have a wider range of expected lap times. On the contrary, if we had a number of lap time records, we could more accurately calculate the confidence intervals for the mean value. We could even estimate the probability that a particular race car driver could clock a certain time, if we were able to understand the distribution that is the closest model of the data. In a hypothesis test, therefore, we construct a model of our data, and test a hypothesis based on that model. Naturally, there is a risk of going wrong with such an approach. We call these risks $\alpha$ and $\beta$ risks. Hypothesis tests, alpha and beta risks ## Type I & Type II Errors and Power To explain it simply, the chance of erroneously rejecting the null hypothesis $H_0$ is referred to as the Type I error (or $\alpha$). Similarly, the chance of erroneously accepting the null hypothesis (if the reality was different from what the null hypothesis stated) is called the Type II error (or $\beta$ risk). Naturally, we want both these errors to have very low probability in our experiments. How do we determine if our statistical model is powerful enough, therefore, to avoid both kinds of risks? We use two statistical terms, significance (known here as $\alpha$, as in $\alpha$ risk), and power (known sometimes as $\pi$, but more commonly known as $1-\beta$, as we see from the illustration above). It turns out that the statistical power is heavily dependent on the sample size you used to collect your data. If you used a small sample size, the ability of your test to detect a certain difference (such as 10 milliseconds of lap time, or 1 mile per gallon of difference in fuel efficiency) is diminished greatly. In truth, you may receive a result that gives you a p-value (also discussed in an earlier post) that is greater than the significance. Remember that this is now a straightforward comparison between the p-value of the test and what we know now as $\alpha$. However, note how in the results interpretation of our hypothesis test, we didn’t yet consider $1-\beta$. Technically, we should have. And this is what causes so many spurious results, because false positives end up getting ignored, leading to truth inflation. Very often in data-driven businesses, the question of “how many samples is good enough” arises – and usually such discussions end with “as many as we can”. In truth, the process of determining how much data of a certain kind to collect, isn’t easy. Going back-and-forth to collect samples of data in order to do your hypothesis tests is helpful – primarily because you can see the effects of sample size in your specific problem, practically. ## A Note on Big Data Big Data promises us what a lot of statisticians didn’t have in the past – the opportunity to analyze population data for a wide variety of problems. Big Data is naturally exciting for those who already have the trenchant infrastructure to make the call to collect, store and analyze such data. While Big Data security is an as yet incompletely answered question, especially in the context of user data and personally identifiable information, there is a push to collect such data, and it is highly likely that ethical questions will need to be answered by many social media and email account providers on the web that also double up as social networks. Where does this fit in? Well, when building such systems, we have neither the trust of a large number of people, nor the information we require – which could be in the form of demographic information, personal interests, relationships, work histories, and more. In the absence of such readily available information, and when companies have to build systems that handle this kind of information well, they have to experiment with smaller samples of data. Naturally, the statistical ideas mentioned in this post will be relevant in such contexts. ## Summary 1. Power: The power of a test is defined as the ability to correctly reject the null hypothesis of a test. As we’ve described it above, it is defined by $1-\beta$, where $\beta$ is the chance of incorrectly accepting the default or null hypothesis $H_0$. 2. Standard deviation ($\sigma$ ): The more variation we observe in any given process, the greater our target sample size should be, for achieving the same power, and if we’re detecting the same difference in performance, statistically. You could say that the sample size to be collected depends directly on the variability observed in the data. Even small differences in the $\sigma$ can affect the number of data points we need to collect to arrive at a result with sufficient power. 3. Significance ($\alpha$ ): As discussed in the earlier post on normality tests, significance of a result can have an impact on how much data we need to collect. If we have a wider margin for error, or greater margin for error in our decisions, we ought to settle for a larger significance value, perhaps of 10% or 15%. The de-facto norm in many industries and processes (for better or for worse, usually for worse) is to use $\alpha = 0.05$. While this isn’t always a bad thing, it encourages blind adherence and myths to propagate, and prevents statisticians from thinking about the consequences of their assumptions. In reality, $\alpha$ values of 0.01 and even 0.001 may be required, depending on how certain we want to be about the results. 4. Sample size ($n$): The greater the sample size of the data you’re using for a given hypothesis test, the greater the power of that test (and by that I mean, the test has a greater ability to detect a false positive). 5. Difference ($\Delta$): The greater the difference you want to be able to detect between two sets of data (proportions or means or medians), the smaller the sample size you need. This is an intuitively easy thing to understand – like testing a HumVee for gas mileage versus a Ford Focus – you need only a few trips (a small sample size) to tell a real difference, as opposed to if you were to test two compact cars against each other (when you may require a more rigorous testing approach). # Hypothesis Tests: 2 Sample Tests (A/B Tests) Businesses are increasingly beginning to use data to drive decision making, and are often using hypothesis tests. Hypothesis tests are used to differentiate between a pair of potential solutions, or to understand the performance of systems before and after a certain change. We’ve already seen t-tests and how they’re used to ascribe a range to the variability inherent in any data set. We’ll now see the use of t-tests to compare different sets of data. In website optimization projects, these tests are also called A/B tests, because they compare two different alternative website designs, to determine how they perform against each other. It is important to reiterate that in hypothesis testing, we’re looking for a significant difference and that we use the p-value in conjunction with the significance (%) to determine whether we want to reject some default hypothesis we’re evaluating with the data, or not. We do this by calculating a confidence interval, also called an interval estimate. Let’s look at a simple 2-sample t-test and understand how it works for two different samples of data. ## Simple 2-sample t-test A 2-sample t-test has the default hypothesis that the two samples you’re testing come from the same population, and that you can’t really tell any difference between them. So, any variation you see in the data is purely random variation. The alternative hypothesis in this test, is of course, that it isn’t only random variation we’re seeing, and that these samples come from completely different populations altogether. $H_o : \mu_1 = \mu_2 \newline H_a: \mu_1 \neq \mu_2$ What the populations from which X1 and X2 are taken may look like We’ll generate two samples of data $x_1, x_2$ from two different normal distributions for the purpose of demonstration, since normality is a pre-requisite for using the 2-sample t-test. (In the absence of normality, we can use other estimators of central tendency such as the median, and the tests appropriate for estimating the median, such as the Moods-Median or Kruskall-Wallis test – which I’ll blog about another time). We also have to ensure that the standard deviations of the two samples of data we’re testing are comparable. I’ll also demonstrate how we can use a test for standard deviations to understand whether the samples have different variability. Naturally, when the samples have different standard deviations, tests for assessing similarities in their means may not be fully effective. library(nortest) #Generating two samples of data #100 points of data each #Same standard deviation #Different values of mean (of sampling distribution) x<-read.csv(file = "x1x2.csv") x1<-x[,2] x2<-x[,3] #Setting the global value of significance alpha = 0.1 #Histograms hist(x1, col = rgb(0.1,0.5,0.1,0.25), xlim = c(7,15), ylim = c(0,15),breaks = seq(7,15,0.25), main = "Histogram of x1 and x2", xlab = "x1, x2") abline(v=10, col = "orange") hist(x2, col = rgb(0.5,0.1,0.5,0.25), xlim = c(7,15),breaks = seq(7,15,0.25), add = T) abline(v=12, col = "purple") #Running normality tests (just to be sure) ad1<-ad.test(x1) ad2<-ad.test(x2) #F-test to compare two or more variances v1<-var.test(x1,x2) if(ad1$p.value>=alpha & ad2$p.value>=alpha){ if(v1$p.value>=alpha){
#Running a 2-sample t-test
for (i in c(-2,-1,0,1,2)){
temp<-t.test(x=x1,y=x2,paired = FALSE,var.equal = TRUE,alternative = "two.sided",conf.level = 1-alpha, mu = i )
cat("Difference= ",i,"; p-value:",temp$p.value,"\n") } } } The first few lines in the code merely include the “nortest” package and invoke/generate the data sets we’re comparing. The nortest package contains the Anderson Darling Normality Test, which we have also covered in an earlier post. We can generate a histogram, to understand what $x_1$ and $x_2$ look like. Histogram of X1 and X2 – showing the reference population mean lines The overlapping histograms of $x_1$ and $x_2$ clearly indicate the difference in the central tendency, and the overlap is also visible. Subsequent code above covers an F-test. As explained earlier, equality of variances is a pre-requisite for the 2-sample t-test. Failing this would mean that we essentially have samples from two different populations, which have two different standard deviations. Finally, if the conditions to run a 2-sample t-test are met, the t.test() command (which is present in the “stats” package, runs, and provides us a result. Let’s look closely at how the t.test command is constructed. The arguments contain $x_1$ and $x_2$, which are our two samples for comparison. We provide the argument “paired = FALSE”, because these are not before/after samples. They’re two independently generated samples of data. There are instances where you may want to conduct a paired t-test, though, depending on your situation. We’ve also specified the confidence level. Note how the code uses a global value of $\alpha$, or significance level. Now that we’ve seen what the code does, let’s look at the results. 2-sample t-test results ## Evaluating The Results Two sample t-test results should be evaluated in a similar way to 1-sample t-tests. Our decision is dependent on the p-value we see in the result, and the confidence interval of the difference between sample means. Observe how the difference estimate lies on the negative side of the number line. Difference is calculated from the populations means 10 and 12, so we can clearly understand why this estimate of difference is negative. The estimates for mean values of x and y (in this case, $x_1$ and $x_2$) are also given. Naturally, the p-value that’s in the result, when compared to our generous $\alpha$ of 0.1, is far lesser, and we can consider this to be a significant result (provided we have sufficient statistical power – and we’ll discuss this in another post). This indicates a significant difference between the two sets. If $x_1$ and $x_2$ were fuel efficiency figures for passenger vehicles, or bikes, we may actually be looking at better performance for $x_2$ when compared to $x_1$. ## Detecting a Specific Difference Sometimes, you may want to evaluate a new product, and see if it performs at least x% better than the old product. For websites, for instance, you may be concerned with loading times. You may be concerned with code runtime, or with vehicle gas mileage, or vehicle durability, or some other aspect of performance. At times, the fortunes of entire companies depend on them producing faster, better products – that are known to be faster by at least some amount. Let’s see how a 2-sample t-test can be used to evaluate a minimum difference between two samples of data. The same example above can be modified slightly, to test for a specific difference. The only real difference we have to make here, of course, is the value of $\Delta$ or difference. The t.test() command in R unfortunately isn’t very clear on this – it expects you to understand that you should use $\mu$ for this. Once you get used to it, however, this little detail is fine, and it delivers the expected result. if(ad1$p.value >=alpha & ad2$p.value>=alpha){ if(v1$p.value>=alpha){
#Running a 2-sample t-test
for (i in c(-2,-1,0,1,2)){
temp<-t.test(x=x1,y=x2,paired = FALSE,var.equal = TRUE,alternative = "two.sided",conf.level = 1-alpha, mu = i )
cat("Difference= ",i,"; p-value:",temp$p.value,"\n") } } }  The code above prints out different p-values, for different tests. The data used in these tests is the same, but by virtue of the different differences we want to detect between these samples, the p-values are different. Observe the results below: Differences and how they influence p-value (same two samples of data) Since the data was generated from two distributions that have means of 10 and 12 respectively for $x_1$ and $x_2$, we know from intuition that the difference is -2, and we should start seeing results that indicate no difference between the expected and observed difference at this value in the test. Therefore, the p-values in this scenario will be greater than the significance value, $\alpha$. For other scenarios – when $\Delta = -1, 0, 1, 2$, we see that the p-values are clearly far below the significance of $\alpha = 10%$. What’s important to remember therefore, is that contrary to what many people may think, there is no one or best p-value for a given set of data. It depends on the factors we take into consideration during the test – such as the sample size, the confidence level we chose for our test, the resulting significance level, and, as illustrated here, the difference expected. ## Concluding Remarks A 2-sample t-test is a great way for an organization to compare samples of data from different products, processes, and so on, and understand if one of them is performing significantly better than another. The test is strictly for data that fits the normality criteria, that also happen to have comparable standard deviations, and the results from it tend to be impacted heavily by the kind of hypothesis we use – for difference (which we explored here) and for one or two sided comparisons. We explored only the two sided comparisons here (and hence constructed a two sided confidence interval). When a business uses a 2-sample t-test, some of the arguments here, such as the values of confidence level, difference and so on, should be evaluated thoroughly. It is also important to bear in mind the impact of sample size. The smaller the difference we want to detect, the greater the sample sizes have to be. We’ll see more about this in another post, on power, difference and sample size. # Confidence Intervals and t-tests in R If you were to walk into a restaurant and order a cup of coffee, you’d expect to get a standard cup of the stuff, and you’d expect to get a sufficient quantity of it too. How much coffee for a certain price is too much, and how much is too little? More realistically, when you know that the same coffee shop may serve ever so slightly different volumes of coffee in the same cup, how can you quantify the coffee in the cup? ## What Plausible Ranges of Gas Prices Do You Pay? The same question may very well be asked at the gas station or diesel station that you fill up your car in. When you get there and shell out a few gallons or litres worth of money (depending on where you are), you can expect to pay a certain amount of money for each such gallon or litre. How do we determine what the range of expected prices for a gallon or litre of fuel is, based on already available data? This is where the confidence interval comes in – and it is one of the most important tools of inferential statistics. Inferential statistics is the science of making decisions or informed generalizations about some data you have, based on some of the characteristics of this data. An important way to understand variability in any process or product’s performance is to ascribe a range of plausible values. Confidence intervals can be defined as the plausible ranges of values a population parameter may take, if you were to estimate it with a sample statistic. You’d hardly be expected by a statistician to be asked “What plausible range of gas prices do you pay on average?”, but this is, in fact, closer to the truth of interval estimation in statistics. ## Confidence Levels and Sampling Why the word confidence in confidence interval? Well, information costs you something to collect it. If you want to be 100 percent sure about the mean of petrol prices, for instance, you could literally collect data on every transaction from every pump in the world. In the real world, this is impossible, and we require sampling. In the age of Big Data, it seems to be a taboo to talk about sampling sometimes. “You can collect all the data from a process”, some claim. That may be the case for a small minority of processes, but for the vast world out there, characterization is only possible by collecting and evaluating samples of data. And this approach entails the concept of a confidence level. Confidence levels tell us the proportion of times we’re willing to be right (and wrong) about any parameter we wish to estimate from a sample of data. For example, if I measured the price per gallon of gas at every pump in Maine, or Tokyo, for a day, I’d get a lot of data – and that data would show me some general trends and patterns in the way the prices are distributed, or what typical prices seem to be in effect. If I expect to make an estimate of petrol prices in Tokyo or Maine next July, I couldn’t hope to do this with a limited sample such as this, however. Basic economics knowledge tells us that there could be many factors that could change these prices – and that they could very well be quite different from what they are now. And this is despite the quality of the data we have. If I wanted to be 95% confident about the prices of petrol within a month from now, I could use a day’s worth of data. And to represent this data, I could use a confidence interval ( a range of values), of course. So, whether it is the quantity of coffee in your cup, or the price per gallon of fuel you buy, you can evaluate the broader parameters of your data sets, as long as you can get the data, using confidence intervals. ## R Confidence Intervals Example Now, let’s see a simple example in R, that illustrate confidence intervals. #Generate some data - 100 points of data #The mean of the data generated is 10, #The standard deviation we've chosen is 1.0 #Data comes from the gaussian distribution x<-rnorm(100,10,1.0) #Testing an 80% confidence level x80<-t.test(x,conf.level = 0.8) #Testing a 90% confidence level x90<-t.test(x,conf.level = 0.9) #Testing a 99% confidence level x99<-t.test(x,conf.level = 0.99) x80 x90 x99  The first part of the code shows us how 100 points of data are used as a sample in this illustration. In the next part, the t.test() command used here can be used to generate confidence intervals, and also test hypotheses you may have developed. In the above code, I’ve saved three different results, based on three different confidence levels – 80%, 90% and 95%. Typically, when you want to be more certain, but you don’t have more data, you end up getting a wider confidence interval. Expect more uncertainty if you have limited data, and more certainty, when you have more data, all other things being equal. Here are the results from the first test – the 80% confidence interval. 1-sample t-test and confidence interval (80% confidence) Let’s break down the results of the t-test. First, we see the data set the test was performed on, and then we see the t-statistics and also a p-value. Further, you can see a confidence interval $(9.85,10.10)$ for the data, based on a sample size of 100, and a confidence level of 80%. Since the t-test is fundamentally a hypothesis test that uses confidence intervals to help you make your decision, you can also see an alternative hypothesis listed there. Then there’s the estimate of the mean – 9.98. Recall the characteristics of the normal distribution we used to generate the data in the first place – it had a mean $\mu = 10.0$ and a standard deviation $\sigma = 1$. ## Confidence Levels versus Confidence Intervals Summarily, we can see the confidence intervals and mean estimates of the remaining two confidence intervals also. For ease, I’ll print them out from storage. CIs comparison (80%,90%, 99%) Observe how, for the same data set, confidence intervals (plausible ranges of values for the mean of the data) are different, depending on how the confidence level changes. The confidence intervals widen as the confidence level increases. The 80% CI is calculated to be $(9.85, 10.10)$ while the same sample yields $(9.72, 10.24)$ when the CI is calculated at 99% confidence level. When you consider that standard deviations can be quite large, what confidence level you use in your calculations, could actually become a matter of importance! More on how confidence intervals are calculated here, at NIST. ## The 1-sample t-Test We’ve seen earlier that the command that is invoked to calculate these confidence intervals is actually called t.test(). Now let’s look at what a t-test is. In inferential statistics, specifically in hypothesis testing, we test samples of data to determine if we can make a generalization about them. When you collect process data from a process you know, the expectation is that this data will look a lot like what you think it should look like. If this is data about how much coffee, or how much you paid for a gallon of gasoline, you may be interested in knowing, for instance, if the price per gallon is any different here, compared to some other gas station. One way to find this out – with statistical certainty – is through a t-test. A t-test fundamentally tells us whether we fail to reject a default hypothesis we have (also called null hypothesis and denoted as $H_0$ ), or if we reject the default hypothesis and embrace an alternative hypothesis (denoted by $H_a$). In case of the 1-sample t-test, therefore: $H_o : \mu = k \newline H_a : \mu \neq k$ Depending on the nature of the alternative hypothesis, we could have inequalities there as well. Note that $k$ here is the expectation you have about what the mean ought to be. If you don’t supply one, t.test will calculate a confidence interval and produce a mean estimate. A t-test uses the t-distribution, which is a lot like the Gaussian or normal distribution, except that it uses an additional parameter – which directly relates to the sample size of your data. Understandably, the size of the sample could give you very different results in a t-test. As with other hypothesis tests, you also get a p-value. The null hypothesis in a 1-sample t-test is relatively straightforward – that there is no difference between the mean of the sample in question, and the population’s mean. Naturally, the alternative of this hypothesis could help us study whether the population mean is less than expected (less expensive gas!) or greater (more expensive gas!) than the expectation. Let’s take another look at that first t-test result: 1-sample t-test and confidence interval (80% confidence) The confidence level we’ve calculated here is an 80% confidence interval, which translates to a 20% significance. To interpret the results above, we compare the value of p, with the significance, and reject the null hypothesis is p is smaller. But what are the t-statistic and df? The t-statistic here is calculated as the critical value of t, based on a confidence level of 80%, the sample mean and standard deviation, and of course, the fact that we have 100 points of data. (The “df” here stands for degrees of freedom – which stands at 99, calculated from the 100 data points we have and the 1 parameter we’re estimating.) ## Alternative Hypotheses Inequalities The t.test() command also allows us to evaluate how the confidence intervals look, and what the p-values are, when we have different alternative hypotheses. We can test for the population mean that’s being estimated to be less than, or greater than the expected value. We can also specify what our expected values of mean are.  x80<-t.test(x,conf.level = 0.8, mu = 10.1,alternative = "less" ) x80  Observe how the p-value, confidence intervals have changed. We’ve evaluated the same 80% confidence intervals, with different expected values of the mean of $\mu = 10.1$, and the alternative hypothesis is that this mean $\mu<10.1$. ## Concluding Remarks When evaluating data to draw conclusions from it, it helps to construct confidence intervals. These tell us general patterns in the data, and also help us estimate variability. In real-life situations, using confidence intervals and t-tests to estimate the presence or absence of a difference between expectation and estimate is valuable. Often, this is the lifeblood of data-driven decision making when dealing with lots of data, and when coming to impactful conclusions about data. R’s power in quickly generating confidence intervals becomes quite an ally, in the right hands – and of course, if you’ve collected the right data. # Normality Tests in R When we see data visualized in a graph such as a histogram, we tend to draw some conclusions from it. When data is spread out, or concentrated, or observed to change with other data, we often take that to mean relationships between data sets. Statisticians, though, have to be more rigorous in the way they establish their notions of the nature of a data set, or its relationship with other data sets. Statisticians of any merit depend on test statistics, in addition to visualization, to test any theories or hunches they may have about the data. Usually, normality tests fit into this toolbox. Histogram: Can this graph alone tell you whether your data is normally distributed? Normality tests help us understand the chance that any data we have with us may have come from a normal or Gaussian distribution. At the outset, that seems simple enough. However, when you look closer at a Gaussian distribution, you can observe how it has certain specific properties. For instance, there are two main parameters – a location parameter, the mean, and the scale parameter, the standard deviation. Different combinations of this can mean different shapes of distributions. You can therefore have thin and tall normal distributions, or have fat and wide normal distributions. When you’re checking a data set for normality, it helps to visualize the data too. ## Normal Q-Q Plots #Generating 10k points of data and arranging them into 100 columns x<-rnorm(10000,10,1) dim(x)<-c(100,100) #Generating a simple normal quantile-quantile plot for the first column #Generating a line for the qqplot qqnorm(x[,1]) qqline (x[,1], col=2)  The code above generates data from a normal distribution (command “rnorm”), reshapes it into a series of columns, and runs what is called a normal quantile-quantile plot (QQ Plot, for short) on the first column. Q-Q Plot (Normal) The Q-Q plot tells us what proportion of the data set (in this case, the first column of variable x), compares with the expected proportion (theoretically) of the normal distribution model based on the sample’s mean and standard deviation. We’re able to do this, because of the normal distribution’s properties. The normal distribution is thicker around the mean, and thinner as you move away from it – specifically, around 68% of the points you can expect to see in normally distributed data will only be 1 standard deviation away from the mean. There are similar metrics for normally distributed data, for 2 and 3 standard deviations (95.4% and 99.7% respectively). However, as you see, testing a large set of data (such as the 100 columns of data we have here) can quickly become tedious, if we’re using a graphical approach. Then there’s the fact that the graphical approach may not be a rigorous enough evaluation for most statistical analysis situations, where you want to compare multiple sets of data easily. Unsurprisingly, we therefore use test statistics, and normality tests, to assess the data’s normality. You may ask, what does non-normal data look like in this plot? Here’s an example below, from a binomial distribution, plotted on a Q-Q normal plot. QQ-Normal plot – observe how binomial distribution data displays categories and extremes ## Anderson Darling Normality Test As one of the commonly used normality tests, this is very commonly used to tell us whether or not a sample may represent normally distributed data. This is done in R by using the ad.test() command, in the nortest package. So, if you don’t have the ad.test command popping up on your R-studio auto-complete, you can easily install it via nortest on the “install.packages” command. Running the Anderson-Darling test for normality generally returns a bunch of data. Here’s how to make sense of it.  #Running the A-D test for first column library(nortest) ad.test(x[,1])  Anderson Darling Normality Test results The data from the A-D test tells us which data has been tested, and two results: A and p-value. The A result refers to the Anderson-Darling test statistic. The A-D test uses this test statistic to calculate the probability that this sample could have come from a normal distribution. The A-D test tests the default hypothesis that the data (in this case the first column of x), comes from a normal distribution. Assuming that this hypothesis is true, the p-value we see here tells us the probability that we can see the data we see in this sample purely by random chance. That is to say, in this case, we have a 50.57% probability of seeing the same kind of data from this process, assuming that the process in question does represent normally distributed data. Obviously, such a high chance of normality is hard to ignore, which is why we fail to reject this hypothesis we had originally, that the data does come from a normal distribution. For another sample, if the p-value were around 3%, for instance, that would mean a 3% chance of seeing the same data from a normal distribution – which is obviously a very low chance. Although a 5% chance is still a small chance, we choose that as the very bottom end of our acceptability and should ideally subject such data to scrutiny before we proceed to do draw inferences from it. P-values can be confusing for some and hard to interpret – I usually try to construct a sentence to interpret the p-value’s meaning in the context of the hypothesis that’s being tested. I’ll write more on this in a future post on hypothesis testing. ## Interpreting and Understanding A-D test Results Naturally, as the p-values from an Anderson-Darling normality test become smaller and smaller, there is a smaller and smaller chance that we are looking at data from a normal distribution. Most statistical studies peg the “significance” level at which we reject the default hypothesis (that this data comes from a normal distribution) outright, at p-values of 0.05 (5%) or lesser. I’ve known teams and individuals that fail to reject this hypothesis, all the way down to p-values of 0.01. In truth, one significance value (often referred to as $\alpha$) doesn’t fit all applications, and we have to exercise great caution in rejecting null hypotheses. Let’s go a bit further with our data and the A-D test: we will not perform the A-D test on all the columns of data we’ve got in our variable x. The easiest way to do this in R is to define a function that returns the p-values from each column, and use that in an “apply” command.  #Running the A-D test for first column library(nortest) #defining a function called "adt" to run the A-D tests and return p-values adt<-function(x){ test<-ad.test(x); return(test$p.value) }
#store the p-values for each column in a separate variable



The code above analyzes the samples in x (as columns) and returns their p-values as columns. So, what do you expect to see when you summarize the variable “pvalues” which stores the test results?

p-values summary (columns of x)

When you summarize p-values, you can see how approximately 9 of the 100 don’t pass the significance criteria (of p>=0.05). You can also see that the p-values in this set of randomly generated samples are randomly distributed over the entire range of probabilities from 0 to 1. We can visualize this too, by plotting the variable “pvalues”.


#Plotting the sample p-values and drawing a significance line
plot(pvalues, main = "p-values for columns in x (AD-test)", xlab = "Column number in x", ylab = "p-value")
abline(h=0.05, col ="red")



p-values for columns in x

## Other tests: Shapiro-Wilk Test

The Anderson-Darling test isn’t the only one available in the nortest package for assessing normality. Statisticians and engineers often use the Shapiro-Wilk test of normality also. For similar data used above (generated as random numbers from the normal distribution), the Shapiro-Wilk test can be performed, with only a few changes to the R script (which is another reason R is so time-efficient).


#Generating 10k points of data and arranging them into 100 columns
x<-rnorm(10000,10,1)
dim(x)<-c(100,100)

#Generating a simple normal quantile-quantile plot for the first column
#Generating a line for the qqplot
qqnorm(x[,1])
qqline (x[,1], col=2)

#Running the A-D test for first column
library(nortest)
#defining a function called "swt" to run the Shapiro-Wilk tests and return p-values
swt<-function(x){ test<-shapiro.test(x); return(test$p.value) } #store the p-values for each column in a separate variable swpvalues<-apply(x,MARGIN = 2, swt) #Plotting the sample p-values and drawing a significance line plot(swpvalues, main = "p-values for columns in x (Shapiro-Wilk-test)", xlab = "Column number in x", ylab = "p-value") abline(h=0.05, col ="red")  Shapiro-Wilk test p-values (for columns in x) Observe than in these samples of 100 points each, the Shapiro Wilk tests returns 96 samples as being normally distributed data, while rejecting 4 (the four dots below the red line). We can’t be sure in this example whether the Shapiro Wilk test is better than the A-D test for normality assessments, however, because these are randomly generated data sets. If we run these tests side-by-side, however, we may get to see interesting results. When I ran these tests side by side, I got very similar number of significantly different samples (non-normal samples) – either 4 or 5 out of the total 100 – from both tests. ## Concluding Remarks So, what does all this mean for day-to-day data analysis? • Data analysis of continuous (variable, as opposed to yes/no, or other attribute data) data often uses tools that are meant for normally distributed data • Visualization alone isn’t sufficient to estimate the normality of a given set of data, and test-statistics are very important for this • The nortest package in R provides a fast and convenient way to assess samples for normality using tools like A-D and S-W tests • A-D and S-W tests tend to perform similarly for most data sets (however, research is being done on this) • The re-usability of R code allows you to set up a macro rather quickly for performing normality tests and a range of studies on test data, in a time-efficient manner # Six Sigma and R As software primarily meant for statistical analysis, R obviously has uses in statistical and data driven problem solving, especially the kind seen in Six Sigma programs. A product of large manufacturing firms that aspired to attain high capability processes with low defects, Six Sigma started out as a child of the Total Quality Management movement and became infused with the statistical methods that were being developed for process analysis and system analysis in the 1980s. ## Background Drawing upon an impressive range of statistical tools, early Six Sigma programs in various companies trained C-suite executives in data driven problem solving, enabling them to use tools such as control charts, capability analysis, and other tools from the TQM movement. While Six Sigma attracted a lot of attention in its hey day, it has its fair share of critics to this day, who criticize the institutionalization of knowledge and the hierarchical approach to problem solving that it seemed to promote. It was often criticized as a flavour-of-the-month program in many companies (and that may indeed have been a fair criticism at times). Negatives aside, Six Sigma still attracts a number of engineers, technologists and managers from various industries, even outside the manufacturing industry, which has traditionally been its bastion. This is because of the rigour that has been developed over the years in definition, measurement and analysis of problems, and the notion that being data driven is a better way to run businesses, than running business by pure experience (or, on a lighter note, chance, luck or favourable risks). The R language has a number of packages which are suited for the specific kinds of analysis that Six Sigma professionals require. Even if you aren’t a Six Sigma professional (or want nothing to do with it), you can benefit from the many statistical tools and methods that it has made popular. The installed large “user base” of practitioners means that these tools tend to have standard reporting formats. So it does help to have a package which has a range of functionality and standard output. ## R Packages Two packages in R which offer tools and methods for the Six Sigma professional are qcc and SixSigma. Some Six Sigma projects that use time series analysis may benefit from native packages in R, and forecasting packages such as the one developed by Rob Hyndman (forecast). In this post, I’ll walk through how to use qcc for a simple process capability analysis. ## Capability Analysis: 101 A process capability analysis is, in the simplest of terms – a comparison between how your process performs, and the range of specifications that have been agreed upon as its specification limits. These specification limits indicate the acceptable range over which the process should perform. The traditional view is that any process output between these specification limits is considered good, and anything outside that isn’t. Innate process variation results in defects Any process that has variation (and lots of processes – in nature and in industry do) occasionally produces results that don’t meet the requirements/specifications intended for it. What process capability methods do is to allow us to measure these variations with some end result in mind. These end results may be something we may be concerned about as someone interested in the process, or it could be someone else’s process that we may be interested in. In general, a careful, step-wise approach is required in order to evaluate capability correctly. Here’s a chart that sums up how to go about it. If that looks complicated, let me break it down – it only consists of three important steps: assessing normality, assessing stability, and assessing capability. Normality tests tell us whether or not our data may have come from a normal distribution (also known as a Gaussian distribution) of some kind. It turns out that a lot of processes in nature and industry can be modeled using the normal distribution. (Unsurprising then, that the analyses we are doing here are contingent upon it). By some kind of normal distribution, I mean some mean and some standard deviation (since these are the two parameters that make the normal distribution look lean and tall – or fat and spread out). The stability tests are essentially control charts, that tell us if we can expect to see consistent results in our process. Control charts were invented several decades ago by Walter Shewart, and they’re still in use. The final test is the capability analysis itself. Put in a simple way, this compares the specification width to the process variation. ## Process Capability in R For those who are used to the point-and-click interface of user-friendly software like Minitab and MATLAB, R may seem like an unlikely candidate for doing all this in. But the flexibility and modularity of R shine through especially when doing a complicated analysis like this, with multiple steps. The approach I’ll describe here simply replicates what you see in flow chart above, for a sample of data. For the purpose of this analysis, I’ll describe how it is done in R-studio, with the qcc and nortest packages installed. You can do this by firing up R-studio and then using the “install.packages” command. #Import process measurements from a file #Data are in 99 samples of 100 measurements each library(qcc) library(nortest) x<-read.csv("Process capability data.csv", header = TRUE) x<-as.data.frame(x) x<-x[,2:100] View(x) #Assess normality using the Anderson-Darling normality test temp = NULL for (i in 2:length(x[,1])) { temp <-append(temp,ad.test(x[,i])$p.value, after = length(temp))

if (sum(temp>0.05)&>=length(x[,1])*0.95){
qcc(x, type = "xbar",plot = TRUE)
qcc(x, type = "S",plot = TRUE)
}
}

#The above control charts seem to be "in control" - for only the +/- 3 sigma test as the criterion
#We can now proceed to perform a capability analysis
x<-qcc(x,type = "xbar", nsigma = 3, plot = FALSE)
process.capability(xqcc, spec.limits = c(9,10.5))


In the first section of the code, we’re looking at data being pulled from a CSV file. This data is then formatted as a data frame. At this stage, we haven’t used any of the visualization that we could within R, such as boxplots, or histograms for the different data columns.  Even at this stage, when you’ve pulled in the data, it is perfectly possible to run a histogram:


hist(x[,1], xlab = "Sample 1", col = "blue",main = "Histogram of Sample 1" )



Graphical examination reveals a structure similar to the eponymous bell curve of the normal distribution. However, for statistical purposes, we generally use normality tests such as the Anderson-Darling or the Shapiro-Wilk tests. Think of these tests as a way for checking whether our data is thick around the middle and thin at the sides, just like the normal distribution. These tests are in the package “nortest“, which I have called using the “library()” command.

Now let’s talk through the rest of the code. Since there are 99 subgroups of data here, it makes more sense to use the Anderson-Darling normality test to numerically examine the normality of the data. With a 5% significance, we’d expect around 95% of these 99 samples to come from a normal distribution. (I say this with confidence especially as I put together the sample data set from the normal distribution. In a real-life scenario, of course, you’d have to evaluate samples based on their normality). If you are wondering what that 5% was all about, I’ll talk about significance (and confidence levels) in a later post.

Once we have our normality assessed (and okay), we can put together a pair of control charts. We use the X-bar and S charts to assess process stability. X-bar is the notation for the average value of each sample (which we should hope won’t change a great deal between samples in the same process), and S is the standard deviation of each sample (which we hope stays as consistent and low as possible). Now, let’s look at the X-bar and S charts we generated. Observe how they have black dots representing each sample’s statistic (X-bar or S) and three lines – two dotted lines, called the “control limits” and the centre line.

X-bar Chart for the Sample

S chart for the sample

There are a number of rules for evaluating process stability in control charts, but the main and most widely followed one always is – “none of the points in your samples should lie outside the control limits”. There are a series of rules based on the properties of the normal distribution (based on which the control charts are constructed) that use the patterns of points, series of points that indicate constant rise or constant reduction in output (a process trend), and so on. However, for the purposes of demonstration, let’s agree that the process results in this case represent a stable process. The S chart here shows a consistent standard deviation, and the X-bar chart here shows a consistent mean. The center line in each chart simply illustrates the mean of means, and the mean of standard deviations, respectively, for X-bar and S charts.

Now that the stability assessment is also complete, the next command in the sample code helps us assess the capability of this process. Usually, in a manufacturing organization, there are extensive quality plans that help you understand what the specification limits are.

Naturally, capability assessment shouldn’t be done lightly by (goodness forbid) assuming specification limits. Design engineers and teams sometimes use the tolerance limits (or tighter limits than them) to assess the manufacturing process in question. Overall, though, this is what a capability analysis looks like. In our example, the data is is being compared to the USL (upper spec limit) of 10.5, and the LSL (lower spec limit) of 9.0. Presumably these are based on the drawings and documents alluded to earlier in your analysis too. The indices in the third column – $C_p, C_pl, C_pu, C_pk, and C_pm$– indicate different capability indices. Observe how they’re all different, and how Cpl is the highest. Now, if you observe the data in the capability plot, you’ll note that the peak is a little to the right of the centre line. There is therefore an asymmetry in the way the process behaves with respect to the center of the specification limits.  The higher the capability indices are, the better. If you were to look up the Automotive Industry Action Group manual on Statistical Process Control, you may find the appropriate values for the automotive industry. There may be different required process capability indices for different industries, different manufacturing processes, parts and so on.

## Interpreting Process Capability

Process capability is generally a very important part of Six Sigma projects and many managers in manufacturing organizations tend to use the capability metrics alluded to (and other ones such as $P_p, P_pk$ and Z-scores) as measures of process quality. Naturally, a higher process capability is better, because it means that our process will tend to produce results more often, between the agreed-upon specification limits. Why do we have so many indices? And why are they all different? They’re different because they measure different things. Processes need not only have small variations, but should also be centered on specifications. And processes should be consistent in their results in the short and long terms – and since each of these criteria have different capability indices, the number of indices swells.

Process capability is an important concept to know for service organizations too. Your process may not follow the normal distribution like manufacturing or assembly processes may, and you may deal with boolean (1-0) process results or ordinal results (such as the performance of a multi-modal switch, or a processor) – and even for these situations, there are other capability analysis metrics and methods.

## Concluding Remarks

We’ve seen how R’s power can be leveraged to conduct process capability analysis. R’s benefit over other commercial software seems to be the modularity, which could bring you a new way to do these charts, perhaps with better fonts, or formatting, or better functionality, and the redoubtable repeatability and speed, like a well-oiled machine. As caveats, it is important to evaluate normality and stability prior to capability. If you have data that isn’t normally distributed, or that couldn’t be considered stable from its control charts, you may have to take more fundamental actions to improve your process’ results and consistency. The Six Sigma tool set also consists of a number of other tools – including inferential statistics and sometimes, times series methods. I’ll cover how these may be carried out in R in the next post, with the SixSigma package and its commands.

# Data Science: Beyond the Hype

While there is justifiable excitement in the technology industry (and other industries) these days on the widespread availability of data, and the availability of algorithms to process and make sense of this data, I sincerely think (like many others) that the hype behind Big Data is somewhat unfounded.

For many decades, “small data” have been studied in science and industry with the intent of constructing mathematical models, i.e., approximate, error-prone mathematical representations of phenomena. In some ways, the scientific method is all about such data analysis. We often hear in the news about the amplification of effects, the “truth inflation” observed when drawing conclusions from small data sets, to make broader generalizations. We hear about the lack of enough data impeding the progress of research, we also hear about fabricated data and spurious research results. A lot of scientific findings have come under scrutiny for these reason – and perhaps analysis of population data (as Big Data promises to do) may help this situation. However, the key difference between the past decades of statistics – from legends such as Fisher and George Box, to present day stalwarts in applied statistics and machine learning like Nate Silver, Sebastian Thrun and Andrew Ng, is the ability to leverage computing to analyse large data sets.

A lot of the discussion around Big Data seems to be on the so-called four Vs of Big Data – volume, velocity, variety – and increasingly, veracity – referring to the increasing speed and range of data generated in the information age. However, what’s forgotten often enough, is that below the hype, below the machine learning algorithms and below the databases and technologies, we still have the same underlying principles.

The types of data, the mathematical methods we use to evaluate them, and the fundamental concepts thereof are unchanged – and understanding this is often the key between knowing whether and when to sample from your big data set, or not. This is more important than we realize, because sampling is not obsolete. Often, well collected samples of data may be more than sufficient for establishing or testing a certain hypothesis we may have.

In my view, newcomers to the data science and big data revolutions ought to consider a course in statistics, statistical thinking and statistical reasoning first. This lays the foundation for everything else that follows. The internet and most developed and even developing countries are awash with resources that can enable individuals to learn programming and computer-based problem solving, but critical thinking and statistical thinking seem to be harder skills to learn.

Statistical thinking not only requires a level of mathematical rigour but an ability to embrace notions of uncertainty, probabilistic thinking and a fundamental change in one’s notions of cause and effect. Perhaps this is a big step for many. The relative certainty of the logic of programming languages may actually be welcoming to many – which is probably also why we see more discussions about Hadoop and Spark and not enough discussions about statistical hypothesis testing or time series auto-correlation models.

So, if you want to cut through the hype, see data science for what it is, by breaking it up into its elements – the data (which may be coming in from ever more diverse sources), the tools (algorithms, computers) and the science (which is, in this case, statistics). Not everyone is a data scientist, as some articles on the web have begun to claim, but it isn’t only a specific set of skills that makes one a “data scientist”. Some say that these data scientists are glorified statisticians, some say that they’re statistically competent programmers well versed in machine learning, but the truth is probably somewhere in between.

Furthermore, data visualization – another aspect of the data science hype – is both an art and a science – which perhaps implies that you can both be enlightened and obfuscated by charts and graphs. In my view, knowledge/abilities in visualization alone doesn’t make you a data scientist (nor does, for instance, knowledge of machine learning methods alone or skills in programming R for ETL purposes alone). When you cut through the hype here, what’s pragmatic is to be able to acquire a wide array of skills – and depth in some. Like many engineers in fields of technology or engineering, who may have a wide swath of knowledge but expertise in only a few areas, this is the most likely role that most data scientists may have.

There’s definitely more that can be said about specific aspects of the data science “movement”, but what is certain is that a knowledge of the science of statistics underlying most of the science cannot be underestimated in its value and relevance in the present day. Statistics, hopefully, will become as important as learning a language or developing an ability to have conversations, or write a well argued paragraph.

# [1] “Hello world!”

This is a blog that chronicles some of my ideas and thoughts about statistics, the R programming language for statistical computing, and related topics.

I’m an engineer and quality manager by profession and training, and increasingly a keen follower of the data revolution that we are witnessing today. As a veteran of many quality initiatives and programs in large organizations, I’ve come to understand the importance of data driven problem solving in driving good decision quality, and have come to appreciate the role of data in solving complex engineering and business problems.

I believe that statistical numeracy and an ability to constantly learn and challenge one’s own mental models are increasingly linked, and that these are vital skills in future human societies of any kind. As someone who isn’t a trained statistician that is deeply interested in statistics, this space is to me like a sounding board and a personal log. I hope to maintain ideas, observations and more, as I learn more and more about statistics and statistical models, data driven problem solving, data visualization and machine learning.

My current interests in the broad field of data science are the use of R for statistical analysis of various kinds, with a focus on statistical quality control, significance and hypothesis testing and time series analysis. Increasingly, I’ve developed a keen interest in forecasting methods, and the uses of machine learning in these problems. Coming from the Minitab user base, I’m finding R interesting, rewarding and challenging – all at the same time!

I believe chronicling some of my thoughts here along my journey will help me and others who may be walking down the same path!