Comparing Non-Normal Data Graphically and with Non-Parametric Tests

Not all data in this world is predictable in the exact same way, of course, and not all data can be modeled using the Gaussian distribution. There are times, when we have to make comparisons about data using one of many distributions that represent data which could show different patterns other than the familiar and comforting “bell curve” of the normal distribution pattern we’re used to seeing in business presentations and the media alike. For instance, here’s data from the Weibull distribution, plotted using different shape and scale parameters. A Weibull distribution has two parameters, shape and scale, which determine how it looks (which varies widely), and how spread out it is.


shape <- 1
scale <- 5
x<-rweibull(1000000,shape,scale)
hist(x, breaks = 1000, main = paste("Weibull Distribution with shape: ",shape,", and scale: ",scale))
abline (v = median(x), col = "blue")
abline (v = scale, col = "red")
2015-08-22 18_04_01-Action center

Shape = 1; Scale = 5. The red line represents the scale value, and the blue line, the median of the data set.

Here’s data from a very different distribution, which has a scale parameter of 100.

2015-08-22 18_07_12-New notification

Shape = 1; Scale = 100. Same number of points. The red and blue lines mean the same things here too.

The shape parameter, as can be seen clearly here, is called so for a good reason. Even when the scale parameter changes wildly (as in our two examples), the overall geometry of our data looks similar – but of course, it isn’t. The change in the scale parameter has changed the probability of an event x ->0 towards the lower end of the x range (closer to zero), compared to an event x>>>0 further away. When you superimpose these distributions and their medians, you can get a very different picture of them.

If we have two very similar data sets like the data shown in the first graph and the data in the second, what kinds of hypothesis tests can we use? It is a pertinent question, because at times, we may not know that a data set may represent a process that can be modeled by a specific kind of distribution. At other times, we may have entirely empirical distributions represented by our data. And we’d still want to make comparisons using such data sets.

shape <- 1
scale1 <- 5
scale2<-scale1*2
x<-rweibull(1000000,shape,scale1)
xprime<-rweibull(1000000,shape,scale2)
hist(x, breaks = 1000, border = rgb(0.9,0.2,0.2,0.2), col = rgb(0.9,0.2,0.2,0.2), main = paste("Weibull Distribution different shape parameters: ",shape/100,", ", shape))
hist(xprime, breaks = 1000, border = rgb(0.2,0.9,0.2,0.2), col = rgb(0.2,0.9,0.2,0.2), add = T)
abline (v = median(x), col = "blue")
abline (v = scale, col = "red")

Different scale parameters. Red and blue lines indicate medians of the two data sets.

Different scale parameters. Red and blue lines indicate medians of the two data sets.

The Weibull distribution is known to be quite versatile, and can at times be used to approximate the Gaussian distribution for real world data. An example of this is the use of the Weibull distribution to approximate constant failure rate data in engineering systems. Let’s look at data from a different pair of distributions with a different shape parameter, this time, 3.0.

shape <- 3
scale1 <- 5
scale2<-scale1*1.1 #Different scale parameter for the second data set
x<-rweibull(1000000,shape,scale1)
xprime<-rweibull(1000000,shape,scale2)
hist(x, breaks = 1000, border = rgb(0.9,0.2,0.2,0.2), col = rgb(0.9,0.2,0.2,0.2), main = paste("Weibull Distribution different scale parameters: ",scale1,", ", scale2))
hist(xprime, breaks = 1000, border = rgb(0.2,0.9,0.2,0.2), col = rgb(0.2,0.9,0.2,0.2), add = T)
abline (v = median(x), col = "blue")
abline (v = median(xprime), col = "red")
Weibull distribution data - different because of scale parameters. Vertical lines indicate medians.

Weibull distribution data – different because of scale parameters. Vertical lines indicate medians.

The medians can be used to illustrate the differences between the data, and summarize the differences observed in the graphs. However, when we know that a data set is non-normal, we can adopt non-parametric methods from the hypothesis testing toolbox in R. Like hypothesis tests for normally distributed data that have comparable means, we can compare the medians of two or more samples of non-normally distributed data. Naturally, the same conditions – of larger samples of data being better, at times, apply. However, the tests can help us analytically differentiate between two similar-looking data sets. Since the Mann-Whitney median test and other non-parametric tests don’t make assumptions about the parameters of the underlying distribution of the data, we can rely on these tests to a greater extent when studying the differences between samples that we think may have a greater chance of being non-normal (even though the normality tests may say otherwise).

Non-parametrics and the inferential statistics approach

Non-parametrics and the inferential statistics approach: how to use the right test

When we conduct the AD test for normality on the two samples in question, we can see how these samples return a very low p-value each. This can also be confirmed using the qqnorm plots.

Let’s use the Mann-Whitney test for two medians from samples of non-normal data, to assess the difference between the median values. We’ll use a smaller sample size for both, and use the wilcox.test() command. For two samples, the wilcox.test() command actually performs a Mann-Whitney test.

shape <- 3
scale1 <- 5
scale2<-scale1*1.01
x<-rweibull(1000000,shape,scale1)
xprime<-rweibull(1000000,shape,scale2)
library(nortest)
paste("Normality test p-values: Sample 'x' ",ad.test(x)$p.value, " Sample 'xprime': ", ad.test(xprime)$p.value)

hist(x, breaks = length(x)/10, border = rgb(0.9,0.2,0.2,0.05), col = rgb(0.9,0.2,0.2,0.2), main = paste("Weibull Distribution different scale parameters: ",scale1,", ", scale2))
hist(xprime, breaks = length(xprime)/10, border = rgb(0.2,0.9,0.2,0.05), col = rgb(0.2,0.2,0.9,0.2), add = T)
abline (v = median(x), col = "blue")
abline (v = median(xprime), col = "red")
wilcox.test(x,xprime)
paste("Median 1: ", median(x),"Median 2: ", median(xprime))

Observe how close the scale parameters of both samples are. We’d expect both samples to overlap, given the large number of points in each sample. Now, let’s see the results and graphs.

Nearly overlapping histograms for the large non-normal samples

Nearly overlapping histograms for the large non-normal samples

The results for this are below.

Mann-Whitney test results

Mann-Whitney test results

The p-value here (for this considerable sample size) clearly illustrates the present of a significant difference. A very low p-value in this test result indicates that, if we were to make the assumption that the medians of these data sets are equal, there would be an extremely small probability, that we would see samples as extreme as observed in these samples. The fine difference in the medians observed in the median results can also be picked up in this test.

To run the Mann-Whitney test with a different confidence level (or significance), we can use the following syntax:

wilcox.test(x,xprime, conf.level = 0.95)

Note 1 : The mood.test() command in R performs a two-samples test of scale. Since the scale parameters in these samples of data we generated (for the purposes of the demo) are well known, in real life situations, the p-value should be interpreted based on additional information, such as the sample size and confidence level.

Note 2:  The wilcox.test() command performs the Mann Whitney test. This is a comparison of mean ranks, and not of the medians per se.

Hypothesis Tests: 2 Sample Tests (A/B Tests)

Businesses are increasingly beginning to use data to drive decision making, and are often using hypothesis tests. Hypothesis tests are used to differentiate between a pair of potential solutions, or to understand the performance of systems before and after a certain change. We’ve already seen t-tests and how they’re used to ascribe a range to the variability inherent in any data set. We’ll now see the use of t-tests to compare different sets of data. In website optimization projects, these tests are also called A/B tests, because they compare two different alternative website designs, to determine how they perform against each other.

It is important to reiterate that in hypothesis testing, we’re looking for a significant difference and that we use the p-value in conjunction with the significance (%) to determine whether we want to reject some default hypothesis we’re evaluating with the data, or not. We do this by calculating a confidence interval, also called an interval estimate. Let’s look at a simple 2-sample t-test and understand how it works for two different samples of data.

Simple 2-sample t-test

A 2-sample t-test has the default hypothesis that the two samples you’re testing come from the same population, and that you can’t really tell any difference between them. So, any variation you see in the data is purely random variation. The alternative hypothesis in this test, is of course, that it isn’t only random variation we’re seeing, and that these samples come from completely different populations altogether.

H_o : \mu_1 = \mu_2    \newline    H_a: \mu_1 \neq \mu_2

What the populations from which X1 and X2 are taken may look like

What the populations from which X1 and X2 are taken may look like

We’ll generate two samples of data x_1, x_2 from two different normal distributions for the purpose of demonstration, since normality is a pre-requisite for using the 2-sample t-test. (In the absence of normality, we can use other estimators of central tendency such as the median, and the tests appropriate for estimating the median, such as the Moods-Median or Kruskall-Wallis test – which I’ll blog about another time). We also have to ensure that the standard deviations of the two samples of data we’re testing are comparable. I’ll also demonstrate how we can use a test for standard deviations to understand whether the samples have different variability. Naturally, when the samples have different standard deviations, tests for assessing similarities in their means may not be fully effective.

library(nortest)
#Generating two samples of data
#100 points of data each
#Same standard deviation
#Different values of mean (of sampling distribution)
x<-read.csv(file = "x1x2.csv")
x1<-x[,2]
x2<-x[,3]

#Setting the global value of significance 
alpha = 0.1

#Histograms
hist(x1, col = rgb(0.1,0.5,0.1,0.25), xlim = c(7,15), ylim = c(0,15),breaks = seq(7,15,0.25), main = "Histogram of x1 and x2", xlab = "x1, x2")
abline(v=10, col = "orange")
hist(x2, col = rgb(0.5,0.1,0.5,0.25), xlim = c(7,15),breaks = seq(7,15,0.25), add = T)
abline(v=12, col = "purple")

#Running normality tests (just to be sure)
ad1<-ad.test(x1)
ad2<-ad.test(x2)

#F-test to compare two or more variances
v1<-var.test(x1,x2)

if(ad1$p.value>=alpha & ad2$p.value>=alpha){
  if(v1$p.value>=alpha){
    #Running a 2-sample t-test
    for (i in c(-2,-1,0,1,2)){
      temp<-t.test(x=x1,y=x2,paired = FALSE,var.equal = TRUE,alternative = "two.sided",conf.level = 1-alpha, mu = i )
      cat("Difference= ",i,"; p-value:",temp$p.value,"\n")  
    }
    
  }
}

The first few lines in the code merely include the “nortest” package and invoke/generate the data sets we’re comparing. The nortest package contains the Anderson Darling Normality Test, which we have also covered in an earlier post. We can generate a histogram, to understand what x_1 and x_2 look like.

Histogram of X1 and X2 - showing the reference population mean lines

Histogram of X1 and X2 – showing the reference population mean lines

The overlapping histograms of x_1 and x_2 clearly indicate the difference in the central tendency, and the overlap is also visible. Subsequent code above covers an F-test. As explained earlier, equality of variances is a pre-requisite for the 2-sample t-test. Failing this would mean that we essentially have samples from two different populations, which have two different standard deviations.

Finally, if the conditions to run a 2-sample t-test are met, the t.test() command (which is present in the “stats” package, runs, and provides us a result. Let’s look closely at how the t.test command is constructed. The arguments contain x_1 and x_2, which are our two samples for comparison. We provide the argument “paired = FALSE”, because these are not before/after samples. They’re two independently generated samples of data. There are instances where you may want to conduct a paired t-test, though, depending on your situation. We’ve also specified the confidence level. Note how the code uses a global value of \alpha, or significance level.

Now that we’ve seen what the code does, let’s look at the results.

2-sample t-test results

2-sample t-test results

Evaluating The Results

Two sample t-test results should be evaluated in a similar way to 1-sample t-tests. Our decision is dependent on the p-value we see in the result, and the confidence interval of the difference between sample means.

Observe how the difference estimate lies on the negative side of the number line. Difference is calculated from the populations means 10 and 12, so we can clearly understand why this estimate of difference is negative. The estimates for mean values of x and y (in this case, x_1 and x_2) are also given. Naturally, the p-value that’s in the result, when compared to our generous \alpha of 0.1, is far lesser, and we can consider this to be a significant result (provided we have sufficient statistical power – and we’ll discuss this in another post). This indicates a significant difference between the two sets. If x_1 and x_2 were fuel efficiency figures for passenger vehicles, or bikes, we may actually be looking at better performance for x_2 when compared to x_1.

Detecting a Specific Difference

Sometimes, you may want to evaluate a new product, and see if it performs at least x% better than the old product. For websites, for instance, you may be concerned with loading times. You may be concerned with code runtime, or with vehicle gas mileage, or vehicle durability, or some other aspect of performance. At times, the fortunes of entire companies depend on them producing faster, better products – that are known to be faster by at least some amount. Let’s see how a 2-sample t-test can be used to evaluate a minimum difference between two samples of data.

The same example above can be modified slightly, to test for a specific difference. The only real difference we have to make here, of course, is the value of \Delta or difference. The t.test() command in R unfortunately isn’t very clear on this – it expects you to understand that you should use \mu for this. Once you get used to it, however, this little detail is fine, and it delivers the expected result.

if(ad1$p.value >=alpha & ad2$p.value>=alpha){
  if(v1$p.value>=alpha){
    #Running a 2-sample t-test
    for (i in c(-2,-1,0,1,2)){
      temp<-t.test(x=x1,y=x2,paired = FALSE,var.equal = TRUE,alternative = "two.sided",conf.level = 1-alpha, mu = i )
      cat("Difference= ",i,"; p-value:",temp$p.value,"\n")  
    }
 }
}

The code above prints out different p-values, for different tests. The data used in these tests is the same, but by virtue of the different differences we want to detect between these samples, the p-values are different. Observe the results below:

Differences and how they influence p-value (same two samples of data)

Differences and how they influence p-value (same two samples of data)

Since the data was generated from two distributions that have means of 10 and 12 respectively for x_1 and x_2, we know from intuition that the difference is -2, and we should start seeing results that indicate no difference between the expected and observed difference at this value in the test. Therefore, the p-values in this scenario will be greater than the significance value, \alpha.

For other scenarios – when \Delta = -1, 0, 1, 2, we see that the p-values are clearly far below the significance of \alpha = 10%.

What’s important to remember therefore, is that contrary to what many people may think, there is no one or best p-value for a given set of data. It depends on the factors we take into consideration during the test – such as the sample size, the confidence level we chose for our test, the resulting significance level, and, as illustrated here, the difference expected.

Concluding Remarks

A 2-sample t-test is a great way for an organization to compare samples of data from different products, processes, and so on, and understand if one of them is performing significantly better than another. The test is strictly for data that fits the normality criteria, that also happen to have comparable standard deviations, and the results from it tend to be impacted heavily by the kind of hypothesis we use – for difference (which we explored here) and for one or two sided comparisons. We explored only the two sided comparisons here (and hence constructed a two sided confidence interval). When a business uses a 2-sample t-test, some of the arguments here, such as the values of confidence level, difference and so on, should be evaluated thoroughly. It is also important to bear in mind the impact of sample size. The smaller the difference we want to detect, the greater the sample sizes have to be. We’ll see more about this in another post, on power, difference and sample size.