Johnson Transformation for Non-Normal Data

A number of inferential statistical tests (A/B tests and significance tests) assume that the underlying that we’re comparing come from a normal (Gaussian) distribution. However, this isn’t generally true for a number of data sets in practice. In order to use the tools that assume normality, we have to transform the data (and the limits or comparisons being made).

Background

The purpose of transformation, therefore, is to ensure that the data set we have satisfies the minimum assumptions made in the process of conducting the hypothesis tests. In frequentist statistics, where we’re using these statistical distributions to model a process and describe aggregate behaviour, rather than using Bayesian approaches, it is useful to keep such transformation tools at hand.

Let’s generate a sample data set, and plot it, and analyze it using the QQ-Norm plots, to understand normality. We’ll use the standard Shapiro-Wilk test which is a powerful test for normality of a data set.

#Generating a weibull distribution sample
x <- rweibull(1000,2,5)

#Plotting and visualizing data
hist(x, breaks = 20, col = rgb(0.1, 0.1, 0.9, 0.5))

#Shapiro-Wilk test for normality
shapiro.test(x)
Simple histogram of sample

Simple histogram of sample. Observe how the data set exhibits skewness and appears quite non-normal.

	Shapiro-Wilk normality test

data:  x
W = 0.9634, p-value = 3.802e-15

The Shapiro-Wilk test results certainly confirm that the data set is non-normal. Now let’s look at a QQ-Norm plot.

QQ Norm Plot of X (with QQ Line)

QQ Norm Plot of X (with QQ Line). Non-normality is evident from extreme values in the data

Our objective now is to transform this data set into a dataset, on which we can perform operations meant for normally distributed data. The benefit of being able to transform data is many-fold, but chiefly, it allows us to conduct capability analyses and stability analyses, in addition to hypothesis tests like t-tests. Naturally, the reference points which will be used in these tests will also have to be transformed, in order to make meaningful comparisons.

Log Transformations

The Log and Square Root functions are commonly used to transform positive data. In our case, since we have data from the Weibull distribution, we can explore the use of the log function and observe its effectiveness at transforming the dataset.

#Transforming data using the log function
x_log <- log(x)

#Plotting the transformed data
hist(x_log, breaks = 20, col = rgb(0.1, 0.1, 0.9, 0.5))

#Shapiro-Wilk test for normality of transformed data
shapiro.test(x_log)

#Normal QQ Plot
qqnorm(x)
qqline(x_log)

	Shapiro-Wilk normality test

data:  x_log
W = 0.9381, p-value < 2.2e-16

Histogram of x_log

Histogram of x_log

Quite clearly, the results from this aren’t too promising. The data set created as x_log doesn’t exhibit normality, given the p-value in the normality test is extremely small. This means that there is an extremely small chance that the log-transformed data set could have come from a normal distribution, assuming that the null hypothesis of the normality test is true.

The Johnson R Package

The Johnson R package can be used to access certain tried and tested transformation approaches for transformation. The Johnson package contains a number of useful functions, including a normality test (Anderson-Darling, which is comparable in power to Shapiro Wilk), and the Johnson transformation function. The Johnson package can be installed using the “install.packages()” command in R.

library(Johnson)
#Running the Anderson-Darling Normality Test on x
adx <- RE.ADT(x)
adx

#Running the Johnson Transformation on x
x_johnson <- RE.Johnson(x)

#Plotting transformed data
hist(x_johnson$transformed, breaks = 25, col = rgb(0.9, 0.1, 0.1, 0.5))
qqnorm(x_johnson$transformed)
qqline(x_johnson$transformed, col = "red")

#Assessing normality of transformed data
adx_johnson <- RE.ADT(x_johnson$transformed)
adx_johnson

Running the RE.Johnson() command generates a list assigned in this case to the variable x_johnson. This contains a vector of the transformed values, under x_transformed$transformed – which is our new transformed data set. Using the same plots and tests we did earlier for the base data set, we can understand the effectiveness of the Johnson transformation.

Histogram of data transformed using the Johnson transformation

Histogram of data transformed using the Johnson transformation

QQ-Norm plot of the transformed data (x_johnson)

QQ-Norm plot of the transformed data (x_johnson)

> adx_johnson  adx_johnson

   "Anderson-Darling Test"

$p 0.3212095 

A p-value of 0.32 from the Anderson Darling test for the transformed data set clearly indicates that we fail to reject the null hypothesis of the A-D test, and cannot rule out the possibility that the transformed data is normally distributed.

Concluding Remarks

We’ve seen in this post how the Johnson R package can be used to transform data that is non-normal (as a lot of real data sets are) to a data set that can be used as arguments in a hypothesis test or other function that assumes normality in the data. Transformations have wide applications, and can be used to extract meaningful information about the dynamics of a distribution, process or data set. Transformation also allows us to present data better. Sometimes, when data is skewed, extreme values get highlighted, at the cost of highlighting the pattern that’s present in most of the data set. In such situations, transformations can come in handy. Transformations can also be used to highlight the scale of phenomena by using their transformed data in graphs.

Further Reading

  1. Box-Cox transformation, another popular transformation method used for non-normal data
  2. Box-Cox transformations on linear models, a more mathematical treatment.

Two Way ANOVA in R

Introduction

The more advanced methods in statistics have generally been developed to answer real-world questions, and ANOVA is no different.

  • How do we answer questions in the real world, as to which route from home to work on your daily commute is easier, or
  • How would you know which air-conditioner to choose out of a bunch that you’re evaluating in various climates?
  • If you were dealing with a bunch of suppliers, and wanted to compare their process results all at the same time, how would you do it?
  • If you had three competing designs for a system or an algorithm, and wanted to understand whether one of them was significantly better than the others, how would you do that statistically?

ANOVA answers these kinds of questions – it helps us discover whether we have clear reasons to choose a particular alternative over many others, or determine whether there is exceptional performance (good or bad) that is dependent on a factor.

We discussed linear models earlier – and ANOVA is indeed a kind of linear model – the difference being that ANOVA is where you have discrete factors whose effect on a continuous (variable) result you want to understand.

The ANOVA Hypothesis

The ANOVA hypothesis is usually explained as a comparison of population means estimated based on the sample means. What we’re trying to understand here is the effect of a change in the level of one factor, on the response. The term “Analysis of Variance” for ANOVA is therefore a misnomer for many new to inferential statistics, since it is a test that compares means.

A simplified version of the One-Way ANOVA hypothesis for three samples of data (the effect of a factor with three possible values, or levels) is below:

H_0 : \mu_1 = \mu_2 = \mu_3

While Ha could be:

H_a : \mu_1 \neq \mu_2 = \mu_3, or
H_a : \mu_1 = \mu_2 \neq \mu_3, or
H_a : \mu_1 \neq \mu_2 = \mu_3

It is possible to understand the Two-Way ANOVA problem, therefore, as a study of the impact of two different factors (and their associated levels) on the response.

Travel Time Problem

Let’s look at a simple data set which has travel time data organized by day of the week and route. Assume you’ve been monitoring data from many people travelling a certain route, between two points, and you’re trying to understand whether the time taken for the trip is more dependent on the day of the week, or on the route taken. A Two-Way ANOVA is a great way to solve this kind of a problem.

The first few rows of our dataset

The first few rows of our dataset

We see above how the data for this problem is organized. We’re essentially constructing a linear model that explains the relationship between the “response” or “output” variable Time, and the factors Day and Route.

Two Way ANOVA in R

ANOVA is a hypothesis test that requires the continuous variables (by each factor’s levels) to normally distributed. Additionally, ANOVA results are contingent upon an equal variance assumption for the samples being compared too. I’ve demonstrated in an earlier post how the normality and variance tests can be run prior to a hypothesis test for variable data.

The code below first pulls data from a data set into variables, and constructs a linear ANOVA model after the normality and variance tests. For normality testing, we’re using the Shapiro-Wilk test, and for variance testing, we’re using the bartlett.test() command here, which is used to compare multiple variances.

#Reading the dataset
Dataset<-read.csv("Traveldata.csv")
str(Dataset)

#Shapiro-Wilk normality tests by Day
cat("Normality p-values by Factor Day: ")
for (i in unique(factor(Dataset$Day))){
  cat(shapiro.test(Dataset[Dataset$Day==i, ]$Time)$p.value," ")
}
cat("Normality p-values by Factor Route: ")

#Shapiro-Wilk normality tests by Route
for (i in unique(factor(Dataset$Route))){
  cat(shapiro.test(Dataset[Dataset$Route==i, ]$Time)$p.value," ")
}

#Variance tests for Day and Route factors
bartlett.test(Time~Day,data = Dataset )
bartlett.test(Time~Route,data = Dataset )

#Creating a linear model
#The LM tells us both main effects and interactions
l <- lm(Time~ Day + Route + Day*Route , Dataset)
summary(l)

#Running and summarizing a general ANOVA on the linear model
la <- anova(l)
summary(la)

#Plots of the linear model and Cook's Distance
plot(cooks.distance(l), 
     main = "Cook's Distance for linear model", xlab =
       "Travel Time (observations)", ylab = "Cook's Distance")
plot(l)
plot(la)

Results for the Bartlett test are below:

> bartlett.test(Time~Day,data = Dataset )

	Bartlett test of homogeneity of variances

data:  Time by Day
Bartlett's K-squared = 3.2082, df = 4, p-value = 0.5236

> bartlett.test(Time~Route,data = Dataset )

	Bartlett test of homogeneity of variances

data:  Time by Route
Bartlett's K-squared = 0.8399, df = 2, p-value = 0.6571

The code also calculates Cook’s distance, which is an important concept in linear models. When trying to understand any anomalous terms in the model, we can refer to the Cook’s distance to understand whether those terms have high leverage in the model, or not. Removing a point with high leverage could potentially affect the model results. Equally, if your model isn’t performing well, it may be worth looking at Cook’s distance.

Cook's Distance for our data set, visualized

Cook’s Distance for our data set, visualized

Cook's distance, explained by its importance to leverage in the model.

Cook’s distance, explained by its importance to leverage in the model.

When looking at the graphs produced by lm, we can understand the how various points in the model have different values of Cook’s distance, and we also understand their relative leverages. This is also illustrated in the Normal Quantile-Quantile plot below, where you can see observations #413 and #415 that have large values, among others.

Normal QQ plot of data set showing high leverage points (large Cook's Distance)

Normal QQ plot of data set showing high leverage points (large Cook’s Distance)

ANOVA Results

A summary of the lm command’s result is shown below.


Call:
lm(formula = Time ~ Day + Route + Day * Route, data = Dataset)

Residuals:
    Min      1Q  Median      3Q     Max 
-20.333  -4.646   0.516   4.963  19.655 

Coefficients:
              Estimate Std. Error t value Pr(>|t|)    
(Intercept)   54.34483    1.39067  39.078   <2e-16 ***
DayMon        -3.34483    1.95025  -1.715   0.0870 .  
DayThu         2.69221    2.00280   1.344   0.1795    
DayTue        -0.43574    1.90618  -0.229   0.8193    
DayWed        -0.01149    2.00280  -0.006   0.9954    
RouteB        -1.02130    1.89302  -0.540   0.5898    
RouteC        -1.83131    1.85736  -0.986   0.3246    
DayMon:RouteB  2.91785    2.71791   1.074   0.2836    
DayThu:RouteB  0.39335    2.63352   0.149   0.8813    
DayTue:RouteB  3.44554    2.64247   1.304   0.1929    
DayWed:RouteB  1.23796    2.65761   0.466   0.6416    
DayMon:RouteC  5.27034    2.58597   2.038   0.0421 *  
DayThu:RouteC  0.24255    2.73148   0.089   0.9293    
DayTue:RouteC  4.48105    2.60747   1.719   0.0863 .  
DayWed:RouteC  1.95253    2.68823   0.726   0.4680    
---
Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

Residual standard error: 7.489 on 485 degrees of freedom
Multiple R-squared:  0.04438,	Adjusted R-squared:  0.01679 
F-statistic: 1.609 on 14 and 485 DF,  p-value: 0.07291

While the model above indicates the effect of each factor on the response, it doesn’t compute the f-statistic, which is by taking into consideration the “within” and “between” variations in the samples of data. The ANOVA mean squares and sum of squares approach does exactly this, which is why the results from that are more relevant here. And the summary below is the ANOVA model itself, in the ANOVA table:

Analysis of Variance Table

Response: Time
           Df  Sum Sq Mean Sq F value  Pr(>F)   
Day         4   823.3 205.830  3.6700 0.00588 **
Route       2    46.0  23.005  0.4102 0.66376   
Day:Route   8   393.9  49.237  0.8779 0.53492   
Residuals 485 27201.3  56.085                   
---
Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

By looking at the results above, it is clear that the Day factor has a statistically significant impact on the Travel time. (Upon closer observation, you’ll see that one of those means that was different from others, was the mean for Monday! This is clearly an example inspired by Monday rush hour traffic!)

When reading results from the lm and anova commands, it is important to note that R indicates the results using significance codes. A small p-value indicates a significant result, and the relative significance of different factors is indicated by assigning different symbols to them. For instance, two asterixes (**) are used when we get a p-value of < 0.001. Depending on the nature of your experiment, you can choose your significance level and understand the results in a comparison of these p-values with significance. Also, in this specific case, the Route factor seems to have an insignificant impact on the response.

Interactions

When we have two or more terms in a model which function as inputs to a response variable, we also need to evaluate whether a change in both variables causes a different effect on the response, as opposed to fixing one and changing the other. This is referred to as an interaction, and interaction effects are taken into account in our model. Once again, the p-values for the interactions can inform us about the relative importance of different interactions.

Concluding Remarks

We’ve seen in this example how the Analysis of Variance (ANOVA) approach can be used to compare the impact of two factors on a response variable. Cook’s distance and its importance were also explained. It is important to make each of the data points within our data set count, and at the same time, it is important to evaluate the model’s veracity and validity to what we want to understand. In this context, understanding the significance of our results (statistically and practically) is necessary. Fortunately, the linear modeling packages in R are very convenient for such modeling, and incorporate lots of added functionality, like being able to call plots on them by simply using the plot command on a saved model. This functionality really comes into its own, when you make and test many different models and want to compare results.

Simple Outlier Detection in R

Outliers are points in a data set that lie far away from the estimated value of the centre of the data set. This estimated centre could be either the mean, or median, depending on what kind of point or interval estimate you’re using. Outliers tend to represent something different from “the usual” that you might observe in a data set, and therefore hold importance. Outlier detection is an important aspect of machine learning algorithms of any sophistication. Because of the fact that outliers can throw off a learning algorithm or deflate an assumption about the data set, we have to be able to identify and explain the outliers in data sets, if the need arises. I’ll only cover the basic R commands here to do outlier detection, but it would be good to look up a more comprehensive resource. A first primer by Sanjay Chawla and Pei Sun (University of Sydney) is here: Outlier detection (PDF slides).

Graphical Approaches to Outlier Detection

Boxplots and histograms are useful to get an idea of the distribution that could be used to model the data, and could also provide insights into whether outliers exist or not in our data set.

y <-read.csv("y.csv")
ylarge <- read.csv("ylarge.csv")

#summarizing and plotting y
summary(y)
hist(y[,2], breaks = 20, col = rgb(0,0,1,0.5))
boxplot(y[,2], col = rgb(0,0,1,0.5), main = "Boxplot of y[,2]")
shapiro.test(y[,2])
qqnorm(y[,2], main = "Normal QQ Plot - y")
qqline(y[,2], col = "red")

#summarizing and plotting ylarge
summary(ylarge)
hist(ylarge[,2], breaks = 20, col = rgb(0,1,0,0.5))
boxplot(ylarge[,2], col =  rgb(0,1,0,0.5), main = "Boxplot of ylarge[,2]")
shapiro.test(ylarge[,2])
qqnorm(ylarge[,2], main = "Normal QQ Plot - ylarge")
qqline(ylarge[,2], col = "red")

The Shapiro-Wilk test used above is used to check for the normality of a data set. Normality assumptions underlie outlier detection hypothesis tests. In this case, with p-values of 0.365 and 0.399 respectively and sample sizes of 30 and 1000, both samples y and ylarge seem to be normally distributed.

 

Box plot of y (no real outliers observed as per graph)

Box plot of y (no real outliers observed as per graph)

Boxplot of ylarge - a few outlier points seem to be present in graph

Boxplot of ylarge – a few outlier points seem to be present in graph

Histogram of y

Histogram of y

Histogram of ylarge

Histogram of ylarge

Normal QQ Plot of Y

Normal QQ Plot of Y

Normal QQ Plot of ylarge

Normal QQ Plot of ylarge

 

The graphical analysis tells us that there could possibly be outliers in our data set ylarge, which is the larger data set out of the two. The normal probability plots also seem to indicate that these data sets (as different in sample size as they are) can be modeled using normal distributions.

Dixon and Chi Squared Tests for Outliers

The Dixon test and Chi-squared tests for outliers (PDF) are statistical hypothesis tests used to detect outliers in given sample sets. Bear in mind though, that this Chi-squared test for outliers is very different from the better known Chi-square test used for comparing multiple proportions. The Dixon tests makes a normality assumption about the data, and is used generally for 30 points or less. The Chi-square test on the other hands makes variance assumptions, and is not sensitive to mild outliers if variance isn’t specified as an argument. Let’s see how these tests can be used for outliers detection.

library(outliers)
#Dixon Tests for Outliers for y
dixon.test(y[,2],opposite = TRUE)
dixon.test(y[,2],opposite = FALSE)

#Dixon Tests for Outliers for ylarge
dixon.test(ylarge[,2],opposite = TRUE)
dixon.test(ylarge[,2],opposite = FALSE)


#Chi-Sq Tests for Outliers for y
chisq.out.test(y[,2],variance = var(y[,2]),opposite = TRUE)
chisq.out.test(y[,2],variance = var(y[,2]),opposite = FALSE)

#Chi-Sq Tests for Outliers for ylarge
chisq.out.test(ylarge[,2],variance = var(ylarge[,2]),opposite = TRUE)
chisq.out.test(ylarge[,2],variance = var(ylarge[,2]),opposite = FALSE)

In each of the Dixon and Chi-Squared tests for outliers above, we’ve chosen both options TRUE and FALSE in turn, for the argument opposite. This argument helps us choose between whether we’re testing for the lowest extreme value, or the highest extreme value, since outliers can lie to both sides of the data set.

Sample output is below, from one of the tests.

> #Dixon Tests for Outliers for y
> dixon.test(y[,2],opposite = TRUE)

	Dixon test for outliers

data:  y[, 2]
Q = 0.0466, p-value = 0.114
alternative hypothesis: highest value 11.7079474800368 is an outlier


When you closely observe the p-values of these tests alone, you can see the following results:

P-values for outlier tests:

Dixon test (y, upper):  0.114 ; Dixon test (y, lower):  0.3543
Dixon test not executed for ylarge
Chi-sq test (y, upper):  0.1047 ; Chi-sq test (y, lower):  0.0715
Chi-sq test (ylarge, upper):  0.0012 ; Chi-sq test (ylarge, lower):  4e-04

The p-values here (taken with an indicative 5% significance) may imply that the possibility that the extreme values in ylarge are outliers. This may or may not be true, of course, since in inferential statistics, we always state the chance of error. And in this case, we can conclude that there is a very small chance that those extreme values we see in ylarge are actually typical in that data set.

Concluding Remarks

We’ve seen the graphical outlier detection approaches and also have seen the Dixon and Chi-square tests. The Dixon test is newer, but isn’t applicable to large data sets, for which we need to use the Chi-square test for outliers and other tests. In machine learning problems, we often have to be able to explain some of the values, from a training perspective for neural networks, or be able to deal with lower resolution models such as least squares regression, used in simpler forecasting and estimation problems. Approaches like regression depend heavily on the central tendency of the data, and we can build better models if we’re able to explain outliers and understand the underlying causes for them. Continual improvement professionals generally regard outliers with importance. Statistically, the chance of getting extreme results (extremely good ones and extremely poor ones) is exciting in process excellence and continuous improvement, because they could represent benchmark cases, or worst case scenarios. Either way, outlier detection is an immensely useful activity applicable to different statistical situations business.

Normality Tests in R

When we see data visualized in a graph such as a histogram, we tend to draw some conclusions from it. When data is spread out, or concentrated, or observed to change with other data, we often take that to mean relationships between data sets. Statisticians, though, have to be more rigorous in the way they establish their notions of the nature of a data set, or its relationship with other data sets. Statisticians of any merit depend on test statistics, in addition to visualization, to test any theories or hunches they may have about the data. Usually, normality tests fit into this toolbox.

Histogram: Can this graph alone tell you whether your data is normally distributed?

Histogram: Can this graph alone tell you whether your data is normally distributed?

Normality tests help us understand the chance that any data we have with us may have come from a normal or Gaussian distribution. At the outset, that seems simple enough. However, when you look closer at a Gaussian distribution, you can observe how it has certain specific properties. For instance, there are two main parameters – a location parameter, the mean, and the scale parameter, the standard deviation. Different combinations of this can mean different shapes of distributions. You can therefore have thin and tall normal distributions, or have fat and wide normal distributions.

When you’re checking a data set for normality, it helps to visualize the data too.

Normal Q-Q Plots

#Generating 10k points of data and arranging them into 100 columns
x<-rnorm(10000,10,1)
dim(x)<-c(100,100)

#Generating a simple normal quantile-quantile plot for the first column
#Generating a line for the qqplot
qqnorm(x[,1])
qqline (x[,1], col=2)

The code above generates data from a normal distribution (command “rnorm”), reshapes it into a series of columns, and runs what is called a normal quantile-quantile plot (QQ Plot, for short) on the first column.

Q-Q Plot (Normal)

Q-Q Plot (Normal)

The Q-Q plot tells us what proportion of the data set (in this case, the first column of variable x), compares with the expected proportion (theoretically) of the normal distribution model based on the sample’s mean and standard deviation. We’re able to do this, because of the normal distribution’s properties. The normal distribution is thicker around the mean, and thinner as you move away from it – specifically, around 68% of the points you can expect to see in normally distributed data will only be 1 standard deviation away from the mean. There are similar metrics for normally distributed data, for 2 and 3 standard deviations (95.4% and 99.7% respectively).

However, as you see, testing a large set of data (such as the 100 columns of data we have here) can quickly become tedious, if we’re using a graphical approach. Then there’s the fact that the graphical approach may not be a rigorous enough evaluation for most statistical analysis situations, where you want to compare multiple sets of data easily. Unsurprisingly, we therefore use test statistics, and normality tests, to assess the data’s normality.

You may ask, what does non-normal data look like in this plot? Here’s an example below, from a binomial distribution, plotted on a Q-Q normal plot.

QQ-Normal plot - observe how binomial distribution data displays categories and extremes

QQ-Normal plot – observe how binomial distribution data displays categories and extremes

Anderson Darling Normality Test

As one of the commonly used normality tests, this is very commonly used to tell us whether or not a sample may represent normally distributed data. This is done in R by using the ad.test() command, in the nortest package. So, if you don’t have the ad.test command popping up on your R-studio auto-complete, you can easily install it via nortest on the “install.packages” command. Running the Anderson-Darling test for normality generally returns a bunch of data. Here’s how to make sense of it.


#Running the A-D test for first column
library(nortest)
ad.test(x[,1])

adtest

Anderson Darling Normality Test results

The data from the A-D test tells us which data has been tested, and two results: A and p-value.

The A result refers to the Anderson-Darling test statistic. The A-D test uses this test statistic to calculate the probability that this sample could have come from a normal distribution. The A-D test tests the default hypothesis that the data (in this case the first column of x), comes from a normal distribution. Assuming that this hypothesis is true, the p-value we see here tells us the probability that we can see the data we see in this sample purely by random chance. That is to say, in this case, we have a 50.57% probability of seeing the same kind of data from this process, assuming that the process in question does represent normally distributed data. Obviously, such a high chance of normality is hard to ignore, which is why we fail to reject this hypothesis we had originally, that the data does come from a normal distribution.

For another sample, if the p-value were around 3%, for instance, that would mean a 3% chance of seeing the same data from a normal distribution – which is obviously a very low chance. Although a 5% chance is still a small chance, we choose that as the very bottom end of our acceptability and should ideally subject such data to scrutiny before we proceed to do draw inferences from it. P-values can be confusing for some and hard to interpret – I usually try to construct a sentence to interpret the p-value’s meaning in the context of the hypothesis that’s being tested. I’ll write more on this in a future post on hypothesis testing.

Interpreting and Understanding A-D test Results

Naturally, as the p-values from an Anderson-Darling normality test become smaller and smaller, there is a smaller and smaller chance that we are looking at data from a normal distribution. Most statistical studies peg the “significance” level at which we reject the default hypothesis (that this data comes from a normal distribution) outright, at p-values of 0.05 (5%) or lesser. I’ve known teams and individuals that fail to reject this hypothesis, all the way down to p-values of 0.01. In truth, one significance value (often referred to as \alpha) doesn’t fit all applications, and we have to exercise great caution in rejecting null hypotheses.

Let’s go a bit further with our data and the A-D test: we will not perform the A-D test on all the columns of data we’ve got in our variable x. The easiest way to do this in R is to define a function that returns the p-values from each column, and use that in an “apply” command.


#Running the A-D test for first column
library(nortest)
#defining a function called "adt" to run the A-D tests and return p-values
adt<-function(x){ test<-ad.test(x); return(test$p.value) }
#store the p-values for each column in a separate variable
pvalues<-apply(x,MARGIN = 2, adt)

The code above analyzes the samples in x (as columns) and returns their p-values as columns. So, what do you expect to see when you summarize the variable “pvalues” which stores the test results?

p-values summary (columns of x)

p-values summary (columns of x)

When you summarize p-values, you can see how approximately 9 of the 100 don’t pass the significance criteria (of p>=0.05). You can also see that the p-values in this set of randomly generated samples are randomly distributed over the entire range of probabilities from 0 to 1. We can visualize this too, by plotting the variable “pvalues”.


#Plotting the sample p-values and drawing a significance line
plot(pvalues, main = "p-values for columns in x (AD-test)", xlab = "Column number in x", ylab = "p-value")
abline(h=0.05, col ="red")

p-values for columns in x

p-values for columns in x

Other tests: Shapiro-Wilk Test

The Anderson-Darling test isn’t the only one available in the nortest package for assessing normality. Statisticians and engineers often use the Shapiro-Wilk test of normality also. For similar data used above (generated as random numbers from the normal distribution), the Shapiro-Wilk test can be performed, with only a few changes to the R script (which is another reason R is so time-efficient).


#Generating 10k points of data and arranging them into 100 columns
x<-rnorm(10000,10,1)
dim(x)<-c(100,100)

#Generating a simple normal quantile-quantile plot for the first column
#Generating a line for the qqplot
qqnorm(x[,1])
qqline (x[,1], col=2)

#Running the A-D test for first column
library(nortest)
#defining a function called "swt" to run the Shapiro-Wilk tests and return p-values
swt<-function(x){ test<-shapiro.test(x); return(test$p.value) }
#store the p-values for each column in a separate variable
swpvalues<-apply(x,MARGIN = 2, swt)

#Plotting the sample p-values and drawing a significance line
plot(swpvalues, main = "p-values for columns in x (Shapiro-Wilk-test)", xlab = "Column number in x", ylab = "p-value")
abline(h=0.05, col ="red")

Shapiro-Wilk test

Shapiro-Wilk test p-values (for columns in x)

Observe than in these samples of 100 points each, the Shapiro Wilk tests returns 96 samples as being normally distributed data, while rejecting 4 (the four dots below the red line). We can’t be sure in this example whether the Shapiro Wilk test is better than the A-D test for normality assessments, however, because these are randomly generated data sets. If we run these tests side-by-side, however, we may get to see interesting results. When I ran these tests side by side, I got very similar number of significantly different samples (non-normal samples) – either 4 or 5 out of the total 100 – from both tests.

Concluding Remarks

So, what does all this mean for day-to-day data analysis?

  • Data analysis of continuous (variable, as opposed to yes/no, or other attribute data) data often uses tools that are meant for normally distributed data
  • Visualization alone isn’t sufficient to estimate the normality of a given set of data, and test-statistics are very important for this
  • The nortest package in R provides a fast and convenient way to assess samples for normality using tools like A-D and S-W tests
  • A-D and S-W tests tend to perform similarly for most data sets (however, research is being done on this)
  • The re-usability of R code allows you to set up a macro rather quickly for performing normality tests and a range of studies on test data, in a time-efficient manner