# “Small Data”and Being Data-Driven

Being data-driven in organizations is a bigger challenge than it is made out to be. For managers to suspend judgement and make decisions that are informed by facts and data is hard, even in this age of Big Data. I was spurred by a set of tweets I posted, to think through this subject.

## Decision Making Culture

A lot of organizations have jumped into the Big Data era having bypassed widespread use of data-driven decision making in their management ranks altogether. And this is, for many organizations, an inconvenient truth. In many organizations, even well known ones, experienced managers often made decision on gut feeling or based on reasons other than data that they collected. Analytics and business intelligence hoped to change that, and in some ways, it has. Many organizations and managers have changed their work styles. Examples abound of companies adopting techniques like Six Sigma in the 1980s and 1990s, a trend that continues to this day in the manufacturing industry.

## Three Contrasts

With the explosion in technologies and methods that have enabled Big Data to be collected and stored as “data lakes” and for data to be collected in real time as streaming data using technologies like Spark and NiFi, we’re at the advent of a new era of decision making characterised by the  3 Vs of Big Data, and data science at scale.

To see three contrasts between old and new management decision making styles:

1. Spending and buying decisions (for resources, infrastructure, technology and projects) are made after competitive evaluation based on data now more than ever. In the past, the lack of communication and analysis engines, and limited globalization enabled managers to spend less time evaluating even critical decisions, because the options were limited. Spending and buying decisions make up a lot of the executive decision making and a lot of it is informed by small data. The new trends of connected economies to networks, data mining and data analysis is bound to impact this positively. A flood of information enabled by the digital age exposed them to possibilities but without the tools to do better at such competitive analysis. The advent of advanced analytics will upend this paradigm, and will result in a better visibility for decision alternatives.
2. Operational excellence decisions are based more on real-time data now more than ever. Operational excellence and process efficiency is a key focus area for many manufacturing organizations, and increasingly concerns service oriented organizations as well. While “small data” were being collected at regular intervals, to get a sense of the business operations, these were not fully effective in capturing the wide range of process modes and didn’t represent the full possibilities one could leverage with such data. The number of practitioners of advanced methods, who used such methods in a verifiable way, were also limited and rarely formed the management strata or informed them. The proliferation of the new classes of data scientists and data engineers will affect the way decisions will be taken in future, in addition to the advent of real-time analytics.
3. Small data as a stepping stone to Big Data. Small Data, which is data collected as samples that may be slices of sensor information or representative samples of population data (such as Big Data), may increasingly be used to formulate the “cultural business case” for doing Big Data in companies. Many companies that do not have the culture of data driven decision making in their managerial ranks, are experimenting on a grand scale, with Big Data. Such organizations have taken to Big Data technologies such as Hadoop and Spark, and are collecting more data than they usefully analyze, often times. There is definitely scope to evaluate the business value with such implementations. There is also an opportunity to improve the cost effectiveness of the data science initiatives in companies, by evaluating the real need for a Big Data implementation, by using “small data” – data that does not have the same volume, velocity, variety and veracity criteria that what’s now accepted to be Big Data does have.

## Data Driven Decision Making Behaviours

Decision making is strongly influenced by behaviours. Daniel Kahnemann’s book Thinking Fast and Slow provides a psychological framework for thinking about fast and slow decision making, the former being gut-driven, and the latter being driven by careful, plodding analysis. Humans have the tendency to decisiveness, especially in organizations, and executives are often rewarded for fast decision making that is also effective. Naturally, this means that decision making as a habit flourishes in organizations.

Such fast decision making, however, comes at a price. A lot of decisions that aren’t well thought-through, could influence a large organization’s functioning, because the decision could be fundamental to the organization and may be relevant to all employees. Some organizations do reward behaviours in their managerial cadres that facilitate looking at the data that supports decisions. However, the vast majority of managers have a tax on the time they spend on decisions and would be rewarded for acting quickly and influencing a wide ranging array of decisions instead.

Enabling fast decision making has obvious benefits in a market economy. The more time managers spend in decision making, or delay a decision, the less competitive companies tend to look. Data driven decision making can be enabled by providing access to data, in a quick and painless way. And this means building intelligence into our interfaces, and into the machines that help us make and record decisions. It also means being able to delegate the mundane tasks well and easily.

## Concluding Remarks

A lot of organizations that have Big Data initiatives may not have the appropriate management or decision making culture that can fully utilize the investment in Big Data, which can sometimes be considerable. By using “Small Data” and the insights from analysis of such data, there is an opportunity to invest less and build the behaviours and organizational systems and habits that will make a Big Data implementation effective.

# Power, Difference and Sample Sizes

In my earlier posts on hypothesis testing and confidence intervals, I covered how there are two hypotheses – the default or null hypothesis, and the alternative hypothesis (which is like a logical opposite of the null hypothesis). Hypothesis testing is fundamentally a decision making activity, where you reject or fail to reject the default hypothesis. For example: how do you tell whether the gas mileage of cars from one fleet is greater than the gas mileage of cars from some other fleet? When you collect samples of data, you can compare the average values of the samples, and arrive at some inference from this information, about the population. Since sample sizes tend to be affected by variability, we ought to be interested in how much data to actually collect.

## Hypothesis Tests

Statistically speaking, when we collect a small sample of data and calculate its standard deviation, we tend to get a larger estimate of standard deviation or a smaller estimate of standard deviation from the actual standard deviation. The relationship between the sample size and the standard deviation of a sample is described by the term standard error. This standard error is an integral part of how a confidence interval is calculated for variable data. For smaller samples, the difference between the “true” standard deviation of the population (if that can even be measured) and the sample standard deviation tends to be small for large sample sizes. There is an intuitive way to think about this. If you have more information, you can make a better guess at a characteristic of the population the information is coming from.

Let’s take another example: Motor racing. Motor racing lap times are generally recorded with extremely high precision and accuracy. If we had a sample with three times and wanted to estimate lap times for a circuit, we’d probably do okay, but have a wider range of expected lap times. On the contrary, if we had a number of lap time records, we could more accurately calculate the confidence intervals for the mean value. We could even estimate the probability that a particular race car driver could clock a certain time, if we were able to understand the distribution that is the closest model of the data. In a hypothesis test, therefore, we construct a model of our data, and test a hypothesis based on that model. Naturally, there is a risk of going wrong with such an approach. We call these risks $\alpha$ and $\beta$ risks.

Hypothesis tests, alpha and beta risks

## Type I & Type II Errors and Power

To explain it simply, the chance of erroneously rejecting the null hypothesis $H_0$ is referred to as the Type I error (or $\alpha$). Similarly, the chance of erroneously accepting the null hypothesis (if the reality was different from what the null hypothesis stated) is called the Type II error (or $\beta$ risk). Naturally, we want both these errors to have very low probability in our experiments.

How do we determine if our statistical model is powerful enough, therefore, to avoid both kinds of risks? We use two statistical terms, significance (known here as $\alpha$, as in $\alpha$ risk), and power (known sometimes as $\pi$, but more commonly known as $1-\beta$, as we see from the illustration above).

It turns out that the statistical power is heavily dependent on the sample size you used to collect your data. If you used a small sample size, the ability of your test to detect a certain difference (such as 10 milliseconds of lap time, or 1 mile per gallon of difference in fuel efficiency) is diminished greatly. In truth, you may receive a result that gives you a p-value (also discussed in an earlier post) that is greater than the significance. Remember that this is now a straightforward comparison between the p-value of the test and what we know now as $\alpha$. However, note how in the results interpretation of our hypothesis test, we didn’t yet consider $1-\beta$. Technically, we should have. And this is what causes so many spurious results, because false positives end up getting ignored, leading to truth inflation.

Very often in data-driven businesses, the question of “how many samples is good enough” arises – and usually such discussions end with “as many as we can”. In truth, the process of determining how much data of a certain kind to collect, isn’t easy. Going back-and-forth to collect samples of data in order to do your hypothesis tests is helpful – primarily because you can see the effects of sample size in your specific problem, practically.

## A Note on Big Data

Big Data promises us what a lot of statisticians didn’t have in the past – the opportunity to analyze population data for a wide variety of problems. Big Data is naturally exciting for those who already have the trenchant infrastructure to make the call to collect, store and analyze such data. While Big Data security is an as yet incompletely answered question, especially in the context of user data and personally identifiable information, there is a push to collect such data, and it is highly likely that ethical questions will need to be answered by many social media and email account providers on the web that also double up as social networks. Where does this fit in? Well, when building such systems, we have neither the trust of a large number of people, nor the information we require – which could be in the form of demographic information, personal interests, relationships, work histories, and more. In the absence of such readily available information, and when companies have to build systems that handle this kind of information well, they have to experiment with smaller samples of data. Naturally, the statistical ideas mentioned in this post will be relevant in such contexts.

## Summary

1. Power: The power of a test is defined as the ability to correctly reject the null hypothesis of a test. As we’ve described it above, it is defined by $1-\beta$, where $\beta$ is the chance of incorrectly accepting the default or null hypothesis $H_0$.
2. Standard deviation ($\sigma$ ): The more variation we observe in any given process, the greater our target sample size should be, for achieving the same power, and if we’re detecting the same difference in performance, statistically. You could say that the sample size to be collected depends directly on the variability observed in the data. Even small differences in the $\sigma$ can affect the number of data points we need to collect to arrive at a result with sufficient power.
3. Significance ($\alpha$ ): As discussed in the earlier post on normality tests, significance of a result can have an impact on how much data we need to collect. If we have a wider margin for error, or greater margin for error in our decisions, we ought to settle for a larger significance value, perhaps of 10% or 15%. The de-facto norm in many industries and processes (for better or for worse, usually for worse) is to use $\alpha = 0.05$. While this isn’t always a bad thing, it encourages blind adherence and myths to propagate, and prevents statisticians from thinking about the consequences of their assumptions. In reality, $\alpha$ values of 0.01 and even 0.001 may be required, depending on how certain we want to be about the results.
4. Sample size ($n$): The greater the sample size of the data you’re using for a given hypothesis test, the greater the power of that test (and by that I mean, the test has a greater ability to detect a false positive).
5. Difference ($\Delta$): The greater the difference you want to be able to detect between two sets of data (proportions or means or medians), the smaller the sample size you need. This is an intuitively easy thing to understand – like testing a HumVee for gas mileage versus a Ford Focus – you need only a few trips (a small sample size) to tell a real difference, as opposed to if you were to test two compact cars against each other (when you may require a more rigorous testing approach).

# Hypothesis Tests: 2 Sample Tests (A/B Tests)

Businesses are increasingly beginning to use data to drive decision making, and are often using hypothesis tests. Hypothesis tests are used to differentiate between a pair of potential solutions, or to understand the performance of systems before and after a certain change. We’ve already seen t-tests and how they’re used to ascribe a range to the variability inherent in any data set. We’ll now see the use of t-tests to compare different sets of data. In website optimization projects, these tests are also called A/B tests, because they compare two different alternative website designs, to determine how they perform against each other.

It is important to reiterate that in hypothesis testing, we’re looking for a significant difference and that we use the p-value in conjunction with the significance (%) to determine whether we want to reject some default hypothesis we’re evaluating with the data, or not. We do this by calculating a confidence interval, also called an interval estimate. Let’s look at a simple 2-sample t-test and understand how it works for two different samples of data.

## Simple 2-sample t-test

A 2-sample t-test has the default hypothesis that the two samples you’re testing come from the same population, and that you can’t really tell any difference between them. So, any variation you see in the data is purely random variation. The alternative hypothesis in this test, is of course, that it isn’t only random variation we’re seeing, and that these samples come from completely different populations altogether.

$H_o : \mu_1 = \mu_2 \newline H_a: \mu_1 \neq \mu_2$

What the populations from which X1 and X2 are taken may look like

We’ll generate two samples of data $x_1, x_2$ from two different normal distributions for the purpose of demonstration, since normality is a pre-requisite for using the 2-sample t-test. (In the absence of normality, we can use other estimators of central tendency such as the median, and the tests appropriate for estimating the median, such as the Moods-Median or Kruskall-Wallis test – which I’ll blog about another time). We also have to ensure that the standard deviations of the two samples of data we’re testing are comparable. I’ll also demonstrate how we can use a test for standard deviations to understand whether the samples have different variability. Naturally, when the samples have different standard deviations, tests for assessing similarities in their means may not be fully effective.

library(nortest)
#Generating two samples of data
#100 points of data each
#Same standard deviation
#Different values of mean (of sampling distribution)
x1<-x[,2]
x2<-x[,3]

#Setting the global value of significance
alpha = 0.1

#Histograms
hist(x1, col = rgb(0.1,0.5,0.1,0.25), xlim = c(7,15), ylim = c(0,15),breaks = seq(7,15,0.25), main = "Histogram of x1 and x2", xlab = "x1, x2")
abline(v=10, col = "orange")
hist(x2, col = rgb(0.5,0.1,0.5,0.25), xlim = c(7,15),breaks = seq(7,15,0.25), add = T)
abline(v=12, col = "purple")

#Running normality tests (just to be sure)

#F-test to compare two or more variances
v1<-var.test(x1,x2)

if(ad1$p.value>=alpha & ad2$p.value>=alpha){
if(v1$p.value>=alpha){ #Running a 2-sample t-test for (i in c(-2,-1,0,1,2)){ temp<-t.test(x=x1,y=x2,paired = FALSE,var.equal = TRUE,alternative = "two.sided",conf.level = 1-alpha, mu = i ) cat("Difference= ",i,"; p-value:",temp$p.value,"\n")
}

}
}

The first few lines in the code merely include the “nortest” package and invoke/generate the data sets we’re comparing. The nortest package contains the Anderson Darling Normality Test, which we have also covered in an earlier post. We can generate a histogram, to understand what $x_1$ and $x_2$ look like.

Histogram of X1 and X2 – showing the reference population mean lines

The overlapping histograms of $x_1$ and $x_2$ clearly indicate the difference in the central tendency, and the overlap is also visible. Subsequent code above covers an F-test. As explained earlier, equality of variances is a pre-requisite for the 2-sample t-test. Failing this would mean that we essentially have samples from two different populations, which have two different standard deviations.

Finally, if the conditions to run a 2-sample t-test are met, the t.test() command (which is present in the “stats” package, runs, and provides us a result. Let’s look closely at how the t.test command is constructed. The arguments contain $x_1$ and $x_2$, which are our two samples for comparison. We provide the argument “paired = FALSE”, because these are not before/after samples. They’re two independently generated samples of data. There are instances where you may want to conduct a paired t-test, though, depending on your situation. We’ve also specified the confidence level. Note how the code uses a global value of $\alpha$, or significance level.

Now that we’ve seen what the code does, let’s look at the results.

2-sample t-test results

## Evaluating The Results

Two sample t-test results should be evaluated in a similar way to 1-sample t-tests. Our decision is dependent on the p-value we see in the result, and the confidence interval of the difference between sample means.

Observe how the difference estimate lies on the negative side of the number line. Difference is calculated from the populations means 10 and 12, so we can clearly understand why this estimate of difference is negative. The estimates for mean values of x and y (in this case, $x_1$ and $x_2$) are also given. Naturally, the p-value that’s in the result, when compared to our generous $\alpha$ of 0.1, is far lesser, and we can consider this to be a significant result (provided we have sufficient statistical power – and we’ll discuss this in another post). This indicates a significant difference between the two sets. If $x_1$ and $x_2$ were fuel efficiency figures for passenger vehicles, or bikes, we may actually be looking at better performance for $x_2$ when compared to $x_1$.

## Detecting a Specific Difference

Sometimes, you may want to evaluate a new product, and see if it performs at least x% better than the old product. For websites, for instance, you may be concerned with loading times. You may be concerned with code runtime, or with vehicle gas mileage, or vehicle durability, or some other aspect of performance. At times, the fortunes of entire companies depend on them producing faster, better products – that are known to be faster by at least some amount. Let’s see how a 2-sample t-test can be used to evaluate a minimum difference between two samples of data.

The same example above can be modified slightly, to test for a specific difference. The only real difference we have to make here, of course, is the value of $\Delta$ or difference. The t.test() command in R unfortunately isn’t very clear on this – it expects you to understand that you should use $\mu$ for this. Once you get used to it, however, this little detail is fine, and it delivers the expected result.

if(ad1$p.value >=alpha & ad2$p.value>=alpha){
if(v1$p.value>=alpha){ #Running a 2-sample t-test for (i in c(-2,-1,0,1,2)){ temp<-t.test(x=x1,y=x2,paired = FALSE,var.equal = TRUE,alternative = "two.sided",conf.level = 1-alpha, mu = i ) cat("Difference= ",i,"; p-value:",temp$p.value,"\n")
}
}
}


The code above prints out different p-values, for different tests. The data used in these tests is the same, but by virtue of the different differences we want to detect between these samples, the p-values are different. Observe the results below:

Differences and how they influence p-value (same two samples of data)

Since the data was generated from two distributions that have means of 10 and 12 respectively for $x_1$ and $x_2$, we know from intuition that the difference is -2, and we should start seeing results that indicate no difference between the expected and observed difference at this value in the test. Therefore, the p-values in this scenario will be greater than the significance value, $\alpha$.

For other scenarios – when $\Delta = -1, 0, 1, 2$, we see that the p-values are clearly far below the significance of $\alpha = 10%$.

What’s important to remember therefore, is that contrary to what many people may think, there is no one or best p-value for a given set of data. It depends on the factors we take into consideration during the test – such as the sample size, the confidence level we chose for our test, the resulting significance level, and, as illustrated here, the difference expected.

## Concluding Remarks

A 2-sample t-test is a great way for an organization to compare samples of data from different products, processes, and so on, and understand if one of them is performing significantly better than another. The test is strictly for data that fits the normality criteria, that also happen to have comparable standard deviations, and the results from it tend to be impacted heavily by the kind of hypothesis we use – for difference (which we explored here) and for one or two sided comparisons. We explored only the two sided comparisons here (and hence constructed a two sided confidence interval). When a business uses a 2-sample t-test, some of the arguments here, such as the values of confidence level, difference and so on, should be evaluated thoroughly. It is also important to bear in mind the impact of sample size. The smaller the difference we want to detect, the greater the sample sizes have to be. We’ll see more about this in another post, on power, difference and sample size.

# Confidence Intervals and t-tests in R

If you were to walk into a restaurant and order a cup of coffee, you’d expect to get a standard cup of the stuff, and you’d expect to get a sufficient quantity of it too. How much coffee for a certain price is too much, and how much is too little? More realistically, when you know that the same coffee shop may serve ever so slightly different volumes of coffee in the same cup, how can you quantify the coffee in the cup?

## What Plausible Ranges of Gas Prices Do You Pay?

The same question may very well be asked at the gas station or diesel station that you fill up your car in. When you get there and shell out a few gallons or litres worth of money (depending on where you are), you can expect to pay a certain amount of money for each such gallon or litre. How do we determine what the range of expected prices for a gallon or litre of fuel is, based on already available data?

This is where the confidence interval comes in – and it is one of the most important tools of inferential statistics. Inferential statistics is the science of making decisions or informed generalizations about some data you have, based on some of the characteristics of this data. An important way to understand variability in any process or product’s performance is to ascribe a range of plausible values. Confidence intervals can be defined as the plausible ranges of values a population parameter may take, if you were to estimate it with a sample statistic. You’d hardly be expected by a statistician to be asked “What plausible range of gas prices do you pay on average?”, but this is, in fact, closer to the truth of interval estimation in statistics.

## Confidence Levels and Sampling

Why the word confidence in confidence interval? Well, information costs you something to collect it. If you want to be 100 percent sure about the mean of petrol prices, for instance, you could literally collect data on every transaction from every pump in the world. In the real world, this is impossible, and we require sampling.

In the age of Big Data, it seems to be a taboo to talk about sampling sometimes. “You can collect all the data from a process”, some claim. That may be the case for a small minority of processes, but for the vast world out there, characterization is only possible by collecting and evaluating samples of data. And this approach entails the concept of a confidence level.

Confidence levels tell us the proportion of times we’re willing to be right (and wrong) about any parameter we wish to estimate from a sample of data. For example, if I measured the price per gallon of gas at every pump in Maine, or Tokyo, for a day, I’d get a lot of data – and that data would show me some general trends and patterns in the way the prices are distributed, or what typical prices seem to be in effect. If I expect to make an estimate of petrol prices in Tokyo or Maine next July, I couldn’t hope to do this with a limited sample such as this, however. Basic economics knowledge tells us that there could be many factors that could change these prices – and that they could very well be quite different from what they are now. And this is despite the quality of the data we have.

If I wanted to be 95% confident about the prices of petrol within a month from now, I could use a day’s worth of data. And to represent this data, I could use a confidence interval ( a range of values), of course. So, whether it is the quantity of coffee in your cup, or the price per gallon of fuel you buy, you can evaluate the broader parameters of your data sets, as long as you can get the data, using confidence intervals.

## R Confidence Intervals Example

Now, let’s see a simple example in R, that illustrate confidence intervals.

#Generate some data - 100 points of data
#The mean of the data generated is 10,
#The standard deviation we've chosen is 1.0
#Data comes from the gaussian distribution

x<-rnorm(100,10,1.0)

#Testing an 80% confidence level
x80<-t.test(x,conf.level = 0.8)

#Testing a 90% confidence level
x90<-t.test(x,conf.level = 0.9)

#Testing a 99% confidence level
x99<-t.test(x,conf.level = 0.99)

x80
x90
x99


The first part of the code shows us how 100 points of data are used as a sample in this illustration.

In the next part, the t.test() command used here can be used to generate confidence intervals, and also test hypotheses you may have developed. In the above code, I’ve saved three different results, based on three different confidence levels – 80%, 90% and 95%. Typically, when you want to be more certain, but you don’t have more data, you end up getting a wider confidence interval. Expect more uncertainty if you have limited data, and more certainty, when you have more data, all other things being equal. Here are the results from the first test – the 80% confidence interval.

1-sample t-test and confidence interval (80% confidence)

Let’s break down the results of the t-test. First, we see the data set the test was performed on, and then we see the t-statistics and also a p-value. Further, you can see a confidence interval $(9.85,10.10)$ for the data, based on a sample size of 100, and a confidence level of 80%. Since the t-test is fundamentally a hypothesis test that uses confidence intervals to help you make your decision, you can also see an alternative hypothesis listed there. Then there’s the estimate of the mean – 9.98. Recall the characteristics of the normal distribution we used to generate the data in the first place – it had a mean $\mu = 10.0$ and a standard deviation $\sigma = 1$.

## Confidence Levels versus Confidence Intervals

Summarily, we can see the confidence intervals and mean estimates of the remaining two confidence intervals also. For ease, I’ll print them out from storage.

CIs comparison (80%,90%, 99%)

Observe how, for the same data set, confidence intervals (plausible ranges of values for the mean of the data) are different, depending on how the confidence level changes. The confidence intervals widen as the confidence level increases. The 80% CI is calculated to be $(9.85, 10.10)$ while the same sample yields $(9.72, 10.24)$ when the CI is calculated at 99% confidence level. When you consider that standard deviations can be quite large, what confidence level you use in your calculations, could actually become a matter of importance!

More on how confidence intervals are calculated here, at NIST.

## The 1-sample t-Test

We’ve seen earlier that the command that is invoked to calculate these confidence intervals is actually called t.test(). Now let’s look at what a t-test is.

In inferential statistics, specifically in hypothesis testing, we test samples of data to determine if we can make a generalization about them. When you collect process data from a process you know, the expectation is that this data will look a lot like what you think it should look like. If this is data about how much coffee, or how much you paid for a gallon of gasoline, you may be interested in knowing, for instance, if the price per gallon is any different here, compared to some other gas station. One way to find this out – with statistical certainty – is through a t-test. A t-test fundamentally tells us whether we fail to reject a default hypothesis we have (also called null hypothesis and denoted as $H_0$ ), or if we reject the default hypothesis and embrace an alternative hypothesis (denoted by $H_a$). In case of the 1-sample t-test, therefore:

$H_o : \mu = k \newline H_a : \mu \neq k$

Depending on the nature of the alternative hypothesis, we could have inequalities there as well. Note that $k$ here is the expectation you have about what the mean ought to be. If you don’t supply one, t.test will calculate a confidence interval and produce a mean estimate.

A t-test uses the t-distribution, which is a lot like the Gaussian or normal distribution, except that it uses an additional parameter – which directly relates to the sample size of your data. Understandably, the size of the sample could give you very different results in a t-test.

As with other hypothesis tests, you also get a p-value. The null hypothesis in a 1-sample t-test is relatively straightforward – that there is no difference between the mean of the sample in question, and the population’s mean. Naturally, the alternative of this hypothesis could help us study whether the population mean is less than expected (less expensive gas!) or greater (more expensive gas!) than the expectation.

Let’s take another look at that first t-test result:

1-sample t-test and confidence interval (80% confidence)

The confidence level we’ve calculated here is an 80% confidence interval, which translates to a 20% significance. To interpret the results above, we compare the value of p, with the significance, and reject the null hypothesis is p is smaller. But what are the t-statistic and df? The t-statistic here is calculated as the critical value of t, based on a confidence level of 80%, the sample mean and standard deviation, and of course, the fact that we have 100 points of data. (The “df” here stands for degrees of freedom – which stands at 99, calculated from the 100 data points we have and the 1 parameter we’re estimating.)

## Alternative Hypotheses Inequalities

The t.test() command also allows us to evaluate how the confidence intervals look, and what the p-values are, when we have different alternative hypotheses. We can test for the population mean that’s being estimated to be less than, or greater than the expected value. We can also specify what our expected values of mean are.


x80<-t.test(x,conf.level = 0.8, mu = 10.1,alternative = "less" )

x80



Observe how the p-value, confidence intervals have changed.

We’ve evaluated the same 80% confidence intervals, with different expected values of the mean of $\mu = 10.1$, and the alternative hypothesis is that this mean $\mu<10.1$.

## Concluding Remarks

When evaluating data to draw conclusions from it, it helps to construct confidence intervals. These tell us general patterns in the data, and also help us estimate variability. In real-life situations, using confidence intervals and t-tests to estimate the presence or absence of a difference between expectation and estimate is valuable. Often, this is the lifeblood of data-driven decision making when dealing with lots of data, and when coming to impactful conclusions about data. R’s power in quickly generating confidence intervals becomes quite an ally, in the right hands – and of course, if you’ve collected the right data.

# Normality Tests in R

When we see data visualized in a graph such as a histogram, we tend to draw some conclusions from it. When data is spread out, or concentrated, or observed to change with other data, we often take that to mean relationships between data sets. Statisticians, though, have to be more rigorous in the way they establish their notions of the nature of a data set, or its relationship with other data sets. Statisticians of any merit depend on test statistics, in addition to visualization, to test any theories or hunches they may have about the data. Usually, normality tests fit into this toolbox.

Histogram: Can this graph alone tell you whether your data is normally distributed?

Normality tests help us understand the chance that any data we have with us may have come from a normal or Gaussian distribution. At the outset, that seems simple enough. However, when you look closer at a Gaussian distribution, you can observe how it has certain specific properties. For instance, there are two main parameters – a location parameter, the mean, and the scale parameter, the standard deviation. Different combinations of this can mean different shapes of distributions. You can therefore have thin and tall normal distributions, or have fat and wide normal distributions.

When you’re checking a data set for normality, it helps to visualize the data too.

## Normal Q-Q Plots

#Generating 10k points of data and arranging them into 100 columns
x<-rnorm(10000,10,1)
dim(x)<-c(100,100)

#Generating a simple normal quantile-quantile plot for the first column
#Generating a line for the qqplot
qqnorm(x[,1])
qqline (x[,1], col=2)



The code above generates data from a normal distribution (command “rnorm”), reshapes it into a series of columns, and runs what is called a normal quantile-quantile plot (QQ Plot, for short) on the first column.

Q-Q Plot (Normal)

The Q-Q plot tells us what proportion of the data set (in this case, the first column of variable x), compares with the expected proportion (theoretically) of the normal distribution model based on the sample’s mean and standard deviation. We’re able to do this, because of the normal distribution’s properties. The normal distribution is thicker around the mean, and thinner as you move away from it – specifically, around 68% of the points you can expect to see in normally distributed data will only be 1 standard deviation away from the mean. There are similar metrics for normally distributed data, for 2 and 3 standard deviations (95.4% and 99.7% respectively).

However, as you see, testing a large set of data (such as the 100 columns of data we have here) can quickly become tedious, if we’re using a graphical approach. Then there’s the fact that the graphical approach may not be a rigorous enough evaluation for most statistical analysis situations, where you want to compare multiple sets of data easily. Unsurprisingly, we therefore use test statistics, and normality tests, to assess the data’s normality.

You may ask, what does non-normal data look like in this plot? Here’s an example below, from a binomial distribution, plotted on a Q-Q normal plot.

QQ-Normal plot – observe how binomial distribution data displays categories and extremes

## Anderson Darling Normality Test

As one of the commonly used normality tests, this is very commonly used to tell us whether or not a sample may represent normally distributed data. This is done in R by using the ad.test() command, in the nortest package. So, if you don’t have the ad.test command popping up on your R-studio auto-complete, you can easily install it via nortest on the “install.packages” command. Running the Anderson-Darling test for normality generally returns a bunch of data. Here’s how to make sense of it.


#Running the A-D test for first column
library(nortest)



Anderson Darling Normality Test results

The data from the A-D test tells us which data has been tested, and two results: A and p-value.

The A result refers to the Anderson-Darling test statistic. The A-D test uses this test statistic to calculate the probability that this sample could have come from a normal distribution. The A-D test tests the default hypothesis that the data (in this case the first column of x), comes from a normal distribution. Assuming that this hypothesis is true, the p-value we see here tells us the probability that we can see the data we see in this sample purely by random chance. That is to say, in this case, we have a 50.57% probability of seeing the same kind of data from this process, assuming that the process in question does represent normally distributed data. Obviously, such a high chance of normality is hard to ignore, which is why we fail to reject this hypothesis we had originally, that the data does come from a normal distribution.

For another sample, if the p-value were around 3%, for instance, that would mean a 3% chance of seeing the same data from a normal distribution – which is obviously a very low chance. Although a 5% chance is still a small chance, we choose that as the very bottom end of our acceptability and should ideally subject such data to scrutiny before we proceed to do draw inferences from it. P-values can be confusing for some and hard to interpret – I usually try to construct a sentence to interpret the p-value’s meaning in the context of the hypothesis that’s being tested. I’ll write more on this in a future post on hypothesis testing.

## Interpreting and Understanding A-D test Results

Naturally, as the p-values from an Anderson-Darling normality test become smaller and smaller, there is a smaller and smaller chance that we are looking at data from a normal distribution. Most statistical studies peg the “significance” level at which we reject the default hypothesis (that this data comes from a normal distribution) outright, at p-values of 0.05 (5%) or lesser. I’ve known teams and individuals that fail to reject this hypothesis, all the way down to p-values of 0.01. In truth, one significance value (often referred to as $\alpha$) doesn’t fit all applications, and we have to exercise great caution in rejecting null hypotheses.

Let’s go a bit further with our data and the A-D test: we will not perform the A-D test on all the columns of data we’ve got in our variable x. The easiest way to do this in R is to define a function that returns the p-values from each column, and use that in an “apply” command.


#Running the A-D test for first column
library(nortest)
#defining a function called "adt" to run the A-D tests and return p-values
adt<-function(x){ test<-ad.test(x); return(test$p.value) } #store the p-values for each column in a separate variable pvalues<-apply(x,MARGIN = 2, adt)  The code above analyzes the samples in x (as columns) and returns their p-values as columns. So, what do you expect to see when you summarize the variable “pvalues” which stores the test results? p-values summary (columns of x) When you summarize p-values, you can see how approximately 9 of the 100 don’t pass the significance criteria (of p>=0.05). You can also see that the p-values in this set of randomly generated samples are randomly distributed over the entire range of probabilities from 0 to 1. We can visualize this too, by plotting the variable “pvalues”.  #Plotting the sample p-values and drawing a significance line plot(pvalues, main = "p-values for columns in x (AD-test)", xlab = "Column number in x", ylab = "p-value") abline(h=0.05, col ="red")  p-values for columns in x ## Other tests: Shapiro-Wilk Test The Anderson-Darling test isn’t the only one available in the nortest package for assessing normality. Statisticians and engineers often use the Shapiro-Wilk test of normality also. For similar data used above (generated as random numbers from the normal distribution), the Shapiro-Wilk test can be performed, with only a few changes to the R script (which is another reason R is so time-efficient).  #Generating 10k points of data and arranging them into 100 columns x<-rnorm(10000,10,1) dim(x)<-c(100,100) #Generating a simple normal quantile-quantile plot for the first column #Generating a line for the qqplot qqnorm(x[,1]) qqline (x[,1], col=2) #Running the A-D test for first column library(nortest) #defining a function called "swt" to run the Shapiro-Wilk tests and return p-values swt<-function(x){ test<-shapiro.test(x); return(test$p.value) }
#store the p-values for each column in a separate variable
swpvalues<-apply(x,MARGIN = 2, swt)

#Plotting the sample p-values and drawing a significance line
plot(swpvalues, main = "p-values for columns in x (Shapiro-Wilk-test)", xlab = "Column number in x", ylab = "p-value")
abline(h=0.05, col ="red")



Shapiro-Wilk test p-values (for columns in x)

Observe than in these samples of 100 points each, the Shapiro Wilk tests returns 96 samples as being normally distributed data, while rejecting 4 (the four dots below the red line). We can’t be sure in this example whether the Shapiro Wilk test is better than the A-D test for normality assessments, however, because these are randomly generated data sets. If we run these tests side-by-side, however, we may get to see interesting results. When I ran these tests side by side, I got very similar number of significantly different samples (non-normal samples) – either 4 or 5 out of the total 100 – from both tests.

## Concluding Remarks

So, what does all this mean for day-to-day data analysis?

• Data analysis of continuous (variable, as opposed to yes/no, or other attribute data) data often uses tools that are meant for normally distributed data
• Visualization alone isn’t sufficient to estimate the normality of a given set of data, and test-statistics are very important for this
• The nortest package in R provides a fast and convenient way to assess samples for normality using tools like A-D and S-W tests
• A-D and S-W tests tend to perform similarly for most data sets (however, research is being done on this)
• The re-usability of R code allows you to set up a macro rather quickly for performing normality tests and a range of studies on test data, in a time-efficient manner

# Data Science: Beyond the Hype

While there is justifiable excitement in the technology industry (and other industries) these days on the widespread availability of data, and the availability of algorithms to process and make sense of this data, I sincerely think (like many others) that the hype behind Big Data is somewhat unfounded.

For many decades, “small data” have been studied in science and industry with the intent of constructing mathematical models, i.e., approximate, error-prone mathematical representations of phenomena. In some ways, the scientific method is all about such data analysis. We often hear in the news about the amplification of effects, the “truth inflation” observed when drawing conclusions from small data sets, to make broader generalizations. We hear about the lack of enough data impeding the progress of research, we also hear about fabricated data and spurious research results. A lot of scientific findings have come under scrutiny for these reason – and perhaps analysis of population data (as Big Data promises to do) may help this situation. However, the key difference between the past decades of statistics – from legends such as Fisher and George Box, to present day stalwarts in applied statistics and machine learning like Nate Silver, Sebastian Thrun and Andrew Ng, is the ability to leverage computing to analyse large data sets.

A lot of the discussion around Big Data seems to be on the so-called four Vs of Big Data – volume, velocity, variety – and increasingly, veracity – referring to the increasing speed and range of data generated in the information age. However, what’s forgotten often enough, is that below the hype, below the machine learning algorithms and below the databases and technologies, we still have the same underlying principles.

The types of data, the mathematical methods we use to evaluate them, and the fundamental concepts thereof are unchanged – and understanding this is often the key between knowing whether and when to sample from your big data set, or not. This is more important than we realize, because sampling is not obsolete. Often, well collected samples of data may be more than sufficient for establishing or testing a certain hypothesis we may have.

In my view, newcomers to the data science and big data revolutions ought to consider a course in statistics, statistical thinking and statistical reasoning first. This lays the foundation for everything else that follows. The internet and most developed and even developing countries are awash with resources that can enable individuals to learn programming and computer-based problem solving, but critical thinking and statistical thinking seem to be harder skills to learn.

Statistical thinking not only requires a level of mathematical rigour but an ability to embrace notions of uncertainty, probabilistic thinking and a fundamental change in one’s notions of cause and effect. Perhaps this is a big step for many. The relative certainty of the logic of programming languages may actually be welcoming to many – which is probably also why we see more discussions about Hadoop and Spark and not enough discussions about statistical hypothesis testing or time series auto-correlation models.

So, if you want to cut through the hype, see data science for what it is, by breaking it up into its elements – the data (which may be coming in from ever more diverse sources), the tools (algorithms, computers) and the science (which is, in this case, statistics). Not everyone is a data scientist, as some articles on the web have begun to claim, but it isn’t only a specific set of skills that makes one a “data scientist”. Some say that these data scientists are glorified statisticians, some say that they’re statistically competent programmers well versed in machine learning, but the truth is probably somewhere in between.

Furthermore, data visualization – another aspect of the data science hype – is both an art and a science – which perhaps implies that you can both be enlightened and obfuscated by charts and graphs. In my view, knowledge/abilities in visualization alone doesn’t make you a data scientist (nor does, for instance, knowledge of machine learning methods alone or skills in programming R for ETL purposes alone). When you cut through the hype here, what’s pragmatic is to be able to acquire a wide array of skills – and depth in some. Like many engineers in fields of technology or engineering, who may have a wide swath of knowledge but expertise in only a few areas, this is the most likely role that most data scientists may have.

There’s definitely more that can be said about specific aspects of the data science “movement”, but what is certain is that a knowledge of the science of statistics underlying most of the science cannot be underestimated in its value and relevance in the present day. Statistics, hopefully, will become as important as learning a language or developing an ability to have conversations, or write a well argued paragraph.