Power, Difference and Sample Sizes

In my earlier posts on hypothesis testing and confidence intervals, I covered how there are two hypotheses – the default or null hypothesis, and the alternative hypothesis (which is like a logical opposite of the null hypothesis). Hypothesis testing is fundamentally a decision making activity, where you reject or fail to reject the default hypothesis. For example: how do you tell whether the gas mileage of cars from one fleet is greater than the gas mileage of cars from some other fleet? When you collect samples of data, you can compare the average values of the samples, and arrive at some inference from this information, about the population. Since sample sizes tend to be affected by variability, we ought to be interested in how much data to actually collect.

Hypothesis Tests

Statistically speaking, when we collect a small sample of data and calculate its standard deviation, we tend to get a larger estimate of standard deviation or a smaller estimate of standard deviation from the actual standard deviation. The relationship between the sample size and the standard deviation of a sample is described by the term standard error. This standard error is an integral part of how a confidence interval is calculated for variable data. For smaller samples, the difference between the “true” standard deviation of the population (if that can even be measured) and the sample standard deviation tends to be small for large sample sizes. There is an intuitive way to think about this. If you have more information, you can make a better guess at a characteristic of the population the information is coming from.

Let’s take another example: Motor racing. Motor racing lap times are generally recorded with extremely high precision and accuracy. If we had a sample with three times and wanted to estimate lap times for a circuit, we’d probably do okay, but have a wider range of expected lap times. On the contrary, if we had a number of lap time records, we could more accurately calculate the confidence intervals for the mean value. We could even estimate the probability that a particular race car driver could clock a certain time, if we were able to understand the distribution that is the closest model of the data. In a hypothesis test, therefore, we construct a model of our data, and test a hypothesis based on that model. Naturally, there is a risk of going wrong with such an approach. We call these risks \alpha and \beta risks.

Hypothesis tests, alpha and beta risks

Hypothesis tests, alpha and beta risks

Type I & Type II Errors and Power

To explain it simply, the chance of erroneously rejecting the null hypothesis H_0 is referred to as the Type I error (or \alpha). Similarly, the chance of erroneously accepting the null hypothesis (if the reality was different from what the null hypothesis stated) is called the Type II error (or \beta risk). Naturally, we want both these errors to have very low probability in our experiments.

How do we determine if our statistical model is powerful enough, therefore, to avoid both kinds of risks? We use two statistical terms, significance (known here as \alpha, as in \alpha risk), and power (known sometimes as \pi, but more commonly known as 1-\beta, as we see from the illustration above).

It turns out that the statistical power is heavily dependent on the sample size you used to collect your data. If you used a small sample size, the ability of your test to detect a certain difference (such as 10 milliseconds of lap time, or 1 mile per gallon of difference in fuel efficiency) is diminished greatly. In truth, you may receive a result that gives you a p-value (also discussed in an earlier post) that is greater than the significance. Remember that this is now a straightforward comparison between the p-value of the test and what we know now as \alpha. However, note how in the results interpretation of our hypothesis test, we didn’t yet consider 1-\beta. Technically, we should have. And this is what causes so many spurious results, because false positives end up getting ignored, leading to truth inflation.

Very often in data-driven businesses, the question of “how many samples is good enough” arises – and usually such discussions end with “as many as we can”. In truth, the process of determining how much data of a certain kind to collect, isn’t easy. Going back-and-forth to collect samples of data in order to do your hypothesis tests is helpful – primarily because you can see the effects of sample size in your specific problem, practically.

A Note on Big Data

Big Data promises us what a lot of statisticians didn’t have in the past – the opportunity to analyze population data for a wide variety of problems. Big Data is naturally exciting for those who already have the trenchant infrastructure to make the call to collect, store and analyze such data. While Big Data security is an as yet incompletely answered question, especially in the context of user data and personally identifiable information, there is a push to collect such data, and it is highly likely that ethical questions will need to be answered by many social media and email account providers on the web that also double up as social networks. Where does this fit in? Well, when building such systems, we have neither the trust of a large number of people, nor the information we require – which could be in the form of demographic information, personal interests, relationships, work histories, and more. In the absence of such readily available information, and when companies have to build systems that handle this kind of information well, they have to experiment with smaller samples of data. Naturally, the statistical ideas mentioned in this post will be relevant in such contexts.

Summary

  1. Power: The power of a test is defined as the ability to correctly reject the null hypothesis of a test. As we’ve described it above, it is defined by 1-\beta , where \beta is the chance of incorrectly accepting the default or null hypothesis H_0.
  2. Standard deviation (\sigma ): The more variation we observe in any given process, the greater our target sample size should be, for achieving the same power, and if we’re detecting the same difference in performance, statistically. You could say that the sample size to be collected depends directly on the variability observed in the data. Even small differences in the \sigma can affect the number of data points we need to collect to arrive at a result with sufficient power.
  3. Significance (\alpha ): As discussed in the earlier post on normality tests, significance of a result can have an impact on how much data we need to collect. If we have a wider margin for error, or greater margin for error in our decisions, we ought to settle for a larger significance value, perhaps of 10% or 15%. The de-facto norm in many industries and processes (for better or for worse, usually for worse) is to use \alpha = 0.05. While this isn’t always a bad thing, it encourages blind adherence and myths to propagate, and prevents statisticians from thinking about the consequences of their assumptions. In reality, \alpha values of 0.01 and even 0.001 may be required, depending on how certain we want to be about the results.
  4. Sample size (n): The greater the sample size of the data you’re using for a given hypothesis test, the greater the power of that test (and by that I mean, the test has a greater ability to detect a false positive).
  5. Difference ($\Delta$): The greater the difference you want to be able to detect between two sets of data (proportions or means or medians), the smaller the sample size you need. This is an intuitively easy thing to understand – like testing a HumVee for gas mileage versus a Ford Focus – you need only a few trips (a small sample size) to tell a real difference, as opposed to if you were to test two compact cars against each other (when you may require a more rigorous testing approach).

Confidence Intervals and t-tests in R

If you were to walk into a restaurant and order a cup of coffee, you’d expect to get a standard cup of the stuff, and you’d expect to get a sufficient quantity of it too. How much coffee for a certain price is too much, and how much is too little? More realistically, when you know that the same coffee shop may serve ever so slightly different volumes of coffee in the same cup, how can you quantify the coffee in the cup?

What Plausible Ranges of Gas Prices Do You Pay?

The same question may very well be asked at the gas station or diesel station that you fill up your car in. When you get there and shell out a few gallons or litres worth of money (depending on where you are), you can expect to pay a certain amount of money for each such gallon or litre. How do we determine what the range of expected prices for a gallon or litre of fuel is, based on already available data?

This is where the confidence interval comes in – and it is one of the most important tools of inferential statistics. Inferential statistics is the science of making decisions or informed generalizations about some data you have, based on some of the characteristics of this data. An important way to understand variability in any process or product’s performance is to ascribe a range of plausible values. Confidence intervals can be defined as the plausible ranges of values a population parameter may take, if you were to estimate it with a sample statistic. You’d hardly be expected by a statistician to be asked “What plausible range of gas prices do you pay on average?”, but this is, in fact, closer to the truth of interval estimation in statistics.

Confidence Levels and Sampling

Why the word confidence in confidence interval? Well, information costs you something to collect it. If you want to be 100 percent sure about the mean of petrol prices, for instance, you could literally collect data on every transaction from every pump in the world. In the real world, this is impossible, and we require sampling.

In the age of Big Data, it seems to be a taboo to talk about sampling sometimes. “You can collect all the data from a process”, some claim. That may be the case for a small minority of processes, but for the vast world out there, characterization is only possible by collecting and evaluating samples of data. And this approach entails the concept of a confidence level.

Confidence levels tell us the proportion of times we’re willing to be right (and wrong) about any parameter we wish to estimate from a sample of data. For example, if I measured the price per gallon of gas at every pump in Maine, or Tokyo, for a day, I’d get a lot of data – and that data would show me some general trends and patterns in the way the prices are distributed, or what typical prices seem to be in effect. If I expect to make an estimate of petrol prices in Tokyo or Maine next July, I couldn’t hope to do this with a limited sample such as this, however. Basic economics knowledge tells us that there could be many factors that could change these prices – and that they could very well be quite different from what they are now. And this is despite the quality of the data we have.

If I wanted to be 95% confident about the prices of petrol within a month from now, I could use a day’s worth of data. And to represent this data, I could use a confidence interval ( a range of values), of course. So, whether it is the quantity of coffee in your cup, or the price per gallon of fuel you buy, you can evaluate the broader parameters of your data sets, as long as you can get the data, using confidence intervals.

R Confidence Intervals Example

Now, let’s see a simple example in R, that illustrate confidence intervals.

#Generate some data - 100 points of data
#The mean of the data generated is 10,
#The standard deviation we've chosen is 1.0
#Data comes from the gaussian distribution

x<-rnorm(100,10,1.0)

#Testing an 80% confidence level
x80<-t.test(x,conf.level = 0.8)

#Testing a 90% confidence level
x90<-t.test(x,conf.level = 0.9)

#Testing a 99% confidence level
x99<-t.test(x,conf.level = 0.99)

x80
x90
x99

The first part of the code shows us how 100 points of data are used as a sample in this illustration.

In the next part, the t.test() command used here can be used to generate confidence intervals, and also test hypotheses you may have developed. In the above code, I’ve saved three different results, based on three different confidence levels – 80%, 90% and 95%. Typically, when you want to be more certain, but you don’t have more data, you end up getting a wider confidence interval. Expect more uncertainty if you have limited data, and more certainty, when you have more data, all other things being equal. Here are the results from the first test – the 80% confidence interval.

1-sample t-test and confidence interval (80% confidence)

1-sample t-test and confidence interval (80% confidence)

Let’s break down the results of the t-test. First, we see the data set the test was performed on, and then we see the t-statistics and also a p-value. Further, you can see a confidence interval (9.85,10.10) for the data, based on a sample size of 100, and a confidence level of 80%. Since the t-test is fundamentally a hypothesis test that uses confidence intervals to help you make your decision, you can also see an alternative hypothesis listed there. Then there’s the estimate of the mean – 9.98. Recall the characteristics of the normal distribution we used to generate the data in the first place – it had a mean \mu = 10.0 and a standard deviation \sigma = 1.

Confidence Levels versus Confidence Intervals

Summarily, we can see the confidence intervals and mean estimates of the remaining two confidence intervals also. For ease, I’ll print them out from storage.

CIs comparison (80%,90%, 99%)

CIs comparison (80%,90%, 99%)

Observe how, for the same data set, confidence intervals (plausible ranges of values for the mean of the data) are different, depending on how the confidence level changes. The confidence intervals widen as the confidence level increases. The 80% CI is calculated to be (9.85, 10.10) while the same sample yields (9.72, 10.24) when the CI is calculated at 99% confidence level. When you consider that standard deviations can be quite large, what confidence level you use in your calculations, could actually become a matter of importance!

More on how confidence intervals are calculated here, at NIST.

The 1-sample t-Test

We’ve seen earlier that the command that is invoked to calculate these confidence intervals is actually called t.test(). Now let’s look at what a t-test is.

In inferential statistics, specifically in hypothesis testing, we test samples of data to determine if we can make a generalization about them. When you collect process data from a process you know, the expectation is that this data will look a lot like what you think it should look like. If this is data about how much coffee, or how much you paid for a gallon of gasoline, you may be interested in knowing, for instance, if the price per gallon is any different here, compared to some other gas station. One way to find this out – with statistical certainty – is through a t-test. A t-test fundamentally tells us whether we fail to reject a default hypothesis we have (also called null hypothesis and denoted as H_0 ), or if we reject the default hypothesis and embrace an alternative hypothesis (denoted by H_a). In case of the 1-sample t-test, therefore:

H_o : \mu = k  \newline  H_a : \mu \neq k

Depending on the nature of the alternative hypothesis, we could have inequalities there as well. Note that k here is the expectation you have about what the mean ought to be. If you don’t supply one, t.test will calculate a confidence interval and produce a mean estimate.

A t-test uses the t-distribution, which is a lot like the Gaussian or normal distribution, except that it uses an additional parameter – which directly relates to the sample size of your data. Understandably, the size of the sample could give you very different results in a t-test.

As with other hypothesis tests, you also get a p-value. The null hypothesis in a 1-sample t-test is relatively straightforward – that there is no difference between the mean of the sample in question, and the population’s mean. Naturally, the alternative of this hypothesis could help us study whether the population mean is less than expected (less expensive gas!) or greater (more expensive gas!) than the expectation.

Let’s take another look at that first t-test result:

1-sample t-test and confidence interval (80% confidence)

1-sample t-test and confidence interval (80% confidence)

The confidence level we’ve calculated here is an 80% confidence interval, which translates to a 20% significance. To interpret the results above, we compare the value of p, with the significance, and reject the null hypothesis is p is smaller. But what are the t-statistic and df? The t-statistic here is calculated as the critical value of t, based on a confidence level of 80%, the sample mean and standard deviation, and of course, the fact that we have 100 points of data. (The “df” here stands for degrees of freedom – which stands at 99, calculated from the 100 data points we have and the 1 parameter we’re estimating.)

Alternative Hypotheses Inequalities

The t.test() command also allows us to evaluate how the confidence intervals look, and what the p-values are, when we have different alternative hypotheses. We can test for the population mean that’s being estimated to be less than, or greater than the expected value. We can also specify what our expected values of mean are.


x80<-t.test(x,conf.level = 0.8, mu = 10.1,alternative = "less" )

x80

2015-08-11 23_59_15-Jump List for VLC media player

Observe how the p-value, confidence intervals have changed.

We’ve evaluated the same 80% confidence intervals, with different expected values of the mean of \mu = 10.1, and the alternative hypothesis is that this mean \mu<10.1.

Concluding Remarks

When evaluating data to draw conclusions from it, it helps to construct confidence intervals. These tell us general patterns in the data, and also help us estimate variability. In real-life situations, using confidence intervals and t-tests to estimate the presence or absence of a difference between expectation and estimate is valuable. Often, this is the lifeblood of data-driven decision making when dealing with lots of data, and when coming to impactful conclusions about data. R’s power in quickly generating confidence intervals becomes quite an ally, in the right hands – and of course, if you’ve collected the right data.