In my earlier posts on hypothesis testing and confidence intervals, I covered how there are two hypotheses – the default or null hypothesis, and the alternative hypothesis (which is like a logical opposite of the null hypothesis). Hypothesis testing is fundamentally a decision making activity, where you reject or fail to reject the default hypothesis. For example: how do you tell whether the gas mileage of cars from one fleet is greater than the gas mileage of cars from some other fleet? When you collect samples of data, you can compare the average values of the samples, and arrive at some inference from this information, about the population. Since sample sizes tend to be affected by variability, we ought to be interested in how much data to actually collect.
Statistically speaking, when we collect a small sample of data and calculate its standard deviation, we tend to get a larger estimate of standard deviation or a smaller estimate of standard deviation from the actual standard deviation. The relationship between the sample size and the standard deviation of a sample is described by the term standard error. This standard error is an integral part of how a confidence interval is calculated for variable data. For smaller samples, the difference between the “true” standard deviation of the population (if that can even be measured) and the sample standard deviation tends to be small for large sample sizes. There is an intuitive way to think about this. If you have more information, you can make a better guess at a characteristic of the population the information is coming from.
Let’s take another example: Motor racing. Motor racing lap times are generally recorded with extremely high precision and accuracy. If we had a sample with three times and wanted to estimate lap times for a circuit, we’d probably do okay, but have a wider range of expected lap times. On the contrary, if we had a number of lap time records, we could more accurately calculate the confidence intervals for the mean value. We could even estimate the probability that a particular race car driver could clock a certain time, if we were able to understand the distribution that is the closest model of the data. In a hypothesis test, therefore, we construct a model of our data, and test a hypothesis based on that model. Naturally, there is a risk of going wrong with such an approach. We call these risks and risks.
Hypothesis tests, alpha and beta risks
Type I & Type II Errors and Power
To explain it simply, the chance of erroneously rejecting the null hypothesis is referred to as the Type I error (or ). Similarly, the chance of erroneously accepting the null hypothesis (if the reality was different from what the null hypothesis stated) is called the Type II error (or risk). Naturally, we want both these errors to have very low probability in our experiments.
How do we determine if our statistical model is powerful enough, therefore, to avoid both kinds of risks? We use two statistical terms, significance (known here as , as in risk), and power (known sometimes as , but more commonly known as , as we see from the illustration above).
It turns out that the statistical power is heavily dependent on the sample size you used to collect your data. If you used a small sample size, the ability of your test to detect a certain difference (such as 10 milliseconds of lap time, or 1 mile per gallon of difference in fuel efficiency) is diminished greatly. In truth, you may receive a result that gives you a p-value (also discussed in an earlier post) that is greater than the significance. Remember that this is now a straightforward comparison between the p-value of the test and what we know now as . However, note how in the results interpretation of our hypothesis test, we didn’t yet consider . Technically, we should have. And this is what causes so many spurious results, because false positives end up getting ignored, leading to truth inflation.
Very often in data-driven businesses, the question of “how many samples is good enough” arises – and usually such discussions end with “as many as we can”. In truth, the process of determining how much data of a certain kind to collect, isn’t easy. Going back-and-forth to collect samples of data in order to do your hypothesis tests is helpful – primarily because you can see the effects of sample size in your specific problem, practically.
A Note on Big Data
Big Data promises us what a lot of statisticians didn’t have in the past – the opportunity to analyze population data for a wide variety of problems. Big Data is naturally exciting for those who already have the trenchant infrastructure to make the call to collect, store and analyze such data. While Big Data security is an as yet incompletely answered question, especially in the context of user data and personally identifiable information, there is a push to collect such data, and it is highly likely that ethical questions will need to be answered by many social media and email account providers on the web that also double up as social networks. Where does this fit in? Well, when building such systems, we have neither the trust of a large number of people, nor the information we require – which could be in the form of demographic information, personal interests, relationships, work histories, and more. In the absence of such readily available information, and when companies have to build systems that handle this kind of information well, they have to experiment with smaller samples of data. Naturally, the statistical ideas mentioned in this post will be relevant in such contexts.
- Power: The power of a test is defined as the ability to correctly reject the null hypothesis of a test. As we’ve described it above, it is defined by , where is the chance of incorrectly accepting the default or null hypothesis .
- Standard deviation ( ): The more variation we observe in any given process, the greater our target sample size should be, for achieving the same power, and if we’re detecting the same difference in performance, statistically. You could say that the sample size to be collected depends directly on the variability observed in the data. Even small differences in the can affect the number of data points we need to collect to arrive at a result with sufficient power.
- Significance ( ): As discussed in the earlier post on normality tests, significance of a result can have an impact on how much data we need to collect. If we have a wider margin for error, or greater margin for error in our decisions, we ought to settle for a larger significance value, perhaps of 10% or 15%. The de-facto norm in many industries and processes (for better or for worse, usually for worse) is to use . While this isn’t always a bad thing, it encourages blind adherence and myths to propagate, and prevents statisticians from thinking about the consequences of their assumptions. In reality, values of 0.01 and even 0.001 may be required, depending on how certain we want to be about the results.
- Sample size (): The greater the sample size of the data you’re using for a given hypothesis test, the greater the power of that test (and by that I mean, the test has a greater ability to detect a false positive).
- Difference ($\Delta$): The greater the difference you want to be able to detect between two sets of data (proportions or means or medians), the smaller the sample size you need. This is an intuitively easy thing to understand – like testing a HumVee for gas mileage versus a Ford Focus – you need only a few trips (a small sample size) to tell a real difference, as opposed to if you were to test two compact cars against each other (when you may require a more rigorous testing approach).