# Data and Strategy for Small and Medium Organizations

Data analytics and statistics aren’t historically associated with the strategic decisions that leaders take in small and medium sized businesses. Data analytics has for some years been used in larger organizations and organizations with larger user bases are also benefiting from this, thanks to the use of big data to drive consumer and business insight in business decision making. However, even such businesses can benefit from the large volumes of data that are being collected, including from public data bases. Most decisions in traditional businesses and in small and medium businesses are still taken by leaders who at best have a pulse of the market and a domain knowledge of the business they’re in, but aren’t using the data at their disposal to create mathematical models and strategies derived from them.

## When does data fit into strategy?

To answer this, we may need to understand the purpose of strategy and strategic initiatives themselves. In small and medium organizations, the purpose of strategic initiatives, especially the mid- and short-term strategies, is to enable growth. Larger organizations have the benefit of extensive user bases, consumer bases or resources, which they can use to develop, test, validate and release new products and services. However, smaller organizations and medium sized organizations make these strategic initiatives, because their focus tends to be limited to the near term, and in maintaining a good financial performance. Small and medium organizations in modern economies will also seek to maintain leverage and a consumer base that is dedicated and loyal to their product or service journey. The latter is especially true of niche product companies, because they sell lifestyles, and not merely products.

In this context, data fits into strategy in the following key ways:

1. Descriptive data analytics allows strategists and leaders to question underlying assumptions of existing strategies
2. Data visualizations allow strategists to classify and rank opportunities and have more cost and time efficient strategies
3. Inferential data analytics, predictive analytics and simulations allow strategists to play out scenarios, and take a peek into the future of the business

Descriptive data analytics may work with public data, or data already available with the organization. It could be composed of statistical reports, illustrating the growth in demand, or market size, or certain broader trends and patterns in consumption, or demand, for a certain product, service, or opportunity. Descriptive analytics is easy enough to do, and doesn’t involve complex modeling usually. It is a good entry point for strategists that hope to become more data driven in the development of their strategies.

Data visualizations, in addition to being communication tools that provide strategists leverage, could also throw some light on the functional aspects of what opportunities to seek out, and what strategies to develop. They could also help strategists make connections and see relationships that would otherwise not have been apparent. Data visualization has been made easier and more affordable because of powerful and free software such as R and R-Studio. Visualizations are extremely effective as communication and ideation tools. For strategists who look to mature beyond just using descriptive statistics in developing their strategies, visualizations can be valuable.

Inferential data analytics leverage the predictive power of mathematical and statistical models. By representing what is common knowledge as a mathematical model, we can apply it to diverse situations, and throw new light on problems that we haven’t evaluated before in a scientific or data driven manner. Inferential data analytics generally requires individuals with experience as data scientists. Inferential statistical models require a good understanding of basic and inferential statistical models, and therefore, can be more complex to incorporate into data based strategy models. While descriptive and visualizations may not be driven by advanced algorithms such as neural networks or machine learning, advanced and inferential analytics can certainly be so driven.

## Data for Short and Medium Term Strategy

Data analysis that informs short term strategy and medium term strategy are fundamentally different. Short term strategy, that focuses on the immediate near term of a business, generally seeks to inform the operational teams on how they should act. This may be a set of simple rules, which are used to run the rudiments of the business on a day-to-day basis. Why use data to drive the regular activities of businesses for which extensive procedures may already be in place? Because keeping one’s ear to the ground – and collecting customer and market information on an ongoing basis – is extremely important for most businesses today in a competitive business world.  Continual improvement and quality are fundamental and important to a wide variety of businesses, and data that informs the short term is therefore extremely important.

Data analytics in the short term doesn’t rely on extensive analysis, but keeping abreast of information and the trends and patterns we see in them on a day-to-day basis. Approaches relevant to short term strategy may be:

1. Dashboards and real time information streams
2. Automatically generated reports that give operations leaders or general managers a pulse of the market, or a pulse of the business
3. Sample data analysis (small data, as opposed to big data), that informs managers and teams about the ongoing status of a specific process or product – this is similar to quality management systems in use in various companies small and large

Data analytics in the mid term strategy space is quite a different situation, being required to inform strategists about the impact of changing market scenarios on a future product or service launch. The data analysis here should seek to serve the strategists’ need to be informed about served and total addressable markets, competitive space, penetration and market share expectations, and such business-specific criteria that help fund, finance or prioritize the development of new products or services.

Accordingly, data analytics in a mid-term strategy space (also called Horizon 2 strategy) may involve more involved analysis, typically by data scientists. Tools and themes of analysis may be things like:

1. Consumer sentiment analysis to determine the relevance of a particular product or service
2. Patents and intellectual property data munging, classification and text mining for category analysis
3. Competitor analysis by automated searches, classification algorithms, risk analysis by dynamic analytical hierarchy processes
4. Scenario analysis and simulation, driven by methods such as Markov Chain Monte Carlo analysis

Observe how the analyses above are distinct from the more ready information that’s shared with operational teams. The data analytics activities here generally require analysis of data in a rigorous manner, not merely the collection and presentation of available data that fit a certain definition. When data is unstructured and when data science requires the cleaning and visualization of data, the creation of models from a starting point, such as public data, is much more challenging. This is where the skills of well trained data scientists and data analysts is essential.

## Data for Long Term Strategy

One narrative that has made itself known through data in the world of business, is that the long term as it was traditionally known, is shrinking. Even S&P 500 companies are conspicuous these days by how short lived they are, and small and medium companies, therefore, are no exception. Successful tech companies boast product and service development cycles of a few months up to a year, and the technology world is therefore unrecognizable from what it was every few months, thanks to innovation. However, there is probably a method to even this madness. The scale and openness of access has made the consumer and end user powerful, and the consumer these days has opportunities to do things with free resources and tools, that could only be imagined a few years ago.

Data informs strategists in such longer term strategic scenarios, typically, five years or more, by helping construct scenarios. Data analytics in scenario planning should account for the following:

1. Dynamic trends in the increase of velocity in information/data being collected (Velocity, out of the four Vs)
2. Dynamic changes in the type of information being collected (Variety, out of the four Vs)
3. Dynamic changes in the reliability of information being collected (Veracity out of the four Vs)

Volume, the other V out of the four Vs, is a static measure of the data being collected at specific points in time, but these above are more than just volume, and they represent the growing size, variety and unreliability of available data.

Data analysis of a more simple nature can be used for some of the analysis above, while for specific approaches such as scenario analysis, sophisticated mathematical models can be used. In small and medium organizations, where the focus is usually on the short term, and at best on the mid term, data analytics can help inform executives about the long term and keep that conversation going. It is easy in smaller organizations to fall into the trap of not preparing for the long term. In the mid and long term, more advanced methods can be used to guide and inform the organization’s vision.

## Concluding Remarks

Data analytics as applied to strategy is not entirely new, with many mature organizations already working on it. For small and medium businesses, which are mushrooming in a big way around the developed and developing world these days, data analytics is a force multiplier for strategic decision making and for leaders. Data analytics can reveal information we have hitherto believed to only be the preserve of large organizations who can collect data on an unprecedented scale and hire expert teams to analyze them. What makes analytics relevant to small and medium businesses today is that in our changing business landscape, we can expect analytics driven companies to respond in more agile ways to the needs of customers, and to excite customers in new ways, that traditional, less agile and larger organizations are not likely to do. The surfeit of mature data analysis tools and approaches available, combined with public data, can therefore make leaders and strategists in small and medium organizations more competitive.

# Animated: Mean and Sample Size

A quick experiment in R can unveil the impact of sample size on the estimates we make from data. A small number of samples provides us less information about the process or system from which we’re collecting data, while a large number can help ground our findings in near certainty. See the earlier post on sample size, confidence intervals and related topics on R Explorations.

Using the “animation” package once again, I’ve put together a simple animation to describe this.

#package containing saveGIF function
library(animation)

#setting GIF options
ani.options(interval = 0.12, ani.width = 480, ani.height = 320)

#a function to help us call GIF plots easily
plo <- function(samplesize, iter = 100){

for (i in seq(1,iter)){

#Generating a sample from the normal distribution
x <- rnorm(samplesize,mu,sd)

#Histogram of samples as they're generated
hist(x, main = paste("N = ",samplesize,", xbar = ",round(mean(x), digits = 2),
", s = ",round(sd(x),digits = 2)), xlim = c(5,15),
ylim = c(0,floor(samplesize/3)), breaks = seq(4,16,0.5), col = rgb(0.1,0.9,0.1,0.2),
border = "grey", xlab = "x (Gaussian sample)")

#Adding the estimate of the mean line to the histogram
abline(v = mean(x), col = "red", lw = 2 )
}
}

#Setting the parameters for the distribution
mu = 10.0
sd = 1.0

for (i in c(10,50,100,500,1000,10000)){
saveGIF({plo(i,mu,sd)},movie.name = paste("N=",i,", mu=",mu,", sd=",sd,".gif"))
}



## Animated Results

Very small sample size of 5. Observe how the sample mean line hunts wildly.

A small sample size of 10. Mean once again moves around quite a bit.

Moderate sample size of 50. Far less inconsistency in estimate (red line)

A larger sample size, showing little deviation in sample mean over different samples

A large sample size, indicating minor variations in sample mean

Very large sample size (however, still smaller than many real world data sets!). Sample mean estimate barely changes over samples.

# Comparing Non-Normal Data Graphically and with Non-Parametric Tests

Not all data in this world is predictable in the exact same way, of course, and not all data can be modeled using the Gaussian distribution. There are times, when we have to make comparisons about data using one of many distributions that represent data which could show different patterns other than the familiar and comforting “bell curve” of the normal distribution pattern we’re used to seeing in business presentations and the media alike. For instance, here’s data from the Weibull distribution, plotted using different shape and scale parameters. A Weibull distribution has two parameters, shape and scale, which determine how it looks (which varies widely), and how spread out it is.


shape <- 1
scale <- 5
x<-rweibull(1000000,shape,scale)
hist(x, breaks = 1000, main = paste("Weibull Distribution with shape: ",shape,", and scale: ",scale))
abline (v = median(x), col = "blue")
abline (v = scale, col = "red")


Shape = 1; Scale = 5. The red line represents the scale value, and the blue line, the median of the data set.

Here’s data from a very different distribution, which has a scale parameter of 100.

Shape = 1; Scale = 100. Same number of points. The red and blue lines mean the same things here too.

The shape parameter, as can be seen clearly here, is called so for a good reason. Even when the scale parameter changes wildly (as in our two examples), the overall geometry of our data looks similar – but of course, it isn’t. The change in the scale parameter has changed the probability of an event $x ->0$ towards the lower end of the x range (closer to zero), compared to an event $x>>>0$ further away. When you superimpose these distributions and their medians, you can get a very different picture of them.

If we have two very similar data sets like the data shown in the first graph and the data in the second, what kinds of hypothesis tests can we use? It is a pertinent question, because at times, we may not know that a data set may represent a process that can be modeled by a specific kind of distribution. At other times, we may have entirely empirical distributions represented by our data. And we’d still want to make comparisons using such data sets.

shape <- 1
scale1 <- 5
scale2<-scale1*2
x<-rweibull(1000000,shape,scale1)
xprime<-rweibull(1000000,shape,scale2)
hist(x, breaks = 1000, border = rgb(0.9,0.2,0.2,0.2), col = rgb(0.9,0.2,0.2,0.2), main = paste("Weibull Distribution different shape parameters: ",shape/100,", ", shape))
hist(xprime, breaks = 1000, border = rgb(0.2,0.9,0.2,0.2), col = rgb(0.2,0.9,0.2,0.2), add = T)
abline (v = median(x), col = "blue")
abline (v = scale, col = "red")



Different scale parameters. Red and blue lines indicate medians of the two data sets.

The Weibull distribution is known to be quite versatile, and can at times be used to approximate the Gaussian distribution for real world data. An example of this is the use of the Weibull distribution to approximate constant failure rate data in engineering systems. Let’s look at data from a different pair of distributions with a different shape parameter, this time, 3.0.

shape <- 3
scale1 <- 5
scale2<-scale1*1.1 #Different scale parameter for the second data set
x<-rweibull(1000000,shape,scale1)
xprime<-rweibull(1000000,shape,scale2)
hist(x, breaks = 1000, border = rgb(0.9,0.2,0.2,0.2), col = rgb(0.9,0.2,0.2,0.2), main = paste("Weibull Distribution different scale parameters: ",scale1,", ", scale2))
hist(xprime, breaks = 1000, border = rgb(0.2,0.9,0.2,0.2), col = rgb(0.2,0.9,0.2,0.2), add = T)
abline (v = median(x), col = "blue")
abline (v = median(xprime), col = "red")


Weibull distribution data – different because of scale parameters. Vertical lines indicate medians.

The medians can be used to illustrate the differences between the data, and summarize the differences observed in the graphs. However, when we know that a data set is non-normal, we can adopt non-parametric methods from the hypothesis testing toolbox in R. Like hypothesis tests for normally distributed data that have comparable means, we can compare the medians of two or more samples of non-normally distributed data. Naturally, the same conditions – of larger samples of data being better, at times, apply. However, the tests can help us analytically differentiate between two similar-looking data sets. Since the Mann-Whitney median test and other non-parametric tests don’t make assumptions about the parameters of the underlying distribution of the data, we can rely on these tests to a greater extent when studying the differences between samples that we think may have a greater chance of being non-normal (even though the normality tests may say otherwise).

Non-parametrics and the inferential statistics approach: how to use the right test

When we conduct the AD test for normality on the two samples in question, we can see how these samples return a very low p-value each. This can also be confirmed using the qqnorm plots.

Let’s use the Mann-Whitney test for two medians from samples of non-normal data, to assess the difference between the median values. We’ll use a smaller sample size for both, and use the wilcox.test() command. For two samples, the wilcox.test() command actually performs a Mann-Whitney test.

shape <- 3
scale1 <- 5
scale2<-scale1*1.01
x<-rweibull(1000000,shape,scale1)
xprime<-rweibull(1000000,shape,scale2)
library(nortest)
paste("Normality test p-values: Sample 'x' ",ad.test(x)$p.value, " Sample 'xprime': ", ad.test(xprime)$p.value)

hist(x, breaks = length(x)/10, border = rgb(0.9,0.2,0.2,0.05), col = rgb(0.9,0.2,0.2,0.2), main = paste("Weibull Distribution different scale parameters: ",scale1,", ", scale2))
hist(xprime, breaks = length(xprime)/10, border = rgb(0.2,0.9,0.2,0.05), col = rgb(0.2,0.2,0.9,0.2), add = T)
abline (v = median(x), col = "blue")
abline (v = median(xprime), col = "red")
wilcox.test(x,xprime)
paste("Median 1: ", median(x),"Median 2: ", median(xprime))


Observe how close the scale parameters of both samples are. We’d expect both samples to overlap, given the large number of points in each sample. Now, let’s see the results and graphs.

Nearly overlapping histograms for the large non-normal samples

The results for this are below.

Mann-Whitney test results

The p-value here (for this considerable sample size) clearly illustrates the present of a significant difference. A very low p-value in this test result indicates that, if we were to make the assumption that the medians of these data sets are equal, there would be an extremely small probability, that we would see samples as extreme as observed in these samples. The fine difference in the medians observed in the median results can also be picked up in this test.

To run the Mann-Whitney test with a different confidence level (or significance), we can use the following syntax:

wilcox.test(x,xprime, conf.level = 0.95)


Note 1 : The mood.test() command in R performs a two-samples test of scale. Since the scale parameters in these samples of data we generated (for the purposes of the demo) are well known, in real life situations, the p-value should be interpreted based on additional information, such as the sample size and confidence level.

Note 2:  The wilcox.test() command performs the Mann Whitney test. This is a comparison of mean ranks, and not of the medians per se.

# Power, Difference and Sample Sizes

In my earlier posts on hypothesis testing and confidence intervals, I covered how there are two hypotheses – the default or null hypothesis, and the alternative hypothesis (which is like a logical opposite of the null hypothesis). Hypothesis testing is fundamentally a decision making activity, where you reject or fail to reject the default hypothesis. For example: how do you tell whether the gas mileage of cars from one fleet is greater than the gas mileage of cars from some other fleet? When you collect samples of data, you can compare the average values of the samples, and arrive at some inference from this information, about the population. Since sample sizes tend to be affected by variability, we ought to be interested in how much data to actually collect.

## Hypothesis Tests

Statistically speaking, when we collect a small sample of data and calculate its standard deviation, we tend to get a larger estimate of standard deviation or a smaller estimate of standard deviation from the actual standard deviation. The relationship between the sample size and the standard deviation of a sample is described by the term standard error. This standard error is an integral part of how a confidence interval is calculated for variable data. For smaller samples, the difference between the “true” standard deviation of the population (if that can even be measured) and the sample standard deviation tends to be small for large sample sizes. There is an intuitive way to think about this. If you have more information, you can make a better guess at a characteristic of the population the information is coming from.

Let’s take another example: Motor racing. Motor racing lap times are generally recorded with extremely high precision and accuracy. If we had a sample with three times and wanted to estimate lap times for a circuit, we’d probably do okay, but have a wider range of expected lap times. On the contrary, if we had a number of lap time records, we could more accurately calculate the confidence intervals for the mean value. We could even estimate the probability that a particular race car driver could clock a certain time, if we were able to understand the distribution that is the closest model of the data. In a hypothesis test, therefore, we construct a model of our data, and test a hypothesis based on that model. Naturally, there is a risk of going wrong with such an approach. We call these risks $\alpha$ and $\beta$ risks.

Hypothesis tests, alpha and beta risks

## Type I & Type II Errors and Power

To explain it simply, the chance of erroneously rejecting the null hypothesis $H_0$ is referred to as the Type I error (or $\alpha$). Similarly, the chance of erroneously accepting the null hypothesis (if the reality was different from what the null hypothesis stated) is called the Type II error (or $\beta$ risk). Naturally, we want both these errors to have very low probability in our experiments.

How do we determine if our statistical model is powerful enough, therefore, to avoid both kinds of risks? We use two statistical terms, significance (known here as $\alpha$, as in $\alpha$ risk), and power (known sometimes as $\pi$, but more commonly known as $1-\beta$, as we see from the illustration above).

It turns out that the statistical power is heavily dependent on the sample size you used to collect your data. If you used a small sample size, the ability of your test to detect a certain difference (such as 10 milliseconds of lap time, or 1 mile per gallon of difference in fuel efficiency) is diminished greatly. In truth, you may receive a result that gives you a p-value (also discussed in an earlier post) that is greater than the significance. Remember that this is now a straightforward comparison between the p-value of the test and what we know now as $\alpha$. However, note how in the results interpretation of our hypothesis test, we didn’t yet consider $1-\beta$. Technically, we should have. And this is what causes so many spurious results, because false positives end up getting ignored, leading to truth inflation.

Very often in data-driven businesses, the question of “how many samples is good enough” arises – and usually such discussions end with “as many as we can”. In truth, the process of determining how much data of a certain kind to collect, isn’t easy. Going back-and-forth to collect samples of data in order to do your hypothesis tests is helpful – primarily because you can see the effects of sample size in your specific problem, practically.

## A Note on Big Data

Big Data promises us what a lot of statisticians didn’t have in the past – the opportunity to analyze population data for a wide variety of problems. Big Data is naturally exciting for those who already have the trenchant infrastructure to make the call to collect, store and analyze such data. While Big Data security is an as yet incompletely answered question, especially in the context of user data and personally identifiable information, there is a push to collect such data, and it is highly likely that ethical questions will need to be answered by many social media and email account providers on the web that also double up as social networks. Where does this fit in? Well, when building such systems, we have neither the trust of a large number of people, nor the information we require – which could be in the form of demographic information, personal interests, relationships, work histories, and more. In the absence of such readily available information, and when companies have to build systems that handle this kind of information well, they have to experiment with smaller samples of data. Naturally, the statistical ideas mentioned in this post will be relevant in such contexts.

## Summary

1. Power: The power of a test is defined as the ability to correctly reject the null hypothesis of a test. As we’ve described it above, it is defined by $1-\beta$, where $\beta$ is the chance of incorrectly accepting the default or null hypothesis $H_0$.
2. Standard deviation ($\sigma$ ): The more variation we observe in any given process, the greater our target sample size should be, for achieving the same power, and if we’re detecting the same difference in performance, statistically. You could say that the sample size to be collected depends directly on the variability observed in the data. Even small differences in the $\sigma$ can affect the number of data points we need to collect to arrive at a result with sufficient power.
3. Significance ($\alpha$ ): As discussed in the earlier post on normality tests, significance of a result can have an impact on how much data we need to collect. If we have a wider margin for error, or greater margin for error in our decisions, we ought to settle for a larger significance value, perhaps of 10% or 15%. The de-facto norm in many industries and processes (for better or for worse, usually for worse) is to use $\alpha = 0.05$. While this isn’t always a bad thing, it encourages blind adherence and myths to propagate, and prevents statisticians from thinking about the consequences of their assumptions. In reality, $\alpha$ values of 0.01 and even 0.001 may be required, depending on how certain we want to be about the results.
4. Sample size ($n$): The greater the sample size of the data you’re using for a given hypothesis test, the greater the power of that test (and by that I mean, the test has a greater ability to detect a false positive).
5. Difference ($\Delta$): The greater the difference you want to be able to detect between two sets of data (proportions or means or medians), the smaller the sample size you need. This is an intuitively easy thing to understand – like testing a HumVee for gas mileage versus a Ford Focus – you need only a few trips (a small sample size) to tell a real difference, as opposed to if you were to test two compact cars against each other (when you may require a more rigorous testing approach).