Big Data: Size and Velocity

One of the changes envisioned in the big data space is that there is the need to receive data that isn’t so much big in volume, as big in relevance. Perhaps this is a crucial distinction to make. Here, we examine business manifestations of relevant data, as opposed to just large volumes of data.

What Managers Want From Data

It is easier to ask the question “what do you want from your data” to managers and executives, than to answer it as one. As someone who has worked in Fortune 500s with teams that use data to make decisions, I’d like to share some insight into this:

  1. Managers don’t necessarily want to see data even if they talk about wanting to use data for decision making. They instead want to see interpretations of data that helps them make up their minds and take decisions.
  2. Decision making is not monotonous or based on single variables of interest.
  3. Decision making involves not only operational data descriptors (which are most often instrumented for collection from data sources)
  4. Decisions can be taken based on uncertain estimates in some cases, but many situations do require accurate estimates of results to drive decision making

From Data To Insight

The process of getting from data to insight isn’t linear. It involves exploration, and this means collecting more data, and iterating on one’s results and hypotheses. Broadly, the process of getting insights from data may involve data preparation and analysis as intermediate stages between data collection and the generation of insight. This doesn’t mean that the data scientist’s job is done once the insights are generated. There is a need to collect more data and refine the models we’ve built, and construct better views of the problem landscape.

Data Quality

A large percentage of the data analyst’s or data scientist’s problems have to do with the broad area of data quality, and its readiness for analysis. Specifically to data quality, some things stand out:

  1. Measurement aspects – whether the measured data really represents the state of the variable which was measured. This in turn involves other aspects such as linearity, stability, bias, range, sensitivity and other parameters of the measurement system
  2. Latency aspects – whether the measured data in time sequence is recorded and logged in the correct sequence and at the right intervals
  3. Missing and anomalous values – these are missing or anomalous readings/data records, as opposed to anomalous behaviour, which is a whole other subject.

Fast Data Processing

Speed is an essential element in the data scientist’s effectiveness. The speed of decisions is the speed of the slowest link in the chain. Traditionally, this slowest link has been the collection of the data itself. Data processing frameworks have improved by leaps and bounds in the recent past, with frameworks like Apache Spark leading the charge. However, this is changing, with sensors in IOT settings delivering huge data sets and massive streams of data in themselves. In such a situation, the dearth of time is not in the acquisition of data itself. Indeed, the availability of massive data lakes with lots of data on them itself signals the need for more and more data scientists, who can analyse this data and arrive at insights from the data. It is in this context that the rapid creation of models, analysis of insights from data, and the creation of meta-algorithms that do such work is valuable.

Where Size Does Matter

There are some problems which do require very large data sets. There are many examples of systems that gain effectiveness only with scale. One such example is the commonly found collaborative filtering recommendation engine, used everywhere in e-commerce and related industries. Size does matter for these data sets. Small data sets are prone to truth inflation and poor analysis results from poor data collection. In such cases, there is no respite other than to ensure we collect, store and analyze large data sets.

Volume or Relevance in Data

Now we come to the dichotomy we set out to resolve – whether volume is more important in data sets, or whether relevance is. Relevant data simply meets a number of the criteria listed above, whereas data that’s measured up purely in volume (petabytes or exabytes) doesn’t give us an idea of the quality and its use for real data analysis.

Volume and Relevance in Data

We now look at whether volume itself may become part of what makes the data relevant for analysis. Unsurprisingly, for some applications such as neural network training, data science on time series data sets of high frequency, etc., data volume is undeniably useful. More data in these cases implies that more can be done with the model, that more partitions or subsets of the data can be taken, and that more theories can be tested out on different representative sample sets of the data.

The Big-Three Vs

So, where does this leave us with respect to our understanding of what makes Big Data, Big Data? We’ve seen the popular trope that Big Data is data that exhibits volume, velocity and variety. Some discuss a fourth characteristic – the veracity of the data. Overall, the availability of relevant data in sufficient volumes should be able to address the needs of the data scientist for model building and for data exploration. The question of variety still remains, and as data profiling approaches mature, data wrangling will advance to a point where this variety isn’t a trouble, and is a genuine asset. However, the volume and velocity points are definitely scoped for a trade off, in a large percentage of the cases. For most model building activities, such as linear regression models, or classification models where we know the characteristics or behavior of the data set, so-called “small data” is sufficient, as long as the data set is representative and relevant.

Data Perspectives: “Orbiting The Giant Hairball”

This may sound weird, but one sure way to not have perspective about the business in an innovative and constantly changing industry is to bury yourself within regular work. This is the meaning of the title – which comes from a book of the same name.

By regular work, I mean work in which you execute tasks with a view to minimize variability and have standard results. This is as opposed to innovative work, which, as Bob Sutton explains in his lectures, is characterised by an increase of variability to the point of failure. Failure and validated learning are essential aspects of the learning experience in any job, to extend a metaphor from Eric Reis’ book The Lean Startup.

Data science and data engineering are the truly cross-functional and cross-industry work areas within the analytics revolution that is under way right now. There are a number of business perspectives that are relevant in one industry, which can also be applied to another. Indeed, work in some industries can anticipate very closely the needs of another.

Data scientists should keep one eye on the business, or to be true to the metaphor here, should occasionally “dive into the hairball” of business and routine work, to get a glimpse of what’s happening in the world of work. The data perspectives that they bring to that conversation will then become as important, as the perspectives they develop due to such experiences. Seasoned professionals and consultants in the data analytics industry may have unconsciously or consciously developed their cross-functional and cross-industry experience over years. But it probably is fitting for younger data professionals – and there are many of them out there – to occasionally “dive into the hairball from orbit” and understand the challenges of data for those in various walks of business.

Data and Strategy for Small and Medium Organizations

Data analytics and statistics aren’t historically associated with the strategic decisions that leaders take in small and medium sized businesses. Data analytics has for some years been used in larger organizations and organizations with larger user bases are also benefiting from this, thanks to the use of big data to drive consumer and business insight in business decision making. However, even such businesses can benefit from the large volumes of data that are being collected, including from public data bases. Most decisions in traditional businesses and in small and medium businesses are still taken by leaders who at best have a pulse of the market and a domain knowledge of the business they’re in, but aren’t using the data at their disposal to create mathematical models and strategies derived from them.

When does data fit into strategy?

To answer this, we may need to understand the purpose of strategy and strategic initiatives themselves. In small and medium organizations, the purpose of strategic initiatives, especially the mid- and short-term strategies, is to enable growth. Larger organizations have the benefit of extensive user bases, consumer bases or resources, which they can use to develop, test, validate and release new products and services. However, smaller organizations and medium sized organizations make these strategic initiatives, because their focus tends to be limited to the near term, and in maintaining a good financial performance. Small and medium organizations in modern economies will also seek to maintain leverage and a consumer base that is dedicated and loyal to their product or service journey. The latter is especially true of niche product companies, because they sell lifestyles, and not merely products.

In this context, data fits into strategy in the following key ways:

  1. Descriptive data analytics allows strategists and leaders to question underlying assumptions of existing strategies
  2. Data visualizations allow strategists to classify and rank opportunities and have more cost and time efficient strategies
  3. Inferential data analytics, predictive analytics and simulations allow strategists to play out scenarios, and take a peek into the future of the business

Descriptive data analytics may work with public data, or data already available with the organization. It could be composed of statistical reports, illustrating the growth in demand, or market size, or certain broader trends and patterns in consumption, or demand, for a certain product, service, or opportunity. Descriptive analytics is easy enough to do, and doesn’t involve complex modeling usually. It is a good entry point for strategists that hope to become more data driven in the development of their strategies.

Data visualizations, in addition to being communication tools that provide strategists leverage, could also throw some light on the functional aspects of what opportunities to seek out, and what strategies to develop. They could also help strategists make connections and see relationships that would otherwise not have been apparent. Data visualization has been made easier and more affordable because of powerful and free software such as R and R-Studio. Visualizations are extremely effective as communication and ideation tools. For strategists who look to mature beyond just using descriptive statistics in developing their strategies, visualizations can be valuable.

Inferential data analytics leverage the predictive power of mathematical and statistical models. By representing what is common knowledge as a mathematical model, we can apply it to diverse situations, and throw new light on problems that we haven’t evaluated before in a scientific or data driven manner. Inferential data analytics generally requires individuals with experience as data scientists. Inferential statistical models require a good understanding of basic and inferential statistical models, and therefore, can be more complex to incorporate into data based strategy models. While descriptive and visualizations may not be driven by advanced algorithms such as neural networks or machine learning, advanced and inferential analytics can certainly be so driven.

Data for Short and Medium Term Strategy

Data analysis that informs short term strategy and medium term strategy are fundamentally different. Short term strategy, that focuses on the immediate near term of a business, generally seeks to inform the operational teams on how they should act. This may be a set of simple rules, which are used to run the rudiments of the business on a day-to-day basis. Why use data to drive the regular activities of businesses for which extensive procedures may already be in place? Because keeping one’s ear to the ground – and collecting customer and market information on an ongoing basis – is extremely important for most businesses today in a competitive business world.  Continual improvement and quality are fundamental and important to a wide variety of businesses, and data that informs the short term is therefore extremely important.

Data analytics in the short term doesn’t rely on extensive analysis, but keeping abreast of information and the trends and patterns we see in them on a day-to-day basis. Approaches relevant to short term strategy may be:

  1. Dashboards and real time information streams
  2. Automatically generated reports that give operations leaders or general managers a pulse of the market, or a pulse of the business
  3. Sample data analysis (small data, as opposed to big data), that informs managers and teams about the ongoing status of a specific process or product – this is similar to quality management systems in use in various companies small and large

Data analytics in the mid term strategy space is quite a different situation, being required to inform strategists about the impact of changing market scenarios on a future product or service launch. The data analysis here should seek to serve the strategists’ need to be informed about served and total addressable markets, competitive space, penetration and market share expectations, and such business-specific criteria that help fund, finance or prioritize the development of new products or services.

Accordingly, data analytics in a mid-term strategy space (also called Horizon 2 strategy) may involve more involved analysis, typically by data scientists. Tools and themes of analysis may be things like:

  1. Consumer sentiment analysis to determine the relevance of a particular product or service
  2. Patents and intellectual property data munging, classification and text mining for category analysis
  3. Competitor analysis by automated searches, classification algorithms, risk analysis by dynamic analytical hierarchy processes
  4. Scenario analysis and simulation, driven by methods such as Markov Chain Monte Carlo analysis

Observe how the analyses above are distinct from the more ready information that’s shared with operational teams. The data analytics activities here generally require analysis of data in a rigorous manner, not merely the collection and presentation of available data that fit a certain definition. When data is unstructured and when data science requires the cleaning and visualization of data, the creation of models from a starting point, such as public data, is much more challenging. This is where the skills of well trained data scientists and data analysts is essential.

Data for Long Term Strategy

One narrative that has made itself known through data in the world of business, is that the long term as it was traditionally known, is shrinking. Even S&P 500 companies are conspicuous these days by how short lived they are, and small and medium companies, therefore, are no exception. Successful tech companies boast product and service development cycles of a few months up to a year, and the technology world is therefore unrecognizable from what it was every few months, thanks to innovation. However, there is probably a method to even this madness. The scale and openness of access has made the consumer and end user powerful, and the consumer these days has opportunities to do things with free resources and tools, that could only be imagined a few years ago.

Data informs strategists in such longer term strategic scenarios, typically, five years or more, by helping construct scenarios. Data analytics in scenario planning should account for the following:

  1. Dynamic trends in the increase of velocity in information/data being collected (Velocity, out of the four Vs)
  2. Dynamic changes in the type of information being collected (Variety, out of the four Vs)
  3. Dynamic changes in the reliability of information being collected (Veracity out of the four Vs)

Volume, the other V out of the four Vs, is a static measure of the data being collected at specific points in time, but these above are more than just volume, and they represent the growing size, variety and unreliability of available data.

Data analysis of a more simple nature can be used for some of the analysis above, while for specific approaches such as scenario analysis, sophisticated mathematical models can be used. In small and medium organizations, where the focus is usually on the short term, and at best on the mid term, data analytics can help inform executives about the long term and keep that conversation going. It is easy in smaller organizations to fall into the trap of not preparing for the long term. In the mid and long term, more advanced methods can be used to guide and inform the organization’s vision.

Concluding Remarks

Data analytics as applied to strategy is not entirely new, with many mature organizations already working on it. For small and medium businesses, which are mushrooming in a big way around the developed and developing world these days, data analytics is a force multiplier for strategic decision making and for leaders. Data analytics can reveal information we have hitherto believed to only be the preserve of large organizations who can collect data on an unprecedented scale and hire expert teams to analyze them. What makes analytics relevant to small and medium businesses today is that in our changing business landscape, we can expect analytics driven companies to respond in more agile ways to the needs of customers, and to excite customers in new ways, that traditional, less agile and larger organizations are not likely to do. The surfeit of mature data analysis tools and approaches available, combined with public data, can therefore make leaders and strategists in small and medium organizations more competitive.