Big Data: Size and Velocity

One of the changes envisioned in the big data space is that there is the need to receive data that isn’t so much big in volume, as big in relevance. Perhaps this is a crucial distinction to make. Here, we examine business manifestations of relevant data, as opposed to just large volumes of data.

What Managers Want From Data

It is easier to ask the question “what do you want from your data” to managers and executives, than to answer it as one. As someone who has worked in Fortune 500s with teams that use data to make decisions, I’d like to share some insight into this:

  1. Managers don’t necessarily want to see data even if they talk about wanting to use data for decision making. They instead want to see interpretations of data that helps them make up their minds and take decisions.
  2. Decision making is not monotonous or based on single variables of interest.
  3. Decision making involves not only operational data descriptors (which are most often instrumented for collection from data sources)
  4. Decisions can be taken based on uncertain estimates in some cases, but many situations do require accurate estimates of results to drive decision making

From Data To Insight

The process of getting from data to insight isn’t linear. It involves exploration, and this means collecting more data, and iterating on one’s results and hypotheses. Broadly, the process of getting insights from data may involve data preparation and analysis as intermediate stages between data collection and the generation of insight. This doesn’t mean that the data scientist’s job is done once the insights are generated. There is a need to collect more data and refine the models we’ve built, and construct better views of the problem landscape.

Data Quality

A large percentage of the data analyst’s or data scientist’s problems have to do with the broad area of data quality, and its readiness for analysis. Specifically to data quality, some things stand out:

  1. Measurement aspects – whether the measured data really represents the state of the variable which was measured. This in turn involves other aspects such as linearity, stability, bias, range, sensitivity and other parameters of the measurement system
  2. Latency aspects – whether the measured data in time sequence is recorded and logged in the correct sequence and at the right intervals
  3. Missing and anomalous values – these are missing or anomalous readings/data records, as opposed to anomalous behaviour, which is a whole other subject.

Fast Data Processing

Speed is an essential element in the data scientist’s effectiveness. The speed of decisions is the speed of the slowest link in the chain. Traditionally, this slowest link has been the collection of the data itself. Data processing frameworks have improved by leaps and bounds in the recent past, with frameworks like Apache Spark leading the charge. However, this is changing, with sensors in IOT settings delivering huge data sets and massive streams of data in themselves. In such a situation, the dearth of time is not in the acquisition of data itself. Indeed, the availability of massive data lakes with lots of data on them itself signals the need for more and more data scientists, who can analyse this data and arrive at insights from the data. It is in this context that the rapid creation of models, analysis of insights from data, and the creation of meta-algorithms that do such work is valuable.

Where Size Does Matter

There are some problems which do require very large data sets. There are many examples of systems that gain effectiveness only with scale. One such example is the commonly found collaborative filtering recommendation engine, used everywhere in e-commerce and related industries. Size does matter for these data sets. Small data sets are prone to truth inflation and poor analysis results from poor data collection. In such cases, there is no respite other than to ensure we collect, store and analyze large data sets.

Volume or Relevance in Data

Now we come to the dichotomy we set out to resolve – whether volume is more important in data sets, or whether relevance is. Relevant data simply meets a number of the criteria listed above, whereas data that’s measured up purely in volume (petabytes or exabytes) doesn’t give us an idea of the quality and its use for real data analysis.

Volume and Relevance in Data

We now look at whether volume itself may become part of what makes the data relevant for analysis. Unsurprisingly, for some applications such as neural network training, data science on time series data sets of high frequency, etc., data volume is undeniably useful. More data in these cases implies that more can be done with the model, that more partitions or subsets of the data can be taken, and that more theories can be tested out on different representative sample sets of the data.

The Big-Three Vs

So, where does this leave us with respect to our understanding of what makes Big Data, Big Data? We’ve seen the popular trope that Big Data is data that exhibits volume, velocity and variety. Some discuss a fourth characteristic – the veracity of the data. Overall, the availability of relevant data in sufficient volumes should be able to address the needs of the data scientist for model building and for data exploration. The question of variety still remains, and as data profiling approaches mature, data wrangling will advance to a point where this variety isn’t a trouble, and is a genuine asset. However, the volume and velocity points are definitely scoped for a trade off, in a large percentage of the cases. For most model building activities, such as linear regression models, or classification models where we know the characteristics or behavior of the data set, so-called “small data” is sufficient, as long as the data set is representative and relevant.

Quality and the Data Lifecycle

The insights we get from data depend on the quality of the data itself, and as the saying goes, “Garbage In, Garbage Out”. The volumes of data don’t matter as much as the quality of the data itself. Data quality and data quality assurance are therefore of growing importance in today’s Big Data arena. With the growing scale of big data implementations, data quality assurance is an increasingly importance function in the big data world today. In this short post, I’ll discuss aspects of data quality as seen from a data life cycle perspective.

Data quality is applicable to all the three key areas of data collection, storage and use in analysis. In each of these contexts, data quality can take on a different meaning. The data management paradigms most widely recognized involve these three stages, sometimes broken down into a larger number of stages. How does data quality assurance figure in each of these three stages of the data life cycle?

Data Quality In Data Collection

Data quality in the data collection part of the data life cycle is concerned with the nature, kind and operational definition of the data collected itself. This is therefore concerned with the sensors or measurement systems that collect data, their veracity and their ability to monitor the source of the data as per requirements. Assessment criteria in a 20th century industrial setting may have involved approaches like measurement system analysis. These days, more sophisticated logical validation approaches may be used. Methods of studying different aspects of the measurement systems, such as stability, linearity, bias, etc., may also be adopted. A wider discussion of data quality here will also include the process of documenting the data collected, whether manually or electronically, and how this is best rationalized.

Data Quality in Data Storage

Data Storage is the second part of the data life cycle, where we have the storage of data in small and large servers alike, and in the big data space, we may use approaches like Hadoop, Spark and the like to store and retrieve data from relational and non-relational databases such as generic AS400 or Oracle databases, or Mongo-esque databases, respectively. When we approach data storage across different physical disks and locations, the integrity of data and sanity/rationalization of data management practices becomes important. Data quality here could be measured using the same ways we check database integrity, which may be checked using data loss, data corruption and other checks. Data quality can therefore encompass physical checks for integrity – the quality of the hard drives or SSDs in question, the systems and processes that enable access of the data on an as-needed basis (power supply, bandwidth and other considerations), and the software aspects. The data quality discussion should naturally also flow to practices adopted at the data warehouse in question – maintenance of the servers, the kinds of commands run, the kinds of administrative activities performed.

Data Quality in Data Analysis

Data analysis, which is the end that justifies the means (to collect and store data) is arguably the most determinant and context-sensitive step of the whole data lifecycle. Data quality in this context may be seen as the availability from our databases, of the right data for the right analysis. While the analysis itself is decided based on what is required for the business and what is possible based on available data, the available data in principle (or as per the organization’s SLAs) and the actual available data should tally. Therefore data quality in an analysis context becomes more specific and tailored to the analysis that we’re interested in, more focused, based on the insights we want to derive from the data.

In addition to the above, there is a need to reuse data that is stored, and there is a need to review / revisit old data. This function is considered separate in some frameworks. In such situations, it may be wise to have data quality processes that are specific to these information/data retrieval steps. Usually, however, the same data quality criteria we have used for data storage or databases, should be applicable here too.