Big Data: Size and Velocity

One of the changes envisioned in the big data space is that there is the need to receive data that isn’t so much big in volume, as big in relevance. Perhaps this is a crucial distinction to make. Here, we examine business manifestations of relevant data, as opposed to just large volumes of data.

What Managers Want From Data

It is easier to ask the question “what do you want from your data” to managers and executives, than to answer it as one. As someone who has worked in Fortune 500s with teams that use data to make decisions, I’d like to share some insight into this:

  1. Managers don’t necessarily want to see data even if they talk about wanting to use data for decision making. They instead want to see interpretations of data that helps them make up their minds and take decisions.
  2. Decision making is not monotonous or based on single variables of interest.
  3. Decision making involves not only operational data descriptors (which are most often instrumented for collection from data sources)
  4. Decisions can be taken based on uncertain estimates in some cases, but many situations do require accurate estimates of results to drive decision making

From Data To Insight

The process of getting from data to insight isn’t linear. It involves exploration, and this means collecting more data, and iterating on one’s results and hypotheses. Broadly, the process of getting insights from data may involve data preparation and analysis as intermediate stages between data collection and the generation of insight. This doesn’t mean that the data scientist’s job is done once the insights are generated. There is a need to collect more data and refine the models we’ve built, and construct better views of the problem landscape.

Data Quality

A large percentage of the data analyst’s or data scientist’s problems have to do with the broad area of data quality, and its readiness for analysis. Specifically to data quality, some things stand out:

  1. Measurement aspects – whether the measured data really represents the state of the variable which was measured. This in turn involves other aspects such as linearity, stability, bias, range, sensitivity and other parameters of the measurement system
  2. Latency aspects – whether the measured data in time sequence is recorded and logged in the correct sequence and at the right intervals
  3. Missing and anomalous values – these are missing or anomalous readings/data records, as opposed to anomalous behaviour, which is a whole other subject.

Fast Data Processing

Speed is an essential element in the data scientist’s effectiveness. The speed of decisions is the speed of the slowest link in the chain. Traditionally, this slowest link has been the collection of the data itself. Data processing frameworks have improved by leaps and bounds in the recent past, with frameworks like Apache Spark leading the charge. However, this is changing, with sensors in IOT settings delivering huge data sets and massive streams of data in themselves. In such a situation, the dearth of time is not in the acquisition of data itself. Indeed, the availability of massive data lakes with lots of data on them itself signals the need for more and more data scientists, who can analyse this data and arrive at insights from the data. It is in this context that the rapid creation of models, analysis of insights from data, and the creation of meta-algorithms that do such work is valuable.

Where Size Does Matter

There are some problems which do require very large data sets. There are many examples of systems that gain effectiveness only with scale. One such example is the commonly found collaborative filtering recommendation engine, used everywhere in e-commerce and related industries. Size does matter for these data sets. Small data sets are prone to truth inflation and poor analysis results from poor data collection. In such cases, there is no respite other than to ensure we collect, store and analyze large data sets.

Volume or Relevance in Data

Now we come to the dichotomy we set out to resolve – whether volume is more important in data sets, or whether relevance is. Relevant data simply meets a number of the criteria listed above, whereas data that’s measured up purely in volume (petabytes or exabytes) doesn’t give us an idea of the quality and its use for real data analysis.

Volume and Relevance in Data

We now look at whether volume itself may become part of what makes the data relevant for analysis. Unsurprisingly, for some applications such as neural network training, data science on time series data sets of high frequency, etc., data volume is undeniably useful. More data in these cases implies that more can be done with the model, that more partitions or subsets of the data can be taken, and that more theories can be tested out on different representative sample sets of the data.

The Big-Three Vs

So, where does this leave us with respect to our understanding of what makes Big Data, Big Data? We’ve seen the popular trope that Big Data is data that exhibits volume, velocity and variety. Some discuss a fourth characteristic – the veracity of the data. Overall, the availability of relevant data in sufficient volumes should be able to address the needs of the data scientist for model building and for data exploration. The question of variety still remains, and as data profiling approaches mature, data wrangling will advance to a point where this variety isn’t a trouble, and is a genuine asset. However, the volume and velocity points are definitely scoped for a trade off, in a large percentage of the cases. For most model building activities, such as linear regression models, or classification models where we know the characteristics or behavior of the data set, so-called “small data” is sufficient, as long as the data set is representative and relevant.

Challenges of Effective Measurement

Introduction

Effective measurement is as important in the data science revolution as effective analysis is. Without data that is measured correctly, we fly blind into data analysis, and such a scenario can hardly be effective at extracting insight from the data we possess. In this post, I discuss some challenges facing effective measurement in the context of data science and the internet of things. Rather than address specific technical aspects, this is a reflective post that engages the key questions that arise in the context of data from diverse sources, unprocessed and processed data, and the importance of measurement systems analysis to data science and Internet-of-Things system builder and integrator teams.

Data Before Software and Algorithms

While a lot of discussion and debate rages on about which algorithm to use, or which language is better for data science, or indeed, which distributed computing framework to use for data processing and machine learning, the effective and accurate measurement of data itself is in some cases an unsolved problem. The Internet of Things (IOT) revolution will bring with it the need to integrate hundreds of sensors into devices around us, and the consequential sensor data complexity will necessitate systematic methods of processing measurements and storing such measurements at large volumes. While databases and messaging engines to transfer data have kept up with needs in this space (and continue to innovate), there is a need to better integrate some measurement system analysis routines, error modeling methods, and calibration methods into sensors themselves. Perhaps this will be accomplished in the sensor architectures of the future.

The Value of Measurement Error Models

Measurement error models are a key outstanding problem for calibration activities, measurement system design, embedded measurement system integration, and measurement management. While contemporary approaches such as variance models (a la Type A and Type B measurement estimation methods) exist, architectures that combine many different kinds of measurement sensors are not often found. Such architectures would have to combine correlation and cross correlation analyses, integrate distribution models of errors and provide for sensor state memory (meta data about sensor measurements) alongside the collection and processing of actual measurements.

While direct measurement sensors address singular operational definitions at massive scale, the trend is definitely towards sensor architectures that consider sensor fusion approaches. Static and dynamic characteristics of these sensor systems then become of paramount importance, since understanding and modeling them (and new phenomena in this space) will be central to accomplishing more from both direct-measurement sensors and sensor fusion arrays. Derived measures, their meaning in the context of complex signal interactions, and related effects are sure to play a role in the definition of sensor architectures.

Addressing Measurement Uncertainty

The more we peer into the source of our data and the means we have used to collect it, the more we have to examine its sources of error. Traditionally acknowledged sources of static measurement uncertainty are:

  1. Measurement bias uncertainty
  2. Precision uncertainty
  3. Resolution uncertainty
  4. Environmental or noise factor induced uncertainty

Additionally, there are sources of dynamic uncertainty, some of which may be:

  1. Stability and absolute sensitivity
  2. Dynamic range and dynamic sensitivity
  3. Hysteresis behavior of sensors and measurement systems

Without effectively addressing such sources of error and measurement uncertainty (or quantifying them), it is probably unwise to use advanced algorithms that paint broad brush strokes about problems. There is a class of computationally induced uncertainty to add to this, that is removed from the measurement layer in architectures, such as:

  1. Database updates and null values
  2. Time lags in updates, mismatches between sensor time and computer time
  3. Memory latency and the associated lags
  4. Interpolation, computation and round-off errors

Interestingly enough, many of these errors cannot be addressed directly, since we use digital devices that are built to a certain architecture. It is perhaps impossible to do away with aspects such as memory latency, or computational latencies that cause data to be processed with a delay, or measured data to be stored with a delay. As processor and memory chips improve in computational capabilities (despite the warnings to the contrary that Moore’s law is slowing down) these lags are bound to be less and less significant. While individually, in small data sets, such latencies, biases and errors may not be significant, over time, the cost of decisions made with poorly measured and processed data is high.

Measurement Error and Uncertainty Characterization

Characterization of measurement error and uncertainty is the process of analyzing measurements from a process or sensor to understand:

  1. The measurement system’s contribution to errors or variations
  2. The contribution of the object being measured, to observed variations

The ISO GUM (Guidelines for expression of Uncertainty in Measurements) standards specify a set of approaches that are widely being followed by organizations whose best interests lie in effective measurement error characterization (think NASA and many Aerospace corporations who are likely to use fine grained measurements for their design, testing and manufacturing processes). Historically, approaches like Gage Analysis (ANOVA based Gage R&R) have been recommended by other manufacturing process and quality practitioners whose job it frequently was to deal with data and measurements. However, a more comprehensive, systems-based approach may be due, since the process for such analysis is sometimes based on archaic process definitions and expectations and is out of touch with the reality of increased manufacturing automation.

The Importance of Time

Finally, it is impossible to have effective measurement devices of any kind (leave alone massive sensor arrays or sensor fusion arrays) that don’t have a system of analysis that considers the time element of the analysis. Time series views of the data have to prevail over the i.i.d. view of data analysis that is often practiced in industry and academia. This means moving away from some frequentist views of the world, and embracing Bayesian approaches, to allow us to reason better with the data, rather than using theories about the data.

The Importance of Domain

I’m a fan of domain knowledge in data science and make no secret of the fact – primarily because domain knowledge and experience helps us make sense of data analysis. If data analysis is the process of reasoning with data effectively, domain knowledge is what sustains sane interpretations of such analysis. Domain knowledge is important in measurement, because setting up measurement systems is a highly domain specific task. Managing process and product measurement is a whole different and complex topic, and perhaps warrants a separate post.

Concluding Remarks

Measurement is an oft-overlooked area of data analysis, because data scientists and analysts often like to get right to the analysis and insights. However, it would do business analysts, machine learning engineers and serious data scientists a world of good to take a close look at measurement systems and how they measure, capture and process data from the sources as measurements. Not only are measurement system analysis, measurement error characterization and measurement uncertainty characterization central to the process of collecting data, but they indirectly affect the results we deliver to our organizations and stakeholders as data scientists.