The “Jagged Edge” of Real Time Analytics

I recently came across this Hortonworks Data Flow presentation where the concept of the jagged edge of real time data analytics is discussed. The context that suits a discussion on this is to me, centred around prioritization of what comes in through the sensors (or other big data gathered from these “jagged edge” sources). This is a pondering, rather than a post with a specific agenda, perhaps I will add to this, as responses come in, or as I learn more.

One of the key challenges for a lot of data centres in future, in the world of the Internet of Things, is to be able to provide relevant data analytics, regardless of size, point of origin, or time when it was generated. In order to do this well, I foresee not only technologies like NiFi being able to provide low latency updates as much in real-time as possible, but also a need for technologies that have sufficient intelligence built into the data sampling and data collecting process.

The central philosophy is susprisingly old – Edwards Deming said that data are not collected for museum purposes, but for decision making. And it is in this context that we see the data lakes of today transforming into the data seas of tomorrow, if we are not to use intelligent prioritization to determine what data should be streamed, and what data should be stored. The reality of decision making in such real time analytics situations is the availability of too much data to make one decision – which reminds me of Barry Schwartz’s paradox of free choice – that too much choice can actually impede and delay decision making, rather than aid it.

How do we ensure that approaches like data flow prioritization can allow us to address these issues? How do we move away from the static data lakes to the more useful data streams from the jagged edge, that data streaming technologies like Spark promise on established frameworks like Hadoop, without the risk of turning our lakes into seas that we do not have the insight of fully benefiting from?

Some of the answers lie in the implementations of specific use cases, of course. There is no silver bullet, if you will. That said, what kinds of technologies can we foresee being developed for Hadoop, Spark and other technologies that will heavily influence the internet of things revolution, to solve this prioritization conundrum?

Quality and the Data Lifecycle

The insights we get from data depend on the quality of the data itself, and as the saying goes, “Garbage In, Garbage Out”. The volumes of data don’t matter as much as the quality of the data itself. Data quality and data quality assurance are therefore of growing importance in today’s Big Data arena. With the growing scale of big data implementations, data quality assurance is an increasingly importance function in the big data world today. In this short post, I’ll discuss aspects of data quality as seen from a data life cycle perspective.

Data quality is applicable to all the three key areas of data collection, storage and use in analysis. In each of these contexts, data quality can take on a different meaning. The data management paradigms most widely recognized involve these three stages, sometimes broken down into a larger number of stages. How does data quality assurance figure in each of these three stages of the data life cycle?

Data Quality In Data Collection

Data quality in the data collection part of the data life cycle is concerned with the nature, kind and operational definition of the data collected itself. This is therefore concerned with the sensors or measurement systems that collect data, their veracity and their ability to monitor the source of the data as per requirements. Assessment criteria in a 20th century industrial setting may have involved approaches like measurement system analysis. These days, more sophisticated logical validation approaches may be used. Methods of studying different aspects of the measurement systems, such as stability, linearity, bias, etc., may also be adopted. A wider discussion of data quality here will also include the process of documenting the data collected, whether manually or electronically, and how this is best rationalized.

Data Quality in Data Storage

Data Storage is the second part of the data life cycle, where we have the storage of data in small and large servers alike, and in the big data space, we may use approaches like Hadoop, Spark and the like to store and retrieve data from relational and non-relational databases such as generic AS400 or Oracle databases, or Mongo-esque databases, respectively. When we approach data storage across different physical disks and locations, the integrity of data and sanity/rationalization of data management practices becomes important. Data quality here could be measured using the same ways we check database integrity, which may be checked using data loss, data corruption and other checks. Data quality can therefore encompass physical checks for integrity – the quality of the hard drives or SSDs in question, the systems and processes that enable access of the data on an as-needed basis (power supply, bandwidth and other considerations), and the software aspects. The data quality discussion should naturally also flow to practices adopted at the data warehouse in question – maintenance of the servers, the kinds of commands run, the kinds of administrative activities performed.

Data Quality in Data Analysis

Data analysis, which is the end that justifies the means (to collect and store data) is arguably the most determinant and context-sensitive step of the whole data lifecycle. Data quality in this context may be seen as the availability from our databases, of the right data for the right analysis. While the analysis itself is decided based on what is required for the business and what is possible based on available data, the available data in principle (or as per the organization’s SLAs) and the actual available data should tally. Therefore data quality in an analysis context becomes more specific and tailored to the analysis that we’re interested in, more focused, based on the insights we want to derive from the data.

In addition to the above, there is a need to reuse data that is stored, and there is a need to review / revisit old data. This function is considered separate in some frameworks. In such situations, it may be wise to have data quality processes that are specific to these information/data retrieval steps. Usually, however, the same data quality criteria we have used for data storage or databases, should be applicable here too.