The Value of “Small Data” in ML and AI

This is a comment from LinkedIn.

I wish we paid more attention to “small data”. Models that are built from small data aren’t necessarily bad – it depends on the data generating process you’re trying to model. More data doesn’t necessarily imply better models, especially if the veracity of the data is questionable. Data-centric AI is a discussion that’s being had now in this context. However, when you don’t need large scale ML models are are (prudently) content building statistical tests and simple models, these small data problems become important.

What decision makers shouldn’t forget is that the essential nature of decision making won’t change just due to the size of the data – ultimately it is the insight that models provide (based on many factors) that are the commodity we consume as decision makers. Consequently there should not be an aversion towards “small data” problems but a healthy curiosity. Like all efficiency movements that came before, small data paradigms are innately attractive – if I can verifiably build better models by doing less work, that should logically be a point of value.

Different Kinds of Data Scientists

Data scientists come in many shapes and sizes, and constitute a diverse lot of people. More importantly, they can perform diverse functions in organizations and still stand to qualify under the same criteria we use to define data scientists.

In this cross-post from a Quora answer, I wish to elucidate on the different kinds of data scientist roles I believe exist in industry. Here is the original question on Quora. I have to say here, that I found Michael Koelbl’s answer to What are all the different types of data scientists? quite interesting, and thinking along similar lines, I decided to delineate the following stereotypical kinds of data science people:

  1. Business analysts with a data focus: These are essentially business analysts that understand a specific business domain reasonably well, although they’re not statistically or analytically inclined. Focused on exploratory data analysis, reporting based on creation of new measures, graphs and charts based on them, and asking questions around these EDA. They’re excellent at story telling, asking questions based on data, and pushing their teams in interesting directions.
  2. Machine learning engineers: Essentially software developers with a one-size-fits-all approach to data analysis, where they’re trying to build ML models of one or other kind, based on the data. They’re not statistically savvy, but understand ML engineering, model development, software architecture and model deployment.
  3. Domain expert data scientists: They’re essentially experts in a specific domain, interested in generating the right features from the data to answer questions in the domain. While not skilled as statisticians or machine learning engineers, they’re very keyed in on what’s required to answer questions in their specific domains.
  4. Data visualization specialists: These are data scientists focused on developing visualizations and graphs from data. Some may be statistically savvy, but their focus is on data visualization. They span the range from BI tools to coded up scripts and programs for data analysis
  5. Statisticians: Let’s not forget the old epithets assigned to data scientists (and the jokes around data science and statisticians). Perhaps statisticians are the rarest breed of the current data science talent pool, despite the need for them being higher than ever. They’re generally savvy analysts who can build models of various kinds – from distribution models, to significance testing, factor-response models and DOE, to machine learning and deep learning. They’re not normally known to handle the large data sets we often see in data science work, though.
  6. Data engineers with data analysis skills: Data engineers can be considered “cousins” of data scientists that are more focused on building data management systems, pipelines for implementation of models, and the data management infrastructure. They’re concerned with data ingestion, extraction, data lakes, and such aspects of the infrastructure, but not so much about the data analysis itself. While they understand use cases and the process of generating reports and statistics, they’re not necessarily savvy analysts themselves.
  7. Data science managers: These are experienced data analysts and/or data engineers that are interested in the deployment and use of data science results. They could also be functional or strategic managers in companies, who are interested in putting together processes, systems and tools to enable their data scientists, analysts and engineers, to be effective.

So, do you think I’ve covered all the kinds of data scientists you know? Do you think I missed anything? Let me know in the comments.

Related links

  1. O’Reilly blog post on data scientists versus data engineers

Why Do I Love Data Science?

This is a really interesting question for me, because I really enjoy discussing data science and data analysis. Some reasons I love data science:

  1. Discovering and uncovering patterns in the data through data visualization
  2. Finding and exploring unusual relationships between factors in a system using statistical measures
  3. Asking questions about systems in a data context – this is why data science is so hands-on, so iterative, and so full of throw-away models

Let me expand on each of these with an example, so that you get an idea.

Uncovering Patterns in Data

On a few projects, I’ve found data visualization to be a great way to identify hypotheses about my data set. Having a starting point such as a visualization for the hypothesis generation process makes us go into the process of building models a little more confidently. There’s the specific example of a time series analysis technique I used for energy system data, where using aggregate statistical measures and distribution fitting led to arbitrary and complex patterns in the data. Using time ordered visualizations helped me formulate the hypothesis in the correct way, and allowed me to build an explanatory model of the system.

Exploring Unusual Relationships in Data

In data science work, you begin to observe broad patterns and exceptions to these rules. Simple examples may be found in the analysis of anomalous behaviour in various kinds of systems. Some time back, I worked with a log data set that captured different kinds of customer transaction data between a customer and a client. These log data revealed unusual patterns that those steeped in the process could tell, but which couldn’t be quantified. By finding typical patterns across customers using session-specific metrics, I helped identify the anomalous customers. The construction of these variables, known as “feature engineering” in data science and machine learning, was a key insight. Such insights can only come when we’re informed about domain considerations, and when we understand the business context of the data analysis well.

Asking Questions about Systems in a Data Context

When you’re exploring the behaviour of systems using data, you start from some hypothesis (as I’ve described above) and then continue to improve your hypothesis to a point where it is able to help your business answer key questions. In each data science project, I’ve observed how considerations external to the immediate data set often come in, and present interesting possibilities to us during the data analysis. Sometimes, we answer these questions by finding and including the additional data, and at other times, the questions remain on the table. Either way, you get to ask a question on top of an answer you know, and you get to do an analysis on top of another analysis – with the result that you’ve composited different models together after a while, that give you completely new insights that you’ve not seen before.

Concluding Remarks

All three patterns are exhilarating and interesting to observe, for data scientists, especially those who are deeply involved in reasoning about the data. A good indication of whether you’ve done well in data analysis is when you’re more curious and better educated about the nuances of a system or process than you were before – and this is definitely true in my case. What seemed like a simple system at the outset can reveal so much to you when you study its data – and as a long-time design, engineering and quality professional, this is what interests me a great deal about data science.

Quora Data Science Answers Roundup

I’m given to spurts of activity on Quora. Over the past year, I’ve had the opportunity to answer several questions there on the topics of data science, big data and data engineering.

Some answers here are career-specific, while others are of a technical nature. Then there are interesting and nuanced questions that are always a pleasure to answer. Earlier this week I received a pleasant message from the Quora staff, who have designated me a Quora Top Writer for 2017. This is exciting, of course, as I’ve been focused largely on questions around data science, data analytics, hobbies like aviation and technology, past work such as in mechanical engineering, and a few other topics of a general nature on Quora.

Below, I’ve put together a list of the answers that I enjoyed writing. These answers have been written keeping a layperson audience in mind, for the most part, unless the question itself seemed to indicate a level of subject matter knowledge. If you like any of these answers (or think they can be improved), leave a comment or thanks (preferably directly on the Quora answer) and I’ll take a look! 🙂

Happy Quora surfing!

Disclaimer: None of my content or answers on Quora reflect my employer’s views. My content on Quora is meant for a layperson audience, and is not to be taken as an official recommendation or solicitation of any kind.

Big Data: Size and Velocity

One of the changes envisioned in the big data space is that there is the need to receive data that isn’t so much big in volume, as big in relevance. Perhaps this is a crucial distinction to make. Here, we examine business manifestations of relevant data, as opposed to just large volumes of data.

What Managers Want From Data

It is easier to ask the question “what do you want from your data” to managers and executives, than to answer it as one. As someone who has worked in Fortune 500s with teams that use data to make decisions, I’d like to share some insight into this:

  1. Managers don’t necessarily want to see data even if they talk about wanting to use data for decision making. They instead want to see interpretations of data that helps them make up their minds and take decisions.
  2. Decision making is not monotonous or based on single variables of interest.
  3. Decision making involves not only operational data descriptors (which are most often instrumented for collection from data sources)
  4. Decisions can be taken based on uncertain estimates in some cases, but many situations do require accurate estimates of results to drive decision making

From Data To Insight

The process of getting from data to insight isn’t linear. It involves exploration, and this means collecting more data, and iterating on one’s results and hypotheses. Broadly, the process of getting insights from data may involve data preparation and analysis as intermediate stages between data collection and the generation of insight. This doesn’t mean that the data scientist’s job is done once the insights are generated. There is a need to collect more data and refine the models we’ve built, and construct better views of the problem landscape.

Data Quality

A large percentage of the data analyst’s or data scientist’s problems have to do with the broad area of data quality, and its readiness for analysis. Specifically to data quality, some things stand out:

  1. Measurement aspects – whether the measured data really represents the state of the variable which was measured. This in turn involves other aspects such as linearity, stability, bias, range, sensitivity and other parameters of the measurement system
  2. Latency aspects – whether the measured data in time sequence is recorded and logged in the correct sequence and at the right intervals
  3. Missing and anomalous values – these are missing or anomalous readings/data records, as opposed to anomalous behaviour, which is a whole other subject.

Fast Data Processing

Speed is an essential element in the data scientist’s effectiveness. The speed of decisions is the speed of the slowest link in the chain. Traditionally, this slowest link has been the collection of the data itself. Data processing frameworks have improved by leaps and bounds in the recent past, with frameworks like Apache Spark leading the charge. However, this is changing, with sensors in IOT settings delivering huge data sets and massive streams of data in themselves. In such a situation, the dearth of time is not in the acquisition of data itself. Indeed, the availability of massive data lakes with lots of data on them itself signals the need for more and more data scientists, who can analyse this data and arrive at insights from the data. It is in this context that the rapid creation of models, analysis of insights from data, and the creation of meta-algorithms that do such work is valuable.

Where Size Does Matter

There are some problems which do require very large data sets. There are many examples of systems that gain effectiveness only with scale. One such example is the commonly found collaborative filtering recommendation engine, used everywhere in e-commerce and related industries. Size does matter for these data sets. Small data sets are prone to truth inflation and poor analysis results from poor data collection. In such cases, there is no respite other than to ensure we collect, store and analyze large data sets.

Volume or Relevance in Data

Now we come to the dichotomy we set out to resolve – whether volume is more important in data sets, or whether relevance is. Relevant data simply meets a number of the criteria listed above, whereas data that’s measured up purely in volume (petabytes or exabytes) doesn’t give us an idea of the quality and its use for real data analysis.

Volume and Relevance in Data

We now look at whether volume itself may become part of what makes the data relevant for analysis. Unsurprisingly, for some applications such as neural network training, data science on time series data sets of high frequency, etc., data volume is undeniably useful. More data in these cases implies that more can be done with the model, that more partitions or subsets of the data can be taken, and that more theories can be tested out on different representative sample sets of the data.

The Big-Three Vs

So, where does this leave us with respect to our understanding of what makes Big Data, Big Data? We’ve seen the popular trope that Big Data is data that exhibits volume, velocity and variety. Some discuss a fourth characteristic – the veracity of the data. Overall, the availability of relevant data in sufficient volumes should be able to address the needs of the data scientist for model building and for data exploration. The question of variety still remains, and as data profiling approaches mature, data wrangling will advance to a point where this variety isn’t a trouble, and is a genuine asset. However, the volume and velocity points are definitely scoped for a trade off, in a large percentage of the cases. For most model building activities, such as linear regression models, or classification models where we know the characteristics or behavior of the data set, so-called “small data” is sufficient, as long as the data set is representative and relevant.

Domain: The Missing Element in Data Science

As a data science consultant that routinely deals with large companies and their data analysis, data science and machine learning challenges, I have come to understand one key element of the data scientist’s skill set that isn’t oft-discussed in data science circles online. In this post I hope to elucidate on the importance of domain knowledge.

Over the last several years, there has (rightly) been significant debate on the skill sets of data scientists, and the importance of business, statistics, programming and other skill sets. Interesting sub-classifications of professions, such as “data hacker”, “data nerd” and other terms have been used to describe the various combinations or intersections of these skill sets.

The Importance of Domain Knowledge

In all of these discussions, however, one key element has been left out. And that is the domain.


Domain knowledge is an important subset of the data scientist’s work. Although the perfect data scientist is a bit of a unicorn, the domain should be an important consideration.

Domain knowledge is distinct from statistics, data analysis, programming and the purely technical areas, and it is easy to see how that is the case. However, business knowledge is often conflated with domain knowledge, perhaps understandably, because these are both vague and interdisciplinary areas. Business knowledge entails some amount of financial knowledge, unit economics models, strategy, people management, and a range of other skills taught in business schools, and more commonly, learned in organisations on the job. Domain knowledge, however, is like being a kind of human expert system. Wikipedia defines an expert system without defining expertise. What role does expertise play in data science, however?

Domain knowledge is a result of the system exploration that humans as system builders naturally do. To be able to formulate intelligent hypotheses, the unique cause-effect chains that are relevant to specific systems can be studied and understood. Do humans learn about systems in ways that are different from how machines might explore them, if we were to give them infinite data and computational capability? That is a hard question to answer in this context, and perhaps represents a red herring of sorts. What is useful to note, however, is that machine learning models still rely on human-formulated hypotheses. There is the odd example of an expert system that has formulated hypotheses and proved them (as is happening in medicine, these days), but these examples are hardly possible without human intervention.

Now that we have established that human intervention has become necessary in machine learning systems, data science can be seen as a field that relies uniquely on human-formulated hypotheses. While computational power and statistical models help us explore and construct hypotheses, the decisions that are made from this data – that help us define hypotheses, model the data to test these hypotheses, construct mathematical or statistical models of these data, and then evaluate the results of those tests – all of these activities take place with human intervention.

So where does domain fit in? Domain experts are those who have significant experience learning about one or a few interconnected systems in intimate ways. Their ability to develop a gut feel for the system’s performance and characteristics helps them leap frog the formulation of hypotheses, and this is their biggest benefit, compared to domain-agnostic data scientists, who merely have the programming, statistics, business and communication skills required to make serious analysis happen.

Domain Expertise and Analysis Paralysis

Domain expertise is probably one fine way to fight off the analysis-paralysis problem that plagues many data science teams. Some data science teams take up significant time and resources to experiment with ideas vastly, and the availability of high performance computing power on tap makes them take hypothesis formulation less seriously. Adversity is truly the mother of inventiveness, and it is, for example, when computing power was at a premium, that some of the most efficient sorting algorithms were devised. Similarly, the availability of computing power and statistical modeling capabilities on a massive scale de-incentivize the need to ask pertinent questions.

Pertinent questions and specific answers lead to tangible decisions and related business improvements. Without the benefit of domain knowledge, this is not possible. Analysis paralysis is a very real phenomenon. Data scientists are susceptible in organizations that value domain expertise, and don’t value analytical solutions. In situations where analytical solutions and problem solving are valued, data science that fly blind toting algorithms and machine learning won’t come out on top either – they’re more likely to hurt the credibility of the data science exercise than help it, when they solve simple problems that have pre-existing domain formulations with the help of complex algorithms (which may sometimes not give sufficient insight into their own workings, despite working well).

Challenge or Channel Domain Expertise?

Machine learning work done in medicine (cancer cell detection) points to a future where human-learned skills are replicated by deep learning or reinforcement learning systems. Alternatively, many real data science programs at diverse companies indicate an analysis paralysis that can be addressed by involvement to a greater degree of domain experts of specific kinds in the data science hypothesis formulation, analysis and  interpretation of results. The latter is more representative of a real world scenario than the former, where an expert system independently learns about a hard problem and solves it.

Doing Data Science Better

In order to be able to do data science better, it isn’t merely important to consider developing data scientist resources along the lines described by Drew Conway or Stephan Kolassa. It is important to groom analytically capable people from within domains too. This means distributing the skill set required for serious analysis from the mainstream data science practice, into functional teams. Sometimes, this may mean penetrating leadership teams that work in functional capacities, and at other times, it may mean addressing the needs of small teams directly, by grooming functional/technical talent for doing data science.

Doing data science better doesn’t merely involve leveraging algorithms and their strengths better. It also means asking the right questions. Pay attention to your domain experts, and develop the capabilities around the analytical capabilities of your team. Success for many companies doesn’t look like all-conquering deep learning algorithms, but looks like specific problems solved in a targeted manner, by using well defined problem statements and the right algorithms and frameworks.

Insights about Data Products

Data products are one inevitable result and culmination of the information age. With enough information to process, and with enough data to build massively validated mathematical models like never before, the natural urge is to take a shot at solving some of the world’s problems that depend on data.

Data Product Maturity

There are some fundamental problems all data products aim to address:

  1. Large scale mathematical model building was not possible before. In today’s world of Hadoop and R/Python/Scala, you can build a very specific kind of hypothesis and test it using data collected on a massive scale
  2. Large scale validation of an idea was not possible before. Taking a step back from the hypothesis itself, the presence of big data technologies and the ability to test hypotheses of various kinds ultimately helps validate ideas
  3. Data asymmetry problems can be addressed on a scale never seen before. Taking yet another step back from the ability to validate diverse ideas, the presence of such technologies and models allows us to put power in the hands of decision makers like never before, by arming them with data.


Being Data Driven: Enabling Higher Level Abstractions of Work

Cultivating a data-driven mindset is hard. I have blogged about this before. But when the standard process workflows (think Plan-Do-Check-Act and Deming) are augmented by analytics, it is amazing what happens to “regular work”. The need to collect, sort and analyze data in a tireless, diligently consistent and unbiased fashion gets delegated to a machine. The human being in organization is not staffed with the mundane activities of data collection and management. Their powers are put to use by leveraging higher reasoning faculties – to do the data analysis that results in insight, and to interpret and review the strategic outcomes. The higher levels of abstraction of work that data products enable help organizations and teams mature.

And this is the primary value addition that a lot of data products seem to bring. The tasks that humans are either too creative for (or too easily bored because of) get automated, and in the process, the advantages of massive data collection and machine learning are leveraged, to bring about a decision making experience that truly eclipses prior generations of managers in the ability and speed to get through complex decisions fast.

Data Product Opportunities

Data products will become a driving force for industrializing the third world nations, and may become a key element of the business strategy of the largest of the large corporations. The levels of uncertainty in business today echo the quality of tools available, and the leverage that this brings. The open source movement has accelerated product development teams in areas such as web development, search technologies, and made the internet the de-facto medium of information for a lot of youngsters. Naturally, these youngsters will warm up faster than the previous generations about the data products available to them. Data products could improve the lives of millions, by enabling the access economy.


While the action is generally in the upper right quadrant here, with companies fighting it out for more subscribers and catering to modern segments of industry that are more receptive to ideas, the silent analytics revolution may actually happen in brick and mortar companies that have fewer subscribers and have a more traditional mindset or in a more traditional business. Wherever possible, companies are delivering value by digitization, but a number of services cannot be so digitized, and here is another enabling opportunity. The data products in this space may not attempt to replace the human, or replace the traditional value proposition. Instead, they can function in much the same way IoT is disrupting enterprises. Embedded systems and technologies are definitely one aspect of the silent analytics revolution in the bottom left quadrant, which may have large market fragmentation and entrenched business models that haven’t moved on from decades or centuries old ideas.


“Small Data”and Being Data-Driven

Being data-driven in organizations is a bigger challenge than it is made out to be. For managers to suspend judgement and make decisions that are informed by facts and data is hard, even in this age of Big Data. I was spurred by a set of tweets I posted, to think through this subject.

Decision Making Culture

A lot of organizations have jumped into the Big Data era having bypassed widespread use of data-driven decision making in their management ranks altogether. And this is, for many organizations, an inconvenient truth. In many organizations, even well known ones, experienced managers often made decision on gut feeling or based on reasons other than data that they collected. Analytics and business intelligence hoped to change that, and in some ways, it has. Many organizations and managers have changed their work styles. Examples abound of companies adopting techniques like Six Sigma in the 1980s and 1990s, a trend that continues to this day in the manufacturing industry.

Three Contrasts

With the explosion in technologies and methods that have enabled Big Data to be collected and stored as “data lakes” and for data to be collected in real time as streaming data using technologies like Spark and NiFi, we’re at the advent of a new era of decision making characterised by the  3 Vs of Big Data, and data science at scale.

To see three contrasts between old and new management decision making styles:

  1. Spending and buying decisions (for resources, infrastructure, technology and projects) are made after competitive evaluation based on data now more than ever. In the past, the lack of communication and analysis engines, and limited globalization enabled managers to spend less time evaluating even critical decisions, because the options were limited. Spending and buying decisions make up a lot of the executive decision making and a lot of it is informed by small data. The new trends of connected economies to networks, data mining and data analysis is bound to impact this positively. A flood of information enabled by the digital age exposed them to possibilities but without the tools to do better at such competitive analysis. The advent of advanced analytics will upend this paradigm, and will result in a better visibility for decision alternatives.
  2. Operational excellence decisions are based more on real-time data now more than ever. Operational excellence and process efficiency is a key focus area for many manufacturing organizations, and increasingly concerns service oriented organizations as well. While “small data” were being collected at regular intervals, to get a sense of the business operations, these were not fully effective in capturing the wide range of process modes and didn’t represent the full possibilities one could leverage with such data. The number of practitioners of advanced methods, who used such methods in a verifiable way, were also limited and rarely formed the management strata or informed them. The proliferation of the new classes of data scientists and data engineers will affect the way decisions will be taken in future, in addition to the advent of real-time analytics.
  3. Small data as a stepping stone to Big Data. Small Data, which is data collected as samples that may be slices of sensor information or representative samples of population data (such as Big Data), may increasingly be used to formulate the “cultural business case” for doing Big Data in companies. Many companies that do not have the culture of data driven decision making in their managerial ranks, are experimenting on a grand scale, with Big Data. Such organizations have taken to Big Data technologies such as Hadoop and Spark, and are collecting more data than they usefully analyze, often times. There is definitely scope to evaluate the business value with such implementations. There is also an opportunity to improve the cost effectiveness of the data science initiatives in companies, by evaluating the real need for a Big Data implementation, by using “small data” – data that does not have the same volume, velocity, variety and veracity criteria that what’s now accepted to be Big Data does have.

Data Driven Decision Making Behaviours

Decision making is strongly influenced by behaviours. Daniel Kahnemann’s book Thinking Fast and Slow provides a psychological framework for thinking about fast and slow decision making, the former being gut-driven, and the latter being driven by careful, plodding analysis. Humans have the tendency to decisiveness, especially in organizations, and executives are often rewarded for fast decision making that is also effective. Naturally, this means that decision making as a habit flourishes in organizations.

Such fast decision making, however, comes at a price. A lot of decisions that aren’t well thought-through, could influence a large organization’s functioning, because the decision could be fundamental to the organization and may be relevant to all employees. Some organizations do reward behaviours in their managerial cadres that facilitate looking at the data that supports decisions. However, the vast majority of managers have a tax on the time they spend on decisions and would be rewarded for acting quickly and influencing a wide ranging array of decisions instead.

Enabling fast decision making has obvious benefits in a market economy. The more time managers spend in decision making, or delay a decision, the less competitive companies tend to look. Data driven decision making can be enabled by providing access to data, in a quick and painless way. And this means building intelligence into our interfaces, and into the machines that help us make and record decisions. It also means being able to delegate the mundane tasks well and easily.

Concluding Remarks

A lot of organizations that have Big Data initiatives may not have the appropriate management or decision making culture that can fully utilize the investment in Big Data, which can sometimes be considerable. By using “Small Data” and the insights from analysis of such data, there is an opportunity to invest less and build the behaviours and organizational systems and habits that will make a Big Data implementation effective.

The “Jagged Edge” of Real Time Analytics

I recently came across this Hortonworks Data Flow presentation where the concept of the jagged edge of real time data analytics is discussed. The context that suits a discussion on this is to me, centred around prioritization of what comes in through the sensors (or other big data gathered from these “jagged edge” sources). This is a pondering, rather than a post with a specific agenda, perhaps I will add to this, as responses come in, or as I learn more.

One of the key challenges for a lot of data centres in future, in the world of the Internet of Things, is to be able to provide relevant data analytics, regardless of size, point of origin, or time when it was generated. In order to do this well, I foresee not only technologies like NiFi being able to provide low latency updates as much in real-time as possible, but also a need for technologies that have sufficient intelligence built into the data sampling and data collecting process.

The central philosophy is susprisingly old – Edwards Deming said that data are not collected for museum purposes, but for decision making. And it is in this context that we see the data lakes of today transforming into the data seas of tomorrow, if we are not to use intelligent prioritization to determine what data should be streamed, and what data should be stored. The reality of decision making in such real time analytics situations is the availability of too much data to make one decision – which reminds me of Barry Schwartz’s paradox of free choice – that too much choice can actually impede and delay decision making, rather than aid it.

How do we ensure that approaches like data flow prioritization can allow us to address these issues? How do we move away from the static data lakes to the more useful data streams from the jagged edge, that data streaming technologies like Spark promise on established frameworks like Hadoop, without the risk of turning our lakes into seas that we do not have the insight of fully benefiting from?

Some of the answers lie in the implementations of specific use cases, of course. There is no silver bullet, if you will. That said, what kinds of technologies can we foresee being developed for Hadoop, Spark and other technologies that will heavily influence the internet of things revolution, to solve this prioritization conundrum?

Quality and the Data Lifecycle

The insights we get from data depend on the quality of the data itself, and as the saying goes, “Garbage In, Garbage Out”. The volumes of data don’t matter as much as the quality of the data itself. Data quality and data quality assurance are therefore of growing importance in today’s Big Data arena. With the growing scale of big data implementations, data quality assurance is an increasingly importance function in the big data world today. In this short post, I’ll discuss aspects of data quality as seen from a data life cycle perspective.

Data quality is applicable to all the three key areas of data collection, storage and use in analysis. In each of these contexts, data quality can take on a different meaning. The data management paradigms most widely recognized involve these three stages, sometimes broken down into a larger number of stages. How does data quality assurance figure in each of these three stages of the data life cycle?

Data Quality In Data Collection

Data quality in the data collection part of the data life cycle is concerned with the nature, kind and operational definition of the data collected itself. This is therefore concerned with the sensors or measurement systems that collect data, their veracity and their ability to monitor the source of the data as per requirements. Assessment criteria in a 20th century industrial setting may have involved approaches like measurement system analysis. These days, more sophisticated logical validation approaches may be used. Methods of studying different aspects of the measurement systems, such as stability, linearity, bias, etc., may also be adopted. A wider discussion of data quality here will also include the process of documenting the data collected, whether manually or electronically, and how this is best rationalized.

Data Quality in Data Storage

Data Storage is the second part of the data life cycle, where we have the storage of data in small and large servers alike, and in the big data space, we may use approaches like Hadoop, Spark and the like to store and retrieve data from relational and non-relational databases such as generic AS400 or Oracle databases, or Mongo-esque databases, respectively. When we approach data storage across different physical disks and locations, the integrity of data and sanity/rationalization of data management practices becomes important. Data quality here could be measured using the same ways we check database integrity, which may be checked using data loss, data corruption and other checks. Data quality can therefore encompass physical checks for integrity – the quality of the hard drives or SSDs in question, the systems and processes that enable access of the data on an as-needed basis (power supply, bandwidth and other considerations), and the software aspects. The data quality discussion should naturally also flow to practices adopted at the data warehouse in question – maintenance of the servers, the kinds of commands run, the kinds of administrative activities performed.

Data Quality in Data Analysis

Data analysis, which is the end that justifies the means (to collect and store data) is arguably the most determinant and context-sensitive step of the whole data lifecycle. Data quality in this context may be seen as the availability from our databases, of the right data for the right analysis. While the analysis itself is decided based on what is required for the business and what is possible based on available data, the available data in principle (or as per the organization’s SLAs) and the actual available data should tally. Therefore data quality in an analysis context becomes more specific and tailored to the analysis that we’re interested in, more focused, based on the insights we want to derive from the data.

In addition to the above, there is a need to reuse data that is stored, and there is a need to review / revisit old data. This function is considered separate in some frameworks. In such situations, it may be wise to have data quality processes that are specific to these information/data retrieval steps. Usually, however, the same data quality criteria we have used for data storage or databases, should be applicable here too.