Quality and the Data Lifecycle

The insights we get from data depend on the quality of the data itself, and as the saying goes, “Garbage In, Garbage Out”. The volumes of data don’t matter as much as the quality of the data itself. Data quality and data quality assurance are therefore of growing importance in today’s Big Data arena. With the growing scale of big data implementations, data quality assurance is an increasingly importance function in the big data world today. In this short post, I’ll discuss aspects of data quality as seen from a data life cycle perspective.

Data quality is applicable to all the three key areas of data collection, storage and use in analysis. In each of these contexts, data quality can take on a different meaning. The data management paradigms most widely recognized involve these three stages, sometimes broken down into a larger number of stages. How does data quality assurance figure in each of these three stages of the data life cycle?

Data Quality In Data Collection

Data quality in the data collection part of the data life cycle is concerned with the nature, kind and operational definition of the data collected itself. This is therefore concerned with the sensors or measurement systems that collect data, their veracity and their ability to monitor the source of the data as per requirements. Assessment criteria in a 20th century industrial setting may have involved approaches like measurement system analysis. These days, more sophisticated logical validation approaches may be used. Methods of studying different aspects of the measurement systems, such as stability, linearity, bias, etc., may also be adopted. A wider discussion of data quality here will also include the process of documenting the data collected, whether manually or electronically, and how this is best rationalized.

Data Quality in Data Storage

Data Storage is the second part of the data life cycle, where we have the storage of data in small and large servers alike, and in the big data space, we may use approaches like Hadoop, Spark and the like to store and retrieve data from relational and non-relational databases such as generic AS400 or Oracle databases, or Mongo-esque databases, respectively. When we approach data storage across different physical disks and locations, the integrity of data and sanity/rationalization of data management practices becomes important. Data quality here could be measured using the same ways we check database integrity, which may be checked using data loss, data corruption and other checks. Data quality can therefore encompass physical checks for integrity – the quality of the hard drives or SSDs in question, the systems and processes that enable access of the data on an as-needed basis (power supply, bandwidth and other considerations), and the software aspects. The data quality discussion should naturally also flow to practices adopted at the data warehouse in question – maintenance of the servers, the kinds of commands run, the kinds of administrative activities performed.

Data Quality in Data Analysis

Data analysis, which is the end that justifies the means (to collect and store data) is arguably the most determinant and context-sensitive step of the whole data lifecycle. Data quality in this context may be seen as the availability from our databases, of the right data for the right analysis. While the analysis itself is decided based on what is required for the business and what is possible based on available data, the available data in principle (or as per the organization’s SLAs) and the actual available data should tally. Therefore data quality in an analysis context becomes more specific and tailored to the analysis that we’re interested in, more focused, based on the insights we want to derive from the data.

In addition to the above, there is a need to reuse data that is stored, and there is a need to review / revisit old data. This function is considered separate in some frameworks. In such situations, it may be wise to have data quality processes that are specific to these information/data retrieval steps. Usually, however, the same data quality criteria we have used for data storage or databases, should be applicable here too.