Pervasive Trends in Big Data and Data Science

As of mid-2017, I’ve spent almost two years in the big data analytics and data science world, coming from 13 years of diverse work experience in engineering and management prior. Starting from a professional curiosity, it has taken me a while to develop some data science and engineering skills and hone key skills among these as a data scientist. Along the way, I’ve had a chance to learn core software development methods and principles, stay in touch with the latest in the field, challenge my existing knowledge of product development methodologies and processes, and learn more about data analysis, statistics and machine learning than I started out with in 2015. Along with the constant learning, I’ve had a chance to observe a few pervasive trends in the big data and analytics worlds, which I wish to share here.

  1. Cloud infrastructure penetration: Undoubtedly the biggest beneficiaries of the data and analytics revolution have been cloud service providers. They’re also stretched thin, with reducing costs, massive competition, and the need for value added services of various kinds (big compute and API support, along with big storage, for instance) to be available alongside the core cloud offerings that companies are lapping up, for their data management needs. Security concerns continue to exist, and one of the biggest security issues was actually from the US’ leading cloud service provider, Amazon Web Services. Despite this, many industries, even those that consider data security paramount, wish to adopt cloud infrastructures, because of the reduced cost of operation and the scalability inherent in cloud platforms.
  2. Deep learning adoption: Generalized learning algorithms based on neural networks have taken the machine learning world by storm, and given the proliferation of big compute and big data storage platforms, it has become easier to train deep learning algorithms than in the past. Extant frameworks continue to give better, more user-friendly algorithms as they evolve, and there’s definitely a more user-friendly ecosystem of frameworks and algorithms out there, such as Caffe, Keras, and Tensorflow (which has become more user-friendly and better integrated with numerous systems programming languages and frameworks). This trend will continue, with several tested and published DL APIs available for integration into application software of various kinds.
  3. API based data product deployment: Data science operationalization has begun to happen through APIs and platforms. Organizations that are developing data product strategies are increasingly considering platform views and integrating APIs for managing data, or for scoring incoming data based on machine learning models. With the availability of such APIs for general use, it has become possible to integrate many such microservice APIs to build data products with very specific and diverse capabilities.
  4. A focus on value from data: Companies are looking past the big data hype more often these days, and are looking at what value they can get from the data. They’re focusing on better data collection and measurement processes, improved instrumentation and qualifying their data management infrastructure investments. They’re also seeking to enable their data science teams with the right approaches, tools and methods, so that they can get from data to insight faster. Several startups are also doing pioneering work in governing the data science process, by integrating principles of agility, continuous integration and continuous deployment into software solutions developed by data science teams.
  5. Automated data science and machine learning: Finally (and in many ways, most importantly), automated data science and machine learning is a relatively new area of work which is gaining ground significantly. Numerous startups and established organizations are evaluating methods to automate key parts of the data science workflow itself, with The Data Team among them. Such automation of data science is a trend that I foresee will gain ground for some more time, before it becomes an integral part of many data science workflows, and development approaches. While a number of applications that straddle this space are referred to as AI, the word is out on what AI is, and what isn’t, as far as me and many of my colleagues are concerned.

These are just some of the trends I’ve observed, of course, and from where you are, as a data scientist, you may be seeing much more. One thing is for sure – those who continue to keep their knowledge and skills relevant in this fast-changing space will continue to be rewarded with interesting work and new opportunity.

Insights about Data Products

Data products are one inevitable result and culmination of the information age. With enough information to process, and with enough data to build massively validated mathematical models like never before, the natural urge is to take a shot at solving some of the world’s problems that depend on data.

Data Product Maturity

There are some fundamental problems all data products aim to address:

  1. Large scale mathematical model building was not possible before. In today’s world of Hadoop and R/Python/Scala, you can build a very specific kind of hypothesis and test it using data collected on a massive scale
  2. Large scale validation of an idea was not possible before. Taking a step back from the hypothesis itself, the presence of big data technologies and the ability to test hypotheses of various kinds ultimately helps validate ideas
  3. Data asymmetry problems can be addressed on a scale never seen before. Taking yet another step back from the ability to validate diverse ideas, the presence of such technologies and models allows us to put power in the hands of decision makers like never before, by arming them with data.

dataproduct_maturity

Being Data Driven: Enabling Higher Level Abstractions of Work

Cultivating a data-driven mindset is hard. I have blogged about this before. But when the standard process workflows (think Plan-Do-Check-Act and Deming) are augmented by analytics, it is amazing what happens to “regular work”. The need to collect, sort and analyze data in a tireless, diligently consistent and unbiased fashion gets delegated to a machine. The human being in organization is not staffed with the mundane activities of data collection and management. Their powers are put to use by leveraging higher reasoning faculties – to do the data analysis that results in insight, and to interpret and review the strategic outcomes. The higher levels of abstraction of work that data products enable help organizations and teams mature.

And this is the primary value addition that a lot of data products seem to bring. The tasks that humans are either too creative for (or too easily bored because of) get automated, and in the process, the advantages of massive data collection and machine learning are leveraged, to bring about a decision making experience that truly eclipses prior generations of managers in the ability and speed to get through complex decisions fast.

Data Product Opportunities

Data products will become a driving force for industrializing the third world nations, and may become a key element of the business strategy of the largest of the large corporations. The levels of uncertainty in business today echo the quality of tools available, and the leverage that this brings. The open source movement has accelerated product development teams in areas such as web development, search technologies, and made the internet the de-facto medium of information for a lot of youngsters. Naturally, these youngsters will warm up faster than the previous generations about the data products available to them. Data products could improve the lives of millions, by enabling the access economy.

dataproduct_opportunities

While the action is generally in the upper right quadrant here, with companies fighting it out for more subscribers and catering to modern segments of industry that are more receptive to ideas, the silent analytics revolution may actually happen in brick and mortar companies that have fewer subscribers and have a more traditional mindset or in a more traditional business. Wherever possible, companies are delivering value by digitization, but a number of services cannot be so digitized, and here is another enabling opportunity. The data products in this space may not attempt to replace the human, or replace the traditional value proposition. Instead, they can function in much the same way IoT is disrupting enterprises. Embedded systems and technologies are definitely one aspect of the silent analytics revolution in the bottom left quadrant, which may have large market fragmentation and entrenched business models that haven’t moved on from decades or centuries old ideas.