What Could Data Scientists (And Data Science Managers) Be Doing Better in 2019?

The “data science” job description is becoming more and more common, as of early 2019.

Not only has the field garnered a great deal of interest from software developers, statisticians and machine learning exponents, but has also attracted plenty of interest over the years, from people in roles such as strategy, operations, sales and marketing. Product designers, manufacturing and customer service managers are also turning towards data science talent to help them make sense of their businesses, processes and find new ways to improve.

The Data Science Misinformation Challenge

The aforementioned motivations for people interested in data science aren’t inherently bad – in fact, they’re common sense, reasonable starting points to look for data science talent and begin analytical programs in organizations. The problem starts with the availability of access to sound, hype-free information on data science, analytics, machine learning and AI. Thanks to the media’s fulminations around sometimes disconnected value propositions – chat bots, artificial intelligence agents, machine learning and big data – these terms have come to be clumped together along with data science and machine learning, purely because of the similarity of notion, or some of the skills required to build and sell solutions along these lines. Media speculation around AI doesn’t stop there – from calling automated machine learning as “Building AI that can build AI” (NYT), to mentions of killer robots and killer cars, 2018 was a year full of hype and alarmism as I expect 2019 will also be, to some extent. I have dealt with this topic extensively in an earlier post here. What I take issue with, naturally, is the fact this serves to misinform business teams about what’s really important.

Managing Data Science Better

Astute business leaders build analytical programs where they don’t put the cart before the horse. By this, I mean the following things:

  1. They have not merely a data strategy, but a strategy for labelled data
  2. They start with small problems, not big, all-encompassing problems
  3. They grow data science capabilities within the team
  4. They embrace visualization methods and question black box models
  5. They check for actual business value in data science projects
  6. They look for ways to deploy models, not merely build throw-away analyses

Data science and analytics managers ought not to:

  1. Perpetuate hype and spread misinformation without research
  2. Set expectations based on such hype around data science
  3. Assume solutions are possible without due consideration
  4. Not budget for subject matter experts
  5. Not training your staff and still expecting better results

As obvious as the above may sound, they’re all too common in the industry. Then there is the problem of consultants who sometimes perpetuate the hype train, thereby reinforcing some of these behaviors.

Doing Data Science Better

Now let’s look at some of the things Data Scientists themselves could be doing better. Some of the points I make here have to do with the state of talent, while others have to do with the tools and the infrastructure provided to data scientists in companies. Some has to do with preferences, while others have to do with processes. I find many common practices by data science professionals to be problematic. Some of these are:

  1. Incorrect assumption checking – for significance tests, for machine learning models and for other kinds of modeling in general
  2. Not being aware of how some of the details of algorithms work – and not bothering to learn this even after several projects where their shortcomings are highlighted
  3. Not bothering to perform basic or exploratory data analysis (EDA) before taking up any serious mathematical modeling
  4. Not visualizing data before attempting to build models from them
  5. Assuming things about the problem solving approach they should take, without basing this on EDA results
  6. Not differentiating between the unique characteristics that make certain algorithms or frameworks more computationally, statistically or otherwise efficient, compared to others
  7. Some of these can be sorted out by asking critical questions such as the ones below (which may overlap to some extent with the activities listed above):
    1. Where the data came from
    2. How the data was measured
    3. Whether the data was meddled with anyhow, and in what ways
    4. How the insights will be consumed
    5. What user experience is required for the analytics consumer
    6. Does the solution have to scale

This is just a select list, and I’m sure that contextually, there are many other problems, both technical and process-specific. Either way, there is a need to exercise caution before jumping headlong into data science initiatives (as a manager) and to plan and structure data science work (as a data scientist).

Pervasive Trends in Big Data and Data Science

As of mid-2017, I’ve spent almost two years in the big data analytics and data science world, coming from 13 years of diverse work experience in engineering and management prior. Starting from a professional curiosity, it has taken me a while to develop some data science and engineering skills and hone key skills among these as a data scientist. Along the way, I’ve had a chance to learn core software development methods and principles, stay in touch with the latest in the field, challenge my existing knowledge of product development methodologies and processes, and learn more about data analysis, statistics and machine learning than I started out with in 2015. Along with the constant learning, I’ve had a chance to observe a few pervasive trends in the big data and analytics worlds, which I wish to share here.

  1. Cloud infrastructure penetration: Undoubtedly the biggest beneficiaries of the data and analytics revolution have been cloud service providers. They’re also stretched thin, with reducing costs, massive competition, and the need for value added services of various kinds (big compute and API support, along with big storage, for instance) to be available alongside the core cloud offerings that companies are lapping up, for their data management needs. Security concerns continue to exist, and one of the biggest security issues was actually from the US’ leading cloud service provider, Amazon Web Services. Despite this, many industries, even those that consider data security paramount, wish to adopt cloud infrastructures, because of the reduced cost of operation and the scalability inherent in cloud platforms.
  2. Deep learning adoption: Generalized learning algorithms based on neural networks have taken the machine learning world by storm, and given the proliferation of big compute and big data storage platforms, it has become easier to train deep learning algorithms than in the past. Extant frameworks continue to give better, more user-friendly algorithms as they evolve, and there’s definitely a more user-friendly ecosystem of frameworks and algorithms out there, such as Caffe, Keras, and Tensorflow (which has become more user-friendly and better integrated with numerous systems programming languages and frameworks). This trend will continue, with several tested and published DL APIs available for integration into application software of various kinds.
  3. API based data product deployment: Data science operationalization has begun to happen through APIs and platforms. Organizations that are developing data product strategies are increasingly considering platform views and integrating APIs for managing data, or for scoring incoming data based on machine learning models. With the availability of such APIs for general use, it has become possible to integrate many such microservice APIs to build data products with very specific and diverse capabilities.
  4. A focus on value from data: Companies are looking past the big data hype more often these days, and are looking at what value they can get from the data. They’re focusing on better data collection and measurement processes, improved instrumentation and qualifying their data management infrastructure investments. They’re also seeking to enable their data science teams with the right approaches, tools and methods, so that they can get from data to insight faster. Several startups are also doing pioneering work in governing the data science process, by integrating principles of agility, continuous integration and continuous deployment into software solutions developed by data science teams.
  5. Automated data science and machine learning: Finally (and in many ways, most importantly), automated data science and machine learning is a relatively new area of work which is gaining ground significantly. Numerous startups and established organizations are evaluating methods to automate key parts of the data science workflow itself, with The Data Team among them. Such automation of data science is a trend that I foresee will gain ground for some more time, before it becomes an integral part of many data science workflows, and development approaches. While a number of applications that straddle this space are referred to as AI, the word is out on what AI is, and what isn’t, as far as me and many of my colleagues are concerned.

These are just some of the trends I’ve observed, of course, and from where you are, as a data scientist, you may be seeing much more. One thing is for sure – those who continue to keep their knowledge and skills relevant in this fast-changing space will continue to be rewarded with interesting work and new opportunity.

Big Data: Size and Velocity

One of the changes envisioned in the big data space is that there is the need to receive data that isn’t so much big in volume, as big in relevance. Perhaps this is a crucial distinction to make. Here, we examine business manifestations of relevant data, as opposed to just large volumes of data.

What Managers Want From Data

It is easier to ask the question “what do you want from your data” to managers and executives, than to answer it as one. As someone who has worked in Fortune 500s with teams that use data to make decisions, I’d like to share some insight into this:

  1. Managers don’t necessarily want to see data even if they talk about wanting to use data for decision making. They instead want to see interpretations of data that helps them make up their minds and take decisions.
  2. Decision making is not monotonous or based on single variables of interest.
  3. Decision making involves not only operational data descriptors (which are most often instrumented for collection from data sources)
  4. Decisions can be taken based on uncertain estimates in some cases, but many situations do require accurate estimates of results to drive decision making

From Data To Insight

The process of getting from data to insight isn’t linear. It involves exploration, and this means collecting more data, and iterating on one’s results and hypotheses. Broadly, the process of getting insights from data may involve data preparation and analysis as intermediate stages between data collection and the generation of insight. This doesn’t mean that the data scientist’s job is done once the insights are generated. There is a need to collect more data and refine the models we’ve built, and construct better views of the problem landscape.

Data Quality

A large percentage of the data analyst’s or data scientist’s problems have to do with the broad area of data quality, and its readiness for analysis. Specifically to data quality, some things stand out:

  1. Measurement aspects – whether the measured data really represents the state of the variable which was measured. This in turn involves other aspects such as linearity, stability, bias, range, sensitivity and other parameters of the measurement system
  2. Latency aspects – whether the measured data in time sequence is recorded and logged in the correct sequence and at the right intervals
  3. Missing and anomalous values – these are missing or anomalous readings/data records, as opposed to anomalous behaviour, which is a whole other subject.

Fast Data Processing

Speed is an essential element in the data scientist’s effectiveness. The speed of decisions is the speed of the slowest link in the chain. Traditionally, this slowest link has been the collection of the data itself. Data processing frameworks have improved by leaps and bounds in the recent past, with frameworks like Apache Spark leading the charge. However, this is changing, with sensors in IOT settings delivering huge data sets and massive streams of data in themselves. In such a situation, the dearth of time is not in the acquisition of data itself. Indeed, the availability of massive data lakes with lots of data on them itself signals the need for more and more data scientists, who can analyse this data and arrive at insights from the data. It is in this context that the rapid creation of models, analysis of insights from data, and the creation of meta-algorithms that do such work is valuable.

Where Size Does Matter

There are some problems which do require very large data sets. There are many examples of systems that gain effectiveness only with scale. One such example is the commonly found collaborative filtering recommendation engine, used everywhere in e-commerce and related industries. Size does matter for these data sets. Small data sets are prone to truth inflation and poor analysis results from poor data collection. In such cases, there is no respite other than to ensure we collect, store and analyze large data sets.

Volume or Relevance in Data

Now we come to the dichotomy we set out to resolve – whether volume is more important in data sets, or whether relevance is. Relevant data simply meets a number of the criteria listed above, whereas data that’s measured up purely in volume (petabytes or exabytes) doesn’t give us an idea of the quality and its use for real data analysis.

Volume and Relevance in Data

We now look at whether volume itself may become part of what makes the data relevant for analysis. Unsurprisingly, for some applications such as neural network training, data science on time series data sets of high frequency, etc., data volume is undeniably useful. More data in these cases implies that more can be done with the model, that more partitions or subsets of the data can be taken, and that more theories can be tested out on different representative sample sets of the data.

The Big-Three Vs

So, where does this leave us with respect to our understanding of what makes Big Data, Big Data? We’ve seen the popular trope that Big Data is data that exhibits volume, velocity and variety. Some discuss a fourth characteristic – the veracity of the data. Overall, the availability of relevant data in sufficient volumes should be able to address the needs of the data scientist for model building and for data exploration. The question of variety still remains, and as data profiling approaches mature, data wrangling will advance to a point where this variety isn’t a trouble, and is a genuine asset. However, the volume and velocity points are definitely scoped for a trade off, in a large percentage of the cases. For most model building activities, such as linear regression models, or classification models where we know the characteristics or behavior of the data set, so-called “small data” is sufficient, as long as the data set is representative and relevant.

Insights about Data Products

Data products are one inevitable result and culmination of the information age. With enough information to process, and with enough data to build massively validated mathematical models like never before, the natural urge is to take a shot at solving some of the world’s problems that depend on data.

Data Product Maturity

There are some fundamental problems all data products aim to address:

  1. Large scale mathematical model building was not possible before. In today’s world of Hadoop and R/Python/Scala, you can build a very specific kind of hypothesis and test it using data collected on a massive scale
  2. Large scale validation of an idea was not possible before. Taking a step back from the hypothesis itself, the presence of big data technologies and the ability to test hypotheses of various kinds ultimately helps validate ideas
  3. Data asymmetry problems can be addressed on a scale never seen before. Taking yet another step back from the ability to validate diverse ideas, the presence of such technologies and models allows us to put power in the hands of decision makers like never before, by arming them with data.

dataproduct_maturity

Being Data Driven: Enabling Higher Level Abstractions of Work

Cultivating a data-driven mindset is hard. I have blogged about this before. But when the standard process workflows (think Plan-Do-Check-Act and Deming) are augmented by analytics, it is amazing what happens to “regular work”. The need to collect, sort and analyze data in a tireless, diligently consistent and unbiased fashion gets delegated to a machine. The human being in organization is not staffed with the mundane activities of data collection and management. Their powers are put to use by leveraging higher reasoning faculties – to do the data analysis that results in insight, and to interpret and review the strategic outcomes. The higher levels of abstraction of work that data products enable help organizations and teams mature.

And this is the primary value addition that a lot of data products seem to bring. The tasks that humans are either too creative for (or too easily bored because of) get automated, and in the process, the advantages of massive data collection and machine learning are leveraged, to bring about a decision making experience that truly eclipses prior generations of managers in the ability and speed to get through complex decisions fast.

Data Product Opportunities

Data products will become a driving force for industrializing the third world nations, and may become a key element of the business strategy of the largest of the large corporations. The levels of uncertainty in business today echo the quality of tools available, and the leverage that this brings. The open source movement has accelerated product development teams in areas such as web development, search technologies, and made the internet the de-facto medium of information for a lot of youngsters. Naturally, these youngsters will warm up faster than the previous generations about the data products available to them. Data products could improve the lives of millions, by enabling the access economy.

dataproduct_opportunities

While the action is generally in the upper right quadrant here, with companies fighting it out for more subscribers and catering to modern segments of industry that are more receptive to ideas, the silent analytics revolution may actually happen in brick and mortar companies that have fewer subscribers and have a more traditional mindset or in a more traditional business. Wherever possible, companies are delivering value by digitization, but a number of services cannot be so digitized, and here is another enabling opportunity. The data products in this space may not attempt to replace the human, or replace the traditional value proposition. Instead, they can function in much the same way IoT is disrupting enterprises. Embedded systems and technologies are definitely one aspect of the silent analytics revolution in the bottom left quadrant, which may have large market fragmentation and entrenched business models that haven’t moved on from decades or centuries old ideas.

 

Data Perspectives: “Orbiting The Giant Hairball”

This may sound weird, but one sure way to not have perspective about the business in an innovative and constantly changing industry is to bury yourself within regular work. This is the meaning of the title – which comes from a book of the same name.

By regular work, I mean work in which you execute tasks with a view to minimize variability and have standard results. This is as opposed to innovative work, which, as Bob Sutton explains in his lectures, is characterised by an increase of variability to the point of failure. Failure and validated learning are essential aspects of the learning experience in any job, to extend a metaphor from Eric Reis’ book The Lean Startup.

Data science and data engineering are the truly cross-functional and cross-industry work areas within the analytics revolution that is under way right now. There are a number of business perspectives that are relevant in one industry, which can also be applied to another. Indeed, work in some industries can anticipate very closely the needs of another.

Data scientists should keep one eye on the business, or to be true to the metaphor here, should occasionally “dive into the hairball” of business and routine work, to get a glimpse of what’s happening in the world of work. The data perspectives that they bring to that conversation will then become as important, as the perspectives they develop due to such experiences. Seasoned professionals and consultants in the data analytics industry may have unconsciously or consciously developed their cross-functional and cross-industry experience over years. But it probably is fitting for younger data professionals – and there are many of them out there – to occasionally “dive into the hairball from orbit” and understand the challenges of data for those in various walks of business.

The “Jagged Edge” of Real Time Analytics

I recently came across this Hortonworks Data Flow presentation where the concept of the jagged edge of real time data analytics is discussed. The context that suits a discussion on this is to me, centred around prioritization of what comes in through the sensors (or other big data gathered from these “jagged edge” sources). This is a pondering, rather than a post with a specific agenda, perhaps I will add to this, as responses come in, or as I learn more.

One of the key challenges for a lot of data centres in future, in the world of the Internet of Things, is to be able to provide relevant data analytics, regardless of size, point of origin, or time when it was generated. In order to do this well, I foresee not only technologies like NiFi being able to provide low latency updates as much in real-time as possible, but also a need for technologies that have sufficient intelligence built into the data sampling and data collecting process.

The central philosophy is susprisingly old – Edwards Deming said that data are not collected for museum purposes, but for decision making. And it is in this context that we see the data lakes of today transforming into the data seas of tomorrow, if we are not to use intelligent prioritization to determine what data should be streamed, and what data should be stored. The reality of decision making in such real time analytics situations is the availability of too much data to make one decision – which reminds me of Barry Schwartz’s paradox of free choice – that too much choice can actually impede and delay decision making, rather than aid it.

How do we ensure that approaches like data flow prioritization can allow us to address these issues? How do we move away from the static data lakes to the more useful data streams from the jagged edge, that data streaming technologies like Spark promise on established frameworks like Hadoop, without the risk of turning our lakes into seas that we do not have the insight of fully benefiting from?

Some of the answers lie in the implementations of specific use cases, of course. There is no silver bullet, if you will. That said, what kinds of technologies can we foresee being developed for Hadoop, Spark and other technologies that will heavily influence the internet of things revolution, to solve this prioritization conundrum?