Questions that Data Scientists Hate Getting

This is a variation on a Quora answer.

When asked how data scientists can be effective, there are a few things that com e to mind:

  1. Skills: A curiosity and sufficient skill in data analysis methods and techniques
  2. Fundamental needs: the data and access to the tools to perform analysis — and this would include the environments
  3. Performance needs: Sufficient resources, time and good enough processes to validate or invalidate hypotheses and build models based on them
  4. Excitement needs: Sufficient support and latitude to independently deploy projects based on successful hypotheses tested and models built

Note that while these criteria listed above begin with the fundamental skills required to do data science, the focus shifts in items 2, 3 and 4, to what is required for data scientists to be effective. The first of these are the fundamental needs, such as the data itself, and the access to the required tools, be they statistical or machine learning tools, databases, visualization libraries, or other resources. The second of these are the performance needs, which will help the data scientist do whatever it is that they do, a bit better than how they’re doing this now. This includes processes and systems that enable the data scientist to improve their own capabilities. Finally, we have excitement needs, which enable data scientists to do outstanding work — a large part of this is being able to reuse what has been built, through deployment of various kinds.

It is in this context that we can discuss how managers of data science teams can help them be effective.

If there is one kind of behaviour in analytics managers that I wish changed, it is the one I describe in the following lines.

A lot of what data scientists do is experimental, throw-away analysis. However, it is tempting for a number of managers (many of who have made up their minds that some hypothesis holds true, or will work), to assume that they’re right, and what is required from the data scientist is the detailed model that formalizes the relationship.

This kind of assumption makes for poorly designed projects, and doesn’t amply use the data scientist’s time for exploratory analysis, for evaluating the development of different kinds of models, and for finding out what works, given the dataset.

Naturally, given the time-bound nature of businesses and poor understanding of analytics at the executive level in many organizations, such clients are commonplace, and such managers also find themselves in a situation where they push for results without the right underlying systems, data or resources. Sometimes, they begin projects with data scientists who lack the specific skills to build the kinds of models required to solve problems. While this may be the case, the challenge many data scientists in business and consulting have is dealing with such unreasonable expectations.

In this specific context, some questions that shouldn’t be posed to data scientists might be along the following lines:

  • “Assuming that hypothesis X works, how long would it take to build a full fledged application using this hypothesis X?”
  • “The domain experts are convinced that this hypothesis X is true. Why don’t your results reflect this too?”
  • “The values of R_sq or precision/recall I see here don’t reflect what can be done with the data. Aren’t better results possible?”

These kinds of questions are simplistic when in the initial stages of a data science activity/experiment, and in some situations, they could be dangerous too (although they’re innocuous mistakes any manager new to analytics initiatives may make).

For the same reason that “a little knowledge is a dangerous thing” these project managers might be playing with the fortune of the entire analytics program they serve, because they base even large projects on such naive and unverified assumptions. Were they to change their behaviour by giving due consideration to exploratory data analysis, and what the data actually says about viable models and applications that may be built, they might be putting their data scientists and engineers on the path to success.

Big Data: Size and Velocity

One of the changes envisioned in the big data space is that there is the need to receive data that isn’t so much big in volume, as big in relevance. Perhaps this is a crucial distinction to make. Here, we examine business manifestations of relevant data, as opposed to just large volumes of data.

What Managers Want From Data

It is easier to ask the question “what do you want from your data” to managers and executives, than to answer it as one. As someone who has worked in Fortune 500s with teams that use data to make decisions, I’d like to share some insight into this:

  1. Managers don’t necessarily want to see data even if they talk about wanting to use data for decision making. They instead want to see interpretations of data that helps them make up their minds and take decisions.
  2. Decision making is not monotonous or based on single variables of interest.
  3. Decision making involves not only operational data descriptors (which are most often instrumented for collection from data sources)
  4. Decisions can be taken based on uncertain estimates in some cases, but many situations do require accurate estimates of results to drive decision making

From Data To Insight

The process of getting from data to insight isn’t linear. It involves exploration, and this means collecting more data, and iterating on one’s results and hypotheses. Broadly, the process of getting insights from data may involve data preparation and analysis as intermediate stages between data collection and the generation of insight. This doesn’t mean that the data scientist’s job is done once the insights are generated. There is a need to collect more data and refine the models we’ve built, and construct better views of the problem landscape.

Data Quality

A large percentage of the data analyst’s or data scientist’s problems have to do with the broad area of data quality, and its readiness for analysis. Specifically to data quality, some things stand out:

  1. Measurement aspects – whether the measured data really represents the state of the variable which was measured. This in turn involves other aspects such as linearity, stability, bias, range, sensitivity and other parameters of the measurement system
  2. Latency aspects – whether the measured data in time sequence is recorded and logged in the correct sequence and at the right intervals
  3. Missing and anomalous values – these are missing or anomalous readings/data records, as opposed to anomalous behaviour, which is a whole other subject.

Fast Data Processing

Speed is an essential element in the data scientist’s effectiveness. The speed of decisions is the speed of the slowest link in the chain. Traditionally, this slowest link has been the collection of the data itself. Data processing frameworks have improved by leaps and bounds in the recent past, with frameworks like Apache Spark leading the charge. However, this is changing, with sensors in IOT settings delivering huge data sets and massive streams of data in themselves. In such a situation, the dearth of time is not in the acquisition of data itself. Indeed, the availability of massive data lakes with lots of data on them itself signals the need for more and more data scientists, who can analyse this data and arrive at insights from the data. It is in this context that the rapid creation of models, analysis of insights from data, and the creation of meta-algorithms that do such work is valuable.

Where Size Does Matter

There are some problems which do require very large data sets. There are many examples of systems that gain effectiveness only with scale. One such example is the commonly found collaborative filtering recommendation engine, used everywhere in e-commerce and related industries. Size does matter for these data sets. Small data sets are prone to truth inflation and poor analysis results from poor data collection. In such cases, there is no respite other than to ensure we collect, store and analyze large data sets.

Volume or Relevance in Data

Now we come to the dichotomy we set out to resolve – whether volume is more important in data sets, or whether relevance is. Relevant data simply meets a number of the criteria listed above, whereas data that’s measured up purely in volume (petabytes or exabytes) doesn’t give us an idea of the quality and its use for real data analysis.

Volume and Relevance in Data

We now look at whether volume itself may become part of what makes the data relevant for analysis. Unsurprisingly, for some applications such as neural network training, data science on time series data sets of high frequency, etc., data volume is undeniably useful. More data in these cases implies that more can be done with the model, that more partitions or subsets of the data can be taken, and that more theories can be tested out on different representative sample sets of the data.

The Big-Three Vs

So, where does this leave us with respect to our understanding of what makes Big Data, Big Data? We’ve seen the popular trope that Big Data is data that exhibits volume, velocity and variety. Some discuss a fourth characteristic – the veracity of the data. Overall, the availability of relevant data in sufficient volumes should be able to address the needs of the data scientist for model building and for data exploration. The question of variety still remains, and as data profiling approaches mature, data wrangling will advance to a point where this variety isn’t a trouble, and is a genuine asset. However, the volume and velocity points are definitely scoped for a trade off, in a large percentage of the cases. For most model building activities, such as linear regression models, or classification models where we know the characteristics or behavior of the data set, so-called “small data” is sufficient, as long as the data set is representative and relevant.

Data Perspectives: “Orbiting The Giant Hairball”

This may sound weird, but one sure way to not have perspective about the business in an innovative and constantly changing industry is to bury yourself within regular work. This is the meaning of the title – which comes from a book of the same name.

By regular work, I mean work in which you execute tasks with a view to minimize variability and have standard results. This is as opposed to innovative work, which, as Bob Sutton explains in his lectures, is characterised by an increase of variability to the point of failure. Failure and validated learning are essential aspects of the learning experience in any job, to extend a metaphor from Eric Reis’ book The Lean Startup.

Data science and data engineering are the truly cross-functional and cross-industry work areas within the analytics revolution that is under way right now. There are a number of business perspectives that are relevant in one industry, which can also be applied to another. Indeed, work in some industries can anticipate very closely the needs of another.

Data scientists should keep one eye on the business, or to be true to the metaphor here, should occasionally “dive into the hairball” of business and routine work, to get a glimpse of what’s happening in the world of work. The data perspectives that they bring to that conversation will then become as important, as the perspectives they develop due to such experiences. Seasoned professionals and consultants in the data analytics industry may have unconsciously or consciously developed their cross-functional and cross-industry experience over years. But it probably is fitting for younger data professionals – and there are many of them out there – to occasionally “dive into the hairball from orbit” and understand the challenges of data for those in various walks of business.