Questions that Data Scientists Hate Getting

This is a variation on a Quora answer.

When asked how data scientists can be effective, there are a few things that com e to mind:

  1. Skills: A curiosity and sufficient skill in data analysis methods and techniques
  2. Fundamental needs: the data and access to the tools to perform analysis — and this would include the environments
  3. Performance needs: Sufficient resources, time and good enough processes to validate or invalidate hypotheses and build models based on them
  4. Excitement needs: Sufficient support and latitude to independently deploy projects based on successful hypotheses tested and models built

Note that while these criteria listed above begin with the fundamental skills required to do data science, the focus shifts in items 2, 3 and 4, to what is required for data scientists to be effective. The first of these are the fundamental needs, such as the data itself, and the access to the required tools, be they statistical or machine learning tools, databases, visualization libraries, or other resources. The second of these are the performance needs, which will help the data scientist do whatever it is that they do, a bit better than how they’re doing this now. This includes processes and systems that enable the data scientist to improve their own capabilities. Finally, we have excitement needs, which enable data scientists to do outstanding work — a large part of this is being able to reuse what has been built, through deployment of various kinds.

It is in this context that we can discuss how managers of data science teams can help them be effective.

If there is one kind of behaviour in analytics managers that I wish changed, it is the one I describe in the following lines.

A lot of what data scientists do is experimental, throw-away analysis. However, it is tempting for a number of managers (many of who have made up their minds that some hypothesis holds true, or will work), to assume that they’re right, and what is required from the data scientist is the detailed model that formalizes the relationship.

This kind of assumption makes for poorly designed projects, and doesn’t amply use the data scientist’s time for exploratory analysis, for evaluating the development of different kinds of models, and for finding out what works, given the dataset.

Naturally, given the time-bound nature of businesses and poor understanding of analytics at the executive level in many organizations, such clients are commonplace, and such managers also find themselves in a situation where they push for results without the right underlying systems, data or resources. Sometimes, they begin projects with data scientists who lack the specific skills to build the kinds of models required to solve problems. While this may be the case, the challenge many data scientists in business and consulting have is dealing with such unreasonable expectations.

In this specific context, some questions that shouldn’t be posed to data scientists might be along the following lines:

  • “Assuming that hypothesis X works, how long would it take to build a full fledged application using this hypothesis X?”
  • “The domain experts are convinced that this hypothesis X is true. Why don’t your results reflect this too?”
  • “The values of R_sq or precision/recall I see here don’t reflect what can be done with the data. Aren’t better results possible?”

These kinds of questions are simplistic when in the initial stages of a data science activity/experiment, and in some situations, they could be dangerous too (although they’re innocuous mistakes any manager new to analytics initiatives may make).

For the same reason that “a little knowledge is a dangerous thing” these project managers might be playing with the fortune of the entire analytics program they serve, because they base even large projects on such naive and unverified assumptions. Were they to change their behaviour by giving due consideration to exploratory data analysis, and what the data actually says about viable models and applications that may be built, they might be putting their data scientists and engineers on the path to success.

Achieving Explainability and Simplicity in Data Science Work

This post stems from a few of the tweets I’d authored recently (Over at @rexplorations) on deep learning, data science, and the other skills that data scientists ought to learn. Naturally, this is by no means a short list of skills, given the increasingly pivotal role that data scientists play in organizations.

Here’s a summary of the tweet-stream I’d put out, with some additional ponderings.

  1. Domain knowledge is ignored on the data science road to perdition. Doing data analysis, or building models from data without understanding the domain and the relevance of the data and factors one is using for these models, is akin to “data science suicide”. It is a sure shot road to perdition as a data scientist. Domain knowledge is also hard to acquire for data scientists, especially those working on projects as consultants, and applying their skills in a consultative, short-term setting. For instance, I have more than a decade of experience in the manufacturing industry, and I still find myself learning new things when I encounter a new engineering set up or a new firm. A data scientist is nobody if not capable of learning new things – and domain knowledge is something that they need to constantly skill up on, in addition to their analytical skills.
  2. Get coached on your communication skills, if needed. When interacting with domain experts and subject matter experts, communication skills are extremely important for data scientists. I have frequently seen data scientists suffer from the “impostor syndrome” – not only in the context of data analysis methods and techniques, but also in the context of domain understanding.
  3. Empathise, and take notes when speaking to subject matter experts. It is for this reason that the following things are extremely important for new data scientists interacting with subject matter experts:
    1. Humility about one’s own knowledge of a specific industry area,
    2. An ability to empathise with the problems of different stakeholders
    3. The ability to take notes, including but not limited to mind maps, to organize ideas and thoughts in data science projects
  4. Strive for the usefulness of models, not to build more complex models. Data scientists ignore hypotheses that come from such discussions at their own peril. Hypotheses form the lifeblood of useful data science and analysis. As George E. P. Box said, “All models are wrong, some models are useful” – and this couldn’t be more true than when dealing with models built from hypotheses. It is such models that become really useful.
  5. Simpler models are easier to manage in a data ethics context. In product companies that use machine learning and data science to add value to customers, a debate constantly exists on the effective and ethical use of customer data. While having more data at one’s disposal is helpful for building lots of features, callous use of customer data can present a huge risk. Simpler models are easier to explain – and are arrived at when we accumulate sufficient domain knowledge, and test enough hypotheses. With simpler models, it is easier to explain what data to collect, and this can also help win the customer’s trust.
  6. Careful feature engineering done with human supervision and care may be more effective and scrupulous than automated feature engineering. We live in a world where AutoML and RoboticDataScience are often discussed in the context of machine intelligence and speeding up the process of insight generation from data. However, for some applications, it may be a better idea in the short term to ensure that the feature engineering happens through human hands. Such careful feature engineering may give organizations that use sensitive data a leg up as a longer term strategy, by erring on the side of caution.
  7. Deep learning isn’t the end of the road for data scientists. Deep learning (justifiably) has seen a great deal of hype in the recent past. However, it cannot be seen as a panacea to all data analysis. The end goal from data is the generation of value – be it for a customer, or for society at large. There are many ways to do this – and deep learning is just one approach.

I’m not discussing the many technical aspects of building explainable models. These technical aspects are contextual and depend on the situation, for one, and additionally, the tone of the post and tweets are lighter, to encourage a discussion and to welcome beginner data scientists to this discussion. Hence my omission of these (important) topics.

If you like something on this post, or want to share any other related insights, do drop a comment, or tweet to me at @rexplorations or message me at LinkedIn.

Hypothesis Generation: A Key Data Science Challenge

Data scientists are new age explorers. Their field of exploration is rife with data from various sources. Their methods are mathematics, linear algebra, computational sciences, statistics and data visualisation. Their tools are programming languages, frameworks, libraries and statistical analysis tools. And their rewards are stepping stones, better understanding and insights.

The data science process for many teams starts with data summaries, visualisation and data analysis, and ends with the interpretation of analysis results. However, in today’s world of rapid data science cycles, it is possible to do much more, if we take a hypothesis-centred approach to data science.

Theories for New Age Raconteurs

Data scientists work with data sets small and large, and are tellers of stories. These stories have entities, properties and relationships, all described by data. Their apparatus and methods open up data scientists to opportunities to identify, consolidate and validate hypotheses with data, and use these hypotheses as starting points for our data narratives. Hypothesis generation is a key challenge for data scientists. Hypothesis generation and by extension hypothesis refinement constitute the very purpose of data analysis and data science.

Hypothesis generation for a data scientist can take numerous forms, such as:

  1. They may be interested in the properties of a certain stream of data or a certain measurement. These properties and their default or exceptional values may form a certain hypothesis.
  2. They may be keen on understanding how a certain measure has evolved over time. In trying to understand this evolution of a system’s metric, or a person’s behaviour, they could rely on a mathematical model as a hypothesis.
  3. They could consider the impact of some properties on the states of systems, interactions and people. In trying to understand such relationships between different measures and properties, they could construct machine learning models of different kinds.

Ultimately, the purpose of such hypothesis generation is to simplify some aspect of system behaviour and represent such behaviour in a manner that’s tangible and tractable based on simple, explicable rules. This makes story-telling easier for data scientists when they become new-age raconteurs, straddling data visualisations, dashboards with data summaries and machine learning models.

Developing Nuanced Understanding

The importance of hypothesis generation in data science teams is many fold:

  1. Hypothesis generation allows the team to experiment with theories about the data
  2. Hypothesis generation can allow the team to take a systems-thinking approach to the problem to be solved
  3. Hypothesis generation allows us to build more sophisticated models based on prior hypotheses and understanding

When data science teams approach complex projects, some of them may be wont to diving right into building complex systems based on available resources, libraries and software. By taking a hypothesis-centred view of the data science problem, they could build up complexity and nuanced understanding in a very natural way, and build up hypotheses and ideas in the process.