Achieving Explainability and Simplicity in Data Science Work

This post stems from a few of the tweets I’d authored recently (Over at @rexplorations) on deep learning, data science, and the other skills that data scientists ought to learn. Naturally, this is by no means a short list of skills, given the increasingly pivotal role that data scientists play in organizations.

Here’s a summary of the tweet-stream I’d put out, with some additional ponderings.

  1. Domain knowledge is ignored on the data science road to perdition. Doing data analysis, or building models from data without understanding the domain and the relevance of the data and factors one is using for these models, is akin to “data science suicide”. It is a sure shot road to perdition as a data scientist. Domain knowledge is also hard to acquire for data scientists, especially those working on projects as consultants, and applying their skills in a consultative, short-term setting. For instance, I have more than a decade of experience in the manufacturing industry, and I still find myself learning new things when I encounter a new engineering set up or a new firm. A data scientist is nobody if not capable of learning new things – and domain knowledge is something that they need to constantly skill up on, in addition to their analytical skills.
  2. Get coached on your communication skills, if needed. When interacting with domain experts and subject matter experts, communication skills are extremely important for data scientists. I have frequently seen data scientists suffer from the “impostor syndrome” – not only in the context of data analysis methods and techniques, but also in the context of domain understanding.
  3. Empathise, and take notes when speaking to subject matter experts. It is for this reason that the following things are extremely important for new data scientists interacting with subject matter experts:
    1. Humility about one’s own knowledge of a specific industry area,
    2. An ability to empathise with the problems of different stakeholders
    3. The ability to take notes, including but not limited to mind maps, to organize ideas and thoughts in data science projects
  4. Strive for the usefulness of models, not to build more complex models. Data scientists ignore hypotheses that come from such discussions at their own peril. Hypotheses form the lifeblood of useful data science and analysis. As George E. P. Box said, “All models are wrong, some models are useful” – and this couldn’t be more true than when dealing with models built from hypotheses. It is such models that become really useful.
  5. Simpler models are easier to manage in a data ethics context. In product companies that use machine learning and data science to add value to customers, a debate constantly exists on the effective and ethical use of customer data. While having more data at one’s disposal is helpful for building lots of features, callous use of customer data can present a huge risk. Simpler models are easier to explain – and are arrived at when we accumulate sufficient domain knowledge, and test enough hypotheses. With simpler models, it is easier to explain what data to collect, and this can also help win the customer’s trust.
  6. Careful feature engineering done with human supervision and care may be more effective and scrupulous than automated feature engineering. We live in a world where AutoML and RoboticDataScience are often discussed in the context of machine intelligence and speeding up the process of insight generation from data. However, for some applications, it may be a better idea in the short term to ensure that the feature engineering happens through human hands. Such careful feature engineering may give organizations that use sensitive data a leg up as a longer term strategy, by erring on the side of caution.
  7. Deep learning isn’t the end of the road for data scientists. Deep learning (justifiably) has seen a great deal of hype in the recent past. However, it cannot be seen as a panacea to all data analysis. The end goal from data is the generation of value – be it for a customer, or for society at large. There are many ways to do this – and deep learning is just one approach.

I’m not discussing the many technical aspects of building explainable models. These technical aspects are contextual and depend on the situation, for one, and additionally, the tone of the post and tweets are lighter, to encourage a discussion and to welcome beginner data scientists to this discussion. Hence my omission of these (important) topics.

If you like something on this post, or want to share any other related insights, do drop a comment, or tweet to me at @rexplorations or message me at LinkedIn.

Why Do I Love Data Science?

This is a really interesting question for me, because I really enjoy discussing data science and data analysis. Some reasons I love data science:

  1. Discovering and uncovering patterns in the data through data visualization
  2. Finding and exploring unusual relationships between factors in a system using statistical measures
  3. Asking questions about systems in a data context – this is why data science is so hands-on, so iterative, and so full of throw-away models

Let me expand on each of these with an example, so that you get an idea.

Uncovering Patterns in Data

On a few projects, I’ve found data visualization to be a great way to identify hypotheses about my data set. Having a starting point such as a visualization for the hypothesis generation process makes us go into the process of building models a little more confidently. There’s the specific example of a time series analysis technique I used for energy system data, where using aggregate statistical measures and distribution fitting led to arbitrary and complex patterns in the data. Using time ordered visualizations helped me formulate the hypothesis in the correct way, and allowed me to build an explanatory model of the system.

Exploring Unusual Relationships in Data

In data science work, you begin to observe broad patterns and exceptions to these rules. Simple examples may be found in the analysis of anomalous behaviour in various kinds of systems. Some time back, I worked with a log data set that captured different kinds of customer transaction data between a customer and a client. These log data revealed unusual patterns that those steeped in the process could tell, but which couldn’t be quantified. By finding typical patterns across customers using session-specific metrics, I helped identify the anomalous customers. The construction of these variables, known as “feature engineering” in data science and machine learning, was a key insight. Such insights can only come when we’re informed about domain considerations, and when we understand the business context of the data analysis well.

Asking Questions about Systems in a Data Context

When you’re exploring the behaviour of systems using data, you start from some hypothesis (as I’ve described above) and then continue to improve your hypothesis to a point where it is able to help your business answer key questions. In each data science project, I’ve observed how considerations external to the immediate data set often come in, and present interesting possibilities to us during the data analysis. Sometimes, we answer these questions by finding and including the additional data, and at other times, the questions remain on the table. Either way, you get to ask a question on top of an answer you know, and you get to do an analysis on top of another analysis – with the result that you’ve composited different models together after a while, that give you completely new insights that you’ve not seen before.

Concluding Remarks

All three patterns are exhilarating and interesting to observe, for data scientists, especially those who are deeply involved in reasoning about the data. A good indication of whether you’ve done well in data analysis is when you’re more curious and better educated about the nuances of a system or process than you were before – and this is definitely true in my case. What seemed like a simple system at the outset can reveal so much to you when you study its data – and as a long-time design, engineering and quality professional, this is what interests me a great deal about data science.