Achieving Explainability and Simplicity in Data Science Work

This post stems from a few of the tweets I’d authored recently (Over at @rexplorations) on deep learning, data science, and the other skills that data scientists ought to learn. Naturally, this is by no means a short list of skills, given the increasingly pivotal role that data scientists play in organizations.

Here’s a summary of the tweet-stream I’d put out, with some additional ponderings.

  1. Domain knowledge is ignored on the data science road to perdition. Doing data analysis, or building models from data without understanding the domain and the relevance of the data and factors one is using for these models, is akin to “data science suicide”. It is a sure shot road to perdition as a data scientist. Domain knowledge is also hard to acquire for data scientists, especially those working on projects as consultants, and applying their skills in a consultative, short-term setting. For instance, I have more than a decade of experience in the manufacturing industry, and I still find myself learning new things when I encounter a new engineering set up or a new firm. A data scientist is nobody if not capable of learning new things – and domain knowledge is something that they need to constantly skill up on, in addition to their analytical skills.
  2. Get coached on your communication skills, if needed. When interacting with domain experts and subject matter experts, communication skills are extremely important for data scientists. I have frequently seen data scientists suffer from the “impostor syndrome” – not only in the context of data analysis methods and techniques, but also in the context of domain understanding.
  3. Empathise, and take notes when speaking to subject matter experts. It is for this reason that the following things are extremely important for new data scientists interacting with subject matter experts:
    1. Humility about one’s own knowledge of a specific industry area,
    2. An ability to empathise with the problems of different stakeholders
    3. The ability to take notes, including but not limited to mind maps, to organize ideas and thoughts in data science projects
  4. Strive for the usefulness of models, not to build more complex models. Data scientists ignore hypotheses that come from such discussions at their own peril. Hypotheses form the lifeblood of useful data science and analysis. As George E. P. Box said, “All models are wrong, some models are useful” – and this couldn’t be more true than when dealing with models built from hypotheses. It is such models that become really useful.
  5. Simpler models are easier to manage in a data ethics context. In product companies that use machine learning and data science to add value to customers, a debate constantly exists on the effective and ethical use of customer data. While having more data at one’s disposal is helpful for building lots of features, callous use of customer data can present a huge risk. Simpler models are easier to explain – and are arrived at when we accumulate sufficient domain knowledge, and test enough hypotheses. With simpler models, it is easier to explain what data to collect, and this can also help win the customer’s trust.
  6. Careful feature engineering done with human supervision and care may be more effective and scrupulous than automated feature engineering. We live in a world where AutoML and RoboticDataScience are often discussed in the context of machine intelligence and speeding up the process of insight generation from data. However, for some applications, it may be a better idea in the short term to ensure that the feature engineering happens through human hands. Such careful feature engineering may give organizations that use sensitive data a leg up as a longer term strategy, by erring on the side of caution.
  7. Deep learning isn’t the end of the road for data scientists. Deep learning (justifiably) has seen a great deal of hype in the recent past. However, it cannot be seen as a panacea to all data analysis. The end goal from data is the generation of value – be it for a customer, or for society at large. There are many ways to do this – and deep learning is just one approach.

I’m not discussing the many technical aspects of building explainable models. These technical aspects are contextual and depend on the situation, for one, and additionally, the tone of the post and tweets are lighter, to encourage a discussion and to welcome beginner data scientists to this discussion. Hence my omission of these (important) topics.

If you like something on this post, or want to share any other related insights, do drop a comment, or tweet to me at @rexplorations or message me at LinkedIn.

Different Kinds of Data Scientists

Data scientists come in many shapes and sizes, and constitute a diverse lot of people. More importantly, they can perform diverse functions in organizations and still stand to qualify under the same criteria we use to define data scientists.

In this cross-post from a Quora answer, I wish to elucidate on the different kinds of data scientist roles I believe exist in industry. Here is the original question on Quora. I have to say here, that I found Michael Koelbl’s answer to What are all the different types of data scientists? quite interesting, and thinking along similar lines, I decided to delineate the following stereotypical kinds of data science people:

  1. Business analysts with a data focus: These are essentially business analysts that understand a specific business domain reasonably well, although they’re not statistically or analytically inclined. Focused on exploratory data analysis, reporting based on creation of new measures, graphs and charts based on them, and asking questions around these EDA. They’re excellent at story telling, asking questions based on data, and pushing their teams in interesting directions.
  2. Machine learning engineers: Essentially software developers with a one-size-fits-all approach to data analysis, where they’re trying to build ML models of one or other kind, based on the data. They’re not statistically savvy, but understand ML engineering, model development, software architecture and model deployment.
  3. Domain expert data scientists: They’re essentially experts in a specific domain, interested in generating the right features from the data to answer questions in the domain. While not skilled as statisticians or machine learning engineers, they’re very keyed in on what’s required to answer questions in their specific domains.
  4. Data visualization specialists: These are data scientists focused on developing visualizations and graphs from data. Some may be statistically savvy, but their focus is on data visualization. They span the range from BI tools to coded up scripts and programs for data analysis
  5. Statisticians: Let’s not forget the old epithets assigned to data scientists (and the jokes around data science and statisticians). Perhaps statisticians are the rarest breed of the current data science talent pool, despite the need for them being higher than ever. They’re generally savvy analysts who can build models of various kinds – from distribution models, to significance testing, factor-response models and DOE, to machine learning and deep learning. They’re not normally known to handle the large data sets we often see in data science work, though.
  6. Data engineers with data analysis skills: Data engineers can be considered “cousins” of data scientists that are more focused on building data management systems, pipelines for implementation of models, and the data management infrastructure. They’re concerned with data ingestion, extraction, data lakes, and such aspects of the infrastructure, but not so much about the data analysis itself. While they understand use cases and the process of generating reports and statistics, they’re not necessarily savvy analysts themselves.
  7. Data science managers: These are experienced data analysts and/or data engineers that are interested in the deployment and use of data science results. They could also be functional or strategic managers in companies, who are interested in putting together processes, systems and tools to enable their data scientists, analysts and engineers, to be effective.

So, do you think I’ve covered all the kinds of data scientists you know? Do you think I missed anything? Let me know in the comments.

Related links

  1. O’Reilly blog post on data scientists versus data engineers

The Expert System Anachronism in the Data Science and AI Divergence

Although the data science and big data buzzwords have been bandied about for years now, and although artificial intelligence has been talked about for decades, the two fields are irrevocably inter-related and interdependent.

For one thing, the wide interest in data science started just as we were beginning to leverage distribute data storage and computation technologies – which allowed companies to “scale out” storage and computation, rather than “scale up” computation. Companies who could therefore buy numerous run-of-the-mill computers (rather than extremely expensive, high end computers, in smaller numbers) could potentially leverage their data collection activities to be useful to the enterprise.

Let’s not forget, though, that the point of such exercises was to actually get some business value at the end of such an exercise. There’s virtually no business case for collecting huge amounts of data and storing them (with or without structure), if we don’t have a plan to somehow utilize that data for taking business decisions better, or to somehow impact the business or customers positively. IT managers across industries have therefore struggled to make sense of the big data space, and how much to invest, what to invest in, and how to make sense of it all.

Technology companies are only too happy to sell companies the latest and greatest data science and data management frameworks and solutions, but how can companies actually use these solutions and tools to make a difference to their business? This challenge for executives isn’t going away with the advent of AI.

Artificial Intelligence (AI) has a long and hoary history, and has been the subject of debate, discussion and chronicle over several decades. Geoff Hinton, the AI pioneer, has a pretty comprehensive description of various historical aspects of AI here. Starting from Geoff Hinton’s research, pioneering research in recent years by Yann Le Cun, Andrej Karpathy and others has enabled AI to be considered seriously by organizations as a force multiplier, just as they considered data science a force multiplier for decision making activities. The focus of all these researchers are on general purpose machine intelligence, specifically neural networks. While the “deep learning” buzzword has caught on of late, this is fundamentally no different from a complex neural network and what it can do.

That said, AI in the form of deep learning differs vastly in capability from the algorithms data scientists and data mining engineers have used for more than a decade, now. By adding many layers, and by constructing complex topologies in these neural networks, and by iteratively training them on large amounts of data, we’ve progressed along multiple quantitative axes (complexity, number of layers, amount of training data, etc) in the AI world, to get not merely quantitative, but qualitatively better in terms of AI performance. Recent studies at Google show that image captioning, often considered a hard problem for AI, is now at near-human levels of accuracy. Microsoft famously announced that their speech-to-text and translation engines stand improved by an order of magnitude, because of the use of these techniques.

It is this vastly improved capability of AI, and the elimination of the human (present forever in the data science activity loop) from even the analysis and design of these neural networks (generative adversarial networks being a case in point), that makes the divergence between Data Science and AI very vivid and distinct. AI seems to be headed in the direction of general intelligence, whereas data science approaches and methods constituted human-in-loop approaches to making sense of the data. The key value addition of the human in this data science context was “domain” – and I have extensively discussed the importance of domain in data science in an earlier post – but this too, has increasingly become supplanted by efficient AI, provided that the data collection process for training data, and the training and topological aspects of the networks (known as hyper parameters) are well defined enough. This supplanting of the human domain perspective, by machine-learned domain features that matter, is precisely what will enable AI to develop and become a key force to reckon with, in industry.

Therefore I venture that the “anachronism” in the title of this post, is the domain-based model of systems, or intelligent systems, called the Expert System. Expert system design is an old problem that probably had its heyday and apparently disappeared into the mist of technological obsolescence – and it is this kind of expert system design problem that AI methods will be so good at solving, to the point that they can replace humans in key tasks, and become a true general intelligence. Expert systems were how the earliest AI researchers imagined machine intelligence to be useful to humanity. However, their understanding was limited to rule-based expert systems. While the overall idea of the expert system is still relevant in many domains – so much so that in a sense, we have expert systems all around us – it is undeniable that the advent of AI will enable expert systems to develop and evolve once again, but without the rule-based approaches we have seen in the past, and with inductive learning as is apparent from deep learning and machine learning methods.

Domain: The Missing Element in Data Science

As a data science consultant that routinely deals with large companies and their data analysis, data science and machine learning challenges, I have come to understand one key element of the data scientist’s skill set that isn’t oft-discussed in data science circles online. In this post I hope to elucidate on the importance of domain knowledge.

Over the last several years, there has (rightly) been significant debate on the skill sets of data scientists, and the importance of business, statistics, programming and other skill sets. Interesting sub-classifications of professions, such as “data hacker”, “data nerd” and other terms have been used to describe the various combinations or intersections of these skill sets.

The Importance of Domain Knowledge

In all of these discussions, however, one key element has been left out. And that is the domain.

Domain_DS

Domain knowledge is an important subset of the data scientist’s work. Although the perfect data scientist is a bit of a unicorn, the domain should be an important consideration.

Domain knowledge is distinct from statistics, data analysis, programming and the purely technical areas, and it is easy to see how that is the case. However, business knowledge is often conflated with domain knowledge, perhaps understandably, because these are both vague and interdisciplinary areas. Business knowledge entails some amount of financial knowledge, unit economics models, strategy, people management, and a range of other skills taught in business schools, and more commonly, learned in organisations on the job. Domain knowledge, however, is like being a kind of human expert system. Wikipedia defines an expert system without defining expertise. What role does expertise play in data science, however?

Domain knowledge is a result of the system exploration that humans as system builders naturally do. To be able to formulate intelligent hypotheses, the unique cause-effect chains that are relevant to specific systems can be studied and understood. Do humans learn about systems in ways that are different from how machines might explore them, if we were to give them infinite data and computational capability? That is a hard question to answer in this context, and perhaps represents a red herring of sorts. What is useful to note, however, is that machine learning models still rely on human-formulated hypotheses. There is the odd example of an expert system that has formulated hypotheses and proved them (as is happening in medicine, these days), but these examples are hardly possible without human intervention.

Now that we have established that human intervention has become necessary in machine learning systems, data science can be seen as a field that relies uniquely on human-formulated hypotheses. While computational power and statistical models help us explore and construct hypotheses, the decisions that are made from this data – that help us define hypotheses, model the data to test these hypotheses, construct mathematical or statistical models of these data, and then evaluate the results of those tests – all of these activities take place with human intervention.

So where does domain fit in? Domain experts are those who have significant experience learning about one or a few interconnected systems in intimate ways. Their ability to develop a gut feel for the system’s performance and characteristics helps them leap frog the formulation of hypotheses, and this is their biggest benefit, compared to domain-agnostic data scientists, who merely have the programming, statistics, business and communication skills required to make serious analysis happen.

Domain Expertise and Analysis Paralysis

Domain expertise is probably one fine way to fight off the analysis-paralysis problem that plagues many data science teams. Some data science teams take up significant time and resources to experiment with ideas vastly, and the availability of high performance computing power on tap makes them take hypothesis formulation less seriously. Adversity is truly the mother of inventiveness, and it is, for example, when computing power was at a premium, that some of the most efficient sorting algorithms were devised. Similarly, the availability of computing power and statistical modeling capabilities on a massive scale de-incentivize the need to ask pertinent questions.

Pertinent questions and specific answers lead to tangible decisions and related business improvements. Without the benefit of domain knowledge, this is not possible. Analysis paralysis is a very real phenomenon. Data scientists are susceptible in organizations that value domain expertise, and don’t value analytical solutions. In situations where analytical solutions and problem solving are valued, data science that fly blind toting algorithms and machine learning won’t come out on top either – they’re more likely to hurt the credibility of the data science exercise than help it, when they solve simple problems that have pre-existing domain formulations with the help of complex algorithms (which may sometimes not give sufficient insight into their own workings, despite working well).

Challenge or Channel Domain Expertise?

Machine learning work done in medicine (cancer cell detection) points to a future where human-learned skills are replicated by deep learning or reinforcement learning systems. Alternatively, many real data science programs at diverse companies indicate an analysis paralysis that can be addressed by involvement to a greater degree of domain experts of specific kinds in the data science hypothesis formulation, analysis and  interpretation of results. The latter is more representative of a real world scenario than the former, where an expert system independently learns about a hard problem and solves it.

Doing Data Science Better

In order to be able to do data science better, it isn’t merely important to consider developing data scientist resources along the lines described by Drew Conway or Stephan Kolassa. It is important to groom analytically capable people from within domains too. This means distributing the skill set required for serious analysis from the mainstream data science practice, into functional teams. Sometimes, this may mean penetrating leadership teams that work in functional capacities, and at other times, it may mean addressing the needs of small teams directly, by grooming functional/technical talent for doing data science.

Doing data science better doesn’t merely involve leveraging algorithms and their strengths better. It also means asking the right questions. Pay attention to your domain experts, and develop the capabilities around the analytical capabilities of your team. Success for many companies doesn’t look like all-conquering deep learning algorithms, but looks like specific problems solved in a targeted manner, by using well defined problem statements and the right algorithms and frameworks.