As a data science consultant that routinely deals with large companies and their data analysis, data science and machine learning challenges, I have come to understand one key element of the data scientist’s skill set that isn’t oft-discussed in data science circles online. In this post I hope to elucidate on the importance of domain knowledge.
Over the last several years, there has (rightly) been significant debate on the skill sets of data scientists, and the importance of business, statistics, programming and other skill sets. Interesting sub-classifications of professions, such as “data hacker”, “data nerd” and other terms have been used to describe the various combinations or intersections of these skill sets.
The Importance of Domain Knowledge
In all of these discussions, however, one key element has been left out. And that is the domain.
Domain knowledge is an important subset of the data scientist’s work. Although the perfect data scientist is a bit of a unicorn, the domain should be an important consideration.
Domain knowledge is distinct from statistics, data analysis, programming and the purely technical areas, and it is easy to see how that is the case. However, business knowledge is often conflated with domain knowledge, perhaps understandably, because these are both vague and interdisciplinary areas. Business knowledge entails some amount of financial knowledge, unit economics models, strategy, people management, and a range of other skills taught in business schools, and more commonly, learned in organisations on the job. Domain knowledge, however, is like being a kind of human expert system. Wikipedia defines an expert system without defining expertise. What role does expertise play in data science, however?
Domain knowledge is a result of the system exploration that humans as system builders naturally do. To be able to formulate intelligent hypotheses, the unique cause-effect chains that are relevant to specific systems can be studied and understood. Do humans learn about systems in ways that are different from how machines might explore them, if we were to give them infinite data and computational capability? That is a hard question to answer in this context, and perhaps represents a red herring of sorts. What is useful to note, however, is that machine learning models still rely on human-formulated hypotheses. There is the odd example of an expert system that has formulated hypotheses and proved them (as is happening in medicine, these days), but these examples are hardly possible without human intervention.
Now that we have established that human intervention has become necessary in machine learning systems, data science can be seen as a field that relies uniquely on human-formulated hypotheses. While computational power and statistical models help us explore and construct hypotheses, the decisions that are made from this data – that help us define hypotheses, model the data to test these hypotheses, construct mathematical or statistical models of these data, and then evaluate the results of those tests – all of these activities take place with human intervention.
So where does domain fit in? Domain experts are those who have significant experience learning about one or a few interconnected systems in intimate ways. Their ability to develop a gut feel for the system’s performance and characteristics helps them leap frog the formulation of hypotheses, and this is their biggest benefit, compared to domain-agnostic data scientists, who merely have the programming, statistics, business and communication skills required to make serious analysis happen.
Domain Expertise and Analysis Paralysis
Domain expertise is probably one fine way to fight off the analysis-paralysis problem that plagues many data science teams. Some data science teams take up significant time and resources to experiment with ideas vastly, and the availability of high performance computing power on tap makes them take hypothesis formulation less seriously. Adversity is truly the mother of inventiveness, and it is, for example, when computing power was at a premium, that some of the most efficient sorting algorithms were devised. Similarly, the availability of computing power and statistical modeling capabilities on a massive scale de-incentivize the need to ask pertinent questions.
Pertinent questions and specific answers lead to tangible decisions and related business improvements. Without the benefit of domain knowledge, this is not possible. Analysis paralysis is a very real phenomenon. Data scientists are susceptible in organizations that value domain expertise, and don’t value analytical solutions. In situations where analytical solutions and problem solving are valued, data science that fly blind toting algorithms and machine learning won’t come out on top either – they’re more likely to hurt the credibility of the data science exercise than help it, when they solve simple problems that have pre-existing domain formulations with the help of complex algorithms (which may sometimes not give sufficient insight into their own workings, despite working well).
Challenge or Channel Domain Expertise?
Machine learning work done in medicine (cancer cell detection) points to a future where human-learned skills are replicated by deep learning or reinforcement learning systems. Alternatively, many real data science programs at diverse companies indicate an analysis paralysis that can be addressed by involvement to a greater degree of domain experts of specific kinds in the data science hypothesis formulation, analysis and interpretation of results. The latter is more representative of a real world scenario than the former, where an expert system independently learns about a hard problem and solves it.
Doing Data Science Better
In order to be able to do data science better, it isn’t merely important to consider developing data scientist resources along the lines described by Drew Conway or Stephan Kolassa. It is important to groom analytically capable people from within domains too. This means distributing the skill set required for serious analysis from the mainstream data science practice, into functional teams. Sometimes, this may mean penetrating leadership teams that work in functional capacities, and at other times, it may mean addressing the needs of small teams directly, by grooming functional/technical talent for doing data science.
Doing data science better doesn’t merely involve leveraging algorithms and their strengths better. It also means asking the right questions. Pay attention to your domain experts, and develop the capabilities around the analytical capabilities of your team. Success for many companies doesn’t look like all-conquering deep learning algorithms, but looks like specific problems solved in a targeted manner, by using well defined problem statements and the right algorithms and frameworks.