The “data science” job description is becoming more and more common, as of early 2019.
Not only has the field garnered a great deal of interest from software developers, statisticians and machine learning exponents, but has also attracted plenty of interest over the years, from people in roles such as strategy, operations, sales and marketing. Product designers, manufacturing and customer service managers are also turning towards data science talent to help them make sense of their businesses, processes and find new ways to improve.
The Data Science Misinformation Challenge
The aforementioned motivations for people interested in data science aren’t inherently bad – in fact, they’re common sense, reasonable starting points to look for data science talent and begin analytical programs in organizations. The problem starts with the availability of access to sound, hype-free information on data science, analytics, machine learning and AI. Thanks to the media’s fulminations around sometimes disconnected value propositions – chat bots, artificial intelligence agents, machine learning and big data – these terms have come to be clumped together along with data science and machine learning, purely because of the similarity of notion, or some of the skills required to build and sell solutions along these lines. Media speculation around AI doesn’t stop there – from calling automated machine learning as “Building AI that can build AI” (NYT), to mentions of killer robots and killer cars, 2018 was a year full of hype and alarmism as I expect 2019 will also be, to some extent. I have dealt with this topic extensively in an earlier post here. What I take issue with, naturally, is the fact this serves to misinform business teams about what’s really important.
Managing Data Science Better
Astute business leaders build analytical programs where they don’t put the cart before the horse. By this, I mean the following things:
- They have not merely a data strategy, but a strategy for labelled data
- They start with small problems, not big, all-encompassing problems
- They grow data science capabilities within the team
- They embrace visualization methods and question black box models
- They check for actual business value in data science projects
- They look for ways to deploy models, not merely build throw-away analyses
Data science and analytics managers ought not to:
- Perpetuate hype and spread misinformation without research
- Set expectations based on such hype around data science
- Assume solutions are possible without due consideration
- Not budget for subject matter experts
- Not training your staff and still expecting better results
As obvious as the above may sound, they’re all too common in the industry. Then there is the problem of consultants who sometimes perpetuate the hype train, thereby reinforcing some of these behaviors.
Doing Data Science Better
Now let’s look at some of the things Data Scientists themselves could be doing better. Some of the points I make here have to do with the state of talent, while others have to do with the tools and the infrastructure provided to data scientists in companies. Some has to do with preferences, while others have to do with processes. I find many common practices by data science professionals to be problematic. Some of these are:
- Incorrect assumption checking – for significance tests, for machine learning models and for other kinds of modeling in general
- Not being aware of how some of the details of algorithms work – and not bothering to learn this even after several projects where their shortcomings are highlighted
- Not bothering to perform basic or exploratory data analysis (EDA) before taking up any serious mathematical modeling
- Not visualizing data before attempting to build models from them
- Assuming things about the problem solving approach they should take, without basing this on EDA results
- Not differentiating between the unique characteristics that make certain algorithms or frameworks more computationally, statistically or otherwise efficient, compared to others
- Some of these can be sorted out by asking critical questions such as the ones below (which may overlap to some extent with the activities listed above):
- Where the data came from
- How the data was measured
- Whether the data was meddled with anyhow, and in what ways
- How the insights will be consumed
- What user experience is required for the analytics consumer
- Does the solution have to scale
This is just a select list, and I’m sure that contextually, there are many other problems, both technical and process-specific. Either way, there is a need to exercise caution before jumping headlong into data science initiatives (as a manager) and to plan and structure data science work (as a data scientist).