What Could Data Scientists (And Data Science Managers) Be Doing Better in 2019?

The “data science” job description is becoming more and more common, as of early 2019.

Not only has the field garnered a great deal of interest from software developers, statisticians and machine learning exponents, but has also attracted plenty of interest over the years, from people in roles such as strategy, operations, sales and marketing. Product designers, manufacturing and customer service managers are also turning towards data science talent to help them make sense of their businesses, processes and find new ways to improve.

The Data Science Misinformation Challenge

The aforementioned motivations for people interested in data science aren’t inherently bad – in fact, they’re common sense, reasonable starting points to look for data science talent and begin analytical programs in organizations. The problem starts with the availability of access to sound, hype-free information on data science, analytics, machine learning and AI. Thanks to the media’s fulminations around sometimes disconnected value propositions – chat bots, artificial intelligence agents, machine learning and big data – these terms have come to be clumped together along with data science and machine learning, purely because of the similarity of notion, or some of the skills required to build and sell solutions along these lines. Media speculation around AI doesn’t stop there – from calling automated machine learning as “Building AI that can build AI” (NYT), to mentions of killer robots and killer cars, 2018 was a year full of hype and alarmism as I expect 2019 will also be, to some extent. I have dealt with this topic extensively in an earlier post here. What I take issue with, naturally, is the fact this serves to misinform business teams about what’s really important.

Managing Data Science Better

Astute business leaders build analytical programs where they don’t put the cart before the horse. By this, I mean the following things:

  1. They have not merely a data strategy, but a strategy for labelled data
  2. They start with small problems, not big, all-encompassing problems
  3. They grow data science capabilities within the team
  4. They embrace visualization methods and question black box models
  5. They check for actual business value in data science projects
  6. They look for ways to deploy models, not merely build throw-away analyses

Data science and analytics managers ought not to:

  1. Perpetuate hype and spread misinformation without research
  2. Set expectations based on such hype around data science
  3. Assume solutions are possible without due consideration
  4. Not budget for subject matter experts
  5. Not training your staff and still expecting better results

As obvious as the above may sound, they’re all too common in the industry. Then there is the problem of consultants who sometimes perpetuate the hype train, thereby reinforcing some of these behaviors.

Doing Data Science Better

Now let’s look at some of the things Data Scientists themselves could be doing better. Some of the points I make here have to do with the state of talent, while others have to do with the tools and the infrastructure provided to data scientists in companies. Some has to do with preferences, while others have to do with processes. I find many common practices by data science professionals to be problematic. Some of these are:

  1. Incorrect assumption checking – for significance tests, for machine learning models and for other kinds of modeling in general
  2. Not being aware of how some of the details of algorithms work – and not bothering to learn this even after several projects where their shortcomings are highlighted
  3. Not bothering to perform basic or exploratory data analysis (EDA) before taking up any serious mathematical modeling
  4. Not visualizing data before attempting to build models from them
  5. Assuming things about the problem solving approach they should take, without basing this on EDA results
  6. Not differentiating between the unique characteristics that make certain algorithms or frameworks more computationally, statistically or otherwise efficient, compared to others
  7. Some of these can be sorted out by asking critical questions such as the ones below (which may overlap to some extent with the activities listed above):
    1. Where the data came from
    2. How the data was measured
    3. Whether the data was meddled with anyhow, and in what ways
    4. How the insights will be consumed
    5. What user experience is required for the analytics consumer
    6. Does the solution have to scale

This is just a select list, and I’m sure that contextually, there are many other problems, both technical and process-specific. Either way, there is a need to exercise caution before jumping headlong into data science initiatives (as a manager) and to plan and structure data science work (as a data scientist).

Lessons from Agile in Data Science

Over the past year and a few months, I’ve had a chance to lead a few different data science teams working on different kinds of hypotheses. The engineering process view that the so-called agile methodologies bring to data science teams is something that has been written about. However, one’s own experiences tend to be different, especially when it comes to the process aspects of engineering and solution development.

Agile principles are by no means new to the technology industry. Numerous technology companies have attempted to use principles of agility in their engineering and product development practices, since a lot of technology product development (whether software or hardware, or both) is systems engineering and systems building. While some have found success in these endeavours, many organizations still find agility a hard objective to accomplish. Managing requirements, the needs of engineering teams and concerns such as delivery, quality and productivity for scalable data science are a similarly hard task. Organizational structure, team competence, communication channels and approaches, leadership styles and culture all play significant roles in the success of product development programmes, especially those centred around agility.

In the specific context of software and systems development, two talks stand out in my mind. One is from a thought leader and an industry pioneer who helped formulate the agile manifesto (a term which he extensively derides, actually) – and the other is from a team at Microsoft, which is a success story in agile product development.

Here’s Pragmatic Dave (Dave Thomas, one of the original pioneers of agile software development), in his GOTO 2015 talk titled “Agile is Dead”.

I’m wary of both extreme proponents and extreme detractors of a philosophy or an idea, especially when in practice or use, it seems to have some success in some quarters. While Dave Thomas seems to take some extreme views, he does bring in a lot of pragmatic advice. His views on the “manifesto for agility” are in some sense more helpful than boiler plate Agile training programmes, especially when seen in the context of Agile software/system development.

The second talk that I mentioned, the one featuring Microsoft Scrum masters, is very much a success story. It has all the hallmarks of an organization navigating through what works and what doesn’t, and trying to find their velocity, their rhythm and their approach, from what is a normative approach that’s suggested in so many agile software development textbooks and by many gurus and self-proclaimed experts.

This talk by Aaron Bjork was actually quite instructive for me when I first saw it a few months ago. Specifically, the focus of agile practices on teams and interactions, rather than on “process” was instructive. Naturally, this approach has other questions around it, such as scaling, but in the specific context of data science, I find that the interactions, and the process of generating hypotheses and evaluating them, seems to matter more than most things. These are only two of the many videos and podcasts I listened to, and surely they constitute only a portion of the interactions I’ve had with team members and managers on Agile processes for data science delivery.

It is in this setting that my personal experiences with Agile were initially less than fruitful. The team struggled to follow both process and do data science, and the management overhead, with activity and task management was extensive. This problem still remains, and there doesn’t seem a clear solution to balancing the ceremony/rituals of agile practices and seemingly useless ideas such as story points. Hours are more useful than story points – so much so that scrum practitioners typically devolve from equating story points to hours or multiples of them, at some point. The issue here lies squarely with how the practices have been written about and evangelized, rather than the fundamental idea itself.

There’s also the issue of process versus practice – in my view, one of the key things about project management of any kind. The divergence between process and practice in Agile methods is very high – and in my opinion, the systems/software development world deserves better. Perhaps one key reason for this is the proliferation of Scrum as the de-facto Agile development approach. When Agile methods were being discussed and debated, the term “Agile development” used to represent a range of different approaches, which has given way (rather unfortunately) to one predominant approach, Scrum. There is an analogy in the quality management world that I am extensively familiar with – in Six Sigma and the proliferation of DMAIC almost exclusively to solve “common cause” problems.

Process-v-practice apart, there are other significant challenges within using Agile development for data science. Changing toolsets, the tendency to “build now and fix later” (although this is addressed through effective continuous deployment methods) and process overhead constitute some reasons why this approach may still be attractive.

What does work universally, is the sprint-based approach to data science. While the sprint-based approach is only one element of the overall Scrum workflows we see in the industry, it can, in itself, become a powerful, iterative way to think about data science delivery in organizations. Combined with a task-level structure and a hypothesis model, it may be all that your data science team requires for even complex data science. Keeping things simple process-wise, may unlock the creative juices of data scientists and enable your team to favour direct interactions over structured interactions, enabling them to explore more, and extract more value from the data.