Lessons from Agile in Data Science

Over the past year and a few months, I’ve had a chance to lead a few different data science teams working on different kinds of hypotheses. The engineering process view that the so-called agile methodologies bring to data science teams is something that has been written about. However, one’s own experiences tend to be different, especially when it comes to the process aspects of engineering and solution development.

Agile principles are by no means new to the technology industry. Numerous technology companies have attempted to use principles of agility in their engineering and product development practices, since a lot of technology product development (whether software or hardware, or both) is systems engineering and systems building. While some have found success in these endeavours, many organizations still find agility a hard objective to accomplish. Managing requirements, the needs of engineering teams and concerns such as delivery, quality and productivity for scalable data science are a similarly hard task. Organizational structure, team competence, communication channels and approaches, leadership styles and culture all play significant roles in the success of product development programmes, especially those centred around agility.

In the specific context of software and systems development, two talks stand out in my mind. One is from a thought leader and an industry pioneer who helped formulate the agile manifesto (a term which he extensively derides, actually) – and the other is from a team at Microsoft, which is a success story in agile product development.

Here’s Pragmatic Dave (Dave Thomas, one of the original pioneers of agile software development), in his GOTO 2015 talk titled “Agile is Dead”.

I’m wary of both extreme proponents and extreme detractors of a philosophy or an idea, especially when in practice or use, it seems to have some success in some quarters. While Dave Thomas seems to take some extreme views, he does bring in a lot of pragmatic advice. His views on the “manifesto for agility” are in some sense more helpful than boiler plate Agile training programmes, especially when seen in the context of Agile software/system development.

The second talk that I mentioned, the one featuring Microsoft Scrum masters, is very much a success story. It has all the hallmarks of an organization navigating through what works and what doesn’t, and trying to find their velocity, their rhythm and their approach, from what is a normative approach that’s suggested in so many agile software development textbooks and by many gurus and self-proclaimed experts.

This talk by Aaron Bjork was actually quite instructive for me when I first saw it a few months ago. Specifically, the focus of agile practices on teams and interactions, rather than on “process” was instructive. Naturally, this approach has other questions around it, such as scaling, but in the specific context of data science, I find that the interactions, and the process of generating hypotheses and evaluating them, seems to matter more than most things. These are only two of the many videos and podcasts I listened to, and surely they constitute only a portion of the interactions I’ve had with team members and managers on Agile processes for data science delivery.

It is in this setting that my personal experiences with Agile were initially less than fruitful. The team struggled to follow both process and do data science, and the management overhead, with activity and task management was extensive. This problem still remains, and there doesn’t seem a clear solution to balancing the ceremony/rituals of agile practices and seemingly useless ideas such as story points. Hours are more useful than story points – so much so that scrum practitioners typically devolve from equating story points to hours or multiples of them, at some point. The issue here lies squarely with how the practices have been written about and evangelized, rather than the fundamental idea itself.

There’s also the issue of process versus practice – in my view, one of the key things about project management of any kind. The divergence between process and practice in Agile methods is very high – and in my opinion, the systems/software development world deserves better. Perhaps one key reason for this is the proliferation of Scrum as the de-facto Agile development approach. When Agile methods were being discussed and debated, the term “Agile development” used to represent a range of different approaches, which has given way (rather unfortunately) to one predominant approach, Scrum. There is an analogy in the quality management world that I am extensively familiar with – in Six Sigma and the proliferation of DMAIC almost exclusively to solve “common cause” problems.

Process-v-practice apart, there are other significant challenges within using Agile development for data science. Changing toolsets, the tendency to “build now and fix later” (although this is addressed through effective continuous deployment methods) and process overhead constitute some reasons why this approach may still be attractive.

What does work universally, is the sprint-based approach to data science. While the sprint-based approach is only one element of the overall Scrum workflows we see in the industry, it can, in itself, become a powerful, iterative way to think about data science delivery in organizations. Combined with a task-level structure and a hypothesis model, it may be all that your data science team requires for even complex data science. Keeping things simple process-wise, may unlock the creative juices of data scientists and enable your team to favour direct interactions over structured interactions, enabling them to explore more, and extract more value from the data.

Onward to Strata+Hadoop World 2016, Singapore!


Strata + Hadoop World 2016 is happening next week, between December 5th and 8th in Singapore. I’m excited to be presenting at the conference on the subject of time series analysis for sensor data. More about my talk here.

One of my key focus areas during the last several months at The Data Team, where I am Senior Data Consultant, was to help our clients derive value from sensor data they have been collecting from their various field-deployed products.

I am equally excited to be able to be witness to the rapidly advancing areas of big data, data science and the Internet of Things, which are driving our world forward in new and bold ways.

Additionally, The Data Team ( On Twitter, @carpedata ) is featured in the Innovators’ Pavilion at this conference. Do drop me a line on Twitter at @rexplorations or on LinkedIn if you’re there and would like to connect professionally.