Operations management as a discipline has taken many shapes and forms in different industries over the years, but there is perhaps something unique that is discussed in software development operations, commonly referred to as DevOps. Many of these considerations around DevOps also apply to the related and increasingly interesting subset of problems that is MLOps, which is the field of Machine Learning Operations. So, what is unique about DevOps and the discussions of software development operations in this context?
Perceptions of DevOps Today And Contrasts with Traditional Industry Operations
One of the tweets I came across recently by a manager hiring for DevOps roles was this one, that sparked an outpouring of ideas from me. The entire thread is below, with the original tweet for context.
Popular understanding of DevOps seems to revolve around tools. Tools for managing code, workflows, and applications for helping with this or the other thing encountered in the context of software development workflows. Strangely enough, operations management in industries that are more established, such as the manufacturing industry, oil and gas or energy industry, or telecommunications tend to have the following sets of considerations:
- People considerations: From the hiring and onboarding of talent for the organization, to the development of these as productive employees, to employee exit. Operational challenges here may be the development of role definitions, establishing the right hierarchy or interactions for smooth operations, and ensuring that the right talent is attracted and retained in the organization.
- Process considerations: All considerations spanning the actual process of value delivery, whereby the resources available to the organization are put to use to efficiently solve day-to-day problems and meeting customer requirements on an ongoing basis. Some elements of innovation and continual improvement would also fall into the ambit of the process management that’s part of Operations.
- Technology considerations: All considerations spanning the application of various kinds of technology ranging from the established and mundane, to the innovative and novel – all of these could be considered a part of technology management within Operations in traditional organizations.
Anyone familiar with typical, product-centric or services-oriented software development organizations will observe that the above three considerations are spread out among other supporting functions of these organizations. Perhaps technically centred organizations with very specific engineering and development functions evolve this way, and perhaps there is research to show for this hypothesis. However, the fact remains that what is considered development operations doesn’t normally involve the hiring and development of talent for product/solution engineering or development, or the considerations around the specific technologies used and managed by the software developers. These elements seem to be subsumed by human resources and architects respectively.
Indeed, the diversification of roles in software development teams is so prolific that delivery managers (of the so-called Scrum teams) are rarely in charge of the development operations process. They’re usually owners for specific solution deliverables. The DevOps function has come to be seen as a combination of software development and tooling roles, with an emphasis on continuous delivery and code management. This isn’t necessarily a bad thing, and there is a need for such capablities – arguably there is a need for specialists in these areas as well. But here’s the challenge many managers hiring mid-senior professionals for managing DevOps:
Cross Functional DevOps and Lean in Manufacturing Operations
When we have DevOps engineers and managers only interested in setting up pipelines for writing and managing code, rather than thinking holistically about how value is being delivered, and whether it is, we miss crucial opportunities for continuous improvement.
As someone who has worked in both manufacturing product development and software product development teams, I find that there needs to be a greater emphasis in software development organizations on cross-functional thinking, and cross-functional problem solving. While a lot of issues faced by developers and engineers in the context of product or solution development are solved by technical know-how and technical excellence, there are broader organizational considerations that fit into the people, process and technology focus areas, that are important to consider – and without such considerations, wise decisions cannot be taken. A lot of these decisions have to do with managing waste in processes – whether that is wasted effort, time or creativity, or technical debt we build up over time, or redundancy for that matter. The Lean toolbox, which originated from the manufacturing industry, provides us a ready reckoner for this, titled the “eight wastes in processes”: inventory, unused creativity, waiting, excess motion, transportation, overproduction, defects and overprocessing. Short of seeing all development activities through these “waste lenses”, we can use them as general guidelines for keenly observing the interactions between a developer, his tools, other developers, and code. Studying these interactions could yield numerous benefits, and perhaps such serious studies are common in some large enterprise DevOps contexts, but at least in the contexts I’ve seen, there’s rarely discussions of this nature with nuance and deep observation of processes.
In fact, manufacturing organizations see Lean in a fundamentally different way from how software development teams see it.
Manufacturing organizations heavily emphasize process mapping, process observations and process walks. And I shouldn’t paint all manufacturing organizations by the same brush, because indeed, the good and the bad ones in this respect are like chalk and cheese – they’re poles apart in how well they understand and deploy efficient operational processes through Lean thinking. Many may claim to be doing Six Sigma and structured innovation, and in many cases, such claims don’t hold water because they’re using tools to do their thinking.
Which brings me to one of the main problems with DevOps as it is done in the software development world today – the tools have become substitutes for thinking, for many, many teams. A lot of teams don’t evaluate the process of development critically – after all, software development may be a team sport, but in a weird way, software developers can be sensitive to replay and criticism of their development approaches. This is reminiscent of artisans in the days before mass production, and how they developed and practised an art in their day to day trade. It is less similar to what’s happening in large scale car or even bottle manufacturing plants around the world. Perhaps there are good reasons for this too, like the development of complexity and the need for specialization for building complex systems such as software applications, which are built but once, but shipped innumerable times. All this still doesn’t imply, however, that tools can become substitutes for thinking about processes and code – there are many conversations in that ambit that could be valuable, eye-opening elements of any analysis of software development practices.
MLOps: What it Ought to Include
Now I’ll address machine learning operations (MLOps) which is a modern cousin of DevOps, relevant in the context of machine learning models being developed and deployed (generally as some kind of software service). MLOps have come to evolve in much the same we saw DevOps evolving, but there is a set of issues here that go beyond the software-level technicalities, to the statistical and mathematical technicalities of building and deploying machine learning systems.
MLOps workflows and lifecycles appear similar to software development workflows as executed in DevOps contexts. However, there ought to be (and are) crucial differences in how these workflows are different between these two disciplines (of software engineering and machine learning engineering).
Some of the unique technicalities for MLOps include:
- Model’s absolute performance, measured by metrics such as RMSE or F1 score
- Model deployment performance against SLAs such as latency, load and scalability
- Model training and retraining performance, and scalability in that context
- Model explainability and interpretability
- Security elements – data and otherwise, of the model, which is a highly domain-dependent conversation
In addition to these purely technical elements of MLOps, there are elements of the discipline in my mind, that should include people and processes:
- Do we have engineers with the right skills to build and deploy these models?
- Have we got statisticians who can evaluate the underlying assumptions of these ML models and their formulation?
- Do we have communication processes in the team that ensure timely implementation of specific ML model features?
- How do we address model drift and retraining?
- If new training data comes from a different region, can it be subject to the same security, operational and other considerations?
There may be more, and some of you reading this, who happen to have deployed and faced production scale ML model development/deployment challenges, may have more to add. MLOps should therefore see significant discussions around these elements, and these and other related discussions should happen early and often, in the context of ML model deployment and maintenance.