MLOps Capabilities, Outcomes and Opportunities for Enterprise AI

Machine Learning Operations (MLOps) has come to be an important push for enterprises in 2021 and beyond – and there are clear reasons why this paradigm shift in Enterprise AI is upon us. Most enterprises who have begun data science and machine learning programs over the last several years have had difficulties putting even their promising machine learning models and proof of concept exercises into action, by deploying them meaningfully in production environments. I use the term “meaningfully” here, because the nuances around deployment make all the difference and form the soul of the subject matter around MLOps. In this post, I wish to discuss what ails enterprise AI today, sources of the gaps between production and proof-of-concept, expectations from MLOps implementations and the current state of the discourse on MLOps.

Note and Acknowledgement: I have also discussed several ideas and patterns I've seen from experiences I've had in the industry, not necessarily in one company or job, but going back all the way to projects and programs I've been in over the last seven to ten years. I don't mention clients or employers here as a matter of principle, but I would like to acknowledge mentors and clients for their time and energy and occasionally their guidance as well, in the synthesis of some of these ideas. It is a more boundaryless world than before, and great conversations are to be had regardless of one's location. I find a lot of the content and conversations regarding data science on Twitter and LinkedIn quite illuminating - and together with work and clients, the twain have constituted a great environment in which to discuss and develop ideas. 

What ails Enterprise AI today?

Surely, with the large scale data pipelines companies have access to, the low cost of cloud native solutions, and the high level frameworks for building machine learning models, things should have become easier? Enterprises still seem to be failing in their efforts to build AI programs for many reasons despite these upsides. For one thing, building models has become easier than before. It takes less time to take (good enough or clean enough) data and build prototype models with this data. Regardless of how many hypotheses you have as a business leader or data scientist, you’re more likely in 2021 to be able to collect data and build prototype models with this data, than you were able to in previous years. In the past, you may have had to go through several organizational hoops to get your data, and then prepare this data and then build models. All of these processes have become a bit simpler in 2021, thanks to enterprise data stores maturing, frameworks for building ML models become better known, and greater numbers of data scientists being available to build models. While things are still quite complex for the uninitiated, those on the growth curve in data science have found this phase to be adding productivity to their prototyping efforts.

What hasn’t changed, though, is the process of taking these models to production. The model is largely seen as a software asset, and productionization of the model has been seen in this limited context. As we will discuss, it is important to challenge this mindset if we’re to build effective machine learning systems for production. The gap, therefore, between proof-of-concept models like we’ve discussed above, and production scale implementations of such models, is large. Real world implementations are more complex and tedious, and often, the hypotheses we want to build models for are a bit more well defined – this necessitates extensive data processing, profiling and monitoring. But the complexity doesn’t end there, even though there has been an effort on the part of MLOps practitioners to build end-to-end pipelines. You’ll note that none of these are ground-breaking realizations. MLOps is a practical field, thus far, intended to make all these models work for enterprises – but as we will see below, the practical nature of this field encompasses a number of domain, statistical, cultural, architectural and other considerations.

I wish to suggest before diving deeper into this post, that this trend towards MLOps adoption represents a noteworthy change in how enterprises see ML system architecture in 2021, as opposed to the previous decade. In a manner of thinking, represents a move towards the “plateau of productivity” in enterprise machine learning.

Considerations for Enterprise AI – from MLOps, Data Science and Data Engineering

Domain Considerations Matter in Data Science, Data Engineering and MLOps

I wrote several years ago on this blog that domain knowledge is an important element in doing data science. Back then, as a data science neophyte learning from early experience in pure data science roles, I had made several observations about the impact of domain understanding on how quickly we can arrive at hypotheses for formulating data/AI problems. Looking back, this was an important lesson, because I now acknowledge the importance of domain knowledge every time I work on a data science project, or each time that I enable a data science team to be successful. Whether this is my own domain knowledge or that of SMEs, I am grateful for it, because without it, we could build anything, and it wouldn’t ultimately matter to anyone. Domain knowledge gives purpose to data and AI efforts. Without speaking to the domain experts and SMEs in various projects (finance, manufacturing, retail, energy and other industries), there would be little to no chance of timely and cost-effective success in characterising, ideating about and solving these problems.

It may not be immediately evident, however, that domain considerations matter in MLOps (and DataOps). Without an understanding of data generating processes, data formats, sources, rates, types, and data organization patterns, data fields, tables and even some of the process characteristics, we cannot understand data generation or transformation processes in enterprise data pipelines. We can also not understand how models are to be implemented, and what deployment means in different enterprise or customer contexts. When building and architecting machine learning systems, we end up needing to discover these details if we haven’t already. MLOps therefore cannot be ideated about in a vacuum, without consideration to the domain of the problem, or without consideration to the unique challenges of deploying models that domain. MLOps in logistics and supply chain problems, therefore, will be quite different from MLOps in manufacturing, retail or banking domains.

For instance, if we were building a classification model to sort defective parts from good ones on a manufacturing shop floor, we may need a real-time deployment system, with consideration to latency, edge based deployments of models, opportunities to inspect models as downstream processes or metrics may indicate process failure modes, and so forth. These considerations may not exist if we were building a system for enhancing ad revenue in a platform software company. The considerations there around uplift from pushing ads to new customers may require edge based deployments of a different kind, or federated learning needs, that may be unnecessary in the manufacturing example we discussed. To use an analogy, deployments are like different flavours of ice-cream, each requiring a different kind of appreciation. A failure to realize this may lead to difficulties in enterprises that may inadvertently underestimate the complexity of MLOps, of their own domain processes, or both.

Simplistic, Linear Pipelines Don’t Get Us Over the Line

The current thinking around MLOps is somewhat simplistic and linear, and I mean this in a specific way. There is a lot of discussion around data workflows and pipelines, metadata generation and management, and the metrics around model training and model performance. These are discussions around the management, transformation and profiling of data. Datasets are important to MLOps pipelines, and inasmuch as agility in data science is concerned, I’d even say that they are primary.

However, this notion of thinking only about the software and application-level implications of models and their deployment doesn’t address some of the needs from MLOps pipelines for enterprises. Notably, model interpretability and explainability, managing a diversity of deployment patterns (edge, batch, real-time or near-real-time), and the need to build repeatable pipelines or reproduce results. These problems cannot be broken down into just software applications, and require statistical rigour and attention to changing domain patterns. In fact, there is sometimes a desire on the part of ML engineering or MLOps practitioners to see these more statistical needs of MLOps as “not software engineering” and perhaps therefore “not easy to build for” – both of which may not be true, especially as the space of tools and implementations of statistical models for interpretability/explainability expands just as ML implementations have expanded.

Imagine that you have built an MLOps pipeline to build a dataset for a specific use case, and deployed it and the model eventually, and all’s well. If there’s a need for a new use case, you’re likely to begin back at square one, and build new pipelines, especially if you don’t have a clear and unified data model. As we will discuss in a later section on architecture, this is important to consider in ML engineering – more than one use case may require your data pipeline. This also means that simplistic and linear pipelines can only serve a limited purpose when you’re required to build many such pipelines across enterprise workloads.

For instance, it is possible to build SHAP scores for models given a specific dataset, and for companies with regulatory needs, there may be a reason to deeply analyze and publish results such as these. Therefore, MLOps shouldn’t only be about building simplistic DAGs or workflows in your YAML engineering tool of choice, or building and deploying metadata-tracked machine learning training/inference workflows. These are necessary, but insufficient for good MLOps implementations – chiefly because there are many other statistical and probabilistic considerations around MLOps which also deserve attention.

Data Architecture Before MLOps, but Business Needs First

There was an interesting discussion here recently around the theme of “Data before models, but problem formulation first”. The interesting article in question describes the specific challenges of thinking about data science problems based on business problems, and being “data-driven” in thinking about and building models for our hypotheses. I posit that a similar paradigm applies to MLOps. Data architecture understandably matters a great deal for success MLOps implementations, because it encompasses very foundational organizational processes and needs around data collection, storage and management, governance, security and quality, access patterns, ETL/ELT, sandboxes for analytics, connections to BI and reporting systems, and so on. Ultimately, this complex web of processes and technologies (because data architecture is more than just storing and retrieving data) is meant to perform some function of the business. As W. Edwards Deming said, “Data are not collected for museum purposes” – they are collected for a decision to be made, or for some end use. In the world of MLOps, we enable such decisions to take place on top of the data provided to us through an enterprise data architecture such as this one described above.

While typical enterprise data architectures are driven by the capabilities of tools and cloud scale applications more and more (because of the economies of scale of cloud providers, and the low barriers to entry), there is an important set of decisions every enterprise data architect has to answer for, around the specific needs of the organization, and how the architecture in question enables that to happen. Seemingly trivial decisions taken at the design phase of a data lake or data warehouse can have long lasting implications for the delivery of value from analytics, machine learning and MLOps. Data architecture is certainly important for MLOps, but the more fundamental needs of the organization – the kind of data required, the strategic importance of it, the decisions that need to be made across use cases, security and access patterns for data analysis and data science, and many more operational aspects of data – all of these are important and have a bearing on MLOps effectiveness too. So if you’re a data scientist or MLOps practitioner looking to improve your impact and effectiveness in solving problems, understand the underlying data architecture more deeply first. Sometimes, doing this can be hard – especially if there are no stakeholders who can explain it well – but this kind of fundamental understanding and context are highly underrated and have an outsize impact on the success of data science and ML programs eventually.

The Enterprise Model Sanctuary: Many Simpler Models, A Few Complex Models, and Other Combinations

A cursory glance at machine learning and MLOps forums, discussions and content indicate that the thinking around model development techniques is method centric, and not business centric. A large number of the discussions are a consequence of what’s required for companies at scale innovating on a few complex models with huge amounts of data – and these are legitimate and interesting discussions for sure. For example, most MLOps discussions I have come across seem to discuss the deployment of deep learning models. They discuss text and unstructured data processing, and complex image processing pipelines. Whether the use of tools like Kubeflow for training and deploying models in a distributed fashion, or the use of MLFlow for tracking metrics and performance, these are all legitimate considerations that may solve subsets of the ML deployment space. However, machine learning state-of-the-art is rarely required for enterprises looking to get value out of their specific use cases. The large majority of use cases in the industry are for simpler models, though and this is why simpler pipelines could do a large part of the value creation. I say this from experience and with confidence, having seen numerous projects where managers struggle to make sense of ML outcomes for their business, but have less difficulty making sense of data aggregations, summaries and statistics based on the data. The enterprise model ecosystem is more likely to resemble a zoo or even more accurately a sanctuary of different models, where each model may have its own specific needs and requirements.

Model development in mature organizations generally is an afterthought to carefully evaluating data and the evidential findings from it on merit, and then exploring hypotheses subsequently. Enterprises at lower levels of maturity have difficulty getting value from such an approach, however, and many leaders there may still rely on dashboards and reports. Clearly, there is an important and untapped market in business intelligence from big data. There is also a huge market for implementing simpler models based on clearly defined hypotheses. In many cases, enterprises may need many such simpler models, one for each stratified part of a specific use case. For instance, if you’re a market research firm estimating sales in a market segment, you may wish to build many such models for each sub-segment. If you are an equipment manufacturer doing quality checks using machine learning models, you may wish to use attribute based classification models, one for each product line, and perhaps you want to build many of them. The true value of MLOps in these cases is not in managing the complexity of deployment for one complex model, but in enabling many simpler models to be taken to production quickly and efficiently. These simpler models may then provide a baseline with which to build more complex models as needed.

Machine Learning Systems are Stochastic, Not Deterministic

Perhaps I’m stating the obvious, but it needs to be said. The underlying nature of data generating processes and machine learning models is stochastic and not deterministic. Whether we’re talking about manufacturing process metrics, banking and finance transaction data, energy sector data around load, power, usage, and so forth – all of these data are generated from stochastic data generating processes, even if they come from engineered systems. Machine learning models are also never exact mathematical formulations – they are almost always stochastic processes. There is a little to unpack here, so I’ll get into a few instances. What this stochasticity means, is that machine learning models exhibit variability in results from situation to situation, and that this will be quite evident in production. In order to begin building machine learning systems, we need to perform exploratory data analysis prior to training time, prepare features for our hypothesis, check assumptions based on the feature and the model formulation, and then build models and evaluate them. What it also means, is that we need to build safeguards to ensure that these assumptions are valid when doing production scale inference. It means that we may have to reformulate problems, as the underlying conditions of the data generating process changes. In case of deep learning models, sophisticated tensor transformations and training loops are required as part of the normal training loop of deep learning models.

When the model is eventually trained to the required level of performance and rendered, they too represent a solution at a specific point in time. MLOps is not about “train once, deploy everywhere”, but about “routine retraining and redeployment”. This makes ModelOps and the continuous training lifecycle of model development as important a consideration in MLOps as DataOps is. A lot of discussion around MLOps today is centered around data preparation – and the motivation for this, of course, is the fact that there are significant data preparation challenges that data scientists face. However, model training in the real world cannot be wished away by despite the prevalence of AutoML, although AutoML tools are one path for progress. As of 2021, for most use cases, model definition and training is still done manually, even if tuning and optimizing the model are automated. In MLOps lingo, we are referring to the importance of using feature stores, and their impact on data drift and concept drift analysis. While a healthy discussion is in progress on these topics, the instrumentation in actual implementations of data drift and concept drift identification and measurement tends to vary. Some tool chains are ready for this change, and others just aren’t.

More broadly, some MLOps implementations may account for these stochastic and probabilistic characteristics of ML systems, because their data scientists ask the hard questions after training and during/before deployment. On the other hand, it is likely that most MLOps implementations today treat models merely as pieces of software. The latter pattern leads to the unfolding of technical debt of various kinds later in the lifecycle of the system. This technical debt currently represents building additional regulatory checks, doing interpretability analysis, meta-data logging, model performance metrics, and so on – and over time, this set of secondary considerations may grow much bigger.

Changing Skillsets and Roles for MLOps

Companies looking to hire top ML talent as of 2021 are pushing for a greater number of high quality data engineers with MLOps skill sets. This is in contrast to emphasis on data science hiring in the past. Hiring pipelines for data and AI roles (I’ve seen a few different ones over the last few years) tend to emphasize programming, statistics, databases and specific technologies for data science – of late, this is largely SQL, Python, with a smattering of distributed frameworks and tools, and skill sets in deep learning, tabular data analysis and the associated frameworks and tools for solving problems in this space. For data engineering roles, over the years I’ve seen skill requirements specifying systems programming and strongly typed languages such as Java and Scala, experience working on JVM languages, in addition to SQL, databases, and a lot of the back-end software engineering skill sets we see for application developers elsewhere. For data engineers working on big data technologies, there’s very often a need to be familiar with NoSQL databases, or graph databases, depending on the role and use cases, in addition to the Hadoop-and-friends ecosystem, and cloud engineering skills such as AWS or Azure. While the data scientist’s role and skill set has come to include domain considerations, advanced statistical and ML models, cloud-native and large scale data science and deep learning and communication/presentation of data and insights, the data engineer’s role has become broader around systems engineering and design.

Someone said (in fact, in this talk) that data engineers ought to build frameworks, and not pipelines – and this is a fair assessment of how to use this broad and useful skill set in data engineering. There has been a healthy discussion in various forums, talks and the like on ML engineering roles which combine elements of these two different skill sets. All of these conversations around skill sets are important context for where we’re heading in data science and engineering space overall too. MLOps, unlike DevOps before it, should not be constrained by the limited value addition possible outside of data scientist or data engineering roles (the bulk of DevOps roles are administrative in nature). They cannot be construed as or see themselves as configuration file engineers, for lack of a better term. In fact, their role could be much broader – as systems engineers spanning a range of capabilities in both data science and data engineering, while not possessing expertise in any one of these (themselves diverse) areas. MLOps roles should perhaps also emphasize domain knowledge or expertise of some kind – since ultimately, the outcomes here are practical and related to business value from ML. There are many outcomes and opportunities for talent and skill sets for sure, but these stand out as being relevant. What is for sure is that the data scientist’s role has changed (as has the data engineer’s), and the old and unyielding challenges being faced by data scientists are taking on new definitions and manifestations – thereby requiring new mindsets, new skill sets, and new processes to come forth.

In my view this churn in the extant data science and engineering role paradigm is a welcome development because enterprises first want to realize value from DataOps and MLOps simultaneously today. As we will discuss later in this post, while models are important, business managers will continue to derive value from analytics and reports – and perhaps there has never been a better time to build on that need than 2021. Also, the emphasis on data engineering roles as on date is well-founded. From practical experience as a data scientist who worked on a range of problems from relatively simple ML to complex deep learning models, I will happily acknowledge that data engineers I have worked with were indispensable to the success of the projects I succeeded on. However, leaders hiring for ML roles should not think that the role of the data scientist is no longer required. I believe this emphasis on data engineering is a passing trend as enterprises build foundational pieces that enable value from data. The focus will therefore shift once again to business value from data, and that this automatically means that statistical, data science and ML skills will continue to be in vogue through this shift and afterwards.

Don’t Ignore Decision-Making Culture

Organizational culture matters a lot for the success of MLOps, as much as it does for any digital transformation program. MLOps represents, in a way, a desirable end-state or the happy marriage of data science and data engineering in a given enterprise and data architecture context. However, both data science and engineering can only be valuable and effective in organizations whose leaders think about and talk about data and use the data and insights from these data for taking decisions. The latter is a cultural synthesis, and not just a technology adoption process or workflow that one can execute on demand. Being a cultural matter, it has to do with behavioural and attitudinal patterns that ultimately enable data and insights to be used for decision making.

The adoption of data driven decision making represents a shift from thinking about business processes, systems and decisions in terms of rules (“Rules are for lazy managers”, to paraphrase Simon Sinek), to an open-minded thought process around data and AI systems. When leaders stop thinking in terms of rules, and start thinking in terms of systems, they are often imagining situations of change, synthesis, formation and deformation of patterns, structures and interactions. They begin to see their role as an influencer more and as a commander less, and this shift in thinking can enable them to make subtle changes to their managerial approach, driven by data.

In the earlier post I wrote about OODA, and the AI-enabled generalist, there is a point I make about the decision making language of organizations. This kind of development of a decision making language requires a way of thinking about the enterprise’s systems, processes, and also the ML models in new ways. It requires an openness of mind in decision making to adopt models as thinking tools. In a sense, the modern AI-empowered generalist could be seen as a prototype for a supreme pragmatist. Enterprises want rational actors at their helm, at least for the functions that require data driven decision making – and such rational actors can be groomed in a culture that doesn’t shy away from challenging the current rules and norms on decision making, and is willing to look at data and models.

Data/AI Exponents as SMEs and Future Leaders

Organizations come to embrace data, ML or even MLOps so that they can ultimately derive value from data, and this cannot be done without talent that unlocks value from data. Be this talent data science talent or data engineering/architecture talent, there is both a topical / functional need and a strategic value of these roles in enterprises, and this tends to be overlooked in data strategy. This is because of the value such individuals accumulate over time, as they build data pipelines and AI/ML models, accumulating a lot of knowledge about business processes, customers and also domain knowledge in the process. When you have a data scientist in your team who has built a few different models that explain different elements of your business, processes or customer behaviour, they become invaluable assets for both developing further models, and for analyzing customer or business or process behaviour. Such individuals can also become effective leaders and transition to process management roles.

MLOps and DataOps engineers in an organization can therefore themselves be considered Data/AI SME roles – and this is an important source of value that is often overlooked in organizations. A lot of organizations still see data/AI resources as just means to an end, but in fact, many of these roles can become storehouses of domain knowledge. MLOps can potentially enable the tacit knowledge from such individuals to be effectively captured for process management as well – this may be an important opportunity for value creation from MLOps. MLOps can also accelerate the development of data-driven leadership talent. When exposed to the models used to take decisions, and the specific mechanisms of taking such decisions, leadership potential for process leadership is improved.

In an earlier post, I discussed the importance of higher-level decision making languages, the OODA decision making loop, and how AI can enable a new generation of generalists. I would suggest that this is a useful idea to consider in the broader context of building a data-driven decision making culture.

“Data Before Models” also implies “Models After Data”

The purpose of this heading is to draw attention to the fact that the best data pipelines won’t help, if we aren’t doing much with the data we prepare. We have to eventually build models with this data of one or other kind for actually taking decisions. Many recent discussions around MLOps talk about data-centric AI, and above, we have discussed data architecture and other elements of enterprise systems and culture that contribute to MLOps success. We have also discussed the stochastic nature of data generating processes and machine learning systems. There are important implications from the core ModelOps processes as well, and we will discuss them here, finally. The process of developing models, as I have discussed above, has become easier now than ever before, at least in software. The careful formulation and evaluation of model hypotheses, statistical analysis of the input data and features, and the checking of assumptions – these still remain harder, more tedious and less trivial, as they were before. This necessitates the importance of statistical analysis and exploratory data analysis. Without these foundational steps, ML models can be built with high bias or high variance, thereby setting up the use case for higher failure rates and lower effectiveness overall. This bears introspection and repetition, since there seem to be two schools of machine learning and data science professionals – there seems to be a group of professionals who believe strongly that mathematical and statistical thinking are important for doing data science. There’s another group of professionals and practitioners who think otherwise, that the software elements of data science modeling can be learnt by someone without knowledge of statistics or machine learning.

In my experience, the statistical analysis and EDA are fundamentally important for machine learning – they forms an integral and important part of extracting value from the data we have, and making sense of it, before we solve problems. A number of business situations require us to think in terms of data distributions and stochastic processes. To build things that scale within MLOps pipelines, some of us may need to have an open mind about exploring the mathematical underpinnings of things like gradient descent or batch normalization, or activation functions. This open-mindedness is important for a key reason – a lot of MLOps engineers being trained today may assume that the data science is easy, or trivial, because people who don’t know statistics are building models, or because they can, if they just follow a simple workflow. I know this to be patently untrue – if you want to develop a model worth anything in an enterprise, you may have to start from formulating and thinking about the business problem, get to the EDA and statistical analysis and built out tests for assumptions checking, and then experiment with different models. You have to get into the probability and statistical analysis eventually, or you will be forced to rediscover the effectiveness of these mathematical and scientific methods. Even if you manage to build one or a few models, there will be situations where you’re required to explain these models. Not only will ML engineers or data science engineers be more confident when they are able to reason about the mathematics of machine learning, but their ability to build and scale systems for the enterprise improves. Their ability to think about the implications of these models for different related use cases, for different deployment modes, different source data, and different data quality considerations also improves. By checking assumptions on the features, they could stave off big challenges that may arise when the model is implemented in production.

Statistical analysis and machine learning model development have been core and will be core to data science, regardless of the peripheral engineering required for realization of value. Data engineering and MLOps as allied fields help realize this value at enterprise scale. It is the process of data science and model development that ultimately converts data into insights – and insights are the primary purpose of investing in enterprise data and AI projects and programs in the first place. They will therefore continue to be a good bet for practitioners in future – as long as they realize that those skills alone cannot take them over the finish line.

Concluding Remarks

I hope that you’ve benefited by reading this rather lengthy post on MLOps and Enterprise AI. If anything, it allowed me to explore my own experiences, document a few patterns I see in the development of truly enterprise ready AI, MLOps toolchains and capabilities, and also explore sources of value from MLOps for enterprises. If you have questions or ideas, please leave a comment or tweet to me at @aiexplorations.

Further Reading/Listening

  1. Data Science is Different Now, by Vicki Boykis:
  2. Problem Formulation Comes First by Brian Kent on Crosstab.io
  3. Build Frameworks, Not Pipelines – a Data Engineering Talk on PyData
  4. From Model-Centric to Data-Centric AI – a discussion on Enterprise scale AI with Andrew Ng and others
  5. ML Engineering for Production – another discussion on ML for production with Andrew Ng and others

Statistical Competence and Its Importance for Good Data Science Careers

In 2019, enterprises routinely begin initiatives related to analytics, data science and machine learning that invoke specific technologies from a very early stage in their initiatives. This tendency to put technology ahead of value sometimes extends to analytics champions and managers who take up or lead data-intensive initiatives. While this may seem pragmatic at one level, at another level, it may lead to significant problems when ensuring successful outcomes from such analytics initiatives and programs. In this post, I’ll address the three-pronged conundrum of statistical competence in the data science world, specifically in the context of data science consulting and services, and specifically what it means for the careers of data science candidates now and in the future.

Hiring Statisticians: An Expert’s View

Kevin Gray is one of my connections on LinkedIn who posts insightful content on statistical analysis and related topics on a regular basis, including very good recommendations for books on various statistical and analytical techniques and methods. One of his recent posts was an article he’d authored titled “What to Look For in a Statistician” (the article, and my comment), which definitely resonated with my own experiences in hiring statistically competent engineers in different settings, such as data science and machine learning, between 2015 and today. In years past, I have had similar experiences when hiring competent product engineers and manufacturing engineers in data-intensive problem solving roles.

The importance of statistical thinking and statistical analysis in business problem solving cannot be underestimated. However, even good advice that is canon, and that is well-acknowledged, often falls on deaf ears in the hyper-competitive data science job market. Both hiring managers and recruiters tend to emphasize keywords comprising the latest framework or approach, over the ability to think critically about problem statements, carefully architect systems, and rigorously apply statistical analysis and machine learning to real world problems while keeping considerations of explainability in mind.

The Three-Pronged Conundrum of Data Science Talent

Now you might ask why I say this, and what I really mean by this. The devil, as they say, is in the details, and one essential problem with the broad and wide proliferation of tools, frameworks and applications of high capability, that can perform and automate statistical analysis of different kinds, is the following three-pronged conundrum:

  1. Lack of core statistical knowledge despite having a working knowledge of the practicum of advanced techniques: Most candidates in the data science job market who are deeply interested in building data science and ML applications have unfortunately not developed skills in the core statistical sciences and statistical reasoning. Since statistics is the foundation for machine learning and data science, this degrades the quality of projects and programs which have to rely on hiring such talent. When they prefer to use software to do most of or all the thinking for them, their own reasoning about the problem is rarely good enough to critically evaluate different statistical formulations for problems, because they think in very set and specific ways about problems thanks only to their familiarity with the tools.
  2. Tools as an unfortunate substitute to statistical thinking: Solutions, services and consulting professionals in the data science and advanced analytics space, who have to bring their best statistical thinking to client-facing interactions, are unable to differentiate between competence in statistical thinking, and competence in a specific software tool or approach.
  3. Model bloat and inexplicability: The use of heavy, general purpose approaches that rely on complex, less explainable models, than reliance on simpler models that are constructed upon a fuller understanding of the true dynamics of the problem.

These three sub-problems can derail even the best envisioned data science and machine learning initiatives in product / solution delivery firms, and in enterprises.

Some “Unsexy” Characteristics of Good Data Scientists

These are also not “sexy” problems – they’re earthy, multi-dimensional, real world problems that have many contributing factors, from business and how it is done, to the culture of education and the culture of software and solution development teams. Kevin Gray in his post touches upon attitudinal qualities for good statisticians, which could also be extended to data science leaders, data scientists and data engineers:

  1. Integrity and honesty are important in data science – this is true especially in a world where personal data is being handled carelessly and sometimes gratuitously by many applications without heed to data protection and privacy, and when user data is taken for granted by many technology companies. This is not an easy expectation or evaluation point for hiring managers, since it is only long association with anyone which allows us to build a model of their integrity, and rarely does one effectively determine such an attribute in short interviews. What’s dismal about data science hiring sometimes, is the proliferation of candidate resumes which are full of fluff, and the tendency of candidates to not stand up to scrutiny on skills they identify as “key” or “core” skills.
  2. Curiosity and a broad spectrum of interests – this cannot be understated in the context of a consulting data science or machine learning expert. The more we’re aware of different mental models and theoretical frameworks of the world and the data we see in it, the better we’re able to reason starting from hypotheses about the data. By extension, we’re better able to identify the right statistical approaches for a problem when we start from and explore different such mental models. The book I’ve linked to here by Scott E. Page is a fantastic evaluation of different mental models. But with models come biases, to restate George E. P. Box’s famous quote, “All models are wrong, some models are useful”.
  3. Checking for logical fallacies is key for data science reasoning – I would add to the critical thinking element mentioned in Kevin’s post, by saying that it behooves any thought leader such as a data science consultant to critically evaluate their own thinking by checking for logical fallacies. When overlooked, a benign piece of flawed reasoning can turn into a face-melting disaster. The best way to ensure this does not happen is to critically evaluate our ideas, notions and mental models.
  4. Don’t develop one hammer, develop a tool box – Like experienced plumbers, carpenters or mechanics, the tools landscape of a data scientist today should not be one of quasi-religious fervor in promoting one technique at the cost of others, such as how deep learning has come to be promoted in some circles as a data science panacea. Instead, the effective data scientist is usually pragmatic in their approach. Like a tailor or carpenter who has to cut or join different materials with different instruments, data scientists today do not have the luxury of getting behind one comfortable model of thinking about their tool set and profession – and any attempt to do this can be construed as laziness (especially for the consulting data scientist) at best. While the customer is always right, there are times when the client can be wrong and it is at these times that they need the advice of a qualified statistician or data scientist. If there is one time when data scientists should not abandon their statistical thinking, it is this kind of a situation.

Concluding Remarks

To conclude, data scientists ought not to be seen as resources that take data, analyze it using pre-built tools, and write code to explain the data using pre-built libraries of various kinds. They’re not software jockeys who happen to know some statistics and have a handle on machine learning workflows. Data scientists’ work scope and emphases as industry professionals and consultants go way beyond these limited definitions. Data scientists are expected to be dynamic, statistically sound professionals who critically evaluate real world problems based on theories, data and evidence drawn from many sources and contexts, and progressively build a deeper understanding of these real world problems that lead to tangible value for their customers, be they businesses or the consumers of products. The sooner data scientists realize this, the better off they will be while charting out a truly successful and fulfilling data science career.

Backpropagation and Gradient Descent

Sometimes the simple questions can be revelatory and make one think about the possibilities we have in front of us to improve existing processes, systems and methods.

On this occasion, it was a simple question on Quora about gradient descent and the ubiquitous backpropagation algorithm used in neural networks and deep learning. The content of my answer follows.

The process of computing weights and biases which minimize the error from a neural network could be any optimization algorithm which is good enough for the job. Gradient descent (especially the versions of the algorithm which use momentum and RMS propagation) are especially effective and have well implemented matrix algebra formulations in languages like C and Python, which makes them used often. Equally though, a genetic algorithm or simulated annealing algorithm (which are more complex and computationally intensive) may be used for finding such weights and biases on each iteration. Indeed, such methods have been and are being researched extensively.[1]

Backpropagation is defined by four equations that help calculate new weights and biases to update a neural network.[2]

The first of these equations helps calculate the error at the output layer. The second helps calculate the error in a given layer based on the error in the next layer. The third and fourth equations help calculate the rate of change of the loss function C with variations in the weights and biases.

Therefore the algorithm itself can be written out as follows[3] :

We then use the gradient of the cost function to compute the new values of w and b, based on things like the learning rate and regularization parameters as applicable.

Why gradient descent? Since the process of backpropagation is iterative (we go from steps 1 – 5 and back again), for each update, we can get better and better versions of the weights w and biases b that are able to reduce the error between the target and the result produced by the network. The following animation (source: Data Blog) probably gives you an idea (the red areas are higher values of Cost, and blue means lower values).

A nice graphic illustrating gradient descent

Now, you might ask : can’t other algorithms be used to do the same thing? The answer is indeed yes. We can use many other optimization algorithms (constrained and unconstrained ones, used for convex and non-convex functions). If you would like to learn about convex optimization with theoretical treatment in more detail, consider this resource: Convex Optimization – Boyd and Vandenberghe. In addition to other convex optimization methods, there’s scores of robust optimization methods such as:

  1. Genetic algorithms
  2. Particle Swarm Optimization
  3. Simulated Annealing
  4. Ant Colony Optimization

While some of these, especially GAs and PSOs have been explored in the context of neural networks, common implementations of deep learning algorithms still rely on the gradient descent family of algorithms (such as Nesterov – which has come to be implemented in a distributed paradigm, RMSProp, Adam).

Footnotes

[1] https://arxiv.org/pdf/1711.07655…

[2] Neural networks and deep learning

[3] Neural networks and deep learning

A Simple CNN Tutorial in Keras

In the last year or so, I have begun working extensively with Keras, Tensorflow and CNTK for various problems at work in industries ranging from manufacturing, to media, to cybersecurity.

Here is a simple convolutional network tutorial on Kaggle that I developed in Keras and Tensorflow. Given the GPU-enabled kernels you have within Kaggle these days, it has become easy enough to train large scale image data on some of these kernels. Performance is another matter, though, since the Tesla K40 GPUs you get here are the lower end GPUs, and are also load balanced for multiple users. In any case, it allows you to even try out CUDA code – and that opportunity can’t be beat, given the low cost of doing Kaggle.

My motivation for putting together a tutorial is not the dearth of tutorials – there are enough and more out there. However I wanted to emphasize certain good practices here, and intend to continue to update the kernel in question in future to illustrate those.

Caveat: The internet is awash with tutorials on deep learning using these frameworks, so I won’t dwell much on why this tutorial is different, because it isn’t very different. That said, it does emphasize how a simple deep learning model could be made more effective by using various good practices, such as batch normalization, some explanations about loss functions, and some amount of data exploration in the context of data and labels for this supervised problem.

Achieving Explainability and Simplicity in Data Science Work

This post stems from a few of the tweets I’d authored recently (Over at @rexplorations) on deep learning, data science, and the other skills that data scientists ought to learn. Naturally, this is by no means a short list of skills, given the increasingly pivotal role that data scientists play in organizations.

Here’s a summary of the tweet-stream I’d put out, with some additional ponderings.

  1. Domain knowledge is ignored on the data science road to perdition. Doing data analysis, or building models from data without understanding the domain and the relevance of the data and factors one is using for these models, is akin to “data science suicide”. It is a sure shot road to perdition as a data scientist. Domain knowledge is also hard to acquire for data scientists, especially those working on projects as consultants, and applying their skills in a consultative, short-term setting. For instance, I have more than a decade of experience in the manufacturing industry, and I still find myself learning new things when I encounter a new engineering set up or a new firm. A data scientist is nobody if not capable of learning new things – and domain knowledge is something that they need to constantly skill up on, in addition to their analytical skills.
  2. Get coached on your communication skills, if needed. When interacting with domain experts and subject matter experts, communication skills are extremely important for data scientists. I have frequently seen data scientists suffer from the “impostor syndrome” – not only in the context of data analysis methods and techniques, but also in the context of domain understanding.
  3. Empathise, and take notes when speaking to subject matter experts. It is for this reason that the following things are extremely important for new data scientists interacting with subject matter experts:
    1. Humility about one’s own knowledge of a specific industry area,
    2. An ability to empathise with the problems of different stakeholders
    3. The ability to take notes, including but not limited to mind maps, to organize ideas and thoughts in data science projects
  4. Strive for the usefulness of models, not to build more complex models. Data scientists ignore hypotheses that come from such discussions at their own peril. Hypotheses form the lifeblood of useful data science and analysis. As George E. P. Box said, “All models are wrong, some models are useful” – and this couldn’t be more true than when dealing with models built from hypotheses. It is such models that become really useful.
  5. Simpler models are easier to manage in a data ethics context. In product companies that use machine learning and data science to add value to customers, a debate constantly exists on the effective and ethical use of customer data. While having more data at one’s disposal is helpful for building lots of features, callous use of customer data can present a huge risk. Simpler models are easier to explain – and are arrived at when we accumulate sufficient domain knowledge, and test enough hypotheses. With simpler models, it is easier to explain what data to collect, and this can also help win the customer’s trust.
  6. Careful feature engineering done with human supervision and care may be more effective and scrupulous than automated feature engineering. We live in a world where AutoML and RoboticDataScience are often discussed in the context of machine intelligence and speeding up the process of insight generation from data. However, for some applications, it may be a better idea in the short term to ensure that the feature engineering happens through human hands. Such careful feature engineering may give organizations that use sensitive data a leg up as a longer term strategy, by erring on the side of caution.
  7. Deep learning isn’t the end of the road for data scientists. Deep learning (justifiably) has seen a great deal of hype in the recent past. However, it cannot be seen as a panacea to all data analysis. The end goal from data is the generation of value – be it for a customer, or for society at large. There are many ways to do this – and deep learning is just one approach.

I’m not discussing the many technical aspects of building explainable models. These technical aspects are contextual and depend on the situation, for one, and additionally, the tone of the post and tweets are lighter, to encourage a discussion and to welcome beginner data scientists to this discussion. Hence my omission of these (important) topics.

If you like something on this post, or want to share any other related insights, do drop a comment, or tweet to me at @rexplorations or message me at LinkedIn.

Key Data and AI trends in 2017

This year, 2017, has been quite a busy year for artificial intelligence and data science professionals. In some ways, this is the year when AI truly began to be debated and discussed, from frameworks and technologies to ethics and morality. This is the year when opportunities for AI-driven improvement in businesses began to be examined critically by diverse industry professionals and academicians.With good reason, machine learning and deep learning came to be placed at the top of the Garner’s hype cycle. We’re really at the peak of inflated expectations when it comes to ML/DL – with opportunities to shorten the time we take to reach measurable and direct consumer value.

Image result for gartner hype cycle 2017

Gartner Hype Cycle for 2017

Overall, in my experience, three key trends that enterprises welcomed in 2017 include:

  1. Simplification of cloud and data infrastructure services
  2. Improved and democratized scalable machine learning and deep learning
  3. Automation in key AI, ML and data analysis tasks

Improving Cloud and Data Infrastructure

Perhaps the foundational enabler for the data strategy of many enterprises that I have seen and worked with in 2017, is the availability of an easily operated and managed scalable cloud infrastructure. This promise of a high performance, low cost and (arbitrarily) scalable cloud infrastructure was made as early as 2014, but has taken a few years to materialize as a truly viable, business-wise feasible commercial offering from a stable, top-tier technology firm. Prominent cloud vendors such as Google Cloud, Microsoft Azure and Amazon’s AWS have upped the ante, while veterans like Hortonworks, Cloudera continue to hold sway. This space where the cloud vendors are competing is ripe for consolidation, in my view, although we can expect to see converging architectures before viable consolidation that isn’t entirely wasteful can happen.

Other notable developments on the cloud infrastructure side of things were ideas such as serverless compute (which enterprises are definitely warming up to – and it shows, in the Gartner Hype Cycle), production-ready pre-built models for common tasks as APIs (a trend that continues to inspire software/AI application architecture) and the performing of streaming and real-time data processing frameworks. By combining these capabilities in cloud platforms, cloud providers have really upped their offerings in 2017 compared to before, and provide formidable capabilities – which in my view haven’t even been explored as much as they should have been by businesses.

Despite the availability of such production-ready, cost-effective and scalable data management systems in the cloud, cloud infrastructure has nevertheless come under scrutiny in 2017 for massive security lapses and downtime. To speak of specific examples, we had the biggest impact events in cloud reliability and data security history between Equifax data breach and the massive AWS outage, to say nothing of the numerous data security episodes of smaller scale that were attributable to hacktivism, such as the Panama Papers.

As a counter to some of these incidents and the rise of the GDPR and other data protection regulations, numerous cloud providers have been offering “private cloud” solutions, along with region-specific hosting options for banks and other organizations that deal with regulation-sensitive data.

Additonally, it would be unfair to not point out how much containerization has helped cloud providers in 2017. Massive scale adoption of containerization using Docker and Kubernetes has enabled virtual environments to be set up and managed for complex development and deployment tasks that are data intensive.

Spark and Tensorflow

The space of scalable machine learning frameworks continues to be dominated by Apache Spark – which has found many friends among data engineers and scientists in production after the 2.0 release, especially, given its equitable performance for the data frame APIs across languages. So, whether you program in Python, R, or Scala, you can be assured of the same high performance from Spark these days. Spark ML has expanded on the capabilities of Spark ML Lib, and in its recent releases, Spark has also polished and unified the interfaces for streaming data analysis on Spark-Streaming and graph analysis via GraphX. As someone who has seen teams use Spark for different purposes and built frameworks on it in 2017, the differences between versions 1.6 and below, and 2.0 and above are significant, and the newer versions are more polished and consistent in their behaviour.

Tensorflow received a lot of hype but only lackluster adoption in late 2016 and early 2017, but over the last several months, has made a strong case for itself, and adoption has grown significantly. As developers have warmed up to the framework, and as more language interfaces have been developed for Tensorflow, its popularity has soared, especially in the latter half of 2017. Another factor in the development and adoption of Tensorflow is the widespread use of GPU based deep learning. The core Tensorflow development team’s additions to 1.0 (as explained by Jeff Dean here) have made it a mature deep learning development package and perhaps the most widely used and sought after deep learning framework. While Torch makes an impression and is widely loved (especially in its PyTorch form), Tensorflow is hard to beat for the speed and dynamism of its high quality open source contributors. At Strata Singapore 2016, I sat through a tutorial on Tensorflow 0.8, and what I saw then contrasts with what I see in versions 1.0 and higher. My recent brushes with Tensorflow have made me more convinced that this is the framework to learn for deep learning developers at the moment. The presence of wrappers and higher level interfaces, such as Keras or Caffe, has made Tensorflow very easy to use for entry-level and intermediate programmers and data scientists.

Automation in ML, DL and Data Science

Without a doubt, the development of automation-centric techniques to automate parts of ML and DL development is one of the biggest and most important directions within the field of Artificial Intelligence in 2017. Taking after Leo Brieman’s random forests (an ensemble of “weak learners” resulting in a machine learning model with high performance) and various advancements in deep learning and machine vision (especially convolutional neural networks, which essentially encode complex features using simpler features in computer vision problems), hyper parameter optimization automation was probably the first step in the general direction of automated machine learning.

Frameworks like AutoML (see the talk by Andreas Mueller above) have been the cynosure of this kind of research, and companies small and large have begun attempting different approaches for solving the context modeling problem that arise from the need to automate data science. While most approaches towards machine learning have taken a classical approach, by finding computational approaches to learn more and more from data, some have take non-traditional approaches, by combining ideas from expert systems, rule based inference engines, and other approaches. A novel approach to machine learning has been the invention and development of generative adversarial networks (GANs) which could lead to hitherto unseen improvements in the use of computationally generated data as a starting point for understanding the best representations of a given dataset. Despite being invented in 2014, it is in 2017 that implementations of this kind of network became popular and came to be considered as a viable neural network architecture for computer vision and other kinds of machine learning problems.

Other noteworthy trends within the data and AI space include the rise and improved performance of chat bots and conversational natural-language enabled APIs, the amazing improvements to translation and image tagging made possible by deep learning, and the important question of AI ethics – starting from that now-famous question of “should your self-driving car kill a pedestrian in order to save your life”, to ethical conundrums and alarmist remarks from tech luminaries such as Elon Musk.

Concluding Remarks

So, what does 2018 hold in store? That seems to be the question on everyone’s lips in the data and AI world, and it is also what data and AI enthusiasts in different industry roles are looking to understand. While it is not possible to clearly say which trend will dictate progress in 2018 and beyond, it is clear that the above three developments will form key cornerstones on top of which future capabilities for AI and enterprise scale data management and data science will be built. Hope you enjoyed reading this. Do leave a comment or a note if you would like to share more.

Andrew Ng’s DeepLearning.AI (Coursera) Certification

2017-10-21 19_43_58-Clipboard

One of the more interesting mental models of machine learning I’ve come to understand in the last month or so, is the “five tribes of artificial intelligence” model popularized in “The Master Algorithm” by Pedro Domingos. To summarize in a phrase, the master algorithm is that approach which can uncover all possible insight from data – and Prof. Domingos hypothesises that there are five distinct such “master algorithms”, one for each of these tribes. One of these “tribes” is the connectionists, whose master algorithm is, in fact, backpropagation, which is central to the design and operation of neural networks.

A Connectionist Tour Guide

In a sense, the deep neural network has become synonymous with artificial intelligence today. There are numerous other algorithms which could lend a sense of intelligence to machines – whether by communicating in natural language as a conversationalist (starting from rudimentary bots like ELIZA through Pootwattle and Smedley (of U Chicago fame), to modern chatbots), or by learning to differentiate different kinds of faces, or identify emotions of specific kinds. The deep neural network has successfully been applied to numerous such real world problems, and therefore stands out as being promising on this account. For the other tribes, we don’t yet have algorithms such as “advanced induction inference machines”, or “higher dimensional kernel machines” – whatever these may indicate (really or apocryphally). So it behooves us to pay attention to stories such as this one, which discuss the “unreasonable effectiveness” of neural networks.

 

DeepLearning.AI’s Course

There’s definitely a skills gap in the advanced machine learning and artificial intelligence space. Businesses are as yet unable to see value beyond the hype. Unsurprisingly, the skills gap has to be addressed at the very root – the fundamentals, where the ability to model problems, computationally solve them, and build systems out of such solutions intersect. Andrew Ng has, also unsurprisingly, taken a stab at the deep learning space, if his “AI is the new electricity” talk is anything to go by.

 

 

Over the last few weeks, I’ve had the opportunity to spend some time on Andrew Ng’s Deep Learning course from DeepLearning.ai. For me, this is like a tour guide to the world of the connectionists. The reality is that neural networks don’t work like the human brain apart from superficial similarities – as Ng himself explains in the course – but the term has stuck, since the motivations of early pioneers who also knew some neuroscience led to the moniker.

The Coursera certification is organized into five different courses, and the first of these lays the mathematical and programmatic foundation for implementing them. This first course, titled Neural Networks and Deep Learning has well-orchestrated exercises within Coursera’s integrated Jupyter notebook interface, and you can use the algorithm on your own data, to evaluate its performance. I’m currently some way through the second course, having finished the first one – and I have to say that the videos, programming exercises and other course aspects create a true learning feedback loop, which is effective in teaching the basics really well. I’m very impressed with the way the course has been put together and made accessible to those with a little bit of machine learning knowledge, who are starting out on neural networks and deep learning.

Course Experience

In the below section, I’ll outline my key learnings from the first course in the certification. I hope that you take the course, if you are a ML and AI enthusiast or young professional (or even an experienced one) interested in working on deep learning.

  1. The course introduced the most fundamental ideas of neural networks at the very start, with extensive coverage on how to implement a logistic regression model for classifying data. This intial discussion was built up rather nicely into a discussion on deep learning.
  2. As an intermediate course, it assumes some amount of knowledge of linear algebra and differential equations. As someone who works with machine learning models, I was able to grasp the intuitions with one repetition. If it has been a while since you worked through linear algebra and differential calculus (or thought through equations, at the very least), expect to take a while to find your feet.
  3. Some of the intuitions around gradient descent, the values of derivatives, and so on, were introduced very handily – and were reinforced through the exercises.
  4. The importance of vectorization and its central use in numpy (which is used extensively – nay, almost exclusively – throughout the course) was well brought out. Numpy is a powerful library and surprisingly, received its first funding only in 2017 after being useful for the development of numerous algorithms and tools. Some of its quirks, such as order (n,) vectors, were especially interesting and useful to learn about. Overall though this isn’t a numpy tutorial by any stretch, it is referenced extensively.
  5. During weeks 2 and 3, the logistic regression algorithm is taught in a different context – it is likened to neurons in a deep net, and the details of activation functions are discussed. This, to me, was the meat of the course.
  6. In weeks 2 and 3, a consistent methodology and notation was followed for the discussion of and the implementation of  forward and backward propagation, two of the key mechanisms in any neural network – and this was done entirely within numpy, and these are great hands-on lessons. Stochastic gradient descent was also explained and implemented.
  7. Finally, in week 4, deep neural networks were handled, and parametrization of the neural network topology was introduced. Ideas related to this, such as hyperparameter optimization were also discussed. Additionally, in both videos and assignments, Andrew Ng provided practical advice on how to get the matrix dimensions right for weight and bias vectors – without this and the consistent notation, a lot of the programming implementations of DNNs could potentially get very hairy, so I personally felt that this was very well handled.
  8. A cat classifier deep neural network in Week 4 – because who doesn’t like cats?
  9. Right through the course, there are optional video lectures, and interviews with well known researchers. One of them is with Geoff Hinton, and it was definitely instructive.

 

 

 

I’m about half-way through the second course, on Improving Deep Neural Networks, and my experience there has been similar to the first course. The content derives directly from the content of the first course, and therefore, going in sequence from the first to the second definitely has its advantages. If you were to start the second course of the specialization first, expect to spend some time to find your feet. So far, I only wish there had been better explanations of ideas like dropout and L2 regularization, especially given the tricky quizzes in Week 1. This is a 3-week course, and I wish an additional week, or a few more videos had been spent initially, explaining and firming up ideas around regularization. Additionally, the exploding/vanishing gradient problems could be better illustrated with videos and so on, although I felt the course generally does a good job of explaining the essentials of these ideas.

Concluding Remarks

To conclude, I’d recommend this certificate for those in the analytics, data science or machine learning space, who are a bit hands on, can grasp linear algebra and calculus, and can work with Python. You’ll find that since this is an “intermediate” specialization, neophytes will require multiple viewings of the videos to become conversant in the ideas and concepts. This still shouldn’t deter those who want to audit the course or learn the concepts therein for a deeper understanding to back up their direct experience in machine learning.

Related Content

  1. My Quora answer on Deeplearning.AI’s Coursera course

Pervasive Trends in Big Data and Data Science

As of mid-2017, I’ve spent almost two years in the big data analytics and data science world, coming from 13 years of diverse work experience in engineering and management prior. Starting from a professional curiosity, it has taken me a while to develop some data science and engineering skills and hone key skills among these as a data scientist. Along the way, I’ve had a chance to learn core software development methods and principles, stay in touch with the latest in the field, challenge my existing knowledge of product development methodologies and processes, and learn more about data analysis, statistics and machine learning than I started out with in 2015. Along with the constant learning, I’ve had a chance to observe a few pervasive trends in the big data and analytics worlds, which I wish to share here.

  1. Cloud infrastructure penetration: Undoubtedly the biggest beneficiaries of the data and analytics revolution have been cloud service providers. They’re also stretched thin, with reducing costs, massive competition, and the need for value added services of various kinds (big compute and API support, along with big storage, for instance) to be available alongside the core cloud offerings that companies are lapping up, for their data management needs. Security concerns continue to exist, and one of the biggest security issues was actually from the US’ leading cloud service provider, Amazon Web Services. Despite this, many industries, even those that consider data security paramount, wish to adopt cloud infrastructures, because of the reduced cost of operation and the scalability inherent in cloud platforms.
  2. Deep learning adoption: Generalized learning algorithms based on neural networks have taken the machine learning world by storm, and given the proliferation of big compute and big data storage platforms, it has become easier to train deep learning algorithms than in the past. Extant frameworks continue to give better, more user-friendly algorithms as they evolve, and there’s definitely a more user-friendly ecosystem of frameworks and algorithms out there, such as Caffe, Keras, and Tensorflow (which has become more user-friendly and better integrated with numerous systems programming languages and frameworks). This trend will continue, with several tested and published DL APIs available for integration into application software of various kinds.
  3. API based data product deployment: Data science operationalization has begun to happen through APIs and platforms. Organizations that are developing data product strategies are increasingly considering platform views and integrating APIs for managing data, or for scoring incoming data based on machine learning models. With the availability of such APIs for general use, it has become possible to integrate many such microservice APIs to build data products with very specific and diverse capabilities.
  4. A focus on value from data: Companies are looking past the big data hype more often these days, and are looking at what value they can get from the data. They’re focusing on better data collection and measurement processes, improved instrumentation and qualifying their data management infrastructure investments. They’re also seeking to enable their data science teams with the right approaches, tools and methods, so that they can get from data to insight faster. Several startups are also doing pioneering work in governing the data science process, by integrating principles of agility, continuous integration and continuous deployment into software solutions developed by data science teams.
  5. Automated data science and machine learning: Finally (and in many ways, most importantly), automated data science and machine learning is a relatively new area of work which is gaining ground significantly. Numerous startups and established organizations are evaluating methods to automate key parts of the data science workflow itself, with The Data Team among them. Such automation of data science is a trend that I foresee will gain ground for some more time, before it becomes an integral part of many data science workflows, and development approaches. While a number of applications that straddle this space are referred to as AI, the word is out on what AI is, and what isn’t, as far as me and many of my colleagues are concerned.

These are just some of the trends I’ve observed, of course, and from where you are, as a data scientist, you may be seeing much more. One thing is for sure – those who continue to keep their knowledge and skills relevant in this fast-changing space will continue to be rewarded with interesting work and new opportunity.

The Expert System Anachronism in the Data Science and AI Divergence

Although the data science and big data buzzwords have been bandied about for years now, and although artificial intelligence has been talked about for decades, the two fields are irrevocably inter-related and interdependent.

For one thing, the wide interest in data science started just as we were beginning to leverage distribute data storage and computation technologies – which allowed companies to “scale out” storage and computation, rather than “scale up” computation. Companies who could therefore buy numerous run-of-the-mill computers (rather than extremely expensive, high end computers, in smaller numbers) could potentially leverage their data collection activities to be useful to the enterprise.

Let’s not forget, though, that the point of such exercises was to actually get some business value at the end of such an exercise. There’s virtually no business case for collecting huge amounts of data and storing them (with or without structure), if we don’t have a plan to somehow utilize that data for taking business decisions better, or to somehow impact the business or customers positively. IT managers across industries have therefore struggled to make sense of the big data space, and how much to invest, what to invest in, and how to make sense of it all.

Technology companies are only too happy to sell companies the latest and greatest data science and data management frameworks and solutions, but how can companies actually use these solutions and tools to make a difference to their business? This challenge for executives isn’t going away with the advent of AI.

Artificial Intelligence (AI) has a long and hoary history, and has been the subject of debate, discussion and chronicle over several decades. Geoff Hinton, the AI pioneer, has a pretty comprehensive description of various historical aspects of AI here. Starting from Geoff Hinton’s research, pioneering research in recent years by Yann Le Cun, Andrej Karpathy and others has enabled AI to be considered seriously by organizations as a force multiplier, just as they considered data science a force multiplier for decision making activities. The focus of all these researchers are on general purpose machine intelligence, specifically neural networks. While the “deep learning” buzzword has caught on of late, this is fundamentally no different from a complex neural network and what it can do.

That said, AI in the form of deep learning differs vastly in capability from the algorithms data scientists and data mining engineers have used for more than a decade, now. By adding many layers, and by constructing complex topologies in these neural networks, and by iteratively training them on large amounts of data, we’ve progressed along multiple quantitative axes (complexity, number of layers, amount of training data, etc) in the AI world, to get not merely quantitative, but qualitatively better in terms of AI performance. Recent studies at Google show that image captioning, often considered a hard problem for AI, is now at near-human levels of accuracy. Microsoft famously announced that their speech-to-text and translation engines stand improved by an order of magnitude, because of the use of these techniques.

It is this vastly improved capability of AI, and the elimination of the human (present forever in the data science activity loop) from even the analysis and design of these neural networks (generative adversarial networks being a case in point), that makes the divergence between Data Science and AI very vivid and distinct. AI seems to be headed in the direction of general intelligence, whereas data science approaches and methods constituted human-in-loop approaches to making sense of the data. The key value addition of the human in this data science context was “domain” – and I have extensively discussed the importance of domain in data science in an earlier post – but this too, has increasingly become supplanted by efficient AI, provided that the data collection process for training data, and the training and topological aspects of the networks (known as hyper parameters) are well defined enough. This supplanting of the human domain perspective, by machine-learned domain features that matter, is precisely what will enable AI to develop and become a key force to reckon with, in industry.

Therefore I venture that the “anachronism” in the title of this post, is the domain-based model of systems, or intelligent systems, called the Expert System. Expert system design is an old problem that probably had its heyday and apparently disappeared into the mist of technological obsolescence – and it is this kind of expert system design problem that AI methods will be so good at solving, to the point that they can replace humans in key tasks, and become a true general intelligence. Expert systems were how the earliest AI researchers imagined machine intelligence to be useful to humanity. However, their understanding was limited to rule-based expert systems. While the overall idea of the expert system is still relevant in many domains – so much so that in a sense, we have expert systems all around us – it is undeniable that the advent of AI will enable expert systems to develop and evolve once again, but without the rule-based approaches we have seen in the past, and with inductive learning as is apparent from deep learning and machine learning methods.