The Value of “Small Data” in ML and AI

This is a comment from LinkedIn.

I wish we paid more attention to “small data”. Models that are built from small data aren’t necessarily bad – it depends on the data generating process you’re trying to model. More data doesn’t necessarily imply better models, especially if the veracity of the data is questionable. Data-centric AI is a discussion that’s being had now in this context. However, when you don’t need large scale ML models are are (prudently) content building statistical tests and simple models, these small data problems become important.

What decision makers shouldn’t forget is that the essential nature of decision making won’t change just due to the size of the data – ultimately it is the insight that models provide (based on many factors) that are the commodity we consume as decision makers. Consequently there should not be an aversion towards “small data” problems but a healthy curiosity. Like all efficiency movements that came before, small data paradigms are innately attractive – if I can verifiably build better models by doing less work, that should logically be a point of value.

MLOps Capabilities, Outcomes and Opportunities for Enterprise AI

Machine Learning Operations (MLOps) has come to be an important push for enterprises in 2021 and beyond – and there are clear reasons why this paradigm shift in Enterprise AI is upon us. Most enterprises who have begun data science and machine learning programs over the last several years have had difficulties putting even their promising machine learning models and proof of concept exercises into action, by deploying them meaningfully in production environments. I use the term “meaningfully” here, because the nuances around deployment make all the difference and form the soul of the subject matter around MLOps. In this post, I wish to discuss what ails enterprise AI today, sources of the gaps between production and proof-of-concept, expectations from MLOps implementations and the current state of the discourse on MLOps.

Note and Acknowledgement: I have also discussed several ideas and patterns I've seen from experiences I've had in the industry, not necessarily in one company or job, but going back all the way to projects and programs I've been in over the last seven to ten years. I don't mention clients or employers here as a matter of principle, but I would like to acknowledge mentors and clients for their time and energy and occasionally their guidance as well, in the synthesis of some of these ideas. It is a more boundaryless world than before, and great conversations are to be had regardless of one's location. I find a lot of the content and conversations regarding data science on Twitter and LinkedIn quite illuminating - and together with work and clients, the twain have constituted a great environment in which to discuss and develop ideas. 

What ails Enterprise AI today?

Surely, with the large scale data pipelines companies have access to, the low cost of cloud native solutions, and the high level frameworks for building machine learning models, things should have become easier? Enterprises still seem to be failing in their efforts to build AI programs for many reasons despite these upsides. For one thing, building models has become easier than before. It takes less time to take (good enough or clean enough) data and build prototype models with this data. Regardless of how many hypotheses you have as a business leader or data scientist, you’re more likely in 2021 to be able to collect data and build prototype models with this data, than you were able to in previous years. In the past, you may have had to go through several organizational hoops to get your data, and then prepare this data and then build models. All of these processes have become a bit simpler in 2021, thanks to enterprise data stores maturing, frameworks for building ML models become better known, and greater numbers of data scientists being available to build models. While things are still quite complex for the uninitiated, those on the growth curve in data science have found this phase to be adding productivity to their prototyping efforts.

What hasn’t changed, though, is the process of taking these models to production. The model is largely seen as a software asset, and productionization of the model has been seen in this limited context. As we will discuss, it is important to challenge this mindset if we’re to build effective machine learning systems for production. The gap, therefore, between proof-of-concept models like we’ve discussed above, and production scale implementations of such models, is large. Real world implementations are more complex and tedious, and often, the hypotheses we want to build models for are a bit more well defined – this necessitates extensive data processing, profiling and monitoring. But the complexity doesn’t end there, even though there has been an effort on the part of MLOps practitioners to build end-to-end pipelines. You’ll note that none of these are ground-breaking realizations. MLOps is a practical field, thus far, intended to make all these models work for enterprises – but as we will see below, the practical nature of this field encompasses a number of domain, statistical, cultural, architectural and other considerations.

I wish to suggest before diving deeper into this post, that this trend towards MLOps adoption represents a noteworthy change in how enterprises see ML system architecture in 2021, as opposed to the previous decade. In a manner of thinking, represents a move towards the “plateau of productivity” in enterprise machine learning.

Considerations for Enterprise AI – from MLOps, Data Science and Data Engineering

Domain Considerations Matter in Data Science, Data Engineering and MLOps

I wrote several years ago on this blog that domain knowledge is an important element in doing data science. Back then, as a data science neophyte learning from early experience in pure data science roles, I had made several observations about the impact of domain understanding on how quickly we can arrive at hypotheses for formulating data/AI problems. Looking back, this was an important lesson, because I now acknowledge the importance of domain knowledge every time I work on a data science project, or each time that I enable a data science team to be successful. Whether this is my own domain knowledge or that of SMEs, I am grateful for it, because without it, we could build anything, and it wouldn’t ultimately matter to anyone. Domain knowledge gives purpose to data and AI efforts. Without speaking to the domain experts and SMEs in various projects (finance, manufacturing, retail, energy and other industries), there would be little to no chance of timely and cost-effective success in characterising, ideating about and solving these problems.

It may not be immediately evident, however, that domain considerations matter in MLOps (and DataOps). Without an understanding of data generating processes, data formats, sources, rates, types, and data organization patterns, data fields, tables and even some of the process characteristics, we cannot understand data generation or transformation processes in enterprise data pipelines. We can also not understand how models are to be implemented, and what deployment means in different enterprise or customer contexts. When building and architecting machine learning systems, we end up needing to discover these details if we haven’t already. MLOps therefore cannot be ideated about in a vacuum, without consideration to the domain of the problem, or without consideration to the unique challenges of deploying models that domain. MLOps in logistics and supply chain problems, therefore, will be quite different from MLOps in manufacturing, retail or banking domains.

For instance, if we were building a classification model to sort defective parts from good ones on a manufacturing shop floor, we may need a real-time deployment system, with consideration to latency, edge based deployments of models, opportunities to inspect models as downstream processes or metrics may indicate process failure modes, and so forth. These considerations may not exist if we were building a system for enhancing ad revenue in a platform software company. The considerations there around uplift from pushing ads to new customers may require edge based deployments of a different kind, or federated learning needs, that may be unnecessary in the manufacturing example we discussed. To use an analogy, deployments are like different flavours of ice-cream, each requiring a different kind of appreciation. A failure to realize this may lead to difficulties in enterprises that may inadvertently underestimate the complexity of MLOps, of their own domain processes, or both.

Simplistic, Linear Pipelines Don’t Get Us Over the Line

The current thinking around MLOps is somewhat simplistic and linear, and I mean this in a specific way. There is a lot of discussion around data workflows and pipelines, metadata generation and management, and the metrics around model training and model performance. These are discussions around the management, transformation and profiling of data. Datasets are important to MLOps pipelines, and inasmuch as agility in data science is concerned, I’d even say that they are primary.

However, this notion of thinking only about the software and application-level implications of models and their deployment doesn’t address some of the needs from MLOps pipelines for enterprises. Notably, model interpretability and explainability, managing a diversity of deployment patterns (edge, batch, real-time or near-real-time), and the need to build repeatable pipelines or reproduce results. These problems cannot be broken down into just software applications, and require statistical rigour and attention to changing domain patterns. In fact, there is sometimes a desire on the part of ML engineering or MLOps practitioners to see these more statistical needs of MLOps as “not software engineering” and perhaps therefore “not easy to build for” – both of which may not be true, especially as the space of tools and implementations of statistical models for interpretability/explainability expands just as ML implementations have expanded.

Imagine that you have built an MLOps pipeline to build a dataset for a specific use case, and deployed it and the model eventually, and all’s well. If there’s a need for a new use case, you’re likely to begin back at square one, and build new pipelines, especially if you don’t have a clear and unified data model. As we will discuss in a later section on architecture, this is important to consider in ML engineering – more than one use case may require your data pipeline. This also means that simplistic and linear pipelines can only serve a limited purpose when you’re required to build many such pipelines across enterprise workloads.

For instance, it is possible to build SHAP scores for models given a specific dataset, and for companies with regulatory needs, there may be a reason to deeply analyze and publish results such as these. Therefore, MLOps shouldn’t only be about building simplistic DAGs or workflows in your YAML engineering tool of choice, or building and deploying metadata-tracked machine learning training/inference workflows. These are necessary, but insufficient for good MLOps implementations – chiefly because there are many other statistical and probabilistic considerations around MLOps which also deserve attention.

Data Architecture Before MLOps, but Business Needs First

There was an interesting discussion here recently around the theme of “Data before models, but problem formulation first”. The interesting article in question describes the specific challenges of thinking about data science problems based on business problems, and being “data-driven” in thinking about and building models for our hypotheses. I posit that a similar paradigm applies to MLOps. Data architecture understandably matters a great deal for success MLOps implementations, because it encompasses very foundational organizational processes and needs around data collection, storage and management, governance, security and quality, access patterns, ETL/ELT, sandboxes for analytics, connections to BI and reporting systems, and so on. Ultimately, this complex web of processes and technologies (because data architecture is more than just storing and retrieving data) is meant to perform some function of the business. As W. Edwards Deming said, “Data are not collected for museum purposes” – they are collected for a decision to be made, or for some end use. In the world of MLOps, we enable such decisions to take place on top of the data provided to us through an enterprise data architecture such as this one described above.

While typical enterprise data architectures are driven by the capabilities of tools and cloud scale applications more and more (because of the economies of scale of cloud providers, and the low barriers to entry), there is an important set of decisions every enterprise data architect has to answer for, around the specific needs of the organization, and how the architecture in question enables that to happen. Seemingly trivial decisions taken at the design phase of a data lake or data warehouse can have long lasting implications for the delivery of value from analytics, machine learning and MLOps. Data architecture is certainly important for MLOps, but the more fundamental needs of the organization – the kind of data required, the strategic importance of it, the decisions that need to be made across use cases, security and access patterns for data analysis and data science, and many more operational aspects of data – all of these are important and have a bearing on MLOps effectiveness too. So if you’re a data scientist or MLOps practitioner looking to improve your impact and effectiveness in solving problems, understand the underlying data architecture more deeply first. Sometimes, doing this can be hard – especially if there are no stakeholders who can explain it well – but this kind of fundamental understanding and context are highly underrated and have an outsize impact on the success of data science and ML programs eventually.

The Enterprise Model Sanctuary: Many Simpler Models, A Few Complex Models, and Other Combinations

A cursory glance at machine learning and MLOps forums, discussions and content indicate that the thinking around model development techniques is method centric, and not business centric. A large number of the discussions are a consequence of what’s required for companies at scale innovating on a few complex models with huge amounts of data – and these are legitimate and interesting discussions for sure. For example, most MLOps discussions I have come across seem to discuss the deployment of deep learning models. They discuss text and unstructured data processing, and complex image processing pipelines. Whether the use of tools like Kubeflow for training and deploying models in a distributed fashion, or the use of MLFlow for tracking metrics and performance, these are all legitimate considerations that may solve subsets of the ML deployment space. However, machine learning state-of-the-art is rarely required for enterprises looking to get value out of their specific use cases. The large majority of use cases in the industry are for simpler models, though and this is why simpler pipelines could do a large part of the value creation. I say this from experience and with confidence, having seen numerous projects where managers struggle to make sense of ML outcomes for their business, but have less difficulty making sense of data aggregations, summaries and statistics based on the data. The enterprise model ecosystem is more likely to resemble a zoo or even more accurately a sanctuary of different models, where each model may have its own specific needs and requirements.

Model development in mature organizations generally is an afterthought to carefully evaluating data and the evidential findings from it on merit, and then exploring hypotheses subsequently. Enterprises at lower levels of maturity have difficulty getting value from such an approach, however, and many leaders there may still rely on dashboards and reports. Clearly, there is an important and untapped market in business intelligence from big data. There is also a huge market for implementing simpler models based on clearly defined hypotheses. In many cases, enterprises may need many such simpler models, one for each stratified part of a specific use case. For instance, if you’re a market research firm estimating sales in a market segment, you may wish to build many such models for each sub-segment. If you are an equipment manufacturer doing quality checks using machine learning models, you may wish to use attribute based classification models, one for each product line, and perhaps you want to build many of them. The true value of MLOps in these cases is not in managing the complexity of deployment for one complex model, but in enabling many simpler models to be taken to production quickly and efficiently. These simpler models may then provide a baseline with which to build more complex models as needed.

Machine Learning Systems are Stochastic, Not Deterministic

Perhaps I’m stating the obvious, but it needs to be said. The underlying nature of data generating processes and machine learning models is stochastic and not deterministic. Whether we’re talking about manufacturing process metrics, banking and finance transaction data, energy sector data around load, power, usage, and so forth – all of these data are generated from stochastic data generating processes, even if they come from engineered systems. Machine learning models are also never exact mathematical formulations – they are almost always stochastic processes. There is a little to unpack here, so I’ll get into a few instances. What this stochasticity means, is that machine learning models exhibit variability in results from situation to situation, and that this will be quite evident in production. In order to begin building machine learning systems, we need to perform exploratory data analysis prior to training time, prepare features for our hypothesis, check assumptions based on the feature and the model formulation, and then build models and evaluate them. What it also means, is that we need to build safeguards to ensure that these assumptions are valid when doing production scale inference. It means that we may have to reformulate problems, as the underlying conditions of the data generating process changes. In case of deep learning models, sophisticated tensor transformations and training loops are required as part of the normal training loop of deep learning models.

When the model is eventually trained to the required level of performance and rendered, they too represent a solution at a specific point in time. MLOps is not about “train once, deploy everywhere”, but about “routine retraining and redeployment”. This makes ModelOps and the continuous training lifecycle of model development as important a consideration in MLOps as DataOps is. A lot of discussion around MLOps today is centered around data preparation – and the motivation for this, of course, is the fact that there are significant data preparation challenges that data scientists face. However, model training in the real world cannot be wished away by despite the prevalence of AutoML, although AutoML tools are one path for progress. As of 2021, for most use cases, model definition and training is still done manually, even if tuning and optimizing the model are automated. In MLOps lingo, we are referring to the importance of using feature stores, and their impact on data drift and concept drift analysis. While a healthy discussion is in progress on these topics, the instrumentation in actual implementations of data drift and concept drift identification and measurement tends to vary. Some tool chains are ready for this change, and others just aren’t.

More broadly, some MLOps implementations may account for these stochastic and probabilistic characteristics of ML systems, because their data scientists ask the hard questions after training and during/before deployment. On the other hand, it is likely that most MLOps implementations today treat models merely as pieces of software. The latter pattern leads to the unfolding of technical debt of various kinds later in the lifecycle of the system. This technical debt currently represents building additional regulatory checks, doing interpretability analysis, meta-data logging, model performance metrics, and so on – and over time, this set of secondary considerations may grow much bigger.

Changing Skillsets and Roles for MLOps

Companies looking to hire top ML talent as of 2021 are pushing for a greater number of high quality data engineers with MLOps skill sets. This is in contrast to emphasis on data science hiring in the past. Hiring pipelines for data and AI roles (I’ve seen a few different ones over the last few years) tend to emphasize programming, statistics, databases and specific technologies for data science – of late, this is largely SQL, Python, with a smattering of distributed frameworks and tools, and skill sets in deep learning, tabular data analysis and the associated frameworks and tools for solving problems in this space. For data engineering roles, over the years I’ve seen skill requirements specifying systems programming and strongly typed languages such as Java and Scala, experience working on JVM languages, in addition to SQL, databases, and a lot of the back-end software engineering skill sets we see for application developers elsewhere. For data engineers working on big data technologies, there’s very often a need to be familiar with NoSQL databases, or graph databases, depending on the role and use cases, in addition to the Hadoop-and-friends ecosystem, and cloud engineering skills such as AWS or Azure. While the data scientist’s role and skill set has come to include domain considerations, advanced statistical and ML models, cloud-native and large scale data science and deep learning and communication/presentation of data and insights, the data engineer’s role has become broader around systems engineering and design.

Someone said (in fact, in this talk) that data engineers ought to build frameworks, and not pipelines – and this is a fair assessment of how to use this broad and useful skill set in data engineering. There has been a healthy discussion in various forums, talks and the like on ML engineering roles which combine elements of these two different skill sets. All of these conversations around skill sets are important context for where we’re heading in data science and engineering space overall too. MLOps, unlike DevOps before it, should not be constrained by the limited value addition possible outside of data scientist or data engineering roles (the bulk of DevOps roles are administrative in nature). They cannot be construed as or see themselves as configuration file engineers, for lack of a better term. In fact, their role could be much broader – as systems engineers spanning a range of capabilities in both data science and data engineering, while not possessing expertise in any one of these (themselves diverse) areas. MLOps roles should perhaps also emphasize domain knowledge or expertise of some kind – since ultimately, the outcomes here are practical and related to business value from ML. There are many outcomes and opportunities for talent and skill sets for sure, but these stand out as being relevant. What is for sure is that the data scientist’s role has changed (as has the data engineer’s), and the old and unyielding challenges being faced by data scientists are taking on new definitions and manifestations – thereby requiring new mindsets, new skill sets, and new processes to come forth.

In my view this churn in the extant data science and engineering role paradigm is a welcome development because enterprises first want to realize value from DataOps and MLOps simultaneously today. As we will discuss later in this post, while models are important, business managers will continue to derive value from analytics and reports – and perhaps there has never been a better time to build on that need than 2021. Also, the emphasis on data engineering roles as on date is well-founded. From practical experience as a data scientist who worked on a range of problems from relatively simple ML to complex deep learning models, I will happily acknowledge that data engineers I have worked with were indispensable to the success of the projects I succeeded on. However, leaders hiring for ML roles should not think that the role of the data scientist is no longer required. I believe this emphasis on data engineering is a passing trend as enterprises build foundational pieces that enable value from data. The focus will therefore shift once again to business value from data, and that this automatically means that statistical, data science and ML skills will continue to be in vogue through this shift and afterwards.

Don’t Ignore Decision-Making Culture

Organizational culture matters a lot for the success of MLOps, as much as it does for any digital transformation program. MLOps represents, in a way, a desirable end-state or the happy marriage of data science and data engineering in a given enterprise and data architecture context. However, both data science and engineering can only be valuable and effective in organizations whose leaders think about and talk about data and use the data and insights from these data for taking decisions. The latter is a cultural synthesis, and not just a technology adoption process or workflow that one can execute on demand. Being a cultural matter, it has to do with behavioural and attitudinal patterns that ultimately enable data and insights to be used for decision making.

The adoption of data driven decision making represents a shift from thinking about business processes, systems and decisions in terms of rules (“Rules are for lazy managers”, to paraphrase Simon Sinek), to an open-minded thought process around data and AI systems. When leaders stop thinking in terms of rules, and start thinking in terms of systems, they are often imagining situations of change, synthesis, formation and deformation of patterns, structures and interactions. They begin to see their role as an influencer more and as a commander less, and this shift in thinking can enable them to make subtle changes to their managerial approach, driven by data.

In the earlier post I wrote about OODA, and the AI-enabled generalist, there is a point I make about the decision making language of organizations. This kind of development of a decision making language requires a way of thinking about the enterprise’s systems, processes, and also the ML models in new ways. It requires an openness of mind in decision making to adopt models as thinking tools. In a sense, the modern AI-empowered generalist could be seen as a prototype for a supreme pragmatist. Enterprises want rational actors at their helm, at least for the functions that require data driven decision making – and such rational actors can be groomed in a culture that doesn’t shy away from challenging the current rules and norms on decision making, and is willing to look at data and models.

Data/AI Exponents as SMEs and Future Leaders

Organizations come to embrace data, ML or even MLOps so that they can ultimately derive value from data, and this cannot be done without talent that unlocks value from data. Be this talent data science talent or data engineering/architecture talent, there is both a topical / functional need and a strategic value of these roles in enterprises, and this tends to be overlooked in data strategy. This is because of the value such individuals accumulate over time, as they build data pipelines and AI/ML models, accumulating a lot of knowledge about business processes, customers and also domain knowledge in the process. When you have a data scientist in your team who has built a few different models that explain different elements of your business, processes or customer behaviour, they become invaluable assets for both developing further models, and for analyzing customer or business or process behaviour. Such individuals can also become effective leaders and transition to process management roles.

MLOps and DataOps engineers in an organization can therefore themselves be considered Data/AI SME roles – and this is an important source of value that is often overlooked in organizations. A lot of organizations still see data/AI resources as just means to an end, but in fact, many of these roles can become storehouses of domain knowledge. MLOps can potentially enable the tacit knowledge from such individuals to be effectively captured for process management as well – this may be an important opportunity for value creation from MLOps. MLOps can also accelerate the development of data-driven leadership talent. When exposed to the models used to take decisions, and the specific mechanisms of taking such decisions, leadership potential for process leadership is improved.

In an earlier post, I discussed the importance of higher-level decision making languages, the OODA decision making loop, and how AI can enable a new generation of generalists. I would suggest that this is a useful idea to consider in the broader context of building a data-driven decision making culture.

“Data Before Models” also implies “Models After Data”

The purpose of this heading is to draw attention to the fact that the best data pipelines won’t help, if we aren’t doing much with the data we prepare. We have to eventually build models with this data of one or other kind for actually taking decisions. Many recent discussions around MLOps talk about data-centric AI, and above, we have discussed data architecture and other elements of enterprise systems and culture that contribute to MLOps success. We have also discussed the stochastic nature of data generating processes and machine learning systems. There are important implications from the core ModelOps processes as well, and we will discuss them here, finally. The process of developing models, as I have discussed above, has become easier now than ever before, at least in software. The careful formulation and evaluation of model hypotheses, statistical analysis of the input data and features, and the checking of assumptions – these still remain harder, more tedious and less trivial, as they were before. This necessitates the importance of statistical analysis and exploratory data analysis. Without these foundational steps, ML models can be built with high bias or high variance, thereby setting up the use case for higher failure rates and lower effectiveness overall. This bears introspection and repetition, since there seem to be two schools of machine learning and data science professionals – there seems to be a group of professionals who believe strongly that mathematical and statistical thinking are important for doing data science. There’s another group of professionals and practitioners who think otherwise, that the software elements of data science modeling can be learnt by someone without knowledge of statistics or machine learning.

In my experience, the statistical analysis and EDA are fundamentally important for machine learning – they forms an integral and important part of extracting value from the data we have, and making sense of it, before we solve problems. A number of business situations require us to think in terms of data distributions and stochastic processes. To build things that scale within MLOps pipelines, some of us may need to have an open mind about exploring the mathematical underpinnings of things like gradient descent or batch normalization, or activation functions. This open-mindedness is important for a key reason – a lot of MLOps engineers being trained today may assume that the data science is easy, or trivial, because people who don’t know statistics are building models, or because they can, if they just follow a simple workflow. I know this to be patently untrue – if you want to develop a model worth anything in an enterprise, you may have to start from formulating and thinking about the business problem, get to the EDA and statistical analysis and built out tests for assumptions checking, and then experiment with different models. You have to get into the probability and statistical analysis eventually, or you will be forced to rediscover the effectiveness of these mathematical and scientific methods. Even if you manage to build one or a few models, there will be situations where you’re required to explain these models. Not only will ML engineers or data science engineers be more confident when they are able to reason about the mathematics of machine learning, but their ability to build and scale systems for the enterprise improves. Their ability to think about the implications of these models for different related use cases, for different deployment modes, different source data, and different data quality considerations also improves. By checking assumptions on the features, they could stave off big challenges that may arise when the model is implemented in production.

Statistical analysis and machine learning model development have been core and will be core to data science, regardless of the peripheral engineering required for realization of value. Data engineering and MLOps as allied fields help realize this value at enterprise scale. It is the process of data science and model development that ultimately converts data into insights – and insights are the primary purpose of investing in enterprise data and AI projects and programs in the first place. They will therefore continue to be a good bet for practitioners in future – as long as they realize that those skills alone cannot take them over the finish line.

Concluding Remarks

I hope that you’ve benefited by reading this rather lengthy post on MLOps and Enterprise AI. If anything, it allowed me to explore my own experiences, document a few patterns I see in the development of truly enterprise ready AI, MLOps toolchains and capabilities, and also explore sources of value from MLOps for enterprises. If you have questions or ideas, please leave a comment or tweet to me at @aiexplorations.

Further Reading/Listening

  1. Data Science is Different Now, by Vicki Boykis:
  2. Problem Formulation Comes First by Brian Kent on Crosstab.io
  3. Build Frameworks, Not Pipelines – a Data Engineering Talk on PyData
  4. From Model-Centric to Data-Centric AI – a discussion on Enterprise scale AI with Andrew Ng and others
  5. ML Engineering for Production – another discussion on ML for production with Andrew Ng and others

Emphasizing the Basics: Structured Data Science Mentoring

Data science, machine learning and AI are constantly growing and burgeoning fields, with research that’s spilling over at the seams in terms of the sheer volume of it all. Every day, I receive numerous references to interesting papers on my Twitter feed, thanks to Arxiv daily and such accounts there. I also see papers explained with code, and references to ML products and systems in numerous contexts. This is all overwhelming beyond a point for a professional who doesn’t have a specific focus area. Speaking pragmatically, and from the tree of knowledge (which is always bound to be vast), it is a feature of every single human endeavour to exhibit this kind of complexity as we spend more and more time exploring things, farming ideas and understanding new possibilities in these areas.

Data scientists are going to be at different levels of competence and may be differently placed to take on challenges they are asked to face – the role of the mentor (regardless of the type) is to systematically challenge the data scientist to discover new innate potential and develop such potential to increase their overall capabilities and effectiveness.

The Data Science Mentoring Challenge

Mentoring can be a hard task for this reason – a lot of people (understandably) gravitate towards complex models that are meant for specific purposes, without fully understanding the details and the exact mechanisms behind simpler machine learning and statistical modeling methods. The problem with this is two fold – a) reliance on libraries and frameworks with implementations that already exist, and b) inability to characterize, apply and explain common and simpler techniques to actual real world problem statements. Part of the problem here is the sensationalization of research in the media. Open research without borders is important and pivotal for speedy progress in technology areas like ML. But we’re also seeing a lot of misinformation including sensationalization of advanced ML techniques and when some of gets parroted by professionals (some of who may become hiring managers) we see the problem proliferating into the world of work as well. I’ve interviewed my fair share of individuals who understand, say, an LSTM unit’s different gates but aren’t comfortable explaining autocorrelation techniques or ARMA models. This gap probably stems from gaps in mentoring and coaching, which ideally should emphasize basics first.

I’d posit that the role of the mentor has changed, in data science, over the last several years, and I would say it has changed most significantly in the last two years. In the future, data and AI mentoring will look different from what it looked like in the past five years. This is because the nature of the job of a data scientist (or alternatively an AI/ML engineer) has also changed. Despite developments in Automated Machine Learning, we’re inundated with situations in the real world, where we require human expertise to get through data science and machine learning problems. This human expertise manifests in three processes: problem characterisation, problem formulation, and problem solving. We need real, human data scientists (not just an AutoML tool) to look beyond the obvious automations such as hyperparameter or architectural searches, to reason about the nuts and bolts of problems, interpret the problem domain and reason about different kinds of hypotheses and how they make sense.

This makes the process of mentoring for data science different than it was, in certain specific ways. For one thing, mentors today create the field of problems or opportunities that will exist tomorrow. Data scientists today experience an overload of information as can be expected, from different sources. From Arxiv and Springer papers and articles, to new research and code, new books and new frameworks and algorithms, there are plenty of things to learn on a daily basis. However, the broader skill set of the data scientist even today can be characterized into four key areas: basic, business, functional and frontier skills.

Broad Characterization of Skills for Modern Data Science
  1. Technical skills: There’s the need for a strong foundation that enables general effectiveness in a data science role. This includes good skills across statistical analysis fundamentals, leading into the key principles that enable statistical learning models to be built, and a sound understanding of the mechanism behind common algorithms such as regression, tree algorithms, search and optimization methods
  2. Business skills: There is a strong need for data scientists who can reason about business processes and systems, and understand how data may be generated, how it may flow, and what insights may be required of it. Not only is this is a key skill to have fruitful interactions with clients and stakeholders, but it is also important to narrow down to the right level of depth for the job in terms of satisfaction and effectiveness.
  3. Functional skills: There’s the need for effectiveness on the job, at a functional level, which not only includes technical competence at the statistical, mathematical and code levels, but at the level of processes and good practices such as clean code, change management and reproducible research. One could also see more advanced machine learning and feature engineering techniques as being part of the functional skill set.
  4. Frontier skills: There’s research that’s expanding at multiple frontiers, which is hard for even experienced data scientists to keep up with, if they’re really interested in furthering their career beyond the obvious and evident challenges of day-to-day work.

Mentors: Different Levels

The role of mentorship has also become specialized in the last two years, which is, in my view, one of the changes most representative of the maturation of the field of data science. Mentors today can be at different levels of skill and still add value to different kinds of data science and analytics roles. For the sake of this discussion, I’d classify mentors today into two kinds – the “breadth mentor” and the “depth mentor”. While both kinds of mentors possess certain common skills, especially on the interpersonal communication front, they may have different approaches to technical, functional and research level mentoring.

The breadth mentor is an individual with plenty of experience in data science, perhaps in a consulting setting, that can provide generally correct advice to data scientists with the development of broad skill sets, ranging from basic statistical analysis, to advanced algorithms. The nature of the mentoring here is on developing a well-rounded data scientist, rather than an expert in a specific field.

The depth mentor by contrast, is someone who has deep experience in a specific area of industry or technology and has deep experience in bringing this field together with data science. Examples of this kind of data scientist would be an NLP researcher, or a researcher in the field of robotics, both of who may be expert practitioners of data science methods in their specific areas, but without the broader knowledge of consultative data science methods.

Depending on the needs of the business and the data scientists in question, the appropriate kind of mentor has to be chosen – and this shouldn’t be done lightly. For example, bringing a breadth mentor to an AI product firm may have some advantages, but if the firm is solving problems in a specific space, this may not work out so well. Similarly, bringing a depth mentor to a consulting firm can help grow a specific practice (or a new one) but may not benefit the broader data science efforts across different business domains there.

Structuring Mentorship in Data Science

Mentors (and hiring managers) in general should emphasize the importance of the basic skills listed above. In my view, when a data science candidate has the correct understanding of the essential basic statistical ideas and common algorithms, it becomes a lot easier for them to grasp more advanced ideas such as in deep learning, when this is required. Mentors can build better basic skills in data scientists by challenging their technical acumen.

Mentors should also emphasize business skills where relevant, and where the emphasis is on research, they should emphasize some of the frontier skills as well. Mentors in this context are expected to challenge the data scientist with relevant questions, and encourage a habit of systematically breaking down problems and asking the right questions. These business skills are important all the way up to solution architect roles and management, when crucial decisions have to be taken and hard questions will need to be asked often. Mentors can build better business skills in data scientists by challenging their problem understanding and characterisation.

Functional skills are important for effectiveness on the job. It is not okay for data scientists to theoretically understand a specific subject area, only to find themselves handicapped when asked to build a machine learning pipeline. Therefore functional skill mentoring is about challenging the data scientist on problem solving effectiveness.

Finally, frontier skills development depends on both the organizational or research context, and the data scientist’s interests. Mentors can provide helpful markers to enable the exploration of ideas, while emphasizing value from the research, and asking questions that keeps the data science researcher on track. The challenge the mentor can pose here is differentiated solution value and originality.

The Importance of Emphasizing the Basics

This brings me to the importance of emphasizing the basics. I see numerous individuals out there who are getting into data science and machine learning that are interested in getting right to the latest and greatest algorithms. For a while – and this has been a trend on LinkedIn and Twitter – budding data science aspirants post some of their work, where it involves the development of simple scripts or programs around computer vision, translation and such problem statements, thereby delivering an impression to a lot of their audience, that not only are they skilled at those techniques demonstrated, but that they are skilled at different kinds of data science problem statements as well. My own suggestion to data science aspirants is that they will be under pressure to demonstrate some of their more involved skills, not merely the ability to use pre-built libraries to solve problems using one’s own basic skill sets in statistical learning, but, perhaps be able to build such algorithms and systems from scratch. This kind of deeper skill is what differentiates the wheat from the chaff in data science.

Concluding Remarks

In conclusion, I believe the mentor’s role in data science has changed – mentors today have their tasks cut out, when it comes to building deeper skill in their data scientists – they should emphasise technical acumen first and foremost, problem understanding and characterisation next, and problem solving effectiveness after this. This builds up a well-layered skill-set where technical skills can perform a harmonious dance in amalgam, resulting in true value to the data science market.

Statistical Competence and Its Importance for Good Data Science Careers

In 2019, enterprises routinely begin initiatives related to analytics, data science and machine learning that invoke specific technologies from a very early stage in their initiatives. This tendency to put technology ahead of value sometimes extends to analytics champions and managers who take up or lead data-intensive initiatives. While this may seem pragmatic at one level, at another level, it may lead to significant problems when ensuring successful outcomes from such analytics initiatives and programs. In this post, I’ll address the three-pronged conundrum of statistical competence in the data science world, specifically in the context of data science consulting and services, and specifically what it means for the careers of data science candidates now and in the future.

Hiring Statisticians: An Expert’s View

Kevin Gray is one of my connections on LinkedIn who posts insightful content on statistical analysis and related topics on a regular basis, including very good recommendations for books on various statistical and analytical techniques and methods. One of his recent posts was an article he’d authored titled “What to Look For in a Statistician” (the article, and my comment), which definitely resonated with my own experiences in hiring statistically competent engineers in different settings, such as data science and machine learning, between 2015 and today. In years past, I have had similar experiences when hiring competent product engineers and manufacturing engineers in data-intensive problem solving roles.

The importance of statistical thinking and statistical analysis in business problem solving cannot be underestimated. However, even good advice that is canon, and that is well-acknowledged, often falls on deaf ears in the hyper-competitive data science job market. Both hiring managers and recruiters tend to emphasize keywords comprising the latest framework or approach, over the ability to think critically about problem statements, carefully architect systems, and rigorously apply statistical analysis and machine learning to real world problems while keeping considerations of explainability in mind.

The Three-Pronged Conundrum of Data Science Talent

Now you might ask why I say this, and what I really mean by this. The devil, as they say, is in the details, and one essential problem with the broad and wide proliferation of tools, frameworks and applications of high capability, that can perform and automate statistical analysis of different kinds, is the following three-pronged conundrum:

  1. Lack of core statistical knowledge despite having a working knowledge of the practicum of advanced techniques: Most candidates in the data science job market who are deeply interested in building data science and ML applications have unfortunately not developed skills in the core statistical sciences and statistical reasoning. Since statistics is the foundation for machine learning and data science, this degrades the quality of projects and programs which have to rely on hiring such talent. When they prefer to use software to do most of or all the thinking for them, their own reasoning about the problem is rarely good enough to critically evaluate different statistical formulations for problems, because they think in very set and specific ways about problems thanks only to their familiarity with the tools.
  2. Tools as an unfortunate substitute to statistical thinking: Solutions, services and consulting professionals in the data science and advanced analytics space, who have to bring their best statistical thinking to client-facing interactions, are unable to differentiate between competence in statistical thinking, and competence in a specific software tool or approach.
  3. Model bloat and inexplicability: The use of heavy, general purpose approaches that rely on complex, less explainable models, than reliance on simpler models that are constructed upon a fuller understanding of the true dynamics of the problem.

These three sub-problems can derail even the best envisioned data science and machine learning initiatives in product / solution delivery firms, and in enterprises.

Some “Unsexy” Characteristics of Good Data Scientists

These are also not “sexy” problems – they’re earthy, multi-dimensional, real world problems that have many contributing factors, from business and how it is done, to the culture of education and the culture of software and solution development teams. Kevin Gray in his post touches upon attitudinal qualities for good statisticians, which could also be extended to data science leaders, data scientists and data engineers:

  1. Integrity and honesty are important in data science – this is true especially in a world where personal data is being handled carelessly and sometimes gratuitously by many applications without heed to data protection and privacy, and when user data is taken for granted by many technology companies. This is not an easy expectation or evaluation point for hiring managers, since it is only long association with anyone which allows us to build a model of their integrity, and rarely does one effectively determine such an attribute in short interviews. What’s dismal about data science hiring sometimes, is the proliferation of candidate resumes which are full of fluff, and the tendency of candidates to not stand up to scrutiny on skills they identify as “key” or “core” skills.
  2. Curiosity and a broad spectrum of interests – this cannot be understated in the context of a consulting data science or machine learning expert. The more we’re aware of different mental models and theoretical frameworks of the world and the data we see in it, the better we’re able to reason starting from hypotheses about the data. By extension, we’re better able to identify the right statistical approaches for a problem when we start from and explore different such mental models. The book I’ve linked to here by Scott E. Page is a fantastic evaluation of different mental models. But with models come biases, to restate George E. P. Box’s famous quote, “All models are wrong, some models are useful”.
  3. Checking for logical fallacies is key for data science reasoning – I would add to the critical thinking element mentioned in Kevin’s post, by saying that it behooves any thought leader such as a data science consultant to critically evaluate their own thinking by checking for logical fallacies. When overlooked, a benign piece of flawed reasoning can turn into a face-melting disaster. The best way to ensure this does not happen is to critically evaluate our ideas, notions and mental models.
  4. Don’t develop one hammer, develop a tool box – Like experienced plumbers, carpenters or mechanics, the tools landscape of a data scientist today should not be one of quasi-religious fervor in promoting one technique at the cost of others, such as how deep learning has come to be promoted in some circles as a data science panacea. Instead, the effective data scientist is usually pragmatic in their approach. Like a tailor or carpenter who has to cut or join different materials with different instruments, data scientists today do not have the luxury of getting behind one comfortable model of thinking about their tool set and profession – and any attempt to do this can be construed as laziness (especially for the consulting data scientist) at best. While the customer is always right, there are times when the client can be wrong and it is at these times that they need the advice of a qualified statistician or data scientist. If there is one time when data scientists should not abandon their statistical thinking, it is this kind of a situation.

Concluding Remarks

To conclude, data scientists ought not to be seen as resources that take data, analyze it using pre-built tools, and write code to explain the data using pre-built libraries of various kinds. They’re not software jockeys who happen to know some statistics and have a handle on machine learning workflows. Data scientists’ work scope and emphases as industry professionals and consultants go way beyond these limited definitions. Data scientists are expected to be dynamic, statistically sound professionals who critically evaluate real world problems based on theories, data and evidence drawn from many sources and contexts, and progressively build a deeper understanding of these real world problems that lead to tangible value for their customers, be they businesses or the consumers of products. The sooner data scientists realize this, the better off they will be while charting out a truly successful and fulfilling data science career.

Understanding the Logarithm Trick in Maximum Likelihood Estimation

Maximum Likelihood Estimation is a fundamental and powerful idea that’s at the centre of many things we do with data – so much so that we often use it without knowing it. MLE allows us to find a model’s parameters that are likely to enable the model to represent the data we have on our hands as closely as possible. This short post addresses the logarithm trick which is used to enable simpler MLE calculation.

There are two elements to understanding the formulation of MLE for the common Multivariate Gaussian model (which could be extended to other models equally):

  1. The i.i.d assumption that simplifies the MLE formulation
  2. The logarithm trick with enables solution of the MLE formulation

On this blog I’ve discussed topics like time series analysis in the past where the idea of independent and identically distributed variables is addressed, and of course, being an important statistical topic, is is well explained and understood. The logarithm trick, however, is specific to the simplification and solution of MLE formulations, and is helpful to understand.

The logarithm function very simply enables scale variance in any input data while allowing location invariance. This is extremely helpful when dealing with monotonic input data that we want to ensure continues to be monotonic after transformation, but whose scale we want to change.

When building a model p ( x_1,.... x_n | \theta ) of the data (x_1,.... x_n), the MLE formulation seeks to find the appropriate values of \theta such that

\bigtriangledown_{\theta} \prod_{i = 1}^{n} p(x_i | \theta ) = 0

The interesting thing about the log transform is, as I said earlier that in the transformation ln ( \prod_{i} f_i ) = \sum_{i} ln(f_i) , there is no change in where f_i may attain a maximum or a minimum when it is transformed to ln(f_i) for any i. This logarithm trick enables us to compute the latter product more simply, and thereby execute the MLE.

Different Kinds of Data Scientists

Data scientists come in many shapes and sizes, and constitute a diverse lot of people. More importantly, they can perform diverse functions in organizations and still stand to qualify under the same criteria we use to define data scientists.

In this cross-post from a Quora answer, I wish to elucidate on the different kinds of data scientist roles I believe exist in industry. Here is the original question on Quora. I have to say here, that I found Michael Koelbl’s answer to What are all the different types of data scientists? quite interesting, and thinking along similar lines, I decided to delineate the following stereotypical kinds of data science people:

  1. Business analysts with a data focus: These are essentially business analysts that understand a specific business domain reasonably well, although they’re not statistically or analytically inclined. Focused on exploratory data analysis, reporting based on creation of new measures, graphs and charts based on them, and asking questions around these EDA. They’re excellent at story telling, asking questions based on data, and pushing their teams in interesting directions.
  2. Machine learning engineers: Essentially software developers with a one-size-fits-all approach to data analysis, where they’re trying to build ML models of one or other kind, based on the data. They’re not statistically savvy, but understand ML engineering, model development, software architecture and model deployment.
  3. Domain expert data scientists: They’re essentially experts in a specific domain, interested in generating the right features from the data to answer questions in the domain. While not skilled as statisticians or machine learning engineers, they’re very keyed in on what’s required to answer questions in their specific domains.
  4. Data visualization specialists: These are data scientists focused on developing visualizations and graphs from data. Some may be statistically savvy, but their focus is on data visualization. They span the range from BI tools to coded up scripts and programs for data analysis
  5. Statisticians: Let’s not forget the old epithets assigned to data scientists (and the jokes around data science and statisticians). Perhaps statisticians are the rarest breed of the current data science talent pool, despite the need for them being higher than ever. They’re generally savvy analysts who can build models of various kinds – from distribution models, to significance testing, factor-response models and DOE, to machine learning and deep learning. They’re not normally known to handle the large data sets we often see in data science work, though.
  6. Data engineers with data analysis skills: Data engineers can be considered “cousins” of data scientists that are more focused on building data management systems, pipelines for implementation of models, and the data management infrastructure. They’re concerned with data ingestion, extraction, data lakes, and such aspects of the infrastructure, but not so much about the data analysis itself. While they understand use cases and the process of generating reports and statistics, they’re not necessarily savvy analysts themselves.
  7. Data science managers: These are experienced data analysts and/or data engineers that are interested in the deployment and use of data science results. They could also be functional or strategic managers in companies, who are interested in putting together processes, systems and tools to enable their data scientists, analysts and engineers, to be effective.

So, do you think I’ve covered all the kinds of data scientists you know? Do you think I missed anything? Let me know in the comments.

Related links

  1. O’Reilly blog post on data scientists versus data engineers

Some Ideas on Combining Design Thinking and Data Science

Recently, I had the opportunity to finish Stanford SCPD’s XINE 217 “Empathize and Prototype” course, as part of the Stanford Innovation and Entrepreneurship Certificate, which emphasizes the use of design thinking ideas to develop product and solution ideas. It is during this course, that I wrote down a few ideas around the use of data in improving design decisions. Design thinking is a modern approach to system and product design which puts the customers and their interactions at the center of the design process. The design process has been characterized over decades by many scholars and practitioners in diverse ways, but a few aspects are perhaps unchanged. Three of these are as follows:

  1. The essential nature of design processes is to be iterative, and to constantly evolve over time
  2. The design process always oversimplifies a problem – and introduces side effects into the customer-product or customer-process interactions
  3. The design process is only as good as the diversity of ideas we use for “flaring” and “focusing” (which roughly translate to “exploring ideas” and “choosing few out of many ideas” respectively).

Overall, the essential idea conveyed in the design thinking process as explained in XINE 217, is “Empathize and Prototype” – and that phrase conveys a sense of deep customer understanding and focus. Coming to the process of integrating data into the design process – by no means is this idea new, since engineers starting from Genichi Taguchi, and perhaps even engineers a generation before Taguchi, have been developing systems models of processes or products in their designs. These systems models are modeled as factor-response models at some level, because they are converted to prototypes via parameter models and tolerance design processes.

Statistically speaking, these are analogues of the overall designed experiment practice, where a range of parameter variables may be considered as factors to a response, and are together modeled as orthogonal arrays. There’s more detail here.

Although described above in a simplified way, data-driven design approaches, grouped under the broad gamut of “statistical engineering” are used in one or other form to validate designs of mechanical and electrical systems in well-known manufacturing organizations. However, when you look at the design thinking processes in specific ways, the benefits of data science techniques at certain stages become apparent.

The design thinking process could perhaps be summarised as follows:

  1. Observe, empathise and understand the customer’s behaviour or interaction
  2. Develop theories about their behaviour, including those that account for motivations – spoken and unspoken aspects of their behaviour, explicit and implicit needs, and the like
  3. Based on these theories, develop a slew of potential solutions that could address the problem they face (“flare”)
  4. Qualify some of these solutions based on various kinds of criteria (feasibility, scope, technology, cost, to name some) (“focus”)
  5. Arrive at a prototype, which can then be developed into a product idea

While this summary of the design thinking approach may appear very generic and rudimentary, it may be applicable to a wide range of situations, and is therefore worth considering. More involved versions of this same process could take on different levels of detail, whether domain-specific detail, or process-wise rich. They could also add more fine-grained steps, to enable the designer to “flare” and “focus” better. As I’ve discussed in a post on using principles of agility in doing data science, it is also possible to iterate the “focus” and “flare” steps, to get better and better results.

Looking more closely at this five-step process, we can identify some ways in which data science tools or methods may be used in it:

  1. Observing consumer behaviour and interactions, and understanding them, has become a science unto itself, and with the advent of video instrumentation, accelerometers and behavioural analysis, a number of activities in this first step of the design thinking process can be improved, merely by better instrumentation and measurement. I’ve stressed the importance of measurement on this blog before – for one, fewer samples of useful data can be more valuable for building certain kinds of models. The capabilities of new sensors also make it possible to expand the kinds of data collected.
  2. Developing theories of behaviour (hypotheses) may be validated using various Bayesian (or even Frequentist) methods of data science. As more and more data gets collected, our understanding of the consumer’s behaviour can be updated, and Bayesian behavioural models could help us validate such hypotheses as a result.
  3. In steps 3 and 4 of the design thinking process I’ve outlined above, the “focusing and flaring” routine, is at one level, the core experimental design practice described by statistical pioneers including Taguchi. Using some of the tools of data science, such as significance testing, effect size determination and factor-response modeling, we could come up with interesting designs and validate them based on relevant factors.
  4. Finally, the process of prototyping and development would involve a verification and validation step, which tends to be data-intensive. From reliability and durability models (based on Frequentist statistics and PDF/CDF functions), to key life testing and analysis of data in that context, there are numerous tools in the data science toolbox, that could potentially be used to improve the prototyping process.

I realize that a short blog post such as this one is probably too short to explore this broad an intersection between the two domains of design thinking and data science – there’s the added matter of exploring work already done in the space, in research and industry. The intersection of these two spaces lends itself to much discussion, and I will cover related ideas in future posts.

Hypothesis Generation: A Key Data Science Challenge

Data scientists are new age explorers. Their field of exploration is rife with data from various sources. Their methods are mathematics, linear algebra, computational sciences, statistics and data visualisation. Their tools are programming languages, frameworks, libraries and statistical analysis tools. And their rewards are stepping stones, better understanding and insights.

The data science process for many teams starts with data summaries, visualisation and data analysis, and ends with the interpretation of analysis results. However, in today’s world of rapid data science cycles, it is possible to do much more, if we take a hypothesis-centred approach to data science.

Theories for New Age Raconteurs

Data scientists work with data sets small and large, and are tellers of stories. These stories have entities, properties and relationships, all described by data. Their apparatus and methods open up data scientists to opportunities to identify, consolidate and validate hypotheses with data, and use these hypotheses as starting points for our data narratives. Hypothesis generation is a key challenge for data scientists. Hypothesis generation and by extension hypothesis refinement constitute the very purpose of data analysis and data science.

Hypothesis generation for a data scientist can take numerous forms, such as:

  1. They may be interested in the properties of a certain stream of data or a certain measurement. These properties and their default or exceptional values may form a certain hypothesis.
  2. They may be keen on understanding how a certain measure has evolved over time. In trying to understand this evolution of a system’s metric, or a person’s behaviour, they could rely on a mathematical model as a hypothesis.
  3. They could consider the impact of some properties on the states of systems, interactions and people. In trying to understand such relationships between different measures and properties, they could construct machine learning models of different kinds.

Ultimately, the purpose of such hypothesis generation is to simplify some aspect of system behaviour and represent such behaviour in a manner that’s tangible and tractable based on simple, explicable rules. This makes story-telling easier for data scientists when they become new-age raconteurs, straddling data visualisations, dashboards with data summaries and machine learning models.

Developing Nuanced Understanding

The importance of hypothesis generation in data science teams is many fold:

  1. Hypothesis generation allows the team to experiment with theories about the data
  2. Hypothesis generation can allow the team to take a systems-thinking approach to the problem to be solved
  3. Hypothesis generation allows us to build more sophisticated models based on prior hypotheses and understanding

When data science teams approach complex projects, some of them may be wont to diving right into building complex systems based on available resources, libraries and software. By taking a hypothesis-centred view of the data science problem, they could build up complexity and nuanced understanding in a very natural way, and build up hypotheses and ideas in the process.

Quora Data Science Answers Roundup

I’m given to spurts of activity on Quora. Over the past year, I’ve had the opportunity to answer several questions there on the topics of data science, big data and data engineering.

Some answers here are career-specific, while others are of a technical nature. Then there are interesting and nuanced questions that are always a pleasure to answer. Earlier this week I received a pleasant message from the Quora staff, who have designated me a Quora Top Writer for 2017. This is exciting, of course, as I’ve been focused largely on questions around data science, data analytics, hobbies like aviation and technology, past work such as in mechanical engineering, and a few other topics of a general nature on Quora.

Below, I’ve put together a list of the answers that I enjoyed writing. These answers have been written keeping a layperson audience in mind, for the most part, unless the question itself seemed to indicate a level of subject matter knowledge. If you like any of these answers (or think they can be improved), leave a comment or thanks (preferably directly on the Quora answer) and I’ll take a look! 🙂

Happy Quora surfing!

Disclaimer: None of my content or answers on Quora reflect my employer’s views. My content on Quora is meant for a layperson audience, and is not to be taken as an official recommendation or solicitation of any kind.

Domain: The Missing Element in Data Science

As a data science consultant that routinely deals with large companies and their data analysis, data science and machine learning challenges, I have come to understand one key element of the data scientist’s skill set that isn’t oft-discussed in data science circles online. In this post I hope to elucidate on the importance of domain knowledge.

Over the last several years, there has (rightly) been significant debate on the skill sets of data scientists, and the importance of business, statistics, programming and other skill sets. Interesting sub-classifications of professions, such as “data hacker”, “data nerd” and other terms have been used to describe the various combinations or intersections of these skill sets.

The Importance of Domain Knowledge

In all of these discussions, however, one key element has been left out. And that is the domain.

Domain_DS

Domain knowledge is an important subset of the data scientist’s work. Although the perfect data scientist is a bit of a unicorn, the domain should be an important consideration.

Domain knowledge is distinct from statistics, data analysis, programming and the purely technical areas, and it is easy to see how that is the case. However, business knowledge is often conflated with domain knowledge, perhaps understandably, because these are both vague and interdisciplinary areas. Business knowledge entails some amount of financial knowledge, unit economics models, strategy, people management, and a range of other skills taught in business schools, and more commonly, learned in organisations on the job. Domain knowledge, however, is like being a kind of human expert system. Wikipedia defines an expert system without defining expertise. What role does expertise play in data science, however?

Domain knowledge is a result of the system exploration that humans as system builders naturally do. To be able to formulate intelligent hypotheses, the unique cause-effect chains that are relevant to specific systems can be studied and understood. Do humans learn about systems in ways that are different from how machines might explore them, if we were to give them infinite data and computational capability? That is a hard question to answer in this context, and perhaps represents a red herring of sorts. What is useful to note, however, is that machine learning models still rely on human-formulated hypotheses. There is the odd example of an expert system that has formulated hypotheses and proved them (as is happening in medicine, these days), but these examples are hardly possible without human intervention.

Now that we have established that human intervention has become necessary in machine learning systems, data science can be seen as a field that relies uniquely on human-formulated hypotheses. While computational power and statistical models help us explore and construct hypotheses, the decisions that are made from this data – that help us define hypotheses, model the data to test these hypotheses, construct mathematical or statistical models of these data, and then evaluate the results of those tests – all of these activities take place with human intervention.

So where does domain fit in? Domain experts are those who have significant experience learning about one or a few interconnected systems in intimate ways. Their ability to develop a gut feel for the system’s performance and characteristics helps them leap frog the formulation of hypotheses, and this is their biggest benefit, compared to domain-agnostic data scientists, who merely have the programming, statistics, business and communication skills required to make serious analysis happen.

Domain Expertise and Analysis Paralysis

Domain expertise is probably one fine way to fight off the analysis-paralysis problem that plagues many data science teams. Some data science teams take up significant time and resources to experiment with ideas vastly, and the availability of high performance computing power on tap makes them take hypothesis formulation less seriously. Adversity is truly the mother of inventiveness, and it is, for example, when computing power was at a premium, that some of the most efficient sorting algorithms were devised. Similarly, the availability of computing power and statistical modeling capabilities on a massive scale de-incentivize the need to ask pertinent questions.

Pertinent questions and specific answers lead to tangible decisions and related business improvements. Without the benefit of domain knowledge, this is not possible. Analysis paralysis is a very real phenomenon. Data scientists are susceptible in organizations that value domain expertise, and don’t value analytical solutions. In situations where analytical solutions and problem solving are valued, data science that fly blind toting algorithms and machine learning won’t come out on top either – they’re more likely to hurt the credibility of the data science exercise than help it, when they solve simple problems that have pre-existing domain formulations with the help of complex algorithms (which may sometimes not give sufficient insight into their own workings, despite working well).

Challenge or Channel Domain Expertise?

Machine learning work done in medicine (cancer cell detection) points to a future where human-learned skills are replicated by deep learning or reinforcement learning systems. Alternatively, many real data science programs at diverse companies indicate an analysis paralysis that can be addressed by involvement to a greater degree of domain experts of specific kinds in the data science hypothesis formulation, analysis and  interpretation of results. The latter is more representative of a real world scenario than the former, where an expert system independently learns about a hard problem and solves it.

Doing Data Science Better

In order to be able to do data science better, it isn’t merely important to consider developing data scientist resources along the lines described by Drew Conway or Stephan Kolassa. It is important to groom analytically capable people from within domains too. This means distributing the skill set required for serious analysis from the mainstream data science practice, into functional teams. Sometimes, this may mean penetrating leadership teams that work in functional capacities, and at other times, it may mean addressing the needs of small teams directly, by grooming functional/technical talent for doing data science.

Doing data science better doesn’t merely involve leveraging algorithms and their strengths better. It also means asking the right questions. Pay attention to your domain experts, and develop the capabilities around the analytical capabilities of your team. Success for many companies doesn’t look like all-conquering deep learning algorithms, but looks like specific problems solved in a targeted manner, by using well defined problem statements and the right algorithms and frameworks.