Notes and Comments on Data Science

Here are some notes and comments I’ve made over the past several months and years, that convey current or extant thinking about doing data science and being productive as a data scientist. Most of these comments are from my LinkedIn engagement, and are to be read in that specific context.

  • Python is as elegant a language as they come and frameworks like Pandas have a lot going for them, but we’re far from the declarative paradigm here – and that is what seems to be the biggest productivity enhancer. Effective tooling is what really reduces time-to-insights.
  • Great set of slides by Arvind Narayanan, which I read with interest. To paraphrase Duncan Watts from Twitter, you could replace “AI” here with other technologies/capabilities that are sparsely known but admired/reviled/feared, and still have a lot of it be valid. The issue then is in the marketing-to-consumer value chain, and not the technical capabilities of ML/AI systems themselves. Good marketers ought to help people discover value from what is being marketed – in this case, that is clearly not happening. (link)
  • They published a version of this using Tensorflow some time back, and the original with MXNet was pretty good in itself, with numpy-esque matrix operations within MXNet being used for several demos. What I like here is that you see text, equations and code in one place, making it ideal as a resource to explore, experiment and learn. (link)
  • Picked up this book recently too. I like the writing style, and the discussions around seq-to-seq models. The core content in the beginning does a good job of covering all the text processing required for NLP. I foresee revisiting this book many times in the coming months/years! (link)
  • In light of the popularity of open source, it would be interesting to see if Mathworks makes any kind of OSS play. We’ve seen Microsoft changing their approach and benefiting hugely in the recent past, perhaps it is time for others to follow suit. There’s a lot of value to be added to the market by bringing out a free product variant. (link)
  • While Polynote is helpful, many more DS folks will use the Python-VS Code integrations more often than this. Scala is not used as widely for doing data science on notebooks as Python and those who want to use Scala for machine learning applications may as well switch to a full featured IDE such as IntelliJ or ScalaIDE.
  • You have a book by Albert Barabasi in there, whose work was inspiring. He’s written a book titled “Linked” which I read very enthusiastically some years ago. Also would recommend papers by Duncan Watts, Mark Newman and Steven Strogatz. I see that you already have a book by Mark Newman here in this list. The story seemed to start from Erdos and Renyi and their theory of random graphs, there are a couple of tomes from back in their day that are also available widely. 

One AI Marketing Conundrum

We are now in an age when the simplest kind of intelligence built into products and services is being marketed as “AI”. This is a regrettable consequence of current marketing practice, that seems to extend to individuals, products and even job postings. For instance, it isn’t unusual to want to hire “AI developers” these days, who have certified “credentials in AI”.

As a professional in the AI and Machine Learning space, I have come across and perhaps to an extent have been complicit in, such hype. However, with time, you gain perspective and collect feedback. Of late, the more strident the clarion calls of “AI this” and “AI that” are in products, the more common it is to see ordinary consumers become dismissive of new technology. I truly think this “performance undersupply” (to use a phrase coined by Tinymagiq’s Kumaran Anandan) in AI marketing is a bit of a regression (pun intended).

For instance, tools with natural language processing are routinely called out as being “AI”. Let’s dig a little deeper:

  1. Text mining, and the extraction of information from documents, requires mathematical representation, modeling and characterisation of corpuses of text Stemming, lemmatization and other tasks commonly seen in task mining fit into this category of tasks.
  2. Models built on top of such representations that use them as input data learn statistical relationships between different representations. Term frequency histograms, TF/IDF models and such represent such statistical models.
  3. End-to-end deep learning systems that perform higher-order statistical modeling on such representations can learn and generate more complex patterns in text data. This is what we see with language translation models.

Note that none of the above truly imply intelligence. While there is an extensive use of statistical methods and techniques to represent text data and model it mathematically and statistically, there is no memory, context capture, or knowledge base. There is no agent in question, and therefore these can at best be described as enablers of artificial intelligence.

A post on LinkedIn by Kevin Gray talks about the same problem of marketing machine learning capabilities as AI. My response to his post is below, and perhaps it provides additional context to the discussion above on NLP/NLU/NLG and how that should be considered an enabler of AI, and not AI in and of itself.

The contention here seems to be on the matter of whether something can be described (and by extension marketed) as AI or not.

Perhaps it is more helpful to think of ML algorithms/capabilities such as NLU/NLP/NLG (as with audio/image understanding, processing and generation tasks) as _enablers_ of intelligent systems, and not the intelligence itself. 

This distinction can perhaps help address the fact that consciousness, memory, context understanding and other characteristics of real-world intelligent agents are not glossed over in our quest to market one specific tool in the AI toolkit.

Coming to multiple regression – clearly a “soothsayer” or a forecaster (in the trading sense, perhaps) is valued for their competence and experience, which brings context and the other benefits I mentioned of real world intelligent agents. When a regression model makes a prediction along similar lines, that does not assume context either, and is therefore not in and of itself an intelligent system.  So in summary, I’d say that NLP/NLU/NLG and such capabilities are also not “AI”, just as stepwise multiple regression isn’t.

From my comment.

Coming back to the topic at hand, we all can probably acknowledge first that marketers won’t stop using the “AI” buzzwords for all things under the sun anytime soon. That said, we can rest easy because we might be able to understand, with a little effort, what the true capability of the marketed product or service in question is. Mental models like those described above might help contextualize and rationalize the hype as and when we see it.

Exploring Chaos and Bifurcation Diagrams in Python

In the study of nonlinear dynamical systems and chaos, one of the basic properties of systems we evaluate is period doubling, or bifurcation. As the parameters that describe system states change, the system can exhibit different modes of behaviour. Generally speaking, they all exhibit a sensitive dependence to initial conditions, which is to say that with just small, inscrutable changes in the initial conditions, we can get wildly different results in a system governed by very simple rules of iterative calculation.

In this notebook, I’m attempting to explore three different kinds of maps, and their associated period doubling behaviour, through bifurcation diagrams.

  1. Logistic map: A relatively well known function described by x_{n+1} = rx_n (1 - x_n) , the logistic map‘s bifurcation diagram is plotted as the change in x_n with varying values of r. Since each value of x_n depends on the previous value in a nonlinear fashion, a sensitive dependence on initial conditions is exhibited. Depending on starting values and the parameter values of r, the function exhibits smoothness initially followed by suddenly occurring chaos.
  2. Circle map: Circle maps are associated with Arnold Tongues and are described by the iteration \theta_{n+1} = \theta_n + \Omega - \frac{K}{2\pi}sin(2\pi\theta_n) .
  3. Gauss iterated map: This is a nonlinear iterated map defined as  x_{n+1} = e^{- \alpha x_n^2 } + \beta , and is generally computed like the Logistic map.

Explore the github repository for more.

Bifurcation diagram for a Circle Map
Bifurcation patterns and “edge of chaos” phenomena in the circle map, between k = 2.5 and k = 3.0
Circle Map bifurcation diagram in k = (2.9,3.0) shows alternating chaotic and non-chaotic behaviour

Related Ideas and Links

  1. Steven Strogatz’s lectures on nonlinear dynamics and chaos (link)
  2. Pitchfork bifurcation
  3. Logistic differential equation
  4. Simple mathematical models with very complicated dynamics, Robert M. May (open access link)
  5. Feigenbaum constants
  6. Feigenbaum scaling in discrete dynamical systems by Keith Briggs ( link )
  7. Mitchell Feigenbaum’s original paper on patterns in chaos (link)
  8. Learning resources for complex systems on Complexity Explorer; one specific introductory course may be helpful for those beginning to learn these subjects.

Exploring SVM Kernels

Support Vector Machines are a popular option for data scientists wanting to explore and model higher dimensional data. Despite their lack of scalability, they’re popular for prototyping different kinds of classifiers for systems where there are large numbers of variables. At the core of the SVM is the use of a kernel function, which enables a mapping of the feature space to a higher dimensional feature space. Therefore, if we’re unable to find separability between classes in the (lower dimensional) feature space, we could find a function in the higher dimensional space, which can be used as a classifier.

Two classes of data in the R^2 space

In this Jupyter notebook I’ve explored a couple of different types of kernel functions for bivariate, two-class data, where an SVM is being used to separate out these classes. Since these classes are not linearly separable, the use of kernel functions here enables us to find the best possible hyperplanes that can solve the separability problem. What’s interesting to note is that the convex hulls (in this case, polygons) for these classes are overlapping in the 2D space. This is a clear indicator of a lack of linear separability.

The blue polygon here represents the convex hull for one class, which is dimensions-wise nested in another, in this data representation

The use of a kernel opens up the possibility of linear separability, since we add an additional spatial dimension on which these points get distributed. Specifically here we have two different kernel functions that are explored:

\phi(x_{1}, x_{2}) = (x_{1}^2, x_{1}x_{2}, x_{2}^2)^{T}

K(x_{1}, x_{2}) = a e^{{-\frac{1}{b} ||x_{1} - x_{2}||^{2} }}

The latter is called the radial basis function kernel, or the RBF kernel. Visualizing this kernel for the data we’d generated gives us the following nice image. What’s easily visible here is the possibility of separating out the classes thanks to the additional rbf dimension that has now been added.

RBF (vertical axis) enables separation of the two classes, in blue and yellow.
Note: Low opacity used for better visibility
Decision boundary identified by the SVM (which uses the RBF kernel)

Upon training the SVM classifier, visualizing the results gives us the below plot. The thick grey line is the decision boundary that enables us to separate the originally linearly inseparable classes in the dataset.

There are other explorations I hope to do on this notebook in future, specifically the process of calculating the sign (class label) of a dataset, based on the Lagrangian – which indeed brings us to the idea of the SVC being a maximal margin classifier. This is also referred to as the dual problem of the SVM. For another post!

Statistical Competence and Its Importance for Good Data Science Careers

In 2019, enterprises routinely begin initiatives related to analytics, data science and machine learning that invoke specific technologies from a very early stage in their initiatives. This tendency to put technology ahead of value sometimes extends to analytics champions and managers who take up or lead data-intensive initiatives. While this may seem pragmatic at one level, at another level, it may lead to significant problems when ensuring successful outcomes from such analytics initiatives and programs. In this post, I’ll address the three-pronged conundrum of statistical competence in the data science world, specifically in the context of data science consulting and services, and specifically what it means for the careers of data science candidates now and in the future.

Hiring Statisticians: An Expert’s View

Kevin Gray is one of my connections on LinkedIn who posts insightful content on statistical analysis and related topics on a regular basis, including very good recommendations for books on various statistical and analytical techniques and methods. One of his recent posts was an article he’d authored titled “What to Look For in a Statistician” (the article, and my comment), which definitely resonated with my own experiences in hiring statistically competent engineers in different settings, such as data science and machine learning, between 2015 and today. In years past, I have had similar experiences when hiring competent product engineers and manufacturing engineers in data-intensive problem solving roles.

The importance of statistical thinking and statistical analysis in business problem solving cannot be underestimated. However, even good advice that is canon, and that is well-acknowledged, often falls on deaf ears in the hyper-competitive data science job market. Both hiring managers and recruiters tend to emphasize keywords comprising the latest framework or approach, over the ability to think critically about problem statements, carefully architect systems, and rigorously apply statistical analysis and machine learning to real world problems while keeping considerations of explainability in mind.

The Three-Pronged Conundrum of Data Science Talent

Now you might ask why I say this, and what I really mean by this. The devil, as they say, is in the details, and one essential problem with the broad and wide proliferation of tools, frameworks and applications of high capability, that can perform and automate statistical analysis of different kinds, is the following three-pronged conundrum:

  1. Lack of core statistical knowledge despite having a working knowledge of the practicum of advanced techniques: Most candidates in the data science job market who are deeply interested in building data science and ML applications have unfortunately not developed skills in the core statistical sciences and statistical reasoning. Since statistics is the foundation for machine learning and data science, this degrades the quality of projects and programs which have to rely on hiring such talent. When they prefer to use software to do most of or all the thinking for them, their own reasoning about the problem is rarely good enough to critically evaluate different statistical formulations for problems, because they think in very set and specific ways about problems thanks only to their familiarity with the tools.
  2. Tools as an unfortunate substitute to statistical thinking: Solutions, services and consulting professionals in the data science and advanced analytics space, who have to bring their best statistical thinking to client-facing interactions, are unable to differentiate between competence in statistical thinking, and competence in a specific software tool or approach.
  3. Model bloat and inexplicability: The use of heavy, general purpose approaches that rely on complex, less explainable models, than reliance on simpler models that are constructed upon a fuller understanding of the true dynamics of the problem.

These three sub-problems can derail even the best envisioned data science and machine learning initiatives in product / solution delivery firms, and in enterprises.

Some “Unsexy” Characteristics of Good Data Scientists

These are also not “sexy” problems – they’re earthy, multi-dimensional, real world problems that have many contributing factors, from business and how it is done, to the culture of education and the culture of software and solution development teams. Kevin Gray in his post touches upon attitudinal qualities for good statisticians, which could also be extended to data science leaders, data scientists and data engineers:

  1. Integrity and honesty are important in data science – this is true especially in a world where personal data is being handled carelessly and sometimes gratuitously by many applications without heed to data protection and privacy, and when user data is taken for granted by many technology companies. This is not an easy expectation or evaluation point for hiring managers, since it is only long association with anyone which allows us to build a model of their integrity, and rarely does one effectively determine such an attribute in short interviews. What’s dismal about data science hiring sometimes, is the proliferation of candidate resumes which are full of fluff, and the tendency of candidates to not stand up to scrutiny on skills they identify as “key” or “core” skills.
  2. Curiosity and a broad spectrum of interests – this cannot be understated in the context of a consulting data science or machine learning expert. The more we’re aware of different mental models and theoretical frameworks of the world and the data we see in it, the better we’re able to reason starting from hypotheses about the data. By extension, we’re better able to identify the right statistical approaches for a problem when we start from and explore different such mental models. The book I’ve linked to here by Scott E. Page is a fantastic evaluation of different mental models. But with models come biases, to restate George E. P. Box’s famous quote, “All models are wrong, some models are useful”.
  3. Checking for logical fallacies is key for data science reasoning – I would add to the critical thinking element mentioned in Kevin’s post, by saying that it behooves any thought leader such as a data science consultant to critically evaluate their own thinking by checking for logical fallacies. When overlooked, a benign piece of flawed reasoning can turn into a face-melting disaster. The best way to ensure this does not happen is to critically evaluate our ideas, notions and mental models.
  4. Don’t develop one hammer, develop a tool box – Like experienced plumbers, carpenters or mechanics, the tools landscape of a data scientist today should not be one of quasi-religious fervor in promoting one technique at the cost of others, such as how deep learning has come to be promoted in some circles as a data science panacea. Instead, the effective data scientist is usually pragmatic in their approach. Like a tailor or carpenter who has to cut or join different materials with different instruments, data scientists today do not have the luxury of getting behind one comfortable model of thinking about their tool set and profession – and any attempt to do this can be construed as laziness (especially for the consulting data scientist) at best. While the customer is always right, there are times when the client can be wrong and it is at these times that they need the advice of a qualified statistician or data scientist. If there is one time when data scientists should not abandon their statistical thinking, it is this kind of a situation.

Concluding Remarks

To conclude, data scientists ought not to be seen as resources that take data, analyze it using pre-built tools, and write code to explain the data using pre-built libraries of various kinds. They’re not software jockeys who happen to know some statistics and have a handle on machine learning workflows. Data scientists’ work scope and emphases as industry professionals and consultants go way beyond these limited definitions. Data scientists are expected to be dynamic, statistically sound professionals who critically evaluate real world problems based on theories, data and evidence drawn from many sources and contexts, and progressively build a deeper understanding of these real world problems that lead to tangible value for their customers, be they businesses or the consumers of products. The sooner data scientists realize this, the better off they will be while charting out a truly successful and fulfilling data science career.

Understanding the Logarithm Trick in Maximum Likelihood Estimation

Maximum Likelihood Estimation is a fundamental and powerful idea that’s at the centre of many things we do with data – so much so that we often use it without knowing it. MLE allows us to find a model’s parameters that are likely to enable the model to represent the data we have on our hands as closely as possible. This short post addresses the logarithm trick which is used to enable simpler MLE calculation.

There are two elements to understanding the formulation of MLE for the common Multivariate Gaussian model (which could be extended to other models equally):

  1. The i.i.d assumption that simplifies the MLE formulation
  2. The logarithm trick with enables solution of the MLE formulation

On this blog I’ve discussed topics like time series analysis in the past where the idea of independent and identically distributed variables is addressed, and of course, being an important statistical topic, is is well explained and understood. The logarithm trick, however, is specific to the simplification and solution of MLE formulations, and is helpful to understand.

The logarithm function very simply enables scale variance in any input data while allowing location invariance. This is extremely helpful when dealing with monotonic input data that we want to ensure continues to be monotonic after transformation, but whose scale we want to change.

When building a model p ( x_1,.... x_n | \theta ) of the data (x_1,.... x_n), the MLE formulation seeks to find the appropriate values of \theta such that

\bigtriangledown_{\theta} \prod_{i = 1}^{n} p(x_i | \theta ) = 0

The interesting thing about the log transform is, as I said earlier that in the transformation ln ( \prod_{i} f_i ) = \sum_{i} ln(f_i) , there is no change in where f_i may attain a maximum or a minimum when it is transformed to ln(f_i) for any i. This logarithm trick enables us to compute the latter product more simply, and thereby execute the MLE.

Backpropagation and Gradient Descent

Sometimes the simple questions can be revelatory and make one think about the possibilities we have in front of us to improve existing processes, systems and methods.

On this occasion, it was a simple question on Quora about gradient descent and the ubiquitous backpropagation algorithm used in neural networks and deep learning. The content of my answer follows.

The process of computing weights and biases which minimize the error from a neural network could be any optimization algorithm which is good enough for the job. Gradient descent (especially the versions of the algorithm which use momentum and RMS propagation) are especially effective and have well implemented matrix algebra formulations in languages like C and Python, which makes them used often. Equally though, a genetic algorithm or simulated annealing algorithm (which are more complex and computationally intensive) may be used for finding such weights and biases on each iteration. Indeed, such methods have been and are being researched extensively.[1]

Backpropagation is defined by four equations that help calculate new weights and biases to update a neural network.[2]

The first of these equations helps calculate the error at the output layer. The second helps calculate the error in a given layer based on the error in the next layer. The third and fourth equations help calculate the rate of change of the loss function C with variations in the weights and biases.

Therefore the algorithm itself can be written out as follows[3] :

We then use the gradient of the cost function to compute the new values of w and b, based on things like the learning rate and regularization parameters as applicable.

Why gradient descent? Since the process of backpropagation is iterative (we go from steps 1 – 5 and back again), for each update, we can get better and better versions of the weights w and biases b that are able to reduce the error between the target and the result produced by the network. The following animation (source: Data Blog) probably gives you an idea (the red areas are higher values of Cost, and blue means lower values).

A nice graphic illustrating gradient descent

Now, you might ask : can’t other algorithms be used to do the same thing? The answer is indeed yes. We can use many other optimization algorithms (constrained and unconstrained ones, used for convex and non-convex functions). If you would like to learn about convex optimization with theoretical treatment in more detail, consider this resource: Convex Optimization – Boyd and Vandenberghe. In addition to other convex optimization methods, there’s scores of robust optimization methods such as:

  1. Genetic algorithms
  2. Particle Swarm Optimization
  3. Simulated Annealing
  4. Ant Colony Optimization

While some of these, especially GAs and PSOs have been explored in the context of neural networks, common implementations of deep learning algorithms still rely on the gradient descent family of algorithms (such as Nesterov – which has come to be implemented in a distributed paradigm, RMSProp, Adam).



[2] Neural networks and deep learning

[3] Neural networks and deep learning

Natural Language AI: Architectural Considerations

If there is one area of AI that was closely watched by practitioners and and researchers in 2018 and as of early 2019, it was the natural language processing space. Innovations in sequence modeling deep neural networks, ranging from bidirectional LSTM networks, to Google’s BERT and Microsoft’s MT-DNN have improved capabilities such as language translation in a significant way. There are many more advancements in the field of deep learning which have been very well summarized by MIT Researcher and Professor Lex Fridman in the below talk.

The State of the Art

Lex Fridman takes us through many deep learning developments in 2018, including BERT

Given the presence of mature and increasingly sophisticated models of language translation, and improvements in language understanding, what many human-machine interface development teams may be looking at, to leverage these capabilities, is the right kind of architecture for enabling this capability. After all, it is only when these algorithms reach the customer in an actual translation or language understanding task, that their value is realized.

It is evident from the MT-DNN paper by Microsoft Research that some core elements of the natural language processing tasks won’t change. For instance, look at the architecture diagram of the MT-DNN (Multi-tasking Deep Neural Network) below.

MT-DNN Architecture from Microsoft’s Research Paper

The feature vector X still has to be taken through all shared layers in any sentence / phrase based interaction, leading to the context embedding vectors we see as l2. It goes without saying that when we have such a shared architecture which provides the underlying capabilities for transformation, representations and word encoding, the subsequent deeper layers of the network can become more specialized, be this pairwise similarity, classification or other use cases.

Similar Paradigms

The surprising thing is that this isn’t a new capability. It is rather analogous to the higher level representations learned by face recognition deep learning networks, or the higher order patterns learned by deep LSTM sequence classifiers.

Image result for deep learning face recognition layers
Face recognition DNNs and the features they learn (via presentation here on Slideshare, by Igor Trajkovski)

One of the trends anticipated by Andrew Ng and other Deep Learning researchers some years ago is the arrival of end-to-end deep learning systems. In the past, there would have been a need for specific components across data pre-processing, feature engineering, machine learning or optimization, and perhaps a compositing layer which encompasses all these elements. This component-wise architecture can, given enough data, be replaced utterly by a deep learning network. Falling back on the mathematics behind the possibility of deep learning networks as universal function approximators (Hahn-Banach theorem et al, as shown here) provides another justification for such an end-to-end architecture for deep learning centric systems.

Natural language centric AI systems are, by definition, customer-centric. There are few use cases for systems deep inside the woods of business processes that require this capability, and because of this context, such AI systems have to provide for online learning and management of concept drift. Concept drift management is no easy task, and active research continues to happen in the space ( one example is here ). Concept drift verily informs capabilities such as online learning, and although brute force methods exist for reiterating large scale training, there’s only so far that can go before a smarter approach is sought out.

Architectural Considerations

Some architectural considerations for such end-to-end natural language centric AI applications therefore could be:

Four architectural considerations for Natural Language centric AI systems

Harmonization of data generation processes calls for unified user interfaces, or sub-layers in the user interfaces, which translate the end user’s intent. The manifestation of this intent may be different in different cases, depending on whether the interface is speech based, vision based or gesture based, for instance. However, intent inference and translation to a natural language paradigm could be a key capability, which enables a certain kind of taught interaction to AI systems.

We have seen already how common representational methods of input data can be a massive advantage for building numerous specializations on top of what was already available as a core capability. Modularity therefore becomes more important. In the presence of a common representational standard for input data, building specialized networks can become more straightforward, since a number of constraints begin to manifest themselves in any AI development life cycle. Finally, concept drift and its management become important considerations for the last-mile of the AI value delivery, at deployment time.


It should be realized that the modern translation algorithms such as BERT and MT-DNN provide very advanced capabilities which can enable natural language interactions in a manner never before imagined, and as we see intelligent systems leverage these algorithms at large scale, we will probably also see the above architectural considerations of input harmonization, common representation, specialization + modularity and online learning become infused into the architecture of common AI systems.

What Could Data Scientists (And Data Science Managers) Be Doing Better in 2019?

The “data science” job description is becoming more and more common, as of early 2019.

Not only has the field garnered a great deal of interest from software developers, statisticians and machine learning exponents, but has also attracted plenty of interest over the years, from people in roles such as strategy, operations, sales and marketing. Product designers, manufacturing and customer service managers are also turning towards data science talent to help them make sense of their businesses, processes and find new ways to improve.

The Data Science Misinformation Challenge

The aforementioned motivations for people interested in data science aren’t inherently bad – in fact, they’re common sense, reasonable starting points to look for data science talent and begin analytical programs in organizations. The problem starts with the availability of access to sound, hype-free information on data science, analytics, machine learning and AI. Thanks to the media’s fulminations around sometimes disconnected value propositions – chat bots, artificial intelligence agents, machine learning and big data – these terms have come to be clumped together along with data science and machine learning, purely because of the similarity of notion, or some of the skills required to build and sell solutions along these lines. Media speculation around AI doesn’t stop there – from calling automated machine learning as “Building AI that can build AI” (NYT), to mentions of killer robots and killer cars, 2018 was a year full of hype and alarmism as I expect 2019 will also be, to some extent. I have dealt with this topic extensively in an earlier post here. What I take issue with, naturally, is the fact this serves to misinform business teams about what’s really important.

Managing Data Science Better

Astute business leaders build analytical programs where they don’t put the cart before the horse. By this, I mean the following things:

  1. They have not merely a data strategy, but a strategy for labelled data
  2. They start with small problems, not big, all-encompassing problems
  3. They grow data science capabilities within the team
  4. They embrace visualization methods and question black box models
  5. They check for actual business value in data science projects
  6. They look for ways to deploy models, not merely build throw-away analyses

Data science and analytics managers ought not to:

  1. Perpetuate hype and spread misinformation without research
  2. Set expectations based on such hype around data science
  3. Assume solutions are possible without due consideration
  4. Not budget for subject matter experts
  5. Not training your staff and still expecting better results

As obvious as the above may sound, they’re all too common in the industry. Then there is the problem of consultants who sometimes perpetuate the hype train, thereby reinforcing some of these behaviors.

Doing Data Science Better

Now let’s look at some of the things Data Scientists themselves could be doing better. Some of the points I make here have to do with the state of talent, while others have to do with the tools and the infrastructure provided to data scientists in companies. Some has to do with preferences, while others have to do with processes. I find many common practices by data science professionals to be problematic. Some of these are:

  1. Incorrect assumption checking – for significance tests, for machine learning models and for other kinds of modeling in general
  2. Not being aware of how some of the details of algorithms work – and not bothering to learn this even after several projects where their shortcomings are highlighted
  3. Not bothering to perform basic or exploratory data analysis (EDA) before taking up any serious mathematical modeling
  4. Not visualizing data before attempting to build models from them
  5. Assuming things about the problem solving approach they should take, without basing this on EDA results
  6. Not differentiating between the unique characteristics that make certain algorithms or frameworks more computationally, statistically or otherwise efficient, compared to others
  7. Some of these can be sorted out by asking critical questions such as the ones below (which may overlap to some extent with the activities listed above):
    1. Where the data came from
    2. How the data was measured
    3. Whether the data was meddled with anyhow, and in what ways
    4. How the insights will be consumed
    5. What user experience is required for the analytics consumer
    6. Does the solution have to scale

This is just a select list, and I’m sure that contextually, there are many other problems, both technical and process-specific. Either way, there is a need to exercise caution before jumping headlong into data science initiatives (as a manager) and to plan and structure data science work (as a data scientist).

Questions that Data Scientists Hate Getting

This is a variation on a Quora answer.

When asked how data scientists can be effective, there are a few things that com e to mind:

  1. Skills: A curiosity and sufficient skill in data analysis methods and techniques
  2. Fundamental needs: the data and access to the tools to perform analysis — and this would include the environments
  3. Performance needs: Sufficient resources, time and good enough processes to validate or invalidate hypotheses and build models based on them
  4. Excitement needs: Sufficient support and latitude to independently deploy projects based on successful hypotheses tested and models built

Note that while these criteria listed above begin with the fundamental skills required to do data science, the focus shifts in items 2, 3 and 4, to what is required for data scientists to be effective. The first of these are the fundamental needs, such as the data itself, and the access to the required tools, be they statistical or machine learning tools, databases, visualization libraries, or other resources. The second of these are the performance needs, which will help the data scientist do whatever it is that they do, a bit better than how they’re doing this now. This includes processes and systems that enable the data scientist to improve their own capabilities. Finally, we have excitement needs, which enable data scientists to do outstanding work — a large part of this is being able to reuse what has been built, through deployment of various kinds.

It is in this context that we can discuss how managers of data science teams can help them be effective.

If there is one kind of behaviour in analytics managers that I wish changed, it is the one I describe in the following lines.

A lot of what data scientists do is experimental, throw-away analysis. However, it is tempting for a number of managers (many of who have made up their minds that some hypothesis holds true, or will work), to assume that they’re right, and what is required from the data scientist is the detailed model that formalizes the relationship.

This kind of assumption makes for poorly designed projects, and doesn’t amply use the data scientist’s time for exploratory analysis, for evaluating the development of different kinds of models, and for finding out what works, given the dataset.

Naturally, given the time-bound nature of businesses and poor understanding of analytics at the executive level in many organizations, such clients are commonplace, and such managers also find themselves in a situation where they push for results without the right underlying systems, data or resources. Sometimes, they begin projects with data scientists who lack the specific skills to build the kinds of models required to solve problems. While this may be the case, the challenge many data scientists in business and consulting have is dealing with such unreasonable expectations.

In this specific context, some questions that shouldn’t be posed to data scientists might be along the following lines:

  • “Assuming that hypothesis X works, how long would it take to build a full fledged application using this hypothesis X?”
  • “The domain experts are convinced that this hypothesis X is true. Why don’t your results reflect this too?”
  • “The values of R_sq or precision/recall I see here don’t reflect what can be done with the data. Aren’t better results possible?”

These kinds of questions are simplistic when in the initial stages of a data science activity/experiment, and in some situations, they could be dangerous too (although they’re innocuous mistakes any manager new to analytics initiatives may make).

For the same reason that “a little knowledge is a dangerous thing” these project managers might be playing with the fortune of the entire analytics program they serve, because they base even large projects on such naive and unverified assumptions. Were they to change their behaviour by giving due consideration to exploratory data analysis, and what the data actually says about viable models and applications that may be built, they might be putting their data scientists and engineers on the path to success.