Finishing up the Columbia University + Emeritus Post Graduate Diploma in ML and AI

I have had a very interesting nine months or so deepening my fundamental skills and learning new skills in AI and Machine Learning. With the Coronavirus crisis and the associated disruption to all our lives across the world in a matter of mere months, it seems like the world has changed overnight. It is now more than ever that skill development and improvement can sustain us all through tough times and unforeseen challenges.

In this context, the PGDMLAI program at Columbia, in retrospect, was a well crafted program that pushed my boundaries. It developed true new skill and changed the way I think about AI and ML problem statements, despite being an active industry professional for several years in the AI and ML space.

On the AI side, most of my new explorations have been in the realms of advanced deep learning and reinforcement learning. In the last month or two, for example, I’ve explored FaceNet, MTCNNs, GANs and DeepRL techniques. In this context, learning about search techniques, Markov decision processes, CSPs and reinforcement learning techniques (policy and value iteration methods) in this program, was particularly rewarding.

Learning Experience, Faculty and Assignments

The course content is highly mathematically detailed and well paced. There is an emphasis on the core ideas of each algorithm you learn, and the assumptions you make in each case. Interesting concepts and sub-problems such as transformations for monotonic functions, two-class problems in SVMs, the maximum likelihood principle, Bellman’s equation, etc., are discussed in context, and additional resources there gave me an opportunity to dig into this content too when I had gotten done with the course content or assignments.

As for pre-requisites, it goes without saying that even for me as an industry professional in AI and ML who’s involved on a regular basis in the development of AI and ML code and solutions, the depth of ideas presented is advanced – it is a graduate level course. I needed some study pre-lecture and revision and revisiting ideas post-lecture to understand certain concepts correctly. At times, I’d feel out of depth and would need to revisit course videos several times and sometimes even pre-requisites. In my case, I went back quite a bit to linear algebra lessons from MIT OCW (Gilbert Strang’s lectures) and elsewhere. I also went back to a lot of Python programming on DataCamp, which all students in the course had access to.

In addition to the core topics within AI and ML, several interesting topics in CSPs and Reinforcement Learning were also discussed. Cryptarithmetic puzzles that are solved by graph search techniques like backtracking search, and the Sudoku solvers stand out. Also, for someone like me without a formal background in computer science, the initial lectures on graph search algorithms were gold, they were essential to understand the deeper ideas within AI eventually.

A related thread from my Twitter page:

I want to take a moment to appreciate the incredible faculty and staff for this course:

  1. Prof. David Yakobovitch – the course leader for the program who shared invaluable knowledge in all the webinars
  2. Prof. Ansaf Salleb Al-Aoussi – the AI expert who taught various search and ML techniques lucidly and clearly
  3. Prof. Jacob Koehler – all the excellent office hours sessions that were incredibly helpful in clarifying hard problems in implementing solutions we’d learnt conceptually
  4. Prof. John Paisley – who provided lucid and mathematically comprehensive lectures in machine learning

All above staff (and many others) routinely engaged with many of us students on the course forums, and I am sure I’ve made a few friends among the students during the course of this program.

I also want to thank the Emeritus and Columbia team for making DataCamp’s data science and ML courses available to all students. This definitely helped me in the course of the learning experience.

As someone who has been coding up ML algorithms in Python for years, some of the content wasn’t new to me, but that didn’t keep the assignments from being challenging or interesting. A whole lot of the program, especially the pieces around reinforcement learning, search and also many ML algorithms (KNNs, K-Means, Linear and Logistic regression, regularized regression methods, and more) from scratch. During this program, I picked up some more Tensorflow and Keras than I already knew, and also picked up PyTorch!

Time (and Energy) Management 

The course took many weekend sessions and evenings over the last 9 or so months to complete. I would eagerly await some of the break weeks sometimes to keep pace – and the weekly exercises were comprehensive and not just coding assignments at one level – they were full fledged problem solving opportunities. 

Being in a full time job, and managing responsibilities at home besides learning ML and AI requires good time management at some level. I definitely could have done some things better, looking back – after all, nobody is perfect! If I were to list three things I could have done better they’d be: a) sticking with the assignment problems everyday and trying different approaches, b) writing my own sub-problems to solve the assignment problem at hand, and c) Setting aside time to read and replicate papers related to the topics at hand.

Capstone Project Experience

The final part of the program was a capstone assignment, featuring some particularly dirty and complex data. The assignment emphasized the importance of extracting business value from data. This aspect of data science is unchanged over years – it is as it was in 2015 when I first took up data science. Value from data is where the rubber hits the road for organizations adopting data science and ML/AI. In the last nine or so months at work, while on this program, I’ve helped build a serverless data lake on the cloud, I’ve helped solve many large scale machine learning problems for telecommunications networks – and in all these cases, as in the capstone, it came back to “How do we tie the results from our analysis to the business decisions to be taken?”

In this sense, the Capstone project was interesting and challenging – like in real world projects, you have to make and state assumptions, qualify and confirm some of these assumptions, and ensure your analysis makes sense. You’d have to go back and redo some of the analysis based on new data that comes to light. You’d have to spend a significant amount of time planning your data preparation process. For instance, if you’re building a supervised classification model, you’d need to identify the hypotheses, the associated features and budget time for data preparation tasks. A large part of data science is ensuring that your dataset is suitable for machine learning – that you have target variables identified and your features sorted – and this process of developing a data pipeline is as important as any other step of the process.

I was fortunate for my Capstone project submission to be awarded an “Exemplary Assignment” badge by the course faculty (badge below)! I received feedback via a grading rubric, which was also very interesting and meaningful.

EMERITUS_Badge

Onward!

As anyone within the technology industry will know, learning new things is a regular part of our lives and our mental flexibility in learning new things takes us forward in our careers. This is even more the case with data, machine learning, AI and related areas of technology today. I cherish the learning experience within the PGDMLAI as with others in the past, and treasure it as much as a real project – in many ways, as a comprehensive, well put together program, it was a great way to pick up in-depth skills and solve challenging problems. The proverbial axe has, therefore, been sharpened. 

I hope to spend even more time on reinforcement learning centric problem statements in the coming months. Topics such as GANs and the associated new challenges they present are also interesting. The current Coronavirus disease crisis has given me opportunities to think about the problems that need to be solved in the world today, and new and innovative ways in which data and AI can be used to solve these problems. I look forward to ideate and determine such interesting problems and find ways to extend my newly gained skills to these problems, be they in domains as diverse as healthcare, pharma, telecommunications, manufacturing or technology.

 

 

Notes and Comments on Data Science

Here are some notes and comments I’ve made over the past several months and years, that convey current or extant thinking about doing data science and being productive as a data scientist. Most of these comments are from my LinkedIn engagement, and are to be read in that specific context.

  • Python is as elegant a language as they come and frameworks like Pandas have a lot going for them, but we’re far from the declarative paradigm here – and that is what seems to be the biggest productivity enhancer. Effective tooling is what really reduces time-to-insights.
  • Great set of slides by Arvind Narayanan, which I read with interest. To paraphrase Duncan Watts from Twitter, you could replace “AI” here with other technologies/capabilities that are sparsely known but admired/reviled/feared, and still have a lot of it be valid. The issue then is in the marketing-to-consumer value chain, and not the technical capabilities of ML/AI systems themselves. Good marketers ought to help people discover value from what is being marketed – in this case, that is clearly not happening. (link)
  • They published a version of this using Tensorflow some time back, and the original with MXNet was pretty good in itself, with numpy-esque matrix operations within MXNet being used for several demos. What I like here is that you see text, equations and code in one place, making it ideal as a resource to explore, experiment and learn. (link)
  • Picked up this book recently too. I like the writing style, and the discussions around seq-to-seq models. The core content in the beginning does a good job of covering all the text processing required for NLP. I foresee revisiting this book many times in the coming months/years! (link)
  • In light of the popularity of open source, it would be interesting to see if Mathworks makes any kind of OSS play. We’ve seen Microsoft changing their approach and benefiting hugely in the recent past, perhaps it is time for others to follow suit. There’s a lot of value to be added to the market by bringing out a free product variant. (link)
  • While Polynote is helpful, many more DS folks will use the Python-VS Code integrations more often than this. Scala is not used as widely for doing data science on notebooks as Python and those who want to use Scala for machine learning applications may as well switch to a full featured IDE such as IntelliJ or ScalaIDE.
  • You have a book by Albert Barabasi in there, whose work was inspiring. He’s written a book titled “Linked” which I read very enthusiastically some years ago. Also would recommend papers by Duncan Watts, Mark Newman and Steven Strogatz. I see that you already have a book by Mark Newman here in this list. The story seemed to start from Erdos and Renyi and their theory of random graphs, there are a couple of tomes from back in their day that are also available widely. 

One AI Marketing Conundrum

We are now in an age when the simplest kind of intelligence built into products and services is being marketed as “AI”. This is a regrettable consequence of current marketing practice, that seems to extend to individuals, products and even job postings. For instance, it isn’t unusual to want to hire “AI developers” these days, who have certified “credentials in AI”.

As a professional in the AI and Machine Learning space, I have come across and perhaps to an extent have been complicit in, such hype. However, with time, you gain perspective and collect feedback. Of late, the more strident the clarion calls of “AI this” and “AI that” are in products, the more common it is to see ordinary consumers become dismissive of new technology. I truly think this “performance undersupply” (to use a phrase coined by Tinymagiq’s Kumaran Anandan) in AI marketing is a bit of a regression (pun intended).

For instance, tools with natural language processing are routinely called out as being “AI”. Let’s dig a little deeper:

  1. Text mining, and the extraction of information from documents, requires mathematical representation, modeling and characterisation of corpuses of text Stemming, lemmatization and other tasks commonly seen in task mining fit into this category of tasks.
  2. Models built on top of such representations that use them as input data learn statistical relationships between different representations. Term frequency histograms, TF/IDF models and such represent such statistical models.
  3. End-to-end deep learning systems that perform higher-order statistical modeling on such representations can learn and generate more complex patterns in text data. This is what we see with language translation models.

Note that none of the above truly imply intelligence. While there is an extensive use of statistical methods and techniques to represent text data and model it mathematically and statistically, there is no memory, context capture, or knowledge base. There is no agent in question, and therefore these can at best be described as enablers of artificial intelligence.

A post on LinkedIn by Kevin Gray talks about the same problem of marketing machine learning capabilities as AI. My response to his post is below, and perhaps it provides additional context to the discussion above on NLP/NLU/NLG and how that should be considered an enabler of AI, and not AI in and of itself.

The contention here seems to be on the matter of whether something can be described (and by extension marketed) as AI or not.

Perhaps it is more helpful to think of ML algorithms/capabilities such as NLU/NLP/NLG (as with audio/image understanding, processing and generation tasks) as _enablers_ of intelligent systems, and not the intelligence itself. 

This distinction can perhaps help address the fact that consciousness, memory, context understanding and other characteristics of real-world intelligent agents are not glossed over in our quest to market one specific tool in the AI toolkit.

Coming to multiple regression – clearly a “soothsayer” or a forecaster (in the trading sense, perhaps) is valued for their competence and experience, which brings context and the other benefits I mentioned of real world intelligent agents. When a regression model makes a prediction along similar lines, that does not assume context either, and is therefore not in and of itself an intelligent system.  So in summary, I’d say that NLP/NLU/NLG and such capabilities are also not “AI”, just as stepwise multiple regression isn’t.

From my comment.

Coming back to the topic at hand, we all can probably acknowledge first that marketers won’t stop using the “AI” buzzwords for all things under the sun anytime soon. That said, we can rest easy because we might be able to understand, with a little effort, what the true capability of the marketed product or service in question is. Mental models like those described above might help contextualize and rationalize the hype as and when we see it.

Exploring Chaos and Bifurcation Diagrams in Python

In the study of nonlinear dynamical systems and chaos, one of the basic properties of systems we evaluate is period doubling, or bifurcation. As the parameters that describe system states change, the system can exhibit different modes of behaviour. Generally speaking, they all exhibit a sensitive dependence to initial conditions, which is to say that with just small, inscrutable changes in the initial conditions, we can get wildly different results in a system governed by very simple rules of iterative calculation.

In this notebook, I’m attempting to explore three different kinds of maps, and their associated period doubling behaviour, through bifurcation diagrams.

  1. Logistic map: A relatively well known function described by x_{n+1} = rx_n (1 - x_n) , the logistic map‘s bifurcation diagram is plotted as the change in x_n with varying values of r. Since each value of x_n depends on the previous value in a nonlinear fashion, a sensitive dependence on initial conditions is exhibited. Depending on starting values and the parameter values of r, the function exhibits smoothness initially followed by suddenly occurring chaos.
  2. Circle map: Circle maps are associated with Arnold Tongues and are described by the iteration \theta_{n+1} = \theta_n + \Omega - \frac{K}{2\pi}sin(2\pi\theta_n) .
  3. Gauss iterated map: This is a nonlinear iterated map defined as  x_{n+1} = e^{- \alpha x_n^2 } + \beta , and is generally computed like the Logistic map.

Explore the github repository for more.

Bifurcation diagram for a Circle Map
Bifurcation patterns and “edge of chaos” phenomena in the circle map, between k = 2.5 and k = 3.0
Circle Map bifurcation diagram in k = (2.9,3.0) shows alternating chaotic and non-chaotic behaviour

Related Ideas and Links

  1. Steven Strogatz’s lectures on nonlinear dynamics and chaos (link)
  2. Pitchfork bifurcation
  3. Logistic differential equation
  4. Simple mathematical models with very complicated dynamics, Robert M. May (open access link)
  5. Feigenbaum constants
  6. Feigenbaum scaling in discrete dynamical systems by Keith Briggs ( link )
  7. Mitchell Feigenbaum’s original paper on patterns in chaos (link)
  8. Learning resources for complex systems on Complexity Explorer; one specific introductory course may be helpful for those beginning to learn these subjects.

Exploring SVM Kernels

Support Vector Machines are a popular option for data scientists wanting to explore and model higher dimensional data. Despite their lack of scalability, they’re popular for prototyping different kinds of classifiers for systems where there are large numbers of variables. At the core of the SVM is the use of a kernel function, which enables a mapping of the feature space to a higher dimensional feature space. Therefore, if we’re unable to find separability between classes in the (lower dimensional) feature space, we could find a function in the higher dimensional space, which can be used as a classifier.

Two classes of data in the R^2 space

In this Jupyter notebook I’ve explored a couple of different types of kernel functions for bivariate, two-class data, where an SVM is being used to separate out these classes. Since these classes are not linearly separable, the use of kernel functions here enables us to find the best possible hyperplanes that can solve the separability problem. What’s interesting to note is that the convex hulls (in this case, polygons) for these classes are overlapping in the 2D space. This is a clear indicator of a lack of linear separability.

The blue polygon here represents the convex hull for one class, which is dimensions-wise nested in another, in this data representation

The use of a kernel opens up the possibility of linear separability, since we add an additional spatial dimension on which these points get distributed. Specifically here we have two different kernel functions that are explored:

\phi(x_{1}, x_{2}) = (x_{1}^2, x_{1}x_{2}, x_{2}^2)^{T}

K(x_{1}, x_{2}) = a e^{{-\frac{1}{b} ||x_{1} - x_{2}||^{2} }}

The latter is called the radial basis function kernel, or the RBF kernel. Visualizing this kernel for the data we’d generated gives us the following nice image. What’s easily visible here is the possibility of separating out the classes thanks to the additional rbf dimension that has now been added.

RBF (vertical axis) enables separation of the two classes, in blue and yellow.
Note: Low opacity used for better visibility
Decision boundary identified by the SVM (which uses the RBF kernel)

Upon training the SVM classifier, visualizing the results gives us the below plot. The thick grey line is the decision boundary that enables us to separate the originally linearly inseparable classes in the dataset.

There are other explorations I hope to do on this notebook in future, specifically the process of calculating the sign (class label) of a dataset, based on the Lagrangian – which indeed brings us to the idea of the SVC being a maximal margin classifier. This is also referred to as the dual problem of the SVM. For another post!

Statistical Competence and Its Importance for Good Data Science Careers

In 2019, enterprises routinely begin initiatives related to analytics, data science and machine learning that invoke specific technologies from a very early stage in their initiatives. This tendency to put technology ahead of value sometimes extends to analytics champions and managers who take up or lead data-intensive initiatives. While this may seem pragmatic at one level, at another level, it may lead to significant problems when ensuring successful outcomes from such analytics initiatives and programs. In this post, I’ll address the three-pronged conundrum of statistical competence in the data science world, specifically in the context of data science consulting and services, and specifically what it means for the careers of data science candidates now and in the future.

Hiring Statisticians: An Expert’s View

Kevin Gray is one of my connections on LinkedIn who posts insightful content on statistical analysis and related topics on a regular basis, including very good recommendations for books on various statistical and analytical techniques and methods. One of his recent posts was an article he’d authored titled “What to Look For in a Statistician” (the article, and my comment), which definitely resonated with my own experiences in hiring statistically competent engineers in different settings, such as data science and machine learning, between 2015 and today. In years past, I have had similar experiences when hiring competent product engineers and manufacturing engineers in data-intensive problem solving roles.

The importance of statistical thinking and statistical analysis in business problem solving cannot be underestimated. However, even good advice that is canon, and that is well-acknowledged, often falls on deaf ears in the hyper-competitive data science job market. Both hiring managers and recruiters tend to emphasize keywords comprising the latest framework or approach, over the ability to think critically about problem statements, carefully architect systems, and rigorously apply statistical analysis and machine learning to real world problems while keeping considerations of explainability in mind.

The Three-Pronged Conundrum of Data Science Talent

Now you might ask why I say this, and what I really mean by this. The devil, as they say, is in the details, and one essential problem with the broad and wide proliferation of tools, frameworks and applications of high capability, that can perform and automate statistical analysis of different kinds, is the following three-pronged conundrum:

  1. Lack of core statistical knowledge despite having a working knowledge of the practicum of advanced techniques: Most candidates in the data science job market who are deeply interested in building data science and ML applications have unfortunately not developed skills in the core statistical sciences and statistical reasoning. Since statistics is the foundation for machine learning and data science, this degrades the quality of projects and programs which have to rely on hiring such talent. When they prefer to use software to do most of or all the thinking for them, their own reasoning about the problem is rarely good enough to critically evaluate different statistical formulations for problems, because they think in very set and specific ways about problems thanks only to their familiarity with the tools.
  2. Tools as an unfortunate substitute to statistical thinking: Solutions, services and consulting professionals in the data science and advanced analytics space, who have to bring their best statistical thinking to client-facing interactions, are unable to differentiate between competence in statistical thinking, and competence in a specific software tool or approach.
  3. Model bloat and inexplicability: The use of heavy, general purpose approaches that rely on complex, less explainable models, than reliance on simpler models that are constructed upon a fuller understanding of the true dynamics of the problem.

These three sub-problems can derail even the best envisioned data science and machine learning initiatives in product / solution delivery firms, and in enterprises.

Some “Unsexy” Characteristics of Good Data Scientists

These are also not “sexy” problems – they’re earthy, multi-dimensional, real world problems that have many contributing factors, from business and how it is done, to the culture of education and the culture of software and solution development teams. Kevin Gray in his post touches upon attitudinal qualities for good statisticians, which could also be extended to data science leaders, data scientists and data engineers:

  1. Integrity and honesty are important in data science – this is true especially in a world where personal data is being handled carelessly and sometimes gratuitously by many applications without heed to data protection and privacy, and when user data is taken for granted by many technology companies. This is not an easy expectation or evaluation point for hiring managers, since it is only long association with anyone which allows us to build a model of their integrity, and rarely does one effectively determine such an attribute in short interviews. What’s dismal about data science hiring sometimes, is the proliferation of candidate resumes which are full of fluff, and the tendency of candidates to not stand up to scrutiny on skills they identify as “key” or “core” skills.
  2. Curiosity and a broad spectrum of interests – this cannot be understated in the context of a consulting data science or machine learning expert. The more we’re aware of different mental models and theoretical frameworks of the world and the data we see in it, the better we’re able to reason starting from hypotheses about the data. By extension, we’re better able to identify the right statistical approaches for a problem when we start from and explore different such mental models. The book I’ve linked to here by Scott E. Page is a fantastic evaluation of different mental models. But with models come biases, to restate George E. P. Box’s famous quote, “All models are wrong, some models are useful”.
  3. Checking for logical fallacies is key for data science reasoning – I would add to the critical thinking element mentioned in Kevin’s post, by saying that it behooves any thought leader such as a data science consultant to critically evaluate their own thinking by checking for logical fallacies. When overlooked, a benign piece of flawed reasoning can turn into a face-melting disaster. The best way to ensure this does not happen is to critically evaluate our ideas, notions and mental models.
  4. Don’t develop one hammer, develop a tool box – Like experienced plumbers, carpenters or mechanics, the tools landscape of a data scientist today should not be one of quasi-religious fervor in promoting one technique at the cost of others, such as how deep learning has come to be promoted in some circles as a data science panacea. Instead, the effective data scientist is usually pragmatic in their approach. Like a tailor or carpenter who has to cut or join different materials with different instruments, data scientists today do not have the luxury of getting behind one comfortable model of thinking about their tool set and profession – and any attempt to do this can be construed as laziness (especially for the consulting data scientist) at best. While the customer is always right, there are times when the client can be wrong and it is at these times that they need the advice of a qualified statistician or data scientist. If there is one time when data scientists should not abandon their statistical thinking, it is this kind of a situation.

Concluding Remarks

To conclude, data scientists ought not to be seen as resources that take data, analyze it using pre-built tools, and write code to explain the data using pre-built libraries of various kinds. They’re not software jockeys who happen to know some statistics and have a handle on machine learning workflows. Data scientists’ work scope and emphases as industry professionals and consultants go way beyond these limited definitions. Data scientists are expected to be dynamic, statistically sound professionals who critically evaluate real world problems based on theories, data and evidence drawn from many sources and contexts, and progressively build a deeper understanding of these real world problems that lead to tangible value for their customers, be they businesses or the consumers of products. The sooner data scientists realize this, the better off they will be while charting out a truly successful and fulfilling data science career.

Understanding the Logarithm Trick in Maximum Likelihood Estimation

Maximum Likelihood Estimation is a fundamental and powerful idea that’s at the centre of many things we do with data – so much so that we often use it without knowing it. MLE allows us to find a model’s parameters that are likely to enable the model to represent the data we have on our hands as closely as possible. This short post addresses the logarithm trick which is used to enable simpler MLE calculation.

There are two elements to understanding the formulation of MLE for the common Multivariate Gaussian model (which could be extended to other models equally):

  1. The i.i.d assumption that simplifies the MLE formulation
  2. The logarithm trick with enables solution of the MLE formulation

On this blog I’ve discussed topics like time series analysis in the past where the idea of independent and identically distributed variables is addressed, and of course, being an important statistical topic, is is well explained and understood. The logarithm trick, however, is specific to the simplification and solution of MLE formulations, and is helpful to understand.

The logarithm function very simply enables scale variance in any input data while allowing location invariance. This is extremely helpful when dealing with monotonic input data that we want to ensure continues to be monotonic after transformation, but whose scale we want to change.

When building a model p ( x_1,.... x_n | \theta ) of the data (x_1,.... x_n), the MLE formulation seeks to find the appropriate values of \theta such that

\bigtriangledown_{\theta} \prod_{i = 1}^{n} p(x_i | \theta ) = 0

The interesting thing about the log transform is, as I said earlier that in the transformation ln ( \prod_{i} f_i ) = \sum_{i} ln(f_i) , there is no change in where f_i may attain a maximum or a minimum when it is transformed to ln(f_i) for any i. This logarithm trick enables us to compute the latter product more simply, and thereby execute the MLE.

Backpropagation and Gradient Descent

Sometimes the simple questions can be revelatory and make one think about the possibilities we have in front of us to improve existing processes, systems and methods.

On this occasion, it was a simple question on Quora about gradient descent and the ubiquitous backpropagation algorithm used in neural networks and deep learning. The content of my answer follows.

The process of computing weights and biases which minimize the error from a neural network could be any optimization algorithm which is good enough for the job. Gradient descent (especially the versions of the algorithm which use momentum and RMS propagation) are especially effective and have well implemented matrix algebra formulations in languages like C and Python, which makes them used often. Equally though, a genetic algorithm or simulated annealing algorithm (which are more complex and computationally intensive) may be used for finding such weights and biases on each iteration. Indeed, such methods have been and are being researched extensively.[1]

Backpropagation is defined by four equations that help calculate new weights and biases to update a neural network.[2]

The first of these equations helps calculate the error at the output layer. The second helps calculate the error in a given layer based on the error in the next layer. The third and fourth equations help calculate the rate of change of the loss function C with variations in the weights and biases.

Therefore the algorithm itself can be written out as follows[3] :

We then use the gradient of the cost function to compute the new values of w and b, based on things like the learning rate and regularization parameters as applicable.

Why gradient descent? Since the process of backpropagation is iterative (we go from steps 1 – 5 and back again), for each update, we can get better and better versions of the weights w and biases b that are able to reduce the error between the target and the result produced by the network. The following animation (source: Data Blog) probably gives you an idea (the red areas are higher values of Cost, and blue means lower values).

A nice graphic illustrating gradient descent

Now, you might ask : can’t other algorithms be used to do the same thing? The answer is indeed yes. We can use many other optimization algorithms (constrained and unconstrained ones, used for convex and non-convex functions). If you would like to learn about convex optimization with theoretical treatment in more detail, consider this resource: Convex Optimization – Boyd and Vandenberghe. In addition to other convex optimization methods, there’s scores of robust optimization methods such as:

  1. Genetic algorithms
  2. Particle Swarm Optimization
  3. Simulated Annealing
  4. Ant Colony Optimization

While some of these, especially GAs and PSOs have been explored in the context of neural networks, common implementations of deep learning algorithms still rely on the gradient descent family of algorithms (such as Nesterov – which has come to be implemented in a distributed paradigm, RMSProp, Adam).

Footnotes

[1] https://arxiv.org/pdf/1711.07655…

[2] Neural networks and deep learning

[3] Neural networks and deep learning

Natural Language AI: Architectural Considerations

If there is one area of AI that was closely watched by practitioners and and researchers in 2018 and as of early 2019, it was the natural language processing space. Innovations in sequence modeling deep neural networks, ranging from bidirectional LSTM networks, to Google’s BERT and Microsoft’s MT-DNN have improved capabilities such as language translation in a significant way. There are many more advancements in the field of deep learning which have been very well summarized by MIT Researcher and Professor Lex Fridman in the below talk.

The State of the Art

Lex Fridman takes us through many deep learning developments in 2018, including BERT

Given the presence of mature and increasingly sophisticated models of language translation, and improvements in language understanding, what many human-machine interface development teams may be looking at, to leverage these capabilities, is the right kind of architecture for enabling this capability. After all, it is only when these algorithms reach the customer in an actual translation or language understanding task, that their value is realized.

It is evident from the MT-DNN paper by Microsoft Research that some core elements of the natural language processing tasks won’t change. For instance, look at the architecture diagram of the MT-DNN (Multi-tasking Deep Neural Network) below.

MT-DNN Architecture from Microsoft’s Research Paper

The feature vector X still has to be taken through all shared layers in any sentence / phrase based interaction, leading to the context embedding vectors we see as l2. It goes without saying that when we have such a shared architecture which provides the underlying capabilities for transformation, representations and word encoding, the subsequent deeper layers of the network can become more specialized, be this pairwise similarity, classification or other use cases.

Similar Paradigms

The surprising thing is that this isn’t a new capability. It is rather analogous to the higher level representations learned by face recognition deep learning networks, or the higher order patterns learned by deep LSTM sequence classifiers.

Image result for deep learning face recognition layers
Face recognition DNNs and the features they learn (via presentation here on Slideshare, by Igor Trajkovski)

One of the trends anticipated by Andrew Ng and other Deep Learning researchers some years ago is the arrival of end-to-end deep learning systems. In the past, there would have been a need for specific components across data pre-processing, feature engineering, machine learning or optimization, and perhaps a compositing layer which encompasses all these elements. This component-wise architecture can, given enough data, be replaced utterly by a deep learning network. Falling back on the mathematics behind the possibility of deep learning networks as universal function approximators (Hahn-Banach theorem et al, as shown here) provides another justification for such an end-to-end architecture for deep learning centric systems.

Natural language centric AI systems are, by definition, customer-centric. There are few use cases for systems deep inside the woods of business processes that require this capability, and because of this context, such AI systems have to provide for online learning and management of concept drift. Concept drift management is no easy task, and active research continues to happen in the space ( one example is here ). Concept drift verily informs capabilities such as online learning, and although brute force methods exist for reiterating large scale training, there’s only so far that can go before a smarter approach is sought out.

Architectural Considerations

Some architectural considerations for such end-to-end natural language centric AI applications therefore could be:

Four architectural considerations for Natural Language centric AI systems

Harmonization of data generation processes calls for unified user interfaces, or sub-layers in the user interfaces, which translate the end user’s intent. The manifestation of this intent may be different in different cases, depending on whether the interface is speech based, vision based or gesture based, for instance. However, intent inference and translation to a natural language paradigm could be a key capability, which enables a certain kind of taught interaction to AI systems.

We have seen already how common representational methods of input data can be a massive advantage for building numerous specializations on top of what was already available as a core capability. Modularity therefore becomes more important. In the presence of a common representational standard for input data, building specialized networks can become more straightforward, since a number of constraints begin to manifest themselves in any AI development life cycle. Finally, concept drift and its management become important considerations for the last-mile of the AI value delivery, at deployment time.

Conclusions

It should be realized that the modern translation algorithms such as BERT and MT-DNN provide very advanced capabilities which can enable natural language interactions in a manner never before imagined, and as we see intelligent systems leverage these algorithms at large scale, we will probably also see the above architectural considerations of input harmonization, common representation, specialization + modularity and online learning become infused into the architecture of common AI systems.

What Could Data Scientists (And Data Science Managers) Be Doing Better in 2019?

The “data science” job description is becoming more and more common, as of early 2019.

Not only has the field garnered a great deal of interest from software developers, statisticians and machine learning exponents, but has also attracted plenty of interest over the years, from people in roles such as strategy, operations, sales and marketing. Product designers, manufacturing and customer service managers are also turning towards data science talent to help them make sense of their businesses, processes and find new ways to improve.

The Data Science Misinformation Challenge

The aforementioned motivations for people interested in data science aren’t inherently bad – in fact, they’re common sense, reasonable starting points to look for data science talent and begin analytical programs in organizations. The problem starts with the availability of access to sound, hype-free information on data science, analytics, machine learning and AI. Thanks to the media’s fulminations around sometimes disconnected value propositions – chat bots, artificial intelligence agents, machine learning and big data – these terms have come to be clumped together along with data science and machine learning, purely because of the similarity of notion, or some of the skills required to build and sell solutions along these lines. Media speculation around AI doesn’t stop there – from calling automated machine learning as “Building AI that can build AI” (NYT), to mentions of killer robots and killer cars, 2018 was a year full of hype and alarmism as I expect 2019 will also be, to some extent. I have dealt with this topic extensively in an earlier post here. What I take issue with, naturally, is the fact this serves to misinform business teams about what’s really important.

Managing Data Science Better

Astute business leaders build analytical programs where they don’t put the cart before the horse. By this, I mean the following things:

  1. They have not merely a data strategy, but a strategy for labelled data
  2. They start with small problems, not big, all-encompassing problems
  3. They grow data science capabilities within the team
  4. They embrace visualization methods and question black box models
  5. They check for actual business value in data science projects
  6. They look for ways to deploy models, not merely build throw-away analyses

Data science and analytics managers ought not to:

  1. Perpetuate hype and spread misinformation without research
  2. Set expectations based on such hype around data science
  3. Assume solutions are possible without due consideration
  4. Not budget for subject matter experts
  5. Not training your staff and still expecting better results

As obvious as the above may sound, they’re all too common in the industry. Then there is the problem of consultants who sometimes perpetuate the hype train, thereby reinforcing some of these behaviors.

Doing Data Science Better

Now let’s look at some of the things Data Scientists themselves could be doing better. Some of the points I make here have to do with the state of talent, while others have to do with the tools and the infrastructure provided to data scientists in companies. Some has to do with preferences, while others have to do with processes. I find many common practices by data science professionals to be problematic. Some of these are:

  1. Incorrect assumption checking – for significance tests, for machine learning models and for other kinds of modeling in general
  2. Not being aware of how some of the details of algorithms work – and not bothering to learn this even after several projects where their shortcomings are highlighted
  3. Not bothering to perform basic or exploratory data analysis (EDA) before taking up any serious mathematical modeling
  4. Not visualizing data before attempting to build models from them
  5. Assuming things about the problem solving approach they should take, without basing this on EDA results
  6. Not differentiating between the unique characteristics that make certain algorithms or frameworks more computationally, statistically or otherwise efficient, compared to others
  7. Some of these can be sorted out by asking critical questions such as the ones below (which may overlap to some extent with the activities listed above):
    1. Where the data came from
    2. How the data was measured
    3. Whether the data was meddled with anyhow, and in what ways
    4. How the insights will be consumed
    5. What user experience is required for the analytics consumer
    6. Does the solution have to scale

This is just a select list, and I’m sure that contextually, there are many other problems, both technical and process-specific. Either way, there is a need to exercise caution before jumping headlong into data science initiatives (as a manager) and to plan and structure data science work (as a data scientist).