What Could Data Scientists (And Data Science Managers) Be Doing Better in 2019?

The “data science” job description is becoming more and more common, as of early 2019.

Not only has the field garnered a great deal of interest from software developers, statisticians and machine learning exponents, but has also attracted plenty of interest over the years, from people in roles such as strategy, operations, sales and marketing. Product designers, manufacturing and customer service managers are also turning towards data science talent to help them make sense of their businesses, processes and find new ways to improve.

The Data Science Misinformation Challenge

The aforementioned motivations for people interested in data science aren’t inherently bad – in fact, they’re common sense, reasonable starting points to look for data science talent and begin analytical programs in organizations. The problem starts with the availability of access to sound, hype-free information on data science, analytics, machine learning and AI. Thanks to the media’s fulminations around sometimes disconnected value propositions – chat bots, artificial intelligence agents, machine learning and big data – these terms have come to be clumped together along with data science and machine learning, purely because of the similarity of notion, or some of the skills required to build and sell solutions along these lines. Media speculation around AI doesn’t stop there – from calling automated machine learning as “Building AI that can build AI” (NYT), to mentions of killer robots and killer cars, 2018 was a year full of hype and alarmism as I expect 2019 will also be, to some extent. I have dealt with this topic extensively in an earlier post here. What I take issue with, naturally, is the fact this serves to misinform business teams about what’s really important.

Managing Data Science Better

Astute business leaders build analytical programs where they don’t put the cart before the horse. By this, I mean the following things:

  1. They have not merely a data strategy, but a strategy for labelled data
  2. They start with small problems, not big, all-encompassing problems
  3. They grow data science capabilities within the team
  4. They embrace visualization methods and question black box models
  5. They check for actual business value in data science projects
  6. They look for ways to deploy models, not merely build throw-away analyses

Data science and analytics managers ought not to:

  1. Perpetuate hype and spread misinformation without research
  2. Set expectations based on such hype around data science
  3. Assume solutions are possible without due consideration
  4. Not budget for subject matter experts
  5. Not training your staff and still expecting better results

As obvious as the above may sound, they’re all too common in the industry. Then there is the problem of consultants who sometimes perpetuate the hype train, thereby reinforcing some of these behaviors.

Doing Data Science Better

Now let’s look at some of the things Data Scientists themselves could be doing better. Some of the points I make here have to do with the state of talent, while others have to do with the tools and the infrastructure provided to data scientists in companies. Some has to do with preferences, while others have to do with processes. I find many common practices by data science professionals to be problematic. Some of these are:

  1. Incorrect assumption checking – for significance tests, for machine learning models and for other kinds of modeling in general
  2. Not being aware of how some of the details of algorithms work – and not bothering to learn this even after several projects where their shortcomings are highlighted
  3. Not bothering to perform basic or exploratory data analysis (EDA) before taking up any serious mathematical modeling
  4. Not visualizing data before attempting to build models from them
  5. Assuming things about the problem solving approach they should take, without basing this on EDA results
  6. Not differentiating between the unique characteristics that make certain algorithms or frameworks more computationally, statistically or otherwise efficient, compared to others
  7. Some of these can be sorted out by asking critical questions such as the ones below (which may overlap to some extent with the activities listed above):
    1. Where the data came from
    2. How the data was measured
    3. Whether the data was meddled with anyhow, and in what ways
    4. How the insights will be consumed
    5. What user experience is required for the analytics consumer
    6. Does the solution have to scale

This is just a select list, and I’m sure that contextually, there are many other problems, both technical and process-specific. Either way, there is a need to exercise caution before jumping headlong into data science initiatives (as a manager) and to plan and structure data science work (as a data scientist).

Pragmatic Business Transformation with AI

I interact with numerous data scientists and people in the data science space on LinkedIn on a daily basis. Many of these have insightful things to say, about how data and artificial intelligence are transforming the business landscape. There is a certain alarmism in the context of the automation of business processes, that accompanies every discussion on artificial intelligence, and with good reason. One of these is Vin Vashishta, whose posts often address pressing challenges in data and AI. Here is a recent post by Vin and my comment. This blog post was originally on Medium, and is an expansion of the ideas represented by the comment.

Traditional Thinking Couches

Traditional thinking about how work gets done, in general has the following elements. Traditional work and time based thinking is based on scientific reductionism and paradigms such as linearity. In truth, this thinking has allowed us to come very far. The division of labour is the very basis of capitalism, for instance, and modern capitalism thrives on specialization and the management of work in this form.

  1. Linearity: The tendency to think of all work as ultimately reducible into linearly scalable chunks. Less of a task requires less resources, whereas more work requires more resources. To be fair, this kind of thinking has been around for millennia, since at least the time of human settlement and the neolithic age.
  2. Reducibility: This is a tendency to think of work as infinitely reducible, in such a way that if we complete each sub-task of a job in a certain sequence, we have the end result of completing the whole job. Systems engineers know better, and understand holism and reductionism in systems as analogies to the traditional view of reducibility and how it might affect the way we see work today
  3. Value-based Work and Tangibility: Another element of what seems to define work traditionally is the presence of tangible objectives, such as items shipped, or certain unambiguously measurable criteria met. In this world, giving a customer a good experience when they shop, or enabling customers or partners to better be served or serve us better, aren’t seen as value, but as non-value-added activities. For a long time, approaches to business transformation focused on the reduction of non-value-add activities from business process, with the view that this will improve process efficiency.

When we think about how businesses will take up AI and machine learning capabilities, we’re compelled to think in terms of the same above lenses. They’re comfortable couches that we cannot get out of, and as a result, possess and dominate our thinking about AI deployment in enterprises.

AI-Specific Cognitive Biases

Some dangers of thinking driven by the above principles are as follows:

  1. Zero-sum automation: The belief that there is a fixed pie of opportunity, and that when we give human jobs to machines, we deprive humans of opportunities. Naturally, this is not true, because general, self-organizing intelligences such as humans are more than capable of discovering and finding new opportunities. Fixed-pie thinking is probably one of the key reasons behind AI alarmism. I would additionally argue that at some level, AI alarmism is also the result of bogeyman thinking, a paradigm in which a strawman such as AI is assigned blame for large scale change. In the past, a lot technological progress and change happened without such bogeymen, even as other changes were being prevented because of such thinking. Another element of bogeyman thinking is the tendency to ignore complementarity, including situations where humans and AI tools could work alongside each other, resulting in higher process effectiveness.
  2. Value bias: While there is truth to the notion that processes have value-add steps and non-value-add steps, it is a feature typical of reductionism to assume that we don’t need the non-value-add steps at all, while they may be serving true purpose. For instance, all manufacturing processes that transform raw material to product have ended up requiring quality checks and assurance. As a feature of the evolution of industrial production processes, quality assurance and control have become part of nearly all manufacturing processes that operate at scale. QA and QC represent a non-linearity in the production system, or a feedback loop which provides downstream process performance information to upstream processes.
  3. Exclusivity: A flip side of bogeyman thinking, combined with value bias, is the phenomenon of exclusivity. For example, the interpretation of emotional expressions on a human face, has for long been a task that humans are great at — for long, we didn’t know of any higher animals, let alone technologies, that had this level of sophistication. Now, there’s a lot going on in the ML/AI space that has to do with the so-called soft aspects of human life — judging people’s expressions and understanding them, learning about their behavioural patterns, etc., and these capabilities are becoming more and more mature within AI systems on a regular basis. This contradicts traditional notions of human-exclusive capabilities in many areas. Naturally, this is seen as a threat, rather than a capability enhancer. The truth is that exclusivity is also to be considered a logical fallacy when discussing the development of AI systems.

It is common for one to fear he who seems to do everything that one can do, until that person becomes one’s friend. I’d say that the word is still out on what AI cannot do yet — and as a result, our approach to business transformation (as with transformation in other areas) should be humans + AI, and not AI in lieu of humans. This synergy is already visible in the manufacturing world, and perhaps we will see it make its way to other spheres as well. Fixed-pie thinking won’t get us anywhere when we have capability amplifiers like AI to assist humans.

Concluding Remarks

A key element of future human productivity is the discovery and exploitation of new opportunities in new frontiers. My suggestion to business leaders thinking about AI adoption for automation and process improvement, is to expand the pie first, by creating new opportunities to do more as a business, and enable your employees to take up and contribute more to your business. When you then enable them with AI, the humans+AI combination you will see as a result will take your organization to new heights.

Achieving Explainability and Simplicity in Data Science Work

This post stems from a few of the tweets I’d authored recently (Over at @rexplorations) on deep learning, data science, and the other skills that data scientists ought to learn. Naturally, this is by no means a short list of skills, given the increasingly pivotal role that data scientists play in organizations.

Here’s a summary of the tweet-stream I’d put out, with some additional ponderings.

  1. Domain knowledge is ignored on the data science road to perdition. Doing data analysis, or building models from data without understanding the domain and the relevance of the data and factors one is using for these models, is akin to “data science suicide”. It is a sure shot road to perdition as a data scientist. Domain knowledge is also hard to acquire for data scientists, especially those working on projects as consultants, and applying their skills in a consultative, short-term setting. For instance, I have more than a decade of experience in the manufacturing industry, and I still find myself learning new things when I encounter a new engineering set up or a new firm. A data scientist is nobody if not capable of learning new things – and domain knowledge is something that they need to constantly skill up on, in addition to their analytical skills.
  2. Get coached on your communication skills, if needed. When interacting with domain experts and subject matter experts, communication skills are extremely important for data scientists. I have frequently seen data scientists suffer from the “impostor syndrome” – not only in the context of data analysis methods and techniques, but also in the context of domain understanding.
  3. Empathise, and take notes when speaking to subject matter experts. It is for this reason that the following things are extremely important for new data scientists interacting with subject matter experts:
    1. Humility about one’s own knowledge of a specific industry area,
    2. An ability to empathise with the problems of different stakeholders
    3. The ability to take notes, including but not limited to mind maps, to organize ideas and thoughts in data science projects
  4. Strive for the usefulness of models, not to build more complex models. Data scientists ignore hypotheses that come from such discussions at their own peril. Hypotheses form the lifeblood of useful data science and analysis. As George E. P. Box said, “All models are wrong, some models are useful” – and this couldn’t be more true than when dealing with models built from hypotheses. It is such models that become really useful.
  5. Simpler models are easier to manage in a data ethics context. In product companies that use machine learning and data science to add value to customers, a debate constantly exists on the effective and ethical use of customer data. While having more data at one’s disposal is helpful for building lots of features, callous use of customer data can present a huge risk. Simpler models are easier to explain – and are arrived at when we accumulate sufficient domain knowledge, and test enough hypotheses. With simpler models, it is easier to explain what data to collect, and this can also help win the customer’s trust.
  6. Careful feature engineering done with human supervision and care may be more effective and scrupulous than automated feature engineering. We live in a world where AutoML and RoboticDataScience are often discussed in the context of machine intelligence and speeding up the process of insight generation from data. However, for some applications, it may be a better idea in the short term to ensure that the feature engineering happens through human hands. Such careful feature engineering may give organizations that use sensitive data a leg up as a longer term strategy, by erring on the side of caution.
  7. Deep learning isn’t the end of the road for data scientists. Deep learning (justifiably) has seen a great deal of hype in the recent past. However, it cannot be seen as a panacea to all data analysis. The end goal from data is the generation of value – be it for a customer, or for society at large. There are many ways to do this – and deep learning is just one approach.

I’m not discussing the many technical aspects of building explainable models. These technical aspects are contextual and depend on the situation, for one, and additionally, the tone of the post and tweets are lighter, to encourage a discussion and to welcome beginner data scientists to this discussion. Hence my omission of these (important) topics.

If you like something on this post, or want to share any other related insights, do drop a comment, or tweet to me at @rexplorations or message me at LinkedIn.

Different Kinds of Data Scientists

Data scientists come in many shapes and sizes, and constitute a diverse lot of people. More importantly, they can perform diverse functions in organizations and still stand to qualify under the same criteria we use to define data scientists.

In this cross-post from a Quora answer, I wish to elucidate on the different kinds of data scientist roles I believe exist in industry. Here is the original question on Quora. I have to say here, that I found Michael Koelbl’s answer to What are all the different types of data scientists? quite interesting, and thinking along similar lines, I decided to delineate the following stereotypical kinds of data science people:

  1. Business analysts with a data focus: These are essentially business analysts that understand a specific business domain reasonably well, although they’re not statistically or analytically inclined. Focused on exploratory data analysis, reporting based on creation of new measures, graphs and charts based on them, and asking questions around these EDA. They’re excellent at story telling, asking questions based on data, and pushing their teams in interesting directions.
  2. Machine learning engineers: Essentially software developers with a one-size-fits-all approach to data analysis, where they’re trying to build ML models of one or other kind, based on the data. They’re not statistically savvy, but understand ML engineering, model development, software architecture and model deployment.
  3. Domain expert data scientists: They’re essentially experts in a specific domain, interested in generating the right features from the data to answer questions in the domain. While not skilled as statisticians or machine learning engineers, they’re very keyed in on what’s required to answer questions in their specific domains.
  4. Data visualization specialists: These are data scientists focused on developing visualizations and graphs from data. Some may be statistically savvy, but their focus is on data visualization. They span the range from BI tools to coded up scripts and programs for data analysis
  5. Statisticians: Let’s not forget the old epithets assigned to data scientists (and the jokes around data science and statisticians). Perhaps statisticians are the rarest breed of the current data science talent pool, despite the need for them being higher than ever. They’re generally savvy analysts who can build models of various kinds – from distribution models, to significance testing, factor-response models and DOE, to machine learning and deep learning. They’re not normally known to handle the large data sets we often see in data science work, though.
  6. Data engineers with data analysis skills: Data engineers can be considered “cousins” of data scientists that are more focused on building data management systems, pipelines for implementation of models, and the data management infrastructure. They’re concerned with data ingestion, extraction, data lakes, and such aspects of the infrastructure, but not so much about the data analysis itself. While they understand use cases and the process of generating reports and statistics, they’re not necessarily savvy analysts themselves.
  7. Data science managers: These are experienced data analysts and/or data engineers that are interested in the deployment and use of data science results. They could also be functional or strategic managers in companies, who are interested in putting together processes, systems and tools to enable their data scientists, analysts and engineers, to be effective.

So, do you think I’ve covered all the kinds of data scientists you know? Do you think I missed anything? Let me know in the comments.

Related links

  1. O’Reilly blog post on data scientists versus data engineers

Why Do I Love Data Science?

This is a really interesting question for me, because I really enjoy discussing data science and data analysis. Some reasons I love data science:

  1. Discovering and uncovering patterns in the data through data visualization
  2. Finding and exploring unusual relationships between factors in a system using statistical measures
  3. Asking questions about systems in a data context – this is why data science is so hands-on, so iterative, and so full of throw-away models

Let me expand on each of these with an example, so that you get an idea.

Uncovering Patterns in Data

On a few projects, I’ve found data visualization to be a great way to identify hypotheses about my data set. Having a starting point such as a visualization for the hypothesis generation process makes us go into the process of building models a little more confidently. There’s the specific example of a time series analysis technique I used for energy system data, where using aggregate statistical measures and distribution fitting led to arbitrary and complex patterns in the data. Using time ordered visualizations helped me formulate the hypothesis in the correct way, and allowed me to build an explanatory model of the system.

Exploring Unusual Relationships in Data

In data science work, you begin to observe broad patterns and exceptions to these rules. Simple examples may be found in the analysis of anomalous behaviour in various kinds of systems. Some time back, I worked with a log data set that captured different kinds of customer transaction data between a customer and a client. These log data revealed unusual patterns that those steeped in the process could tell, but which couldn’t be quantified. By finding typical patterns across customers using session-specific metrics, I helped identify the anomalous customers. The construction of these variables, known as “feature engineering” in data science and machine learning, was a key insight. Such insights can only come when we’re informed about domain considerations, and when we understand the business context of the data analysis well.

Asking Questions about Systems in a Data Context

When you’re exploring the behaviour of systems using data, you start from some hypothesis (as I’ve described above) and then continue to improve your hypothesis to a point where it is able to help your business answer key questions. In each data science project, I’ve observed how considerations external to the immediate data set often come in, and present interesting possibilities to us during the data analysis. Sometimes, we answer these questions by finding and including the additional data, and at other times, the questions remain on the table. Either way, you get to ask a question on top of an answer you know, and you get to do an analysis on top of another analysis – with the result that you’ve composited different models together after a while, that give you completely new insights that you’ve not seen before.

Concluding Remarks

All three patterns are exhilarating and interesting to observe, for data scientists, especially those who are deeply involved in reasoning about the data. A good indication of whether you’ve done well in data analysis is when you’re more curious and better educated about the nuances of a system or process than you were before – and this is definitely true in my case. What seemed like a simple system at the outset can reveal so much to you when you study its data – and as a long-time design, engineering and quality professional, this is what interests me a great deal about data science.

Key Data and AI trends in 2017

This year, 2017, has been quite a busy year for artificial intelligence and data science professionals. In some ways, this is the year when AI truly began to be debated and discussed, from frameworks and technologies to ethics and morality. This is the year when opportunities for AI-driven improvement in businesses began to be examined critically by diverse industry professionals and academicians.With good reason, machine learning and deep learning came to be placed at the top of the Garner’s hype cycle. We’re really at the peak of inflated expectations when it comes to ML/DL – with opportunities to shorten the time we take to reach measurable and direct consumer value.

Image result for gartner hype cycle 2017

Gartner Hype Cycle for 2017

Overall, in my experience, three key trends that enterprises welcomed in 2017 include:

  1. Simplification of cloud and data infrastructure services
  2. Improved and democratized scalable machine learning and deep learning
  3. Automation in key AI, ML and data analysis tasks

Improving Cloud and Data Infrastructure

Perhaps the foundational enabler for the data strategy of many enterprises that I have seen and worked with in 2017, is the availability of an easily operated and managed scalable cloud infrastructure. This promise of a high performance, low cost and (arbitrarily) scalable cloud infrastructure was made as early as 2014, but has taken a few years to materialize as a truly viable, business-wise feasible commercial offering from a stable, top-tier technology firm. Prominent cloud vendors such as Google Cloud, Microsoft Azure and Amazon’s AWS have upped the ante, while veterans like Hortonworks, Cloudera continue to hold sway. This space where the cloud vendors are competing is ripe for consolidation, in my view, although we can expect to see converging architectures before viable consolidation that isn’t entirely wasteful can happen.

Other notable developments on the cloud infrastructure side of things were ideas such as serverless compute (which enterprises are definitely warming up to – and it shows, in the Gartner Hype Cycle), production-ready pre-built models for common tasks as APIs (a trend that continues to inspire software/AI application architecture) and the performing of streaming and real-time data processing frameworks. By combining these capabilities in cloud platforms, cloud providers have really upped their offerings in 2017 compared to before, and provide formidable capabilities – which in my view haven’t even been explored as much as they should have been by businesses.

Despite the availability of such production-ready, cost-effective and scalable data management systems in the cloud, cloud infrastructure has nevertheless come under scrutiny in 2017 for massive security lapses and downtime. To speak of specific examples, we had the biggest impact events in cloud reliability and data security history between Equifax data breach and the massive AWS outage, to say nothing of the numerous data security episodes of smaller scale that were attributable to hacktivism, such as the Panama Papers.

As a counter to some of these incidents and the rise of the GDPR and other data protection regulations, numerous cloud providers have been offering “private cloud” solutions, along with region-specific hosting options for banks and other organizations that deal with regulation-sensitive data.

Additonally, it would be unfair to not point out how much containerization has helped cloud providers in 2017. Massive scale adoption of containerization using Docker and Kubernetes has enabled virtual environments to be set up and managed for complex development and deployment tasks that are data intensive.

Spark and Tensorflow

The space of scalable machine learning frameworks continues to be dominated by Apache Spark – which has found many friends among data engineers and scientists in production after the 2.0 release, especially, given its equitable performance for the data frame APIs across languages. So, whether you program in Python, R, or Scala, you can be assured of the same high performance from Spark these days. Spark ML has expanded on the capabilities of Spark ML Lib, and in its recent releases, Spark has also polished and unified the interfaces for streaming data analysis on Spark-Streaming and graph analysis via GraphX. As someone who has seen teams use Spark for different purposes and built frameworks on it in 2017, the differences between versions 1.6 and below, and 2.0 and above are significant, and the newer versions are more polished and consistent in their behaviour.

Tensorflow received a lot of hype but only lackluster adoption in late 2016 and early 2017, but over the last several months, has made a strong case for itself, and adoption has grown significantly. As developers have warmed up to the framework, and as more language interfaces have been developed for Tensorflow, its popularity has soared, especially in the latter half of 2017. Another factor in the development and adoption of Tensorflow is the widespread use of GPU based deep learning. The core Tensorflow development team’s additions to 1.0 (as explained by Jeff Dean here) have made it a mature deep learning development package and perhaps the most widely used and sought after deep learning framework. While Torch makes an impression and is widely loved (especially in its PyTorch form), Tensorflow is hard to beat for the speed and dynamism of its high quality open source contributors. At Strata Singapore 2016, I sat through a tutorial on Tensorflow 0.8, and what I saw then contrasts with what I see in versions 1.0 and higher. My recent brushes with Tensorflow have made me more convinced that this is the framework to learn for deep learning developers at the moment. The presence of wrappers and higher level interfaces, such as Keras or Caffe, has made Tensorflow very easy to use for entry-level and intermediate programmers and data scientists.

Automation in ML, DL and Data Science

Without a doubt, the development of automation-centric techniques to automate parts of ML and DL development is one of the biggest and most important directions within the field of Artificial Intelligence in 2017. Taking after Leo Brieman’s random forests (an ensemble of “weak learners” resulting in a machine learning model with high performance) and various advancements in deep learning and machine vision (especially convolutional neural networks, which essentially encode complex features using simpler features in computer vision problems), hyper parameter optimization automation was probably the first step in the general direction of automated machine learning.

Frameworks like AutoML (see the talk by Andreas Mueller above) have been the cynosure of this kind of research, and companies small and large have begun attempting different approaches for solving the context modeling problem that arise from the need to automate data science. While most approaches towards machine learning have taken a classical approach, by finding computational approaches to learn more and more from data, some have take non-traditional approaches, by combining ideas from expert systems, rule based inference engines, and other approaches. A novel approach to machine learning has been the invention and development of generative adversarial networks (GANs) which could lead to hitherto unseen improvements in the use of computationally generated data as a starting point for understanding the best representations of a given dataset. Despite being invented in 2014, it is in 2017 that implementations of this kind of network became popular and came to be considered as a viable neural network architecture for computer vision and other kinds of machine learning problems.

Other noteworthy trends within the data and AI space include the rise and improved performance of chat bots and conversational natural-language enabled APIs, the amazing improvements to translation and image tagging made possible by deep learning, and the important question of AI ethics – starting from that now-famous question of “should your self-driving car kill a pedestrian in order to save your life”, to ethical conundrums and alarmist remarks from tech luminaries such as Elon Musk.

Concluding Remarks

So, what does 2018 hold in store? That seems to be the question on everyone’s lips in the data and AI world, and it is also what data and AI enthusiasts in different industry roles are looking to understand. While it is not possible to clearly say which trend will dictate progress in 2018 and beyond, it is clear that the above three developments will form key cornerstones on top of which future capabilities for AI and enterprise scale data management and data science will be built. Hope you enjoyed reading this. Do leave a comment or a note if you would like to share more.

Andrew Ng’s DeepLearning.AI (Coursera) Certification

2017-10-21 19_43_58-Clipboard

One of the more interesting mental models of machine learning I’ve come to understand in the last month or so, is the “five tribes of artificial intelligence” model popularized in “The Master Algorithm” by Pedro Domingos. To summarize in a phrase, the master algorithm is that approach which can uncover all possible insight from data – and Prof. Domingos hypothesises that there are five distinct such “master algorithms”, one for each of these tribes. One of these “tribes” is the connectionists, whose master algorithm is, in fact, backpropagation, which is central to the design and operation of neural networks.

A Connectionist Tour Guide

In a sense, the deep neural network has become synonymous with artificial intelligence today. There are numerous other algorithms which could lend a sense of intelligence to machines – whether by communicating in natural language as a conversationalist (starting from rudimentary bots like ELIZA through Pootwattle and Smedley (of U Chicago fame), to modern chatbots), or by learning to differentiate different kinds of faces, or identify emotions of specific kinds. The deep neural network has successfully been applied to numerous such real world problems, and therefore stands out as being promising on this account. For the other tribes, we don’t yet have algorithms such as “advanced induction inference machines”, or “higher dimensional kernel machines” – whatever these may indicate (really or apocryphally). So it behooves us to pay attention to stories such as this one, which discuss the “unreasonable effectiveness” of neural networks.

 

DeepLearning.AI’s Course

There’s definitely a skills gap in the advanced machine learning and artificial intelligence space. Businesses are as yet unable to see value beyond the hype. Unsurprisingly, the skills gap has to be addressed at the very root – the fundamentals, where the ability to model problems, computationally solve them, and build systems out of such solutions intersect. Andrew Ng has, also unsurprisingly, taken a stab at the deep learning space, if his “AI is the new electricity” talk is anything to go by.

 

 

Over the last few weeks, I’ve had the opportunity to spend some time on Andrew Ng’s Deep Learning course from DeepLearning.ai. For me, this is like a tour guide to the world of the connectionists. The reality is that neural networks don’t work like the human brain apart from superficial similarities – as Ng himself explains in the course – but the term has stuck, since the motivations of early pioneers who also knew some neuroscience led to the moniker.

The Coursera certification is organized into five different courses, and the first of these lays the mathematical and programmatic foundation for implementing them. This first course, titled Neural Networks and Deep Learning has well-orchestrated exercises within Coursera’s integrated Jupyter notebook interface, and you can use the algorithm on your own data, to evaluate its performance. I’m currently some way through the second course, having finished the first one – and I have to say that the videos, programming exercises and other course aspects create a true learning feedback loop, which is effective in teaching the basics really well. I’m very impressed with the way the course has been put together and made accessible to those with a little bit of machine learning knowledge, who are starting out on neural networks and deep learning.

Course Experience

In the below section, I’ll outline my key learnings from the first course in the certification. I hope that you take the course, if you are a ML and AI enthusiast or young professional (or even an experienced one) interested in working on deep learning.

  1. The course introduced the most fundamental ideas of neural networks at the very start, with extensive coverage on how to implement a logistic regression model for classifying data. This intial discussion was built up rather nicely into a discussion on deep learning.
  2. As an intermediate course, it assumes some amount of knowledge of linear algebra and differential equations. As someone who works with machine learning models, I was able to grasp the intuitions with one repetition. If it has been a while since you worked through linear algebra and differential calculus (or thought through equations, at the very least), expect to take a while to find your feet.
  3. Some of the intuitions around gradient descent, the values of derivatives, and so on, were introduced very handily – and were reinforced through the exercises.
  4. The importance of vectorization and its central use in numpy (which is used extensively – nay, almost exclusively – throughout the course) was well brought out. Numpy is a powerful library and surprisingly, received its first funding only in 2017 after being useful for the development of numerous algorithms and tools. Some of its quirks, such as order (n,) vectors, were especially interesting and useful to learn about. Overall though this isn’t a numpy tutorial by any stretch, it is referenced extensively.
  5. During weeks 2 and 3, the logistic regression algorithm is taught in a different context – it is likened to neurons in a deep net, and the details of activation functions are discussed. This, to me, was the meat of the course.
  6. In weeks 2 and 3, a consistent methodology and notation was followed for the discussion of and the implementation of  forward and backward propagation, two of the key mechanisms in any neural network – and this was done entirely within numpy, and these are great hands-on lessons. Stochastic gradient descent was also explained and implemented.
  7. Finally, in week 4, deep neural networks were handled, and parametrization of the neural network topology was introduced. Ideas related to this, such as hyperparameter optimization were also discussed. Additionally, in both videos and assignments, Andrew Ng provided practical advice on how to get the matrix dimensions right for weight and bias vectors – without this and the consistent notation, a lot of the programming implementations of DNNs could potentially get very hairy, so I personally felt that this was very well handled.
  8. A cat classifier deep neural network in Week 4 – because who doesn’t like cats?
  9. Right through the course, there are optional video lectures, and interviews with well known researchers. One of them is with Geoff Hinton, and it was definitely instructive.

 

 

 

I’m about half-way through the second course, on Improving Deep Neural Networks, and my experience there has been similar to the first course. The content derives directly from the content of the first course, and therefore, going in sequence from the first to the second definitely has its advantages. If you were to start the second course of the specialization first, expect to spend some time to find your feet. So far, I only wish there had been better explanations of ideas like dropout and L2 regularization, especially given the tricky quizzes in Week 1. This is a 3-week course, and I wish an additional week, or a few more videos had been spent initially, explaining and firming up ideas around regularization. Additionally, the exploding/vanishing gradient problems could be better illustrated with videos and so on, although I felt the course generally does a good job of explaining the essentials of these ideas.

Concluding Remarks

To conclude, I’d recommend this certificate for those in the analytics, data science or machine learning space, who are a bit hands on, can grasp linear algebra and calculus, and can work with Python. You’ll find that since this is an “intermediate” specialization, neophytes will require multiple viewings of the videos to become conversant in the ideas and concepts. This still shouldn’t deter those who want to audit the course or learn the concepts therein for a deeper understanding to back up their direct experience in machine learning.

Related Content

  1. My Quora answer on Deeplearning.AI’s Coursera course

The Expert System Anachronism in the Data Science and AI Divergence

Although the data science and big data buzzwords have been bandied about for years now, and although artificial intelligence has been talked about for decades, the two fields are irrevocably inter-related and interdependent.

For one thing, the wide interest in data science started just as we were beginning to leverage distribute data storage and computation technologies – which allowed companies to “scale out” storage and computation, rather than “scale up” computation. Companies who could therefore buy numerous run-of-the-mill computers (rather than extremely expensive, high end computers, in smaller numbers) could potentially leverage their data collection activities to be useful to the enterprise.

Let’s not forget, though, that the point of such exercises was to actually get some business value at the end of such an exercise. There’s virtually no business case for collecting huge amounts of data and storing them (with or without structure), if we don’t have a plan to somehow utilize that data for taking business decisions better, or to somehow impact the business or customers positively. IT managers across industries have therefore struggled to make sense of the big data space, and how much to invest, what to invest in, and how to make sense of it all.

Technology companies are only too happy to sell companies the latest and greatest data science and data management frameworks and solutions, but how can companies actually use these solutions and tools to make a difference to their business? This challenge for executives isn’t going away with the advent of AI.

Artificial Intelligence (AI) has a long and hoary history, and has been the subject of debate, discussion and chronicle over several decades. Geoff Hinton, the AI pioneer, has a pretty comprehensive description of various historical aspects of AI here. Starting from Geoff Hinton’s research, pioneering research in recent years by Yann Le Cun, Andrej Karpathy and others has enabled AI to be considered seriously by organizations as a force multiplier, just as they considered data science a force multiplier for decision making activities. The focus of all these researchers are on general purpose machine intelligence, specifically neural networks. While the “deep learning” buzzword has caught on of late, this is fundamentally no different from a complex neural network and what it can do.

That said, AI in the form of deep learning differs vastly in capability from the algorithms data scientists and data mining engineers have used for more than a decade, now. By adding many layers, and by constructing complex topologies in these neural networks, and by iteratively training them on large amounts of data, we’ve progressed along multiple quantitative axes (complexity, number of layers, amount of training data, etc) in the AI world, to get not merely quantitative, but qualitatively better in terms of AI performance. Recent studies at Google show that image captioning, often considered a hard problem for AI, is now at near-human levels of accuracy. Microsoft famously announced that their speech-to-text and translation engines stand improved by an order of magnitude, because of the use of these techniques.

It is this vastly improved capability of AI, and the elimination of the human (present forever in the data science activity loop) from even the analysis and design of these neural networks (generative adversarial networks being a case in point), that makes the divergence between Data Science and AI very vivid and distinct. AI seems to be headed in the direction of general intelligence, whereas data science approaches and methods constituted human-in-loop approaches to making sense of the data. The key value addition of the human in this data science context was “domain” – and I have extensively discussed the importance of domain in data science in an earlier post – but this too, has increasingly become supplanted by efficient AI, provided that the data collection process for training data, and the training and topological aspects of the networks (known as hyper parameters) are well defined enough. This supplanting of the human domain perspective, by machine-learned domain features that matter, is precisely what will enable AI to develop and become a key force to reckon with, in industry.

Therefore I venture that the “anachronism” in the title of this post, is the domain-based model of systems, or intelligent systems, called the Expert System. Expert system design is an old problem that probably had its heyday and apparently disappeared into the mist of technological obsolescence – and it is this kind of expert system design problem that AI methods will be so good at solving, to the point that they can replace humans in key tasks, and become a true general intelligence. Expert systems were how the earliest AI researchers imagined machine intelligence to be useful to humanity. However, their understanding was limited to rule-based expert systems. While the overall idea of the expert system is still relevant in many domains – so much so that in a sense, we have expert systems all around us – it is undeniable that the advent of AI will enable expert systems to develop and evolve once again, but without the rule-based approaches we have seen in the past, and with inductive learning as is apparent from deep learning and machine learning methods.

Quora Data Science Answers Roundup

I’m given to spurts of activity on Quora. Over the past year, I’ve had the opportunity to answer several questions there on the topics of data science, big data and data engineering.

Some answers here are career-specific, while others are of a technical nature. Then there are interesting and nuanced questions that are always a pleasure to answer. Earlier this week I received a pleasant message from the Quora staff, who have designated me a Quora Top Writer for 2017. This is exciting, of course, as I’ve been focused largely on questions around data science, data analytics, hobbies like aviation and technology, past work such as in mechanical engineering, and a few other topics of a general nature on Quora.

Below, I’ve put together a list of the answers that I enjoyed writing. These answers have been written keeping a layperson audience in mind, for the most part, unless the question itself seemed to indicate a level of subject matter knowledge. If you like any of these answers (or think they can be improved), leave a comment or thanks (preferably directly on the Quora answer) and I’ll take a look! 🙂

Happy Quora surfing!

Disclaimer: None of my content or answers on Quora reflect my employer’s views. My content on Quora is meant for a layperson audience, and is not to be taken as an official recommendation or solicitation of any kind.

Simple Outlier Detection in R

Outliers are points in a data set that lie far away from the estimated value of the centre of the data set. This estimated centre could be either the mean, or median, depending on what kind of point or interval estimate you’re using. Outliers tend to represent something different from “the usual” that you might observe in a data set, and therefore hold importance. Outlier detection is an important aspect of machine learning algorithms of any sophistication. Because of the fact that outliers can throw off a learning algorithm or deflate an assumption about the data set, we have to be able to identify and explain the outliers in data sets, if the need arises. I’ll only cover the basic R commands here to do outlier detection, but it would be good to look up a more comprehensive resource. A first primer by Sanjay Chawla and Pei Sun (University of Sydney) is here: Outlier detection (PDF slides).

Graphical Approaches to Outlier Detection

Boxplots and histograms are useful to get an idea of the distribution that could be used to model the data, and could also provide insights into whether outliers exist or not in our data set.

y <-read.csv("y.csv")
ylarge <- read.csv("ylarge.csv")

#summarizing and plotting y
summary(y)
hist(y[,2], breaks = 20, col = rgb(0,0,1,0.5))
boxplot(y[,2], col = rgb(0,0,1,0.5), main = "Boxplot of y[,2]")
shapiro.test(y[,2])
qqnorm(y[,2], main = "Normal QQ Plot - y")
qqline(y[,2], col = "red")

#summarizing and plotting ylarge
summary(ylarge)
hist(ylarge[,2], breaks = 20, col = rgb(0,1,0,0.5))
boxplot(ylarge[,2], col =  rgb(0,1,0,0.5), main = "Boxplot of ylarge[,2]")
shapiro.test(ylarge[,2])
qqnorm(ylarge[,2], main = "Normal QQ Plot - ylarge")
qqline(ylarge[,2], col = "red")

The Shapiro-Wilk test used above is used to check for the normality of a data set. Normality assumptions underlie outlier detection hypothesis tests. In this case, with p-values of 0.365 and 0.399 respectively and sample sizes of 30 and 1000, both samples y and ylarge seem to be normally distributed.

 

Box plot of y (no real outliers observed as per graph)

Box plot of y (no real outliers observed as per graph)

Boxplot of ylarge - a few outlier points seem to be present in graph

Boxplot of ylarge – a few outlier points seem to be present in graph

Histogram of y

Histogram of y

Histogram of ylarge

Histogram of ylarge

Normal QQ Plot of Y

Normal QQ Plot of Y

Normal QQ Plot of ylarge

Normal QQ Plot of ylarge

 

The graphical analysis tells us that there could possibly be outliers in our data set ylarge, which is the larger data set out of the two. The normal probability plots also seem to indicate that these data sets (as different in sample size as they are) can be modeled using normal distributions.

Dixon and Chi Squared Tests for Outliers

The Dixon test and Chi-squared tests for outliers (PDF) are statistical hypothesis tests used to detect outliers in given sample sets. Bear in mind though, that this Chi-squared test for outliers is very different from the better known Chi-square test used for comparing multiple proportions. The Dixon tests makes a normality assumption about the data, and is used generally for 30 points or less. The Chi-square test on the other hands makes variance assumptions, and is not sensitive to mild outliers if variance isn’t specified as an argument. Let’s see how these tests can be used for outliers detection.

library(outliers)
#Dixon Tests for Outliers for y
dixon.test(y[,2],opposite = TRUE)
dixon.test(y[,2],opposite = FALSE)

#Dixon Tests for Outliers for ylarge
dixon.test(ylarge[,2],opposite = TRUE)
dixon.test(ylarge[,2],opposite = FALSE)


#Chi-Sq Tests for Outliers for y
chisq.out.test(y[,2],variance = var(y[,2]),opposite = TRUE)
chisq.out.test(y[,2],variance = var(y[,2]),opposite = FALSE)

#Chi-Sq Tests for Outliers for ylarge
chisq.out.test(ylarge[,2],variance = var(ylarge[,2]),opposite = TRUE)
chisq.out.test(ylarge[,2],variance = var(ylarge[,2]),opposite = FALSE)

In each of the Dixon and Chi-Squared tests for outliers above, we’ve chosen both options TRUE and FALSE in turn, for the argument opposite. This argument helps us choose between whether we’re testing for the lowest extreme value, or the highest extreme value, since outliers can lie to both sides of the data set.

Sample output is below, from one of the tests.

> #Dixon Tests for Outliers for y
> dixon.test(y[,2],opposite = TRUE)

	Dixon test for outliers

data:  y[, 2]
Q = 0.0466, p-value = 0.114
alternative hypothesis: highest value 11.7079474800368 is an outlier


When you closely observe the p-values of these tests alone, you can see the following results:

P-values for outlier tests:

Dixon test (y, upper):  0.114 ; Dixon test (y, lower):  0.3543
Dixon test not executed for ylarge
Chi-sq test (y, upper):  0.1047 ; Chi-sq test (y, lower):  0.0715
Chi-sq test (ylarge, upper):  0.0012 ; Chi-sq test (ylarge, lower):  4e-04

The p-values here (taken with an indicative 5% significance) may imply that the possibility that the extreme values in ylarge are outliers. This may or may not be true, of course, since in inferential statistics, we always state the chance of error. And in this case, we can conclude that there is a very small chance that those extreme values we see in ylarge are actually typical in that data set.

Concluding Remarks

We’ve seen the graphical outlier detection approaches and also have seen the Dixon and Chi-square tests. The Dixon test is newer, but isn’t applicable to large data sets, for which we need to use the Chi-square test for outliers and other tests. In machine learning problems, we often have to be able to explain some of the values, from a training perspective for neural networks, or be able to deal with lower resolution models such as least squares regression, used in simpler forecasting and estimation problems. Approaches like regression depend heavily on the central tendency of the data, and we can build better models if we’re able to explain outliers and understand the underlying causes for them. Continual improvement professionals generally regard outliers with importance. Statistically, the chance of getting extreme results (extremely good ones and extremely poor ones) is exciting in process excellence and continuous improvement, because they could represent benchmark cases, or worst case scenarios. Either way, outlier detection is an immensely useful activity applicable to different statistical situations business.