Questions that Data Scientists Hate Getting

This is a variation on a Quora answer.

When asked how data scientists can be effective, there are a few things that com e to mind:

  1. Skills: A curiosity and sufficient skill in data analysis methods and techniques
  2. Fundamental needs: the data and access to the tools to perform analysis — and this would include the environments
  3. Performance needs: Sufficient resources, time and good enough processes to validate or invalidate hypotheses and build models based on them
  4. Excitement needs: Sufficient support and latitude to independently deploy projects based on successful hypotheses tested and models built

Note that while these criteria listed above begin with the fundamental skills required to do data science, the focus shifts in items 2, 3 and 4, to what is required for data scientists to be effective. The first of these are the fundamental needs, such as the data itself, and the access to the required tools, be they statistical or machine learning tools, databases, visualization libraries, or other resources. The second of these are the performance needs, which will help the data scientist do whatever it is that they do, a bit better than how they’re doing this now. This includes processes and systems that enable the data scientist to improve their own capabilities. Finally, we have excitement needs, which enable data scientists to do outstanding work — a large part of this is being able to reuse what has been built, through deployment of various kinds.

It is in this context that we can discuss how managers of data science teams can help them be effective.

If there is one kind of behaviour in analytics managers that I wish changed, it is the one I describe in the following lines.

A lot of what data scientists do is experimental, throw-away analysis. However, it is tempting for a number of managers (many of who have made up their minds that some hypothesis holds true, or will work), to assume that they’re right, and what is required from the data scientist is the detailed model that formalizes the relationship.

This kind of assumption makes for poorly designed projects, and doesn’t amply use the data scientist’s time for exploratory analysis, for evaluating the development of different kinds of models, and for finding out what works, given the dataset.

Naturally, given the time-bound nature of businesses and poor understanding of analytics at the executive level in many organizations, such clients are commonplace, and such managers also find themselves in a situation where they push for results without the right underlying systems, data or resources. Sometimes, they begin projects with data scientists who lack the specific skills to build the kinds of models required to solve problems. While this may be the case, the challenge many data scientists in business and consulting have is dealing with such unreasonable expectations.

In this specific context, some questions that shouldn’t be posed to data scientists might be along the following lines:

  • “Assuming that hypothesis X works, how long would it take to build a full fledged application using this hypothesis X?”
  • “The domain experts are convinced that this hypothesis X is true. Why don’t your results reflect this too?”
  • “The values of R_sq or precision/recall I see here don’t reflect what can be done with the data. Aren’t better results possible?”

These kinds of questions are simplistic when in the initial stages of a data science activity/experiment, and in some situations, they could be dangerous too (although they’re innocuous mistakes any manager new to analytics initiatives may make).

For the same reason that “a little knowledge is a dangerous thing” these project managers might be playing with the fortune of the entire analytics program they serve, because they base even large projects on such naive and unverified assumptions. Were they to change their behaviour by giving due consideration to exploratory data analysis, and what the data actually says about viable models and applications that may be built, they might be putting their data scientists and engineers on the path to success.

Pragmatic Business Transformation with AI

I interact with numerous data scientists and people in the data science space on LinkedIn on a daily basis. Many of these have insightful things to say, about how data and artificial intelligence are transforming the business landscape. There is a certain alarmism in the context of the automation of business processes, that accompanies every discussion on artificial intelligence, and with good reason. One of these is Vin Vashishta, whose posts often address pressing challenges in data and AI. Here is a recent post by Vin and my comment. This blog post was originally on Medium, and is an expansion of the ideas represented by the comment.

Traditional Thinking Couches

Traditional thinking about how work gets done, in general has the following elements. Traditional work and time based thinking is based on scientific reductionism and paradigms such as linearity. In truth, this thinking has allowed us to come very far. The division of labour is the very basis of capitalism, for instance, and modern capitalism thrives on specialization and the management of work in this form.

  1. Linearity: The tendency to think of all work as ultimately reducible into linearly scalable chunks. Less of a task requires less resources, whereas more work requires more resources. To be fair, this kind of thinking has been around for millennia, since at least the time of human settlement and the neolithic age.
  2. Reducibility: This is a tendency to think of work as infinitely reducible, in such a way that if we complete each sub-task of a job in a certain sequence, we have the end result of completing the whole job. Systems engineers know better, and understand holism and reductionism in systems as analogies to the traditional view of reducibility and how it might affect the way we see work today
  3. Value-based Work and Tangibility: Another element of what seems to define work traditionally is the presence of tangible objectives, such as items shipped, or certain unambiguously measurable criteria met. In this world, giving a customer a good experience when they shop, or enabling customers or partners to better be served or serve us better, aren’t seen as value, but as non-value-added activities. For a long time, approaches to business transformation focused on the reduction of non-value-add activities from business process, with the view that this will improve process efficiency.

When we think about how businesses will take up AI and machine learning capabilities, we’re compelled to think in terms of the same above lenses. They’re comfortable couches that we cannot get out of, and as a result, possess and dominate our thinking about AI deployment in enterprises.

AI-Specific Cognitive Biases

Some dangers of thinking driven by the above principles are as follows:

  1. Zero-sum automation: The belief that there is a fixed pie of opportunity, and that when we give human jobs to machines, we deprive humans of opportunities. Naturally, this is not true, because general, self-organizing intelligences such as humans are more than capable of discovering and finding new opportunities. Fixed-pie thinking is probably one of the key reasons behind AI alarmism. I would additionally argue that at some level, AI alarmism is also the result of bogeyman thinking, a paradigm in which a strawman such as AI is assigned blame for large scale change. In the past, a lot technological progress and change happened without such bogeymen, even as other changes were being prevented because of such thinking. Another element of bogeyman thinking is the tendency to ignore complementarity, including situations where humans and AI tools could work alongside each other, resulting in higher process effectiveness.
  2. Value bias: While there is truth to the notion that processes have value-add steps and non-value-add steps, it is a feature typical of reductionism to assume that we don’t need the non-value-add steps at all, while they may be serving true purpose. For instance, all manufacturing processes that transform raw material to product have ended up requiring quality checks and assurance. As a feature of the evolution of industrial production processes, quality assurance and control have become part of nearly all manufacturing processes that operate at scale. QA and QC represent a non-linearity in the production system, or a feedback loop which provides downstream process performance information to upstream processes.
  3. Exclusivity: A flip side of bogeyman thinking, combined with value bias, is the phenomenon of exclusivity. For example, the interpretation of emotional expressions on a human face, has for long been a task that humans are great at — for long, we didn’t know of any higher animals, let alone technologies, that had this level of sophistication. Now, there’s a lot going on in the ML/AI space that has to do with the so-called soft aspects of human life — judging people’s expressions and understanding them, learning about their behavioural patterns, etc., and these capabilities are becoming more and more mature within AI systems on a regular basis. This contradicts traditional notions of human-exclusive capabilities in many areas. Naturally, this is seen as a threat, rather than a capability enhancer. The truth is that exclusivity is also to be considered a logical fallacy when discussing the development of AI systems.

It is common for one to fear he who seems to do everything that one can do, until that person becomes one’s friend. I’d say that the word is still out on what AI cannot do yet — and as a result, our approach to business transformation (as with transformation in other areas) should be humans + AI, and not AI in lieu of humans. This synergy is already visible in the manufacturing world, and perhaps we will see it make its way to other spheres as well. Fixed-pie thinking won’t get us anywhere when we have capability amplifiers like AI to assist humans.

Concluding Remarks

A key element of future human productivity is the discovery and exploitation of new opportunities in new frontiers. My suggestion to business leaders thinking about AI adoption for automation and process improvement, is to expand the pie first, by creating new opportunities to do more as a business, and enable your employees to take up and contribute more to your business. When you then enable them with AI, the humans+AI combination you will see as a result will take your organization to new heights.

A Simple CNN Tutorial in Keras

In the last year or so, I have begun working extensively with Keras, Tensorflow and CNTK for various problems at work in industries ranging from manufacturing, to media, to cybersecurity.

Here is a simple convolutional network tutorial on Kaggle that I developed in Keras and Tensorflow. Given the GPU-enabled kernels you have within Kaggle these days, it has become easy enough to train large scale image data on some of these kernels. Performance is another matter, though, since the Tesla K40 GPUs you get here are the lower end GPUs, and are also load balanced for multiple users. In any case, it allows you to even try out CUDA code – and that opportunity can’t be beat, given the low cost of doing Kaggle.

My motivation for putting together a tutorial is not the dearth of tutorials – there are enough and more out there. However I wanted to emphasize certain good practices here, and intend to continue to update the kernel in question in future to illustrate those.

Caveat: The internet is awash with tutorials on deep learning using these frameworks, so I won’t dwell much on why this tutorial is different, because it isn’t very different. That said, it does emphasize how a simple deep learning model could be made more effective by using various good practices, such as batch normalization, some explanations about loss functions, and some amount of data exploration in the context of data and labels for this supervised problem.

Achieving Explainability and Simplicity in Data Science Work

This post stems from a few of the tweets I’d authored recently (Over at @rexplorations) on deep learning, data science, and the other skills that data scientists ought to learn. Naturally, this is by no means a short list of skills, given the increasingly pivotal role that data scientists play in organizations.

Here’s a summary of the tweet-stream I’d put out, with some additional ponderings.

  1. Domain knowledge is ignored on the data science road to perdition. Doing data analysis, or building models from data without understanding the domain and the relevance of the data and factors one is using for these models, is akin to “data science suicide”. It is a sure shot road to perdition as a data scientist. Domain knowledge is also hard to acquire for data scientists, especially those working on projects as consultants, and applying their skills in a consultative, short-term setting. For instance, I have more than a decade of experience in the manufacturing industry, and I still find myself learning new things when I encounter a new engineering set up or a new firm. A data scientist is nobody if not capable of learning new things – and domain knowledge is something that they need to constantly skill up on, in addition to their analytical skills.
  2. Get coached on your communication skills, if needed. When interacting with domain experts and subject matter experts, communication skills are extremely important for data scientists. I have frequently seen data scientists suffer from the “impostor syndrome” – not only in the context of data analysis methods and techniques, but also in the context of domain understanding.
  3. Empathise, and take notes when speaking to subject matter experts. It is for this reason that the following things are extremely important for new data scientists interacting with subject matter experts:
    1. Humility about one’s own knowledge of a specific industry area,
    2. An ability to empathise with the problems of different stakeholders
    3. The ability to take notes, including but not limited to mind maps, to organize ideas and thoughts in data science projects
  4. Strive for the usefulness of models, not to build more complex models. Data scientists ignore hypotheses that come from such discussions at their own peril. Hypotheses form the lifeblood of useful data science and analysis. As George E. P. Box said, “All models are wrong, some models are useful” – and this couldn’t be more true than when dealing with models built from hypotheses. It is such models that become really useful.
  5. Simpler models are easier to manage in a data ethics context. In product companies that use machine learning and data science to add value to customers, a debate constantly exists on the effective and ethical use of customer data. While having more data at one’s disposal is helpful for building lots of features, callous use of customer data can present a huge risk. Simpler models are easier to explain – and are arrived at when we accumulate sufficient domain knowledge, and test enough hypotheses. With simpler models, it is easier to explain what data to collect, and this can also help win the customer’s trust.
  6. Careful feature engineering done with human supervision and care may be more effective and scrupulous than automated feature engineering. We live in a world where AutoML and RoboticDataScience are often discussed in the context of machine intelligence and speeding up the process of insight generation from data. However, for some applications, it may be a better idea in the short term to ensure that the feature engineering happens through human hands. Such careful feature engineering may give organizations that use sensitive data a leg up as a longer term strategy, by erring on the side of caution.
  7. Deep learning isn’t the end of the road for data scientists. Deep learning (justifiably) has seen a great deal of hype in the recent past. However, it cannot be seen as a panacea to all data analysis. The end goal from data is the generation of value – be it for a customer, or for society at large. There are many ways to do this – and deep learning is just one approach.

I’m not discussing the many technical aspects of building explainable models. These technical aspects are contextual and depend on the situation, for one, and additionally, the tone of the post and tweets are lighter, to encourage a discussion and to welcome beginner data scientists to this discussion. Hence my omission of these (important) topics.

If you like something on this post, or want to share any other related insights, do drop a comment, or tweet to me at @rexplorations or message me at LinkedIn.

Different Kinds of Data Scientists

Data scientists come in many shapes and sizes, and constitute a diverse lot of people. More importantly, they can perform diverse functions in organizations and still stand to qualify under the same criteria we use to define data scientists.

In this cross-post from a Quora answer, I wish to elucidate on the different kinds of data scientist roles I believe exist in industry. Here is the original question on Quora. I have to say here, that I found Michael Koelbl’s answer to What are all the different types of data scientists? quite interesting, and thinking along similar lines, I decided to delineate the following stereotypical kinds of data science people:

  1. Business analysts with a data focus: These are essentially business analysts that understand a specific business domain reasonably well, although they’re not statistically or analytically inclined. Focused on exploratory data analysis, reporting based on creation of new measures, graphs and charts based on them, and asking questions around these EDA. They’re excellent at story telling, asking questions based on data, and pushing their teams in interesting directions.
  2. Machine learning engineers: Essentially software developers with a one-size-fits-all approach to data analysis, where they’re trying to build ML models of one or other kind, based on the data. They’re not statistically savvy, but understand ML engineering, model development, software architecture and model deployment.
  3. Domain expert data scientists: They’re essentially experts in a specific domain, interested in generating the right features from the data to answer questions in the domain. While not skilled as statisticians or machine learning engineers, they’re very keyed in on what’s required to answer questions in their specific domains.
  4. Data visualization specialists: These are data scientists focused on developing visualizations and graphs from data. Some may be statistically savvy, but their focus is on data visualization. They span the range from BI tools to coded up scripts and programs for data analysis
  5. Statisticians: Let’s not forget the old epithets assigned to data scientists (and the jokes around data science and statisticians). Perhaps statisticians are the rarest breed of the current data science talent pool, despite the need for them being higher than ever. They’re generally savvy analysts who can build models of various kinds – from distribution models, to significance testing, factor-response models and DOE, to machine learning and deep learning. They’re not normally known to handle the large data sets we often see in data science work, though.
  6. Data engineers with data analysis skills: Data engineers can be considered “cousins” of data scientists that are more focused on building data management systems, pipelines for implementation of models, and the data management infrastructure. They’re concerned with data ingestion, extraction, data lakes, and such aspects of the infrastructure, but not so much about the data analysis itself. While they understand use cases and the process of generating reports and statistics, they’re not necessarily savvy analysts themselves.
  7. Data science managers: These are experienced data analysts and/or data engineers that are interested in the deployment and use of data science results. They could also be functional or strategic managers in companies, who are interested in putting together processes, systems and tools to enable their data scientists, analysts and engineers, to be effective.

So, do you think I’ve covered all the kinds of data scientists you know? Do you think I missed anything? Let me know in the comments.

Related links

  1. O’Reilly blog post on data scientists versus data engineers

Why Do I Love Data Science?

This is a really interesting question for me, because I really enjoy discussing data science and data analysis. Some reasons I love data science:

  1. Discovering and uncovering patterns in the data through data visualization
  2. Finding and exploring unusual relationships between factors in a system using statistical measures
  3. Asking questions about systems in a data context – this is why data science is so hands-on, so iterative, and so full of throw-away models

Let me expand on each of these with an example, so that you get an idea.

Uncovering Patterns in Data

On a few projects, I’ve found data visualization to be a great way to identify hypotheses about my data set. Having a starting point such as a visualization for the hypothesis generation process makes us go into the process of building models a little more confidently. There’s the specific example of a time series analysis technique I used for energy system data, where using aggregate statistical measures and distribution fitting led to arbitrary and complex patterns in the data. Using time ordered visualizations helped me formulate the hypothesis in the correct way, and allowed me to build an explanatory model of the system.

Exploring Unusual Relationships in Data

In data science work, you begin to observe broad patterns and exceptions to these rules. Simple examples may be found in the analysis of anomalous behaviour in various kinds of systems. Some time back, I worked with a log data set that captured different kinds of customer transaction data between a customer and a client. These log data revealed unusual patterns that those steeped in the process could tell, but which couldn’t be quantified. By finding typical patterns across customers using session-specific metrics, I helped identify the anomalous customers. The construction of these variables, known as “feature engineering” in data science and machine learning, was a key insight. Such insights can only come when we’re informed about domain considerations, and when we understand the business context of the data analysis well.

Asking Questions about Systems in a Data Context

When you’re exploring the behaviour of systems using data, you start from some hypothesis (as I’ve described above) and then continue to improve your hypothesis to a point where it is able to help your business answer key questions. In each data science project, I’ve observed how considerations external to the immediate data set often come in, and present interesting possibilities to us during the data analysis. Sometimes, we answer these questions by finding and including the additional data, and at other times, the questions remain on the table. Either way, you get to ask a question on top of an answer you know, and you get to do an analysis on top of another analysis – with the result that you’ve composited different models together after a while, that give you completely new insights that you’ve not seen before.

Concluding Remarks

All three patterns are exhilarating and interesting to observe, for data scientists, especially those who are deeply involved in reasoning about the data. A good indication of whether you’ve done well in data analysis is when you’re more curious and better educated about the nuances of a system or process than you were before – and this is definitely true in my case. What seemed like a simple system at the outset can reveal so much to you when you study its data – and as a long-time design, engineering and quality professional, this is what interests me a great deal about data science.

The Future for Data Scientists

Originally an answer on Quora, this is an interesting topic to discuss, given the rapid pace of change in the intersecting, related and rapidly evolving fields of data science, analytics, AI and machine learning.

As of early 2018, I see the evolution of data science roles in industry in three sets of time frames – a short term time frame, over the next one or two years, followed by a longer term time frame, between two and five years. I also expect that the data scientist as a role will be so infused into management and business so as to not be called out separately in about five years. So, let’s look at the timeline and what it could mean for data scientists.

One-to-Two Years Time Horizon

  1. The number of data scientists in the market will increase greatly, although their quality and readiness for projects will continue to be largely poor. As I’ve said in an earlier answer, a static skill set will get you nowhere in the competitive world of data science. This will become even more the case in the next two years. Knowledge of old tools and frameworks becomes obviated by the need to learn new ones.
  2. A great deal more emphasis on productionising and operationalising data science in the near term. This means that data scientists will be expected to leverage APIs, microservices architectures and such approaches to ensure that data science results in applications, and not merely analyses with an expiration date
  3. The increasing high-level nature of data science tools and frameworks will continue to democratise data science. This will bring people who aren’t strictly data scientists into the fold. Being a data science professional who spent significant time in engineering and quality management before this, I see this as an enabler for smart, knowledgeable and analytically minded professionals from different walks of society, to embrace data. Deployment-friendly data architectures and data infrastructure, such as on the cloud, will enable this transformation too.
  4. Certain common kinds of statistical analysis and machine learning systems will become common and productionized. This is due to the increasing popularity of probabilistic programming frameworks and deep learning frameworks. Common problems such as face or other biometric data analysis, specialized systems that are data centric, such as autonomous vehicles, etc., will become more mature. This will cause probabilistic modeling and deep learning to become a defacto part of the data science skill sets
  5. Data scientists will be expected to be competent data engineers at one level, and at least partly application developers also. A lot of the changes and improvements to data science education will happen along these lines

Three-to-Five Years Time Horizon

Prediction is very difficult, especially if it's about the future. - Niels Bohr

I love this quote by Niels Bohr, and I suppose it subtly sums up the challenge posed by the original Quora question I tried to answer. That said, I do have some views about what could potentially unfold. Would love to know if you think anything else belongs here, or if anything should be different.

  1. The ethics of data science and artificial intelligence will begin to become a really serious topic. Although this is being discussed extensively by academicians, industry stalwarts and intellectuals in 2018, the debate will have real implications a few years down the line, and I expect that data scientists will be held ethically accountable for their work in a few years time. Data use and algorithm use regulations will begin to appear, just as regulations on industrial systems or weapons exist today. The GDPR is a defensive regulation on data security – I expect to see strategically impactful regulations in future.
  2. Data science education will come to include artificial intelligence, and will improve vastly, becoming a standard part of college curricula. This expanded curriculum will include deeper focus on computational engineering, statistical engineering and large scale data based simulations. Data scientists in the future could be drawn from different backgrounds and experiences
  3. I expect a revival of interest in the use of large scale simulation tools and methods. Stochastic simulation of systems at a large scale has been around for a while, on the sidelines. I expect that truly large scale simulations of real world systems are more possible than they were before, and this will become the de-facto way of engineering many kinds of systems.
  4. The value of domain understanding and domain modeling in data science will be emphasised for data scientists. Ontology models will become more common in data science. Data scientists who cannot build domain-aware systems may even come to be regarded like those among data scientists today who don’t understand simple data algorithms are.
  5. Fully automated data science systems will be able to serve a large number of common use cases. This kind of automation of data science will, towards the end of half a decade from now, allow data scientists to flit from one high-level task to another. Furthermore this kind of a capability may allow organizations to not need data scientists per se, but analytical staff who can straddle the different systems and use cases commonly handled.

Beyond five years: I hope that the term “data scientist” becomes outdated five years hence. If it doesn’t, it may mean that we haven’t sufficiently been able to leverage the abilities of the technologies and frameworks to operationalise and productionise data science, or build any truly intelligent data driven systems.

 

Contextualizing “AI Alarmism” in Business Process Automation

Alarmist speculations about Artificial Intelligence are everywhere these days. Business managers in labour-intensive markets such as India and China have, in recent months, come to fear data-driven process automation, often unfairly and unnecessarily. In this post, I wish to discuss some of the AI alarmism we see in the general public at large – ranging from well-founded speculation to the truly ridiculous. I will also present two mental models that may illustrate the usefulness of AI in process automation, before we arrive at how to contextualize AI-based automation.

Some Contours of AI Alarmism

In the last several months, the media has been awash with articles about data-driven process automation made possible by artificial intelligence, that is said to be doing any of the following things (listed in order of increasing speculation craziness):

  1. Taking away our jobs and rendering vast sections of human society jobless
  2. Doing things that humans do better than humans do them, and thereby obviate the need for humans in certain very human activities
  3. As a panacea for all kinds of faults and frailties that make us human, and therefore a representation of the post-human world
  4. Killer robots that will wrest power from all of human society, thereby resulting in the standard-issue-technology-apocalypse that is the staple of Hollywood movies

It is important to assess the sources of these fears and speculations, if only to debunk some of this AI alarmism. It is also important to understand true challenges where they may exists and the threats in that context.

A Process View of AI-driven Automation

In the past several decades, we have seen numerous technology revolutions and their socio-cultural impact on human society. Whether the rise of computerised and robotic manufacturing processes, that led to the digitization of manufacturing, or the evolution of automation methods in the knowledge work space we’ve seen in the last decade or so, the fundamental drivers have been two fold – improved process performance, and increased process flexibility:

  1. A better process for delivering value
    1. Improved process quality and reduced variation
    2. Reduced process time and opportunities to continually improve
    3. Reduced process cost and opportunities to spread value within processes
  2. A more scalable and predictable process for delivering value

Given this broader process-based view of excellence for organizations and how managers look to new technology from an operational effectiveness standpoint, can we see automation driven by artificial intelligence in a new light?  For instance: how can we understand what AI specifically offers to the process automation ambit, and what this means for businesses? To understand this, let’s take a look at what AI solutions currently allow businesses to do:

  1. Automate embarrassingly simple processes in business processes that have true scale, that are based on well-defined rules, but which are subject to variation – and do so cost-effectively
  2. Automate somewhat complex processes which require some human intervention, but which are not mission-critical, and do so in businesses processes that have true scale

Now, let’s look at what AI based automation is not capable of accomplishing in its current state:

  1. Truly domain aware decision making, as an expert system that is aware of business context, and which can make holistic recommendations only possible by highly skilled experts
  2. Truly complex decision making that considers multiple factors in a non-formulaic or dynamic manner
  3. Tasks of moderate to high complexity to be performed in a business environment where the scale of the business isn’t large
Automation value add at scale

Automation and its effectiveness with business scale

Automation value add with complexity

Automation and its effectiveness with changing process complexity

As you can see above, cost-effective process automation is held back by the business case of it and its applicability at different business scales. This leads to an interesting cost-benefit value analysis. AI based process automation in businesses is most effective when there is true business scale, when the processes in question are either simple, or moderately complex.

Data and AI-Based Automation

There is yet another factor that could potentially affect how effective automation might be – and this is the availability of data from processes. The importance of data can be characterised in some key ways:

  1. A core enabler for artificially intelligent systems and applications is learning from data. Being able to learn from data implies that there is a need to use statistical techniques. This implies machine learning, statistical inference, time series modeling of data in real time, etc.
  2. Building domain-specific context and awareness within the application implies needing to use knowledge models, which are representations of the system’s domain, in the form of entities and relationships.
  3. A key consideration for an intelligent system is not only being able to learn from data in the domain, but also the ability to act on the domain. These domain actions can take many forms – from the machining and welding processes we see in robotic manufacturing systems, to computer programs that can generate instructions for writing other programs or instructions in data-intensive systems.
  4. A subset or enabling capability in this context, therefore, is the ability to collect and manage data of various kinds in scalable ways, and in real time.

Reasons for Alarmist Speculation

Given these mental models of process-centric and data-centric views of AI-driven automation, let’s take a step back, and look what what is fueling this speculation:

  1. Misunderstanding about what artificial intelligence is and what capabilities it entails, on the business process side or on the data analysis side
  2. The lack of an objective scale for measuring or understanding AI progress
  3. Oversimplification of even simple, old and established human-in-loop systems
  4. Gross oversimplification of complex, human-engineered, industrial systems
  5. Mass media speculation that rides on the latest and greatest technologies, and importantly,
  6. The unceasing tendency of tech reporters and media to both liken the future to science fiction, and to jump to visions of utterly glorious or utterly ghastly futures, rather than evaluating technologies and their impact realistically

Concluding Remarks: Contextualizing AI-based Automation

First off, it is important to recognize that not all AI-centric speculation is unfounded. I wish to call out not those who have legitimately raised alarms about the policies, economics or ethics implications surrounding AI-based process automation, but those who stretch the speculation to the realm of fantasy. It is near-impossible to replace humans for certain kinds of tasks, such as those explained above that are comprised of high complexity, and that are mission critical for businesses. It is also important to consider the true scale and business realities of enterprises when speculating on AI. To this end, we may have to ask questions around whether and how a firm may use AI, and whether they have a sufficiently strong business case. Not only should speculators, consultants and pundits use such thumb rules, but it behooves business leaders and managers to similarly understand their own businesses.

Further Reading

  1. “Impact of emerging technologies on employment and public policy”, by Darrell M. West, Brookings Institution (link)
  2. “How humans respond to robots: building public policy through good design”, by Heather Knight, Brookings Institution (link)
  3. “It is time to dispel the myths of automation”, Viktor Weber, on the World Economic Forum website (link)

Key Data and AI trends in 2017

This year, 2017, has been quite a busy year for artificial intelligence and data science professionals. In some ways, this is the year when AI truly began to be debated and discussed, from frameworks and technologies to ethics and morality. This is the year when opportunities for AI-driven improvement in businesses began to be examined critically by diverse industry professionals and academicians.With good reason, machine learning and deep learning came to be placed at the top of the Garner’s hype cycle. We’re really at the peak of inflated expectations when it comes to ML/DL – with opportunities to shorten the time we take to reach measurable and direct consumer value.

Image result for gartner hype cycle 2017

Gartner Hype Cycle for 2017

Overall, in my experience, three key trends that enterprises welcomed in 2017 include:

  1. Simplification of cloud and data infrastructure services
  2. Improved and democratized scalable machine learning and deep learning
  3. Automation in key AI, ML and data analysis tasks

Improving Cloud and Data Infrastructure

Perhaps the foundational enabler for the data strategy of many enterprises that I have seen and worked with in 2017, is the availability of an easily operated and managed scalable cloud infrastructure. This promise of a high performance, low cost and (arbitrarily) scalable cloud infrastructure was made as early as 2014, but has taken a few years to materialize as a truly viable, business-wise feasible commercial offering from a stable, top-tier technology firm. Prominent cloud vendors such as Google Cloud, Microsoft Azure and Amazon’s AWS have upped the ante, while veterans like Hortonworks, Cloudera continue to hold sway. This space where the cloud vendors are competing is ripe for consolidation, in my view, although we can expect to see converging architectures before viable consolidation that isn’t entirely wasteful can happen.

Other notable developments on the cloud infrastructure side of things were ideas such as serverless compute (which enterprises are definitely warming up to – and it shows, in the Gartner Hype Cycle), production-ready pre-built models for common tasks as APIs (a trend that continues to inspire software/AI application architecture) and the performing of streaming and real-time data processing frameworks. By combining these capabilities in cloud platforms, cloud providers have really upped their offerings in 2017 compared to before, and provide formidable capabilities – which in my view haven’t even been explored as much as they should have been by businesses.

Despite the availability of such production-ready, cost-effective and scalable data management systems in the cloud, cloud infrastructure has nevertheless come under scrutiny in 2017 for massive security lapses and downtime. To speak of specific examples, we had the biggest impact events in cloud reliability and data security history between Equifax data breach and the massive AWS outage, to say nothing of the numerous data security episodes of smaller scale that were attributable to hacktivism, such as the Panama Papers.

As a counter to some of these incidents and the rise of the GDPR and other data protection regulations, numerous cloud providers have been offering “private cloud” solutions, along with region-specific hosting options for banks and other organizations that deal with regulation-sensitive data.

Additonally, it would be unfair to not point out how much containerization has helped cloud providers in 2017. Massive scale adoption of containerization using Docker and Kubernetes has enabled virtual environments to be set up and managed for complex development and deployment tasks that are data intensive.

Spark and Tensorflow

The space of scalable machine learning frameworks continues to be dominated by Apache Spark – which has found many friends among data engineers and scientists in production after the 2.0 release, especially, given its equitable performance for the data frame APIs across languages. So, whether you program in Python, R, or Scala, you can be assured of the same high performance from Spark these days. Spark ML has expanded on the capabilities of Spark ML Lib, and in its recent releases, Spark has also polished and unified the interfaces for streaming data analysis on Spark-Streaming and graph analysis via GraphX. As someone who has seen teams use Spark for different purposes and built frameworks on it in 2017, the differences between versions 1.6 and below, and 2.0 and above are significant, and the newer versions are more polished and consistent in their behaviour.

Tensorflow received a lot of hype but only lackluster adoption in late 2016 and early 2017, but over the last several months, has made a strong case for itself, and adoption has grown significantly. As developers have warmed up to the framework, and as more language interfaces have been developed for Tensorflow, its popularity has soared, especially in the latter half of 2017. Another factor in the development and adoption of Tensorflow is the widespread use of GPU based deep learning. The core Tensorflow development team’s additions to 1.0 (as explained by Jeff Dean here) have made it a mature deep learning development package and perhaps the most widely used and sought after deep learning framework. While Torch makes an impression and is widely loved (especially in its PyTorch form), Tensorflow is hard to beat for the speed and dynamism of its high quality open source contributors. At Strata Singapore 2016, I sat through a tutorial on Tensorflow 0.8, and what I saw then contrasts with what I see in versions 1.0 and higher. My recent brushes with Tensorflow have made me more convinced that this is the framework to learn for deep learning developers at the moment. The presence of wrappers and higher level interfaces, such as Keras or Caffe, has made Tensorflow very easy to use for entry-level and intermediate programmers and data scientists.

Automation in ML, DL and Data Science

Without a doubt, the development of automation-centric techniques to automate parts of ML and DL development is one of the biggest and most important directions within the field of Artificial Intelligence in 2017. Taking after Leo Brieman’s random forests (an ensemble of “weak learners” resulting in a machine learning model with high performance) and various advancements in deep learning and machine vision (especially convolutional neural networks, which essentially encode complex features using simpler features in computer vision problems), hyper parameter optimization automation was probably the first step in the general direction of automated machine learning.

Frameworks like AutoML (see the talk by Andreas Mueller above) have been the cynosure of this kind of research, and companies small and large have begun attempting different approaches for solving the context modeling problem that arise from the need to automate data science. While most approaches towards machine learning have taken a classical approach, by finding computational approaches to learn more and more from data, some have take non-traditional approaches, by combining ideas from expert systems, rule based inference engines, and other approaches. A novel approach to machine learning has been the invention and development of generative adversarial networks (GANs) which could lead to hitherto unseen improvements in the use of computationally generated data as a starting point for understanding the best representations of a given dataset. Despite being invented in 2014, it is in 2017 that implementations of this kind of network became popular and came to be considered as a viable neural network architecture for computer vision and other kinds of machine learning problems.

Other noteworthy trends within the data and AI space include the rise and improved performance of chat bots and conversational natural-language enabled APIs, the amazing improvements to translation and image tagging made possible by deep learning, and the important question of AI ethics – starting from that now-famous question of “should your self-driving car kill a pedestrian in order to save your life”, to ethical conundrums and alarmist remarks from tech luminaries such as Elon Musk.

Concluding Remarks

So, what does 2018 hold in store? That seems to be the question on everyone’s lips in the data and AI world, and it is also what data and AI enthusiasts in different industry roles are looking to understand. While it is not possible to clearly say which trend will dictate progress in 2018 and beyond, it is clear that the above three developments will form key cornerstones on top of which future capabilities for AI and enterprise scale data management and data science will be built. Hope you enjoyed reading this. Do leave a comment or a note if you would like to share more.

Quora Data Science Answers Roundup – 2017

Quora is a regular haunt of mine these days, and a lot of my activity there is centered on topics of deep interest – usually data, engineering, aviation and technology. Here’s the first version of the Quora data science answers roundup that I posted in January 2017, soon after I was designated Quora Top Writer for 2017.

What you see below are some more of my answers from 2017, on data and related areas, from Quora.

Data science, data analysis, simulation, probability, statistics and machine learning answers:

  1. Some hard truths about becoming a data scientist
  2. The best thing about working in data science
  3. Important qualities for data scientists. Related posts here and here
  4. Relevance of the basics of ML given the presence of machine learning APIs
  5. Expensive boot camps for data science and justifying spending
  6. Nontrivial ideas from probability and statistics required for data science
  7. Thoughts on Andrew Ng’s deep learning course (which led to a blog post here too)
  8. On new and interesting research ideas in the AI space
  9. Managing unstructured text data and feature extraction – more here
  10. Managing missing data fields and null values in data science problems
  11. On linear programming versus stochastic searches for hyperparameter optimization
  12. Differentiating between fitness and loss functions
  13. On model interpretability in machine learning
  14. Characteristics of a good regression model
  15. Distribution modeling and probability – 1 , 2 , 3 , 4
  16. On data analysis and its use in the manufacturing industry
  17. Optimization techniques in data analysis and data science
  18. On the philosophy of deep learning – related answers on how deep learning algorithms learn , on weight initialization in deep neural networks
  19. On time series models in data analysis – more here , here , here , here , here , here and here
  20. Convex optimization and the use of gradient descent
  21. On Genetic Algorithms – 1 , 2 , 3
  22. Anomaly detection in financial time series data – related answer here
  23. Significance and difference in significance testing
  24. Agent based modeling for traffic simulations

Technology-specific answers on data science and analysis:

  1. On big data technology courses, and the lack of architecture, strategy and such courses
  2. On the continuing relevance of SQL/RDBMS technologies
  3. The develop-vs-use conundrum for building data and machine learning systems – more here
  4. Advice on career and certifications  – 1 , 2 , 3 , 4
  5. Programming language specific answers – 1 , 2 , 3 , 4 , 5 , 6 , 7 , 8 , 9 , 10 , 11
  6. General data science books, resources, skills – 1 , 2 , 3
  7. On big data ecosystems and components
  8. Perspectives on data warehousing and big data technologies
  9. Contextualizing tools like Excel in the context of data analysis

Data science and management:

  1. The importance of BI and decision enablement tools in the data space
  2. Andrew Ng’s venture and how it could be differentiated from others
  3. Managing data science projects

Hope you enjoy reading through them and find them interesting and informative!