Achieving Explainability and Simplicity in Data Science Work

This post stems from a few of the tweets I’d authored recently (Over at @rexplorations) on deep learning, data science, and the other skills that data scientists ought to learn. Naturally, this is by no means a short list of skills, given the increasingly pivotal role that data scientists play in organizations.

Here’s a summary of the tweet-stream I’d put out, with some additional ponderings.

  1. Domain knowledge is ignored on the data science road to perdition. Doing data analysis, or building models from data without understanding the domain and the relevance of the data and factors one is using for these models, is akin to “data science suicide”. It is a sure shot road to perdition as a data scientist. Domain knowledge is also hard to acquire for data scientists, especially those working on projects as consultants, and applying their skills in a consultative, short-term setting. For instance, I have more than a decade of experience in the manufacturing industry, and I still find myself learning new things when I encounter a new engineering set up or a new firm. A data scientist is nobody if not capable of learning new things – and domain knowledge is something that they need to constantly skill up on, in addition to their analytical skills.
  2. Get coached on your communication skills, if needed. When interacting with domain experts and subject matter experts, communication skills are extremely important for data scientists. I have frequently seen data scientists suffer from the “impostor syndrome” – not only in the context of data analysis methods and techniques, but also in the context of domain understanding.
  3. Empathise, and take notes when speaking to subject matter experts. It is for this reason that the following things are extremely important for new data scientists interacting with subject matter experts:
    1. Humility about one’s own knowledge of a specific industry area,
    2. An ability to empathise with the problems of different stakeholders
    3. The ability to take notes, including but not limited to mind maps, to organize ideas and thoughts in data science projects
  4. Strive for the usefulness of models, not to build more complex models. Data scientists ignore hypotheses that come from such discussions at their own peril. Hypotheses form the lifeblood of useful data science and analysis. As George E. P. Box said, “All models are wrong, some models are useful” – and this couldn’t be more true than when dealing with models built from hypotheses. It is such models that become really useful.
  5. Simpler models are easier to manage in a data ethics context. In product companies that use machine learning and data science to add value to customers, a debate constantly exists on the effective and ethical use of customer data. While having more data at one’s disposal is helpful for building lots of features, callous use of customer data can present a huge risk. Simpler models are easier to explain – and are arrived at when we accumulate sufficient domain knowledge, and test enough hypotheses. With simpler models, it is easier to explain what data to collect, and this can also help win the customer’s trust.
  6. Careful feature engineering done with human supervision and care may be more effective and scrupulous than automated feature engineering. We live in a world where AutoML and RoboticDataScience are often discussed in the context of machine intelligence and speeding up the process of insight generation from data. However, for some applications, it may be a better idea in the short term to ensure that the feature engineering happens through human hands. Such careful feature engineering may give organizations that use sensitive data a leg up as a longer term strategy, by erring on the side of caution.
  7. Deep learning isn’t the end of the road for data scientists. Deep learning (justifiably) has seen a great deal of hype in the recent past. However, it cannot be seen as a panacea to all data analysis. The end goal from data is the generation of value – be it for a customer, or for society at large. There are many ways to do this – and deep learning is just one approach.

I’m not discussing the many technical aspects of building explainable models. These technical aspects are contextual and depend on the situation, for one, and additionally, the tone of the post and tweets are lighter, to encourage a discussion and to welcome beginner data scientists to this discussion. Hence my omission of these (important) topics.

If you like something on this post, or want to share any other related insights, do drop a comment, or tweet to me at @rexplorations or message me at LinkedIn.

Different Kinds of Data Scientists

Data scientists come in many shapes and sizes, and constitute a diverse lot of people. More importantly, they can perform diverse functions in organizations and still stand to qualify under the same criteria we use to define data scientists.

In this cross-post from a Quora answer, I wish to elucidate on the different kinds of data scientist roles I believe exist in industry. Here is the original question on Quora. I have to say here, that I found Michael Koelbl’s answer to What are all the different types of data scientists? quite interesting, and thinking along similar lines, I decided to delineate the following stereotypical kinds of data science people:

  1. Business analysts with a data focus: These are essentially business analysts that understand a specific business domain reasonably well, although they’re not statistically or analytically inclined. Focused on exploratory data analysis, reporting based on creation of new measures, graphs and charts based on them, and asking questions around these EDA. They’re excellent at story telling, asking questions based on data, and pushing their teams in interesting directions.
  2. Machine learning engineers: Essentially software developers with a one-size-fits-all approach to data analysis, where they’re trying to build ML models of one or other kind, based on the data. They’re not statistically savvy, but understand ML engineering, model development, software architecture and model deployment.
  3. Domain expert data scientists: They’re essentially experts in a specific domain, interested in generating the right features from the data to answer questions in the domain. While not skilled as statisticians or machine learning engineers, they’re very keyed in on what’s required to answer questions in their specific domains.
  4. Data visualization specialists: These are data scientists focused on developing visualizations and graphs from data. Some may be statistically savvy, but their focus is on data visualization. They span the range from BI tools to coded up scripts and programs for data analysis
  5. Statisticians: Let’s not forget the old epithets assigned to data scientists (and the jokes around data science and statisticians). Perhaps statisticians are the rarest breed of the current data science talent pool, despite the need for them being higher than ever. They’re generally savvy analysts who can build models of various kinds – from distribution models, to significance testing, factor-response models and DOE, to machine learning and deep learning. They’re not normally known to handle the large data sets we often see in data science work, though.
  6. Data engineers with data analysis skills: Data engineers can be considered “cousins” of data scientists that are more focused on building data management systems, pipelines for implementation of models, and the data management infrastructure. They’re concerned with data ingestion, extraction, data lakes, and such aspects of the infrastructure, but not so much about the data analysis itself. While they understand use cases and the process of generating reports and statistics, they’re not necessarily savvy analysts themselves.
  7. Data science managers: These are experienced data analysts and/or data engineers that are interested in the deployment and use of data science results. They could also be functional or strategic managers in companies, who are interested in putting together processes, systems and tools to enable their data scientists, analysts and engineers, to be effective.

So, do you think I’ve covered all the kinds of data scientists you know? Do you think I missed anything? Let me know in the comments.

Related links

  1. O’Reilly blog post on data scientists versus data engineers

Why Do I Love Data Science?

This is a really interesting question for me, because I really enjoy discussing data science and data analysis. Some reasons I love data science:

  1. Discovering and uncovering patterns in the data through data visualization
  2. Finding and exploring unusual relationships between factors in a system using statistical measures
  3. Asking questions about systems in a data context – this is why data science is so hands-on, so iterative, and so full of throw-away models

Let me expand on each of these with an example, so that you get an idea.

Uncovering Patterns in Data

On a few projects, I’ve found data visualization to be a great way to identify hypotheses about my data set. Having a starting point such as a visualization for the hypothesis generation process makes us go into the process of building models a little more confidently. There’s the specific example of a time series analysis technique I used for energy system data, where using aggregate statistical measures and distribution fitting led to arbitrary and complex patterns in the data. Using time ordered visualizations helped me formulate the hypothesis in the correct way, and allowed me to build an explanatory model of the system.

Exploring Unusual Relationships in Data

In data science work, you begin to observe broad patterns and exceptions to these rules. Simple examples may be found in the analysis of anomalous behaviour in various kinds of systems. Some time back, I worked with a log data set that captured different kinds of customer transaction data between a customer and a client. These log data revealed unusual patterns that those steeped in the process could tell, but which couldn’t be quantified. By finding typical patterns across customers using session-specific metrics, I helped identify the anomalous customers. The construction of these variables, known as “feature engineering” in data science and machine learning, was a key insight. Such insights can only come when we’re informed about domain considerations, and when we understand the business context of the data analysis well.

Asking Questions about Systems in a Data Context

When you’re exploring the behaviour of systems using data, you start from some hypothesis (as I’ve described above) and then continue to improve your hypothesis to a point where it is able to help your business answer key questions. In each data science project, I’ve observed how considerations external to the immediate data set often come in, and present interesting possibilities to us during the data analysis. Sometimes, we answer these questions by finding and including the additional data, and at other times, the questions remain on the table. Either way, you get to ask a question on top of an answer you know, and you get to do an analysis on top of another analysis – with the result that you’ve composited different models together after a while, that give you completely new insights that you’ve not seen before.

Concluding Remarks

All three patterns are exhilarating and interesting to observe, for data scientists, especially those who are deeply involved in reasoning about the data. A good indication of whether you’ve done well in data analysis is when you’re more curious and better educated about the nuances of a system or process than you were before – and this is definitely true in my case. What seemed like a simple system at the outset can reveal so much to you when you study its data – and as a long-time design, engineering and quality professional, this is what interests me a great deal about data science.

Key Data and AI trends in 2017

This year, 2017, has been quite a busy year for artificial intelligence and data science professionals. In some ways, this is the year when AI truly began to be debated and discussed, from frameworks and technologies to ethics and morality. This is the year when opportunities for AI-driven improvement in businesses began to be examined critically by diverse industry professionals and academicians.With good reason, machine learning and deep learning came to be placed at the top of the Garner’s hype cycle. We’re really at the peak of inflated expectations when it comes to ML/DL – with opportunities to shorten the time we take to reach measurable and direct consumer value.

Image result for gartner hype cycle 2017

Gartner Hype Cycle for 2017

Overall, in my experience, three key trends that enterprises welcomed in 2017 include:

  1. Simplification of cloud and data infrastructure services
  2. Improved and democratized scalable machine learning and deep learning
  3. Automation in key AI, ML and data analysis tasks

Improving Cloud and Data Infrastructure

Perhaps the foundational enabler for the data strategy of many enterprises that I have seen and worked with in 2017, is the availability of an easily operated and managed scalable cloud infrastructure. This promise of a high performance, low cost and (arbitrarily) scalable cloud infrastructure was made as early as 2014, but has taken a few years to materialize as a truly viable, business-wise feasible commercial offering from a stable, top-tier technology firm. Prominent cloud vendors such as Google Cloud, Microsoft Azure and Amazon’s AWS have upped the ante, while veterans like Hortonworks, Cloudera continue to hold sway. This space where the cloud vendors are competing is ripe for consolidation, in my view, although we can expect to see converging architectures before viable consolidation that isn’t entirely wasteful can happen.

Other notable developments on the cloud infrastructure side of things were ideas such as serverless compute (which enterprises are definitely warming up to – and it shows, in the Gartner Hype Cycle), production-ready pre-built models for common tasks as APIs (a trend that continues to inspire software/AI application architecture) and the performing of streaming and real-time data processing frameworks. By combining these capabilities in cloud platforms, cloud providers have really upped their offerings in 2017 compared to before, and provide formidable capabilities – which in my view haven’t even been explored as much as they should have been by businesses.

Despite the availability of such production-ready, cost-effective and scalable data management systems in the cloud, cloud infrastructure has nevertheless come under scrutiny in 2017 for massive security lapses and downtime. To speak of specific examples, we had the biggest impact events in cloud reliability and data security history between Equifax data breach and the massive AWS outage, to say nothing of the numerous data security episodes of smaller scale that were attributable to hacktivism, such as the Panama Papers.

As a counter to some of these incidents and the rise of the GDPR and other data protection regulations, numerous cloud providers have been offering “private cloud” solutions, along with region-specific hosting options for banks and other organizations that deal with regulation-sensitive data.

Additonally, it would be unfair to not point out how much containerization has helped cloud providers in 2017. Massive scale adoption of containerization using Docker and Kubernetes has enabled virtual environments to be set up and managed for complex development and deployment tasks that are data intensive.

Spark and Tensorflow

The space of scalable machine learning frameworks continues to be dominated by Apache Spark – which has found many friends among data engineers and scientists in production after the 2.0 release, especially, given its equitable performance for the data frame APIs across languages. So, whether you program in Python, R, or Scala, you can be assured of the same high performance from Spark these days. Spark ML has expanded on the capabilities of Spark ML Lib, and in its recent releases, Spark has also polished and unified the interfaces for streaming data analysis on Spark-Streaming and graph analysis via GraphX. As someone who has seen teams use Spark for different purposes and built frameworks on it in 2017, the differences between versions 1.6 and below, and 2.0 and above are significant, and the newer versions are more polished and consistent in their behaviour.

Tensorflow received a lot of hype but only lackluster adoption in late 2016 and early 2017, but over the last several months, has made a strong case for itself, and adoption has grown significantly. As developers have warmed up to the framework, and as more language interfaces have been developed for Tensorflow, its popularity has soared, especially in the latter half of 2017. Another factor in the development and adoption of Tensorflow is the widespread use of GPU based deep learning. The core Tensorflow development team’s additions to 1.0 (as explained by Jeff Dean here) have made it a mature deep learning development package and perhaps the most widely used and sought after deep learning framework. While Torch makes an impression and is widely loved (especially in its PyTorch form), Tensorflow is hard to beat for the speed and dynamism of its high quality open source contributors. At Strata Singapore 2016, I sat through a tutorial on Tensorflow 0.8, and what I saw then contrasts with what I see in versions 1.0 and higher. My recent brushes with Tensorflow have made me more convinced that this is the framework to learn for deep learning developers at the moment. The presence of wrappers and higher level interfaces, such as Keras or Caffe, has made Tensorflow very easy to use for entry-level and intermediate programmers and data scientists.

Automation in ML, DL and Data Science

Without a doubt, the development of automation-centric techniques to automate parts of ML and DL development is one of the biggest and most important directions within the field of Artificial Intelligence in 2017. Taking after Leo Brieman’s random forests (an ensemble of “weak learners” resulting in a machine learning model with high performance) and various advancements in deep learning and machine vision (especially convolutional neural networks, which essentially encode complex features using simpler features in computer vision problems), hyper parameter optimization automation was probably the first step in the general direction of automated machine learning.

Frameworks like AutoML (see the talk by Andreas Mueller above) have been the cynosure of this kind of research, and companies small and large have begun attempting different approaches for solving the context modeling problem that arise from the need to automate data science. While most approaches towards machine learning have taken a classical approach, by finding computational approaches to learn more and more from data, some have take non-traditional approaches, by combining ideas from expert systems, rule based inference engines, and other approaches. A novel approach to machine learning has been the invention and development of generative adversarial networks (GANs) which could lead to hitherto unseen improvements in the use of computationally generated data as a starting point for understanding the best representations of a given dataset. Despite being invented in 2014, it is in 2017 that implementations of this kind of network became popular and came to be considered as a viable neural network architecture for computer vision and other kinds of machine learning problems.

Other noteworthy trends within the data and AI space include the rise and improved performance of chat bots and conversational natural-language enabled APIs, the amazing improvements to translation and image tagging made possible by deep learning, and the important question of AI ethics – starting from that now-famous question of “should your self-driving car kill a pedestrian in order to save your life”, to ethical conundrums and alarmist remarks from tech luminaries such as Elon Musk.

Concluding Remarks

So, what does 2018 hold in store? That seems to be the question on everyone’s lips in the data and AI world, and it is also what data and AI enthusiasts in different industry roles are looking to understand. While it is not possible to clearly say which trend will dictate progress in 2018 and beyond, it is clear that the above three developments will form key cornerstones on top of which future capabilities for AI and enterprise scale data management and data science will be built. Hope you enjoyed reading this. Do leave a comment or a note if you would like to share more.

Some Ideas on Combining Design Thinking and Data Science

Recently, I had the opportunity to finish Stanford SCPD’s XINE 217 “Empathize and Prototype” course, as part of the Stanford Innovation and Entrepreneurship Certificate, which emphasizes the use of design thinking ideas to develop product and solution ideas. It is during this course, that I wrote down a few ideas around the use of data in improving design decisions. Design thinking is a modern approach to system and product design which puts the customers and their interactions at the center of the design process. The design process has been characterized over decades by many scholars and practitioners in diverse ways, but a few aspects are perhaps unchanged. Three of these are as follows:

  1. The essential nature of design processes is to be iterative, and to constantly evolve over time
  2. The design process always oversimplifies a problem – and introduces side effects into the customer-product or customer-process interactions
  3. The design process is only as good as the diversity of ideas we use for “flaring” and “focusing” (which roughly translate to “exploring ideas” and “choosing few out of many ideas” respectively).

Overall, the essential idea conveyed in the design thinking process as explained in XINE 217, is “Empathize and Prototype” – and that phrase conveys a sense of deep customer understanding and focus. Coming to the process of integrating data into the design process – by no means is this idea new, since engineers starting from Genichi Taguchi, and perhaps even engineers a generation before Taguchi, have been developing systems models of processes or products in their designs. These systems models are modeled as factor-response models at some level, because they are converted to prototypes via parameter models and tolerance design processes.

Statistically speaking, these are analogues of the overall designed experiment practice, where a range of parameter variables may be considered as factors to a response, and are together modeled as orthogonal arrays. There’s more detail here.

Although described above in a simplified way, data-driven design approaches, grouped under the broad gamut of “statistical engineering” are used in one or other form to validate designs of mechanical and electrical systems in well-known manufacturing organizations. However, when you look at the design thinking processes in specific ways, the benefits of data science techniques at certain stages become apparent.

The design thinking process could perhaps be summarised as follows:

  1. Observe, empathise and understand the customer’s behaviour or interaction
  2. Develop theories about their behaviour, including those that account for motivations – spoken and unspoken aspects of their behaviour, explicit and implicit needs, and the like
  3. Based on these theories, develop a slew of potential solutions that could address the problem they face (“flare”)
  4. Qualify some of these solutions based on various kinds of criteria (feasibility, scope, technology, cost, to name some) (“focus”)
  5. Arrive at a prototype, which can then be developed into a product idea

While this summary of the design thinking approach may appear very generic and rudimentary, it may be applicable to a wide range of situations, and is therefore worth considering. More involved versions of this same process could take on different levels of detail, whether domain-specific detail, or process-wise rich. They could also add more fine-grained steps, to enable the designer to “flare” and “focus” better. As I’ve discussed in a post on using principles of agility in doing data science, it is also possible to iterate the “focus” and “flare” steps, to get better and better results.

Looking more closely at this five-step process, we can identify some ways in which data science tools or methods may be used in it:

  1. Observing consumer behaviour and interactions, and understanding them, has become a science unto itself, and with the advent of video instrumentation, accelerometers and behavioural analysis, a number of activities in this first step of the design thinking process can be improved, merely by better instrumentation and measurement. I’ve stressed the importance of measurement on this blog before – for one, fewer samples of useful data can be more valuable for building certain kinds of models. The capabilities of new sensors also make it possible to expand the kinds of data collected.
  2. Developing theories of behaviour (hypotheses) may be validated using various Bayesian (or even Frequentist) methods of data science. As more and more data gets collected, our understanding of the consumer’s behaviour can be updated, and Bayesian behavioural models could help us validate such hypotheses as a result.
  3. In steps 3 and 4 of the design thinking process I’ve outlined above, the “focusing and flaring” routine, is at one level, the core experimental design practice described by statistical pioneers including Taguchi. Using some of the tools of data science, such as significance testing, effect size determination and factor-response modeling, we could come up with interesting designs and validate them based on relevant factors.
  4. Finally, the process of prototyping and development would involve a verification and validation step, which tends to be data-intensive. From reliability and durability models (based on Frequentist statistics and PDF/CDF functions), to key life testing and analysis of data in that context, there are numerous tools in the data science toolbox, that could potentially be used to improve the prototyping process.

I realize that a short blog post such as this one is probably too short to explore this broad an intersection between the two domains of design thinking and data science – there’s the added matter of exploring work already done in the space, in research and industry. The intersection of these two spaces lends itself to much discussion, and I will cover related ideas in future posts.

Pervasive Trends in Big Data and Data Science

As of mid-2017, I’ve spent almost two years in the big data analytics and data science world, coming from 13 years of diverse work experience in engineering and management prior. Starting from a professional curiosity, it has taken me a while to develop some data science and engineering skills and hone key skills among these as a data scientist. Along the way, I’ve had a chance to learn core software development methods and principles, stay in touch with the latest in the field, challenge my existing knowledge of product development methodologies and processes, and learn more about data analysis, statistics and machine learning than I started out with in 2015. Along with the constant learning, I’ve had a chance to observe a few pervasive trends in the big data and analytics worlds, which I wish to share here.

  1. Cloud infrastructure penetration: Undoubtedly the biggest beneficiaries of the data and analytics revolution have been cloud service providers. They’re also stretched thin, with reducing costs, massive competition, and the need for value added services of various kinds (big compute and API support, along with big storage, for instance) to be available alongside the core cloud offerings that companies are lapping up, for their data management needs. Security concerns continue to exist, and one of the biggest security issues was actually from the US’ leading cloud service provider, Amazon Web Services. Despite this, many industries, even those that consider data security paramount, wish to adopt cloud infrastructures, because of the reduced cost of operation and the scalability inherent in cloud platforms.
  2. Deep learning adoption: Generalized learning algorithms based on neural networks have taken the machine learning world by storm, and given the proliferation of big compute and big data storage platforms, it has become easier to train deep learning algorithms than in the past. Extant frameworks continue to give better, more user-friendly algorithms as they evolve, and there’s definitely a more user-friendly ecosystem of frameworks and algorithms out there, such as Caffe, Keras, and Tensorflow (which has become more user-friendly and better integrated with numerous systems programming languages and frameworks). This trend will continue, with several tested and published DL APIs available for integration into application software of various kinds.
  3. API based data product deployment: Data science operationalization has begun to happen through APIs and platforms. Organizations that are developing data product strategies are increasingly considering platform views and integrating APIs for managing data, or for scoring incoming data based on machine learning models. With the availability of such APIs for general use, it has become possible to integrate many such microservice APIs to build data products with very specific and diverse capabilities.
  4. A focus on value from data: Companies are looking past the big data hype more often these days, and are looking at what value they can get from the data. They’re focusing on better data collection and measurement processes, improved instrumentation and qualifying their data management infrastructure investments. They’re also seeking to enable their data science teams with the right approaches, tools and methods, so that they can get from data to insight faster. Several startups are also doing pioneering work in governing the data science process, by integrating principles of agility, continuous integration and continuous deployment into software solutions developed by data science teams.
  5. Automated data science and machine learning: Finally (and in many ways, most importantly), automated data science and machine learning is a relatively new area of work which is gaining ground significantly. Numerous startups and established organizations are evaluating methods to automate key parts of the data science workflow itself, with The Data Team among them. Such automation of data science is a trend that I foresee will gain ground for some more time, before it becomes an integral part of many data science workflows, and development approaches. While a number of applications that straddle this space are referred to as AI, the word is out on what AI is, and what isn’t, as far as me and many of my colleagues are concerned.

These are just some of the trends I’ve observed, of course, and from where you are, as a data scientist, you may be seeing much more. One thing is for sure – those who continue to keep their knowledge and skills relevant in this fast-changing space will continue to be rewarded with interesting work and new opportunity.

Lessons from Agile in Data Science

Over the past year and a few months, I’ve had a chance to lead a few different data science teams working on different kinds of hypotheses. The engineering process view that the so-called agile methodologies bring to data science teams is something that has been written about. However, one’s own experiences tend to be different, especially when it comes to the process aspects of engineering and solution development.

Agile principles are by no means new to the technology industry. Numerous technology companies have attempted to use principles of agility in their engineering and product development practices, since a lot of technology product development (whether software or hardware, or both) is systems engineering and systems building. While some have found success in these endeavours, many organizations still find agility a hard objective to accomplish. Managing requirements, the needs of engineering teams and concerns such as delivery, quality and productivity for scalable data science are a similarly hard task. Organizational structure, team competence, communication channels and approaches, leadership styles and culture all play significant roles in the success of product development programmes, especially those centred around agility.

In the specific context of software and systems development, two talks stand out in my mind. One is from a thought leader and an industry pioneer who helped formulate the agile manifesto (a term which he extensively derides, actually) – and the other is from a team at Microsoft, which is a success story in agile product development.

Here’s Pragmatic Dave (Dave Thomas, one of the original pioneers of agile software development), in his GOTO 2015 talk titled “Agile is Dead”.

I’m wary of both extreme proponents and extreme detractors of a philosophy or an idea, especially when in practice or use, it seems to have some success in some quarters. While Dave Thomas seems to take some extreme views, he does bring in a lot of pragmatic advice. His views on the “manifesto for agility” are in some sense more helpful than boiler plate Agile training programmes, especially when seen in the context of Agile software/system development.

The second talk that I mentioned, the one featuring Microsoft Scrum masters, is very much a success story. It has all the hallmarks of an organization navigating through what works and what doesn’t, and trying to find their velocity, their rhythm and their approach, from what is a normative approach that’s suggested in so many agile software development textbooks and by many gurus and self-proclaimed experts.

This talk by Aaron Bjork was actually quite instructive for me when I first saw it a few months ago. Specifically, the focus of agile practices on teams and interactions, rather than on “process” was instructive. Naturally, this approach has other questions around it, such as scaling, but in the specific context of data science, I find that the interactions, and the process of generating hypotheses and evaluating them, seems to matter more than most things. These are only two of the many videos and podcasts I listened to, and surely they constitute only a portion of the interactions I’ve had with team members and managers on Agile processes for data science delivery.

It is in this setting that my personal experiences with Agile were initially less than fruitful. The team struggled to follow both process and do data science, and the management overhead, with activity and task management was extensive. This problem still remains, and there doesn’t seem a clear solution to balancing the ceremony/rituals of agile practices and seemingly useless ideas such as story points. Hours are more useful than story points – so much so that scrum practitioners typically devolve from equating story points to hours or multiples of them, at some point. The issue here lies squarely with how the practices have been written about and evangelized, rather than the fundamental idea itself.

There’s also the issue of process versus practice – in my view, one of the key things about project management of any kind. The divergence between process and practice in Agile methods is very high – and in my opinion, the systems/software development world deserves better. Perhaps one key reason for this is the proliferation of Scrum as the de-facto Agile development approach. When Agile methods were being discussed and debated, the term “Agile development” used to represent a range of different approaches, which has given way (rather unfortunately) to one predominant approach, Scrum. There is an analogy in the quality management world that I am extensively familiar with – in Six Sigma and the proliferation of DMAIC almost exclusively to solve “common cause” problems.

Process-v-practice apart, there are other significant challenges within using Agile development for data science. Changing toolsets, the tendency to “build now and fix later” (although this is addressed through effective continuous deployment methods) and process overhead constitute some reasons why this approach may still be attractive.

What does work universally, is the sprint-based approach to data science. While the sprint-based approach is only one element of the overall Scrum workflows we see in the industry, it can, in itself, become a powerful, iterative way to think about data science delivery in organizations. Combined with a task-level structure and a hypothesis model, it may be all that your data science team requires for even complex data science. Keeping things simple process-wise, may unlock the creative juices of data scientists and enable your team to favour direct interactions over structured interactions, enabling them to explore more, and extract more value from the data.

The Expert System Anachronism in the Data Science and AI Divergence

Although the data science and big data buzzwords have been bandied about for years now, and although artificial intelligence has been talked about for decades, the two fields are irrevocably inter-related and interdependent.

For one thing, the wide interest in data science started just as we were beginning to leverage distribute data storage and computation technologies – which allowed companies to “scale out” storage and computation, rather than “scale up” computation. Companies who could therefore buy numerous run-of-the-mill computers (rather than extremely expensive, high end computers, in smaller numbers) could potentially leverage their data collection activities to be useful to the enterprise.

Let’s not forget, though, that the point of such exercises was to actually get some business value at the end of such an exercise. There’s virtually no business case for collecting huge amounts of data and storing them (with or without structure), if we don’t have a plan to somehow utilize that data for taking business decisions better, or to somehow impact the business or customers positively. IT managers across industries have therefore struggled to make sense of the big data space, and how much to invest, what to invest in, and how to make sense of it all.

Technology companies are only too happy to sell companies the latest and greatest data science and data management frameworks and solutions, but how can companies actually use these solutions and tools to make a difference to their business? This challenge for executives isn’t going away with the advent of AI.

Artificial Intelligence (AI) has a long and hoary history, and has been the subject of debate, discussion and chronicle over several decades. Geoff Hinton, the AI pioneer, has a pretty comprehensive description of various historical aspects of AI here. Starting from Geoff Hinton’s research, pioneering research in recent years by Yann Le Cun, Andrej Karpathy and others has enabled AI to be considered seriously by organizations as a force multiplier, just as they considered data science a force multiplier for decision making activities. The focus of all these researchers are on general purpose machine intelligence, specifically neural networks. While the “deep learning” buzzword has caught on of late, this is fundamentally no different from a complex neural network and what it can do.

That said, AI in the form of deep learning differs vastly in capability from the algorithms data scientists and data mining engineers have used for more than a decade, now. By adding many layers, and by constructing complex topologies in these neural networks, and by iteratively training them on large amounts of data, we’ve progressed along multiple quantitative axes (complexity, number of layers, amount of training data, etc) in the AI world, to get not merely quantitative, but qualitatively better in terms of AI performance. Recent studies at Google show that image captioning, often considered a hard problem for AI, is now at near-human levels of accuracy. Microsoft famously announced that their speech-to-text and translation engines stand improved by an order of magnitude, because of the use of these techniques.

It is this vastly improved capability of AI, and the elimination of the human (present forever in the data science activity loop) from even the analysis and design of these neural networks (generative adversarial networks being a case in point), that makes the divergence between Data Science and AI very vivid and distinct. AI seems to be headed in the direction of general intelligence, whereas data science approaches and methods constituted human-in-loop approaches to making sense of the data. The key value addition of the human in this data science context was “domain” – and I have extensively discussed the importance of domain in data science in an earlier post – but this too, has increasingly become supplanted by efficient AI, provided that the data collection process for training data, and the training and topological aspects of the networks (known as hyper parameters) are well defined enough. This supplanting of the human domain perspective, by machine-learned domain features that matter, is precisely what will enable AI to develop and become a key force to reckon with, in industry.

Therefore I venture that the “anachronism” in the title of this post, is the domain-based model of systems, or intelligent systems, called the Expert System. Expert system design is an old problem that probably had its heyday and apparently disappeared into the mist of technological obsolescence – and it is this kind of expert system design problem that AI methods will be so good at solving, to the point that they can replace humans in key tasks, and become a true general intelligence. Expert systems were how the earliest AI researchers imagined machine intelligence to be useful to humanity. However, their understanding was limited to rule-based expert systems. While the overall idea of the expert system is still relevant in many domains – so much so that in a sense, we have expert systems all around us – it is undeniable that the advent of AI will enable expert systems to develop and evolve once again, but without the rule-based approaches we have seen in the past, and with inductive learning as is apparent from deep learning and machine learning methods.

Hypothesis Generation: A Key Data Science Challenge

Data scientists are new age explorers. Their field of exploration is rife with data from various sources. Their methods are mathematics, linear algebra, computational sciences, statistics and data visualisation. Their tools are programming languages, frameworks, libraries and statistical analysis tools. And their rewards are stepping stones, better understanding and insights.

The data science process for many teams starts with data summaries, visualisation and data analysis, and ends with the interpretation of analysis results. However, in today’s world of rapid data science cycles, it is possible to do much more, if we take a hypothesis-centred approach to data science.

Theories for New Age Raconteurs

Data scientists work with data sets small and large, and are tellers of stories. These stories have entities, properties and relationships, all described by data. Their apparatus and methods open up data scientists to opportunities to identify, consolidate and validate hypotheses with data, and use these hypotheses as starting points for our data narratives. Hypothesis generation is a key challenge for data scientists. Hypothesis generation and by extension hypothesis refinement constitute the very purpose of data analysis and data science.

Hypothesis generation for a data scientist can take numerous forms, such as:

  1. They may be interested in the properties of a certain stream of data or a certain measurement. These properties and their default or exceptional values may form a certain hypothesis.
  2. They may be keen on understanding how a certain measure has evolved over time. In trying to understand this evolution of a system’s metric, or a person’s behaviour, they could rely on a mathematical model as a hypothesis.
  3. They could consider the impact of some properties on the states of systems, interactions and people. In trying to understand such relationships between different measures and properties, they could construct machine learning models of different kinds.

Ultimately, the purpose of such hypothesis generation is to simplify some aspect of system behaviour and represent such behaviour in a manner that’s tangible and tractable based on simple, explicable rules. This makes story-telling easier for data scientists when they become new-age raconteurs, straddling data visualisations, dashboards with data summaries and machine learning models.

Developing Nuanced Understanding

The importance of hypothesis generation in data science teams is many fold:

  1. Hypothesis generation allows the team to experiment with theories about the data
  2. Hypothesis generation can allow the team to take a systems-thinking approach to the problem to be solved
  3. Hypothesis generation allows us to build more sophisticated models based on prior hypotheses and understanding

When data science teams approach complex projects, some of them may be wont to diving right into building complex systems based on available resources, libraries and software. By taking a hypothesis-centred view of the data science problem, they could build up complexity and nuanced understanding in a very natural way, and build up hypotheses and ideas in the process.

Azure ML Studio and R

A decade ago, Microsoft looked very different from the Microsoft we see today – it has been a remarkable transformation. One of the areas where MS have made a big push is machine learning and data analytics. Although the CRAN repository is going strong with >10,000 packages as of today, the MRAN repository (Microsoft’s Managed R Archive Network) is adding libraries and functionality that was missing from the R stack. Ever since they acquired Revolution R, they’ve also integrated R into their data science and ML offerings in a big way. For instance, Power BI comes with the ability to write R scripts that can produce visualizations for dashboards. They’ve come out with a number of products that add to or complement the Office suite that is the bedrock of Microsoft’s software portfolio, and of late, they have pushed Azure and their own machine learning algorithms in a big way. A year is a long time in the world of big data and machine learning, and now, on Azure ML Studio, most people just interested in big data and data science can get started with data analysis in a pleasant, user friendly interface.

2017-01-28-23_18_20-start

The Azure ML Studio Interface

I have had the chance to play around a little with Azure ML, and here are what I find to be some of its strong points. Above you can see a simple data processing step I set up within Azure ML Studio – to take a simple data set and subject it to some transformations.

It is possible to summarize and visualize this data pretty quickly, using some of the point-and-click summaries you get from the outputs of the boxes in the workflow.

2017-01-28-23_38_17-clipboard

Simple summaries of dataframes and CSV files are easy

What’s nice about this simple interface is the ability to view multiple variables into one view, and explore a given variable in different ways. Here, I’ve scaled both axes to a log-log plot, and am able to see variation in the MPG values for the sample data set in question. Very handy when you want to quickly test one or two hypotheses.

What ML Studio seems adept at doing is bringing together R, Python and SQL in the same interface. This makes it particularly powerful for multi-language paradigm data analysis. True to this capability, you can bring in an R kernel for doing data analysis. Sure enough, you can use Python too (if you’re like me, you use Python and R almost equally).

2017-01-28-23_36_45-action-center

Interface allows for opening Jupyter notebooks with R and Python kernels

Once you have a Jupyter Notebook opened up, you can perform analysis of all kinds in it – everything available with Open R is apparently supported within Azure ML Studio. The thing about Jupyter notebooks, of course, is that you can’t yet use multiple kernels in the same notebook. You can use either R, or Python, or Julia, for instance, and that language choice is static within a given notebook. There is a discussion around this, but unsure if it has been resolved. Although R support in Jupyter notebooks is a little sketchy, seasoned R coders can use it well enough. The REPL interface of R Studio is a bit nicer (and harder to get away from, for me personally) compared to Jupyter for R programming, but it does work well, for the most part. Kernels are managed remotely and abstracted away from the user, so there is no need to SSH into a Jupyter server and so on. The data analysis can start right away this way, because the distractions are gone.

2017-01-28-23_35_12-action-center

Jupyter notebook running remotely on ML Studio server with an R kernel.

One bug I seemed to run into, is the inability to change graph sizes with the standard par() and mar() commands. Other than that, graphs render well enough within Jupyter. Building models is easy enough in R as it is – so many packages provide a very simple interface. Doing this in Jupyter therefore is no different – a breeze as usual.

2017-01-28-23_36_15-action-center

Simple R graph rendered in Azure ML Studio

Overall, with Azure ML Studio, we’re looking at a mature web app for doing machine learning and data science, that is user friendly and provides some amount of interactivity and code that can be integrated right into the workflows, which is quite a coup, in my opinion. For prototyping and doing exploratory data analysis, this may produce a good repeatable workflow that can be easily understood by others and shared.

  1. The interface is great – it brings together notebooks, data sets, models pre-trained by Microsoft, and so on, together in one nice interface.
  2. One value addition in the interface is the ability to separate out different contexts very clearly. You can clean data with a certain part of it, organize your dataframes with another, and so on.
  3. The drag and drop functionality is actually pretty good and works conveniently for those interested in mixing code with a visual interface
  4. The Jupyter notebook integration is sketchy with R (more an issue with Jupyter than Azure ML studio, in my experience) – but works well enough for most things to do with data frames and simple visualizations.
  5. In addition to what we saw in the notebook, there’s also the possibility of directly embedding R code into the ML Studio workflow as a cell.

Hope you liked this little tour of ML Studio. I enjoyed playing around with it!