Emphasizing the Basics: Structured Data Science Mentoring

Data science, machine learning and AI are constantly growing and burgeoning fields, with research that’s spilling over at the seams in terms of the sheer volume of it all. Every day, I receive numerous references to interesting papers on my Twitter feed, thanks to Arxiv daily and such accounts there. I also see papers explained with code, and references to ML products and systems in numerous contexts. This is all overwhelming beyond a point for a professional who doesn’t have a specific focus area. Speaking pragmatically, and from the tree of knowledge (which is always bound to be vast), it is a feature of every single human endeavour to exhibit this kind of complexity as we spend more and more time exploring things, farming ideas and understanding new possibilities in these areas.

Data scientists are going to be at different levels of competence and may be differently placed to take on challenges they are asked to face – the role of the mentor (regardless of the type) is to systematically challenge the data scientist to discover new innate potential and develop such potential to increase their overall capabilities and effectiveness.

The Data Science Mentoring Challenge

Mentoring can be a hard task for this reason – a lot of people (understandably) gravitate towards complex models that are meant for specific purposes, without fully understanding the details and the exact mechanisms behind simpler machine learning and statistical modeling methods. The problem with this is two fold – a) reliance on libraries and frameworks with implementations that already exist, and b) inability to characterize, apply and explain common and simpler techniques to actual real world problem statements. Part of the problem here is the sensationalization of research in the media. Open research without borders is important and pivotal for speedy progress in technology areas like ML. But we’re also seeing a lot of misinformation including sensationalization of advanced ML techniques and when some of gets parroted by professionals (some of who may become hiring managers) we see the problem proliferating into the world of work as well. I’ve interviewed my fair share of individuals who understand, say, an LSTM unit’s different gates but aren’t comfortable explaining autocorrelation techniques or ARMA models. This gap probably stems from gaps in mentoring and coaching, which ideally should emphasize basics first.

I’d posit that the role of the mentor has changed, in data science, over the last several years, and I would say it has changed most significantly in the last two years. In the future, data and AI mentoring will look different from what it looked like in the past five years. This is because the nature of the job of a data scientist (or alternatively an AI/ML engineer) has also changed. Despite developments in Automated Machine Learning, we’re inundated with situations in the real world, where we require human expertise to get through data science and machine learning problems. This human expertise manifests in three processes: problem characterisation, problem formulation, and problem solving. We need real, human data scientists (not just an AutoML tool) to look beyond the obvious automations such as hyperparameter or architectural searches, to reason about the nuts and bolts of problems, interpret the problem domain and reason about different kinds of hypotheses and how they make sense.

This makes the process of mentoring for data science different than it was, in certain specific ways. For one thing, mentors today create the field of problems or opportunities that will exist tomorrow. Data scientists today experience an overload of information as can be expected, from different sources. From Arxiv and Springer papers and articles, to new research and code, new books and new frameworks and algorithms, there are plenty of things to learn on a daily basis. However, the broader skill set of the data scientist even today can be characterized into four key areas: basic, business, functional and frontier skills.

Broad Characterization of Skills for Modern Data Science
  1. Technical skills: There’s the need for a strong foundation that enables general effectiveness in a data science role. This includes good skills across statistical analysis fundamentals, leading into the key principles that enable statistical learning models to be built, and a sound understanding of the mechanism behind common algorithms such as regression, tree algorithms, search and optimization methods
  2. Business skills: There is a strong need for data scientists who can reason about business processes and systems, and understand how data may be generated, how it may flow, and what insights may be required of it. Not only is this is a key skill to have fruitful interactions with clients and stakeholders, but it is also important to narrow down to the right level of depth for the job in terms of satisfaction and effectiveness.
  3. Functional skills: There’s the need for effectiveness on the job, at a functional level, which not only includes technical competence at the statistical, mathematical and code levels, but at the level of processes and good practices such as clean code, change management and reproducible research. One could also see more advanced machine learning and feature engineering techniques as being part of the functional skill set.
  4. Frontier skills: There’s research that’s expanding at multiple frontiers, which is hard for even experienced data scientists to keep up with, if they’re really interested in furthering their career beyond the obvious and evident challenges of day-to-day work.

Mentors: Different Levels

The role of mentorship has also become specialized in the last two years, which is, in my view, one of the changes most representative of the maturation of the field of data science. Mentors today can be at different levels of skill and still add value to different kinds of data science and analytics roles. For the sake of this discussion, I’d classify mentors today into two kinds – the “breadth mentor” and the “depth mentor”. While both kinds of mentors possess certain common skills, especially on the interpersonal communication front, they may have different approaches to technical, functional and research level mentoring.

The breadth mentor is an individual with plenty of experience in data science, perhaps in a consulting setting, that can provide generally correct advice to data scientists with the development of broad skill sets, ranging from basic statistical analysis, to advanced algorithms. The nature of the mentoring here is on developing a well-rounded data scientist, rather than an expert in a specific field.

The depth mentor by contrast, is someone who has deep experience in a specific area of industry or technology and has deep experience in bringing this field together with data science. Examples of this kind of data scientist would be an NLP researcher, or a researcher in the field of robotics, both of who may be expert practitioners of data science methods in their specific areas, but without the broader knowledge of consultative data science methods.

Depending on the needs of the business and the data scientists in question, the appropriate kind of mentor has to be chosen – and this shouldn’t be done lightly. For example, bringing a breadth mentor to an AI product firm may have some advantages, but if the firm is solving problems in a specific space, this may not work out so well. Similarly, bringing a depth mentor to a consulting firm can help grow a specific practice (or a new one) but may not benefit the broader data science efforts across different business domains there.

Structuring Mentorship in Data Science

Mentors (and hiring managers) in general should emphasize the importance of the basic skills listed above. In my view, when a data science candidate has the correct understanding of the essential basic statistical ideas and common algorithms, it becomes a lot easier for them to grasp more advanced ideas such as in deep learning, when this is required. Mentors can build better basic skills in data scientists by challenging their technical acumen.

Mentors should also emphasize business skills where relevant, and where the emphasis is on research, they should emphasize some of the frontier skills as well. Mentors in this context are expected to challenge the data scientist with relevant questions, and encourage a habit of systematically breaking down problems and asking the right questions. These business skills are important all the way up to solution architect roles and management, when crucial decisions have to be taken and hard questions will need to be asked often. Mentors can build better business skills in data scientists by challenging their problem understanding and characterisation.

Functional skills are important for effectiveness on the job. It is not okay for data scientists to theoretically understand a specific subject area, only to find themselves handicapped when asked to build a machine learning pipeline. Therefore functional skill mentoring is about challenging the data scientist on problem solving effectiveness.

Finally, frontier skills development depends on both the organizational or research context, and the data scientist’s interests. Mentors can provide helpful markers to enable the exploration of ideas, while emphasizing value from the research, and asking questions that keeps the data science researcher on track. The challenge the mentor can pose here is differentiated solution value and originality.

The Importance of Emphasizing the Basics

This brings me to the importance of emphasizing the basics. I see numerous individuals out there who are getting into data science and machine learning that are interested in getting right to the latest and greatest algorithms. For a while – and this has been a trend on LinkedIn and Twitter – budding data science aspirants post some of their work, where it involves the development of simple scripts or programs around computer vision, translation and such problem statements, thereby delivering an impression to a lot of their audience, that not only are they skilled at those techniques demonstrated, but that they are skilled at different kinds of data science problem statements as well. My own suggestion to data science aspirants is that they will be under pressure to demonstrate some of their more involved skills, not merely the ability to use pre-built libraries to solve problems using one’s own basic skill sets in statistical learning, but, perhaps be able to build such algorithms and systems from scratch. This kind of deeper skill is what differentiates the wheat from the chaff in data science.

Concluding Remarks

In conclusion, I believe the mentor’s role in data science has changed – mentors today have their tasks cut out, when it comes to building deeper skill in their data scientists – they should emphasise technical acumen first and foremost, problem understanding and characterisation next, and problem solving effectiveness after this. This builds up a well-layered skill-set where technical skills can perform a harmonious dance in amalgam, resulting in true value to the data science market.

Getting Data Science Work Done Remotely

Given the current Covid-19 crisis that has led to massive disruptions to how we work, communicate and collaborate, there is an understandable interest in being able to do data science work remotely and effectively. In a sense, this capability has been brewing in the background, because the data science talent crunch experienced for several years before data science skill sets went mainstream, was an opportunity for companies to hire talent around the world, and work remotely.

Despite this, there have been challenges in building excellent remotely managed teams in all technology sectors, including data science and AI- and ML-centric teams. For one thing, remote work is fraught with asynchronous work, meetings and the need for over-communication. Another aspect of remote work is the need for interpersonal interactions and relationship-building between peers and team members. These are undeniable aspects of what makes remote work in itself challenging – sometimes, phone calls are not enough, and video calls done professionally and effectively require discipline and commitment on behalf of the participants. All asynchronous collaboration requires goal-orientation and timeliness in the execution of tasks. These are, of course expectations that you may have from ideal employees and team members. Based on the working style and approach of different individuals, you may have very different reactions from your team members to the same ground rules.

Making Remote Data Science Teams Effective

I can think of some ways in which we can enable better interactions and more productive data science teams across locations, time zones and in complex data science projects:

  1. Knowing your team members well. This piece of advice is not about data science, it is about just being a good team player or leader. There is no substitute for actually building great interpersonal relationships. Humans are not robots – professionals are people and are driven by meaning, reason and have motivations of many kinds, both personal and professional. All of us have challenges when listening and taking feedback that requires us to change, whether we’re leading or contributing. Some of us may have quirks and oddities that make us interesting in some ways and annoying in other ways. There are some good ways to build such relationships in remote ways.
    1. Don’t skip the small talk. Ask about how they are doing when you start a video or audio call. A little small talk never goes to waste, especially at the beginning of a meeting. In these times where people may have vastly different levels of well being, whether because of being affected by Covid-19, or otherwise being impacted because of it, it never hurts to ask.
    2. Empathize and make your team feel wanted and welcome. Ensure you empathize with them in case they come out and say that they’re having a tough time. It doesn’t help to be the “strong and silent” kind of individual when the person at the other end is communicating difficulties that they’re having. Video and audio calls require us to overcommunicate.
    3. Understand your team members’ habits and quirks. Share a joke or two, and understand how your team responds. Determine what makes them tick, and what kind of work assignments interest them.
  2. Setting ground rules and “team level agreements”. One of the biggest enablers of productivity in data science teams is knowing what tasks are meant for who. In real world data science teams, things may not be clear-cut, when it comes to the broad span of tasks that data science team members need to do. Ground rules ease the situation and collaborative tools enable this to happen well. Tools like Teams or Slack are great at building contextual conversations. However, you can lose sight of the bigger picture here, because you’re following along multiple threads under a specific topic. What helps here is setting up Wikis, and doing effective stand up calls.
    1. Wikis and how they can help: Team wikis help by consolidating the key information in one place. They’re great for teams who have been working a certain way in an office setting and have had to transition to remote teams, sometimes with new team members. They can provide a nice, section-wise summary of key tasks and elements of the work stream or the job in question. In the context of data science teams, Wikis can help in the following ways: a) by being project documentation, b) by instructing on specific tasks – be this environment set up for a Python task, or PEP 8 guidelines, class hierarchy for an application, a list of hypotheses to explore in statistical analysis, etc., or c) by being a repository of tribal knowledge about the solution you’re building. Wikis can be created by using documentation tools like Read The Docs, or even by Wiki servers. Where Wikis are too much work to do, simple documents (Google / Word / Confluence) can help. On Confluence, you can attract comments as well, and this can make things easier in some ways.
    2. Stand up calls and doing these effectively. Stand up meetings (typically these are 15 or 20 minutes long) are a great way to start the day and gain some momentum with respect to the features and solutions you’re building. They shouldn’t be long drawn out, but should just focus on the key accomplishments and blockers. Add in a little sugar – use video and do a nice, positive team ritual. Ensure that you do round-robin updates, because this makes sure everyone is involved. When doing data science, you may get updates on how a certain analysis went, or whether new findings were made with respect to some data, or whether a model training step failed or succeeded in some way. All of these are useful points of exploration and problem solving for the day ahead. These initial discussions in the day can be the source of a new synergy – if you are a leader, you generally respond to some of these, or perhaps you bring ideas from the previous day to share and guide the team in a specific way. If you’re a contributor, you’re likely to make key notes of things that happened during your work day and share them in the next day’s stand up call.
  3. Few meetings, but effective meetings. Having spent time in big corporate and in startups, I see a tendency on the part of managers and employees who come from large organizations to gravitate towards meetings to solve problems. The reality is that meetings are rarely effective in and of themselves, and better consensus can be built asynchronously in written communication that can be read, absorbed, digested and then responded to. Something to keep in mind in remote teams:
    1. Data scientists and engineers need to write well. They should be able to articulate thoughts, ideas and complex constructs well, and should be able to respond with the appropriate amount of detail. Without this ability to think critically and express themselves in written form, the organization becomes meeting-driven, and can descend into chaos when these meetings sap team energy
    2. Meetings should have clear outcomes. These outcomes should be decided at least five minutes before the end of the meeting. The agendas for meetings need to be clear before the start of the meeting, which isn’t often enough the case. Finally, the results of the meeting should be circulated over Teams/Slack/Email with clear expectations on the part of those involved.
  4. Remote-sourcing domain expertise. The availability of team members from across the world is a potential benefit from having fully remote teams that companies haven’t fully realized the benefits of yet. If you’re doing data science in the energy sector, for example, you may be interested in building a team with occasional input from an energy sector domain expert you may not have had access to before, because more people have opened up to consulting remotely. Such domain expertise is especially important in building effective teams across borders that understand the domain well enough to be effective as a data science delivery team in a specific industry. Technology, statistical analysis and communication skills together cannot solve a problem that also requires domain knowledge to solve, and this can be done effectively by sourcing such talent or expertise remotely.
  5. Working effectively without borders requires planning and effective documentation. Working across time zones can be hard at times – when we work in synchrony, communication can be easier. However, asynchronous communication and work will become increasingly important in the age of Covid-19 and beyond, as more and more teams become remote and distributed. Working remotely doesn’t equate to being flexible time-wise, but being effective with your tasks. For example, you may be on a call at 9pm with your team that’s in a different time zone, and are retiring for the day soon after – this situation calls for planning your upcoming day and tasks well enough for you to be effective at solving those problems in front of you. This may be a challenge for those with small children and families, but there are probably ways to work around it all. The solution is not stretching long into the night to complete that task. This probably impedes personal health, productivity and other aspects of well-being, both personal and professional. Effective asynchronous work requires good planning and documentation.
  6. Leading across borders requires planning, patience and understanding. When you have team members in other time zones and are delegating tasks to them, you probably have to curb the enthusiasm to do things yourself, or to delegate to someone closer to home – this risks isolating those who work in remote time zones. If you find yourself doing this, consider that you may need to rethink how you are structuring your project and tasks.
  7. Pair programming and digital mentoring can be really effective. When your team members are trying to develop new skills and don’t have anyone to turn to for help, they can benefit greatly from pair programming sessions and digital mentoring. These are not new practices, and there are platforms available to do this across teams – but what’s important is to have regular communication to sort issues out as they happen, and help people correct course as soon as they need to do so. Pair programming enables specific and contextual feedback. Whether statistical analysis or application development, data science mentoring done remotely can be a big enabler of growth and technical accomplishment.
  8. Managing work environments well. Work environments here doesn’t only refer to the physical environments, such as the work from home setup we each may use, but also the digital environment, which enables us to find the required information at the right time, or which enables us to construct new workflows as we need them. This extends to code and the virtual environments we code in. When we’re building continuous delivery pipelines that provide the required environment for running and testing code, we are enabling such a need. The tools are half the reason that highly effective teams are as effective as they are.
  9. “Gitting Good” – managing asynchronous data science work effectively. Managing an asynchronously updated code base requires your team to adopt and work well, with the right ground rules, on specific development branches of your code repository. A lot of product development firms have nailed the process of managing code on git or other version control systems well, and indeed this applies to many data science teams as well, but this is as much a matter of individual and collective discipline as it is about systems and processes and Wiki pages. Sometimes, we need hard conversations to happen to bring teams on track – and in the context of code discipline, I have seen a fair few situations of this nature.
  10. Taking time to document (and to “RTFD”). Documentation is one of the more tedious tasks for most developers, and it can be a chore for data scientists as well. Good data science teams, however, build their solutions on top of excellent documentation, where key questions are answered and many elements of their problem solving approaches are documented well, where required with links to papers, journals and results as appropriate. In cases where novel algorithms are written these should not only be put together in modular and reusable ways, but also documented well, so that the solution is intelligible to the broader team and to clients. As important as writing the documentation, is reading it. When new team members come in, or people change assignments on a project, it is important to keep the documentation relevant and ongoing.

Concluding Remarks

Naturally, all of these behaviors don’t happen all at once and aren’t developed in a day – excellence takes time, persistence and diligence. There are many teams out there that are doing lots of incredible work, and fully remotely, in the data science space. I hope that some of the above ideas make sense to your team and that if nothing else, this post made you think about how your team’s currently performing and how to amp up your team’s performance!