# Exploring SVM Kernels

Support Vector Machines are a popular option for data scientists wanting to explore and model higher dimensional data. Despite their lack of scalability, they’re popular for prototyping different kinds of classifiers for systems where there are large numbers of variables. At the core of the SVM is the use of a kernel function, which enables a mapping of the feature space to a higher dimensional feature space. Therefore, if we’re unable to find separability between classes in the (lower dimensional) feature space, we could find a function in the higher dimensional space, which can be used as a classifier.

In this Jupyter notebook I’ve explored a couple of different types of kernel functions for bivariate, two-class data, where an SVM is being used to separate out these classes. Since these classes are not linearly separable, the use of kernel functions here enables us to find the best possible hyperplanes that can solve the separability problem. What’s interesting to note is that the convex hulls (in this case, polygons) for these classes are overlapping in the 2D space. This is a clear indicator of a lack of linear separability.

The use of a kernel opens up the possibility of linear separability, since we add an additional spatial dimension on which these points get distributed. Specifically here we have two different kernel functions that are explored:

$\phi(x_{1}, x_{2}) = (x_{1}^2, x_{1}x_{2}, x_{2}^2)^{T}$

$K(x_{1}, x_{2}) = a e^{{-\frac{1}{b} ||x_{1} - x_{2}||^{2} }}$

The latter is called the radial basis function kernel, or the RBF kernel. Visualizing this kernel for the data we’d generated gives us the following nice image. What’s easily visible here is the possibility of separating out the classes thanks to the additional rbf dimension that has now been added.

Upon training the SVM classifier, visualizing the results gives us the below plot. The thick grey line is the decision boundary that enables us to separate the originally linearly inseparable classes in the dataset.

There are other explorations I hope to do on this notebook in future, specifically the process of calculating the sign (class label) of a dataset, based on the Lagrangian – which indeed brings us to the idea of the SVC being a maximal margin classifier. This is also referred to as the dual problem of the SVM. For another post!

# Statistical Competence and Its Importance for Good Data Science Careers

In 2019, enterprises routinely begin initiatives related to analytics, data science and machine learning that invoke specific technologies from a very early stage in their initiatives. This tendency to put technology ahead of value sometimes extends to analytics champions and managers who take up or lead data-intensive initiatives. While this may seem pragmatic at one level, at another level, it may lead to significant problems when ensuring successful outcomes from such analytics initiatives and programs. In this post, I’ll address the three-pronged conundrum of statistical competence in the data science world, specifically in the context of data science consulting and services, and specifically what it means for the careers of data science candidates now and in the future.

#### Hiring Statisticians: An Expert’s View

Kevin Gray is one of my connections on LinkedIn who posts insightful content on statistical analysis and related topics on a regular basis, including very good recommendations for books on various statistical and analytical techniques and methods. One of his recent posts was an article he’d authored titled “What to Look For in a Statistician” (the article, and my comment), which definitely resonated with my own experiences in hiring statistically competent engineers in different settings, such as data science and machine learning, between 2015 and today. In years past, I have had similar experiences when hiring competent product engineers and manufacturing engineers in data-intensive problem solving roles.

The importance of statistical thinking and statistical analysis in business problem solving cannot be underestimated. However, even good advice that is canon, and that is well-acknowledged, often falls on deaf ears in the hyper-competitive data science job market. Both hiring managers and recruiters tend to emphasize keywords comprising the latest framework or approach, over the ability to think critically about problem statements, carefully architect systems, and rigorously apply statistical analysis and machine learning to real world problems while keeping considerations of explainability in mind.

#### The Three-Pronged Conundrum of Data Science Talent

Now you might ask why I say this, and what I really mean by this. The devil, as they say, is in the details, and one essential problem with the broad and wide proliferation of tools, frameworks and applications of high capability, that can perform and automate statistical analysis of different kinds, is the following three-pronged conundrum:

1. Lack of core statistical knowledge despite having a working knowledge of the practicum of advanced techniques: Most candidates in the data science job market who are deeply interested in building data science and ML applications have unfortunately not developed skills in the core statistical sciences and statistical reasoning. Since statistics is the foundation for machine learning and data science, this degrades the quality of projects and programs which have to rely on hiring such talent. When they prefer to use software to do most of or all the thinking for them, their own reasoning about the problem is rarely good enough to critically evaluate different statistical formulations for problems, because they think in very set and specific ways about problems thanks only to their familiarity with the tools.
2. Tools as an unfortunate substitute to statistical thinking: Solutions, services and consulting professionals in the data science and advanced analytics space, who have to bring their best statistical thinking to client-facing interactions, are unable to differentiate between competence in statistical thinking, and competence in a specific software tool or approach.
3. Model bloat and inexplicability: The use of heavy, general purpose approaches that rely on complex, less explainable models, than reliance on simpler models that are constructed upon a fuller understanding of the true dynamics of the problem.

These three sub-problems can derail even the best envisioned data science and machine learning initiatives in product / solution delivery firms, and in enterprises.

#### Some “Unsexy” Characteristics of Good Data Scientists

These are also not “sexy” problems – they’re earthy, multi-dimensional, real world problems that have many contributing factors, from business and how it is done, to the culture of education and the culture of software and solution development teams. Kevin Gray in his post touches upon attitudinal qualities for good statisticians, which could also be extended to data science leaders, data scientists and data engineers:

1. Integrity and honesty are important in data science – this is true especially in a world where personal data is being handled carelessly and sometimes gratuitously by many applications without heed to data protection and privacy, and when user data is taken for granted by many technology companies. This is not an easy expectation or evaluation point for hiring managers, since it is only long association with anyone which allows us to build a model of their integrity, and rarely does one effectively determine such an attribute in short interviews. What’s dismal about data science hiring sometimes, is the proliferation of candidate resumes which are full of fluff, and the tendency of candidates to not stand up to scrutiny on skills they identify as “key” or “core” skills.
2. Curiosity and a broad spectrum of interests – this cannot be understated in the context of a consulting data science or machine learning expert. The more we’re aware of different mental models and theoretical frameworks of the world and the data we see in it, the better we’re able to reason starting from hypotheses about the data. By extension, we’re better able to identify the right statistical approaches for a problem when we start from and explore different such mental models. The book I’ve linked to here by Scott E. Page is a fantastic evaluation of different mental models. But with models come biases, to restate George E. P. Box’s famous quote, “All models are wrong, some models are useful”.
3. Checking for logical fallacies is key for data science reasoning – I would add to the critical thinking element mentioned in Kevin’s post, by saying that it behooves any thought leader such as a data science consultant to critically evaluate their own thinking by checking for logical fallacies. When overlooked, a benign piece of flawed reasoning can turn into a face-melting disaster. The best way to ensure this does not happen is to critically evaluate our ideas, notions and mental models.
4. Don’t develop one hammer, develop a tool box – Like experienced plumbers, carpenters or mechanics, the tools landscape of a data scientist today should not be one of quasi-religious fervor in promoting one technique at the cost of others, such as how deep learning has come to be promoted in some circles as a data science panacea. Instead, the effective data scientist is usually pragmatic in their approach. Like a tailor or carpenter who has to cut or join different materials with different instruments, data scientists today do not have the luxury of getting behind one comfortable model of thinking about their tool set and profession – and any attempt to do this can be construed as laziness (especially for the consulting data scientist) at best. While the customer is always right, there are times when the client can be wrong and it is at these times that they need the advice of a qualified statistician or data scientist. If there is one time when data scientists should not abandon their statistical thinking, it is this kind of a situation.

#### Concluding Remarks

To conclude, data scientists ought not to be seen as resources that take data, analyze it using pre-built tools, and write code to explain the data using pre-built libraries of various kinds. They’re not software jockeys who happen to know some statistics and have a handle on machine learning workflows. Data scientists’ work scope and emphases as industry professionals and consultants go way beyond these limited definitions. Data scientists are expected to be dynamic, statistically sound professionals who critically evaluate real world problems based on theories, data and evidence drawn from many sources and contexts, and progressively build a deeper understanding of these real world problems that lead to tangible value for their customers, be they businesses or the consumers of products. The sooner data scientists realize this, the better off they will be while charting out a truly successful and fulfilling data science career.