Emphasizing the Basics: Structured Data Science Mentoring

Data science, machine learning and AI are constantly growing and burgeoning fields, with research that’s spilling over at the seams in terms of the sheer volume of it all. Every day, I receive numerous references to interesting papers on my Twitter feed, thanks to Arxiv daily and such accounts there. I also see papers explained with code, and references to ML products and systems in numerous contexts. This is all overwhelming beyond a point for a professional who doesn’t have a specific focus area. Speaking pragmatically, and from the tree of knowledge (which is always bound to be vast), it is a feature of every single human endeavour to exhibit this kind of complexity as we spend more and more time exploring things, farming ideas and understanding new possibilities in these areas.

Data scientists are going to be at different levels of competence and may be differently placed to take on challenges they are asked to face – the role of the mentor (regardless of the type) is to systematically challenge the data scientist to discover new innate potential and develop such potential to increase their overall capabilities and effectiveness.

The Data Science Mentoring Challenge

Mentoring can be a hard task for this reason – a lot of people (understandably) gravitate towards complex models that are meant for specific purposes, without fully understanding the details and the exact mechanisms behind simpler machine learning and statistical modeling methods. The problem with this is two fold – a) reliance on libraries and frameworks with implementations that already exist, and b) inability to characterize, apply and explain common and simpler techniques to actual real world problem statements. Part of the problem here is the sensationalization of research in the media. Open research without borders is important and pivotal for speedy progress in technology areas like ML. But we’re also seeing a lot of misinformation including sensationalization of advanced ML techniques and when some of gets parroted by professionals (some of who may become hiring managers) we see the problem proliferating into the world of work as well. I’ve interviewed my fair share of individuals who understand, say, an LSTM unit’s different gates but aren’t comfortable explaining autocorrelation techniques or ARMA models. This gap probably stems from gaps in mentoring and coaching, which ideally should emphasize basics first.

I’d posit that the role of the mentor has changed, in data science, over the last several years, and I would say it has changed most significantly in the last two years. In the future, data and AI mentoring will look different from what it looked like in the past five years. This is because the nature of the job of a data scientist (or alternatively an AI/ML engineer) has also changed. Despite developments in Automated Machine Learning, we’re inundated with situations in the real world, where we require human expertise to get through data science and machine learning problems. This human expertise manifests in three processes: problem characterisation, problem formulation, and problem solving. We need real, human data scientists (not just an AutoML tool) to look beyond the obvious automations such as hyperparameter or architectural searches, to reason about the nuts and bolts of problems, interpret the problem domain and reason about different kinds of hypotheses and how they make sense.

This makes the process of mentoring for data science different than it was, in certain specific ways. For one thing, mentors today create the field of problems or opportunities that will exist tomorrow. Data scientists today experience an overload of information as can be expected, from different sources. From Arxiv and Springer papers and articles, to new research and code, new books and new frameworks and algorithms, there are plenty of things to learn on a daily basis. However, the broader skill set of the data scientist even today can be characterized into four key areas: basic, business, functional and frontier skills.

Broad Characterization of Skills for Modern Data Science
  1. Technical skills: There’s the need for a strong foundation that enables general effectiveness in a data science role. This includes good skills across statistical analysis fundamentals, leading into the key principles that enable statistical learning models to be built, and a sound understanding of the mechanism behind common algorithms such as regression, tree algorithms, search and optimization methods
  2. Business skills: There is a strong need for data scientists who can reason about business processes and systems, and understand how data may be generated, how it may flow, and what insights may be required of it. Not only is this is a key skill to have fruitful interactions with clients and stakeholders, but it is also important to narrow down to the right level of depth for the job in terms of satisfaction and effectiveness.
  3. Functional skills: There’s the need for effectiveness on the job, at a functional level, which not only includes technical competence at the statistical, mathematical and code levels, but at the level of processes and good practices such as clean code, change management and reproducible research. One could also see more advanced machine learning and feature engineering techniques as being part of the functional skill set.
  4. Frontier skills: There’s research that’s expanding at multiple frontiers, which is hard for even experienced data scientists to keep up with, if they’re really interested in furthering their career beyond the obvious and evident challenges of day-to-day work.

Mentors: Different Levels

The role of mentorship has also become specialized in the last two years, which is, in my view, one of the changes most representative of the maturation of the field of data science. Mentors today can be at different levels of skill and still add value to different kinds of data science and analytics roles. For the sake of this discussion, I’d classify mentors today into two kinds – the “breadth mentor” and the “depth mentor”. While both kinds of mentors possess certain common skills, especially on the interpersonal communication front, they may have different approaches to technical, functional and research level mentoring.

The breadth mentor is an individual with plenty of experience in data science, perhaps in a consulting setting, that can provide generally correct advice to data scientists with the development of broad skill sets, ranging from basic statistical analysis, to advanced algorithms. The nature of the mentoring here is on developing a well-rounded data scientist, rather than an expert in a specific field.

The depth mentor by contrast, is someone who has deep experience in a specific area of industry or technology and has deep experience in bringing this field together with data science. Examples of this kind of data scientist would be an NLP researcher, or a researcher in the field of robotics, both of who may be expert practitioners of data science methods in their specific areas, but without the broader knowledge of consultative data science methods.

Depending on the needs of the business and the data scientists in question, the appropriate kind of mentor has to be chosen – and this shouldn’t be done lightly. For example, bringing a breadth mentor to an AI product firm may have some advantages, but if the firm is solving problems in a specific space, this may not work out so well. Similarly, bringing a depth mentor to a consulting firm can help grow a specific practice (or a new one) but may not benefit the broader data science efforts across different business domains there.

Structuring Mentorship in Data Science

Mentors (and hiring managers) in general should emphasize the importance of the basic skills listed above. In my view, when a data science candidate has the correct understanding of the essential basic statistical ideas and common algorithms, it becomes a lot easier for them to grasp more advanced ideas such as in deep learning, when this is required. Mentors can build better basic skills in data scientists by challenging their technical acumen.

Mentors should also emphasize business skills where relevant, and where the emphasis is on research, they should emphasize some of the frontier skills as well. Mentors in this context are expected to challenge the data scientist with relevant questions, and encourage a habit of systematically breaking down problems and asking the right questions. These business skills are important all the way up to solution architect roles and management, when crucial decisions have to be taken and hard questions will need to be asked often. Mentors can build better business skills in data scientists by challenging their problem understanding and characterisation.

Functional skills are important for effectiveness on the job. It is not okay for data scientists to theoretically understand a specific subject area, only to find themselves handicapped when asked to build a machine learning pipeline. Therefore functional skill mentoring is about challenging the data scientist on problem solving effectiveness.

Finally, frontier skills development depends on both the organizational or research context, and the data scientist’s interests. Mentors can provide helpful markers to enable the exploration of ideas, while emphasizing value from the research, and asking questions that keeps the data science researcher on track. The challenge the mentor can pose here is differentiated solution value and originality.

The Importance of Emphasizing the Basics

This brings me to the importance of emphasizing the basics. I see numerous individuals out there who are getting into data science and machine learning that are interested in getting right to the latest and greatest algorithms. For a while – and this has been a trend on LinkedIn and Twitter – budding data science aspirants post some of their work, where it involves the development of simple scripts or programs around computer vision, translation and such problem statements, thereby delivering an impression to a lot of their audience, that not only are they skilled at those techniques demonstrated, but that they are skilled at different kinds of data science problem statements as well. My own suggestion to data science aspirants is that they will be under pressure to demonstrate some of their more involved skills, not merely the ability to use pre-built libraries to solve problems using one’s own basic skill sets in statistical learning, but, perhaps be able to build such algorithms and systems from scratch. This kind of deeper skill is what differentiates the wheat from the chaff in data science.

Concluding Remarks

In conclusion, I believe the mentor’s role in data science has changed – mentors today have their tasks cut out, when it comes to building deeper skill in their data scientists – they should emphasise technical acumen first and foremost, problem understanding and characterisation next, and problem solving effectiveness after this. This builds up a well-layered skill-set where technical skills can perform a harmonious dance in amalgam, resulting in true value to the data science market.

Exploring SVM Kernels

Support Vector Machines are a popular option for data scientists wanting to explore and model higher dimensional data. Despite their lack of scalability, they’re popular for prototyping different kinds of classifiers for systems where there are large numbers of variables. At the core of the SVM is the use of a kernel function, which enables a mapping of the feature space to a higher dimensional feature space. Therefore, if we’re unable to find separability between classes in the (lower dimensional) feature space, we could find a function in the higher dimensional space, which can be used as a classifier.

Two classes of data in the R^2 space

In this Jupyter notebook I’ve explored a couple of different types of kernel functions for bivariate, two-class data, where an SVM is being used to separate out these classes. Since these classes are not linearly separable, the use of kernel functions here enables us to find the best possible hyperplanes that can solve the separability problem. What’s interesting to note is that the convex hulls (in this case, polygons) for these classes are overlapping in the 2D space. This is a clear indicator of a lack of linear separability.

The blue polygon here represents the convex hull for one class, which is dimensions-wise nested in another, in this data representation

The use of a kernel opens up the possibility of linear separability, since we add an additional spatial dimension on which these points get distributed. Specifically here we have two different kernel functions that are explored:

\phi(x_{1}, x_{2}) = (x_{1}^2, x_{1}x_{2}, x_{2}^2)^{T}

K(x_{1}, x_{2}) = a e^{{-\frac{1}{b} ||x_{1} - x_{2}||^{2} }}

The latter is called the radial basis function kernel, or the RBF kernel. Visualizing this kernel for the data we’d generated gives us the following nice image. What’s easily visible here is the possibility of separating out the classes thanks to the additional rbf dimension that has now been added.

RBF (vertical axis) enables separation of the two classes, in blue and yellow.
Note: Low opacity used for better visibility
Decision boundary identified by the SVM (which uses the RBF kernel)

Upon training the SVM classifier, visualizing the results gives us the below plot. The thick grey line is the decision boundary that enables us to separate the originally linearly inseparable classes in the dataset.

There are other explorations I hope to do on this notebook in future, specifically the process of calculating the sign (class label) of a dataset, based on the Lagrangian – which indeed brings us to the idea of the SVC being a maximal margin classifier. This is also referred to as the dual problem of the SVM. For another post!

Different Kinds of Data Scientists

Data scientists come in many shapes and sizes, and constitute a diverse lot of people. More importantly, they can perform diverse functions in organizations and still stand to qualify under the same criteria we use to define data scientists.

In this cross-post from a Quora answer, I wish to elucidate on the different kinds of data scientist roles I believe exist in industry. Here is the original question on Quora. I have to say here, that I found Michael Koelbl’s answer to What are all the different types of data scientists? quite interesting, and thinking along similar lines, I decided to delineate the following stereotypical kinds of data science people:

  1. Business analysts with a data focus: These are essentially business analysts that understand a specific business domain reasonably well, although they’re not statistically or analytically inclined. Focused on exploratory data analysis, reporting based on creation of new measures, graphs and charts based on them, and asking questions around these EDA. They’re excellent at story telling, asking questions based on data, and pushing their teams in interesting directions.
  2. Machine learning engineers: Essentially software developers with a one-size-fits-all approach to data analysis, where they’re trying to build ML models of one or other kind, based on the data. They’re not statistically savvy, but understand ML engineering, model development, software architecture and model deployment.
  3. Domain expert data scientists: They’re essentially experts in a specific domain, interested in generating the right features from the data to answer questions in the domain. While not skilled as statisticians or machine learning engineers, they’re very keyed in on what’s required to answer questions in their specific domains.
  4. Data visualization specialists: These are data scientists focused on developing visualizations and graphs from data. Some may be statistically savvy, but their focus is on data visualization. They span the range from BI tools to coded up scripts and programs for data analysis
  5. Statisticians: Let’s not forget the old epithets assigned to data scientists (and the jokes around data science and statisticians). Perhaps statisticians are the rarest breed of the current data science talent pool, despite the need for them being higher than ever. They’re generally savvy analysts who can build models of various kinds – from distribution models, to significance testing, factor-response models and DOE, to machine learning and deep learning. They’re not normally known to handle the large data sets we often see in data science work, though.
  6. Data engineers with data analysis skills: Data engineers can be considered “cousins” of data scientists that are more focused on building data management systems, pipelines for implementation of models, and the data management infrastructure. They’re concerned with data ingestion, extraction, data lakes, and such aspects of the infrastructure, but not so much about the data analysis itself. While they understand use cases and the process of generating reports and statistics, they’re not necessarily savvy analysts themselves.
  7. Data science managers: These are experienced data analysts and/or data engineers that are interested in the deployment and use of data science results. They could also be functional or strategic managers in companies, who are interested in putting together processes, systems and tools to enable their data scientists, analysts and engineers, to be effective.

So, do you think I’ve covered all the kinds of data scientists you know? Do you think I missed anything? Let me know in the comments.

Related links

  1. O’Reilly blog post on data scientists versus data engineers

Data and Strategy for Small and Medium Organizations

Data analytics and statistics aren’t historically associated with the strategic decisions that leaders take in small and medium sized businesses. Data analytics has for some years been used in larger organizations and organizations with larger user bases are also benefiting from this, thanks to the use of big data to drive consumer and business insight in business decision making. However, even such businesses can benefit from the large volumes of data that are being collected, including from public data bases. Most decisions in traditional businesses and in small and medium businesses are still taken by leaders who at best have a pulse of the market and a domain knowledge of the business they’re in, but aren’t using the data at their disposal to create mathematical models and strategies derived from them.

When does data fit into strategy?

To answer this, we may need to understand the purpose of strategy and strategic initiatives themselves. In small and medium organizations, the purpose of strategic initiatives, especially the mid- and short-term strategies, is to enable growth. Larger organizations have the benefit of extensive user bases, consumer bases or resources, which they can use to develop, test, validate and release new products and services. However, smaller organizations and medium sized organizations make these strategic initiatives, because their focus tends to be limited to the near term, and in maintaining a good financial performance. Small and medium organizations in modern economies will also seek to maintain leverage and a consumer base that is dedicated and loyal to their product or service journey. The latter is especially true of niche product companies, because they sell lifestyles, and not merely products.

In this context, data fits into strategy in the following key ways:

  1. Descriptive data analytics allows strategists and leaders to question underlying assumptions of existing strategies
  2. Data visualizations allow strategists to classify and rank opportunities and have more cost and time efficient strategies
  3. Inferential data analytics, predictive analytics and simulations allow strategists to play out scenarios, and take a peek into the future of the business

Descriptive data analytics may work with public data, or data already available with the organization. It could be composed of statistical reports, illustrating the growth in demand, or market size, or certain broader trends and patterns in consumption, or demand, for a certain product, service, or opportunity. Descriptive analytics is easy enough to do, and doesn’t involve complex modeling usually. It is a good entry point for strategists that hope to become more data driven in the development of their strategies.

Data visualizations, in addition to being communication tools that provide strategists leverage, could also throw some light on the functional aspects of what opportunities to seek out, and what strategies to develop. They could also help strategists make connections and see relationships that would otherwise not have been apparent. Data visualization has been made easier and more affordable because of powerful and free software such as R and R-Studio. Visualizations are extremely effective as communication and ideation tools. For strategists who look to mature beyond just using descriptive statistics in developing their strategies, visualizations can be valuable.

Inferential data analytics leverage the predictive power of mathematical and statistical models. By representing what is common knowledge as a mathematical model, we can apply it to diverse situations, and throw new light on problems that we haven’t evaluated before in a scientific or data driven manner. Inferential data analytics generally requires individuals with experience as data scientists. Inferential statistical models require a good understanding of basic and inferential statistical models, and therefore, can be more complex to incorporate into data based strategy models. While descriptive and visualizations may not be driven by advanced algorithms such as neural networks or machine learning, advanced and inferential analytics can certainly be so driven.

Data for Short and Medium Term Strategy

Data analysis that informs short term strategy and medium term strategy are fundamentally different. Short term strategy, that focuses on the immediate near term of a business, generally seeks to inform the operational teams on how they should act. This may be a set of simple rules, which are used to run the rudiments of the business on a day-to-day basis. Why use data to drive the regular activities of businesses for which extensive procedures may already be in place? Because keeping one’s ear to the ground – and collecting customer and market information on an ongoing basis – is extremely important for most businesses today in a competitive business world.  Continual improvement and quality are fundamental and important to a wide variety of businesses, and data that informs the short term is therefore extremely important.

Data analytics in the short term doesn’t rely on extensive analysis, but keeping abreast of information and the trends and patterns we see in them on a day-to-day basis. Approaches relevant to short term strategy may be:

  1. Dashboards and real time information streams
  2. Automatically generated reports that give operations leaders or general managers a pulse of the market, or a pulse of the business
  3. Sample data analysis (small data, as opposed to big data), that informs managers and teams about the ongoing status of a specific process or product – this is similar to quality management systems in use in various companies small and large

Data analytics in the mid term strategy space is quite a different situation, being required to inform strategists about the impact of changing market scenarios on a future product or service launch. The data analysis here should seek to serve the strategists’ need to be informed about served and total addressable markets, competitive space, penetration and market share expectations, and such business-specific criteria that help fund, finance or prioritize the development of new products or services.

Accordingly, data analytics in a mid-term strategy space (also called Horizon 2 strategy) may involve more involved analysis, typically by data scientists. Tools and themes of analysis may be things like:

  1. Consumer sentiment analysis to determine the relevance of a particular product or service
  2. Patents and intellectual property data munging, classification and text mining for category analysis
  3. Competitor analysis by automated searches, classification algorithms, risk analysis by dynamic analytical hierarchy processes
  4. Scenario analysis and simulation, driven by methods such as Markov Chain Monte Carlo analysis

Observe how the analyses above are distinct from the more ready information that’s shared with operational teams. The data analytics activities here generally require analysis of data in a rigorous manner, not merely the collection and presentation of available data that fit a certain definition. When data is unstructured and when data science requires the cleaning and visualization of data, the creation of models from a starting point, such as public data, is much more challenging. This is where the skills of well trained data scientists and data analysts is essential.

Data for Long Term Strategy

One narrative that has made itself known through data in the world of business, is that the long term as it was traditionally known, is shrinking. Even S&P 500 companies are conspicuous these days by how short lived they are, and small and medium companies, therefore, are no exception. Successful tech companies boast product and service development cycles of a few months up to a year, and the technology world is therefore unrecognizable from what it was every few months, thanks to innovation. However, there is probably a method to even this madness. The scale and openness of access has made the consumer and end user powerful, and the consumer these days has opportunities to do things with free resources and tools, that could only be imagined a few years ago.

Data informs strategists in such longer term strategic scenarios, typically, five years or more, by helping construct scenarios. Data analytics in scenario planning should account for the following:

  1. Dynamic trends in the increase of velocity in information/data being collected (Velocity, out of the four Vs)
  2. Dynamic changes in the type of information being collected (Variety, out of the four Vs)
  3. Dynamic changes in the reliability of information being collected (Veracity out of the four Vs)

Volume, the other V out of the four Vs, is a static measure of the data being collected at specific points in time, but these above are more than just volume, and they represent the growing size, variety and unreliability of available data.

Data analysis of a more simple nature can be used for some of the analysis above, while for specific approaches such as scenario analysis, sophisticated mathematical models can be used. In small and medium organizations, where the focus is usually on the short term, and at best on the mid term, data analytics can help inform executives about the long term and keep that conversation going. It is easy in smaller organizations to fall into the trap of not preparing for the long term. In the mid and long term, more advanced methods can be used to guide and inform the organization’s vision.

Concluding Remarks

Data analytics as applied to strategy is not entirely new, with many mature organizations already working on it. For small and medium businesses, which are mushrooming in a big way around the developed and developing world these days, data analytics is a force multiplier for strategic decision making and for leaders. Data analytics can reveal information we have hitherto believed to only be the preserve of large organizations who can collect data on an unprecedented scale and hire expert teams to analyze them. What makes analytics relevant to small and medium businesses today is that in our changing business landscape, we can expect analytics driven companies to respond in more agile ways to the needs of customers, and to excite customers in new ways, that traditional, less agile and larger organizations are not likely to do. The surfeit of mature data analysis tools and approaches available, combined with public data, can therefore make leaders and strategists in small and medium organizations more competitive.

Two Way ANOVA in R

Introduction

The more advanced methods in statistics have generally been developed to answer real-world questions, and ANOVA is no different.

  • How do we answer questions in the real world, as to which route from home to work on your daily commute is easier, or
  • How would you know which air-conditioner to choose out of a bunch that you’re evaluating in various climates?
  • If you were dealing with a bunch of suppliers, and wanted to compare their process results all at the same time, how would you do it?
  • If you had three competing designs for a system or an algorithm, and wanted to understand whether one of them was significantly better than the others, how would you do that statistically?

ANOVA answers these kinds of questions – it helps us discover whether we have clear reasons to choose a particular alternative over many others, or determine whether there is exceptional performance (good or bad) that is dependent on a factor.

We discussed linear models earlier – and ANOVA is indeed a kind of linear model – the difference being that ANOVA is where you have discrete factors whose effect on a continuous (variable) result you want to understand.

The ANOVA Hypothesis

The ANOVA hypothesis is usually explained as a comparison of population means estimated based on the sample means. What we’re trying to understand here is the effect of a change in the level of one factor, on the response. The term “Analysis of Variance” for ANOVA is therefore a misnomer for many new to inferential statistics, since it is a test that compares means.

A simplified version of the One-Way ANOVA hypothesis for three samples of data (the effect of a factor with three possible values, or levels) is below:

H_0 : \mu_1 = \mu_2 = \mu_3

While Ha could be:

H_a : \mu_1 \neq \mu_2 = \mu_3, or
H_a : \mu_1 = \mu_2 \neq \mu_3, or
H_a : \mu_1 \neq \mu_2 = \mu_3

It is possible to understand the Two-Way ANOVA problem, therefore, as a study of the impact of two different factors (and their associated levels) on the response.

Travel Time Problem

Let’s look at a simple data set which has travel time data organized by day of the week and route. Assume you’ve been monitoring data from many people travelling a certain route, between two points, and you’re trying to understand whether the time taken for the trip is more dependent on the day of the week, or on the route taken. A Two-Way ANOVA is a great way to solve this kind of a problem.

The first few rows of our dataset

The first few rows of our dataset

We see above how the data for this problem is organized. We’re essentially constructing a linear model that explains the relationship between the “response” or “output” variable Time, and the factors Day and Route.

Two Way ANOVA in R

ANOVA is a hypothesis test that requires the continuous variables (by each factor’s levels) to normally distributed. Additionally, ANOVA results are contingent upon an equal variance assumption for the samples being compared too. I’ve demonstrated in an earlier post how the normality and variance tests can be run prior to a hypothesis test for variable data.

The code below first pulls data from a data set into variables, and constructs a linear ANOVA model after the normality and variance tests. For normality testing, we’re using the Shapiro-Wilk test, and for variance testing, we’re using the bartlett.test() command here, which is used to compare multiple variances.

#Reading the dataset
Dataset<-read.csv("Traveldata.csv")
str(Dataset)

#Shapiro-Wilk normality tests by Day
cat("Normality p-values by Factor Day: ")
for (i in unique(factor(Dataset$Day))){
  cat(shapiro.test(Dataset[Dataset$Day==i, ]$Time)$p.value," ")
}
cat("Normality p-values by Factor Route: ")

#Shapiro-Wilk normality tests by Route
for (i in unique(factor(Dataset$Route))){
  cat(shapiro.test(Dataset[Dataset$Route==i, ]$Time)$p.value," ")
}

#Variance tests for Day and Route factors
bartlett.test(Time~Day,data = Dataset )
bartlett.test(Time~Route,data = Dataset )

#Creating a linear model
#The LM tells us both main effects and interactions
l <- lm(Time~ Day + Route + Day*Route , Dataset)
summary(l)

#Running and summarizing a general ANOVA on the linear model
la <- anova(l)
summary(la)

#Plots of the linear model and Cook's Distance
plot(cooks.distance(l), 
     main = "Cook's Distance for linear model", xlab =
       "Travel Time (observations)", ylab = "Cook's Distance")
plot(l)
plot(la)

Results for the Bartlett test are below:

> bartlett.test(Time~Day,data = Dataset )

	Bartlett test of homogeneity of variances

data:  Time by Day
Bartlett's K-squared = 3.2082, df = 4, p-value = 0.5236

> bartlett.test(Time~Route,data = Dataset )

	Bartlett test of homogeneity of variances

data:  Time by Route
Bartlett's K-squared = 0.8399, df = 2, p-value = 0.6571

The code also calculates Cook’s distance, which is an important concept in linear models. When trying to understand any anomalous terms in the model, we can refer to the Cook’s distance to understand whether those terms have high leverage in the model, or not. Removing a point with high leverage could potentially affect the model results. Equally, if your model isn’t performing well, it may be worth looking at Cook’s distance.

Cook's Distance for our data set, visualized

Cook’s Distance for our data set, visualized

Cook's distance, explained by its importance to leverage in the model.

Cook’s distance, explained by its importance to leverage in the model.

When looking at the graphs produced by lm, we can understand the how various points in the model have different values of Cook’s distance, and we also understand their relative leverages. This is also illustrated in the Normal Quantile-Quantile plot below, where you can see observations #413 and #415 that have large values, among others.

Normal QQ plot of data set showing high leverage points (large Cook's Distance)

Normal QQ plot of data set showing high leverage points (large Cook’s Distance)

ANOVA Results

A summary of the lm command’s result is shown below.


Call:
lm(formula = Time ~ Day + Route + Day * Route, data = Dataset)

Residuals:
    Min      1Q  Median      3Q     Max 
-20.333  -4.646   0.516   4.963  19.655 

Coefficients:
              Estimate Std. Error t value Pr(>|t|)    
(Intercept)   54.34483    1.39067  39.078   <2e-16 ***
DayMon        -3.34483    1.95025  -1.715   0.0870 .  
DayThu         2.69221    2.00280   1.344   0.1795    
DayTue        -0.43574    1.90618  -0.229   0.8193    
DayWed        -0.01149    2.00280  -0.006   0.9954    
RouteB        -1.02130    1.89302  -0.540   0.5898    
RouteC        -1.83131    1.85736  -0.986   0.3246    
DayMon:RouteB  2.91785    2.71791   1.074   0.2836    
DayThu:RouteB  0.39335    2.63352   0.149   0.8813    
DayTue:RouteB  3.44554    2.64247   1.304   0.1929    
DayWed:RouteB  1.23796    2.65761   0.466   0.6416    
DayMon:RouteC  5.27034    2.58597   2.038   0.0421 *  
DayThu:RouteC  0.24255    2.73148   0.089   0.9293    
DayTue:RouteC  4.48105    2.60747   1.719   0.0863 .  
DayWed:RouteC  1.95253    2.68823   0.726   0.4680    
---
Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

Residual standard error: 7.489 on 485 degrees of freedom
Multiple R-squared:  0.04438,	Adjusted R-squared:  0.01679 
F-statistic: 1.609 on 14 and 485 DF,  p-value: 0.07291

While the model above indicates the effect of each factor on the response, it doesn’t compute the f-statistic, which is by taking into consideration the “within” and “between” variations in the samples of data. The ANOVA mean squares and sum of squares approach does exactly this, which is why the results from that are more relevant here. And the summary below is the ANOVA model itself, in the ANOVA table:

Analysis of Variance Table

Response: Time
           Df  Sum Sq Mean Sq F value  Pr(>F)   
Day         4   823.3 205.830  3.6700 0.00588 **
Route       2    46.0  23.005  0.4102 0.66376   
Day:Route   8   393.9  49.237  0.8779 0.53492   
Residuals 485 27201.3  56.085                   
---
Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

By looking at the results above, it is clear that the Day factor has a statistically significant impact on the Travel time. (Upon closer observation, you’ll see that one of those means that was different from others, was the mean for Monday! This is clearly an example inspired by Monday rush hour traffic!)

When reading results from the lm and anova commands, it is important to note that R indicates the results using significance codes. A small p-value indicates a significant result, and the relative significance of different factors is indicated by assigning different symbols to them. For instance, two asterixes (**) are used when we get a p-value of < 0.001. Depending on the nature of your experiment, you can choose your significance level and understand the results in a comparison of these p-values with significance. Also, in this specific case, the Route factor seems to have an insignificant impact on the response.

Interactions

When we have two or more terms in a model which function as inputs to a response variable, we also need to evaluate whether a change in both variables causes a different effect on the response, as opposed to fixing one and changing the other. This is referred to as an interaction, and interaction effects are taken into account in our model. Once again, the p-values for the interactions can inform us about the relative importance of different interactions.

Concluding Remarks

We’ve seen in this example how the Analysis of Variance (ANOVA) approach can be used to compare the impact of two factors on a response variable. Cook’s distance and its importance were also explained. It is important to make each of the data points within our data set count, and at the same time, it is important to evaluate the model’s veracity and validity to what we want to understand. In this context, understanding the significance of our results (statistically and practically) is necessary. Fortunately, the linear modeling packages in R are very convenient for such modeling, and incorporate lots of added functionality, like being able to call plots on them by simply using the plot command on a saved model. This functionality really comes into its own, when you make and test many different models and want to compare results.

Animated: Mean and Sample Size

A quick experiment in R can unveil the impact of sample size on the estimates we make from data. A small number of samples provides us less information about the process or system from which we’re collecting data, while a large number can help ground our findings in near certainty. See the earlier post on sample size, confidence intervals and related topics on R Explorations.

Using the “animation” package once again, I’ve put together a simple animation to describe this.

#package containing saveGIF function
library(animation)

#setting GIF options
ani.options(interval = 0.12, ani.width = 480, ani.height = 320)

#a function to help us call GIF plots easily 
plo <- function(samplesize, iter = 100){
  
  for (i in seq(1,iter)){
    
    #Generating a sample from the normal distribution
    x <- rnorm(samplesize,mu,sd)
    
    #Histogram of samples as they're generated
    hist(x, main = paste("N = ",samplesize,", xbar = ",round(mean(x), digits = 2),
                         ", s = ",round(sd(x),digits = 2)), xlim = c(5,15), 
                        ylim = c(0,floor(samplesize/3)), breaks = seq(4,16,0.5), col = rgb(0.1,0.9,0.1,0.2), 
                        border = "grey", xlab = "x (Gaussian sample)")
    
    #Adding the estimate of the mean line to the histogram
    abline(v = mean(x), col = "red", lw = 2 )
  }
}

#Setting the parameters for the distribution
mu = 10.0
sd = 1.0

for (i in c(10,50,100,500,1000,10000)){
saveGIF({plo(i,mu,sd)},movie.name = paste("N=",i,", mu=",mu,", sd=",sd,".gif"))
}

Animated Results

Very small sample size of 5. Observe how the sample mean line hunts wildly.

Very small sample size of 5. Observe how the sample mean line hunts wildly.

N= 10 , mu= 10 , sd= 1

A small sample size of 10. Mean once again moves around quite a bit.

N= 50 , mu= 10 , sd= 1

Moderate sample size of 50. Far less inconsistency in estimate (red line)

N= 100 , mu= 10 , sd= 1

A larger sample size, showing little deviation in sample mean over different samples

N= 1000 , mu= 10 , sd= 1

A large sample size, indicating minor variations in sample mean

N= 10000 , mu= 10 , sd= 1

Very large sample size (however, still smaller than many real world data sets!). Sample mean estimate barely changes over samples.

Animated Logistic Maps of Chaotic Systems in R

Linear systems are systems that have predictable outputs when there are small changes in the inputs to the system. Nonlinear systems are those that produce disproportionate results for proportional changes in the inputs. Both linear and non-linear systems are common enough in nature and industrial processes, or more accurately, many industrial and natural processes can actually be modeled as standard linear or nonlinear systems.

Nonlinear Dynamics and Chaos

Nonlinear dynamical systems are essentially systems that exhibit time-dependent behaviour and in a non-linear manner. A special class of such systems also exhibit chaos, which is defined as sensitive dependence upon initial conditions. There are great textbooks available on the subject, by researchers such as Steven Strogatz (Cornell University, Ithaca, New York).

While R is often used to run statistical analysis and studies of various kinds including advanced machine learning algorithms, it isn’t often used to study systems like this. However, they are easily modeled in R, and like any programming language, the surfeit of functions can help us understand statistical aspects of the behavior of such systems. There’s extensive material available on the internet, and Steven Strogatz’s lectures on Nonlinear Dynamics and Chaos provide a very deep treatment of the subject.

Logistic Maps and Bifurcation Diagrams

A logistic map is a function that describes how the state of a particular dynamical system evolves from one point in time to the next. It allows us to understand bifurcations, and understand what kinds of conditions produce sensitive dependence on initial conditions. More on bifurcation diagrams here.

Typical logistic map (courtesy Wikipedia)

Typical logistic map (courtesy Wikipedia)

Nonlinear Dynamical System in R

The system I’ll describe here is a probabilistic system that is based on the binomial distribution’s mechanics. This distribution is used to model events with two probabilities (success or failure), of some probability. A special case of this is the coin toss, the Bernoulli distribution.

In our example, we’re trying to understand the probability of success in a repetitive or sequential, identical game with two outcomes, provided we know the initial chance of success. In doing so, we’re exploring the impact of the number of games we’re playing in sequence, and the number of wins or losses we expect in each case. The end result from this, is a matrix, which we call problemset in this specific case.

Animations in R

The R package “animation” has functions which can enable sequential graphics (such as that generated within a loop) to be saved as a GIF animation file. This is especially handy when we’re trying to understand the impact of parameters, or when we’re trying to illustrate the data, analysis and results in our work in some sequence. We’ll use the saveGIF() function here to do just such a thing – to save a sequence of images of logistic maps in succession, into a single GIF file.


library(animation)
#Set delay between frames when replaying
ani.options(interval=.05)

#Do our plots within the saveGIF command parantheses, in order to capture the matrix plots we're generating

saveGIF({
for (inval in seq(0,1,length.out = 200)){

pfirst <- inval 
#Defining a function to calculate event probability based on starting probability assumptions
prob <- function(game){
  n <-game[1];
  k <-game[2];
  p <-game[length(game)];
  return( factorial(n) / (factorial(n-k) * factorial(k)) 
          * p^k * (1-p)^(n-k) );
}

iter <-100
k <- 2
games <- seq(2,100,1)
victories <- rep(5,length(games))
problemset <- cbind(games, victories, 
                    rep(pfirst, length(games)))

#Setting up a temporary variable to store the probability values per iteration
out<-NULL

for (i in seq(1,iter,1)){
  for (i in seq(1,length(problemset[,1]), 1)){
    out<-rbind(out,prob(problemset[i,]))
  }
  problemset <-cbind(problemset,out)
  out<-NULL
}

#Using the matrix plot function matplot() to plot the various columns in our result matrix, together

matplot(problemset[,seq(3,length(problemset[1,]), 1)], type = "l", lwd = 1, lty = "solid", 
        main = paste("Logistic Map with initial probability = ",round(pfirst,2)), ylab = "Probability", 
        xlab = "Number of games", ylim = c(0,0.5) )

}

})

The code above generates an animation (GIF file) in your default R working directory. It allows us to examine how different system parameters could affect the probability of events we’re evaluating in the sample space we have in this problem. Naturally, this changes depending on the number of games you indicate in the games variable in the code. The GIF files are shown at the end of this section – two different cases are shown.

Here’s a logistic map generated by a slightly modified version of the code above. This shows the calculated probabilities for different combinations of games, and won games, based on initial assumed win percentages. The initial assumed win percentage in this case is 0.1 or 10%, the horizontal line running through the graph.

Logistic map for an initial probability of 0.1

Logistic map for an initial probability of 0.1

Logistic map for the dynamical system described above.

Logistic map for the dynamical system described above.

Longer animation with different parameters for k and a greater number of frames.

Longer animation with different parameters for k and a greater number of frames.

A number of systems can only be described well when we see their performance or behaviour in the time domain. Visualizing one of the parameters of any model we construct along the time domain therefore becomes useful to try and spot patterns and trends. Good data visualization, for simple and complex problems alike, isn’t only about static images and infographics, but dynamic displays and dashboards of data that change and show us the changing nature of the system being modeled or data being collected. Naturally, this is of high value when we’re putting together a data science practice in an organization.

Concluding Remarks

We’ve seen how animated logistic maps are generated in the example here, but equally, the animated plots could be other systems which we are trying to describe in the time domain, or systems we’re trying to describe in an elaborate way, especially when many parameters are involved. Linear and nonlinear systems aside, we also have to deal with nonlinear and dynamical systems in real life. Good examples are weather systems, stock markets, and the behaviours of many complex systems which have many variables, many interactions, although simple rules. Logistic maps can tell us, based on simplified representations of our data, about the regimes of chaos and order in the behaviour of the system, and what to expect. In this sense, they can be used in cases where it is known that we’re dealing with nonlinear and dynamical systems.

Comparing Non-Normal Data Graphically and with Non-Parametric Tests

Not all data in this world is predictable in the exact same way, of course, and not all data can be modeled using the Gaussian distribution. There are times, when we have to make comparisons about data using one of many distributions that represent data which could show different patterns other than the familiar and comforting “bell curve” of the normal distribution pattern we’re used to seeing in business presentations and the media alike. For instance, here’s data from the Weibull distribution, plotted using different shape and scale parameters. A Weibull distribution has two parameters, shape and scale, which determine how it looks (which varies widely), and how spread out it is.


shape <- 1
scale <- 5
x<-rweibull(1000000,shape,scale)
hist(x, breaks = 1000, main = paste("Weibull Distribution with shape: ",shape,", and scale: ",scale))
abline (v = median(x), col = "blue")
abline (v = scale, col = "red")

2015-08-22 18_04_01-Action center

Shape = 1; Scale = 5. The red line represents the scale value, and the blue line, the median of the data set.

Here’s data from a very different distribution, which has a scale parameter of 100.

2015-08-22 18_07_12-New notification

Shape = 1; Scale = 100. Same number of points. The red and blue lines mean the same things here too.

The shape parameter, as can be seen clearly here, is called so for a good reason. Even when the scale parameter changes wildly (as in our two examples), the overall geometry of our data looks similar – but of course, it isn’t. The change in the scale parameter has changed the probability of an event x ->0 towards the lower end of the x range (closer to zero), compared to an event x>>>0 further away. When you superimpose these distributions and their medians, you can get a very different picture of them.

If we have two very similar data sets like the data shown in the first graph and the data in the second, what kinds of hypothesis tests can we use? It is a pertinent question, because at times, we may not know that a data set may represent a process that can be modeled by a specific kind of distribution. At other times, we may have entirely empirical distributions represented by our data. And we’d still want to make comparisons using such data sets.

shape <- 1
scale1 <- 5
scale2<-scale1*2
x<-rweibull(1000000,shape,scale1)
xprime<-rweibull(1000000,shape,scale2)
hist(x, breaks = 1000, border = rgb(0.9,0.2,0.2,0.2), col = rgb(0.9,0.2,0.2,0.2), main = paste("Weibull Distribution different shape parameters: ",shape/100,", ", shape))
hist(xprime, breaks = 1000, border = rgb(0.2,0.9,0.2,0.2), col = rgb(0.2,0.9,0.2,0.2), add = T)
abline (v = median(x), col = "blue")
abline (v = scale, col = "red")

Different scale parameters. Red and blue lines indicate medians of the two data sets.

Different scale parameters. Red and blue lines indicate medians of the two data sets.

The Weibull distribution is known to be quite versatile, and can at times be used to approximate the Gaussian distribution for real world data. An example of this is the use of the Weibull distribution to approximate constant failure rate data in engineering systems. Let’s look at data from a different pair of distributions with a different shape parameter, this time, 3.0.

shape <- 3
scale1 <- 5
scale2<-scale1*1.1 #Different scale parameter for the second data set
x<-rweibull(1000000,shape,scale1)
xprime<-rweibull(1000000,shape,scale2)
hist(x, breaks = 1000, border = rgb(0.9,0.2,0.2,0.2), col = rgb(0.9,0.2,0.2,0.2), main = paste("Weibull Distribution different scale parameters: ",scale1,", ", scale2))
hist(xprime, breaks = 1000, border = rgb(0.2,0.9,0.2,0.2), col = rgb(0.2,0.9,0.2,0.2), add = T)
abline (v = median(x), col = "blue")
abline (v = median(xprime), col = "red")

Weibull distribution data - different because of scale parameters. Vertical lines indicate medians.

Weibull distribution data – different because of scale parameters. Vertical lines indicate medians.

The medians can be used to illustrate the differences between the data, and summarize the differences observed in the graphs. However, when we know that a data set is non-normal, we can adopt non-parametric methods from the hypothesis testing toolbox in R. Like hypothesis tests for normally distributed data that have comparable means, we can compare the medians of two or more samples of non-normally distributed data. Naturally, the same conditions – of larger samples of data being better, at times, apply. However, the tests can help us analytically differentiate between two similar-looking data sets. Since the Mann-Whitney median test and other non-parametric tests don’t make assumptions about the parameters of the underlying distribution of the data, we can rely on these tests to a greater extent when studying the differences between samples that we think may have a greater chance of being non-normal (even though the normality tests may say otherwise).

Non-parametrics and the inferential statistics approach

Non-parametrics and the inferential statistics approach: how to use the right test

When we conduct the AD test for normality on the two samples in question, we can see how these samples return a very low p-value each. This can also be confirmed using the qqnorm plots.

Let’s use the Mann-Whitney test for two medians from samples of non-normal data, to assess the difference between the median values. We’ll use a smaller sample size for both, and use the wilcox.test() command. For two samples, the wilcox.test() command actually performs a Mann-Whitney test.

shape <- 3
scale1 <- 5
scale2<-scale1*1.01
x<-rweibull(1000000,shape,scale1)
xprime<-rweibull(1000000,shape,scale2)
library(nortest)
paste("Normality test p-values: Sample 'x' ",ad.test(x)$p.value, " Sample 'xprime': ", ad.test(xprime)$p.value)

hist(x, breaks = length(x)/10, border = rgb(0.9,0.2,0.2,0.05), col = rgb(0.9,0.2,0.2,0.2), main = paste("Weibull Distribution different scale parameters: ",scale1,", ", scale2))
hist(xprime, breaks = length(xprime)/10, border = rgb(0.2,0.9,0.2,0.05), col = rgb(0.2,0.2,0.9,0.2), add = T)
abline (v = median(x), col = "blue")
abline (v = median(xprime), col = "red")
wilcox.test(x,xprime)
paste("Median 1: ", median(x),"Median 2: ", median(xprime))

Observe how close the scale parameters of both samples are. We’d expect both samples to overlap, given the large number of points in each sample. Now, let’s see the results and graphs.

Nearly overlapping histograms for the large non-normal samples

Nearly overlapping histograms for the large non-normal samples

The results for this are below.

Mann-Whitney test results

Mann-Whitney test results

The p-value here (for this considerable sample size) clearly illustrates the present of a significant difference. A very low p-value in this test result indicates that, if we were to make the assumption that the medians of these data sets are equal, there would be an extremely small probability, that we would see samples as extreme as observed in these samples. The fine difference in the medians observed in the median results can also be picked up in this test.

To run the Mann-Whitney test with a different confidence level (or significance), we can use the following syntax:

wilcox.test(x,xprime, conf.level = 0.95)

Note 1 : The mood.test() command in R performs a two-samples test of scale. Since the scale parameters in these samples of data we generated (for the purposes of the demo) are well known, in real life situations, the p-value should be interpreted based on additional information, such as the sample size and confidence level.

Note 2:  The wilcox.test() command performs the Mann Whitney test. This is a comparison of mean ranks, and not of the medians per se.

Hypothesis Tests: 2 Sample Tests (A/B Tests)

Businesses are increasingly beginning to use data to drive decision making, and are often using hypothesis tests. Hypothesis tests are used to differentiate between a pair of potential solutions, or to understand the performance of systems before and after a certain change. We’ve already seen t-tests and how they’re used to ascribe a range to the variability inherent in any data set. We’ll now see the use of t-tests to compare different sets of data. In website optimization projects, these tests are also called A/B tests, because they compare two different alternative website designs, to determine how they perform against each other.

It is important to reiterate that in hypothesis testing, we’re looking for a significant difference and that we use the p-value in conjunction with the significance (%) to determine whether we want to reject some default hypothesis we’re evaluating with the data, or not. We do this by calculating a confidence interval, also called an interval estimate. Let’s look at a simple 2-sample t-test and understand how it works for two different samples of data.

Simple 2-sample t-test

A 2-sample t-test has the default hypothesis that the two samples you’re testing come from the same population, and that you can’t really tell any difference between them. So, any variation you see in the data is purely random variation. The alternative hypothesis in this test, is of course, that it isn’t only random variation we’re seeing, and that these samples come from completely different populations altogether.

H_o : \mu_1 = \mu_2    \newline    H_a: \mu_1 \neq \mu_2

What the populations from which X1 and X2 are taken may look like

What the populations from which X1 and X2 are taken may look like

We’ll generate two samples of data x_1, x_2 from two different normal distributions for the purpose of demonstration, since normality is a pre-requisite for using the 2-sample t-test. (In the absence of normality, we can use other estimators of central tendency such as the median, and the tests appropriate for estimating the median, such as the Moods-Median or Kruskall-Wallis test – which I’ll blog about another time). We also have to ensure that the standard deviations of the two samples of data we’re testing are comparable. I’ll also demonstrate how we can use a test for standard deviations to understand whether the samples have different variability. Naturally, when the samples have different standard deviations, tests for assessing similarities in their means may not be fully effective.

library(nortest)
#Generating two samples of data
#100 points of data each
#Same standard deviation
#Different values of mean (of sampling distribution)
x<-read.csv(file = "x1x2.csv")
x1<-x[,2]
x2<-x[,3]

#Setting the global value of significance 
alpha = 0.1

#Histograms
hist(x1, col = rgb(0.1,0.5,0.1,0.25), xlim = c(7,15), ylim = c(0,15),breaks = seq(7,15,0.25), main = "Histogram of x1 and x2", xlab = "x1, x2")
abline(v=10, col = "orange")
hist(x2, col = rgb(0.5,0.1,0.5,0.25), xlim = c(7,15),breaks = seq(7,15,0.25), add = T)
abline(v=12, col = "purple")

#Running normality tests (just to be sure)
ad1<-ad.test(x1)
ad2<-ad.test(x2)

#F-test to compare two or more variances
v1<-var.test(x1,x2)

if(ad1$p.value>=alpha & ad2$p.value>=alpha){
  if(v1$p.value>=alpha){
    #Running a 2-sample t-test
    for (i in c(-2,-1,0,1,2)){
      temp<-t.test(x=x1,y=x2,paired = FALSE,var.equal = TRUE,alternative = "two.sided",conf.level = 1-alpha, mu = i )
      cat("Difference= ",i,"; p-value:",temp$p.value,"\n")  
    }
    
  }
}

The first few lines in the code merely include the “nortest” package and invoke/generate the data sets we’re comparing. The nortest package contains the Anderson Darling Normality Test, which we have also covered in an earlier post. We can generate a histogram, to understand what x_1 and x_2 look like.

Histogram of X1 and X2 - showing the reference population mean lines

Histogram of X1 and X2 – showing the reference population mean lines

The overlapping histograms of x_1 and x_2 clearly indicate the difference in the central tendency, and the overlap is also visible. Subsequent code above covers an F-test. As explained earlier, equality of variances is a pre-requisite for the 2-sample t-test. Failing this would mean that we essentially have samples from two different populations, which have two different standard deviations.

Finally, if the conditions to run a 2-sample t-test are met, the t.test() command (which is present in the “stats” package, runs, and provides us a result. Let’s look closely at how the t.test command is constructed. The arguments contain x_1 and x_2, which are our two samples for comparison. We provide the argument “paired = FALSE”, because these are not before/after samples. They’re two independently generated samples of data. There are instances where you may want to conduct a paired t-test, though, depending on your situation. We’ve also specified the confidence level. Note how the code uses a global value of \alpha, or significance level.

Now that we’ve seen what the code does, let’s look at the results.

2-sample t-test results

2-sample t-test results

Evaluating The Results

Two sample t-test results should be evaluated in a similar way to 1-sample t-tests. Our decision is dependent on the p-value we see in the result, and the confidence interval of the difference between sample means.

Observe how the difference estimate lies on the negative side of the number line. Difference is calculated from the populations means 10 and 12, so we can clearly understand why this estimate of difference is negative. The estimates for mean values of x and y (in this case, x_1 and x_2) are also given. Naturally, the p-value that’s in the result, when compared to our generous \alpha of 0.1, is far lesser, and we can consider this to be a significant result (provided we have sufficient statistical power – and we’ll discuss this in another post). This indicates a significant difference between the two sets. If x_1 and x_2 were fuel efficiency figures for passenger vehicles, or bikes, we may actually be looking at better performance for x_2 when compared to x_1.

Detecting a Specific Difference

Sometimes, you may want to evaluate a new product, and see if it performs at least x% better than the old product. For websites, for instance, you may be concerned with loading times. You may be concerned with code runtime, or with vehicle gas mileage, or vehicle durability, or some other aspect of performance. At times, the fortunes of entire companies depend on them producing faster, better products – that are known to be faster by at least some amount. Let’s see how a 2-sample t-test can be used to evaluate a minimum difference between two samples of data.

The same example above can be modified slightly, to test for a specific difference. The only real difference we have to make here, of course, is the value of \Delta or difference. The t.test() command in R unfortunately isn’t very clear on this – it expects you to understand that you should use \mu for this. Once you get used to it, however, this little detail is fine, and it delivers the expected result.

if(ad1$p.value >=alpha & ad2$p.value>=alpha){
  if(v1$p.value>=alpha){
    #Running a 2-sample t-test
    for (i in c(-2,-1,0,1,2)){
      temp<-t.test(x=x1,y=x2,paired = FALSE,var.equal = TRUE,alternative = "two.sided",conf.level = 1-alpha, mu = i )
      cat("Difference= ",i,"; p-value:",temp$p.value,"\n")  
    }
 }
}

The code above prints out different p-values, for different tests. The data used in these tests is the same, but by virtue of the different differences we want to detect between these samples, the p-values are different. Observe the results below:

Differences and how they influence p-value (same two samples of data)

Differences and how they influence p-value (same two samples of data)

Since the data was generated from two distributions that have means of 10 and 12 respectively for x_1 and x_2, we know from intuition that the difference is -2, and we should start seeing results that indicate no difference between the expected and observed difference at this value in the test. Therefore, the p-values in this scenario will be greater than the significance value, \alpha.

For other scenarios – when \Delta = -1, 0, 1, 2, we see that the p-values are clearly far below the significance of \alpha = 10%.

What’s important to remember therefore, is that contrary to what many people may think, there is no one or best p-value for a given set of data. It depends on the factors we take into consideration during the test – such as the sample size, the confidence level we chose for our test, the resulting significance level, and, as illustrated here, the difference expected.

Concluding Remarks

A 2-sample t-test is a great way for an organization to compare samples of data from different products, processes, and so on, and understand if one of them is performing significantly better than another. The test is strictly for data that fits the normality criteria, that also happen to have comparable standard deviations, and the results from it tend to be impacted heavily by the kind of hypothesis we use – for difference (which we explored here) and for one or two sided comparisons. We explored only the two sided comparisons here (and hence constructed a two sided confidence interval). When a business uses a 2-sample t-test, some of the arguments here, such as the values of confidence level, difference and so on, should be evaluated thoroughly. It is also important to bear in mind the impact of sample size. The smaller the difference we want to detect, the greater the sample sizes have to be. We’ll see more about this in another post, on power, difference and sample size.

Data Science: Beyond the Hype

While there is justifiable excitement in the technology industry (and other industries) these days on the widespread availability of data, and the availability of algorithms to process and make sense of this data, I sincerely think (like many others) that the hype behind Big Data is somewhat unfounded.

For many decades, “small data” have been studied in science and industry with the intent of constructing mathematical models, i.e., approximate, error-prone mathematical representations of phenomena. In some ways, the scientific method is all about such data analysis. We often hear in the news about the amplification of effects, the “truth inflation” observed when drawing conclusions from small data sets, to make broader generalizations. We hear about the lack of enough data impeding the progress of research, we also hear about fabricated data and spurious research results. A lot of scientific findings have come under scrutiny for these reason – and perhaps analysis of population data (as Big Data promises to do) may help this situation. However, the key difference between the past decades of statistics – from legends such as Fisher and George Box, to present day stalwarts in applied statistics and machine learning like Nate Silver, Sebastian Thrun and Andrew Ng, is the ability to leverage computing to analyse large data sets.

A lot of the discussion around Big Data seems to be on the so-called four Vs of Big Data – volume, velocity, variety – and increasingly, veracity – referring to the increasing speed and range of data generated in the information age. However, what’s forgotten often enough, is that below the hype, below the machine learning algorithms and below the databases and technologies, we still have the same underlying principles.

The types of data, the mathematical methods we use to evaluate them, and the fundamental concepts thereof are unchanged – and understanding this is often the key between knowing whether and when to sample from your big data set, or not. This is more important than we realize, because sampling is not obsolete. Often, well collected samples of data may be more than sufficient for establishing or testing a certain hypothesis we may have.

In my view, newcomers to the data science and big data revolutions ought to consider a course in statistics, statistical thinking and statistical reasoning first. This lays the foundation for everything else that follows. The internet and most developed and even developing countries are awash with resources that can enable individuals to learn programming and computer-based problem solving, but critical thinking and statistical thinking seem to be harder skills to learn.

Statistical thinking not only requires a level of mathematical rigour but an ability to embrace notions of uncertainty, probabilistic thinking and a fundamental change in one’s notions of cause and effect. Perhaps this is a big step for many. The relative certainty of the logic of programming languages may actually be welcoming to many – which is probably also why we see more discussions about Hadoop and Spark and not enough discussions about statistical hypothesis testing or time series auto-correlation models.

So, if you want to cut through the hype, see data science for what it is, by breaking it up into its elements – the data (which may be coming in from ever more diverse sources), the tools (algorithms, computers) and the science (which is, in this case, statistics). Not everyone is a data scientist, as some articles on the web have begun to claim, but it isn’t only a specific set of skills that makes one a “data scientist”. Some say that these data scientists are glorified statisticians, some say that they’re statistically competent programmers well versed in machine learning, but the truth is probably somewhere in between.

Furthermore, data visualization – another aspect of the data science hype – is both an art and a science – which perhaps implies that you can both be enlightened and obfuscated by charts and graphs. In my view, knowledge/abilities in visualization alone doesn’t make you a data scientist (nor does, for instance, knowledge of machine learning methods alone or skills in programming R for ETL purposes alone). When you cut through the hype here, what’s pragmatic is to be able to acquire a wide array of skills – and depth in some. Like many engineers in fields of technology or engineering, who may have a wide swath of knowledge but expertise in only a few areas, this is the most likely role that most data scientists may have.

There’s definitely more that can be said about specific aspects of the data science “movement”, but what is certain is that a knowledge of the science of statistics underlying most of the science cannot be underestimated in its value and relevance in the present day. Statistics, hopefully, will become as important as learning a language or developing an ability to have conversations, or write a well argued paragraph.