Achieving Explainability and Simplicity in Data Science Work

This post stems from a few of the tweets I’d authored recently (Over at @rexplorations) on deep learning, data science, and the other skills that data scientists ought to learn. Naturally, this is by no means a short list of skills, given the increasingly pivotal role that data scientists play in organizations.

Here’s a summary of the tweet-stream I’d put out, with some additional ponderings.

  1. Domain knowledge is ignored on the data science road to perdition. Doing data analysis, or building models from data without understanding the domain and the relevance of the data and factors one is using for these models, is akin to “data science suicide”. It is a sure shot road to perdition as a data scientist. Domain knowledge is also hard to acquire for data scientists, especially those working on projects as consultants, and applying their skills in a consultative, short-term setting. For instance, I have more than a decade of experience in the manufacturing industry, and I still find myself learning new things when I encounter a new engineering set up or a new firm. A data scientist is nobody if not capable of learning new things – and domain knowledge is something that they need to constantly skill up on, in addition to their analytical skills.
  2. Get coached on your communication skills, if needed. When interacting with domain experts and subject matter experts, communication skills are extremely important for data scientists. I have frequently seen data scientists suffer from the “impostor syndrome” – not only in the context of data analysis methods and techniques, but also in the context of domain understanding.
  3. Empathise, and take notes when speaking to subject matter experts. It is for this reason that the following things are extremely important for new data scientists interacting with subject matter experts:
    1. Humility about one’s own knowledge of a specific industry area,
    2. An ability to empathise with the problems of different stakeholders
    3. The ability to take notes, including but not limited to mind maps, to organize ideas and thoughts in data science projects
  4. Strive for the usefulness of models, not to build more complex models. Data scientists ignore hypotheses that come from such discussions at their own peril. Hypotheses form the lifeblood of useful data science and analysis. As George E. P. Box said, “All models are wrong, some models are useful” – and this couldn’t be more true than when dealing with models built from hypotheses. It is such models that become really useful.
  5. Simpler models are easier to manage in a data ethics context. In product companies that use machine learning and data science to add value to customers, a debate constantly exists on the effective and ethical use of customer data. While having more data at one’s disposal is helpful for building lots of features, callous use of customer data can present a huge risk. Simpler models are easier to explain – and are arrived at when we accumulate sufficient domain knowledge, and test enough hypotheses. With simpler models, it is easier to explain what data to collect, and this can also help win the customer’s trust.
  6. Careful feature engineering done with human supervision and care may be more effective and scrupulous than automated feature engineering. We live in a world where AutoML and RoboticDataScience are often discussed in the context of machine intelligence and speeding up the process of insight generation from data. However, for some applications, it may be a better idea in the short term to ensure that the feature engineering happens through human hands. Such careful feature engineering may give organizations that use sensitive data a leg up as a longer term strategy, by erring on the side of caution.
  7. Deep learning isn’t the end of the road for data scientists. Deep learning (justifiably) has seen a great deal of hype in the recent past. However, it cannot be seen as a panacea to all data analysis. The end goal from data is the generation of value – be it for a customer, or for society at large. There are many ways to do this – and deep learning is just one approach.

I’m not discussing the many technical aspects of building explainable models. These technical aspects are contextual and depend on the situation, for one, and additionally, the tone of the post and tweets are lighter, to encourage a discussion and to welcome beginner data scientists to this discussion. Hence my omission of these (important) topics.

If you like something on this post, or want to share any other related insights, do drop a comment, or tweet to me at @rexplorations or message me at LinkedIn.

Different Kinds of Data Scientists

Data scientists come in many shapes and sizes, and constitute a diverse lot of people. More importantly, they can perform diverse functions in organizations and still stand to qualify under the same criteria we use to define data scientists.

In this cross-post from a Quora answer, I wish to elucidate on the different kinds of data scientist roles I believe exist in industry. Here is the original question on Quora. I have to say here, that I found Michael Koelbl’s answer to What are all the different types of data scientists? quite interesting, and thinking along similar lines, I decided to delineate the following stereotypical kinds of data science people:

  1. Business analysts with a data focus: These are essentially business analysts that understand a specific business domain reasonably well, although they’re not statistically or analytically inclined. Focused on exploratory data analysis, reporting based on creation of new measures, graphs and charts based on them, and asking questions around these EDA. They’re excellent at story telling, asking questions based on data, and pushing their teams in interesting directions.
  2. Machine learning engineers: Essentially software developers with a one-size-fits-all approach to data analysis, where they’re trying to build ML models of one or other kind, based on the data. They’re not statistically savvy, but understand ML engineering, model development, software architecture and model deployment.
  3. Domain expert data scientists: They’re essentially experts in a specific domain, interested in generating the right features from the data to answer questions in the domain. While not skilled as statisticians or machine learning engineers, they’re very keyed in on what’s required to answer questions in their specific domains.
  4. Data visualization specialists: These are data scientists focused on developing visualizations and graphs from data. Some may be statistically savvy, but their focus is on data visualization. They span the range from BI tools to coded up scripts and programs for data analysis
  5. Statisticians: Let’s not forget the old epithets assigned to data scientists (and the jokes around data science and statisticians). Perhaps statisticians are the rarest breed of the current data science talent pool, despite the need for them being higher than ever. They’re generally savvy analysts who can build models of various kinds – from distribution models, to significance testing, factor-response models and DOE, to machine learning and deep learning. They’re not normally known to handle the large data sets we often see in data science work, though.
  6. Data engineers with data analysis skills: Data engineers can be considered “cousins” of data scientists that are more focused on building data management systems, pipelines for implementation of models, and the data management infrastructure. They’re concerned with data ingestion, extraction, data lakes, and such aspects of the infrastructure, but not so much about the data analysis itself. While they understand use cases and the process of generating reports and statistics, they’re not necessarily savvy analysts themselves.
  7. Data science managers: These are experienced data analysts and/or data engineers that are interested in the deployment and use of data science results. They could also be functional or strategic managers in companies, who are interested in putting together processes, systems and tools to enable their data scientists, analysts and engineers, to be effective.

So, do you think I’ve covered all the kinds of data scientists you know? Do you think I missed anything? Let me know in the comments.

Related links

  1. O’Reilly blog post on data scientists versus data engineers

Why Do I Love Data Science?

This is a really interesting question for me, because I really enjoy discussing data science and data analysis. Some reasons I love data science:

  1. Discovering and uncovering patterns in the data through data visualization
  2. Finding and exploring unusual relationships between factors in a system using statistical measures
  3. Asking questions about systems in a data context – this is why data science is so hands-on, so iterative, and so full of throw-away models

Let me expand on each of these with an example, so that you get an idea.

Uncovering Patterns in Data

On a few projects, I’ve found data visualization to be a great way to identify hypotheses about my data set. Having a starting point such as a visualization for the hypothesis generation process makes us go into the process of building models a little more confidently. There’s the specific example of a time series analysis technique I used for energy system data, where using aggregate statistical measures and distribution fitting led to arbitrary and complex patterns in the data. Using time ordered visualizations helped me formulate the hypothesis in the correct way, and allowed me to build an explanatory model of the system.

Exploring Unusual Relationships in Data

In data science work, you begin to observe broad patterns and exceptions to these rules. Simple examples may be found in the analysis of anomalous behaviour in various kinds of systems. Some time back, I worked with a log data set that captured different kinds of customer transaction data between a customer and a client. These log data revealed unusual patterns that those steeped in the process could tell, but which couldn’t be quantified. By finding typical patterns across customers using session-specific metrics, I helped identify the anomalous customers. The construction of these variables, known as “feature engineering” in data science and machine learning, was a key insight. Such insights can only come when we’re informed about domain considerations, and when we understand the business context of the data analysis well.

Asking Questions about Systems in a Data Context

When you’re exploring the behaviour of systems using data, you start from some hypothesis (as I’ve described above) and then continue to improve your hypothesis to a point where it is able to help your business answer key questions. In each data science project, I’ve observed how considerations external to the immediate data set often come in, and present interesting possibilities to us during the data analysis. Sometimes, we answer these questions by finding and including the additional data, and at other times, the questions remain on the table. Either way, you get to ask a question on top of an answer you know, and you get to do an analysis on top of another analysis – with the result that you’ve composited different models together after a while, that give you completely new insights that you’ve not seen before.

Concluding Remarks

All three patterns are exhilarating and interesting to observe, for data scientists, especially those who are deeply involved in reasoning about the data. A good indication of whether you’ve done well in data analysis is when you’re more curious and better educated about the nuances of a system or process than you were before – and this is definitely true in my case. What seemed like a simple system at the outset can reveal so much to you when you study its data – and as a long-time design, engineering and quality professional, this is what interests me a great deal about data science.

The Expert System Anachronism in the Data Science and AI Divergence

Although the data science and big data buzzwords have been bandied about for years now, and although artificial intelligence has been talked about for decades, the two fields are irrevocably inter-related and interdependent.

For one thing, the wide interest in data science started just as we were beginning to leverage distribute data storage and computation technologies – which allowed companies to “scale out” storage and computation, rather than “scale up” computation. Companies who could therefore buy numerous run-of-the-mill computers (rather than extremely expensive, high end computers, in smaller numbers) could potentially leverage their data collection activities to be useful to the enterprise.

Let’s not forget, though, that the point of such exercises was to actually get some business value at the end of such an exercise. There’s virtually no business case for collecting huge amounts of data and storing them (with or without structure), if we don’t have a plan to somehow utilize that data for taking business decisions better, or to somehow impact the business or customers positively. IT managers across industries have therefore struggled to make sense of the big data space, and how much to invest, what to invest in, and how to make sense of it all.

Technology companies are only too happy to sell companies the latest and greatest data science and data management frameworks and solutions, but how can companies actually use these solutions and tools to make a difference to their business? This challenge for executives isn’t going away with the advent of AI.

Artificial Intelligence (AI) has a long and hoary history, and has been the subject of debate, discussion and chronicle over several decades. Geoff Hinton, the AI pioneer, has a pretty comprehensive description of various historical aspects of AI here. Starting from Geoff Hinton’s research, pioneering research in recent years by Yann Le Cun, Andrej Karpathy and others has enabled AI to be considered seriously by organizations as a force multiplier, just as they considered data science a force multiplier for decision making activities. The focus of all these researchers are on general purpose machine intelligence, specifically neural networks. While the “deep learning” buzzword has caught on of late, this is fundamentally no different from a complex neural network and what it can do.

That said, AI in the form of deep learning differs vastly in capability from the algorithms data scientists and data mining engineers have used for more than a decade, now. By adding many layers, and by constructing complex topologies in these neural networks, and by iteratively training them on large amounts of data, we’ve progressed along multiple quantitative axes (complexity, number of layers, amount of training data, etc) in the AI world, to get not merely quantitative, but qualitatively better in terms of AI performance. Recent studies at Google show that image captioning, often considered a hard problem for AI, is now at near-human levels of accuracy. Microsoft famously announced that their speech-to-text and translation engines stand improved by an order of magnitude, because of the use of these techniques.

It is this vastly improved capability of AI, and the elimination of the human (present forever in the data science activity loop) from even the analysis and design of these neural networks (generative adversarial networks being a case in point), that makes the divergence between Data Science and AI very vivid and distinct. AI seems to be headed in the direction of general intelligence, whereas data science approaches and methods constituted human-in-loop approaches to making sense of the data. The key value addition of the human in this data science context was “domain” – and I have extensively discussed the importance of domain in data science in an earlier post – but this too, has increasingly become supplanted by efficient AI, provided that the data collection process for training data, and the training and topological aspects of the networks (known as hyper parameters) are well defined enough. This supplanting of the human domain perspective, by machine-learned domain features that matter, is precisely what will enable AI to develop and become a key force to reckon with, in industry.

Therefore I venture that the “anachronism” in the title of this post, is the domain-based model of systems, or intelligent systems, called the Expert System. Expert system design is an old problem that probably had its heyday and apparently disappeared into the mist of technological obsolescence – and it is this kind of expert system design problem that AI methods will be so good at solving, to the point that they can replace humans in key tasks, and become a true general intelligence. Expert systems were how the earliest AI researchers imagined machine intelligence to be useful to humanity. However, their understanding was limited to rule-based expert systems. While the overall idea of the expert system is still relevant in many domains – so much so that in a sense, we have expert systems all around us – it is undeniable that the advent of AI will enable expert systems to develop and evolve once again, but without the rule-based approaches we have seen in the past, and with inductive learning as is apparent from deep learning and machine learning methods.

Azure ML Studio and R

A decade ago, Microsoft looked very different from the Microsoft we see today – it has been a remarkable transformation. One of the areas where MS have made a big push is machine learning and data analytics. Although the CRAN repository is going strong with >10,000 packages as of today, the MRAN repository (Microsoft’s Managed R Archive Network) is adding libraries and functionality that was missing from the R stack. Ever since they acquired Revolution R, they’ve also integrated R into their data science and ML offerings in a big way. For instance, Power BI comes with the ability to write R scripts that can produce visualizations for dashboards. They’ve come out with a number of products that add to or complement the Office suite that is the bedrock of Microsoft’s software portfolio, and of late, they have pushed Azure and their own machine learning algorithms in a big way. A year is a long time in the world of big data and machine learning, and now, on Azure ML Studio, most people just interested in big data and data science can get started with data analysis in a pleasant, user friendly interface.

2017-01-28-23_18_20-start

The Azure ML Studio Interface

I have had the chance to play around a little with Azure ML, and here are what I find to be some of its strong points. Above you can see a simple data processing step I set up within Azure ML Studio – to take a simple data set and subject it to some transformations.

It is possible to summarize and visualize this data pretty quickly, using some of the point-and-click summaries you get from the outputs of the boxes in the workflow.

2017-01-28-23_38_17-clipboard

Simple summaries of dataframes and CSV files are easy

What’s nice about this simple interface is the ability to view multiple variables into one view, and explore a given variable in different ways. Here, I’ve scaled both axes to a log-log plot, and am able to see variation in the MPG values for the sample data set in question. Very handy when you want to quickly test one or two hypotheses.

What ML Studio seems adept at doing is bringing together R, Python and SQL in the same interface. This makes it particularly powerful for multi-language paradigm data analysis. True to this capability, you can bring in an R kernel for doing data analysis. Sure enough, you can use Python too (if you’re like me, you use Python and R almost equally).

2017-01-28-23_36_45-action-center

Interface allows for opening Jupyter notebooks with R and Python kernels

Once you have a Jupyter Notebook opened up, you can perform analysis of all kinds in it – everything available with Open R is apparently supported within Azure ML Studio. The thing about Jupyter notebooks, of course, is that you can’t yet use multiple kernels in the same notebook. You can use either R, or Python, or Julia, for instance, and that language choice is static within a given notebook. There is a discussion around this, but unsure if it has been resolved. Although R support in Jupyter notebooks is a little sketchy, seasoned R coders can use it well enough. The REPL interface of R Studio is a bit nicer (and harder to get away from, for me personally) compared to Jupyter for R programming, but it does work well, for the most part. Kernels are managed remotely and abstracted away from the user, so there is no need to SSH into a Jupyter server and so on. The data analysis can start right away this way, because the distractions are gone.

2017-01-28-23_35_12-action-center

Jupyter notebook running remotely on ML Studio server with an R kernel.

One bug I seemed to run into, is the inability to change graph sizes with the standard par() and mar() commands. Other than that, graphs render well enough within Jupyter. Building models is easy enough in R as it is – so many packages provide a very simple interface. Doing this in Jupyter therefore is no different – a breeze as usual.

2017-01-28-23_36_15-action-center

Simple R graph rendered in Azure ML Studio

Overall, with Azure ML Studio, we’re looking at a mature web app for doing machine learning and data science, that is user friendly and provides some amount of interactivity and code that can be integrated right into the workflows, which is quite a coup, in my opinion. For prototyping and doing exploratory data analysis, this may produce a good repeatable workflow that can be easily understood by others and shared.

  1. The interface is great – it brings together notebooks, data sets, models pre-trained by Microsoft, and so on, together in one nice interface.
  2. One value addition in the interface is the ability to separate out different contexts very clearly. You can clean data with a certain part of it, organize your dataframes with another, and so on.
  3. The drag and drop functionality is actually pretty good and works conveniently for those interested in mixing code with a visual interface
  4. The Jupyter notebook integration is sketchy with R (more an issue with Jupyter than Azure ML studio, in my experience) – but works well enough for most things to do with data frames and simple visualizations.
  5. In addition to what we saw in the notebook, there’s also the possibility of directly embedding R code into the ML Studio workflow as a cell.

Hope you liked this little tour of ML Studio. I enjoyed playing around with it!

 

Data Perspectives: “Orbiting The Giant Hairball”

This may sound weird, but one sure way to not have perspective about the business in an innovative and constantly changing industry is to bury yourself within regular work. This is the meaning of the title – which comes from a book of the same name.

By regular work, I mean work in which you execute tasks with a view to minimize variability and have standard results. This is as opposed to innovative work, which, as Bob Sutton explains in his lectures, is characterised by an increase of variability to the point of failure. Failure and validated learning are essential aspects of the learning experience in any job, to extend a metaphor from Eric Reis’ book The Lean Startup.

Data science and data engineering are the truly cross-functional and cross-industry work areas within the analytics revolution that is under way right now. There are a number of business perspectives that are relevant in one industry, which can also be applied to another. Indeed, work in some industries can anticipate very closely the needs of another.

Data scientists should keep one eye on the business, or to be true to the metaphor here, should occasionally “dive into the hairball” of business and routine work, to get a glimpse of what’s happening in the world of work. The data perspectives that they bring to that conversation will then become as important, as the perspectives they develop due to such experiences. Seasoned professionals and consultants in the data analytics industry may have unconsciously or consciously developed their cross-functional and cross-industry experience over years. But it probably is fitting for younger data professionals – and there are many of them out there – to occasionally “dive into the hairball from orbit” and understand the challenges of data for those in various walks of business.

Quality and the Data Lifecycle

The insights we get from data depend on the quality of the data itself, and as the saying goes, “Garbage In, Garbage Out”. The volumes of data don’t matter as much as the quality of the data itself. Data quality and data quality assurance are therefore of growing importance in today’s Big Data arena. With the growing scale of big data implementations, data quality assurance is an increasingly importance function in the big data world today. In this short post, I’ll discuss aspects of data quality as seen from a data life cycle perspective.

Data quality is applicable to all the three key areas of data collection, storage and use in analysis. In each of these contexts, data quality can take on a different meaning. The data management paradigms most widely recognized involve these three stages, sometimes broken down into a larger number of stages. How does data quality assurance figure in each of these three stages of the data life cycle?

Data Quality In Data Collection

Data quality in the data collection part of the data life cycle is concerned with the nature, kind and operational definition of the data collected itself. This is therefore concerned with the sensors or measurement systems that collect data, their veracity and their ability to monitor the source of the data as per requirements. Assessment criteria in a 20th century industrial setting may have involved approaches like measurement system analysis. These days, more sophisticated logical validation approaches may be used. Methods of studying different aspects of the measurement systems, such as stability, linearity, bias, etc., may also be adopted. A wider discussion of data quality here will also include the process of documenting the data collected, whether manually or electronically, and how this is best rationalized.

Data Quality in Data Storage

Data Storage is the second part of the data life cycle, where we have the storage of data in small and large servers alike, and in the big data space, we may use approaches like Hadoop, Spark and the like to store and retrieve data from relational and non-relational databases such as generic AS400 or Oracle databases, or Mongo-esque databases, respectively. When we approach data storage across different physical disks and locations, the integrity of data and sanity/rationalization of data management practices becomes important. Data quality here could be measured using the same ways we check database integrity, which may be checked using data loss, data corruption and other checks. Data quality can therefore encompass physical checks for integrity – the quality of the hard drives or SSDs in question, the systems and processes that enable access of the data on an as-needed basis (power supply, bandwidth and other considerations), and the software aspects. The data quality discussion should naturally also flow to practices adopted at the data warehouse in question – maintenance of the servers, the kinds of commands run, the kinds of administrative activities performed.

Data Quality in Data Analysis

Data analysis, which is the end that justifies the means (to collect and store data) is arguably the most determinant and context-sensitive step of the whole data lifecycle. Data quality in this context may be seen as the availability from our databases, of the right data for the right analysis. While the analysis itself is decided based on what is required for the business and what is possible based on available data, the available data in principle (or as per the organization’s SLAs) and the actual available data should tally. Therefore data quality in an analysis context becomes more specific and tailored to the analysis that we’re interested in, more focused, based on the insights we want to derive from the data.

In addition to the above, there is a need to reuse data that is stored, and there is a need to review / revisit old data. This function is considered separate in some frameworks. In such situations, it may be wise to have data quality processes that are specific to these information/data retrieval steps. Usually, however, the same data quality criteria we have used for data storage or databases, should be applicable here too.

 

Johnson Transformation for Non-Normal Data

A number of inferential statistical tests (A/B tests and significance tests) assume that the underlying that we’re comparing come from a normal (Gaussian) distribution. However, this isn’t generally true for a number of data sets in practice. In order to use the tools that assume normality, we have to transform the data (and the limits or comparisons being made).

Background

The purpose of transformation, therefore, is to ensure that the data set we have satisfies the minimum assumptions made in the process of conducting the hypothesis tests. In frequentist statistics, where we’re using these statistical distributions to model a process and describe aggregate behaviour, rather than using Bayesian approaches, it is useful to keep such transformation tools at hand.

Let’s generate a sample data set, and plot it, and analyze it using the QQ-Norm plots, to understand normality. We’ll use the standard Shapiro-Wilk test which is a powerful test for normality of a data set.

#Generating a weibull distribution sample
x <- rweibull(1000,2,5)

#Plotting and visualizing data
hist(x, breaks = 20, col = rgb(0.1, 0.1, 0.9, 0.5))

#Shapiro-Wilk test for normality
shapiro.test(x)
Simple histogram of sample

Simple histogram of sample. Observe how the data set exhibits skewness and appears quite non-normal.

	Shapiro-Wilk normality test

data:  x
W = 0.9634, p-value = 3.802e-15

The Shapiro-Wilk test results certainly confirm that the data set is non-normal. Now let’s look at a QQ-Norm plot.

QQ Norm Plot of X (with QQ Line)

QQ Norm Plot of X (with QQ Line). Non-normality is evident from extreme values in the data

Our objective now is to transform this data set into a dataset, on which we can perform operations meant for normally distributed data. The benefit of being able to transform data is many-fold, but chiefly, it allows us to conduct capability analyses and stability analyses, in addition to hypothesis tests like t-tests. Naturally, the reference points which will be used in these tests will also have to be transformed, in order to make meaningful comparisons.

Log Transformations

The Log and Square Root functions are commonly used to transform positive data. In our case, since we have data from the Weibull distribution, we can explore the use of the log function and observe its effectiveness at transforming the dataset.

#Transforming data using the log function
x_log <- log(x)

#Plotting the transformed data
hist(x_log, breaks = 20, col = rgb(0.1, 0.1, 0.9, 0.5))

#Shapiro-Wilk test for normality of transformed data
shapiro.test(x_log)

#Normal QQ Plot
qqnorm(x)
qqline(x_log)

	Shapiro-Wilk normality test

data:  x_log
W = 0.9381, p-value < 2.2e-16

Histogram of x_log

Histogram of x_log

Quite clearly, the results from this aren’t too promising. The data set created as x_log doesn’t exhibit normality, given the p-value in the normality test is extremely small. This means that there is an extremely small chance that the log-transformed data set could have come from a normal distribution, assuming that the null hypothesis of the normality test is true.

The Johnson R Package

The Johnson R package can be used to access certain tried and tested transformation approaches for transformation. The Johnson package contains a number of useful functions, including a normality test (Anderson-Darling, which is comparable in power to Shapiro Wilk), and the Johnson transformation function. The Johnson package can be installed using the “install.packages()” command in R.

library(Johnson)
#Running the Anderson-Darling Normality Test on x
adx <- RE.ADT(x)
adx

#Running the Johnson Transformation on x
x_johnson <- RE.Johnson(x)

#Plotting transformed data
hist(x_johnson$transformed, breaks = 25, col = rgb(0.9, 0.1, 0.1, 0.5))
qqnorm(x_johnson$transformed)
qqline(x_johnson$transformed, col = "red")

#Assessing normality of transformed data
adx_johnson <- RE.ADT(x_johnson$transformed)
adx_johnson

Running the RE.Johnson() command generates a list assigned in this case to the variable x_johnson. This contains a vector of the transformed values, under x_transformed$transformed – which is our new transformed data set. Using the same plots and tests we did earlier for the base data set, we can understand the effectiveness of the Johnson transformation.

Histogram of data transformed using the Johnson transformation

Histogram of data transformed using the Johnson transformation

QQ-Norm plot of the transformed data (x_johnson)

QQ-Norm plot of the transformed data (x_johnson)

> adx_johnson  adx_johnson

   "Anderson-Darling Test"

$p 0.3212095 

A p-value of 0.32 from the Anderson Darling test for the transformed data set clearly indicates that we fail to reject the null hypothesis of the A-D test, and cannot rule out the possibility that the transformed data is normally distributed.

Concluding Remarks

We’ve seen in this post how the Johnson R package can be used to transform data that is non-normal (as a lot of real data sets are) to a data set that can be used as arguments in a hypothesis test or other function that assumes normality in the data. Transformations have wide applications, and can be used to extract meaningful information about the dynamics of a distribution, process or data set. Transformation also allows us to present data better. Sometimes, when data is skewed, extreme values get highlighted, at the cost of highlighting the pattern that’s present in most of the data set. In such situations, transformations can come in handy. Transformations can also be used to highlight the scale of phenomena by using their transformed data in graphs.

Further Reading

  1. Box-Cox transformation, another popular transformation method used for non-normal data
  2. Box-Cox transformations on linear models, a more mathematical treatment.