Exploring Skewness for Odd-Exponent Transformations of Symmetric Distributions

Thanks to a question on Quora, I’ve had the chance to explore the skewness of samples from symmetric distributions, prior to and after odd exponent transformations such as y = x^3. While the answer is posted there, I’d like to explore related odd transformations here and their effect on the skewness. A simple experiment below reveals the impact of non-trivial means on the skewness for data under a cube transformation.

Cube Transformations for Non-Trivial Means

x1 <- rnorm(10^4, 100, 10)
y1 <- x1^3
hist(x1, breaks = 200, col = "light blue", main = "x1; Mean = 100, s.d. = 10")
hist(y1, breaks = 200, col = "light green", main = "y1 = x1^3")

1

2

Right skewness visible when samples with significant positive means are transformed.

From here it is apparent that positive values lead to ever-increasing skewness in the transformed data. Significant negative means produce the following result:

3

Left skewness visible when samples with significant negative means are transformed

Odd-Power Transformations

Next, we explore how the skewness varies with increased exponents, for similar samples.

# Exploring skewness in symmetric distributions
# For mappings of random variablex -> x^3
# Non-trivial negative values of mean

library(e1071)
data <- data.frame()

for (mu in seq(0, 10^3, 1)){
 x <- rnorm(10^4, mu, 10)
 y <- x^3
 y1 <- x^5
 y2 <- x^7
 y3 <- x^9
 data <- rbind(data, c(mu, skewness(x), skewness(y),
 skewness(y1), skewness(y2), skewness(y3)))

}

colnames(data) <- c("mu", "skew_x", "skew_x3", "skew_x5", "skew_x7", "skew_x9") # Plotting x -> x^5
plot(skew_x ~ skew_x3, data = data,
 main = "Skewness(x) vs. Skewness(y = x^3)",
 sub = "Nontrivial positive values of mean",
 col = "dark blue",
 pch = "*",
 xlim = c(-1, 40)
 )
abline(h = 0, v = 0, col= "red")

# Plotting x -> x^5
plot(skew_x ~ skew_x5, data = data,
 main = "Skewness(x) vs. Skewness(y = x^5)",
 sub = "Nontrivial positive values of mean",
 col = "dark green",
 pch = "*",
 xlim = c(-1, 40)
)
abline(h = 0, v = 0, col= "red")

# Plotting x -> x^7
plot(skew_x ~ skew_x7, data = data,
 main = "Skewness(x) vs. Skewness(y = x^7)",
 sub = "Nontrivial positive values of mean",
 col = "dark red",
 pch = "*",
 xlim = c(-1, 40)
)
abline(h = 0, v = 0, col= "red")

#Plotting x -> x^9
plot(skew_x ~ skew_x9, data = data,
 main = "Skewness(x) vs. Skewness(y = x^9)",
 sub = "Nontrivial positive values of mean",
 col = "purple",
 pch = "*",
 xlim = c(-1, 40)
)
abline(h = 0, v = 0, col= "red")

skew3skew5skew7

skew9

Chances of higher skewness increases with higher non-trivial positive values of mean

These patterns are what are called fat tailed distributions, which are common in complex systems. Given that skewness is defined as the third standard moment, E [( {\frac{x-\mu}{\sigma}} )^3 ] , it is understandable that this behaviour exists when we have higher values of \mu. However, I wonder if this is the exact reason, or if there are deeper technical and statistical reasons behind this pattern. For one thing, you’d expect that x increases proportionally with as \mu for symmetric distributions. Further, do non-trivial values of \sigma affect the occurrence of such fat tails? If you happen to know, please comment, I’d love to know more.

Azure ML Studio and R

A decade ago, Microsoft looked very different from the Microsoft we see today – it has been a remarkable transformation. One of the areas where MS have made a big push is machine learning and data analytics. Although the CRAN repository is going strong with >10,000 packages as of today, the MRAN repository (Microsoft’s Managed R Archive Network) is adding libraries and functionality that was missing from the R stack. Ever since they acquired Revolution R, they’ve also integrated R into their data science and ML offerings in a big way. For instance, Power BI comes with the ability to write R scripts that can produce visualizations for dashboards. They’ve come out with a number of products that add to or complement the Office suite that is the bedrock of Microsoft’s software portfolio, and of late, they have pushed Azure and their own machine learning algorithms in a big way. A year is a long time in the world of big data and machine learning, and now, on Azure ML Studio, most people just interested in big data and data science can get started with data analysis in a pleasant, user friendly interface.

2017-01-28-23_18_20-start

The Azure ML Studio Interface

I have had the chance to play around a little with Azure ML, and here are what I find to be some of its strong points. Above you can see a simple data processing step I set up within Azure ML Studio – to take a simple data set and subject it to some transformations.

It is possible to summarize and visualize this data pretty quickly, using some of the point-and-click summaries you get from the outputs of the boxes in the workflow.

2017-01-28-23_38_17-clipboard

Simple summaries of dataframes and CSV files are easy

What’s nice about this simple interface is the ability to view multiple variables into one view, and explore a given variable in different ways. Here, I’ve scaled both axes to a log-log plot, and am able to see variation in the MPG values for the sample data set in question. Very handy when you want to quickly test one or two hypotheses.

What ML Studio seems adept at doing is bringing together R, Python and SQL in the same interface. This makes it particularly powerful for multi-language paradigm data analysis. True to this capability, you can bring in an R kernel for doing data analysis. Sure enough, you can use Python too (if you’re like me, you use Python and R almost equally).

2017-01-28-23_36_45-action-center

Interface allows for opening Jupyter notebooks with R and Python kernels

Once you have a Jupyter Notebook opened up, you can perform analysis of all kinds in it – everything available with Open R is apparently supported within Azure ML Studio. The thing about Jupyter notebooks, of course, is that you can’t yet use multiple kernels in the same notebook. You can use either R, or Python, or Julia, for instance, and that language choice is static within a given notebook. There is a discussion around this, but unsure if it has been resolved. Although R support in Jupyter notebooks is a little sketchy, seasoned R coders can use it well enough. The REPL interface of R Studio is a bit nicer (and harder to get away from, for me personally) compared to Jupyter for R programming, but it does work well, for the most part. Kernels are managed remotely and abstracted away from the user, so there is no need to SSH into a Jupyter server and so on. The data analysis can start right away this way, because the distractions are gone.

2017-01-28-23_35_12-action-center

Jupyter notebook running remotely on ML Studio server with an R kernel.

One bug I seemed to run into, is the inability to change graph sizes with the standard par() and mar() commands. Other than that, graphs render well enough within Jupyter. Building models is easy enough in R as it is – so many packages provide a very simple interface. Doing this in Jupyter therefore is no different – a breeze as usual.

2017-01-28-23_36_15-action-center

Simple R graph rendered in Azure ML Studio

Overall, with Azure ML Studio, we’re looking at a mature web app for doing machine learning and data science, that is user friendly and provides some amount of interactivity and code that can be integrated right into the workflows, which is quite a coup, in my opinion. For prototyping and doing exploratory data analysis, this may produce a good repeatable workflow that can be easily understood by others and shared.

  1. The interface is great – it brings together notebooks, data sets, models pre-trained by Microsoft, and so on, together in one nice interface.
  2. One value addition in the interface is the ability to separate out different contexts very clearly. You can clean data with a certain part of it, organize your dataframes with another, and so on.
  3. The drag and drop functionality is actually pretty good and works conveniently for those interested in mixing code with a visual interface
  4. The Jupyter notebook integration is sketchy with R (more an issue with Jupyter than Azure ML studio, in my experience) – but works well enough for most things to do with data frames and simple visualizations.
  5. In addition to what we saw in the notebook, there’s also the possibility of directly embedding R code into the ML Studio workflow as a cell.

Hope you liked this little tour of ML Studio. I enjoyed playing around with it!

 

Quora Data Science Answers Roundup

I’m given to spurts of activity on Quora. Over the past year, I’ve had the opportunity to answer several questions there on the topics of data science, big data and data engineering.

Some answers here are career-specific, while others are of a technical nature. Then there are interesting and nuanced questions that are always a pleasure to answer. Earlier this week I received a pleasant message from the Quora staff, who have designated me a Quora Top Writer for 2017. This is exciting, of course, as I’ve been focused largely on questions around data science, data analytics, hobbies like aviation and technology, past work such as in mechanical engineering, and a few other topics of a general nature on Quora.

Below, I’ve put together a list of the answers that I enjoyed writing. These answers have been written keeping a layperson audience in mind, for the most part, unless the question itself seemed to indicate a level of subject matter knowledge. If you like any of these answers (or think they can be improved), leave a comment or thanks (preferably directly on the Quora answer) and I’ll take a look! 🙂

Happy Quora surfing!

Disclaimer: None of my content or answers on Quora reflect my employer’s views. My content on Quora is meant for a layperson audience, and is not to be taken as an official recommendation or solicitation of any kind.

Video: Talk from Strata+Hadoop World 2016 Singapore

2016-12-03-10_39_02-start

Early in December 2016, I spoke at the Strata+Hadoop World 2016 Singapore conference on sensor data analysis approaches, specifically, time series analysis. My company, The Data Team, were represented at Strata+Hadoop World at the innovator’s pavilion. It was a wonderful learning experience for me at the conference, and I have the following key take-aways:

  1. There is a lot of interest in advanced machine learning algorithms, deep learning and the capabilities it offers
  2. Platform-level innovation is still driving a large part of the big data and data science world forward. Hadoop ecosystem projects are aplenty, and have plenty of variety at the moment
  3. There is significant interest in Apache Spark and the capabilities it provides to data science teams. With growing data processing needs and the need to distribute data processing, the Databricks team have improved the Apache Spark interface and performance, so that it is easier to use than before, and scales with fewer teething problems
  4. Finally, there is a growing interest in the Internet of Things. Both on the platforms side of things, where new frameworks, architectures and ideas around such systems are discussed, and on the data science side of things, where sensor data analysis approaches and best practices are being discussed, there is increasing momentum.

My talk at Strata+Hadoop World 2016 was on the subject of sensor data analysis. The talk discussed a broad range of approaches, for the analysis of aggregate (I.I.D) data and time series data. You can find a video of the talk at this link. Slides from my talk are here (as a ZIP file).

The Year 2016 in Data

The Year 2016 in my mind will be associated with three key things, with respect to data:

  1. A career transition in engineering, product development and quality management to a career in data science, big data analytics and strategy consulting
  2. Learning how to learn better – learning to update my own skills by constant study, reinforcement of key ideas and application
  3. Gaining new focus areas, and the journey from explorer to purveyor, and the journey from surveyor to practitioner

I’ll discuss each of these aspects of the year in data below.

Each of these three changes have had a profound impact on my life and career in data so far. Thanks to the excellent team I work with, I’ve been able to appreciate ideas from disparate fields, and I’ve also been able to contribute to their understanding of the domains I have experience in. In this sense, there is a sense of satisfaction. I’ve written earlier on Medium about the importance of good career transitions and how this blog, amongst other things, helped me develop the skills required for my future career.

There is a constant tension that anyone who is learning new skills learns to embrace. One contributor to this tension is the relief and contentment that you understand something. Another contributor to this tension is to deliberately learn to deny the patterns you’re used to seeing in situations, and see these situations with new patterns based on your new knowledge. As the framework or idea becomes more complex, the harder this second kind of learning and application is, to execute. The former is easier than the latter, and in my mind, experienced practitioners have difficulty changing their mindset, and have challenges when adopting new ways of thinking to displace their old ways of thinking.

Whether this is learning new technologies and frameworks, or developing skills about entirely new sciences – such as software systems development or databases , both areas of knowledge relatively new to me at the start of the year – I found myself galloping to catch up, often having to learn new models of these ideas and do away with simplistic models of these ideas. This was both a fascinating and at times debilitating experience as I have explained above. One aspect that enabled this better was better time management, and another, which isn’t often discussed, is the reinforcement of the ideas of importance, by repetition and reinforcement. Such repetition and reinforcement can enable us to learn complex subjects and apply ideas from them, while managing disparate objectives.

Gaining new focus areas was another key feature of 2016. Novelty brings with it the need to step out of comfort zones that are old and long established. It also brings risk, and the possibility of failure. This journey will continue, of course, and there are likely to be lots of stepping stones along the way to insight. Just as we plod through data and analyse, dissect and refactor data sets in many ways to gain new insights from models, there are multiple approaches to address new focus areas – and novelty enables us to examine our inventory of approaches in light of such ideas, and experiment with them. The transition from explorer to purveyor encourages us to take on what Nassim Nicholas Taleb calls “skin in the game”. It makes someone else’s problem our problem – as is so often the case in consulting – and helps build empathy for other people, their situations and organisations. Similarly, the journey from surveyor to practitioner involves applying your lessons from a few exercises (which were hopefully well fabricated, and which hopefully developed true skill) to the real world. This is the bridge between theory and practice, between the system and the landscape, between the rubber and the road, to quote two more analogies.