Notes and Comments on Data Science

Here are some notes and comments I’ve made over the past several months and years, that convey current or extant thinking about doing data science and being productive as a data scientist. Most of these comments are from my LinkedIn engagement, and are to be read in that specific context.

  • Python is as elegant a language as they come and frameworks like Pandas have a lot going for them, but we’re far from the declarative paradigm here – and that is what seems to be the biggest productivity enhancer. Effective tooling is what really reduces time-to-insights.
  • Great set of slides by Arvind Narayanan, which I read with interest. To paraphrase Duncan Watts from Twitter, you could replace “AI” here with other technologies/capabilities that are sparsely known but admired/reviled/feared, and still have a lot of it be valid. The issue then is in the marketing-to-consumer value chain, and not the technical capabilities of ML/AI systems themselves. Good marketers ought to help people discover value from what is being marketed – in this case, that is clearly not happening. (link)
  • They published a version of this using Tensorflow some time back, and the original with MXNet was pretty good in itself, with numpy-esque matrix operations within MXNet being used for several demos. What I like here is that you see text, equations and code in one place, making it ideal as a resource to explore, experiment and learn. (link)
  • Picked up this book recently too. I like the writing style, and the discussions around seq-to-seq models. The core content in the beginning does a good job of covering all the text processing required for NLP. I foresee revisiting this book many times in the coming months/years! (link)
  • In light of the popularity of open source, it would be interesting to see if Mathworks makes any kind of OSS play. We’ve seen Microsoft changing their approach and benefiting hugely in the recent past, perhaps it is time for others to follow suit. There’s a lot of value to be added to the market by bringing out a free product variant. (link)
  • While Polynote is helpful, many more DS folks will use the Python-VS Code integrations more often than this. Scala is not used as widely for doing data science on notebooks as Python and those who want to use Scala for machine learning applications may as well switch to a full featured IDE such as IntelliJ or ScalaIDE.
  • You have a book by Albert Barabasi in there, whose work was inspiring. He’s written a book titled “Linked” which I read very enthusiastically some years ago. Also would recommend papers by Duncan Watts, Mark Newman and Steven Strogatz. I see that you already have a book by Mark Newman here in this list. The story seemed to start from Erdos and Renyi and their theory of random graphs, there are a couple of tomes from back in their day that are also available widely.