Friday, 25 July 2014

Making it through my MOOC: the Data Science specialization through the Johns Hopkins University

Although I completed my PhD in 2010 and have been taking data through my postdoctoral research project on the birds of the Fynbos, its been over 20 years since I did stats 101 at University. Although I've muddled through partly through a focus on a few statistical techniques, it was brought home to me that I needed to do something about my patchy statistical knowledge when I had a paper rejected from the world's lowest ranked ornithological journal, partly because of a basic statistical error.

Up until this year I didn't even know what a MOOC was – it stands for Massive Online Open Course; basically, learn anything online, and for free. Some of the big platforms for hosting these include Coursera, edX and Udacity and you read a comprehensive review of these big 3 here:

So how did it all start for me? My university sent out a postgrad-development-program email including links to some Coursera classes, where I spotted the Foundations in Statistical Inference course offered via Duke University. So in February I enrolled in my first MOOC. The course consisted of weekly lectures that could be downloaded or viewed online as videos, as well as course notes, a link to a free basic stats book, weekly practical tutorials using the R programming language, weekly quizzes, a mid term exam, a course stats project, and a final exam.

The course was put together by Mine Çetinkaya-Rundel; material was clear, lucid and the best of all the online courses I was to subsequently take. The course does what is says on the box – and leads you by the hand from understanding Means and Standard-deviations all the way to an introduction on Multiple Regression. Since there are thousands of people signed up to any MOOC, support is provided not provided though interactions with the teacher but through the online discussion forums, with students helping each other. Yip – cheating is practically legitimized (ok, answers to quizz questions are not posted, but steps on how to get them often are; ok – not cheating, but serious collaboration).

In February I started my first Johns Hopkins Data Science specialization courses: the foundation courses being the Data Scientist's Toolbox, and R programming. You won't get anywhere on this course without embracing the R language (and if you have to analyse data at any level, you probably should do this anyway). The Data Scientist's Toolbox is a good overview to the rest of the course, and includes a good motivational first video. Its a very easy 'tick' in the serious of 9 courses. I believe it is presented by Jeff Leek, who has a clear lecture delivery style, is well prepared, and imparts a lot of information very quickly. For the later classes I had to pause videos frequently in order to back up over key concepts, but that is the joy of being able to do these things at your own pace (it is all doable if you are disciplined or motivated).

I've been getting by for the last decade with SPSS for my statistics. However, this is expensive licensed software and it was clear with conversations with clever colleagues that there were multiple benefits in learning R programming - not the least being that it is free. However, R is like learning a real language, and takes time to get to grips with syntax and idiosyncrasies. Six months into all of this, and I still struggle with some aspects of its use for things I could do very quickly and easily with Microsoft Excel. With Excel at least what you do is presented straight away in front of you in terms of data manipulation and reshaping, while in R its all hidden away in data.frames, and little mistakes can severely f***up results. However, my mind has also been blown open by the possibility of all the things that can be done, from charting, to exploring and acquiring data, to running extremely complicated data analysis models.

Enter R programming – presented by Roger Peng. He is great, and he is the only one of the three lecturers to video himself as part of his lecture presentation, which kind of makes it all a little more personal – which is actually really important considering the whole series could be presented by some archane Artificial Intelligence robot. The videos are a little less polished, and I often found myself distracted trying to read the book titles on his book shelf, or found myself amusing myself following the movement of coffee cups and personal items between video lectures.

Then I did something a little crazy during April/May once the Duke course was finished. I took four of the courses simultaneously – Getting and Cleaning Data (by Jeff, excellent); Exploratory Data Analysis (Roger Peng, good – a bit unclear at the end); Reproducible Research (Roger Peng, fundamental lessons here – very important); and Statistical Inference. Basically, the MOOCs became a full time occupation because each one takes about 8 hours a week, and some of the projects can take days if you get stuck, especially if you are learning R along the way. And don't kid yourself – you really need AT LEAST the recommended hours to get through each course proficiently.

Ok – now – back to that last course, Statistical Inference. Having just done the Duke Mooc I was pretty sure the Johns Hopkins version wouldn't be an issue – it was only a typical Data Science 4 week course. However, it is without a doubt the worst of the series and about the most terrible lecture style I have ever encountered in my life. Feedback on the discussion boards was scathing, and included an attempt to start a petition to refund those Coursera students who had paid the fee for Signature Track – i.e. those that wanted official recognition for their course participation. My course score for the Duke Mooc was 85% - and as I'd been on holiday for 2 weeks of it I had missed a quizz and the project proposal submission deadlines, which all counted for points. But despite completing everything for the John Hopkins regression course I scored only 72% - in other words my basic understanding of statistical inference at the end of the second course was actually worse!!!! By comparison, I scored 100% in Reproducible Research and Getting and Cleaning Data. The Statistical Inference course notes were also a disaster – I can only hope for those taking newer versions of course that things have improved. The presenter - Professor Brian Caffo - may be some later day genius in his field, but that does not translate to good teaching style by any means. I also had to suffer through Regression Models, where I am sure it was only some a-priori knowledge on these subjects that got me through.

At the moment I am in the last weeks of the Practical Machine Learning module, which has been a real eye-opener, and TG its Jeff Leek. I have one more course to go – Developing Data Products, and then apart from a Capstone project for those doing the paid version of the course, I'll have nailed it. So far – its been worth it, mostly because I am far more confident in using R – which like any language, only gets better the more you use it. And unlike stats 101 twenty years ago, all paper and equations, I can honestly say that stats is fun now. I never thought it possible that I could say that – but really, the way one can quickly visualize complicated data sets, explore data and interpret data – its almost like telling (or writing) a story – only with numbers and charts on a laptop. And the utility of it all – well, the sky is the limit (literally; get good with these skills and you could work for NASA).

So – Thanks to Coursera and John Hopkins University – this education revolution will change the world. Get on board before national governments start to see free and fair education as a threat to national job security and start to regulate who can participate. That, or global demand brings down the servers – in fact I wrote this entire post while waiting for the Coursera website to come back online from a temporary time out.

In the words of Rob Schneider (Adam Sandler's sidekick) - “You can do it!”


  1. Thank you very much for this post. I've been thinking about starting some part time studies to improve my understanding of how to correctly use statistics in biology, but I wasn't sure where to start. Some of the courses I looked at was either too time consuming or expensive. I'll definitely have a look at your suggestions in this post, thanks.

  2. I am in week two of Genetics and Evolution from Dr Noor at Duke University ( USA)

    Am on wait list for a Physiology course and one on AIDS.

    and it turns out I will be seeing you in October 2015 with BE. I am looking forward to getting to know your homeland and meeting your wonderful birds, plants ( even the ass sticker ) and scaring up some kitties.

  3. Thank you so much for sharing this article in detail.


Related Posts Plugin for WordPress, Blogger...