Although I completed my PhD in 2010 and
have been taking data through my postdoctoral research project on the
birds of the Fynbos, its been over 20 years since I did stats 101
at University. Although I've muddled through partly through a focus
on a few statistical techniques, it was brought home to me that I
needed to do something about my patchy statistical knowledge when I
had a paper rejected from the world's lowest ranked ornithological
journal, partly because of a basic statistical error.
Up until this year I didn't even know
what a MOOC was – it stands for Massive Online Open Course;
basically, learn anything online, and for free. Some of the big
platforms for hosting these include Coursera, edX and Udacity and you
read a comprehensive review of these big 3 here:
So how did it all start for me? My
university sent out a postgrad-development-program email including
links to some Coursera classes, where I spotted the
Foundations in Statistical Inference course offered via Duke University. So in
February I enrolled in my first MOOC. The course consisted of weekly
lectures that could be downloaded or viewed online as videos, as well
as course notes, a link to a free basic stats book, weekly practical
tutorials using the R programming language, weekly quizzes, a mid
term exam, a course stats project, and a final exam.
The course was put together by Mine
Çetinkaya-Rundel; material was clear, lucid and the best of all the
online courses I was to subsequently take. The course does what is
says on the box – and leads you by the hand from understanding
Means and Standard-deviations all the way to an introduction on
Multiple Regression. Since there are thousands of people signed up to
any MOOC, support is provided not provided though interactions with
the teacher but through the online discussion forums, with students
helping each other. Yip – cheating is practically legitimized (ok,
answers to quizz questions are not posted, but steps on how to get
them often are; ok – not cheating, but serious collaboration).
In February I started my first Johns Hopkins
Data Science specialization courses: the foundation
courses being the
Data Scientist's Toolbox, and
R
programming. You won't get anywhere on this course without
embracing the R language (and if you have to analyse data at any
level, you probably should do this anyway). The
Data Scientist's
Toolbox is a good overview to the rest of the course, and
includes a good motivational first video. Its a very easy 'tick' in
the serious of 9 courses. I believe it is presented by Jeff Leek, who
has a clear lecture delivery style, is well prepared, and imparts a
lot of information very quickly. For the later classes I had to pause
videos frequently in order to back up over key concepts, but that is
the joy of being able to do these things at your own pace (it is all
doable if you are disciplined or motivated).
I've been getting by for the last
decade with SPSS for my statistics. However, this is expensive
licensed software and it was clear with conversations with clever
colleagues that there were multiple benefits in learning R
programming - not the least being that it is free. However, R is
like learning a real language, and takes time to get to grips with
syntax and idiosyncrasies. Six months into all of this, and I still
struggle with some aspects of its use for things I could do very
quickly and easily with Microsoft Excel. With Excel at least what you
do is presented straight away in front of you in terms of data
manipulation and reshaping, while in R its all hidden away in
data.frames, and little mistakes can severely f***up results.
However, my mind has also been blown open by the possibility of all
the things that can be done, from charting, to exploring and
acquiring data, to running extremely complicated data analysis
models.
Enter R programming –
presented by Roger Peng. He is great, and he is the only one of the
three lecturers to video himself as part of his lecture presentation,
which kind of makes it all a little more personal – which is
actually really important considering the whole series could be
presented by some archane Artificial Intelligence robot. The videos
are a little less polished, and I often found myself distracted
trying to read the book titles on his book shelf, or found myself
amusing myself following the movement of coffee cups and personal
items between video lectures.
Then I did something a little crazy
during April/May once the Duke course was finished. I took four of
the courses simultaneously – Getting and Cleaning Data (by
Jeff, excellent); Exploratory Data Analysis (Roger Peng, good
– a bit unclear at the end); Reproducible Research (Roger
Peng, fundamental lessons here – very important); and Statistical
Inference. Basically, the MOOCs became a full time occupation
because each one takes about 8 hours a week, and some of the projects
can take days if you get stuck, especially if you are learning R
along the way. And don't kid yourself – you really need AT LEAST
the recommended hours to get through each course proficiently.
Ok – now – back to that last
course, Statistical Inference. Having just done the Duke Mooc
I was pretty sure the Johns Hopkins version wouldn't be an issue –
it was only a typical Data Science 4 week course. However, it is
without a doubt the worst of the series and about the most terrible
lecture style I have ever encountered in my life. Feedback on the
discussion boards was scathing, and included an attempt to start a
petition to refund those Coursera students who had paid the fee for
Signature Track – i.e. those that wanted official recognition for
their course participation. My course score for the Duke Mooc was 85%
- and as I'd been on holiday for 2 weeks of it I had missed a quizz
and the project proposal submission deadlines, which all counted for
points. But despite completing everything for the John Hopkins
regression course I scored only 72% - in other words my basic
understanding of statistical inference at the end of the second
course was actually worse!!!! By comparison, I scored 100% in
Reproducible Research and
Getting and Cleaning Data. The Statistical Inference
course notes were also a disaster – I can only hope for those
taking newer versions of course that things have improved. The
presenter - Professor Brian Caffo - may be some later day genius in
his field, but that does not translate to good teaching style by any
means. I also had to suffer through Regression Models, where I
am sure it was only some a-priori knowledge on these subjects that
got me through.
At the moment I am in the last weeks of
the Practical Machine Learning module, which has been a real
eye-opener, and TG its Jeff Leek. I have one more course to go –
Developing Data Products, and then apart from a Capstone
project for those doing the paid version of the course, I'll have
nailed it. So far – its been worth it, mostly because I am far more
confident in using R – which like any language, only gets better
the more you use it. And unlike stats 101 twenty years ago, all paper
and equations, I can honestly say that stats is fun now. I never
thought it possible that I could say that – but really, the way one
can quickly visualize complicated data sets, explore data and
interpret data – its almost like telling (or writing) a story –
only with numbers and charts on a laptop. And the utility of it all –
well, the sky is the limit (literally; get good with these skills and
you could work for NASA).
So – Thanks to Coursera and John
Hopkins University – this education revolution will change the
world. Get on board before national governments start to see free and
fair education as a threat to national job security and start to
regulate who can participate. That, or global demand brings down the
servers – in fact I wrote this entire post while waiting for the
Coursera website to come back online from a temporary time out.
In the words of Rob Schneider (Adam
Sandler's sidekick) - “You can do it!”