Quick recap: the department of Biostatistics at Johns Hopkins School of Public Health offers a full-on specialization in “Data Science” through Coursera, consisting of nine courses and a “capstone project”. The specialization certificate is supposed to testify that students are proficient in getting data, formatting it, graphing it, extracting useful knowledge from it, drawing and communicating conclusions from it, and so on. With an emphasis on using R, although the skills are supposed to be broadly applicable to other systems.
In detail, the sequence is made of nine courses:
- The Data Scientist's Toolbox
- R Programming
- Getting and Cleaning Data
- Exploratory Data Analysis
- Reproducible Research
- Statistical Inference
- Regression Models
- Practical Machine Learning
- Developing Data Products
The courses are free, but if one shells out for them ($49 or 35€ depending on your currency zone of residence) one gains access to a capstone project and a specialization certificate.
I haven't yet made my mind about doing the whole specialization, or simply taking the free courses. I have a handful of days to decide.
What's in it for me?
Well, I sort of know a lot of this stuff from before. Using git and github for collaboration is something I do daily, programming (though not in R) is my main living; plus of course I've taken MIT's The Analytics Edge (a business, hands-on oriented very intense course on using R for analytics) and UC Berkeley's Introduction to Statistics, so I know my way around most of the material.
I am therefore not in a “first time learner” situation − rather, the cursus is more about consolidating the knowledge I do have, formalizing it, and getting an overall certification to somehow “prove” my mastery of it (the acceptability of this proof − by potential employers and / or academics − remains to be assessed).
The courses themselves
So far − in one week! − I've taken five courses out of nine, and completed two. This may sound impressive, but it's not − as I said, I'm hardly a first-time learner.
The Data Scientist's Toolbox is a very short introduction to the overall specialization. The main point is to install RStudio and create a Github account. Doable (including the quizzes and project) in two hours, and generally dispensable (and part of the reason why I balk at doing the specialization − 35€ is very expensive for three clicks on a website).
R Programming gets a bad rap on review sites. I sort of understand why; it's a rather heavy-handed introduction to R, I guess it's pretty incomprehensible to those as never wrote a line of code in their life and pretty abstract for most that have never toyed with R.
For me, as someone who has used R but never been formally introduced to it (like, I never figured that everything was a vector and I had difficulty wrapping my head around the difference between single and double square brackets, to say nothing of the scoping rules and the notion of “environments”), it was a nice crisp clarification of the essential concepts of the language. I guess having it as a prerequisite for the rest of the sequence isn't a great idea: really, one can use R without understanding it, and understanding is better approached after a degree of use. Laying it out like this is very bottom-up, very French I would say: first slog through abstract concepts then learn to apply them − I have come to prefer the other way around: build an intuition then consolidate the knowledge and learn how to go further. Anyway; I did the whole course in a day and gaining a good understanding of how R works in the process, so that's no bad thing.
Getting and Cleaning data is broadly a walkthrough of R's data gathering libraries. How to connect to a database, how to download a file, etc. It suffers from the lecture-then-exercise syndrome. I'm not sure how I would tackle this, really, except by drawing pictures of what the different input formats are, what the general target is (a clean data frame in R), pointers to the documentation and hand-holding exercises on real data rather than a slow demonstration of each function on made-up data. The Coursera platform is a hurdle there: similar endeavours were much easier on edX, where you can have long exercises with multiple questions, each with immediate feedback − I'm thinking of the very time-consuming exercise sets in The Analytics Edge, which had much, much better learning value than the simplistic, submit-all-at-once quizzes that Coursera provides.
That said, the course isn't very challenging but it's useful stuff to know. I'm more or less taking the course on schedule, with a bit of an advance (working up the energy to do the week 2 quiz).
Exploratory Data Analysis is, similarly, a walkthrough R's graphics libraries, and has the same pros and cons as the previous course. Similarly, having already been exposed to most of it, I find the course a crisp recap of everything. I doubt I would enjoy it very much if it was the first time I was exposed to it.
Statistical Inference is the most-decried course of the sequence, so in order to decide whether to take the overall specialization or not, I registered for it. I understand its detractors: it's very fast, very abstract. Basically in four weeks, Prof. Caffo runs through the same curriculum as Prof. Adhikari did (albeit annoyingly slowly at times) in UC Berkeley's fifteen-week Introduction to Statistics, with the same issues as the rest of the course: it's quite technical and abstract, and rather difficult to connect to (though it's hard to be practical when explaining mathematical constructs). I guess I'll be referring to Adhikari's slides more than this course's native ones, but I don't really expect the course to be very challenging.
These were the five courses I've sampled. The rest are:
Reproducible Research is about communicating research by using R markdown and knitr to create live R-embedding documents. Interesting stuff generally, and I guess useful skills to have if one intends to do statistics professionally, but not burning enough that I prioritized the course, so I'll take it later. Four weeks seems kind of long to do that kind of thing.
Regression Models is the other mathematically-grounded course in the sequence, and the other stumbling block for students. I think I'll be okay with it, but realistically I can't sample it this month. We'll see in September or October.
Practical Machine Learning will likely be − again − a fomalization of stuff I know from The Analytics Edge.
Developing Data Products sounds like another course about communicating results, half about “good practices” and half about using Shiny (R's own web framework − why is it that all languages must have their web development framework? Even Fortran has a CGI interface…)
The Capstone project seems to be about wrapping this up in a real-life situation.
So… why am I considering taking the whole shaboodle?
Based on the courses I took so far (about half), the sequence is pedagogically deficient − or rather, it's traditional in its approach of stuffing lots of science in the face of students then expect them to go through with it. I expect they have a high dropout rate (even higher than the baseline MOOC rate). Comprehensive as it is, the sequence is ill-suited to people approaching the subject for the first time. The course page says there are no prerequisites in terms of analytics or programming, but I don't find it so: it's more a recap / advanced course than an introduction to the field.
In that way, it's broadly what I'm after. I don't feel like I know a subject until I've studied the theory a bit†. Since I am vaguely considering reinventing myself as a biostatistician or bioinformatician‡, or at least keeping my options open, it may be worthwhile investing a bit of time (and some euros) into it.
Oh well. It's a subjective arithmetic. If the course were stellar, I wouldn't hesitate long before paying. As it is, I dither.
Postscript
I slept (well, napped) on it. Re-reading myself, it's kind of obvious I'm not really interested in pursuing the specialization certificate; there are better ways to spend 350€ than to rehash mostly-known subjects.
I may decide otherwise at another time − all I need to do is retake the courses, which means a few days doing quizzes and projects again. Doubt it'll be difficult.
If I need some certification or other there are probably better ones around, starting with Duke's Data Analysis and Inference.
† “A bit” as in, I don't need to know how to prove a theorem to use it, but I need to know I am using a theorem rather than use a pre-baked recipe I'm not comfortable with improvising with.
‡ On the premise that it's more useful to the human race than general web-based development, and anyway the kids who've done a week of Node.js and therefore know everything there is worth knowing about computer science are taking the fun out of general programming.