Thursday, August 28, 2014

Back to school (sort of)

So the most rainy month of the year (in Paris − not that that's usual, mind, and not that I complain: I prefer wet Augusts to sweltering ones, to be sure) is drawing to a close. The academic year is starting, my former local sub-mayor is now education minister, and MOOCs are starting left and right. Not that they've stopped, really… Anyway, time for a sum-up:

Courses that are ending

Astro2, Exoplanets (Australia National University)

Well, Astro1 was great fun, Astro2 almost as much. I say “almost” not because the course itself is less good, but because by necessity it's a lot about the technology behind exoplanet discovery − not something I'm very interested in. Still, the staff at ANU made it a lot of fun, so there.

(It'll be my 15th edX certificate, 20th overall!)

The Emergence of Life (University of Illinois, Urbana-Champaign)

A big disappointment. Broadly, the course is supposed to be a quick run through the history of life as we know it (and not so much about it's emergence, really, but at least that's up front). The problem is it's dumbed-down, inaccurate, and the lectures are quite confused. I'm sticking to it because there are chunks I don't know about (“spot-the-fossil” and the skeletal morphology criteria for classification, mostly), but more often than not I find I'm shaking my head. It's less like a class and more like a 

No certificates here, they're not free and definitely not worth paying for.

A bunch of Data Science courses (Johns Hopkins university)

I know I wrote I wouldn't tackle the Specialization… but I had second thoughts, so I registered for Signature track on the first four modules of the Specialization, plus auditing the Statistical Inference one (which I had heard many people complain about, saying it's hard and obtuse.) Actually… once one does the projects and goes beyond the first week or so of each course, they're getting pretty good. The Statistical Inference course tries to run through a lot of unintuitive material in really too little time, but − after having done UC Berkeley's Introduction to Statistics course − I find it very interesting, very stimulating. I'm glad I'm only auditing it − this way when I take it “seriously” it'll be a review and hopefully by then I'll understand it better.

In any case, I expect four verified certificates to land in my pocket in the coming few weeks.

So if I'm counting right, I'm virtually the proud owner of something like 24 certificates. 20 obtained in 2014. Not bad… 


Upcoming plans

I've rearranged a bit my planning, ditching a number of accessory courses that I couldn't seriously fit along the rest. Still, next week is a busy one, with no less than 5 courses starting at the same time.

Explore Neural Data, Brown

Data analytics + neurology. In Python. Cool.

Fundamentals of Neuroscience part 2, Harvard

More neuroscience! Actually I'm mostly taking this one because I took the first part. Not sure I'll keep both neuro courses (then again, at the time I thought little of it, but after a while I find I keep using the concepts of Neuro 1; so it's been a good use of my time.)

Dino 101, U. Alberta

I don't expect much from this one, a lightweight dino course to pass the time.

Introduction to systems biology, Mount Sinai

I tried this a while ago and dropped it after a week, thinking it too hard. Hopefully, I've learned a bit since then, and MIT's 7.QBWx rekindled my interest in systems biology.

Introductory Human Physiology, Duke

No, I don't want to be a doctor. But yes, physiology and anatomy are interesting. This promises to be a heavy-workload class; we'll see if I keep it through.

Astro3, The Violent Universe, ANU

I can't stop halfway through the series! Sadly, it starts in October and I may be too busy to give it my full attention. Hopefully things will pan out all right.

Fundamentals of Immunology part 2, Rice

Ditto. I did the first part, which was great (though hard work), the timing of this second part isn't so good, but we'll do as we can.

Next

That's September and October pretty much spoken for. With luck, I'll be able to sneak in a Data Science course in there... I still have five full courses to do in the Specialization (Statistical Inference, Regression Models, Reproducible Research, Building Data Products, Practical Machine Learning). I've pencilled in one in October and two each in November and December. This way I still have some leeway until the next capstone project (expected in February).

I have also registered for the Open University's Start Writing Fiction course. Not sure I'll stick with it, but a writing class is interesting to say the least (even though English isn't my first language).

Saturday, August 9, 2014

The Coursera-Johns Hopkins Data Science specialization

Quick recap: the department of Biostatistics at Johns Hopkins School of Public Health offers a full-on specialization in “Data Science” through Coursera, consisting of nine courses and a “capstone project”. The specialization certificate is supposed to testify that students are proficient in getting data, formatting it, graphing it, extracting useful knowledge from it, drawing and communicating conclusions from it, and so on. With an emphasis on using R, although the skills are supposed to be broadly applicable to other systems.

In detail, the sequence is made of nine courses:

  • The Data Scientist's Toolbox
  • R Programming
  • Getting and Cleaning Data
  • Exploratory Data Analysis
  • Reproducible Research
  • Statistical Inference
  • Regression Models
  • Practical Machine Learning
  • Developing Data Products
The courses are free, but if one shells out for them ($49 or 35€ depending on your currency zone of residence) one gains access to a capstone project and a specialization certificate.

I haven't yet made my mind about doing the whole specialization, or simply taking the free courses. I have a handful of days to decide.

What's in it for me?

Well, I sort of know a lot of this stuff from before. Using git and github for collaboration is something I do daily, programming (though not in R) is my main living; plus of course I've taken MIT's The Analytics Edge (a business, hands-on oriented very intense course on using R for analytics) and UC Berkeley's Introduction to Statistics, so I know my way around most of the material.

I am therefore not in a “first time learner” situation − rather, the cursus is more about consolidating the knowledge I do have, formalizing it, and getting an overall certification to somehow “prove” my mastery of it (the acceptability of this proof − by potential employers and / or academics − remains to be assessed).

The courses themselves

So far − in one week! − I've taken five courses out of nine, and completed two. This may sound impressive, but it's not − as I said, I'm hardly a first-time learner.

The Data Scientist's Toolbox is a very short introduction to the overall specialization. The main point is to install RStudio and create a Github account. Doable (including the quizzes and project) in two hours, and generally dispensable (and part of the reason why I balk at doing the specialization − 35€ is very expensive for three clicks on a website).

R Programming gets a bad rap on review sites. I sort of understand why; it's a rather heavy-handed introduction to R, I guess it's pretty incomprehensible to those as never wrote a line of code in their life and pretty abstract for most that have never toyed with R.

For me, as someone who has used R but never been formally introduced to it (like, I never figured that everything was a vector and I had difficulty wrapping my head around the difference between single and double square brackets, to say nothing of the scoping rules and the notion of “environments”), it was a nice crisp clarification of the essential concepts of the language. I guess having it as a prerequisite for the rest of the sequence isn't a great idea: really, one can use R without understanding it, and understanding is better approached after a degree of use. Laying it out like this is very bottom-up, very French I would say: first slog through abstract concepts then learn to apply them − I have come to prefer the other way around: build an intuition then consolidate the knowledge and learn how to go further. Anyway; I did the whole course in a day and gaining a good understanding of how R works in the process, so that's no bad thing.

Getting and Cleaning data is broadly a walkthrough of R's data gathering libraries. How to connect to a database, how to download a file, etc. It suffers from the lecture-then-exercise syndrome. I'm not sure how I would tackle this, really, except by drawing pictures of what the different input formats are, what the general target is (a clean data frame in R), pointers to the documentation and hand-holding exercises on real data rather than a slow demonstration of each function on made-up data. The Coursera platform is a hurdle there: similar endeavours were much easier on edX, where you can have long exercises with multiple questions, each with immediate feedback − I'm thinking of the very time-consuming exercise sets in The Analytics Edge, which had much, much better learning value than the simplistic, submit-all-at-once quizzes that Coursera provides.

That said, the course isn't very challenging but it's useful stuff to know. I'm more or less taking the course on schedule, with a bit of an advance (working up the energy to do the week 2 quiz).

Exploratory Data Analysis is, similarly, a walkthrough R's graphics libraries, and has the same pros and cons as the previous course. Similarly, having already been exposed to most of it, I find the course a crisp recap of everything. I doubt I would enjoy it very much if it was the first time I was exposed to it.

Statistical Inference is the most-decried course of the sequence, so in order to decide whether to take the overall specialization or not, I registered for it. I understand its detractors: it's very fast, very abstract. Basically in four weeks, Prof. Caffo runs through the same curriculum as Prof. Adhikari did (albeit annoyingly slowly at times) in UC Berkeley's fifteen-week Introduction to Statistics, with the same issues as the rest of the course: it's quite technical and abstract, and rather difficult to connect to (though it's hard to be practical when explaining mathematical constructs). I guess I'll be referring to Adhikari's slides more than this course's native ones, but I don't really expect the course to be very challenging.

These were the five courses I've sampled. The rest are:

Reproducible Research
 is about communicating research by using R markdown and knitr to create live R-embedding documents. Interesting stuff generally, and I guess useful skills to have if one intends to do statistics professionally, but not burning enough that I prioritized the course, so I'll take it later. Four weeks seems kind of long to do that kind of thing.

Regression Models is the other mathematically-grounded course in the sequence, and the other stumbling block for students. I think I'll be okay with it, but realistically I can't sample it this month. We'll see in September or October.

Practical Machine Learning will likely be − again − a fomalization of stuff I know from The Analytics Edge.

Developing Data Products sounds like another course about communicating results, half about “good practices” and half about using Shiny (R's own web framework − why is it that all languages must have their web development framework? Even Fortran has a CGI interface…)

The Capstone project seems to be about wrapping this up in a real-life situation.

So… why am I considering taking the whole shaboodle?

Based on the courses I took so far (about half), the sequence is pedagogically deficient − or rather, it's traditional in its approach of stuffing lots of science in the face of students then expect them to go through with it. I expect they have a high dropout rate (even higher than the baseline MOOC rate). Comprehensive as it is, the sequence is ill-suited to people approaching the subject for the first time. The course page says there are no prerequisites in terms of analytics or programming, but I don't find it so: it's more a recap / advanced course than an introduction to the field.

In that way, it's broadly what I'm after. I don't feel like I know a subject until I've studied the theory a bit†. Since I am vaguely considering reinventing myself as a biostatistician or bioinformatician‡, or at least keeping my options open, it may be worthwhile investing a bit of time (and some euros) into it.

Oh well. It's a subjective arithmetic. If the course were stellar, I wouldn't hesitate long before paying. As it is, I dither.

Postscript

I slept (well, napped) on it. Re-reading myself, it's kind of obvious I'm not really interested in pursuing the specialization certificate; there are better ways to spend 350€ than to rehash mostly-known subjects.

I may decide otherwise at another time − all I need to do is retake the courses, which means a few days doing quizzes and projects again. Doubt it'll be difficult.

If I need some certification or other there are probably better ones around, starting with Duke's Data Analysis and Inference.


† “A bit” as in, I don't need to know how to prove a theorem to use it, but I need to know I am using a theorem rather than use a pre-baked recipe I'm not comfortable with improvising with.

‡  On the premise that it's more useful to the human race than general web-based development, and anyway the kids who've done a week of Node.js and therefore know everything there is worth knowing about computer science are taking the fun out of general programming.

Wednesday, August 6, 2014

Mopping up on Exoplanets

It's a bit unfair to say I'm “mopping up” − there are two full weeks of the course, about direct imaging (at last!) and Earth-like planets − but it's clearly on the way out, and it's getting possible to start thinking back about the course.

This has been a surprisingly (or maybe not, I'm not at all a student of astronomy) technical, more than scientific, course. I mean, Paul Francis did whip out his tablet to perform some calculations, but they were fairly simple, by and large, much more than the (already not very advanced) physics of The Greatest Mysteries of the Universe. Here, instead of big questions about gamma ray bursts and Type 1a supernovae, what we have is a celebration of the ingenuity of the engineers making possible something as staggeringly complex as detecting planets orbiting distant stars.

The engineer in me is happy − and it's true these are fantastic achievements.

The course itself follows the same format as Greatest Mysteries: every week has a topic (“radial velocities”, “gravitational microlensing”), which Paul Francis and Brian Schmidt discuss in a Socratic manner, which is an impressive way of saying they convey all the knowledge through dialogue, bouncing questions off each other. Schmidt takes something of a backseat here, often playing the naive novice who asks questions of Francis; maybe he's less comfortable with the topic than with cosmology (or maybe it's a subliminal message: you may have a Nobel Prize, you're still − always − in a position to receive wisdom from your peers). Both lecturers' enthusiasm (especially Francis') is still communicative. Besides the video lectures, we have each week a link to the papers discussed, a text summary of the lesson, a worked example, a graded problem (generally very easy) and a new episode of the Mystery.

Last time around, the mystery had us figure out a weird bouncing parallel universe. This time, we're still in a strange cosmos, but the issues are more technical: a red star seems on course to collide with the world, and we have to find a likely destination for the world's population. But of course, there's a twist…

I have to admit I haven't been as interested in this course's mystery as the last. Maybe it's the lack of bubbles, or maybe I'm just not very entranced by the nitty-gritty detail of surveying the sky, taking radial-velocity measurements, etc. I'll be happy to have the solution for the Mystery through the final exam, but I'm not really motivated enough to go beyond and investigate on my own.

That's perfectly all right. I'm not destined to be an astrophysicist (if I were, I guess I'd be more involved in finding a new haven for the Moggians), I'm there to have fun learning about stuff; and as far as fun is concerned, this course delivers.

Quick notes about The Emergence of Life

So we're in Week 4 of this U. Illinois course over at Coursera, which aims at reconstructing the history of life throughout geological time. Midway between “taxonomy for dummies” and “introduction to evolutionary biology”.

So far, it's… unequal. It's notable that the teaching staff are all geologists rather than biologists, so they're in their home ground when discussing fossil formation, perhaps less so when they're talking about molecular biology. In any case, I like the fossil-discussing segments, they're informative and help driving the geological time-scales into my head; plus I like weird beasts.

Where I'm less enthusiastic is that the lectures are disjointed, often approximative (like mixing the terms eukaryote, metazoan, multiple-celled organisms − y'know, plants are multi-cellular organisms, but they're not metazoans, likewise, there are these things called saccharomyces, amoeba, giardia, etc. : all eukaryotes are not multi-cellular). Sometimes they'll use an inappropriate picture to illustrate what's being discussed (illustrating armored jawless fish with a toothy placoderm isn't a great idea!) There's little logic in how a segment connects to the ones before and after. It's a bit annoying that the clearest segments are the ones from the very young PhD student introducing taxonomy, while the segments from the official professor are somewhat confused (and confusing).

But I can live with that. Playing spot-the-fossil in the quizzes is fun.

Another thing I find upsetting is that the forums are basically drowned in two kinds of posts:

  • corrections for approximations made in the lectures
  • creationist crap (multiple threads discussing intelligent design, “global warming: fact or fiction”, etc.)
Huh. Okay. I'll just steer away from the forums, then.

That, plus the outright idolizing of Carl Woese (can't we grow up beyond the “great man single-handedly upsetting the establishment” type of narratives?) means the course isn't all it's meant to be… oh well. It's still something to do of an otherwise quiet summer.

(That said, I like the funky music and titles.)

Monday, August 4, 2014

Johns Hopkins' data science specialization, round two

Two days ago I noted I ran through the first course of JHSPH's Data Science specialization in a handful of hours.

In fact, yesterday I did the same for the second course in the series, R Programming. But this time I didn't feel “cheated” (although that's a strong word): I found the course easy as pie because I'm an experienced programmer and I've already used R quite a lot in MIT's The Analytics Edge, however I lacked any formal(ish) introduction to the language from a computer scientist's point of view. It's not enough to know that you should type lm(x ~ y + z, data=mydata); I find it necessary to know that it's a functional language where the basic data type is the vector and where every function carries with it its own environment, with such-and-such scoping semantics.

Such an introduction needn't be long. But having it, I'm a lot more confident that I understand how R works, and therefore that I can use it correctly.

All this to say − yeah, I ran through the 4-week course in a day, but it doesn't mean it deserves its poor reviews.

Saturday, August 2, 2014

The Data Scientist's Toolbox - Fastest MOOC ever?

Since I have a rather quiet month of August, I browsed through the Coursera catalog, and found that all the courses from the Johns Hopkins Data Science specialization are repeated on a monthly basis; the next starting date is August 4th, which is next Monday.

So, cool. I signed up for “R Programming”, then figured out I'd check the other ones out, and well, to give me a taste of the sequence (because well, maybe I'll want to do the whole thing) I clicked on the introductory course, “The Data Scientist's Toolbox” − and found that although the course doesn't officially start until Monday, all the content is already available. And so, I clicked through…

… and finished the course four hours later (having taken breaks to give a bath to then feed and put to bed my toddler, then put dinner in the oven…)

Well, it's introductory, I already know the tools concerned (Github? I use it, like, daily…), so there wasn't much challenge. The “course project” was basically setup a Github account and install RStudio.

We'll see what happens with the R Programming course; if it's that easy… I don't know. Maybe I'll pony up (35€ per course; at a couple of hours a pop, that's much more expensive than the cinema) for the verified certificate if the following courses are good.

But in the meantime − it's pretty much the fastest MOOC I've ever done. Started and ended two days before it's due? I rock.