Data Science at Johns Hopkins

Data Science Coursera JHU

Johns Hopkins University has an interesting MOOC on Coursera. We asked ourselves if it is worth doing?

I did the test..

The whole program is based on 9 modules and a capstone project, I will provide a short description of each and give my impressions on the track I took from May 2015 until January 2016. Just for the records: in July I took two modules at the same time. Normally it takes one month per module and 2 months for the capstone.

The nine tracks are:

1. The Data Scientist’s Toolbox

This is the introduction course and provides you some guidance around the tools you are going to use during the course. The most important one is R and its development environment R-Studio. Another tool is Github, the only repository. This plays an important role in publishing and making your code available to peers. Other tools include R Markdown, that is a simple markdown language for documentation.

A nice introduction to get started in the data science world. If you already have knowledge about this than I doubt there is little new to learn. 

2. R Programming

The focus is here on the basics of the language. This is very important module as during the rest of the course you will need to use the R language. You get thorough introduction to the language using R-Studio. It is worthwhile spending enough time on this, because understanding the basics of R are needed to get through the other modules.  You also get to know swirl, an interactive R-tutorial that is simple to use.

I can image people with little or no programming experience spending some time on this one. For people with experience, it is merely getting used to a new syntax and IDE (R-studio). R is widely used in the data science world and in this course you get a decent introduction. 

3. Getting and Cleaning Data

In this session you learn how to read (and write) data from files and using techniques to clean up these files. The ultimate goal is to end up with what is called a “tidy” file. You learn a lot of functions and packages (dplyr!) in R to help you with that.

This is a very fun, practical and hands on course. Not too difficult and helps you to get a bit deeper into the R language and its different packages.

4. Exploratory Data Analysis

This course covers the essential exploratory techniques for summarizing data. These techniques are applied before formal modeling starts. This module is basically focussed around “plotting” data in order to discover and understand its content.

This course is in the same league of the previous one, hands-on and practical with lots of stuff to do.

5. Reproducible Research

This course focuses on the concepts and tools behind reporting data analyses in a reproducible manner. Reproducible research is the idea that data analyses are published with their data and software code so that others may verify the findings and build upon them. This course focusses on literate statistical analysis tools which allow one to publish data analyses in a single document that allows others to easily execute the same analysis to obtain the same results.

This is for me a fundamental part of the whole track. Reproducible research is a key concept for the teachers, technically but certainly so morally. They emphasis very much on this, there are some very good examples of where things went wrong. On the practical side you learn to work with some nice (free) tools such as knitr, R Markdown and RPubs.

6. Statistical Inference

Statistical inference is the process of drawing conclusions about populations or scientific truths from data. This module is all about Probability, Variability, Distributions, testing, p-values, power, and so on…

I guess this module and the next one are the most difficult ones as the teachers go through sometimes lengthy discussions on the subjects, which is for people with a mathematical or scientific background not that hard, but might for others be a bridge to far. I saw in these modules  people in forums doing this course for the second or third time and others giving up. A tough one, but crucial in the the whole data science track. Maybe the professors could have skipped some mathematical details and left this for in the “optional” material.

7. Regression Models

Regression models, a subset of linear models, are the most important statistical analysis tool in a data scientist’s toolkit. This module covers the most of the different models and techniques and how to use them in R.

As explained in previous module, this is a tough one and requires some effort, especially for non-scientific persons.

8. Practical Machine Learning

This module is an introduction to Machine Learning. The course gives basics around training and tests sets, overfitting, and error rates. The course introduces a range of model based and algorithmic machine learning methods including regression, classification trees, Naive Bayes, and random forests.

This is a fun module, it is not going too deep into Machine Learning (see our other ongoing post on Machine Learning at Stanford) but you get a good idea of what this is all about. It was here that I discovered that a powerful computer is worthwhile investing in!

9. Developing Data Product

The final module covers the basics of creating data products using Shiny, R packages, and graphics such as rCharts, GoogleVis and plotly.

Again, a fun and practical module, it is not going too deep into the technical details and has a lot of hands on to do.

The Capstone Project

In this final project the objective was to build a word prediction application, running in ShinyApps and documented in RPresenter. We got 2 months to accomplish this.
At first, it was difficult to get started because we did not touch on the natural language processing during the courses. But once you get over the initial hurdles things were getting really interesting. You can spend a lot of time in optimizing the algorithm, the application interface and the documentation.

Also, it was one of the few times that the forum was buzzing with ideas, problems, solutions and so on.


Is it worth it? Well, if you are an R developer or statistician you probably won’t learn a lot, but if you are new to the Data Science world, then this is an interesting program, some modules are pretty tough but at least you get some ideas on what data science is all about. There was also good support from the mentors and the community.

We would definitely recommend to people are setting their first steps into Data Science.

Thierry J.