How to Get Started Working with the FastMRI Dataset (and why you should!)

K Hodges
4 min readDec 12, 2019

Facebook AI and NYU Langone Health have been working on an AI research project to improve MRI scans. Today, they released their dataset.

This dataset is the largest publicly available MRI dataset ever. It contains 6,970 raw-format MRI cases, and 10,000 additional cases (with more than 350,000 image-slices) in the DICOM format.

Apply to get it here (I got mine in a few minutes and am not part of any particular institution)

But Kevin, this is cool and all, but why does a non-neuroscientist individual like yourself want a bunch of data about brains?

Well, because mad scien — er, because this enables all of us to participate in the exploration of this amazing data. FastMRI is currently challenging the public to improve their already amazing system using this dataset, but that’s just the beginning to the uses.

Improving the speed and lowering the cost at which we can do MRI imaging is very important, because much of the cost of an MRI is related to the imaging portion. Having computer aided MRI imaging could potentially be life saving in certain cases, avoiding unnecessary surgeries and such.

So, here’s how I got started making my neurohacking AI (and how you can too)

  1. This blogpost is a good place to start. Get hype.
  2. Apply to get access to the dataset. It only takes a few minutes, currently.
  3. I then spun up a thicc t2.large ec2 instance with 2tb of storage. This part may seem costly, but it’s not too bad. The storage is $100 a month on SSD, so just remember to delete it after you’re done playing.

4. Install python37 and Jupyter Notebook on the ec2 instance. Also, remember to add 8888 to a security group in aws pointed at your IP (so you can access your notebook).

Shiny new notebook ayy

5. Clone the fastMRI repo. This has some useful tooling and models to get you started.

6. Download the 2tb dataset.

Gotta take a 28 hour long coffee break real quick

6a. To get started on the rest of this article, I decided to clone just the verification dataset (they had split it between train/verify/test). It’s about 10% of the full training data. You may want to do this too, to get started quickly.

Exploring the Dataset

From the git repo, there’s some boilerplate code and a tutorial (fastMRI_tutorial). The tutorial is totally human-readable, so it’s a good place to start.

Here’s some of the boilerplate code.

The data is stored in .h5, which allows us to use h5py to manage it easily. You can see the keys and attributes in [3].

Below, we can begin to see how some of this data is organized. It uses the concept of K-Space (which I had to google). From what I understand, K-Space is a data representation of 3d space in an MRI scan over time. If you’re familiar with fourier transforms, it doesn’t seem too far off.

From the tutorial notebook

We’ve been provided with a convenient to_tensor function, so we can use these k-slice arrays like we would use a tensor.

Using that, we can begin to see actual images!

I’m not a doctor, but this looks like a knee.
Remember to install python modules (I should have used requirements.txt)

Using that pattern, we can start looking at brains! I aimed the code at the brain data above

and we see… something. Unsure at this time why it’s not looking quite like a full image, but I am guessing that the kspace of a brain scan is structured differently from a knee scan? Stay tuned!


Running the same to_tensor transforms, we can see the slice from the brain scan!

That looks more like a brain!

Different files seem to have different levels of ‘fuzziness’, it seems.

We can use root_sum_of_squares to see a fuller image, essentially combining all of the k-slices.

I’m waiting on my full download. To be continued!



K Hodges

“Defense Researcher” according to Reuters, Chelsea Manning Fan Fiction Author, Delightful Degenerate