A Portrait of One Scientist as a Graduate Student

Paul Ivanov




TL;DR version: "What's easy, won't last. What lasts, won't be easy."

In this talk, I will focus on the how of reproducible research. I will focus on specific tools and techniques I have found invaluable in doing research in a reproducible manner. In particular, I will cover the following general topics (with specific examples in parentheses): version control and code provenance (git), code verification (test driven development, nosetests), data integrity (sha1, md5, git-annex), seed saving ( random seed retention ) distribution of datasets (mirroring, git-annex, metalinks), light-weight analysis capture ( ttyrec, ipython notebook)

My background: (since this will be autobiographical)

  • Born in Moscow (Soviet Russia) (communism + socialism)
  • Started running GNU/Linux in 2000 (free software + open source)
  • BS in Computer Science from UC Davis
  • finishing up graduate program in Vision Science at UC Berkeley

My rotation in an primate electrophysiology lab:

Data is hard to get: 1-2 years training animal on task, "minor brain surgery, every day of data collection" for 4-6 months, every day, 6-10 hours per day.

Data is very rich. It is hoarded. With a very tight lid.

My naive conclusion: Data is precious. Free the DATA!

If data was just more accessible....

But the reality is that, having accessible data is not enough...


You need the code, and I don't mean a tar ball.

Step 0 -- version control everything

(including this presentation)

show of hands - familiar with some form of version control?

Git specifically?

It's not rocket science! There are sane GUIs for novice users.

I explained the benefits of version control to my biologist friend Sara, and put SmartGit on her machine. No more _v1, _v3_works, etc. "I didn't think it would be this easy".

Back in my home lab, we do computational experiments.

Unsupervised learning of natural signals. "How should the brain encode images given their properties?"

van Hateren images

Very popular dataset (camera calibrated, linearized, uncompressed, etc) - the paper to cite it came out in 1998.

As of 2007, it had 336 citations according to Google Scholar, (then 99th most cited paper in the Vision literature).

Today that number is up to 776.

Then is 2010, there's an email sent to a vision community mailing list saying:

Does anyone have a copy of van Hateren database? I have been looking  
for the 4000 still image database. The links to images


are broken! And it looks like there is no mirror of the full database  
anywhere. I would appreciate your help and suggestions.

So I put up a mirror.

Shortly thereafter, another grad student in a lab in Germany (one of my academic nephews), did the same.

This happened again a year later with another dataset. Luckily, I had downloaded that one as well, and now host the canonical version.

Lesson learned: don't take today's data sources for granted.


multiple resources for the same data (http, ftp, bittorrent) in one container file.


Lightweight capture tool (I use this daily, it helps me account for how I spend my time). Just writes everything you see in the shell to a file, with timing information, which you can later play back.

demo in the shell (ttyplay ~/2012-08-01_2.tty)

IPython notebook

  • clear all output, and re-run this notebook
  • inline plotting
  • tab-completion, documentation tooltips, etc
  • extensible - R magic, octave magic
  • Notebook format - converters to PDF, LaTeX, HTML, markdown, restructured text, python.
  • communication protocol : multiple clients can talk to the computational kernel (vim-ipython, for example)
  • collaboration: expose a relevant port publically, point collaborator to a website.

Thank you

Paul Ivanov




TL;DR version: "What's easy, won't last. What lasts, won't be easy."