Two tools for reproducible data work

Prabhas Pokharel, Modi Research Group

What will you get out of this?

  • R users: Get introduced to R-Markdown, a tool for doing data analysis and presentation work together.

  • Other data people: Get introduced to make, which lets you write down steps to reproduce data pipelines.

  • Bonus, (most immediately applicable) for R folks: makemake (or pipelineR), a tool to use make without most of the work.

The theme

Data analysis, visualization, presentation, whatever, should be reproducible.

Our aim is to automate the "re-creation" / "reproduction".

What you get out of easy / automated reproduction:

  • work becomes easy to share
  • easy to iterate on
  • easy to collaborate with others
  • easier to hand off

Tool #1 -- RMarkdown

But first, an aside

Rmd has made two things easier, for me:

  • literate / exploratory data analysis (ie, explain what you are doing)
  • data-based presentations (explanations in pictures)

1.1 "Literate" data analysis -- explain what you are doing

A deeper dive -- "How to make Choropleths in R"

Directory structure

list.files("~/Code/gists/NepalChoropleths/")
## [1] "index.html"                 "index.rmd"                 
## [3] "zfstat-062 reformatted.csv" "zfstat-063 reformatted.csv"
  • File organization: index.Rmd compiles to index.md and index.html
    • In this case, the data is in the same directory
  • If you host this work in github, there are automatic renders available. For github, if you create a gh-pages branch, index.html files are auto-rendered at YOURUSERNAME.github.io/REPONAME/. For github gists, use http://bl.ocks.org/YOURGITHUBUSERNAME.

Rmd syntax

  • The file itself
    • Code and markdown are interspersed together
    • "Compilation": "Knit HTML" button
    • Or directly: setwd(appropriate directory); knit2html('index.Rmd').

1.2 Hide the code

The code block can include many options (caching, showing warnings or not, heights and widths of figures). If we simply hide the code, we can get documents which aren't about explaining what you did, but your results!

1.2 Hide the code -- Presentations!

Presentations deep dive


Some Rmd words of wisdom:

  • Learn ggplot2. Its an amazing graphics library that lets you make graphics as you make the data for the graphics. javascript has got nothing (besides interactivity) on ggplot.

  • The "slides" version of this I've found most useful in my work. A great way of leading people through analysis.

  • You can "print" the slides to pdf, and even convert them to ppt if you have Adobe Acrobat Pro.

So that was cool, how about something useful?

How make has made our lives easier: The Nigeria Project

NMIS-Pipeline-1

How make has made our lives easier: The Nigeria Project

NMIS-Pipeline-2

A makefile can help us with that!

NMIS-Pipeline-2

merged.csv: raw_data_1.csv raw_data_2.csv raw_data_1.csv Merge.R
    Rscript Merge.R
dropped.csv: merged.csv DropOutliers.R
    Rscript DropOutliers.R
facility.csv: dropped.csv FacilityIndicators.R
    Rscript FacilityIndicators.R
lga.csv: faclity.csv LGAIndicators.R
    Rscript LGAIndicators.R

Cool!

  • make helped us basically "write down" our pipeline
  • reproduction is as simple as typing "make" on the command line
  • document the process

But there is more.

What I showed you before was a very simplistic picture, the NMIS pipeline actually looks like this:

Dependency Graph

How big is this Makefile?

MakeMake

  • analyses a folder worth of R scripts, and writes your makefile for you.
  • looks at function calls like "read.csv", "write.csv" and other things like that to figure out what your scripts read and write, and auto-write a makefile.
  • also can run a dependency analyzer for you.

How to use

  • Once a name decision is made, look for it on github.com/modilabs
  • (makemake, makemakeR, pipelineR)

Lessons

  • make is super useful. I even use it for smaller projects (Haiti demo?).
  • for Rmarkdown projects, I write a small makefile that links index.html and index.Rmd together. Then, as I'm modifying the project, I run while TRUE; do make; sleep 10; done on the command line; every ten minutes, my presentation is re-compiled if I have modified it recently.
  • if you don't have the rights to re-distribute data, great way to tell people what data they need.

Thank You