Video Tip of the Week: #Docker, shipping containers for software and data

Breaking into the zeitgeist recently, Docker popped into my sphere from several disparate sources. Seems to me that this is a potential problem-solver for some of the reproducibility and sharing dramas that we have been wrestling with in genomics. Sharing of data sets and versions of analysis software is being tackled in a number of ways. FigShare, Github, and some publishers have been making strides among the genoscenti. We’ve seen virtual machines offered as a way to get access to some data and tool collections*. But Docker offers a lighter-weight way to package and deliver these types of things in a quicker and straightforward manner.

One of the discussions I saw about Docker came from Melissa Gymrek, with this post about the potential to use it for managing these things: Using docker for reproducible computational publications. Other chatter led me to this piece as well: Continuous, reproducible genome assembler benchmarking. And at the same time as all this was bubbling up, a discussion on Reddit covered other details: Question: Does using docker hit performance?

Of course, balancing the hype and reality is important, and this discussion thrashed that about a bit (click the timestamp on the Nextflow tweet to see the chatter):

To get a better handle on the utility of Docker, I went looking for some videos, and these are now the video tip of the week. This is different from our usual topics, but because users might find themselves on the receiving end of these containers at some point, it seemed relevant for our readers.

The first one I’ll mention gave me a good overview of the concept. The CTO of Docker, Solomon Hykes, talks at Twitter University about the basis and benefits of their software (Introduction to Docker). He describes Docker of being like the innovation of shipping containers–which don’t really sound particularly remarkable to most of us, but in fact the case has been made that they changed the global economy completely. I read that book that Bill Gates recommended last year, The Box, and it was quite astonishing to see how metal boxes changed everything. This brought standardization and efficiencies that were previously unavailable. And those are two things we really need in genomics data and software.

Hykes explains that the problem of shipping stuff–coffee beans, or whatever, had to be solved, at each place the goods might end up. This is a good analogy–like explained in the shipping container book. How to handle an item, appropriate infrastructure, local expertise, etc, was a real barrier to sharing goods. And this happens with bioinformatics tools and data right now. But with containerization, everyone could agree on the size of the piece, the locks, the label position and contents, and everything standardized on that system. This brought efficiency, automation, and really changed the world economy. As Hykes concisely describes [~8min in]:

“So the goal really is to try and do the same thing for software, right? Because I think it’s embarrassing, personally, that on average, it’ll take more time and energy to get…a collection of software to move from one data center to the next, than it is to ship physical goods from one side of the planet to the other. I think we can do better than that….”

This high-level overview of the concept in less than 10min is really effective. He then takes a question about Docker vs a VM (virtual machine). I think this is the essential take-away: containerizing the necessary items  [~18min]:

“…Which means we can now define a new unit of software delivery, that’s more lightweight than a VM [virtual machine], but can ship more than just the application-specific piece…”

After this point there’s a live demo of Docker to cover some of the features. But if you really do want to get started with Docker, I’d recommend a second video from the Docker team. They have a Docker 101 explanation that covers things starting from installation, to poking around, destroying stuff in the container to show how that works, demoing some of the other nuts and bolts, and the ease of sharing a container.

So this is making waves among the genomics folks. This also drifted through my feed:

Check it out–there seem to be some really nice features of Docker that can impact this field. It doesn’t solve everything–and it shouldn’t be used as an escape mechanism to not put your data into standard formats. And Melissa addresses a number of unmet challenges too. But it does seem that it can be a contributor to reproducibility and access to data issues that are currently hurdles (or, plagues) in this field. Docker is also under active development and they appear to want to make it better. But sharing our stuff: it’s not trivial–there are real consequences to public health from inaccessible data and tools (1). But there are broader applications beyond bioinformatics, of course. And wide appeal and adoption seems to be a good thing for ongoing development and support. More chatter on the larger picture of Docker:

And this discussion was helpful: IDF 2014: Bare Metal, Docker Containers, and Virtualization.

And, er…

I laughed. And wrote this anyway.

Quick links:

Docker main site: http://www.docker.com/

Docker Github: http://github.com/docker/

Reference:
(1) Baggerly K. (2010). Disclose all data in publications, Nature, 467 (7314) 401-401. DOI: http://dx.doi.org/10.1038/467401b

*Ironically, this ENCODE VM is gone, illustrating the problem:

encodevm_gone