This week’s SNPpets offer a nutty new genome and a pine (just like pretty much every day of the week lately there are new genomes). Legal obstacles to data sharing, again. The TL;DR statement on human gene editing. The nerdiest Christmas tree evah from UCSC Genome Browser. New software (also like every day of the week). A Docker-based tool registry. [last one for this year, taking the holiday weeks off except for the annual tip review posts] the 3′ UTRome. Docker.
Oh, and by the way, Friday SNPpets will be off for the holidays, picking back up in January. The posts over the next week are the annual Tip of the Week reviews, with regular blogging resuming in the new year. Have great holidays, everyone!
Welcome to our Friday feature link collection: SNPpets. During the week we come across a lot of links and reads that we think are interesting, but don’t make it to a blog post. Here they are for your enjoyment…
Last fall there was a tip I did on Docker, which was starting to pick up a lot of chatter around the genoscenti. It was starting to look like a good solution for some of the problems of reproducibility and re-use of software in genomics–containerize it. Box it up, hand it off. There’s certainly a lot of interest and appeal in the community, but there are still some issues to resolve with rolling out Docker everywhere. However, my impression is that the Docker team and community seems interested and active in evolving the tools to be as broadly useful as possible.
So when this tweet rolled through the #bioinformatics twitter column on my Tweetdeck, I was excited to see this talk by Michael Barton (who has the best twitter handle in the field: @bioinformatics). It’s a terrific example of how Docker can be aimed at some of the problems in the bioinformatics tool space. It’s not the only option, or course. Some workflow resources like Galaxy can cover other features of genomics researchers’ needs. But as a general solution to the problems of comparing software and distributing complete working containers, Docker seems to developing into a very useful strategy.
Although this is longer than our typical “tips”, I’d recommend that you carve out some time to watch if you are new to the idea of Docker. In case you don’t have time right now for the talk, here’s a summary. For the first 10 minutes, there’s a gentle introduction for non-genomics nerds about what sequencing is like right now. Then Michael describes how the assembler literature works–with completing claims about the “better” assembler as each new paper comes along. This includes a sample of the types of problems that assemblers are trying to tackle with different strategies.
Around 14min, we begin to look at what it’s like to be the researcher who needs to access some assembler software. Then he describes how different lab groups–like remote islands–can instantly ship their sequence data around today. But that biologists are like “longshoremen for data”: they have to unload, unpack, install, try to get all the right pieces together to make it work in a new lab. We are doing “break bulk” science right now. That was a really terrific assessment of the state of play, I thought.
If you are ok with the other pieces, you can skip to around 16min, where we get to know about a specific example of the benefits of Docker for this type of research. Michael goes on to describe how Docker has helped him to build a system to catalog and evaluate various assemblers. He developed the project called nucleotid.es (pronounced just as “nucleotides”), which he goes on to describe. It offers details about various assemblers, which have been put into containers that are easy to access and to use to compare different software. There are examples of benchmarks, but you can also use these containers for your own assembly purposes. You can explore the site for more detail and a lot of data on the assembler comparisons that they have already. A good overview of the reasons to do this can also be found in the blog post over there: Why use containers for scientific software?
At about 25min, some of the constraints and problems they are noted. Fitting Docker into existing infrastructure, and incentivising developers to create Docker containers, can be issues. But the outcomes–having a better strategy than traditional publication for reproducibility, having ongoing access to the software, and the “deduplication of agony” seems to be worth investigating, for sure. Then Barton describes what the pipeline could look like for a researcher with some new sequence–you can use the data from a variety of assemblers to make decisions about how to proceed, rather than sifting through papers or just using what the lab next door did. And if you have a new assembler, you can use this setup to benchmark it as well.
So if you’ve been hearing about Docker, and have been concerned about access and reproducibility issues around genomics data and software, have a look at this video. It nicely presents the problems we face, and one possible solution, with a concrete example. There may be other useful methods as well–like offering a central portal for uses to access multiple tools, like AutoAsssemblyD has described–but that’s really for a different subset of users. But for the more general problem of software comparisons, benchmarking, and access to bioinformatics tools, Docker seems to offer a useful strategy. And I did a quick PubMed check to see if Docker is percolating through the traditional publication system yet, and found that it is. I found that ballaxy (“a Galaxy-based workflow toolkit for structural bioinformatics”) is offered as a Docker image, which means that having a grasp of Docker going forward may really be useful for software users rather quickly….
Veras A., Pablo de Sá, Vasco Azevedo, Artur Silva, Rommel Ramos, Institute of Biological Sciences, Federal University Pará, Belém, Pará & Brazil (2013). AutoAssemblyD: a graphical user interface system for several genome assemblers, Bioinformation, 9 (16) 840-841. DOI: http://dx.doi.org/10.6026/97320630009840
Hildebrandt A.K., D. Stockel, N. M. Fischer, L. de la Garza, J. Kruger, S. Nickels, M. Rottig, C. Scharfe, M. Schumann, P. Thiel & H.-P. Lenhof & (2014). ballaxy: web services for structural bioinformatics, Bioinformatics, 31 (1) 121-122. DOI: http://dx.doi.org/10.1093/bioinformatics/btu574
Breaking into the zeitgeist recently, Docker popped into my sphere from several disparate sources. Seems to me that this is a potential problem-solver for some of the reproducibility and sharing dramas that we have been wrestling with in genomics. Sharing of data sets and versions of analysis software is being tackled in a number of ways. FigShare, Github, and some publishers have been making strides among the genoscenti. We’ve seen virtual machines offered as a way to get access to some data and tool collections*. But Docker offers a lighter-weight way to package and deliver these types of things in a quicker and straightforward manner.
To get a better handle on the utility of Docker, I went looking for some videos, and these are now the video tip of the week. This is different from our usual topics, but because users might find themselves on the receiving end of these containers at some point, it seemed relevant for our readers.
The first one I’ll mention gave me a good overview of the concept. The CTO of Docker, Solomon Hykes, talks at Twitter University about the basis and benefits of their software (Introduction to Docker). He describes Docker of being like the innovation of shipping containers–which don’t really sound particularly remarkable to most of us, but in fact the case has been made that they changed the global economy completely. I read that book that Bill Gates recommended last year, The Box, and it was quite astonishing to see how metal boxes changed everything. This brought standardization and efficiencies that were previously unavailable. And those are two things we really need in genomics data and software.
Hykes explains that the problem of shipping stuff–coffee beans, or whatever, had to be solved, at each place the goods might end up. This is a good analogy–like explained in the shipping container book. How to handle an item, appropriate infrastructure, local expertise, etc, was a real barrier to sharing goods. And this happens with bioinformatics tools and data right now. But with containerization, everyone could agree on the size of the piece, the locks, the label position and contents, and everything standardized on that system. This brought efficiency, automation, and really changed the world economy. As Hykes concisely describes [~8min in]:
“So the goal really is to try and do the same thing for software, right? Because I think it’s embarrassing, personally, that on average, it’ll take more time and energy to get…a collection of software to move from one data center to the next, than it is to ship physical goods from one side of the planet to the other. I think we can do better than that….”
This high-level overview of the concept in less than 10min is really effective. He then takes a question about Docker vs a VM (virtual machine). I think this is the essential take-away: containerizing the necessary items [~18min]:
“…Which means we can now define a new unit of software delivery, that’s more lightweight than a VM [virtual machine], but can ship more than just the application-specific piece…”
After this point there’s a live demo of Docker to cover some of the features. But if you really do want to get started with Docker, I’d recommend a second video from the Docker team. They have a Docker 101 explanation that covers things starting from installation, to poking around, destroying stuff in the container to show how that works, demoing some of the other nuts and bolts, and the ease of sharing a container.
So this is making waves among the genomics folks. This also drifted through my feed:
Fantastic demo from http://t.co/MwCQNKy2nD in the office today. Docker for better collaboration between scientists. More of that please.
Check it out–there seem to be some really nice features of Docker that can impact this field. It doesn’t solve everything–and it shouldn’t be used as an escape mechanism to not put your data into standard formats. And Melissa addresses a number of unmet challenges too. But it does seem that it can be a contributor to reproducibility and access to data issues that are currently hurdles (or, plagues) in this field. Docker is also under active development and they appear to want to make it better. But sharing our stuff: it’s not trivial–there are real consequences to public health from inaccessible data and tools (1). But there are broader applications beyond bioinformatics, of course. And wide appeal and adoption seems to be a good thing for ongoing development and support. More chatter on the larger picture of Docker: