This week’s SNPpets include mining disease genes with PPI and co-regulation networks, DNA and the law, great posts on germline genetic engineering moratorium discussions, a bioinformatics “middle class”, new human genome assembly models, and more….
Welcome to our Friday feature link collection: SNPpets. During the week we come across a lot of links and reads that we think are interesting, but don’t make it to a blog post. Here they are for your enjoyment…
Oooh. This got heated in the comments and on twitter. RT @bioinfblogs: .@ctitusbrown: Towards a bioinformatics middle class: Jared Simpson just posted a great blog entry on nanopoli… http://t.co/NeluM8Q2kM
Last fall there was a tip I did on Docker, which was starting to pick up a lot of chatter around the genoscenti. It was starting to look like a good solution for some of the problems of reproducibility and re-use of software in genomics–containerize it. Box it up, hand it off. There’s certainly a lot of interest and appeal in the community, but there are still some issues to resolve with rolling out Docker everywhere. However, my impression is that the Docker team and community seems interested and active in evolving the tools to be as broadly useful as possible.
So when this tweet rolled through the #bioinformatics twitter column on my Tweetdeck, I was excited to see this talk by Michael Barton (who has the best twitter handle in the field: @bioinformatics). It’s a terrific example of how Docker can be aimed at some of the problems in the bioinformatics tool space. It’s not the only option, or course. Some workflow resources like Galaxy can cover other features of genomics researchers’ needs. But as a general solution to the problems of comparing software and distributing complete working containers, Docker seems to developing into a very useful strategy.
Although this is longer than our typical “tips”, I’d recommend that you carve out some time to watch if you are new to the idea of Docker. In case you don’t have time right now for the talk, here’s a summary. For the first 10 minutes, there’s a gentle introduction for non-genomics nerds about what sequencing is like right now. Then Michael describes how the assembler literature works–with completing claims about the “better” assembler as each new paper comes along. This includes a sample of the types of problems that assemblers are trying to tackle with different strategies.
Around 14min, we begin to look at what it’s like to be the researcher who needs to access some assembler software. Then he describes how different lab groups–like remote islands–can instantly ship their sequence data around today. But that biologists are like “longshoremen for data”: they have to unload, unpack, install, try to get all the right pieces together to make it work in a new lab. We are doing “break bulk” science right now. That was a really terrific assessment of the state of play, I thought.
If you are ok with the other pieces, you can skip to around 16min, where we get to know about a specific example of the benefits of Docker for this type of research. Michael goes on to describe how Docker has helped him to build a system to catalog and evaluate various assemblers. He developed the project called nucleotid.es (pronounced just as “nucleotides”), which he goes on to describe. It offers details about various assemblers, which have been put into containers that are easy to access and to use to compare different software. There are examples of benchmarks, but you can also use these containers for your own assembly purposes. You can explore the site for more detail and a lot of data on the assembler comparisons that they have already. A good overview of the reasons to do this can also be found in the blog post over there: Why use containers for scientific software?
At about 25min, some of the constraints and problems they are noted. Fitting Docker into existing infrastructure, and incentivising developers to create Docker containers, can be issues. But the outcomes–having a better strategy than traditional publication for reproducibility, having ongoing access to the software, and the “deduplication of agony” seems to be worth investigating, for sure. Then Barton describes what the pipeline could look like for a researcher with some new sequence–you can use the data from a variety of assemblers to make decisions about how to proceed, rather than sifting through papers or just using what the lab next door did. And if you have a new assembler, you can use this setup to benchmark it as well.
So if you’ve been hearing about Docker, and have been concerned about access and reproducibility issues around genomics data and software, have a look at this video. It nicely presents the problems we face, and one possible solution, with a concrete example. There may be other useful methods as well–like offering a central portal for uses to access multiple tools, like AutoAsssemblyD has described–but that’s really for a different subset of users. But for the more general problem of software comparisons, benchmarking, and access to bioinformatics tools, Docker seems to offer a useful strategy. And I did a quick PubMed check to see if Docker is percolating through the traditional publication system yet, and found that it is. I found that ballaxy (“a Galaxy-based workflow toolkit for structural bioinformatics”) is offered as a Docker image, which means that having a grasp of Docker going forward may really be useful for software users rather quickly….
Veras A., Pablo de Sá, Vasco Azevedo, Artur Silva, Rommel Ramos, Institute of Biological Sciences, Federal University Pará, Belém, Pará & Brazil (2013). AutoAssemblyD: a graphical user interface system for several genome assemblers, Bioinformation, 9 (16) 840-841. DOI: http://dx.doi.org/10.6026/97320630009840
Hildebrandt A.K., D. Stockel, N. M. Fischer, L. de la Garza, J. Kruger, S. Nickels, M. Rottig, C. Scharfe, M. Schumann, P. Thiel & H.-P. Lenhof & (2014). ballaxy: web services for structural bioinformatics, Bioinformatics, 31 (1) 121-122. DOI: http://dx.doi.org/10.1093/bioinformatics/btu574
Offspring of original Biostars site, Galaxy Biostars replaces their support mailing list.
Generally each week we highlight a post from the main Biostars site, which answers some question or offers discussion of bioinformatics tools or analyses across many arenas. But this week I want to give you a look at the offspring of Biostar–Galaxy Biostar!
I’m calling it the F1 of Biostars x Galaxy, rather than the gendered “son of Galaxy”. There’s a post over at the Galaxy Biostar support site that describes a transition away from their traditional mailing-list based support to this new format. I’ll link part of that here, but it’s long so you should go read the whole thing over there.
We want to create a space where researchers using Galaxy can come together and share both scientific advice and practical tool help. Whether on usegalaxy.org, a Cloudman instance, or any other Galaxy (public or local), if you have something to say about Using Galaxy, this is the place to do it!
As I noted over there, I’ve been using mailing lists happily for a long time because that’s what we had. But I think this is a great way to transition to support now instead of email lists. Go check it out!
I am currently in Puerto Varas, Chile at an EMBO genomics workshop. The workshop is mainly for grad students and the instructors are, for the most part, alumni of the Bork group. I gave a tutorial on genomics databases.
Anyway, the last two days of the workshop is a challenge, in teams of 3-4 advised by an instructor, students are to develop a list of genes associated with epilepsy. Obviously, this could be a trivial task, just go to OMIM or GENECARDS and grab a list. But this challenge requires them to go behind that and use the available data and make predictions. My team attempted, on my suggestion, some brainstorming techniques to ensure a more creative solution than they could come up with individually or just jumping into normal group dynamics. It seemed to work, their solution was quite creative and we will find out today how that worked.
That was my long way of saying, in the process we came across many databases of gene-disease information. above you will find a video of rat gene disease associations from RGD, often used of course to investigate human gene disease associations.
Below you will find a list of some excellent databases and resources to find similar lists:
There are several others I’m sure, if you have a favorite not on this list, please comment.
Reference for RGD: Laulederkind S.J.F., Hayman G.T., Wang S.J., Smith J.R., Lowry T.F., Nigam R., Petri V., de Pons J., Dwinell M.R. & Shimoyama M. & (2013). The Rat Genome Database 2013–data, tools and users, Briefings in Bioinformatics, 14 (4) 520-526. DOI: 10.1093/bib/bbt007
Metagenomics analysis can be a bit daunting at times, but there are a good number of tools out there to assist a researcher in analysis. Integrated Microbial Genomes at JGI has some excellent tools such as IMG/M and IMG HMP M. (OpenHelixtutorial) There are other excellent tools that I suggest you check out. QIIME is an excellent tool also.
But the above is not per se a metagenomics tutorial, rather it’s some short screencast of how to use the Galaxy interface for loading data and datatypes. Why? Because another excellent set of tools to use for metagenomic analysis is MetaPhlAn from the Huttenhower lab at Harvard.
I’m always interested in new strategies to visualize data. So when I saw discussion about a tool to help analyze family genomic data, I went to have a look. TrioVis is a new software tool that offers nice visualization and filtering strategies for exploring parent and child trio data sets. These data sets will become increasingly common as families seek out information for uncharacterized medical situations that may be affecting their kids. But they are being widely used already in many research situations.
TrioVis relies on the common VCF or Variant Call Format files that are generated from sequencing data. You can have a look at the types of information they carry at the 1000 Genomes project site. These files are created for each parent and the child in a trio situation, and then they are visualized with TrioVis in this manner:
The user interface consists of five sections: the main table (Fig. 1A), the global variant count bar graphs (Fig. 1B), the variant frequency sliders (Fig. 1C), the coverage sliders (Fig. 1D) and the histogram view (Fig. 1E). Each section focuses on a specific aspect of trio data and offers specific interactive features to calibrate the thresholds. Father, mother and child are colour-coded in green, orange and blue, respectively.
You can read the paper for more details on their goals and strategies. They also point to some 1000 Genomes project sample data you can use to run their tool.
But I also want to commend the TrioVis folks for putting a screencast of their tool right in their abstract. So their video is what I’d like you to view as this week’s Tip of the Week:
a) Develop it further
b) Add new tools
c) Plug-in new datasources,
d)Run a local production server for your site because you have
Sensitive data (e.g., clinical) or
Large datasets or processing requirements that are too big to be processed on Main
“With sporadic availability of data, individuals and labs may have a need to, over a period of time, process greatly variable amounts of data. Such variability in data volume imposes variable requirements on availability of compute resources used to process given data. Rather than having to purchase and maintain desired compute resources or having to wait a long time for data processing jobs to complete, the Galaxy Team has enabled Galaxy to be instantiated oncloud computing infrastructures”
2) Can I use Galaxy to analyze protein data?
Yes, there are a few tools for analysis on the main instance, but also you can add your own tools to a local instance.
3) What kind of local server? Can you describe the PSU instance as an example? server size, storage. filesystem , etc. ?
This is a free, public, internet accessible resource. Data transfer and data storage are not encrypted. If there are restrictions on the way your research data can be stored and used, please consult your local institutional review board or the project PI before uploading it to any public site, including this Galaxy server. If you have protected data, large data storage requirements, or short deadlines you are encouraged to setup your own local Galaxy instance or run Galaxy on the cloud.
This week I attended and gave a talk at ISMB in Long Beach. While there I had the opportunity to attend a session on Galaxy where Jeremy Goecks spoke on Galaxy Visualizations and Greg Von Kuster spoke about the “first biomedical AppStore,” the Galaxy Toolshed. As always, I learned a few new things.
Today’s tip is a quick introduction to the Galaxy Tool Shed. The Tool shed is a place to share tools you’ve developed or to find tools that other developers have developed for your local instance of Galaxy. This is a quick introduction. I won’t be going into the mechanics and specifics of the toolshed, it’s not specifically for the experimental biologist end user, but rather for developers of tools for use in Galaxy. That said, it can be useful for the end user to know what tools might be available and get them into their local installation. If you or your institution is installing a local instance of Galaxy, you might want to check out the extensive documentation on how to use the toolshed.
There are a lot of tools available in the tool shed, over 1800 at last count. They range through many different categories. Though it’s only been a couple years since the implementation of the toolshed, some published tools such as CodonLogo which is a logo-based viewer for codon patterns in aligned sequences, have been added to the toolshed.
Well, not that kind of galaxy (though visualizing those are quite nice), this kind of Galaxy. Galaxy is an excellent tool to analyze, reproduce and share genomics data and the Galaxy folks are always updating, improving and adding features to the tool. We have a tutorial for Galaxy to help you get started using this tool. As you might have guessed from the previous sentence, Galaxy is a moving target. The basics (and that’s what the tutorial is for) are the same, but the tutorial is in the process of being updated to reflect some of those changes. That update should be out sooner rather than later, but that said, we just can’t fit everything into the tutorial. The relatively new visualization tool is something that will not be in the tutorial. As there are no tutorials on visualization at the Galaxy site that I can find (if you know of any, link them here!), I’ve included a quick intro to visualizations using Galaxy in this tip of the week.
There are other ways to visualize the data analyzed at Galaxy. Galaxy datasets can often be viewed directly at UCSC Genome Browser, Ensembl, RViewer or in GeneTrack within Galaxy. Those are all excellent tools and powerful ways to view and explore your analysis in depth. In addition, the Galaxy visualization tool is a way to quickly visualize your data to help discovery, direct further analysis and share what you’ve found. It is obviously not a full fledged browser, but is very useful in doing a simple visualization of your data from within Galaxy. Today’s tip gives a quick introduction to Galaxy visualization.
P.S. You might here some bird song in the background. I am in, and working from, Hawaii for the next month (yeah, it’s tough work but someone has got to do it). No way to get those birds (or the frogs at night) to be silent for a bit.
I suspect this also means that the GenomeSpace one from today’s tip would also be down, as that’s a test server there.
This is just a PSA–I remember one time UCSC Genome Browser went down (they had a cable cut by construction work–not an earthquake that time), and the traffic to our mirrors post was astounding. So I thought people might be looking for this kind of info as well, and it’s hard to get the word out if your site is out of service…