Tag Archives: galaxy

Video Tip of the Week: Genome assemblers and #Docker

Last fall there was a tip I did on Docker, which was starting to pick up a lot of chatter around the genoscenti. It was starting to look like a good solution for some of the problems of reproducibility and re-use of software in genomics–containerize it. Box it up, hand it off. There’s certainly a lot of interest and appeal in the community, but there are still some issues to resolve with rolling out Docker everywhere. However, my impression is that the Docker team and community seems interested and active in evolving the tools to be as broadly useful as possible.

So when this tweet rolled through the #bioinformatics twitter column on my Tweetdeck, I was excited to see this talk by Michael Barton (who has the best twitter handle in the field: @bioinformatics). It’s a terrific example of how Docker can be aimed at some of the problems in the bioinformatics tool space. It’s not the only option, or course. Some workflow resources like Galaxy can cover other features of genomics researchers’ needs. But as a general solution to the problems of comparing software and distributing complete working containers, Docker seems to developing into a very useful strategy.

Here’s the video:

Although this is longer than our typical “tips”, I’d recommend that you carve out some time to watch if you are new to the idea of Docker. In case you don’t have time right now for the talk, here’s a summary. For the first 10 minutes, there’s a gentle introduction for non-genomics nerds about what sequencing is like right now. Then Michael describes how the assembler literature works–with completing claims about the “better” assembler as each new paper comes along. This includes a sample of the types of problems that assemblers are trying to tackle with different strategies.

Around 14min, we begin to look at what it’s like to be the researcher who needs to access some assembler software. Then he describes how different lab groups–like remote islands–can instantly ship their sequence data around today. But that biologists are like “longshoremen for data”: they have to unload, unpack, install, try to get all the right pieces together to make it work in a new lab. We are doing “break bulk” science right now. That was a really terrific assessment of the state of play, I thought.

If you are ok with the other pieces, you can skip to around 16min, where we get to know about a specific example of the benefits of Docker for this type of research. Michael goes on to describe how Docker has helped him to build a system to catalog and evaluate various assemblers. He developed the project called nucleotid.es (pronounced just as “nucleotides”),  which he goes on to describe. It offers details about various assemblers, which have been put into containers that are easy to access and to use to compare different software. There are examples of benchmarks, but you can also use these containers for your own assembly purposes. You can explore the site for more detail and a lot of data on the assembler comparisons that they have already. A good overview of the reasons to do this can also be found in the blog post over there:  Why use containers for scientific software?

At about 25min, some of the constraints and problems they are noted. Fitting Docker into existing infrastructure, and incentivising developers to create Docker containers, can be issues.  But the outcomes–having a better strategy than traditional publication for reproducibility, having ongoing access to the software, and the “deduplication of agony” seems to be worth investigating, for sure. deduplication_of_agony Then Barton describes what the pipeline could look like for a researcher with some new sequence–you can use the data from a variety of assemblers to make decisions about how to proceed, rather than sifting through papers or just using what the lab next door did. And if you have a new assembler, you can use this setup to benchmark it as well.

So if you’ve been hearing about Docker, and have been concerned about access and reproducibility issues around genomics data and software, have a look at this video. It nicely presents the problems we face, and one possible solution, with a concrete example. There may be other useful methods as well–like offering a central portal for uses to access multiple tools, like AutoAsssemblyD has described–but that’s really for a different subset of users. But for the more general problem of software comparisons, benchmarking, and access to bioinformatics tools, Docker seems to offer a useful strategy. And I did a quick PubMed check to see if Docker is percolating through the traditional publication system yet, and found that it is. I found that ballaxy (“a Galaxy-based workflow toolkit for structural bioinformatics”) is offered as a Docker image, which means that having a grasp of Docker going forward may really be useful for software users rather quickly….

Quick links:

nucleotid.es: http://nucleotid.es

Docker: http://www.docker.com

References (and in this case the slide deck):

And other useful and related items from this post:

Automating the Selection Process for a Genome Assembler, JGI Science Highlights. October 17, 2014. http://jgi.doe.gov/automating-selection-process-genome-assembler/

Veras A., Pablo de Sá, Vasco Azevedo, Artur Silva, Rommel Ramos, Institute of Biological Sciences, Federal University Pará, Belém, Pará & Brazil (2013). AutoAssemblyD: a graphical user interface system for several genome assemblers, Bioinformation, 9 (16) 840-841. DOI: http://dx.doi.org/10.6026/97320630009840

Hildebrandt A.K.,  D. Stockel, N. M. Fischer, L. de la Garza, J. Kruger, S. Nickels, M. Rottig, C. Scharfe, M. Schumann, P. Thiel & H.-P. Lenhof & (2014). ballaxy: web services for structural bioinformatics, Bioinformatics, 31 (1) 121-122. DOI: http://dx.doi.org/10.1093/bioinformatics/btu574

What’s The Answer? (F1 of Biostars x Galaxy version!)

Offspring of original Biostars site, Galaxy Biostars replaces their support mailing list.

Offspring of original Biostars site, Galaxy Biostars replaces their support mailing list.

Generally each week we highlight a post from the main Biostars site, which answers some question or offers discussion of bioinformatics tools or analyses across many arenas. But this week I want to give you a look at the offspring of Biostar–Galaxy Biostar!

I’m calling it the F1 of Biostars x Galaxy, rather than the gendered “son of Galaxy”. There’s a post over at the Galaxy Biostar support site that describes a transition away from their traditional mailing-list based support to this new format. I’ll link part of that here, but it’s long so you should go read the whole thing over there.

Forum: Welcome to Galaxy Biostar

Dear Galaxy Community,
Galaxy has teamed up with Biostar to create a Galaxy User support forum at https://biostar.usegalaxy.org!

We want to create a space where researchers using Galaxy can come together and share both scientific advice and practical tool help.  Whether on usegalaxy.org, a Cloudman instance, or any other Galaxy (public or local), if you have something to say about Using Galaxy, this is the place to do it!

[has a lot more detail--go read the whole thing over there]

Jennifer Hillman Jackson

As I noted over there, I’ve been using mailing lists happily for a long time because that’s what we had. But I think this is a great way to transition to support now instead of email lists. Go check it out!


Video Tip of the Week: list of genes associated with a disease

I am currently in Puerto Varas, Chile at an EMBO genomics workshop. The workshop is mainly for grad students and the instructors are, for the most part, alumni of the Bork group. I gave a tutorial on genomics databases.

Anyway, the last two days of the workshop is a challenge, in teams of 3-4 advised by an instructor, students are to develop a list of genes associated with epilepsy. Obviously, this could be a trivial task, just go to OMIM or GENECARDS and grab a list. But this challenge requires them to go behind that and use the available data and make predictions. My team attempted, on my suggestion, some brainstorming techniques to ensure a more creative solution than they could come up with individually or just jumping into normal group dynamics. It seemed to work, their solution was quite creative and we will find out today how that worked.

That was my long way of saying, in the process we came across many databases of gene-disease information. above you will find a video of rat gene disease associations from RGD, often used of course to investigate human gene disease associations.

Below you will find a list of some excellent databases and resources to find similar lists:

Gene Association Database  http://geneticassociationdb.nih.gov/

G2D http://g2d2.ogic.ca

OMIM http://www.omim.org

Diseases http://diseases.jensenlab.org/

GeneCards http://genecards.org

DisGeNET http://ibi.imim.es/web/DisGeNET/

Several NCBI resources http://www.ncbi.nlm.nih.gov/guide/howto/find-gen-phen/

UCSC Genome Browser’s tracks for disease and phenotype http://genome.ucsc.edu

There are several others I’m sure, if you have a favorite not on this list, please comment.

Reference for RGD:
Laulederkind S.J.F., Hayman G.T., Wang S.J., Smith J.R., Lowry T.F., Nigam R., Petri V., de Pons J., Dwinell M.R. & Shimoyama M. & (2013). The Rat Genome Database 2013–data, tools and users, Briefings in Bioinformatics, 14 (4) 520-526. DOI:

Video Tip of the Week: MetaPhlAn and Galaxy

CPB Using Galaxy 2 from Galaxy Project on Vimeo.

for loading and using datatypes and  the OpenHelix Galaxy tutorial for getting familiar with Galaxy interface and usage.

Metagenomics analysis can be a bit daunting at times, but there are a good number of tools out there to assist a researcher in analysis.  Integrated Microbial Genomes at JGI has some excellent tools such as IMG/M and IMG HMP M. (OpenHelix tutorialThere are other excellent tools that I suggest you check out. QIIME is an excellent tool also.

But the above is not per se a metagenomics tutorial, rather it’s some short screencast of how to use the Galaxy interface for loading data and datatypes. Why? Because another excellent set of tools to use for metagenomic analysis is MetaPhlAn from the Huttenhower lab at Harvard.

The MetaPhlan tools can be downloaded and used ‘offline’, but they also have an excellent Galaxy interface to the tools. If you walk yourself through the MetaPhlAn tutorials on their site, including their Galaxy module one, after familiarizing yourself with Galaxy above, that should help you get started on some excellent metagenomics analysis.

To get a feel of these and other tools and workflows, you might want to browse through this excellent slide set from Surya Saha, Research Associate at Cornell University, from last year.

Quick Links:


Nicola Segata, Levi Waldron, Annalisa Ballarini, Vagheesh Narasimhan, Olivier Jousson & Curtis Huttenhower (2012). Metagenomic microbial community profiling using unique clade-specific marker genes Nature Methods (9), 811-814 : doi:10.1038/nmeth.2066

Video Tip of the Week: TrioVis for family genome data sets

I’m always interested in new strategies to visualize data. So when I saw discussion about a tool to help analyze family genomic data, I went to have a look. TrioVis is a new software tool that offers nice visualization and filtering strategies for exploring parent and child trio data sets. These data sets will become increasingly common as families seek out information for uncharacterized medical situations that may be affecting their kids. But they are being widely used already in many research situations.

TrioVis relies on the common VCF or Variant Call Format files that are generated from sequencing data. You can have a look at the types of information they carry at the 1000 Genomes project site. These files are created for each parent and the child in a trio situation, and then they are visualized with TrioVis in this manner:

The user interface consists of five sections: the main table (Fig. 1A), the global variant count bar graphs (Fig. 1B), the variant frequency sliders (Fig. 1C), the coverage sliders (Fig. 1D) and the histogram view (Fig. 1E). Each section focuses on a specific aspect of trio data and offers specific interactive features to calibrate the thresholds. Father, mother and child are colour-coded in green, orange and blue, respectively.

You can read the paper for more details on their goals and strategies. They also point to some 1000 Genomes project sample data you can use to run their tool.

But I also want to commend the TrioVis folks for putting a screencast of their tool right in their abstract. So their video is what I’d like you to view as this week’s Tip of the Week:

TrioVis from Ryo Sakai on Vimeo.

Right now there isn’t a web interface to use, but I noticed in their paper that they plan to integrate this into Galaxy. I think that’s another great idea on their part.

So if you find yourself exploring family trio data sets, consider a look at TrioVis.

Hat tip to Justin Johnson for drawing my attention to this paper and resource.

Quick links:

TrioVis software: https://bitbucket.org/biovizleuven/triovis/wiki/Home

TrioVis video: http://vimeo.com/user6757771/triovis


Sakai, R., Sifrim, A., Vande Moere, A., & Aerts, J. (2013). TrioVis: a visualization approach for filtering genomic variants of parent-child trios Bioinformatics DOI: 10.1093/bioinformatics/btt267

Galaxy Intro Webinar follow-up post (July 19)

We’ll be having our July 19th Galaxy webinar today, and we find there are questions to follow up afterwards that are often better handled in discussions on the blog.

If there are questions we didn’t have time to get to–or things we want to expand on with more detail–we can discuss them in this thread.

Or if you have other things you’ve been meaning to ask, let us know.

If have registered for the webinar, the same material will be available  in the training movie, slides, and exercises tutorial suite: http://www.openhelix.com/galaxy. You can also sign up to be informed of future webinars coming up on these topics, UCSC, ENCODE and others.

Some questions asked in today’s webinar, with answers:

1) Galaxy seems to downloadable in addition to the PSU portal and the cloud at Amazon. How would you choose?

Each has it’s purposes. From the Galaxy Wiki:
Install your own Galaxy if you want to,

a) Develop it further
b) Add new tools
c) Plug-in new datasources,
d)Run a local production server for your site because you have
Sensitive data (e.g., clinical) or
Large datasets or processing requirements that are too big to be processed on Main

Use the Cloud:

“With sporadic availability of data, individuals and labs may have a need to, over a period of time, process greatly variable amounts of data. Such variability in data volume imposes variable requirements on availability of compute resources used to process given data. Rather than having to purchase and maintain desired compute resources or having to wait a long time for data processing jobs to complete, the Galaxy Team has enabled Galaxy to be instantiated oncloud computing infrastructures”

2) Can I use Galaxy to analyze protein data?

Yes, there are a few tools for analysis on the main instance, but also you can add your own tools to a local instance.

3) What kind of local server? Can you describe the PSU instance as an example? server size, storage. filesystem , etc. ?

Check out this link for needs.

4) Can we use galaxy to align the whole genome sequences of rice to get SNPs?

This link might help.

5) Is there a link to the toolshed from the galaxy interface?

Not that I know, but this is it: http://toolshed.g2.bx.psu.edu/

6) How secure is the data we run on galaxy.psu?

 From the site (emphasis added in answer):

This is a free, public, internet accessible resource. Data transfer and data storage are not encrypted. If there are restrictions on the way your research data can be stored and used, please consult your local institutional review board or the project PI before uploading it to any public site, including this Galaxy server. If you have protected data, large data storage requirements, or short deadlines you are encouraged to setup your own local Galaxy instance or run Galaxy on the cloud.


Tip of the Week: Galaxy Tool Shed

This week I attended and gave a talk at ISMB in Long Beach. While there I had the opportunity to attend a session on Galaxy where Jeremy Goecks spoke on Galaxy Visualizations and Greg Von Kuster spoke about the “first biomedical AppStore,” the Galaxy Toolshed. As always, I learned a few new things.

Today’s tip is a quick introduction to the Galaxy Tool Shed. The Tool shed is a place to share tools you’ve developed or to find tools that other developers have developed for your local instance of Galaxy. This is a quick introduction. I won’t be going into the mechanics and specifics of the toolshed, it’s not specifically for the experimental biologist end user, but rather for developers of tools for use in Galaxy. That said, it can be useful for the end user to know what tools might be available and get them into their local installation. If you or your institution is installing a local instance of Galaxy, you might want to check out the extensive documentation on how to use the toolshed.

There are a lot of tools available in the tool shed, over 1800 at last count. They range through many different categories. Though it’s only been a couple years since the implementation of the toolshed, some published tools such as CodonLogo which is a logo-based viewer for codon patterns in aligned sequences, have been added to the toolshed.

If you want to learn more about Galaxy.

We have a  webinar tomorrow (July 19, 2012 at 11am PDT)  introducing Galaxy (free).

We have an online tutorial (fee)

And we’ve done tips (free of course) on Galaxy visualization, getting flanking sequences and converting genome coordinates using Galaxy,  and Galaxy pages. And we’ve tipped and blogged a lot of Galaxy-related stuff.

Quick Links:
Galaxy Main Instance
Galaxy Tool Box
Galaxy Tool Box How-to
Setting up a local instance


Sharma V, Murphy DP, Provan G, & Baranov PV (2012). CodonLogo: a sequence logo-based viewer for codon patterns. Bioinformatics (Oxford, England), 28 (14), 1935-6 PMID: 22595210

Video Tip of the Week: Visualizing the Galaxy

An antennae galaxy

Well, not that kind of galaxy (though visualizing those are quite nice), this kind of Galaxy. Galaxy is an excellent tool to analyze, reproduce and share genomics data and the Galaxy folks are always updating, improving and adding features to the tool. We have a tutorial for Galaxy to help you get started using this tool. As you might have guessed from the previous sentence, Galaxy is a moving target. The basics (and that’s what the tutorial is for) are the same, but the tutorial is in the process of being updated to reflect some of those changes. That update should be out sooner rather than later, but that said, we just can’t fit everything into the tutorial. The relatively new visualization tool is something that will not be in the tutorial. As there are no tutorials on visualization at the Galaxy site that I can find (if you know of any, link them here!), I’ve included a quick intro to visualizations using Galaxy in this tip of the week.

There are other ways to visualize the data analyzed at Galaxy. Galaxy datasets can often be viewed directly at UCSC Genome Browser, Ensembl, RViewer or in GeneTrack within Galaxy. Those are all excellent tools and powerful ways to view and explore your analysis in depth. In addition, the Galaxy visualization tool is a way to quickly visualize your data to help  discovery,  direct further analysis and share what you’ve found. It is obviously not a full fledged browser, but is very useful in doing a simple visualization of your data from within Galaxy. Today’s tip gives a quick introduction to Galaxy visualization.

Quick Links:
Galaxy (OH tutorial-subscr.)
UCSC Genome Browser (OH tutorials-free)
Ensembl (OH tutorials-subscr.)

P.S. You might here some bird song in the background. I am in, and working from, Hawaii for the next month (yeah, it’s tough work but someone has got to do it). No way to get those birds (or the frogs at night) to be silent for a bit.

UPDATE: Galaxy servers are ̶d̶o̶w̶n̶ semi-up (they know). Other mirrors or sites

UPDATE: Galaxy is up–but…

Be nice–don’t run giant projects right now…and it might not be entirely stable anyway. If you can wait, it might be wise.


I saw a notice earlier, but figured it would be short term. However, just now I saw this:

You can follow the Galaxy twitter feed for updates: @GalaxyProject

Here are links to some mirrors or other servers you can use if you need one at BioStars: list of public Galaxy servers

I suspect this also means that the GenomeSpace one from today’s tip would also be down, as that’s a test server there.

This is just a PSA–I remember one time UCSC Genome Browser went down (they had a cable cut by construction work–not an earthquake that time), and the traffic to our mirrors post was astounding. So I thought people might be looking for this kind of info as well, and it’s hard to get the word out if your site is out of service…


Friday SNPpets

Welcome to our Friday feature link collection: SNPpets. During the week we come across a lot of links and reads that we think are interesting, but don’t make it to a blog post. Here they are for your enjoyment…

And one last special item:

PhD The Movie is now available for streaming–check out the details here: