Tag Archives: galaxy

Galaxy Intro Webinar follow-up post (July 19)

We’ll be having our July 19th Galaxy webinar today, and we find there are questions to follow up afterwards that are often better handled in discussions on the blog.

If there are questions we didn’t have time to get to–or things we want to expand on with more detail–we can discuss them in this thread.

Or if you have other things you’ve been meaning to ask, let us know.

If have registered for the webinar, the same material will be available  in the training movie, slides, and exercises tutorial suite: http://www.openhelix.com/galaxy. You can also sign up to be informed of future webinars coming up on these topics, UCSC, ENCODE and others.

Some questions asked in today’s webinar, with answers:

1) Galaxy seems to downloadable in addition to the PSU portal and the cloud at Amazon. How would you choose?

Each has it’s purposes. From the Galaxy Wiki:
Install your own Galaxy if you want to,

a) Develop it further
b) Add new tools
c) Plug-in new datasources,
d)Run a local production server for your site because you have
Sensitive data (e.g., clinical) or
Large datasets or processing requirements that are too big to be processed on Main

Use the Cloud:

“With sporadic availability of data, individuals and labs may have a need to, over a period of time, process greatly variable amounts of data. Such variability in data volume imposes variable requirements on availability of compute resources used to process given data. Rather than having to purchase and maintain desired compute resources or having to wait a long time for data processing jobs to complete, the Galaxy Team has enabled Galaxy to be instantiated oncloud computing infrastructures”

2) Can I use Galaxy to analyze protein data?

Yes, there are a few tools for analysis on the main instance, but also you can add your own tools to a local instance.

3) What kind of local server? Can you describe the PSU instance as an example? server size, storage. filesystem , etc. ?

Check out this link for needs.

4) Can we use galaxy to align the whole genome sequences of rice to get SNPs?

This link might help.

5) Is there a link to the toolshed from the galaxy interface?

Not that I know, but this is it: http://toolshed.g2.bx.psu.edu/

6) How secure is the data we run on galaxy.psu?

 From the site (emphasis added in answer):

This is a free, public, internet accessible resource. Data transfer and data storage are not encrypted. If there are restrictions on the way your research data can be stored and used, please consult your local institutional review board or the project PI before uploading it to any public site, including this Galaxy server. If you have protected data, large data storage requirements, or short deadlines you are encouraged to setup your own local Galaxy instance or run Galaxy on the cloud.

 

Tip of the Week: Galaxy Tool Shed

This week I attended and gave a talk at ISMB in Long Beach. While there I had the opportunity to attend a session on Galaxy where Jeremy Goecks spoke on Galaxy Visualizations and Greg Von Kuster spoke about the “first biomedical AppStore,” the Galaxy Toolshed. As always, I learned a few new things.

Today’s tip is a quick introduction to the Galaxy Tool Shed. The Tool shed is a place to share tools you’ve developed or to find tools that other developers have developed for your local instance of Galaxy. This is a quick introduction. I won’t be going into the mechanics and specifics of the toolshed, it’s not specifically for the experimental biologist end user, but rather for developers of tools for use in Galaxy. That said, it can be useful for the end user to know what tools might be available and get them into their local installation. If you or your institution is installing a local instance of Galaxy, you might want to check out the extensive documentation on how to use the toolshed.

There are a lot of tools available in the tool shed, over 1800 at last count. They range through many different categories. Though it’s only been a couple years since the implementation of the toolshed, some published tools such as CodonLogo which is a logo-based viewer for codon patterns in aligned sequences, have been added to the toolshed.

If you want to learn more about Galaxy.

We have a  webinar tomorrow (July 19, 2012 at 11am PDT)  introducing Galaxy (free).

We have an online tutorial (fee)

And we’ve done tips (free of course) on Galaxy visualization, getting flanking sequences and converting genome coordinates using Galaxy,  and Galaxy pages. And we’ve tipped and blogged a lot of Galaxy-related stuff.

Quick Links:
Galaxy Main Instance
Galaxy Tool Box
Galaxy Tool Box How-to
Setting up a local instance

 

Sharma V, Murphy DP, Provan G, & Baranov PV (2012). CodonLogo: a sequence logo-based viewer for codon patterns. Bioinformatics (Oxford, England), 28 (14), 1935-6 PMID: 22595210

Video Tip of the Week: Visualizing the Galaxy


An antennae galaxy

Well, not that kind of galaxy (though visualizing those are quite nice), this kind of Galaxy. Galaxy is an excellent tool to analyze, reproduce and share genomics data and the Galaxy folks are always updating, improving and adding features to the tool. We have a tutorial for Galaxy to help you get started using this tool. As you might have guessed from the previous sentence, Galaxy is a moving target. The basics (and that’s what the tutorial is for) are the same, but the tutorial is in the process of being updated to reflect some of those changes. That update should be out sooner rather than later, but that said, we just can’t fit everything into the tutorial. The relatively new visualization tool is something that will not be in the tutorial. As there are no tutorials on visualization at the Galaxy site that I can find (if you know of any, link them here!), I’ve included a quick intro to visualizations using Galaxy in this tip of the week.

There are other ways to visualize the data analyzed at Galaxy. Galaxy datasets can often be viewed directly at UCSC Genome Browser, Ensembl, RViewer or in GeneTrack within Galaxy. Those are all excellent tools and powerful ways to view and explore your analysis in depth. In addition, the Galaxy visualization tool is a way to quickly visualize your data to help  discovery,  direct further analysis and share what you’ve found. It is obviously not a full fledged browser, but is very useful in doing a simple visualization of your data from within Galaxy. Today’s tip gives a quick introduction to Galaxy visualization.

Quick Links:
Galaxy (OH tutorial-subscr.)
UCSC Genome Browser (OH tutorials-free)
Ensembl (OH tutorials-subscr.)
RViewer
GeneTrack

P.S. You might here some bird song in the background. I am in, and working from, Hawaii for the next month (yeah, it’s tough work but someone has got to do it). No way to get those birds (or the frogs at night) to be silent for a bit.

UPDATE: Galaxy servers are ̶d̶o̶w̶n̶ semi-up (they know). Other mirrors or sites

UPDATE: Galaxy is up–but…

Be nice–don’t run giant projects right now…and it might not be entirely stable anyway. If you can wait, it might be wise.

++++++++++++++++++++++++

I saw a notice earlier, but figured it would be short term. However, just now I saw this:

You can follow the Galaxy twitter feed for updates: @GalaxyProject

Here are links to some mirrors or other servers you can use if you need one at BioStars: list of public Galaxy servers

I suspect this also means that the GenomeSpace one from today’s tip would also be down, as that’s a test server there.

This is just a PSA–I remember one time UCSC Genome Browser went down (they had a cable cut by construction work–not an earthquake that time), and the traffic to our mirrors post was astounding. So I thought people might be looking for this kind of info as well, and it’s hard to get the word out if your site is out of service…

 

Friday SNPpets

Welcome to our Friday feature link collection: SNPpets. During the week we come across a lot of links and reads that we think are interesting, but don’t make it to a blog post. Here they are for your enjoyment…

And one last special item:

PhD The Movie is now available for streaming–check out the details here:

http://www.phdmovie.com/

Video Tips of the Week: Annual Review IV, 2nd half

As you may know, we’ve been doing these video tips-of-the-week for FOUR years now. We have completed around 200 little tidbit introductions to various resources from last year, 2011 (yep, it’s 2012 now). At the end of the year we’ve established a sort of holiday tradition: we are doing a summary post to collect them all. If you have missed any of them it’s a great way to have a quick look at what might be useful to your work.

You can see past years’ tips here: 2008 I2008 II2009 I2009 II2010 I2010 II. The summary of the first half of 2011 is available from last week.

July 2011

July 6: Prioritizing genes using the Gene Prioritization Portal

July 13: PolySearch, searching many databases at once

July 20: Human Epigenomics Visualization Hub

July 27: The new SIB Bioinformatics Resource Portal

 

August 2011

August 3: SNPexp, correlation between SNPs and gene expression 

August 10: CompaGB for comparing genome browser software

August 17: CoGe, comparing genomes revisited

August 24: Domain Draw for quick motif diagrams

August 31: From UniProt to the PSI SBKB and back again

 

September 2011

September 7: Plant comparative genomics using Plaza

September 14: phiGENOME for bacteriophage genome exploration

September 21: Getting flanking sequences of genomic locations

September 28: Introduction to R statistical software 

 

October 2011

October 5: VnD resource for genetic variation and drug information

October 12: Track Hubs in UCSC Genome Browser

October 19: Mitochondrial Transcriptome GBrowser 

October 26: Variation data from Ensembl

 

November 2011

November 2: MizBee Synteny Browser

November 9: The new database of genomic variants: DGV2

November 16: MapMi, automated mapping of microRNA loci

November 23: BioMart’s new central portal

November 30: Phosphida, a post-translational modification database

December 2011

December 7: VarSifter, for identifying key sequence variations

December 14: Big changes to NCBI’s genome resources

December 21: eggNOG for the Holidays (or to explore orthologous genes)

December 28: Video Tips of the Week: Annual Review IV (first half of 2011)

Friday SNPpets

Welcome to our Friday feature link collection: SNPpets. During the week we come across a lot of links and reads that we think are interesting, but don’t make it to a blog post. Here they are for your enjoyment…

  • News from the UCSC Genome Browser RT @GenomeBrowser: We now support Variant Call Format (VCF) a line-oriented format for single nucleotide variants, indels, copy number and structural variants. [Mary]
  • RT @MyBioTechniques: 2012 Genomics Crystal Ball http://bit.ly/vJXnKU [Mary]
  • Another day, another species. But I had no idea falcons where looming. RT @Articlesposting: Genomic sequences of two iconic falconry birds — Peregrine and Saker Falcons —  successfully decoded http://ping.fm/0Imis [Mary]
  • RT @datatelling: Martin Krzywinski’s site is a treasure trove of visual inspiration: http://mkweb.bcgsc.ca/ #hiveplots #circos #genome #photography #etc #etc [Mary]
  • RT @m_m_campbell: Thyme in a bottle. Plant genomes will shed light on botanical medicines. http://bit.ly/tA3YTA #genomics #botany #plants #medicinalplants [Mary]
  • RT @StrandLife: How Identical are Identical Twins? A bioinformatician’s approach: http://bit.ly/tcRG31  #genomics #bioinformatics [Mary]
  • From Dec 14th AAAS Policy Alert: “White House Announces New Innovation Fund. On Dec. 8 the White House announced the launch of the Early Stage Innovation Fund, which will provide $1 billion in matching capital to Small Business Investment Companies (SBIC), targeting early-stage small businesses. The fund is part of the administration’s Startup America initiative and is included in proposed rule changes from the Small Business Administration that allow private investment in SBIC participants. The proposals are open for comment until Feb. 7, 2012. The fund will be fully implemented in 2012.” [Jennifer]
  • Double dose, but handy to store: RT @galaxyproject: Search the Galaxy! http://bit.ly/gxysearch Custom Galaxy searches now available http://bit.ly/uUX9Mv #usegalaxy  [Mary]
  • RT @ejwillingham: Rosie Redfield, who blogs @FieldofScience blog network, named a top science newsmaker of 2011: ow.ly/877ZC [Mary]

Why don’t users employ workflows for “big data”? I know why.

Yesterday a tweet to a great post came across the ethers, and ever since I read it I knew I had to write this post. Here’s the original nugget:

RT @ctitusbrown: (my) thoughts on data intensive science & workflows: http://bit.ly/tWXSnx

It is a post about why end users are not adopting workflows which could really help them in this eScience world we find ourselves in, and as we keep moving forward with giant data sets and “big data” projects. And some other points about what we need in workflows. We’re big fans of workflows and have talked about them in the past (Tip of the week: The Taverna Project for workflows; What’s the Answer? Alternatives to Galaxy; Tip of the Week: BioExtract Server; lots of Galaxy posts).

But the first major point in the post asked: Why don’t people use workflows in bioinformatics?

I know why. The first key point is that they are not trained to use them. When we’ve done Galaxy training workshops, we see how quickly people get the point of Galaxy and how it can save them time. And they love the assemblage of tools that they’d have to otherwise seek out at numerous individual sites. So a major step would be awareness that 1) the workflow tool exists, and 2) some gentle introduction at a very basic level to get them started. A lot of people in bioinformatics are not daunted by interfaces with the complexity of Galaxy. But people who don’t spend all day on software and databases are not at the same point.

So, very basic intro training on workflow tools is a big step. But there’s actually another step before that. Biologists need to know how to mine the big data that they are told is out there. Some of the more computationally sophisticated biologists already have their own data, or know how to get it. But if we are going to succeed at increasing use of workflow tools, we also need to train people on how to mine the big data. They don’t necessarily have that step yet.

When we do a UCSC Genome Browser workshop, it’s in 2 segments. I do the intro section first with very basic intro to how to structure a query, how to look at the graphics, how to change the views, etc. Almost always I start with a question for the attendees: how many people here have spent more than 1 hour (total) hammering around on the UCSC Genome Browser? In the average room, this ranges from about 1/3 to 1/2 of the attendees. Generally, more than half have never touched it before.

But then Trey does the advanced topics section, largely about the Table Browser and Custom Tracks. He also starts with the question: how many of you have spent more than an hour using the Table Browser? Generally, there are 1-3 individuals who have, if any.  These are in rooms of 25-50 people (sometimes over 100). And if you haven’t used Galaxy before, you may not know that the primary way to get UCSC Genome Browser data is that the Galaxy interface throws you the Table Browser. (Or BioMart versions or InterMine versions or whatever–they know even less about those in our experience.)

If you don’t know how to get the data (step 1), the workflow setup (step 2+) is not going to help you.

Bioinformatics folks: you’d be stunned to know what biologists don’t know about the tools. And here’s something else they tell us: often the trainings they’ve been offered (if they have had them) start out over their heads. Expert users–or representatives of the tool being trained on–are very often too close to the tool to realize that there are a lot of more basic things people need to know.  But the trainees don’t want to look stupid in front of their colleagues and ask the basic questions. Or they don’t want to be critical of the tool features to the folks who build them.

And this requires cross-training across the bioinformatics projects and data sets. However, sometimes the funding for outreach is limited to one’s own tools. But without some of the other key components–other sources, other projects–users are not going to be able to pull together what they really need.

As the “data bonanza” era proceeds, there’s only going to be more and more data stored that biologists could be using to make fabulous discoveries. It’s not in the papers anymore, as I keep saying (over and over and over). But the bench biologists aren’t getting enough training to take their expertise to mine these data sets.

The other points Titus makes are also great on the workflow issues. This part is particularly resonant with me:

For all of this I need three things: I need workflow agility, I need workflow versioning, and I need workflow tracking. And this all needs to sit on top of a workflow component model that lets me run the components of the workflow wherever the data is.

I have begged workflow providers to provide the versions of the components of the workflows. It stuns me every time I’m told that no–it’s up to you to know that. I can’t even tell which version of the tools they have installed, how can I record that and then know if they changed the underlying algorithm since the last time I ran the workflow? This is a major problem if you want to pitch these tools as a great way to offer reproducibility of research.

The basic point though–everyone ought to be using workflow tools–is 100% solid. But users need more help to get to that point. 

Quick links:

to C. Titus Brown’s original post: Data Intensive Science and Workflows

to Galaxy: http://usegalaxy.org

to UCSC Genome Browser: http://genome.ucsc.edu/

to Taverna: http://www.taverna.org.uk/

to a great list of workflow tools (via Casey Bergman at Titus’ comments section): http://en.wikipedia.org/wiki/Bioinformatics_workflow_management_systems

Reference:
Goecks, J., Nekrutenko, A., Taylor, J., & Galaxy Team, T. (2010). Galaxy: a comprehensive approach for supporting accessible, reproducible, and transparent computational research in the life sciences Genome Biology, 11 (8) DOI: 10.1186/gb-2010-11-8-r86

Video tip of the week: VarSifter for identifying key sequence variations

Recently many of the bioinformatics tweeps I follow were excited about the tool called VarSifter. Here’s the notice that I saw:

RT @yokofakun: http://www.youtube.com/watch?v=I7azpqTWFuM Jamie Teer describes VarSifter, an interactive GUI tool for handing/quering/filtering VCFs #ngs

I just had a chance to watch the video, and now I can see why they were impressed! Over the years in the workshops we do, people have asked questions in various theme groups. For a while it was lists of genes and microarrays. Then it was known SNP variations. Then it became transcription factor binding sites. Lately it’s been: I have a giant set of sequence data that I need to process to find new variants that might impact genes. How do I do that? This video tip-of-the-week will help you to understand how to do that.

In this video that was part of a day of lectures at the NHGRI about how to deal with exome sequencing data: Next-Gen 101: Video Tutorial on Conducting Whole-Exome Sequencing Research . There is a whole series of video and slide material available from NHGRI’s page. And the one I’m highlighting here is number 3 on that list. Be sure to download the slides if you want to take notes, and access the references and URLs that are key to the material.

Jamie Teer gives a terrific talk about dealing with the exome sequence data output that next-gen projects are yielding. It starts with just managing and viewing the reads, and he highlights a couple of different ways to do this. It includes SAMtools, and also showing how they look in both UCSC Genome Browser and in the Broad’s Integrative Genomics Viewer, IGV. It’s nice to see a comparison of these to illustrate what you might expect to see. We could help you to understand how to load this kind of data as custom tracks in the UCSC Genome Browser with our advanced tutorial, and you’ll find some nice guidance on what to expect from IGV from the paper listed below in the references area.

The video also describes annotation software that helps you to identify where the variations and consequences are in the data. Many of these tools we have talked about either in our tutorials or our other tips-of-the-week.

He also describes how people generate pipelines to flow the data through a series of steps to do the analysis. Sometimes these are home-made programs used by a local group. But he also mentioned how Galaxy can help to accomplish this now.  We’ve been fans of Galaxy for a long time, and we know people are using it in exactly this manner.

You still should have a basic understanding of all the tools individually if you want to use them all, or tools that incorporate them all into workflows/processes, though. It will help you to create better workflows/pipelines. And it also matters that you know what you aren’t seeing/using.

Teer closes by introducing the VarSifter software that he’s been involved with creating. This software is freely available for you to download at the VarSifter site. Usually we prefer to highlight web-based interfaces, but there isn’t one for VarSifter. But if you see the utility in it you can also try to get a local copy set up for yourself. VarSifter will help you to view, sort, and filter variants in a lot of ways.

So have a look at this video if you are interested in understanding how these analyses are done, and if you are interested in knowing more about the tools that can be used. It’s worth the 40 minutes–really.

Quick links:

YouTube page: http://www.youtube.com/watch?v=I7azpqTWFuM

VarSifter home page: http://research.nhgri.nih.gov/software/VarSifter/

Exome analysis Talks at NHGRI: http://www.genome.gov/27545880

References:

IGV: Robinson, J., Thorvaldsdóttir, H., Winckler, W., Guttman, M., Lander, E., Getz, G., & Mesirov, J. (2011). Integrative genomics viewer Nature Biotechnology, 29 (1), 24-26 DOI: 10.1038/nbt.1754

UCSC new paper: Dreszer, T., Karolchik, D., Zweig, A., Hinrichs, A., Raney, B., Kuhn, R., Meyer, L., Wong, M., Sloan, C., Rosenbloom, K., Roe, G., Rhead, B., Pohl, A., Malladi, V., Li, C., Learned, K., Kirkup, V., Hsu, F., Harte, R., Guruvadoo, L., Goldman, M., Giardine, B., Fujita, P., Diekhans, M., Cline, M., Clawson, H., Barber, G., Haussler, D., & James Kent, W. (2011). The UCSC Genome Browser database: extensions and updates 2011 Nucleic Acids Research DOI: 10.1093/nar/gkr1055

SAMtools: Li, H., Handsaker, B., Wysoker, A., Fennell, T., Ruan, J., Homer, N., Marth, G., Abecasis, G., Durbin, R., & , . (2009). The Sequence Alignment/Map format and SAMtools Bioinformatics, 25 (16), 2078-2079 DOI: 10.1093/bioinformatics/btp352

What’s the Answer? Alternatives to Galaxy

BioStar is a site for asking, answering and discussing bioinformatics questions. We are members of thecommunity and find it very useful. Often questions and answers arise at BioStar that are germane to our readers (end users of genomics resources). Every Thursday we will be highlighting one of those questions and answers here in this thread. You can ask questions in this thread, or you can always join in at BioStar.

This week’s highlighted question is

What are the “alternatives to Galaxy for wrapping a command line tool in GUI?

Or in other words, what workflow systems are out there in addition to Galaxy (a great tool, but sometimes people need something different :).  The answers to this question will help both bioinformaticists who create tools and biologists who use them, giving the former alternatives for doing this if need be and the latter other workflow systems to try out.

Several were highlighted including TavernaYabi and Knime and a list was provided from wikipedia. Check out the answers for more examples.