Tag Archives: Taverna

Video Tips of the Week: Annual Review IV (first half of 2011)

As you may know, we’ve been doing these video tips-of-the-week for FOUR years now. We have completed around 200 little tidbit introductions to various resources. At the end of the year we’ve established a sort of holiday tradition: we are doing a summary post to collect them all. If you have missed any of them it’s a great way to have a quick look at what might be useful to your work.

You can see past years’ tips here: 2008 I, 2008 II, 2009 I, 2009 II, 2010 I, 2010 II. The summary of the second half of 2011 will be available next week here.

January 2011

January 5: SKIPPY predicting variants w/ splicing effects

January 12: Twitter in Bioinformatics. This one was much more popular than I expected!

January 19: PolyPhen, for predicting the possible effects of mutations in genes

January 26: iRefWeb + protein interaction curation

February 2011

February 2: RCSB PDB Data Distribution Summaries

February 9: SIFT, Sorting (SNPs) Intolerant From Tolerant another tool for predicting the impact of mutations in genes.

February 16: Melina II for promoter analysis

February 23: SNPTips and viewing personal genome data This tip is one of the most-watched ones we’ve had. Thousands of views on SciVee!

March 2011

March 2: DAnCER for disease-annotated epigenetics data

March 9: World Tour of Genomics Resources

March 16: Encyclopedia of Life

March 23: ORegAnno for regulatory annotation

March 30: MetaPhoOrs, orthology and paralogy predictions

April 2011

April 6: The Taverna Project for workflows

April 13: VirusMINT , the branch of the Molecular Interaction database for viral interactions

April 20: LAMHDI for animal models

April 27: Dot Plots, Synteny at VISTA

May 2011

May 4: MycoCosm

May 11: InterMine for mining “big data”

May 18: Allen Institute’s Brain Explorer

May 25: SciVee, the YouTube of science

June 2011

June 1: New and Improved OMIM®

June 8: Converting Genome Coordinates

June 15: MutaDATABASE, a centralized and standardized DNA variation database

June 22: Update to NCBI’s Cn3D Viewer

June 29: Orphanet for Rare Disease information

Why don’t users employ workflows for “big data”? I know why.

Yesterday a tweet to a great post came across the ethers, and ever since I read it I knew I had to write this post. Here’s the original nugget:

RT @ctitusbrown: (my) thoughts on data intensive science & workflows: http://bit.ly/tWXSnx

It is a post about why end users are not adopting workflows which could really help them in this eScience world we find ourselves in, and as we keep moving forward with giant data sets and “big data” projects. And some other points about what we need in workflows. We’re big fans of workflows and have talked about them in the past (Tip of the week: The Taverna Project for workflows; What’s the Answer? Alternatives to Galaxy; Tip of the Week: BioExtract Server; lots of Galaxy posts).

But the first major point in the post asked: Why don’t people use workflows in bioinformatics?

I know why. The first key point is that they are not trained to use them. When we’ve done Galaxy training workshops, we see how quickly people get the point of Galaxy and how it can save them time. And they love the assemblage of tools that they’d have to otherwise seek out at numerous individual sites. So a major step would be awareness that 1) the workflow tool exists, and 2) some gentle introduction at a very basic level to get them started. A lot of people in bioinformatics are not daunted by interfaces with the complexity of Galaxy. But people who don’t spend all day on software and databases are not at the same point.

So, very basic intro training on workflow tools is a big step. But there’s actually another step before that. Biologists need to know how to mine the big data that they are told is out there. Some of the more computationally sophisticated biologists already have their own data, or know how to get it. But if we are going to succeed at increasing use of workflow tools, we also need to train people on how to mine the big data. They don’t necessarily have that step yet.

When we do a UCSC Genome Browser workshop, it’s in 2 segments. I do the intro section first with very basic intro to how to structure a query, how to look at the graphics, how to change the views, etc. Almost always I start with a question for the attendees: how many people here have spent more than 1 hour (total) hammering around on the UCSC Genome Browser? In the average room, this ranges from about 1/3 to 1/2 of the attendees. Generally, more than half have never touched it before.

But then Trey does the advanced topics section, largely about the Table Browser and Custom Tracks. He also starts with the question: how many of you have spent more than an hour using the Table Browser? Generally, there are 1-3 individuals who have, if any.  These are in rooms of 25-50 people (sometimes over 100). And if you haven’t used Galaxy before, you may not know that the primary way to get UCSC Genome Browser data is that the Galaxy interface throws you the Table Browser. (Or BioMart versions or InterMine versions or whatever–they know even less about those in our experience.)

If you don’t know how to get the data (step 1), the workflow setup (step 2+) is not going to help you.

Bioinformatics folks: you’d be stunned to know what biologists don’t know about the tools. And here’s something else they tell us: often the trainings they’ve been offered (if they have had them) start out over their heads. Expert users–or representatives of the tool being trained on–are very often too close to the tool to realize that there are a lot of more basic things people need to know.  But the trainees don’t want to look stupid in front of their colleagues and ask the basic questions. Or they don’t want to be critical of the tool features to the folks who build them.

And this requires cross-training across the bioinformatics projects and data sets. However, sometimes the funding for outreach is limited to one’s own tools. But without some of the other key components–other sources, other projects–users are not going to be able to pull together what they really need.

As the “data bonanza” era proceeds, there’s only going to be more and more data stored that biologists could be using to make fabulous discoveries. It’s not in the papers anymore, as I keep saying (over and over and over). But the bench biologists aren’t getting enough training to take their expertise to mine these data sets.

The other points Titus makes are also great on the workflow issues. This part is particularly resonant with me:

For all of this I need three things: I need workflow agility, I need workflow versioning, and I need workflow tracking. And this all needs to sit on top of a workflow component model that lets me run the components of the workflow wherever the data is.

I have begged workflow providers to provide the versions of the components of the workflows. It stuns me every time I’m told that no–it’s up to you to know that. I can’t even tell which version of the tools they have installed, how can I record that and then know if they changed the underlying algorithm since the last time I ran the workflow? This is a major problem if you want to pitch these tools as a great way to offer reproducibility of research.

The basic point though–everyone ought to be using workflow tools–is 100% solid. But users need more help to get to that point. 

Quick links:

to C. Titus Brown’s original post: Data Intensive Science and Workflows

to Galaxy: http://usegalaxy.org

to UCSC Genome Browser: http://genome.ucsc.edu/

to Taverna: http://www.taverna.org.uk/

to a great list of workflow tools (via Casey Bergman at Titus’ comments section): http://en.wikipedia.org/wiki/Bioinformatics_workflow_management_systems

Reference:
Goecks, J., Nekrutenko, A., Taylor, J., & Galaxy Team, T. (2010). Galaxy: a comprehensive approach for supporting accessible, reproducible, and transparent computational research in the life sciences Genome Biology, 11 (8) DOI: 10.1186/gb-2010-11-8-r86

What’s the Answer? Alternatives to Galaxy

BioStar is a site for asking, answering and discussing bioinformatics questions. We are members of thecommunity and find it very useful. Often questions and answers arise at BioStar that are germane to our readers (end users of genomics resources). Every Thursday we will be highlighting one of those questions and answers here in this thread. You can ask questions in this thread, or you can always join in at BioStar.

This week’s highlighted question is

What are the “alternatives to Galaxy for wrapping a command line tool in GUI?

Or in other words, what workflow systems are out there in addition to Galaxy (a great tool, but sometimes people need something different :).  The answers to this question will help both bioinformaticists who create tools and biologists who use them, giving the former alternatives for doing this if need be and the latter other workflow systems to try out.

Several were highlighted including TavernaYabi and Knime and a list was provided from wikipedia. Check out the answers for more examples.

Tip of the week: The Taverna Project for workflows

We’re on the road this week doing workshops, so I needed to have this tip prepared well ahead of time. To make it easy on myself, I’m going to simply point you to a recent informative webinar on Taverna, that was hosted by Bitesize Bio (and check out their other upcoming webinars).

The image I used as the screen shot made me laugh. It was a graphical illustration of what you might need to do to analyze a piece of sequence data that you might obtain. You have to leap around to all kinds of sites and tools and they all stand nearly independent of each other, and if you wanted to do it with another sequence later you’d have to face the same series of events. Workflow tools are now being developed to streamline, automate, and simplify this process.

Taverna is an application that supports bioinformatics workflows (and other types of workflows as well, actually).  It combines with the MyExperiment social networking aspects, and the BioCatalogue collection of web services. You can create complex and effective workflows, share them with others, and store them for re-use.

If you are considering using workflows to at least partially automate some of the processes you need to accomplish, you should know about Taverna. And this webinar is a nice introduction to the basics and philosophy around it.

It was just over 30 minutes if I remember correctly. And you can hear my question at the end–I asked if the version numbers of software tools or data sets is stored with the analysis. The short answer: no, it’s up to you to do that. [This is something that concerns me a lot about workflow tools and I try to press for this all the time.]

Currently I use Galaxy for the workflows I need. But recently it was announced that there’s a way to use eGalaxy with Taverna to generate workflows that can run in Galaxy–but I haven’t explored this at all yet.

Quick links:

Taverna Webinar at BiteSizeBio: http://bitesizebio.com/webinars/the-taverna-project/

Taverna application: http://www.taverna.org.uk

Tip of the Week: BioCatalogue for finding web services

A couple of years back at a conference I was introduced to BioCatalogue.  It seemed to me to be a really useful idea: locate bioinformatics tools and databases that are web-accessible, and that also have a mechanism to use the web service features to access the tool/server using strategies that don’t require the main web interface of the site.  There are some introductions  to the concept of web services out there–some of them are more for introduction, but most are aimed at programmers.  Essentially it is kind of a back door into the tool, and lets you pull the information you need out in ways that you want–not constrained by the main user interface.

BioCatalogue is a curated collection of these web services.  The creators  of BioCatalogue provide the framework and perform some of the  collection and annotation–but they also enable the user community to bring in web services and annotate them as well.  This means that you can use BioCatalogue to find and learn more about the services, and you can feed back into the system as well if you join the community.  If you are a software provider you can register your service there–so more people can locate you and learn about your project.  Another really nice aspect of BioCatalogue is that they monitor the services.  As we know at OpenHelix–plenty of times a tool you have accessed in the past is suddenly unavailable.  Sometimes they are intermittent server problems, but sometimes they are longer-term issues.  BioCatalogue is regularly checking  the status of the tools so you can have confidence that the tool has been up and seems stable.

The Web Server issue (see the 2009 issue here) of Nucleic Acids Research provides a wealth of  information about useful servers with bioinformatics tools.  And there’s a paper for the 2010 Server  issue about BioCatalogue that will offer more details on the background (linked below).  In this week’s movie I can only briefly introduce the site and the features available.  Check out the paper from the BioCatalogue team, and explore the documentation wiki to learn more about the features and functions that are  provided.

Now, these web services are not for everyone.  For many people the main user interface will still be the best mechanism to access a tool. But if you need more advanced or customized queries, or if you want to create inflows into your own tools, or if you want to use some of the cool work flow software that’s  out there now (such as Galaxy or Taverna)–web services may be right for you.

Check out BioCatalogue  (and remember the -ue spelling!) http://www.biocatalogue.org/

Bhagat, J., Tanoh, F., Nzuobontane, E., Laurent, T., Orlowski, J., Roos, M., Wolstencroft, K., Aleksejevs, S., Stevens, R., Pettifer, S., Lopez, R., & Goble, C. (2010). BioCatalogue: a universal catalogue of web services for the life sciences Nucleic Acids Research DOI: 10.1093/nar/gkq394

Tip of the Week: Acytelome, String and a new database

phosida_thumbI recently read an article in Science entitled “Lysine Acetylation Targets Protein Complexes and Co-Regulates Major Cellular Functionswritten by Choudhary et al. The research uses “high-resolution mass spectrometry to identify 3600 lysine acetylation sites on 1750 proteins” and “demonstrate[s] that the regulatory scope of lysine acetylation is broad and comparable with that of other major posttranslational modifications.”

ResearchBlogging.orgI’m going to admit, I know little of acetylation as a regulatory mechanism, though after reading through the paper, I found this quite and interesting find and it suggests to me that genomics has a lot to offer in the advance in our understanding of regulation and evolution.

Three things jumped out at me though.

The first is minor. The authors use the term Acytelome. You can now add that to the huge list of -omics terms to keep straight :D.

acetalnetworkThe second is that they use STRING to complete an analysis of networked interactions of the proteins discovered in their study and the processes where they are found, as you can see in their figure.

I did my postdoc and some later research in the lab (Peer Bork, EMBL) that developed STRING, and I’ve created a tutorial on it, so any time it’s used, I’m interested :D. So, I went to Methods and Materials to see how the analysis was done. Though there was a decent explanation of the process, it was not enough for me to recreate the analysis. This is not a criticism of the paper or the authors, but of how papers are being published. More and more, papers include genomics analysis, but rarely are these reported in the research paper in the detail needed to easily reproduce the analysis. Projects like Galaxy (publicly available tutorial) and Taverna are filling that void, so I’d like to see more Methods and Materials sections include analysis histories and workflows. It definitely would help in the advancement of science.

And now to the tip of the week. The paper also refers to a new database (at least new to me, it’s at least two years old and was reported in “Phosida: management, structural and evolutionary investigation and prediction of phosphosites.“) called Phosida. The database “allows retrieval of phosphorylation and acetylation data of any protein of interest.” The Tip-of-the-Week today is a quick introduction to that database.

Choudhary, C., Kumar, C., Gnad, F., Nielsen, M., Rehman, M., Walther, T., Olsen, J., & Mann, M. (2009). Lysine Acetylation Targets Protein Complexes and Co-Regulates Major Cellular Functions Science, 325 (5942), 834-840 DOI: 10.1126/science.1175371

Are you ready to create a workflow?

Yesterday I attended the final session of the ICSB conference that I could fit into my schedule: a session on web services in systems biology. (I would link to the description but the ICSB server is down while I write this…) There were several tools covered that I will address later (including one of our old favorites: Reactome. And Esther Schmidt showed me a trick to accomplish some teeny little thing that was making me crazy….Yea Esther!). But I wanted to get you thinking about using tools in workflow pipelines. This is not just for giant sequencing projects anymore!

Although there are a variety of tools that can let the average user in on this handy strategy, yesterday we heard specifically about the Taverna project. Taverna will let you pull re-usable modules of analysis tools into a series of actions that you can perform on your lists, or favorite sequences, or genomic regions, or whole genomes…and annotate, analyze, and process. Don’t be daunted by the look of that project page. We can help you to understand what to do and how to do it. But start to think about the series of things you might be doing from website-to-website as you do your research on genes of interest. Can you imagine a way to streamline that and set up a re-usable protocol to do that? I’ll bet you can….

More later on these types of services. But I’m off to Copenhagen today and won’t be online much until next week. Enjoy your weekend! Scandinavians seem to really understand the purpose of the weekend…