Why don’t users employ workflows for “big data”? I know why.

Yesterday a tweet to a great post came across the ethers, and ever since I read it I knew I had to write this post. Here’s the original nugget:

RT @ctitusbrown: (my) thoughts on data intensive science & workflows: http://bit.ly/tWXSnx

It is a post about why end users are not adopting workflows which could really help them in this eScience world we find ourselves in, and as we keep moving forward with giant data sets and “big data” projects. And some other points about what we need in workflows. We’re big fans of workflows and have talked about them in the past (Tip of the week: The Taverna Project for workflows; What’s the Answer? Alternatives to Galaxy; Tip of the Week: BioExtract Server; lots of Galaxy posts).

But the first major point in the post asked: Why don’t people use workflows in bioinformatics?

I know why. The first key point is that they are not trained to use them. When we’ve done Galaxy training workshops, we see how quickly people get the point of Galaxy and how it can save them time. And they love the assemblage of tools that they’d have to otherwise seek out at numerous individual sites. So a major step would be awareness that 1) the workflow tool exists, and 2) some gentle introduction at a very basic level to get them started. A lot of people in bioinformatics are not daunted by interfaces with the complexity of Galaxy. But people who don’t spend all day on software and databases are not at the same point.

So, very basic intro training on workflow tools is a big step. But there’s actually another step before that. Biologists need to know how to mine the big data that they are told is out there. Some of the more computationally sophisticated biologists already have their own data, or know how to get it. But if we are going to succeed at increasing use of workflow tools, we also need to train people on how to mine the big data. They don’t necessarily have that step yet.

When we do a UCSC Genome Browser workshop, it’s in 2 segments. I do the intro section first with very basic intro to how to structure a query, how to look at the graphics, how to change the views, etc. Almost always I start with a question for the attendees: how many people here have spent more than 1 hour (total) hammering around on the UCSC Genome Browser? In the average room, this ranges from about 1/3 to 1/2 of the attendees. Generally, more than half have never touched it before.

But then Trey does the advanced topics section, largely about the Table Browser and Custom Tracks. He also starts with the question: how many of you have spent more than an hour using the Table Browser? Generally, there are 1-3 individuals who have, if any.  These are in rooms of 25-50 people (sometimes over 100). And if you haven’t used Galaxy before, you may not know that the primary way to get UCSC Genome Browser data is that the Galaxy interface throws you the Table Browser. (Or BioMart versions or InterMine versions or whatever–they know even less about those in our experience.)

If you don’t know how to get the data (step 1), the workflow setup (step 2+) is not going to help you.

Bioinformatics folks: you’d be stunned to know what biologists don’t know about the tools. And here’s something else they tell us: often the trainings they’ve been offered (if they have had them) start out over their heads. Expert users–or representatives of the tool being trained on–are very often too close to the tool to realize that there are a lot of more basic things people need to know.  But the trainees don’t want to look stupid in front of their colleagues and ask the basic questions. Or they don’t want to be critical of the tool features to the folks who build them.

And this requires cross-training across the bioinformatics projects and data sets. However, sometimes the funding for outreach is limited to one’s own tools. But without some of the other key components–other sources, other projects–users are not going to be able to pull together what they really need.

As the “data bonanza” era proceeds, there’s only going to be more and more data stored that biologists could be using to make fabulous discoveries. It’s not in the papers anymore, as I keep saying (over and over and over). But the bench biologists aren’t getting enough training to take their expertise to mine these data sets.

The other points Titus makes are also great on the workflow issues. This part is particularly resonant with me:

For all of this I need three things: I need workflow agility, I need workflow versioning, and I need workflow tracking. And this all needs to sit on top of a workflow component model that lets me run the components of the workflow wherever the data is.

I have begged workflow providers to provide the versions of the components of the workflows. It stuns me every time I’m told that no–it’s up to you to know that. I can’t even tell which version of the tools they have installed, how can I record that and then know if they changed the underlying algorithm since the last time I ran the workflow? This is a major problem if you want to pitch these tools as a great way to offer reproducibility of research.

The basic point though–everyone ought to be using workflow tools–is 100% solid. But users need more help to get to that point. 

Quick links:

to C. Titus Brown’s original post: Data Intensive Science and Workflows

to Galaxy: http://usegalaxy.org

to UCSC Genome Browser: http://genome.ucsc.edu/

to Taverna: http://www.taverna.org.uk/

to a great list of workflow tools (via Casey Bergman at Titus’ comments section): http://en.wikipedia.org/wiki/Bioinformatics_workflow_management_systems

Reference:
Goecks, J., Nekrutenko, A., Taylor, J., & Galaxy Team, T. (2010). Galaxy: a comprehensive approach for supporting accessible, reproducible, and transparent computational research in the life sciences Genome Biology, 11 (8) DOI: 10.1186/gb-2010-11-8-r86

One thought on “Why don’t users employ workflows for “big data”? I know why.

  1. Michael Reich

    Thanks for a post on a topic that requires much attention – the training of researchers and bioinformaticians to rely on workflow tools rather than brittle scripting solutions.

    “I have begged workflow providers to provide the versions of the components of the workflows.”

    The GenePattern environment, http://www.genepattern.org, has a reproducibility model that supports workflows with versioned components. Any change to a component or workflow produces a new version, and all versions are retained, giving users access to any previous version of a component or workflow.

Comments are closed.