End users and the cloud in bioinformatics

Today I began an exchange on some issues of “In silico research in the era of cloud computing” based on this tweet:

@mndoci: http://bit.ly/bx2xnB …. only thing missing is a service component (cc @mza) #bioinformatics

My first answer was this, but there was a bit more back/forth subsequently:

@OpenHelix: Also missing: end user support | RT @mndoci: http://bit.ly/bx2xnB …. only thing missing is a service component (cc @mza) #bioinformatics

I’m going to explain this a little bit more here because 140 characters wasn’t cutting it for discussion purposes. I was still percolating on several threads of recent experience, and this is less fully-formed than I would have liked to write, but maybe my raw thoughts are of some utility anyway.

So here’s the deal: at the recent Beyond the Genome conference I sat in the special Cloud Computing pieces. This was terrific. It’s entirely clear that this is inevitable. I’m on board, no question. The data and software you need is going to be on the cloud. Period. End of discussion. But it was also clear that the culture and philosophy of big data projects, funding, and so on will have to change for this. End user access and support is a big question for me at this point.

Lincoln Stein gave a great talk on cloud computing, which you can largely get from this paper in Genome Biology. I love figure 2 in this paper that shows how fast we are generating sequence now, at lower costs, and that we are crossing the point where we can actually produce sequence more cheaply than we can store it. This is crazy good to me. I love that we are about to get so much more sequence from so many individuals, organisms, disease states–and we can go beyond the navel-gazing human stuff (although we will see the navel organisms I know–heh). Plants, microbes, infectious and harmful species that we need to know about–awesome. But it’s clear we’ll have to access these data differently. I’m down with this.

But: for the average end users to get to this data, things will change. Where you go to access the data and software, and how you can interact with it, will be different if you are using the cloud.

I realized at the conference was that as an end user of the cloud, I’ll be at the mercy of some features: 1) as a casual user, my instance may vanish when I’m done with whatever I’m doing. There may be ways to work around this, but is this a design feature that providers understand right now? I’m not sure.  2) Will my run store all the things I need: the data version, the software version, etc?  If not, how can I reproduce this exactly 2 months later when I need to do this again? Or a year later when I’m ready to publish this paper? Am I tied to one vendor for this? Or the vendor of the data provider?  3) Can I access previous versions of data or software in order to reproduce my results exactly? Or to run a new set through that same workflow I’ve worked out?  4) Ok–someone publishes their data set + workflow and enables me to access it. It contains 8 different analysis tools in a series. Do I know all these tools? Is there good documentation for those sub-pieces? What if I need documentation and support for them–is it available to me from here?

@OpenHelix: @mndoci Yes, I am “talking about sharing code and data with peers in general”. Only a small clique=ready, little talk of broader end users

At one of the sessions I attended at the conference there was a neat project explained to me. Doesn’t matter which one–it’s just an example of the current state. But it was clear that these very smart, eager young folks developed some very cool workflow software. It was also very clear that they hadn’t thought about end user needs {as an occasional end user} for software version control or access to older data sets, or information on what has changed in the interim.

The idea of “whole system snapshot exchange” (WSSE) from the In silico paper is great and will address many of these issues–if the developers understand this and why it will matter to end users. That paper addresses a lot of my concerns about the cloud infrastructure. I hope that data providers and software developers will be designing their tools with this in mind.

But will these aspects be clear to the end users–will my results output clearly indicate/specify which tools and versions were used?  This should not be a buried back-end feature that only people who access some deep readme file will know about.  If there are different versions of the data and software available, how will end users make the choices–will the interface allow me to do that? Will I have access to the documentation for the different versions to tell me what has changed or why it might matter? Is documentation part of the WSSE model? Is user interface design considering these aspects?  That’s not clear to me.

At ASGH2010 last week I attended the tutorial session on the 1000 Genomes project. In one of the first talks it was made clear to us that without a giant cluster, this data as a whole is really not usable for most folks (pieces of it are accessible in various ways, though). A perfect example of the need for the cloud for this. The data set is enormous, and changes were getting rolled out even that day. But again–will I have access to the several data sets, older ones, newer ones? If I have issues with various pieces of this, where do I turn for guidance?  Will I get notified if things changed underneath since I used it last? How? Will that be flagged for me as I call the data set or software? Will I be able to lift over to new versions easily? Or can I lift backwards to run the new data on an old pipeline?

I also attended the Galaxy workshop. Galaxy on the Cloud is wicked cool. But at the end I had to ask about sharing my workflow with my colleagues (as we do now with terrestrial Galaxy). I can apparently save a file of my workflow, which my colleague can later upload to run. But this is a new step, and requires me and my colleague to do more management of this piece. We can, but we need to know this and prepare for this. But it’s a different step for end users–and it needs to carry all the crucial pieces of versions and such.

[It was also clear that the funding mechanisms need to account for research with “infrastructure as a service” or IAAS as it was called. It was very nice to see Vivien Bonazzi at the conference to explore how NHGRI will need to help users with this. A person at the meeting commented that it’s always been easier to get an item (like a computer) than it has been to get services funded. It was very cool to see that forethought on the funding, really. But that’s an issue quite separate from the data and software versions.]

But also if you move away from your local cluster and analysis model, you also may not have some local person who manages these things, and can answer your software questions, and discuss the analysis you are considering.  And as far as I can tell right now, there’s no sort of pan-project support across different tools and projects. Who and where are these folks going to be?

I guess what I’m trying to say is that if you want to discuss “In silico research in the era of cloud computing” the strategies for access and analysis that are being designed should not just include the data providers and the coders. They are crucial, of course. But the conversation should be broader. I find that in a lot of these discussions people talk about the end users in a rather vague way {“be more explicit here in step two“, to paraphrase a cartoon}. But end users are not part of the conversations in my experience.  There is talk about them. But nobody speaks for them and their specific needs.

@mndoci: @OpenHelix We might be talking about a different kind of end user here.  I am talking someone who can take code, data, APIs and run with it.

Yeah, maybe we are. But I don’t think we should be–and I’m not sure they always are different groups–I would guess there’s an overlap of the Venn diagrams for the skills and needs. And the table of the Features (Table 1 in the In silico paper) speaks specifically to “nontechnical users” and the benefits of the cloud for them.  And if people who use this data are going to have to make their results available back to everyone, they are going to need some mechanism to understand the moving parts. And to enable the downstream users, consumers of the data and workflows, the design of the software should include these things now, not as an afterthought. WSSE is great on the pieces it covers. But I would like to see it go further, I guess, to encompass features and aspects that end users need to use the tools effectively. If you are aiming to get some standards and specs, let’s include some end user components/needs/support now. Or at least let’s talk about it. It’s not clear to me how and end user can speak to this at this point, or influence it.

Some of this isn’t specific to the cloud–as the article points out. Some of my perspective on this is based on my current experience with software and “big data” projects–and lots of face time with end users. Some of what the current projects aren’t providing to the nontechnical users is apparent to me. But let’s learn from those and incorporate in to the new direction.

++++++++++++References++++++++++++

Dudley, J., & Butte, A. (2010). In silico research in the era of cloud computing Nature Biotechnology, 28 (11), 1181-1185 DOI: 10.1038/nbt1110-1181

Stein, L. (2010). The case for cloud computing in genome informatics Genome Biology, 11 (5) DOI: 10.1186/gb-2010-11-5-207

4 thoughts on “End users and the cloud in bioinformatics

  1. Pingback: Tweets that mention End users and the cloud in bioinformatics | The OpenHelix Blog -- Topsy.com

  2. Deepak

    I was afraid you had misunderstood my tweets, and that is true. The tweets were independent of the cloud, but rather talking about the ability to share data and code. Good code, documentation, relevant data packaged up together in ways that the results can be reproduced, applications shared, etc is the essence of what I was talking about. You can take it one step further and start talking about APIs that expose data structure and allow other developers to leverage the data (the cloud just makes that much easier).

    In other words, I still think we are speaking about different audiences. My end user knows and understands code and statistics and is interested in sharing that code and data with others who do. There is another set of challenges on making this usable by non-computational people, but that’s better done by abstracting away that complexity.

    This blog post probably sums up these ideas best: http://mndoci.com/2009/10/28/matts-manifesto-for-a-science-data-platform/ (speaking specifically to a service interface)

  3. Mary Post author

    Oh, I thought you were referring to the article you cited, which was entirely about the cloud and its users as a whole–especially since it covers the range of users who might be interested in “In silico research in the era of cloud computing”.

    I think it’s a real risk that this will leave behind a second tier of folks who know the biology, biochemistry, etc, really really well but don’t write code. And that would be incredibly unfortunate when we have this opportunity to democratize data access.

    Let’s just hope there’s more than 140 characters of documentation so things are clearer.

  4. Deepak

    I keep forgetting that not everyone knows I was a data sharing person before I was a cloud person.

    I agree with your point in general. However, there are other ways of making the data available to non-coders. I like the Google example. It’s not perfect but it works. Google has a ton of complex data analysis, algorithms, information retrieval etc behind it, on the data known as the entire web. What Google provides is an interface that makes information available to people. In life science informatics, what we need is something similar. We should be building and designing interfaces that make information available to that “2nd tier”, allow them to interact with the information and perhaps even the underlying data and understand the results. Some will develop the computational skills to do additional computational interrogation or leverage services that allow them to do so.

Comments are closed.