There is a great discussion on Big Data today that I found on the twittosphere.  Hat tip to Paul Blaser on the tweet that got my attention.  I have posted a comment over there, but I decided as I was writing it that I wanted to bring it over here as well.  (I also added some links here that I couldn’t add over there since without preview I hate to not be able to test them.)

Deepak has a post up on the blog business|bytes|genes|molecules called The Biological Data Scientist.  It speaks to big data projects, and the need to have specialists in biological data to handle it.

I suspect that we do actually agree on much of the concept.  But like a lot of things, I think more downstream about the implementation of the topic on the ground.  And my thoughts on that are below, which I posted as a comment over there.


Hmmm…I certainly agree with large chunks of this. But I don’t agree that this should be the domain of some kind of data scientist.  Or–more specifically–it does need to have their hands to some point.  But I think it still needs to be accessible to the handful-of-genes bench biologists.

The idea of the multi-functional team is terrific, when it is possible.  But we see a lot of people who are not getting that kind of support from their local “bioinformatics” club–for a couple of reasons: if you have some big-data folks on site, they have their own project to worry about. They are not eager to hand-hold others on the way in to the data.  It’s not their job. It’s not what they are supported to do, and it doesn’t help them with their next grant.

If you have some kind of dedicated bioinformatics core support, the quality of the support varies widely: the kinds of things they do, the skills they have, the interest in actual support.

We have seen some great examples.  For example, it seems to me the team at CHOP in Philly provides this kind of support: in house tools to support the researchers, bringing in the right tools to add more support, training everyone up to some level so they are at least aware of what the tools can do. (Samples of CHOP tools, team, and training.)

On the other hand, we’ve been to some major institutions–many with “big data” projects, who are getting next to zero interaction with anyone who could help them.  You’d be stunned if I told you who these people are.

Then there are those who don’t even have a shot at this.  People trying to keep up, and write new grants with hot new data, that are in some mid-western campus that really just doesn’t even have someone to ask.  I talked to one woman once that needed a really simple thing out of the UCSC Genome Browser.  It took me roughly 5 minutes to build the right query, pull the data out of the table browser, and hand it to her. I thought she was going to kiss me.  She told me she had expected that to take her 6 months of benchwork.

I would hate to see this strategy create a tier of biologists who are nearly locked out the data.  Because it is also still imminently clear that we can throw a lot of big data at project, but the crucial details require the “small people” to look closely at them.  And many of them feel excluded from the club already.

    I think we do mostly agree. In the web world, the really smart data scientists are often small companies or 2-3 smart folks hammering away and public data sources. So with Open Data you are not locked out of big data, even as a small researcher. I’ll argue that even the big centers aren’t doing it the right way in terms of the analysis approaches and structure

    Yeah, I think we are agreeing on most of it. But I still have a little bit of worry that it tiers the other biologists to a lower level. They’re already mad about how much funding goes to big data projects….

    And my other fear is that some of the big data personalities are not the best communicators to the biologists. I know we all want to talk to each other, but the math- and physics-background type of folks and the biologists really don’t meet up all the time. It strikes me that big data biologists could be like that.

    I think the math- and physic- and programming
    people are more open to discuss their problems
    in public while the biologists are more secret
    and concerned about copyright,patents,funding

