The data isn’t in the papers anymore. Again.

I know this is a topic I keep hammering on. But I’m not sure that it’s really grokked by a lot of people who are not as deep into the bioinformatics aspects of biology today. Or those who support biologists, such as publishers and librarians, who may not be as immersed in the daily software aspects.

There was a nice post by Ed Yong last week about a paper published on sticklebacks. There are several cool things about this paper–but one of them is merely the fact that we can use the next-generation sequencing technology we have to examine species in ways that we just couldn’t before. And Ed made the point that there wasn’t only one genome in this paper–there were 21 genome sequencing events in this paper.

And because of the cool biological niches of these sticklebacks–it was possible to compare populations that varied in interesting ways. Some were fresh-water, some were salt-water based, and this could be examined in different regions of the planet to compare whether the same adaptations happened in different places for the same reason.

It really is a sweet paper. But it also serves another point of mine, that I keep making over and over again. The data is not in the papers anymore. The paper is a nice sort of summary statement of the work. But you cannot put 21 genomes in press–and a big list of A, T, G, and Cs wouldn’t be that valuable on paper anyway. You cannot show the analysis tracks in the papers. You can merely sample a subset of them. You can illustrate a few “compelling examples” as we used to call them at one place I worked.

But if you want to explore other features, or you want to build on this work yourself, you need to turn to the databases. The real magic happens there now–not in the papers. Back in the days of my training and early career, the papers were enough. They are not anymore. It’s not clear to me if publishers appreciate this fact entirely in this field.

And the authors offer a whole genome browser (based on the UCSC Genome Browser software platform) for their stickleback data. It’s quite lovely, actually–I’ll link to it below. It’s also an excellent demonstration of how to use existing open source software to craft a version for your needs.

Quick links:

Here’s Ed’s post on the key features of the work: Stickleback genome reveals detail of evolution’s repeated experiment

Look at the Sticklebrowser yourself. It’s actually rather lovely. And informative.

To learn to use UCSC Genome Browser based software, see the training materials sponsored by UCSC:


Jones, F., Grabherr, M., Chan, Y., Russell, P., Mauceli, E., Johnson, J., Swofford, R., Pirun, M., Zody, M., White, S., Birney, E., Searle, S., Schmutz, J., Grimwood, J., Dickson, M., Myers, R., Miller, C., Summers, B., Knecht, A., Brady, S., Zhang, H., Pollen, A., Howes, T., Amemiya, C., Baldwin, J., Bloom, T., Jaffe, D., Nicol, R., Wilkinson, J., Lander, E., Di Palma, F., Lindblad-Toh, K., & Kingsley, D. (2012). The genomic basis of adaptive evolution in threespine sticklebacks Nature, 484 (7392), 55-61 DOI: 10.1038/nature10944