Category Archives: New Resource

The data isn’t in the papers anymore. Again.

I know this is a topic I keep hammering on. But I’m not sure that it’s really grokked by a lot of people who are not as deep into the bioinformatics aspects of biology today. Or those who support biologists, such as publishers and librarians, who may not be as immersed in the daily software aspects.

There was a nice post by Ed Yong last week about a paper published on sticklebacks. There are several cool things about this paper–but one of them is merely the fact that we can use the next-generation sequencing technology we have to examine species in ways that we just couldn’t before. And Ed made the point that there wasn’t only one genome in this paper–there were 21 genome sequencing events in this paper.

And because of the cool biological niches of these sticklebacks–it was possible to compare populations that varied in interesting ways. Some were fresh-water, some were salt-water based, and this could be examined in different regions of the planet to compare whether the same adaptations happened in different places for the same reason.

It really is a sweet paper. But it also serves another point of mine, that I keep making over and over again. The data is not in the papers anymore. The paper is a nice sort of summary statement of the work. But you cannot put 21 genomes in press–and a big list of A, T, G, and Cs wouldn’t be that valuable on paper anyway. You cannot show the analysis tracks in the papers. You can merely sample a subset of them. You can illustrate a few “compelling examples” as we used to call them at one place I worked.

But if you want to explore other features, or you want to build on this work yourself, you need to turn to the databases. The real magic happens there now–not in the papers. Back in the days of my training and early career, the papers were enough. They are not anymore. It’s not clear to me if publishers appreciate this fact entirely in this field.

And the authors offer a whole genome browser (based on the UCSC Genome Browser software platform) for their stickleback data. It’s quite lovely, actually–I’ll link to it below. It’s also an excellent demonstration of how to use existing open source software to craft a version for your needs.

Quick links:

Here’s Ed’s post on the key features of the work: Stickleback genome reveals detail of evolution’s repeated experiment

Look at the Sticklebrowser yourself. It’s actually rather lovely. And informative. http://sticklebrowser.stanford.edu

To learn to use UCSC Genome Browser based software, see the training materials sponsored by UCSC: http://openhelix.com/ucsc

Reference:

Jones, F., Grabherr, M., Chan, Y., Russell, P., Mauceli, E., Johnson, J., Swofford, R., Pirun, M., Zody, M., White, S., Birney, E., Searle, S., Schmutz, J., Grimwood, J., Dickson, M., Myers, R., Miller, C., Summers, B., Knecht, A., Brady, S., Zhang, H., Pollen, A., Howes, T., Amemiya, C., Baldwin, J., Bloom, T., Jaffe, D., Nicol, R., Wilkinson, J., Lander, E., Di Palma, F., Lindblad-Toh, K., & Kingsley, D. (2012). The genomic basis of adaptive evolution in threespine sticklebacks Nature, 484 (7392), 55-61 DOI: 10.1038/nature10944

NAR database issue (always a treasure trove)

The advance access release of most of the  NAR database issue articles is out. As usual, this this database issue includes a wealth of new and updated data repositories and analysis tools. We’ll be writing up additional more extensive blog posts on it and doing some tips of the week over the next couple months, but I thought I’d highlight the issue and some of the reports:

There are a lot of updates to many of the databases we know and love (links to go full text article): UCSC Genome Browser, Ensembl, UniProt, MINT, SMART, WormBase, Gene Ontology,  ENCODE, KEGG, UCSC Archaeal Browser, IMG/M, DBTSS, InterPro and others (we have tutorials on all those listed here).

And, as an indication of the explosion of data available (itself a subject of a database issue article, SRA), there are a lot of new(ish) databases on specific datatypes such as MINAS, a database of metal ions in nucleic acids (nice name :D); doRiNA, a database of RNA interactions in post-transcriptional regulation; BitterDB, a database of bitter compounds and well over 100 more.

And I’ll give a special shout out to my former PI at EMBL because I can, Peer Bork’s group has 4 databases listed in the issue: eggNOG, SMART, STITCH and OGEE. (and he and a couple members are on the InterPro paper also).

This is going to be a wealth of information to wade through!

sciseekclaimtoken-4ec6d4e6da3c3

sciseekclaimtoken-4ec6cf9447e17

Publishers + bioinformatics tools. Good idea.

So I was watching my twitter feed today and saw this tidbit come along:

@FN_Press: Elsevier Introduces Genome Viewer http://bit.ly/nS92lE

….The Genome Viewer utilizes a genome browser developed by NCBI (the National Center for Biotechnology Information at the National Institutes of Health). Elsevier collaborated with the NCBI as it was developing the browser, and is the first publisher to incorporate the technology into an application for viewing detailed information about the gene sequences that are mentioned in articles.

When an author of an article tags a gene sequence, Elsevier matches this gene with information in NCBI”s databases and pulls this information into the article. This allows readers of the article to get specific information about each strand by hovering over it, and also offers functionality such as flipping the strands, zooming to a sequence, or going to a specific position to define a track of interest within the sequence….

There’s nothing I love more than exploring a new genome viewer, or a clever new use of an existing one! The press release offered links to a couple of papers that were supposed to show this new feature. Um…I couldn’t find it.

Step 1: Locate genome viewer.

Step 2: Explore genome viewer.

Step 3: Write up blog post about new genome viewer.

So here we are, and I’m stuck at Step 1 still. But I’ve sort of hurdled over that to Step 3. I’m sure the folks at Elsevier are going to get me to this browser–I’ve already been in touch with them and they are going to help me out.

But in the meantime I’d like to say how cool an idea this is. I have always thought there should be more integration between the science publications and the databases. And not only because I firmly believe that, in large part, the data isn’t in the papers anymore.

We took a different approach. We recently partnered with BioMedCentral’s team to tag articles with the computational resources mentioned in their publication for which we have training. On their pages you’ll see a link to our site that looks like this, on a recent paper about CNVs in trypanosomes: http://www.biomedcentral.com/1471-2164/12/139

On our pages–such as this example of our landing page for the GBrowse tutorial–you’ll see recent papers that referred to this tool.

So if you were interested in GBrowse you could quickly see who is working with it, how, and it would help you to assess if GBrowse is the right tool for you needs. And you could use our tutorial to help understand ways to explore the data from a project that uses GBrowse. In many of the “big data” projects you aren’t going to get a gene list or a gene link. You’ll need to explore the set in toto at their sites. I’ve had my ranty pants* on about this before

Yeah. I have strong feelings about this.

I also think it’s a good idea for publishers. There’s a bit of pushback I’ve seen about subscription pricing, including this letter that has always stuck with me since I read it:

The Head of the Harvard Library System is Pissed

Profit margins of journal publishers in the fields of science, technology, and medicine recently ran to 30–40 percent; yet those publishers add very little value to the research process, and most of the research is ultimately funded by American taxpayers through the National Institutes of Health and other organizations.

I think that by adding handier access to the data in a paper, or to the tools needed to go further, publishers can add value beyond just the traditional publication.

If I could only find it…

++++++++++++++++++++++

*hat tip to Mike the Mad Biologist for my new favorite phrase, “ranty pants

There’s a database for everything, even uber-operons

I was playing around with Google Scholar’s new citation feature that allowed me to collect my papers in one place easily (worked pretty well, btw, save a few glitches, see below) when I noticed it missed a paper of mine from 2000: “Gene context conservation of a higher order than operons.” The abstract:

Operons, co-transcribed and co-regulated contiguous sets of genes, are poorly conserved over short periods of evolutionary time. The gene order, gene content and regulatory mechanisms of operons can be very different, even in closely related species. Here, we present several lines of evidence which suggest that, although an operon and its individual genes and regulatory structures are rearranged when comparing the genomes of different species, this rearrangement is a conservative process. Genomic rearrangements invariably maintain individual genes in very specific functional and regulatory contexts. We call this conserved context an uber-operon.

The uber-operon. It was my PI’s suggested term. Living and working in Germany at the time, I thought it was kind of funny. Anyway, I never really expanded more than another paper or so on that research and kind of lost track whether that paper resulted in much. I typed in ‘uber-operon’ in google today and found that it’s been cited a few times (88) and, I found this interesting: there have been a few databases built of “uber-operons.”

A Chinese research group created the Uber-Operon Database. The paper looks interesting, but unfortunately the server is down (whether this is temporary or permanent, I do not know), the ODB (Operon Database) uses uber-operons (which they call reference operons) to predict operons in the database , Nebulon is another, HUGO is another. Read the chapter on computational methods for predicting uber-operons :)

Just goes to show you, there’s a database for everything.

Oh, and back to Google Scholar citation. It did find nearly every paper I’ve published, though it missed two (including the one above) and had two false positives. Additionally, many citations are missing (like the 88 for this paper, and many others from other papers). That’s not to say it’s not useful, I find it a nice tool but it’s not perfect. You can find out more about google scholar citation here, and about Microsoft’s similar feature here.

Oh, and does this post put me in the HumbleBrag Hall of Fame? If that’s reserved for twitter, than maybe I should twitter this so I can get there :). (though I’m not sure pointing out relatively small databases based a relatively minor paper constitutes bragging, humbly or not LOL).

Tip of the Week: Human Epigenomics Visualization Hub

More and more we are seeing questions about ways to access Epigenomics data in the workshops we do. This often comes up in the workshop we do that focuses on the ENCODE data, because ENCODE is providing several epigenomics data sets that researchers are interested in. [The workshop we do is based on the materials you can find from the UCSC-sponsored freely available ENCODE tutorial.] But there are other browsers and data collections, and researchers want to be sure they are finding as much as possible.

Beside the data that is flowing into the UCSC Genome Browser ENCODE portal, we’ve talked in the past about the NCBI Epigenomics resource, including this tip.  We have also explored DAnCER in the past as a tip-of-the-week.

I’m still recovering from my vacation this week, so I’m going to point you to a very nice SciVee example I found on another resource, the Human Epigenomics Visualization Hub from WashU, which can be accessed at this link: http://vizhub.wustl.edu/ It’s longer than our usual tips–about 20 minutes. But if you want to begin to explore the data available on that browser, it’s worth your time.

Based on a UCSC Genome Browser framework, this resource focuses on epigenomics data. Familiarity with basic vocabulary and functional features of the UCSC Genome Browser is something I’d recommend. Check out our freely-available UCSC sponsored tutorials for this which cover features of tracks and displays, including things like bigWig, Sessions, etc.

The project is associated with the Roadmap Epigenomics Project, which you can explore in detail at the project site, and learn more about in the reference below.

In case the embed isn’t working, here is the link to the SciVee page: http://www.scivee.tv/node/31122

Quick link to Vizhub: http://vizhub.wustl.edu/

Follow them on Twitter for news and announcements: @WashUGBrowser

Reference:
Bernstein, B., Stamatoyannopoulos, J., Costello, J., Ren, B., Milosavljevic, A., Meissner, A., Kellis, M., Marra, M., Beaudet, A., Ecker, J., Farnham, P., Hirst, M., Lander, E., Mikkelsen, T., & Thomson, J. (2010). The NIH Roadmap Epigenomics Mapping Consortium Nature Biotechnology, 28 (10), 1045-1048 DOI: 10.1038/nbt1010-1045

Naked Mole Rat, another day, another genome

The latest genome to be completed is the naked mole rat (Heterocephalus glaber). Now, could there be a cooler (if ugly) mammal on the planet? It’s one of only two truly eusocial mammals in the world, it lives up to 28 long years (my daughter’s rat, no relation, lived only 3 years) and is surprisingly resistant to a lot of diseases.

So, no wonder the genome was sequenced. Maybe we can learn some things about social behavior and longevity.

Of course there is a resource for it at http://www.naked-mole-rat.org/ though it’s basically just a blast server and some downloads. I’m counting down to the day it’s available at UCSC or Ensembl :D. I have some genes I’m interested in comparing.

A new BioMed Central feature

Brought to you by OpenHelix and BioMed Central :D. We really like the feature and idea (of course) and thought we’d pass it on.

BioMed Central (BMC) is an open access publisher. BMC along with OpenHelix launched a new feature recently to give readers of BMC journals timely access to relevant genomic resource tutorials. When reading a research article at BMC, researchers are now provided links to online tutorials of many of the genomics resources and tools used or cited in the article. The link takes the reader directly to the training landing page on the OpenHelix site. BMC has a large selection of open access high quality peer-reviewed research journals and much of the research reported today uses and cites many of the resources OpenHelix trains on. Researchers can now quickly find training on the databases and tools used in the research. For example, this recent article Genomewide Characterization of non-polyadenylated RNAs, in BMC’s Genome Biology cites several tools used in their research including GEO, MEME and others. The new feature finds these citations in the article and lists links to the OpenHelix tutorials on those resources as seen in the image.

It can be hard to find a quick link to a relevant resource in papers–the citations are sometimes incomplete, or not linked to the site.

We have plans to expand this feature in several ways to make training on relevant and important genomics resources simpler and quicker for researchers.

We’ve already gotten some great feedback on this–Great idea!

@jytricker Great idea! @BioMedCentral/OpenHelix jv will coach scientists on genomics/bioinformatics tools mentioned in papers http://ow.ly/4eQbG

 

Tip of the Week: DAnCER for disease-annotated epigenetics data

Epigenetics and epigenomics are becoming more exciting areas of investigation, and we are seeing more requests for database resources to support them, and for the sources of data from these types of experiments. If you aren’t aware of these investigations at this point, check out their entries in the Talking Glossary of Genetic Terms:

Epigenetics: Epigenetics is an emerging field of science that studies heritable changes caused by the activation and deactivation of genes without any change in the underlying DNA sequence of the organism. The word epigenetics is of Greek origin and literally means over and above (epi) the genome.

Epigenome: The term epigenome is derived from the Greek word epi which literally means “above” the genome. The epigenome consists of chemical compounds that modify, or mark, the genome in a way that tells it what to do, where to do it, and when to do it. Different cells have different epigenetic marks. These epigenetic marks, which are not part of the DNA itself, can be passed on from cell to cell as cells divide, and from one generation to the next.

And for the talking part–you can hear Dr. Laura Elnitski talk about these in more detail–have a listen at each entry. And just today an article providing an epigenetics primer appeared in my inbox: Epigenetics: A Primer.

These intriguing–and sometimes puzzling–chromatin modification (CM) signals and leads that are being unveiled in many labs and projects now are becoming more widely available in different databases. For this week’s tip of the week I’ll introduce DAnCER: Disease-Annotated Chromatin Epigenetics Resource, one of the tools that is organizing this type of data and enabling additional explorations. You can find DAnCER here: http://wodaklab.org/dancer/

In the associated publication, the DAnCER team describes other useful resources that provide epigenetics data. These include ChromDB, ChromatinDB (for yeast), and the Human Histone Modification Database (HHMD), among others. I’m also aware of other sources. A few months back I introduced the NCBI Epigenomics resource as my tip-of-the-week. (At that time I promised that when the publication became available I’d mention it–that’s now at the bottom in the references section below.) There’s also quite a bit of this data flowing in to the UCSC Genome Browser ENCODE DCC. Including–may I add–some data from the very cool Elnitski bi-directional promoter studies.  You can find similar data types via the modENCODE project as well.

So, there are lots of resources out there. Each provider has different projects, species, goals, displays, etc. But the group that developed DAnCER wanted to fill a niche they didn’t see available already: linking these epigenetic changes to possible disease association data. Here’s how they describe their work:

Our research effort therefore strives to explore CM-related genes in the context of their protein-interaction network, their partnership in multi-protein complexes and cellular pathways, as well as their gene expression profiles….

They are well-suited to linking this kind of information. You may remember our previous explorations and discussions of iRefWeb. The kind of network and interaction data that they assemble in that context can be brought to the chromatin-modification arena. The point is that you can take steps beyond the modifications you know about, to explore their neighborhood of interactions, and potentially unearth important disease relationships from that.

The data includes several species, and because of that evolutionary conservation can also be explored.

So if you find that you are interested in exploring chromatin modifications, and want to take that data further, check out DAnCER, and the other tools and projects that are providing this type of information. If you have used the iRefWeb interface, you’ll see some similarities in structure. Search options with many filters are available. Color-coded and sortable results are provided. Links to gene details within the Wodak lab tools and external links are offered. On the gene pages at DAnCER you’ll have many types of annotations, including Gene Ontology descriptions, evidence type and references, neighbors, and protein domain information as well. And besides the texty-table based stuff, you can choose to load up the interactive network/interaction graphic, just like with the iRefWeb tool.

There’s a lot of opportunity to learn things from this tool. Try it out.

Quick Links and References:

DAnCER http://wodaklab.org/dancer/

Turinsky, A., Turner, B., Borja, R., Gleeson, J., Heath, M., Pu, S., Switzer, T., Dong, D., Gong, Y., On, T., Xiong, X., Emili, A., Greenblatt, J., Parkinson, J., Zhang, Z., & Wodak, S. (2010). DAnCER: Disease-Annotated Chromatin Epigenetics Resource Nucleic Acids Research, 39 (Database) DOI: 10.1093/nar/gkq857

Fingerman, I., McDaniel, L., Zhang, X., Ratzat, W., Hassan, T., Jiang, Z., Cohen, R., & Schuler, G. (2010). NCBI Epigenomics: a new public resource for exploring epigenomic data sets Nucleic Acids Research, 39 (Database) DOI: 10.1093/nar/gkq1146

Have some NGS SAM/BAM files? get a GUI interface

A recent paper on a GUI interface introduces SAMMate. As the paper states:

With just a few mouse clicks, SAMMate will provide biomedical researchers easy access to important alignment information stored in SAM/BAM files.

You might want to check it out if you have Next Generation Sequencing data in the form of BAM/SAM files. A nice feature I haven’t been able to check is that it will export a ‘wiggle’ file for alignment visualization in the UCSC Genome Browser.

NAR Database issue…get it while it’s hot!

Ok, it’s hot now–but it’s something we refer back to all year long, actually. For people who don’t know about the NAR Database Issue, since the mid-90s Nucleic Acids Research has been collecting bioinformatics databases and tools that are of use to a huge range of researchers. We’ve watched it grow over the years and we’ve even graphed it. We’ll have to update that graph with the new data point for this year.  But here’s the graph as we published it last year:

(You can get this figure from our paper here, it is Figure 1)

You can see steady growth in the resources collected in the NAR set. But that’s certainly not all of them–others can be found in their server issue in the summer, and some just aren’t listed in a lot of places. We think there are in the range of 3000 tools and resources of some sort around.

A nice overview of the state of play is always provided in the introduction paper for that issue. As they state, this year we are up to 1330 data sources in their list. And they also highlight a couple of editorials that address important issues in this arena. One is about the need for data sources to talk to each other. This is an important point:

these databases risk functioning increasingly as isolated islands in a sea of disparate biological data

And there’s another editorial that speaks to the understanding of the data we have in our hands–and the need to understand it better. It describes COMBREX–a very cool effort:

This project is designed to serve as a clearinghouse, collecting functional predictions from specialists in bioinformatics and functional genomics and then sending these predictions for testing by experimentalists.

This is the kind of thing that makes me wish I still had a lab. There’s so much opportunity here…alas. The road not taken. But a hot opportunity for smart youngsters who might like to carve out a niche with a lab that mines the computational materials and pairs it with great projects for students to do the bench characterizations. And it offers grants to do this work….

Anyway–check out the NAR database issue. It’s worth your time. Really.

EDIT: there’s a fun and interesting crowd-sourced analysis of the NAR databases in the list for features of utility to bioinformatics geeks going on at BioStar.

References:
Williams, J., Mangan, M., Perreault-Micale, C., Lathe, S., Sirohi, N., & Lathe, W. (2010). OpenHelix: bioinformatics education outside of a different box Briefings in Bioinformatics, 11 (6), 598-609 DOI: 10.1093/bib/bbq026

Galperin, M., & Cochrane, G. (2010). The 2011 Nucleic Acids Research Database Issue and the online Molecular Biology Database Collection Nucleic Acids Research, 39 (Database) DOI: 10.1093/nar/gkq1243

Gaudet, P., Bairoch, A., Field, D., Sansone, S., Taylor, C., Attwood, T., Bateman, A., Blake, J., Bult, C., Cherry, J., Chisholm, R., Cochrane, G., Cook, C., Eppig, J., Galperin, M., Gentleman, R., Goble, C., Gojobori, T., Hancock, J., Howe, D., Imanishi, T., Kelso, J., Landsman, D., Lewis, S., Mizrachi, I., Orchard, S., Ouellette, B., Ranganathan, S., Richardson, L., Rocca-Serra, P., Schofield, P., Smedley, D., Southan, C., Tan, T., Tatusova, T., Whetzel, P., White, O., Yamasaki, C., & , . (2010). Towards BioDBcore: a community-defined information specification for biological databases Nucleic Acids Research, 39 (Database) DOI: 10.1093/nar/gkq1173

Roberts, R., Chang, Y., Hu, Z., Rachlin, J., Anton, B., Pokrzywa, R., Choi, H., Faller, L., Guleria, J., Housman, G., Klitgord, N., Mazumdar, V., McGettrick, M., Osmani, L., Swaminathan, R., Tao, K., Letovsky, S., Vitkup, D., Segre, D., Salzberg, S., Delisi, C., Steffen, M., & Kasif, S. (2010). COMBREX: a project to accelerate the functional annotation of prokaryotic genomes Nucleic Acids Research, 39 (Database) DOI: 10.1093/nar/gkq1168