Category Archives: New Resource

Video Tip of the Week: Aquaria, streamlined access to protein structures for biologists

This week’s Video Tip of the Week is Aquaria, a new resource for exploring protein structures, mutations, and similarities to other proteins. It’s a very well-designed and interactive experience for end users. It is aimed largely at biologists who could benefit from exploring the structural details of their proteins of interest, but are daunted by tools aimed at structural biologists. But for tool developers, you should also look at how this rollout went. It’s one of the best examples of a tool launch I’ve seen in this field. And I’ve seen a lot.

So first, the tool. Aquaria offers users a streamlined way to access and explore protein structures. Combining the kinds of information you get from the PDB structure resources, and additional details like the UniProt mutations. Currently you start with a basic search by asking for a protein by name, or PDB or UniProt ID. They have pre-calculated the relationships of proteins in PDB and Swiss-Prot to quickly offer you a structure and related proteins. The paper notes: “Currently, Aquaria contains 46 million precalculated sequence-to-structure alignments, resulting in at least one matching structure for 87% of Swiss-Prot proteins and a median of 35 structures per protein….” In addition, it lets you explore other important biological features such as InterPro domains, post-translational modifications, so you can think about how the mutations + structures + functions impact a given protein that you are interested in. As they describe it:

“We have loaded SNP data from Uniprot and Interpro so you can see where the mutations lie on your 3D model. And we have found that you may be pleasantly surprised to find your mutations clustering in 3D space!”

The Aquaria folks provided an intro video to get you started:

Another handy feature they provided is a Quick Reference Card with shortcuts to the functions [PDF]. In addition to this intro, they have a longer video as well. This is more like a typical lecture with the background, the framework, the goals of the project, and more about the underlying database.

Now, this thing about the rollout of this software project. I found it when I was looking over the talks at the upcoming VIZBI conference (Visualizing Biological Data). Every year I find there are awesome ideas that come out of VIZBI, and tools I want to explore. Among them this year is Aquaria. So I went looking for more detail, and found some of the traditional stuff. The paper (below), the press release, etc. And then I found the Reddit discussion. The Aquaria team did a Science AMA on this tool. It engaged a range of folks–some folks just fans of science who had probably never seen protein structures before. That’s fine with me–the more folks who appreciate research and learn about how researchers explore proteins is a good thing. But others had good technical questions for the team–such as other ways to find proteins of interest with sequence searches, or integration with other tools like UCSC Genome Browser. All the answers are over there. I enjoyed the question about the name of the tool:

It seems you get the ideas we had in mind: using Aquaria lets us observe these fascinating creatures (proteins) from the natural world. Aquaria creates an artificial environment and lighting where we can observe isolated proteins; like aquarium fish, proteins are often beautiful and (usually) live in water.

I asked them about how this played out, and they had ~1000 folks visit their site as a result of this Reddit event. That was really interesting to me, and a very neat route to drive awareness.

They also provided a way to support users with one of my other favorite resources–Biostars. They created a support thread there where uses can ask questions and get answers. https://www.biostars.org/t/aquaria/ I so prefer this to mailing lists, and I’m glad to see this easy method to get support. In fact, I asked something that I couldn’t quite figure out yet.prot_structure_sample (Here’s the protein I was looking at: http://aquaria.ws/P09616/7ahl/A I wanted to see all the subunits in full color, you de-select autofocus to do that. And color by chains for this version.)

Also, for the developer types: they offer a way for you to interact with the Aquaria software to add your own features of interest with their API. Maybe you have new mutations you have found in some sequence you’ve obtained in your lab, for example. They are offering guidance on that here: http://bit.ly/aquaria-features. They touch on this in the longer video (~27min) if you want a bit more explanation. I suspect from the high quality support they are offering, they’d be interested to hear from you and what features you’d like to see applied to these proteins as well.

So kudos to this team for a nifty tool and really serious multi-media outreach efforts. I think it was well done on all counts. I’ll bet you Reddit reached more of the right folks than a press release ever will. PIOs take note–get your scientists on Reddit.

Quick links:

Aquaria site: http://aquaria.ws/

Reddit Science AMA: https://www.reddit.com/r/science/comments/2w2jvw/science_ama_series_we_are_dr_sean_odonoghue_and/

Biostar support thread: https://www.biostars.org/t/aquaria/

Reference:
O’Donoghue S.I., Kenneth S Sabir, Maria Kalemanov, Christian Stolte, Benjamin Wellmann, Vivian Ho, Manfred Roos, Nelson Perdigão, Fabian A Buske, Julian Heinrich & Burkhard Rost & (2015). Aquaria: simplifying discovery and insight from protein structures, Nature Methods, 12 (2) 98-99. DOI: http://dx.doi.org/10.1038/nmeth.3258

The data isn’t in the papers anymore. Again.

I know this is a topic I keep hammering on. But I’m not sure that it’s really grokked by a lot of people who are not as deep into the bioinformatics aspects of biology today. Or those who support biologists, such as publishers and librarians, who may not be as immersed in the daily software aspects.

There was a nice post by Ed Yong last week about a paper published on sticklebacks. There are several cool things about this paper–but one of them is merely the fact that we can use the next-generation sequencing technology we have to examine species in ways that we just couldn’t before. And Ed made the point that there wasn’t only one genome in this paper–there were 21 genome sequencing events in this paper.

And because of the cool biological niches of these sticklebacks–it was possible to compare populations that varied in interesting ways. Some were fresh-water, some were salt-water based, and this could be examined in different regions of the planet to compare whether the same adaptations happened in different places for the same reason.

It really is a sweet paper. But it also serves another point of mine, that I keep making over and over again. The data is not in the papers anymore. The paper is a nice sort of summary statement of the work. But you cannot put 21 genomes in press–and a big list of A, T, G, and Cs wouldn’t be that valuable on paper anyway. You cannot show the analysis tracks in the papers. You can merely sample a subset of them. You can illustrate a few “compelling examples” as we used to call them at one place I worked.

But if you want to explore other features, or you want to build on this work yourself, you need to turn to the databases. The real magic happens there now–not in the papers. Back in the days of my training and early career, the papers were enough. They are not anymore. It’s not clear to me if publishers appreciate this fact entirely in this field.

And the authors offer a whole genome browser (based on the UCSC Genome Browser software platform) for their stickleback data. It’s quite lovely, actually–I’ll link to it below. It’s also an excellent demonstration of how to use existing open source software to craft a version for your needs.

Quick links:

Here’s Ed’s post on the key features of the work: Stickleback genome reveals detail of evolution’s repeated experiment

Look at the Sticklebrowser yourself. It’s actually rather lovely. And informative. http://sticklebrowser.stanford.edu

To learn to use UCSC Genome Browser based software, see the training materials sponsored by UCSC: http://openhelix.com/ucsc

Reference:

Jones, F., Grabherr, M., Chan, Y., Russell, P., Mauceli, E., Johnson, J., Swofford, R., Pirun, M., Zody, M., White, S., Birney, E., Searle, S., Schmutz, J., Grimwood, J., Dickson, M., Myers, R., Miller, C., Summers, B., Knecht, A., Brady, S., Zhang, H., Pollen, A., Howes, T., Amemiya, C., Baldwin, J., Bloom, T., Jaffe, D., Nicol, R., Wilkinson, J., Lander, E., Di Palma, F., Lindblad-Toh, K., & Kingsley, D. (2012). The genomic basis of adaptive evolution in threespine sticklebacks Nature, 484 (7392), 55-61 DOI: 10.1038/nature10944

NAR database issue (always a treasure trove)

The advance access release of most of the  NAR database issue articles is out. As usual, this this database issue includes a wealth of new and updated data repositories and analysis tools. We’ll be writing up additional more extensive blog posts on it and doing some tips of the week over the next couple months, but I thought I’d highlight the issue and some of the reports:

There are a lot of updates to many of the databases we know and love (links to go full text article): UCSC Genome Browser, Ensembl, UniProt, MINT, SMART, WormBase, Gene Ontology,  ENCODE, KEGG, UCSC Archaeal Browser, IMG/M, DBTSS, InterPro and others (we have tutorials on all those listed here).

And, as an indication of the explosion of data available (itself a subject of a database issue article, SRA), there are a lot of new(ish) databases on specific datatypes such as MINAS, a database of metal ions in nucleic acids (nice name :D); doRiNA, a database of RNA interactions in post-transcriptional regulation; BitterDB, a database of bitter compounds and well over 100 more.

And I’ll give a special shout out to my former PI at EMBL because I can, Peer Bork’s group has 4 databases listed in the issue: eggNOG, SMART, STITCH and OGEE. (and he and a couple members are on the InterPro paper also).

This is going to be a wealth of information to wade through!

UCSC Genome Browser: http://genome.ucsc.edu
Ensembl: http://www.ensembl.org/
UniProt: http://www.uniprot.org/
MINT: http://mint.bio.uniroma2.it/mint/
SMART: http://smart.embl.de/
WormBase: http://www.wormbase.org/
Gene Ontology: http://www.geneontology.org/
ENCODE: http://genome.ucsc.edu/ENCODE/
KEGG: http://www.kegg.jp
UCSC Archaeal Brower: http://archaea.ucsc.edu/
IMG: http://img.jgi.doe.gov/cgi-bin/w/main.cgi
DBTSS: http://dbtss.hgc.jp/
InterPro: http://www.ebi.ac.uk/interpro

 

sciseekclaimtoken-4ec6d4e6da3c3

sciseekclaimtoken-4ec6cf9447e17

Publishers + bioinformatics tools. Good idea.

So I was watching my twitter feed today and saw this tidbit come along:

@FN_Press: Elsevier Introduces Genome Viewer http://bit.ly/nS92lE

….The Genome Viewer utilizes a genome browser developed by NCBI (the National Center for Biotechnology Information at the National Institutes of Health). Elsevier collaborated with the NCBI as it was developing the browser, and is the first publisher to incorporate the technology into an application for viewing detailed information about the gene sequences that are mentioned in articles.

When an author of an article tags a gene sequence, Elsevier matches this gene with information in NCBI”s databases and pulls this information into the article. This allows readers of the article to get specific information about each strand by hovering over it, and also offers functionality such as flipping the strands, zooming to a sequence, or going to a specific position to define a track of interest within the sequence….

There’s nothing I love more than exploring a new genome viewer, or a clever new use of an existing one! The press release offered links to a couple of papers that were supposed to show this new feature. Um…I couldn’t find it.

Step 1: Locate genome viewer.

Step 2: Explore genome viewer.

Step 3: Write up blog post about new genome viewer.

So here we are, and I’m stuck at Step 1 still. But I’ve sort of hurdled over that to Step 3. I’m sure the folks at Elsevier are going to get me to this browser–I’ve already been in touch with them and they are going to help me out.

But in the meantime I’d like to say how cool an idea this is. I have always thought there should be more integration between the science publications and the databases. And not only because I firmly believe that, in large part, the data isn’t in the papers anymore.

We took a different approach. We recently partnered with BioMedCentral’s team to tag articles with the computational resources mentioned in their publication for which we have training. On their pages you’ll see a link to our site that looks like this, on a recent paper about CNVs in trypanosomes: http://www.biomedcentral.com/1471-2164/12/139

On our pages–such as this example of our landing page for the GBrowse tutorial–you’ll see recent papers that referred to this tool.

So if you were interested in GBrowse you could quickly see who is working with it, how, and it would help you to assess if GBrowse is the right tool for you needs. And you could use our tutorial to help understand ways to explore the data from a project that uses GBrowse. In many of the “big data” projects you aren’t going to get a gene list or a gene link. You’ll need to explore the set in toto at their sites. I’ve had my ranty pants* on about this before

Yeah. I have strong feelings about this.

I also think it’s a good idea for publishers. There’s a bit of pushback I’ve seen about subscription pricing, including this letter that has always stuck with me since I read it:

The Head of the Harvard Library System is Pissed

Profit margins of journal publishers in the fields of science, technology, and medicine recently ran to 30–40 percent; yet those publishers add very little value to the research process, and most of the research is ultimately funded by American taxpayers through the National Institutes of Health and other organizations.

I think that by adding handier access to the data in a paper, or to the tools needed to go further, publishers can add value beyond just the traditional publication.

If I could only find it…

++++++++++++++++++++++

*hat tip to Mike the Mad Biologist for my new favorite phrase, “ranty pants

There’s a database for everything, even uber-operons

I was playing around with Google Scholar’s new citation feature that allowed me to collect my papers in one place easily (worked pretty well, btw, save a few glitches, see below) when I noticed it missed a paper of mine from 2000: “Gene context conservation of a higher order than operons.” The abstract:

Operons, co-transcribed and co-regulated contiguous sets of genes, are poorly conserved over short periods of evolutionary time. The gene order, gene content and regulatory mechanisms of operons can be very different, even in closely related species. Here, we present several lines of evidence which suggest that, although an operon and its individual genes and regulatory structures are rearranged when comparing the genomes of different species, this rearrangement is a conservative process. Genomic rearrangements invariably maintain individual genes in very specific functional and regulatory contexts. We call this conserved context an uber-operon.

The uber-operon. It was my PI’s suggested term. Living and working in Germany at the time, I thought it was kind of funny. Anyway, I never really expanded more than another paper or so on that research and kind of lost track whether that paper resulted in much. I typed in ‘uber-operon’ in google today and found that it’s been cited a few times (88) and, I found this interesting: there have been a few databases built of “uber-operons.”

A Chinese research group created the Uber-Operon Database. The paper looks interesting, but unfortunately the server is down (whether this is temporary or permanent, I do not know), the ODB (Operon Database) uses uber-operons (which they call reference operons) to predict operons in the database , Nebulon is another, HUGO is another. Read the chapter on computational methods for predicting uber-operons :)

Just goes to show you, there’s a database for everything.

Oh, and back to Google Scholar citation. It did find nearly every paper I’ve published, though it missed two (including the one above) and had two false positives. Additionally, many citations are missing (like the 88 for this paper, and many others from other papers). That’s not to say it’s not useful, I find it a nice tool but it’s not perfect. You can find out more about google scholar citation here, and about Microsoft’s similar feature here.

Oh, and does this post put me in the HumbleBrag Hall of Fame? If that’s reserved for twitter, than maybe I should twitter this so I can get there :). (though I’m not sure pointing out relatively small databases based a relatively minor paper constitutes bragging, humbly or not LOL).

Tip of the Week: Human Epigenomics Visualization Hub

More and more we are seeing questions about ways to access Epigenomics data in the workshops we do. This often comes up in the workshop we do that focuses on the ENCODE data, because ENCODE is providing several epigenomics data sets that researchers are interested in. [The workshop we do is based on the materials you can find from the UCSC-sponsored freely available ENCODE tutorial.] But there are other browsers and data collections, and researchers want to be sure they are finding as much as possible.

Beside the data that is flowing into the UCSC Genome Browser ENCODE portal, we’ve talked in the past about the NCBI Epigenomics resource, including this tip.  We have also explored DAnCER in the past as a tip-of-the-week.

I’m still recovering from my vacation this week, so I’m going to point you to a very nice SciVee example I found on another resource, the Human Epigenomics Visualization Hub from WashU, which can be accessed at this link: http://vizhub.wustl.edu/ It’s longer than our usual tips–about 20 minutes. But if you want to begin to explore the data available on that browser, it’s worth your time.

Based on a UCSC Genome Browser framework, this resource focuses on epigenomics data. Familiarity with basic vocabulary and functional features of the UCSC Genome Browser is something I’d recommend. Check out our freely-available UCSC sponsored tutorials for this which cover features of tracks and displays, including things like bigWig, Sessions, etc.

The project is associated with the Roadmap Epigenomics Project, which you can explore in detail at the project site, and learn more about in the reference below.

In case the embed isn’t working, here is the link to the SciVee page: http://www.scivee.tv/node/31122

Quick link to Vizhub: http://vizhub.wustl.edu/

Follow them on Twitter for news and announcements: @WashUGBrowser

Reference:
Bernstein, B., Stamatoyannopoulos, J., Costello, J., Ren, B., Milosavljevic, A., Meissner, A., Kellis, M., Marra, M., Beaudet, A., Ecker, J., Farnham, P., Hirst, M., Lander, E., Mikkelsen, T., & Thomson, J. (2010). The NIH Roadmap Epigenomics Mapping Consortium Nature Biotechnology, 28 (10), 1045-1048 DOI: 10.1038/nbt1010-1045

Naked Mole Rat, another day, another genome

The latest genome to be completed is the naked mole rat (Heterocephalus glaber). Now, could there be a cooler (if ugly) mammal on the planet? It’s one of only two truly eusocial mammals in the world, it lives up to 28 long years (my daughter’s rat, no relation, lived only 3 years) and is surprisingly resistant to a lot of diseases.

So, no wonder the genome was sequenced. Maybe we can learn some things about social behavior and longevity.

Of course there is a resource for it at http://www.naked-mole-rat.org/ though it’s basically just a blast server and some downloads. I’m counting down to the day it’s available at UCSC or Ensembl :D. I have some genes I’m interested in comparing.

A new BioMed Central feature

Brought to you by OpenHelix and BioMed Central :D. We really like the feature and idea (of course) and thought we’d pass it on.

BioMed Central (BMC) is an open access publisher. BMC along with OpenHelix launched a new feature recently to give readers of BMC journals timely access to relevant genomic resource tutorials. When reading a research article at BMC, researchers are now provided links to online tutorials of many of the genomics resources and tools used or cited in the article. The link takes the reader directly to the training landing page on the OpenHelix site. BMC has a large selection of open access high quality peer-reviewed research journals and much of the research reported today uses and cites many of the resources OpenHelix trains on. Researchers can now quickly find training on the databases and tools used in the research. For example, this recent article Genomewide Characterization of non-polyadenylated RNAs, in BMC’s Genome Biology cites several tools used in their research including GEO, MEME and others. The new feature finds these citations in the article and lists links to the OpenHelix tutorials on those resources as seen in the image.

It can be hard to find a quick link to a relevant resource in papers–the citations are sometimes incomplete, or not linked to the site.

We have plans to expand this feature in several ways to make training on relevant and important genomics resources simpler and quicker for researchers.

We’ve already gotten some great feedback on this–Great idea!

@jytricker Great idea! @BioMedCentral/OpenHelix jv will coach scientists on genomics/bioinformatics tools mentioned in papers http://ow.ly/4eQbG

 

Tip of the Week: DAnCER for disease-annotated epigenetics data

Epigenetics and epigenomics are becoming more exciting areas of investigation, and we are seeing more requests for database resources to support them, and for the sources of data from these types of experiments. If you aren’t aware of these investigations at this point, check out their entries in the Talking Glossary of Genetic Terms:

Epigenetics: Epigenetics is an emerging field of science that studies heritable changes caused by the activation and deactivation of genes without any change in the underlying DNA sequence of the organism. The word epigenetics is of Greek origin and literally means over and above (epi) the genome.

Epigenome: The term epigenome is derived from the Greek word epi which literally means “above” the genome. The epigenome consists of chemical compounds that modify, or mark, the genome in a way that tells it what to do, where to do it, and when to do it. Different cells have different epigenetic marks. These epigenetic marks, which are not part of the DNA itself, can be passed on from cell to cell as cells divide, and from one generation to the next.

And for the talking part–you can hear Dr. Laura Elnitski talk about these in more detail–have a listen at each entry. And just today an article providing an epigenetics primer appeared in my inbox: Epigenetics: A Primer.

These intriguing–and sometimes puzzling–chromatin modification (CM) signals and leads that are being unveiled in many labs and projects now are becoming more widely available in different databases. For this week’s tip of the week I’ll introduce DAnCER: Disease-Annotated Chromatin Epigenetics Resource, one of the tools that is organizing this type of data and enabling additional explorations. You can find DAnCER here: http://wodaklab.org/dancer/

In the associated publication, the DAnCER team describes other useful resources that provide epigenetics data. These include ChromDB, ChromatinDB (for yeast), and the Human Histone Modification Database (HHMD), among others. I’m also aware of other sources. A few months back I introduced the NCBI Epigenomics resource as my tip-of-the-week. (At that time I promised that when the publication became available I’d mention it–that’s now at the bottom in the references section below.) There’s also quite a bit of this data flowing in to the UCSC Genome Browser ENCODE DCC. Including–may I add–some data from the very cool Elnitski bi-directional promoter studies.  You can find similar data types via the modENCODE project as well.

So, there are lots of resources out there. Each provider has different projects, species, goals, displays, etc. But the group that developed DAnCER wanted to fill a niche they didn’t see available already: linking these epigenetic changes to possible disease association data. Here’s how they describe their work:

Our research effort therefore strives to explore CM-related genes in the context of their protein-interaction network, their partnership in multi-protein complexes and cellular pathways, as well as their gene expression profiles….

They are well-suited to linking this kind of information. You may remember our previous explorations and discussions of iRefWeb. The kind of network and interaction data that they assemble in that context can be brought to the chromatin-modification arena. The point is that you can take steps beyond the modifications you know about, to explore their neighborhood of interactions, and potentially unearth important disease relationships from that.

The data includes several species, and because of that evolutionary conservation can also be explored.

So if you find that you are interested in exploring chromatin modifications, and want to take that data further, check out DAnCER, and the other tools and projects that are providing this type of information. If you have used the iRefWeb interface, you’ll see some similarities in structure. Search options with many filters are available. Color-coded and sortable results are provided. Links to gene details within the Wodak lab tools and external links are offered. On the gene pages at DAnCER you’ll have many types of annotations, including Gene Ontology descriptions, evidence type and references, neighbors, and protein domain information as well. And besides the texty-table based stuff, you can choose to load up the interactive network/interaction graphic, just like with the iRefWeb tool.

There’s a lot of opportunity to learn things from this tool. Try it out.

Quick Links and References:

DAnCER http://wodaklab.org/dancer/

Turinsky, A., Turner, B., Borja, R., Gleeson, J., Heath, M., Pu, S., Switzer, T., Dong, D., Gong, Y., On, T., Xiong, X., Emili, A., Greenblatt, J., Parkinson, J., Zhang, Z., & Wodak, S. (2010). DAnCER: Disease-Annotated Chromatin Epigenetics Resource Nucleic Acids Research, 39 (Database) DOI: 10.1093/nar/gkq857

Fingerman, I., McDaniel, L., Zhang, X., Ratzat, W., Hassan, T., Jiang, Z., Cohen, R., & Schuler, G. (2010). NCBI Epigenomics: a new public resource for exploring epigenomic data sets Nucleic Acids Research, 39 (Database) DOI: 10.1093/nar/gkq1146

Have some NGS SAM/BAM files? get a GUI interface

A recent paper on a GUI interface introduces SAMMate. As the paper states:

With just a few mouse clicks, SAMMate will provide biomedical researchers easy access to important alignment information stored in SAM/BAM files.

You might want to check it out if you have Next Generation Sequencing data in the form of BAM/SAM files. A nice feature I haven’t been able to check is that it will export a ‘wiggle’ file for alignment visualization in the UCSC Genome Browser.