As I was reading about the project, I thought I should know a bit more about specifically how they are grown. I’ve seen the flooded harvesting images, but I didn’t know what happened prior to that–the “bog habitat”. Conveniently, one of the research sites had links to some interesting videos of how cranberries are farmed. Sand–really–sand is the foundation of the fields. These dead-looking vines are laid out, and then partially buried in the sand. In a few years you will get cranberries. It’s kind of astonishing to actually see it–it looks so barren and lifeless at first.
“Some of these varieties were the progenitors of our current commercial turkeys, and they are fairly closely related to them genetically,” explains Hulet. “Today’s commercial turkeys are white because people didn’t like the little dots of pigment left on the skin after the feathers are pulled out, so breeders selected for a white-skinned turkey.” The white color is more natural for chickens, he explains, “while it’s a mutation for turkeys.”
Enjoy your mutant foods this holiday season.
Back to regular posting next week.
Polashock J., Ehud Zelzion, Diego Fajardo, Juan Zalapa, Laura Georgi, Debashish Bhattacharya & Nicholi Vorsa (2014). The American cranberry: first insights into the whole genome of a species adapted to bog habitat, BMC Plant Biology, 14 (1) 165. DOI: http://dx.doi.org/10.1186/1471-2229-14-165
It was just a little tweet, with hardly any information about the function or purpose of the resource mentioned. But the cute name drove a lot of people to take a look at GeneFriends from our blog recently, so I figured it was worth highlighting this tool as our Video Tip of the Week.
So here’s the original tweet, hat tip to Jack Scanlan:
I admit, I looked too. I had imagined something like a personal genomics matching site, but that’s not what it is. GeneFriends is a tool that uses gene co-expression data to try to identify which genes are “friends” with other genes in networks. These can be known genes, or they can be uncharacterized genes. The current implementation is for human data.
Not a new tool, the original implementation of GeneFriends with microarray-based data sets came out some time ago. There are 3000 data sets in that part of the previous tool. But their new paper describes a different version, now done with RNA-seq data. The paper says there are over 4000 RNA-seq samples from 240 studies, via the SRA database. In the new paper they describe the criteria for selection and their strategy for calling co-expression. They state that their goal is to help unearth leads on annotation for uncharacterized genes, and this also includes non-coding RNA sequences.
GeneFriends employs a RNAseq based gene co-expression network for candidate gene prioritization, based on a seed list of genes, and for functional annotation of unknown genes in humans.
There is a short video with their foundation and philosophy about the GeneFriends tool:
Another video goes a bit further and illustrates an example of the functionality. On the site you can try this yourself with the handy “show example” buttons they have. In addition to what you’ll find at their site, they also demonstrate that you can bring your results over to the BioLayout tool to work with them further. They also note that you can upload the results into Cytoscape.
It’s pretty straightforward to use the basic features of GeneFriends, but there is additional detail on the underpinnings from their “about” page. The papers below also cover the foundations and their new directions. You should also be aware of the limitation of the RNA-seq data that they discuss in the new paper. But check it out to see if you can discover some new relationships among transcripts of interest with GeneFriends.
References: van Dam S., Rui Cordeiro, Thomas Craig, Jesse van Dam, Shona H Wood & João de Magalhães (2012). GeneFriends: An online co-expression analysis tool to identify novel gene targets for aging and complex diseases, BMC Genomics, 13 (1) 535. DOI: http://dx.doi.org/10.1186/1471-2164-13-535
van Dam S., T. Craig & J. P. de Magalhaes (2014). GeneFriends: a human RNA-seq-based gene and transcript co-expression database, Nucleic Acids Research, DOI: http://dx.doi.org/10.1093/nar/gku1042
So as amusing as this has all been, one team took another approach to this issue. They wondered if this Venn craze was the best way to tackle this data, or if there were more effective and interactive ways to explore this sort of data. Some data set visualization tools may not be right for a task. One problem is scaling Venn diagrams to capture the full range of features that that genomics folks want to illustrate. They are now prepared to UpSet the applecart. In their intro video to UpSet, they summarize with this:
I’ve talked about the terrific data visualization tools around the Caleydo project a number of times. They are developing really useful and intuitive strategies for looking at numerous types of data, and you can see our previous posts on StratomeX, LineUp, Entourage and enRoute (the combo of genomics data and pathways here is particularly nifty). They work really hard with the theories and techniques of data visualization, and implement effective ways to explore data. They recently looked across various genomics data papers to see how data sets were being used, and they attempt to encourage good behavior with the right visualizations to make the necessary points (Points of View reference below):
Understanding the tasks that the diagrams are meant to support and being aware of the data structure are required to find an appropriate representation.
They also have tried to help. UpSet, for visualization of intersecting sets, is one of their new efforts, championed by Alexander Lex, with the other team members. Looking for both effective and efficient representation of the types of data genomics researchers need, this interactive tool is a really nice way to explore which items belong in which subset. And, of course, which ones don’t. But that’s just the beginning. With this tool you can easily spot the intersections, query for ones you are interested in, and sort in various ways. There are ways to explore the attributes and elements for the items as well. The other great thing about the Caleydo team is that they make nice intro videos–I’ll embed the overview one as this week’s video Tip of the Week, but they have a shorter basic intro one as well. In this video the examples include Simpson’s characters and movie data sets, but it will certainly allow you to quickly grasp the utility of this tool. But there’s a lot more to it as well. Read the UpSet paper linked below (and you will spot a copy of the notorious banana Venn, in fact, which inspired their thoughts on a better way to illustrate sets). It has a lot of nice guidance on set theory and will help you think about the appropriate uses of different representations.
The github pages have more help, documentation, and a link to try out an installation with your own data. I also recently had the chance to meet Alexander at a talk he gave, and I know he’s interested in knowing what other visualization challenges are problems in genomics, and would be interested in any feedback you have on the tools.
My dreams for this tool: it would be embeddable in journal articles. So I could see the data as the team presented it, but then also be able to explore the underlying stuff. And if it could be a sort of a “session” so I could snap back to the original view. And I wish I could embed an image faintly on the background….
Gibbs R.A., George M. Weinstock, Michael L. Metzker, Donna M. Muzny, Erica J. Sodergren, Steven Scherer, Graham Scott, David Steffen, Kim C. Worley, Paula E. Burch & Geoffrey Okwuonu & al (2004). Genome sequence of the Brown Norway rat yields insights into mammalian evolution, Nature, 428 (6982) 493-521. DOI: http://dx.doi.org/10.1038/nature02426
We’ve been doing UCSC Genome Browser training workshops for a decade now. We’ve seen all sorts of situations–from places that had terrific bioinformatics and IT support, to places where the attendees had no idea if anyone provided support at their institution. Ironically, sometimes the places with little support were big-name research places where all the support was aimed at, or associated with, certain high-profile labs, and not the average researcher or post-doc. We have also seen places where although there was support, it was so hostile and dismissive that we could understand why the researchers didn’t seek them out. So when we went in, often people would deluge us with questions about problems they were having working with their own data.
Frequently a problem they were having was being able to incorporate their own data into a viewable and explorable way with other tools, where they could look at the deep context of genome annotations with their data. Over the years the options got better and better to do this with the UCSC tools: custom tracks, sessions, then hubs. But one problem still remained: some people couldn’t put their data over the intertubz–for a variety of reasons.
In some cases they had patient data, and HIPAA or grant agency privacy compliance issues, that restricted them to working behind their firewall. Sometimes their data sets were so huge they couldn’t get it loaded without timing out. Some places had the capacity to install a local UCSC mirror, but many didn’t. But UCSC has now solved this problem as well. Using their new Genome Browser in a Box (GBIB), you can download an installation of the UCSC Genome Browser to your own computer, use your own files, and they never have to leave your laptop or your firewall. You have your own personal mirror site. This might be a great solution for some folks at small companies too.
To accomplish this, you use a tool called VirtualBox to set up a virtual machine on your computer, you pull down the UCSC components, and you are ready to roll. I have an older and under-powered computer and it worked fine for me. It also is supported on Windows, Mac, or Linux, so it should serve most people.
This week’s video tip-of-the-week is a quick introduction to that setup. Although there is a paper already (below), good documentation (linked), and the ever-helpful mailing lists at UCSC, I thought some folks who were less likely to seek out (or have access to) the help might benefit from a walk-through of this process. I show where and how to get the GBIB, an overview of the steps, and then illustrate how this runs on my computer. You also get the benefit of my mistakes–I did testing for this before it was released, and I had installation issues, so I highlight where to get the help with that (Pro-tip: I should have printed the documentation before installing–it was all in there. And don’t forget to check the “troubleshooting” section at the end.).
So if you’ve wanted to load your own data in to the UCSC Genome Browser and use the suite of tools there to visualize and query–but haven’t been able to–give the Browser in a Box a try.
In this overview video, I don’t go into more detail on how to use the browser–with your own mirror you are really using the same features that our regular training materials cover–the introduction to the browser and the advanced tools features are mostly the same.
Note: “GBiB is free for non-commercial use by non-profit organizations, academic institutions, and for personal use. Commercial use requires purchase of a license with setup fee and annual payment.” At OpenHelix we have a contract to do general training and outreach, we do not benefit from any license fees associated with the UCSC browser. Checking your status for licensing GBIB or the required tools is in your hands.
Haeussler M., B. J. Raney, A. S. Hinrichs, H. Clawson, A. S. Zweig, D. Karolchik, J. Casper, M. L. Speir, D. Haussler & W. J. Kent (2014). Navigating protected genomics data with UCSC Genome Browser in a Box, Bioinformatics, DOI: http://dx.doi.org/10.1093/bioinformatics/btu712
Yeah, I know, it’s not genomics–but it’s the history of life on this planet–right? The Paleobiology Database has been keeping records of this ancient biology for a while now, and they have some really nice tools to explore the fossil records and resources that have become available. It’s also interesting to me to see the informatics needs of this type of project. It has a lot of overlap with databases of more recent biology, like the GOLD one–they need taxonomy for the organisms, they need literature links–but they have other needs to capture both geographical regions and the layers of time as well.
There are a couple of ways to access the data. When you arrive at the main landing page, you have the choice to “Launch PBDB”, or “Launch Navigator”. PBDB is a “classic” interface, with typical search boxes and query results. Since this is the internet, I used that “quick search” and looked for paleo cats, and found a lot of Felis in there. But that’s not the only way to look around. They have a newer graphical access mechanism that’s called the Navigator. You can use the navigator to search the world, filter for specific items or time periods–but my favorite thing is you can reset the planet to be what it looked like eons ago. This is covered in their intro video that is this week’s Tip of the Week:
They have other videos as well, you can see that they have both this Navigator interface and help with the classic style. Their “apps” offer other types of searches too. You can even search for insect size. Another way to access information is via R. I began to look around at this because David Bapst on Google+ pointed to their new publication announcement (linked below), offering their R package for accessing their underlying data.
According to their publications page, this resource supports a wide range (and copious amount) of research in this field. It was really neat to have a look at a rather different scale of bioinformatics across the time horizon. Check out the Paleobiology Database resources for your fossil needs.
Reference: Varela S., González-Hernández J., Sgarbi L., Marshall C., Uhen M., Peters S. & McClennen M. (2014). paleobioDB: an R package for downloading, visualizing and processing data from the Paleobiology Database, Ecography, DOI: 10.1111/ecog.01154
Always on the lookout for effective visualization tools, I recently came across a series of videos about the SeqMonk software. It’s not software that I had used before, so I wanted to look at the videos, and then try it out. It downloaded quickly, offered me an extensive list of genomes to load up, and then right away I was kicking the tires. And I was impressed. It was easy to locate and explore different regions and the different tracks that were available. And it appears to be very straightforward to load up your own data as well. The video I’ll highlight here is called “Creating Custom Genomes with SeqMonk” which gives a nice intro to their setup.
A lot of folks have found SeqMonk useful. But it took me 3 different site searches to figure out how useful. I searched at PubMed, PubMedCentral, and Google Scholar. The results were pretty interesting, actually. Just a basic search for SeqMonk yields these differences:
The paper in PubMed wasn’t in PubMedCentral, but it was among the 100+ in Google Scholar. Of the 53 in PMC, 2 were absent from Scholar–one had SeqMonk in a figure legend, one had SeqMonk in supplemental procedures. Google Scholar obviously had the biggest range–it also included meeting abstracts, theses, and patent documents, and also a few false positives (from 1840?, 1929, and a couple of other things I couldn’t figure out). Oddly, sometimes the titles differed between PMC and Scholar, but they appeared to be the same paper. As I’ve noted before, it’s challenging to find out where software is being used, since the way people reference it can be so variable. This was another interesting example of this variability.
Chatterjee A., P. A. Stockwell, E. J. Rodger & I. M. Morison (2012). Comparison of alignment software for genome-wide bisulphite sequence data, Nucleic Acids Research, 40 (10) e79-e79. DOI: http://dx.doi.org/10.1093/nar/gks150
The terrific folks at NCBI have been increasing their outreach with a series of webinars recently. I talked about one of them not too long ago, and I mentioned that when I found the whole webinar I’d highlight that. This recording is now available, and if you are interested in using these medical genetics resources, you should check this out.
I was reminded of this webinar by a detailed post over at the NCBI Insights blog: NCBI’s 3 Newest Medical Genetics Resources: GTR, MedGen & ClinVar. There’s no reason for me to repeat all of that–I’ll conserve the electrons and direct you over there for more details about the features of these various tools. There is a lot of information in these resources, and the webinar touches on these features and also describes the relationships and differences among them.
Acland A., R. Agarwala, T. Barrett, J. Beck, D. A. Benson, C. Bollin, E. Bolton, S. H. Bryant, K. Canese, D. M. Church & K. Clark & (2013). Database resources of the National Center for Biotechnology Information, Nucleic Acids Research, 42 (D1) D7-D17. DOI: http://dx.doi.org/10.1093/nar/gkt1146
Although I had other tips in my queue already, over the last week I’ve seen a lot of talk about the new Ebola virus portal from the UCSC Genome Browser team. And it struck me that researchers who have worked primarily on viral sequences may not be as familiar with the functions of the UCSC tools. So I wanted to do a tip with an overview for new folks who may not have used the browser much before.
There is great urgency in understanding the Ebola virus, examining different isolates, and developing possible interventions to help tackle this killer. Jim Kent was made aware of the CDC’s concerns from his sister–who edits the CDC’s “Morbidity and Mortality Weekly Report”, according to this story:
“It wasn’t until talking to Charlotte that I realized this one was special,” Jim Kent said. “It had broken out of the containments that had worked previously, and really, if a good response wasn’t made, the entire developing world was at risk.”
Jim Kent redirected his team of 30 genome analysts to devote all resources toward developing the Ebola genome. They worked through the night for a week to develop a map for other scientists to determine where on the virus to target treatment.
So the folks at UCSC have created a portal where you can explore the sequence information and variations among different isolated strains, annotations about the features of the genes and proteins, and they even added a track for the Immune Epitope Database (IEDB, which happened to be a video tip not long ago)–where antibodies have been shown to bind Ebola protein sequences. The portal also provides links to publications and further research related to these efforts.
The reference sequence that forms the framework for the browser is a sample from Sierra Leone: http://www.ncbi.nlm.nih.gov/nuccore/KM034562.1 It was isolated from a patient this past May, and I don’t see a publication attached to it–the submission is from the Broad’s Viral Hemorrhagic Fever Consortium. There are more details and thanks to the Pardis Sabeti lab for the sequence, you can read in the announcement email. So, as we keep seeing, we need to have access to the data long before publications become available. The work happens in the databases now, we can’t wait for traditional publishing.
In a side note, I also just learned that the NLM (National Library of Medicine) has a disaster response function, and they have a special Ebola section now because of the needs: Ebola Outbreak 2014: Information Resources. And for more of Jim Kent’s thoughts on Ebola, check out the blog that the UCSC folks have just started: 2014 Ebola Epidemic.
The goal of this tip was to provide an overview of the layout and features for folks who might be new to the UCSC software ecosystem. If you already know how to use it, it won’t be new to you. But if you are interested in getting the most out of the UCSC tools, you can also explore our longer training videos. UCSC has sponsored us to provide free online training materials on the existing tools, and this portal is based on the same underlying software. So you can go further, including using the Table Browser for queries beyond just browsing, if you learn the basics that we cover in the longer suites.
Karolchik D., G. P. Barber, J. Casper, H. Clawson, M. S. Cline, M. Diekhans, T. R. Dreszer, P. A. Fujita, L. Guruvadoo, M. Haeussler & R. A. Harte & (2013). The UCSC Genome Browser database: 2014 update, Nucleic Acids Research, 42 (D1) D764-D770. DOI: http://dx.doi.org/10.1093/nar/gkt1168
Gire S.K., A. Goba, K. G. Andersen, R. S. G. Sealfon, D. J. Park, L. Kanneh, S. Jalloh, M. Momoh, M. Fullah, G. Dudas & S. Wohl & (2014). Genomic surveillance elucidates Ebola virus origin and transmission during the 2014 outbreak, Science, 345 (6202) 1369-1372. DOI: http://dx.doi.org/10.1126/science.1259657
Edit Nov 7, as publication of the browser paper was announced: Haeussler M , Karolchik D , Clawson H , Raney BJ , Rosenbloom KR , Fujita PA , Hinrichs AS , Speir ML, Eisenhart C, Zweig AS , Haussler D & Kent WJ & (2014). The UCSC Ebola Genome Portal, PLOS Currents Outbreaks., 1 DOI: 10.1371/currents.outbreaks.386ab0964ab4d6c8cb550bfb6071d822
This week’s tip of the week highlights the MEGA tools–MEGA is a collection of tools that perform Molecular Evolutionary Genetics Analysis. MEGA tools are not new–they’ve been developed and supported over many years. In fact, on their landing page you can see the first reference to MEGA was in 1994. How much computing were you doing in 1994, and what kind of computer was that on?
As they describe their tools on their homepage–here’s a summary:
MEGA is an integrated tool for conducting sequence alignment, inferring phylogenetic trees, estimating divergence times, mining online databases, estimating rates of molecular evolution, inferring ancestral sequences, and testing evolutionary hypotheses.
But you can see they’ve progressed regularly and deeply since 1994, continuing to add new features and tools, and the current version is MEGA6. Although we usually focus on web-based interfaces, there are some tools that run on a desktop installation instead. So you will have to download and install MEGA to try it out, but the number of things you can do with it make it worth your time.
The first video illustrates a file conversion and preparation–getting your data into the right format for MEGA. I won’t embed that here, but when you are ready to kick the tires yourself you should have a look. I’ll jump right to the second video, that includes a bit more action about the things you can do with MEGA. This covers generating a neighbor-joining tree, and several subsequent options for modifying and saving it.
But this is just one aspect of what you can do with the MEGA tools. Be sure to explore the range of things you can do. Their documentation contains a section aimed at the “first time user” that can help you to understand various options you have. They also have sample data for you to try out the tools.
References: Tamura K., N. Peterson, G. Stecher, M. Nei & S. Kumar (2011). MEGA5: Molecular Evolutionary Genetics Analysis Using Maximum Likelihood, Evolutionary Distance, and Maximum Parsimony Methods, Molecular Biology and Evolution, 28 (10) 2731-2739. DOI: http://dx.doi.org/10.1093/molbev/msr121
Tamura K., G.Stecher, D. Peterson, A. Filipski & S. Kumar (2013). MEGA6: Molecular Evolutionary Genetics Analysis Version 6.0, Molecular Biology and Evolution, 30 (12) 2725-2729. DOI: http://dx.doi.org/10.1093/molbev/mst197
The Calyedo team and the tools they develop have been on my short list of favorites for a long time. I’ve been talking about their clever visualizations for years now. My first post on their work was in 2010, with the tip I did on their Calyedo tool that combined gene expression and pathway visualization. They’ve continued to refine their visualizations, and enable new data types to be brought into the analysis, and earlier this year we featured Entourage, enRoute, LineUp, and also StratomeX. They have lots of options for wrangling “big data”. But recently they published a paper on StratomeX and a nice video overview, so I wanted to bring it to your attention again now that the paper is out.
The emphasis in this paper is cancer subtype analysis, using some data from The Cancer Genome Atlas (TCGA). But it’s certainly not limited to cancer analysis–any research area that’s currently flooded with multiple types of data and outcomes could be run through this stratification and visualization software. I find the weighting of the lines and connections among the subsets to be really effective for me when thinking about relationships among the data types. That schizophrenia work that recently did that sort of stratification and clustering thing to suss out the relationships among different sub-types, was the kind of thing that’s going to be really useful (but I don’t know what software they used, because paywall…). And I expect that strategy to become increasingly important for a lot of conditions.
So have a look at this new paper (below), and their well-crafted video with examples.
If you are going to start working with StratomeX, be sure to also see their documentation pages. There are some features and options there that aren’t covered in the intro video and that you’ll want to know about.
The team is a cross-institutional and international bunch: this is a joint project between a lab at Harvard, led by Hanspeter Pfister, Peter Park’s lab at the Center for Biomedical Informatics at Harvard Medical School, and collaborators at Johannes Kepler University in Linz and the Graz University of Technology (both in Austria). And look for upcoming tools from them as well–there’s new stuff over at their site. They keep developing useful items, and I expect to be highlighting those in future Tips of the Week.
Marc Streit, Alexander Lex, Samuel Gratzl, Christian Partl, Dieter Schmalstieg, Hanspeter Pfister, Peter J Park & Nils Gehlenborg (2014). Guided visual exploration of genomic stratifications in cancer, Nature Methods, 11 (9) 884-885. DOI: http://dx.doi.org/10.1038/nmeth.3088