Video Tip of the Week: Biodalliance browser with HiSeq X-Ten data

Drama surrounding the $1000 genome erupts every so often, and earlier this year when the HiSeq X Ten setup was unveiled there was a lot of chatter–and questions: Is the $1,000 genome for real? And some push-back on the cost analysis: That “$1000 genome” is going to cost you $72M. A piece that offers nice framework for the field of play is here: Welcome to the $1,000 genome: Mick Watson on Illumina and next-gen sequencing. Aside from the media flurry, though, what matters is the data. And not many people have had access to the data yet.

Via Gholson Lyon, I heard about access to some:

A set of collaborators (The Garvan Institute of Medical Research, DNAnexus and AllSeq) have provided a test data set from the X Ten. I’ll let them describe this effort:

Take advantage of this unique opportunity to explore X Ten data.

The Garvan Institute of Medical Research, DNAnexus and AllSeq have teamed up to offer the genomics community open access to the first publicly available test data sets generated using Illumina’s HiSeq X Ten, an extremely powerful sequencing platform.  Our goal is to provide sample data that will allow you to gain a deeper understanding of what this technological advancement means for your work today and in the future.

My focus won’t be this data itself–but if you are interested in many of the technical aspects of this system and their process, have a listen to this informative presentation by Warren Kaplan from Garvan:

The sample data is derived from a cell line, the GM12878 cells. These cells are from the Coriell Repository here: Catalog ID: GM12878. Conveniently, this is one of the Tier 1 cell lines from the ENCODE project too, so there is other public data out there on this cell line–which I have explored in the past and knew some things about.

There are 2 different data sets of the sequence in the download files, and one of them is available in the browser to view. I’m sure the Genoscenti will be all over the downloadable files. But because I’m always interested new visualizations, I wanted to explore the genome browser they made available. Although I had heard of Biodalliance before, we hadn’t highlighted it as a tip, so I thought that would be interesting to explore. Biodalliance is a flexible, embeddable, extensible system that’s worth a look on it’s own, besides delivering this test data. And if you come by at a later date and the X Ten data is no longer available, go over to their site for nice sample data sets. Their “getting started” page has a nice intro to the features.

In the video, I’ll just take a quick test drive around some of the visualization features with the X-Ten GM12878 data. I’ll look at a couple of sample regions, one with the SOD1 gene just to illustrate the search and the tracks. And I’ll look at a region that I knew from the previous ENCODE CNV data had a homozygous deletion to see how that looked in this data set. (If you want to look for deletions later, search for the genes OR2T10 or UGT2B17).

Note: the data is time-sensitive–apparently it’s only available until September 30 2014. So get it while it’s hot, or browse around now.

Quick Links:

Test data site:

Biodalliance browser software details:


Public service announcement: NIH #GSPfuture meeting livestream [over]

There’s a workshop running today and tomorrow, called:

Future Opportunities for Genome Sequencing and Beyond:
A Planning Workshop for the National Human Genome Research Institute

July 28-29, 2014

It’s live streaming here:

I’m sure the recordings will be available later, though, if you come across this at a later date.

Edit after session were done: I really enjoyed this. Having all these wicked smaht folks discussing ways to get to the future was really useful. I’ll post an additional note when I see the videos are up.


Video Tip of the Week: Nowomics, set up alert feeds for new data

Yeah, I know you know. There’s a lot of genomics and proteomics data coming out every day–some of it in the traditional publication route, but some of it isn’t–and it’s only getting harder and harder to wrangle the useful information to access the signal from the noise.  I can remember when merely looking through the (er, paper-based) table of contents of Cell and Nature would get me up to speed for a week. But increasingly, the data I need isn’t even coming through the papers.

Like everyone else, I have a variety of strategies to keep notified of different things I need to see. I use the MyNCBI stored searches to keep me posted on things that come from via the NCBI system. I signed up for the OMIM new “MIM-Match” service as well. But there’s still a lot of room for new ways to collect and filter new data and information. Today’s tip focuses on a service to do that: Nowomics. This is a freely available tool to help you keep track of important new data. Here’s a quick video overview of how to see what’s going on with Nowomics.

The goal of Nowomics is to offer you an actively updated feed of relevant information on genes or topics of interest, using text mining and ontology term harvesting from a range of sources. What makes them different from MyNCBI or OMIM is the range and types of data sources they use. The user sets up some genes or Gene Ontology terms to “follow”, and the software regularly checks for changes in the source sites. You can go in an look at your feed, you can filter it for different types of data, and you can see what’s new (“latest”) or what’s being hotly chattered about (“popular”) using Altmetric strategies. For example, here’s a paper that people seemed to find worth talking about, based on the tweets and the Mendeley occurrences.

example_paper This tool is in early stages of development–if there are features you’d like to see or other sources you’d think are useful, the Nowomics team is eager for feedback. You can find a link to contact them over at their site, or locate them on Facebook and Twitter. You can also learn more from their blog. You can also learn more about the philosophy and foundations of Nowomics from their slide presentation below.


Quick links:


Example gene feed:


New tools at Reactome–check ‘em out

Just got this from the Reactome announcement mailing list:

Pathway databases, like Reactome, are uniquely suited for interpreting the results of high-throughput functional genomics data sets such as microarray-based expression profiles, protein interaction sets, and chromatin IP. In response to user feedback and new feature requests, we have released a new Reactome Pathway Browser with an integrated suite of tools for pathway analysis. Using these improved features, you can map protein lists to Reactome pathways, perform pathway overrepresentation analysis for a set of genes, colourize pathway diagrams with gene expression data, and compare model organism and human pathways. To support third-party tool integration, the Reactome Pathway Analysis Portal is also available via RESTful web services. Further details about the new pathway analysis tool can be found in our User Guide.…[see more details and contact info at the mailing list page]

Mapping gene and protein lists to pathways is a frequently-requested feature in pretty much every workshop we give–so have a look and see if it would help you to manage lists and do some discovery on them.

Quick link:

Video Tip of the Week: New UCSC “stacked” wiggle track view

This week’s video tip shows you a new way to look at the multiWig track data at the UCSC Genome Browser. A new option has recently been released (see 06 May 2014), a “stacked” view, and it’s a handy way to look at the data with a new strategy. But I’ll admit it took me a little while of working with it to understand the details. So in this tip I hope you’ll see what the new visualization offers.

I won’t go into the background on the many types of annotation tracks available–if you need to be introduced to the idea of the basic track views, start out with our introduction tutorial that touches on the different types of graphical representations. Custom tracks are touched on in the advanced tutorial. For guidance specifically how to create the different track types, see the UCSC documentation. The type of track I’m illustrating in the video today, a MultiWig track, has its own section over there too. Basically, if you are completely new to this, the “wiggle” style is a way to show a histogram display across a region. MultiWig lets you overlay several of these histograms in one space. In the example I’ll show here, the results of looking at 7 different cell lines are shown for some histone mark signals (Layered H3K27Ac track).

Annotation track cell lines

Annotation track cell lines

When I saw the announcement, I thought this was a good way to show all of the data simultaneously. When we do basic workshops, we don’t always have time to go into the details of this view, although we do explore it in the ENCODE material, because the track I’m using is one of the ENCODE data sets. I’ll use the same track in the same region as the announcement, which is shown here:

stack announcementBut when I first looked at this, I wasn’t sure if the peak–focus on the pink peak that represents the NHLF cell line–was meant to cover the whole area underneath or not. What I was trying to figure out is essentially this (a graphical representation of my thought process follows):


By trying out the various styles I was pretty sure I had the idea of what was really being shown, but I confirmed that with one of the track developers. The value is only the pink band segment, not the whole area below it. And Matthew also noted to me that they are sorting the tracks in reverse alphabetical order (so NHLF is the highest in the stack). That was an aspect I hadn’t realized yet. They are not sorting based on the values at that spot. This makes sense, of course, but it wasn’t obvious to me at first.

I like this option very much–but I figured if I had to do some noodling on what it actually meant others might have the same questions.

In the video I’ll show you how this segment looks with the different “Overlay method” settings on that track page. I’ll be looking at the SOD1 area, like the announcement example.  I tweaked a couple of the other settings from the defaults so it would be easier to see on the video (see arrowheads for my changes). But I hope this conveys the options you have now to look at this type of track data effectively.

Track settings for videoSo here is the video with the SOD1 5′ region in the center, using the 4 different choices of overlay method, illustrating the histone mark data in the 7 cell lines. I’m not going into the details of the data here, but I’ll point you to a reference associated with this work for more on how it’s done–see the Bernstein lab paper below.  I wanted to just demonstrate this new type of viewing options that will be available on wiggle tracks. Some tracks will have too much data for one type or another, or will be clearer with one or another style. But now you have an additional way to consider it.

Quick links:

UCSC Genome Browser:

UCSC Intro tutorial:

UCSC Advanced tutorial:

These tutorials are freely available because UCSC sponsors us to do training and outreach on the UCSC Genome Browser.


BioMart news, and a shiny new look

Just got the news via the mailing list, I haven’t had a chance to kick the tires yet:

We are pleased to announce the release of BioMart version 0.9.

The latest version of BioMart includes support for data analysis and visualisation tools. The first of the BioMart tools has already been implemented and is accessible from This tool enables enrichment analysis of genes in all Ensembl species and a broad range of gene identifiers for each species are also available. Furthermore, the tool supports cross-species analysis using Ensembl homology data. Finally, the enrichment tool facilitates analysis of BED files containing genomic features such as Copy Number Variations (CNVs) or Differentially Methylated Regions (DMRs).

The latest BioMart release comes with the new version of the REST and SOAP APIs. These APIs are available for testing at Third party developers who are currently using REST or SOAP version 0.7 are encouraged to start testing and transitioning to 0.9. The two servers providing access to BioMart data through REST and SOAP (version 0.7 and version 0.9) will be running in parallel to provide support for easy transition. The Enrichment tool is also accessible programmatically through 0.9 REST/SOAP interface.

Finally, the BioMart website has been completely redesigned to cater for a better user experience. The re-organised layout, incorporation of new functionality, such as the “quick tool access” and the use of subtle animation makes for clearer navigation and greater site interactivity.

Your feedback is welcome and appreciated.

On behalf of the BioMart developers


Check it out:

Heartbleed security issues, we’re ok

We’ve been tracking the concerns about the Heartbleed security issues, as has everyone with an internet login anywhere. And the actual depth of the issue continues to be discussed and disputed. Also XKCD:

Most people who read our blog, or access the free materials, haven’t had to register anyway so there wasn’t an issue with those. But we have checked with our development team to see if we were affected by the security flaw for our registered users.

We are told that we are unaffected by this vulnerability on our registration-accessible pages. So although it is always wise to change passwords from time-to-time, we won’t be requiring that for our registered users. Feel free to do so though if you want to. Let us know if you have any problems with that.

A fix was implemented for the Google Wallet checkout feature that some people might have used, and it’s already in place.

Safe travels around the ‘tubz.

New UCSC Genome Browser for the newest human genome assembly

Most folks who read this blog will be aware that a new human genome assembly has been completed, released, and is available for anyone to obtain. One of my favorite overviews of that new version can be found in this readable piece at Bio-IT World: Deanna Church on the Reference Genome Past, Present and Future. That should give you an idea of some of the context and the changes that you might encounter when you begin to work with the new version.

The folks who use genome assemblies in their software will be updating over time. It can take a while for all of the features you want to be mapped to the new assembly, and this will vary by project. At the end of last week, though, we were notified on the UCSC Announcement mailing list that there is a preliminary browser available with the hg38 assembly. Here’s a quick look at that, with a couple the key features highlighted:


Note that calling it hg38 is a big change–we had been on hg19–but now to coordinate with the system of the GRC (Genome Reference Consortium) those numbers will match. And as this is a preliminary browser, you’ll see that there aren’t many annotation tracks available yet. For many things you’ll still want to use the hg19 assembly. The annotation tracks you need will be added as soon as possible. As the announcement notes:

There’s much more to come! This initial release of the hg38 Genome Browser provides a rudimentary set of annotations. Many of our annotations rely on data sets from external contributors (such as our popular SNPs tracks) or require massive computational effort (our
comparative genomics tracks). In the upcoming months/years, we will release many more annotation tracks as they become available. To stay abreast of new datasets, join our genome-announce mailing list or follow us on twitter [@GenomeBrowser].

There are a number of other important changes too, which aren’t obvious from the interface. You should have a look at the full announcement email text to understand the impacts. There are aspects of not only the naming convention, but alternate sequences, centromere representation, mitochondrial genome sequence, sequence updates to fix previous erroneous bases and misassembled regions, and other aspects that could affect your work. Then go kick the tires!

You may also want to have a look at the publication in the NAR Database issue that describes other features that may have been updated since the last time you were diving into a new assembly. There are more species–alligators?!–and more types of tracks than you might be aware of if you just rely on the same stuff most of the time. There’s also the cool hub tools now that provide new ways to load up your own project data. Go forth and discover.


“We BLATted the Internet!”

Best sentence I’ve seen today. Heh.


I’d be interested in the answer to Laura’s question too!

Here’s more detail from the “announcement” mailing list:

All the DNA on the internet now at your fingertips!

Hello everyone!

We’re pleased to announce the release of the Web Sequences track on the UCSC Genome Browser. This track, produced in collaboration with Microsoft Research, contains the results of a 30-day scan for DNA sequences from over 40 billion different webpages. The sequences were then mapped with Blat to the human genome (hg19) and numerous other species including mouse (mm9), rat (rn4), and zebrafish (danRer7). The data were extracted from a variety of sources including patents, online textbooks, help forums, and any other webpages that contain DNA sequence. In essence, this track displays the Blat alignments of nearly every DNA sequence on the internet! The Web Sequences track description page contains more details on how the track was generated.

We would like to acknowledge Max Haeussler and Matt Speir from the UCSC Genome Browser staff and Bob Davidson from Microsoft Research for their hard work in creating this track.

Matthew Speir
UCSC Genome Bioinformatics Group

If you are looking for the track, it’s in the Phenotype and Literature section in human:
web_seqs_noteI took a quick look and it’s definitely a mixed bag–patents and homework sites, and journals and such. But I think it will be interesting to see what turns up.

Edit: some other finds–lots of non-English pages, so I can’t tell what they are. I have seen Japanese, Chinese, and Korean so far. Saw a link to (heh). Slideshare. Some pages are borked and don’t load. Some require logins (medscape). Could be a good source of PDFs that you can’t get elsewhere (*cough*).