Most folks who read this blog will be aware that a new human genome assembly has been completed, released, and is available for anyone to obtain. One of my favorite overviews of that new version can be found in this readable piece at Bio-IT World: Deanna Church on the Reference Genome Past, Present and Future. That should give you an idea of some of the context and the changes that you might encounter when you begin to work with the new version.
The folks who use genome assemblies in their software will be updating over time. It can take a while for all of the features you want to be mapped to the new assembly, and this will vary by project. At the end of last week, though, we were notified on the UCSC Announcement mailing list that there is a preliminary browser available with the hg38 assembly. Here’s a quick look at that, with a couple the key features highlighted:
Note that calling it hg38 is a big change–we had been on hg19–but now to coordinate with the system of the GRC (Genome Reference Consortium) those numbers will match. And as this is a preliminary browser, you’ll see that there aren’t many annotation tracks available yet. For many things you’ll still want to use the hg19 assembly. The annotation tracks you need will be added as soon as possible. As the announcement notes:
There’s much more to come! This initial release of the hg38 Genome Browser provides a rudimentary set of annotations. Many of our annotations rely on data sets from external contributors (such as our popular SNPs tracks) or require massive computational effort (our
comparative genomics tracks). In the upcoming months/years, we will release many more annotation tracks as they become available. To stay abreast of new datasets, join our genome-announce mailing list or follow us on twitter [@GenomeBrowser].
There are a number of other important changes too, which aren’t obvious from the interface. You should have a look at the full announcement email text to understand the impacts. There are aspects of not only the naming convention, but alternate sequences, centromere representation, mitochondrial genome sequence, sequence updates to fix previous erroneous bases and misassembled regions, and other aspects that could affect your work. Then go kick the tires!
You may also want to have a look at the publication in the NAR Database issue that describes other features that may have been updated since the last time you were diving into a new assembly. There are more species–alligators?!–and more types of tracks than you might be aware of if you just rely on the same stuff most of the time. There’s also the cool hub tools now that provide new ways to load up your own project data. Go forth and discover.
Karolchik D., Barber G.P., Casper J., Clawson H., Cline M.S., Diekhans M., Dreszer T.R., Fujita P.A., Guruvadoo L. & Haeussler M. & (2013). The UCSC Genome Browser database: 2014 update, Nucleic Acids Research, 42 (D1) D764-D770. DOI: 10.1093/nar/gkt1168