Category Archives: Genomics Resource News

Newly updated Quick Reference Card

Video Tip of the Week: UCSC Genome Browser in the Cloud (GBIC)

Newly updated Quick Reference CardFor all the years we’ve been out doing training on the UCSC Genome Browser tools, we could watch the evolution of the needs of the researchers and the corresponding features of the UCSC Genome Browser site. At first, people just needed access to the public data. But then they needed ways to add their own data to the public data context and share the views. UCSC gave us custom tracks, and they gave us browser sessions. Woot!

Increasingly, the data sets got bigger and more complex and custom tracks couldn’t handle the volume. UCSC delivered track hubs. Woot!

Some people were telling us that they had patient data that they couldn’t load on to the UCSC site because of privacy and legal issues. Then UCSC delivered GBIB–Genome Browser in a Box. You could download a local copy of the browser and use your own data behind your firewall.

All of these strategies continue to help users combine their own data with the public data and visualize what they want to show. But there’s also another way now–GBIC, Genome Browser in the Cloud. This week’s tip shows you the video the team created to help people to understand what the GBIC can do. There’s additional information about the features that you can see on their announcement, via the mailing list. But just quickly, here’s the nutgraf:

Until now, genomics research groups working with sensitive medical data were largely limited to using local Genome Browser installations to maintain confidentiality, complicating data-sharing among collaborators. Today, the Genome Browser group of the UC Santa Cruz Genomics Institute announced they have changed that by launching a new product, Genome Browser in the Cloud (GBiC). GBiC introduces new freedom to collaborate by allowing rapid Browser installation, in a UNIX-based cloud or UNIX-virtualized cloud.

And here you can have a look at how it works.

In addition, we’ve recently updated our popular Quick Reference Cards, and we added the note that the GBIC can be used to help people work with their own data. You can download those cards, or get some printed ones, from our website. These cards have had to keep evolving over the years to keep up with all the important features that UCSC adds regularly.

Try out the GBIC with your own data. And they are always looking for feedback on how it suits your needs, or other things you might need. Help them evolve.

Disclosure: UCSC Genome Browser tutorials and materials are freely available because UCSC sponsors us to do training and outreach on the UCSC Genome Browser.

Tyner C, Barber GP, Casper J, Clawson H, Diekhans M, Eisenhart C, Fischer CM, Gibson D, Navarro Gonzalez J, Guruvadoo L, Haeussler M, Heitner S, Hinrichs AS, Karolchik D, Lee BT, Lee CM, Nejad P, Raney BJ, Rosenbloom KR, Speir ML, Villarreal C, Vivian J, Zweig AS, Haussler D, Kuhn RM, and Kent WJ. The UCSC Genome Browser database: 2017 update. Nucleic Acids Res. 2016 Nov 29;. PMID: 27899642; PMC: PMC5210591.

UCSC Genome Bioinformatics

UCSC Genome Browser, default human genome changed

This has gone out over the announcement mailing list, and is also on their web site. But in case you aren’t checking those, seemed important to get people to see.

UCSC Genome Bioinformatics

14 September 2015 — Human Genome Browser default changed to GRCh38/hg38

In conjunction with the release of the new 100-species Conservation track on the hg38/GRCh38 human assembly, we have now changed the default human browser on our website from hg19 to hg38. This should not affect your current browsing sessions; if you were last looking at the hg19 (or older) browser, the Genome Browser will continue to display that assembly for you when you start it up. There are circumstances, however, in which the selected assembly can switch to the newer version. For instance, the assembly will switch to hg38 if you reset your browser defaults. If you find yourself in a situation where some of your favorite browser tracks have “disappeared”, you may want to check that you’re viewing the right assembly.

We will continue our efforts to expand the annotation track set on the hg38 browser to include many of the tracks present on previous human assemblies. In cases where it makes sense, data may be simply “lifted” from hg19 using migration tools. In many instances, however, we must rely on our data providers to generate new versions of their data on the latest assembly. We will publish these data sets as they become available.

For a summary of the new features in the GRCh38 assembly, see the overview we published in March 2014.

UCSC Genome Bioinformatics

UCSC replaces UCSC Genes with GENCODE as default gene set

UCSC Genome BioinformaticsThis is a big deal. And now I have to change my training materials. But I think it’s worthwhile. The GENCODE set is very extensive and the range of annotated types captures important details.

This email came from the UCSC Genome Browser announcement mailing list. Pasting in full for those who aren’t on this list, or link to the list item here:

[genome-announce] GENCODE Genes Now the Default Gene Set on the Human (GRCh38/hg38) Assembly

In a move towards standardizing on a common gene set within the bioinformatics community, UCSC has made the decision to adopt the GENCODE set of gene models as our default gene set on the human genome assembly. Today we have released the GENCODE v22 comprehensive gene set as our default gene set on human genome assembly GRCh38 (hg38), replacing the previous default UCSC Genes set generated by UCSC. To facilitate this transition, the new gene set employs the same familiar UCSC Genes schema, using nearly all the same table names and fields that have appeared in earlier versions of the UCSC set.

By default, the browser displays only the transcripts tagged as “basic” by the GENCODE Consortium. These may be found in the track labeled “GENCODE Basic” in the Genes and Gene Predictions track group. However, all the transcripts in the GENCODE comprehensive set are present in the tables, and may be viewed by adjusting the track configuration settings for the All GENCODE super-track. The most recent version of the UCSC-generated genes can still be accessed in the track “Old UCSC Genes”.

The new release has 195,178 total transcripts, compared with 104,178 in the previous version. The total number of canonical genes has increased from 48,424 to 49,534. Comparing the new gene set with the previous version:

  • 9,459 transcripts did not change.
  • 22,088 transcripts were not carried forward to the new version.
  • 43,681 transcripts are “compatible” with those in the previous set, meaning that the two transcripts show consistent splicing. In most cases, the old and new transcripts differ in the lengths of their UTRs.
  • 28,950 transcripts overlap with those in the previous set, but do not show consistent splicing (i.e., they contain overlapping introns with differing splice sites)

More details about the new GENCODE Basic track can be found on the GENCODE Basic track description page.


Off we go. How to add excitement to my morning. I need more coffee still, though.

Phytozome notice, new and improved v10 coming soon [see update]

This announcement came out while I was at a conference last week–but I wanted to pass it along. This appears to be a big change in the way Phytozome works. And there will be down-time before it rolls out, starting May 1. I like to post major announcements from mailing lists because I know everyone isn’t signed up on every mailing list in bioinformatics as I am…. I can’t figure out how to link to their mailing list archive, so I’m posting the whole thing here.

There appears to be a quick-start guide for the new interface, and I’ll keep an eye out for the chance to do another Tip of the Week (previous tip).

Via the mailing list, from David Goodstein:

Subject: May 1st retirement of Phytozome v9

The last full day of support for v9 of Phytozome will be Friday, May 1st.  Over the subsequent weekend, v9 will be brought down and forwarding services will be put in place to ensure as many URLs as possible find the correct, or at least related, pages in Phytozome v10.

1. Why does v9 of Phytozome need to be retired?
The Phytozome v9 website, at, is based on an older technology stack that is no longer supported by any developers on the Phytozome team.  Newer genome releases, and newer data sets (diversity and expression data) are also not hosted on v9.  In the interests of focussing our limited developer resources in the most effective way possible, and having a single location for access to Phytozome genomic data and analysis, we will have a single website going forward:  Phytozome v10, located at .

2. What happens to the genomic data contained in Phytozome v9?
The vast majority of v9 genomes and annotations are available at the Phytozome v10 website, often in updated form (one genome, B. Rapa chiifu, is not being carried forward).  Users can still find bulk data files containing all the genomes and annotations from Phytozome v9 at the JGI Genome Portal:

3.  I have bookmarks to various resources/genes/families at v9; what happens to those URLs?
-Links to the main site, help pages, release notes, and organism info pages will be automatically forwarded to the corresponding pages in v10.
-Gene pages:  forwarding scripts will attempt to determine the corresponding gene page in v10
-GBrowse pages:  GBrowse in v9 is replaced by JBrowse in v10.  We will attempt to forward URLs to the corresponding location in JBrowse if it exists; if not, the URL will be forwarded to the default location in the corresponding organism’s JBrowse.
-The following v9 URLs/pages will NOT forward to new locations in v10:
—Gene family pages
—Sequence Query results (BLAST results) and BioMart query results.  Note that these expire after 3 days and are therefore not archivable at the present time.
—Keyword Search Results pages

4. I have no idea how to use the new Phytozome v10 interface. Help!
There’s a Phytozome Quick Start Guide available at . Release notes for Phytozome v10 are at!releaseNotes.

5.  I have further questions. What should I do?
Email the Phytozome development team at

Thanks for using Phytozome.



Goodstein D.M., S. Shu, R. Howson, R. Neupane, R. D. Hayes, J. Fazo, T. Mitros, W. Dirks, U. Hellsten, N. Putnam & D. S. Rokhsar & (2011). Phytozome: a comparative platform for green plant genomics, Nucleic Acids Research, 40 (D1) D1178-D1186. DOI:

UPDATE: From the Phytozome team–v10 has been available already.

Statistics for Biologists

In a curious coincidence (not statistically relevant), this week I planned to highlight some useful statistical software as my Video Tip of the Week and the Answer post. In order to lure you back for the other pieces this week, I bring you a handy collection from Nature that was just announced:

Direct link over there in case the tweet breaks later:  Statistics for biologists – A free Nature Collection is the announcement post.

The collection is here:

NCBI to hold two-day genomics hackathon in January

Because this came to my email on the Wednesday before the holiday, it seemed to me that some people might miss it who might like to attend. So I just wanted to boost the signal a bit by re-posting it. It came from the NCBI Announcement mailing list if you want to see the whole thing, I’m excerpting just some of it here. It has an application piece, FYI.

From January 5th to 7th, NCBI will host a genomics hackathon focusing on advanced bioinformatics analysis of next generation sequencing data. This event is for students, postdocs and investigators already engaged in the use of pipelines for genomic analyses from next generation sequencing data. Working groups of 5-6 individuals will be formed for DNA-Seq/multiomics, RNA-Seq, metagenomics and Epigenomics. These groups will build pipelines to analyze large datasets within a cloud infrastructure.

After a basic organizational session, teams will spend 2.5 days analyzing a challenging set of scientific problems related to a group of datasets. Students will analyze and combine datasets in order to work on these problems. This course will take place on the NIH main campus in Bethesda, Maryland.

Datasets will come from the public repositories housed at NCBI. During the course, students will have an opportunity to include other datasets and tools for analysis. Please note, if you use your own data during the course, we ask that you submit it to a public database within six months of the end of the event.

All pipelines and other scripts, software and programs generated in this course will be added to a public GitHub repository designed for that purpose. A manuscript outlining the design of the hackathon and descripting participant processes, products and scientific outcomes will be submitted to an appropriate journal.

To apply, complete the form linked below (approximately 10-15 minutes to complete). Applications are due December 1st by 5pm EST.

Participants will be selected from a pool of applicants; prior students will be given priority in the event of a tie. Accepted applicants will be notified on December 10th by 9am EST, and have until December 12th at noon to confirm their participation. Please include a monitored email address, in case there are follow-up questions.

[some stuff removed here, with requirements, pre-reqs, and some other details on the actual event stuff. See full version here.]

* Genomics hackathon application form:

Hack away.

Bioinformatics tools extracted from a typical mammalian genome project [supplement]

This is Table 1 that accompanies the full blog post: Bioinformatics tools extracted from a typical mammalian genome project. See the main post for the details and explanation. The table is too long to keep in the post, but I wanted it to be web-searchable. A copy also resides at FigShare:

Continue reading

Bioinformatics tools extracted from a typical mammalian genome project

In this extended blog post, I describe my efforts to extract the information about bioinformatics-related items from a recent genome sequencing paper, and the larger issues this raises in the field. It’s long, and it’s something of a hybrid between a blog post and a paper format, just to give it some structure for my own organization. A copy of this will also be posted at FigShare with the full data set. Huge thanks to the gibbon genome project team for a terrific paper and extensively-documented collection of their processes and resources. The issues I wanted to highlight are about the access to bioinformatics tools in general and are not specific to this project at all, but are about the field.


In the field of bioinformatics, there is a lot of discussion about data and code availability, and reproducibility or replication of research using the resources described in previous work. To explore the scope of the problem, I used the recent publication of the well-documented gibbon genome sequence project as a launching point to assess the tools, repositories, data sources, and other bioinformatics-related items that had been in use in a current project. Details of the named bioinformatics items were extracted from the publication, and location and information about the tools was then explored.

Only a small fraction of the bioinformatics items from the project were denoted in the main body of the paper (~16%). Most of them were found in the supplementary materials. As we’ve noted in the past, neither the data nor the necessary tools are published in the traditional paper structure any more. Among the over 100 bioinformatics items described in the work, availability and usability varies greatly. Some reside on faculty or student web sites, some on project sites, some in code repositories. Some are published in the traditional literature, some are student thesis publications, some are not ever published and only a web site or software documentation manual serves to provide required details. This means that information about how to use the tools is very uneven, and support is often non-existent. Access to different software versions poses an additional challenge, either for open source tools or commercial products.

New publication and storage strategies, new technological tools, and broad community awareness and support are beginning to change these things for the better, and will certainly help going forward. Strategies for consistently referencing tools, versions, and information about them would be extremely beneficial. The bioinformatics community may also want to consider the need to manage some of the historical, foundational pieces that are important for this field, some of which may need to be rescued from their current status in order to remain available to the community in the future.


From the Nature website, I obtained a copy of the recently published paper: Gibbon genome and the fast karyotype evolution of small apes (Carbone et al, 2014). From the text of the paper and the supplements, I manually extracted all the references to named database tools, data source sites, file types, programs, utilities, or other computational moving parts that I could identify. There maybe be some missed by this process, for example, names that I didn’t recognize or didn’t connect with some existing tool (or some image generated from a tool, perhaps). Some references were to “in house Perl scripts” or other “custom” scenarios were not generally included unless they had been made available. Pieces deemed as being done “in a manner similar to that already described” in some other reference were present, and I did not go upstream to prior papers to extract those details. Software associated with laboratory equipment, such as sequencers (located at various institutions) or PCR machines were not included. So this likely represents an under-count of the software items in use. I also contacted the research team for a couple of additional things, and quickly received help and guidance. Using typical internet search engines or internal searches at publisher or resource sites, I tried to match the items to sources of software or citations for the items.

What I put in the bucket included specific names of items or objects that would be likely to be necessary and/or unfamiliar to students or researchers outside of the bioinformatics community. Some are related, but different. For example, you need to understand what “Gene Ontology” is as a whole, but you also need to know what “GOslim” is, a conceptual difference and a separate object in my designation system here. Some are sub-components of other tools, but important aspects to understand (GOTERM_BP_FAT at DAVID or randomBed from BEDTools) and are individual named items in the report, as these might be obscure to non-practitioners. Other bioinformatics professionals might disagree with their assignment to this collection. We may discuss removal or inclusion of these in discussions about them in future iterations of the list.


After creating a master list of references to bioinformatics objects or items, the list was checked and culled for duplicates or untraceable aspects. References to “in house Perl scripts” or other “custom” scripts were usually eliminated, unless special reference to a code repository was provided. This resulted in 133 items remaining.

How are they referenced? Where in the work?
Both the main publication (14 PDF pages) and the first Supplementary Information file (133 PDF pages) provided the names of bioinformatics objects in use for this project. All of the items referenced in the main paper were also referenced in the supplement. The number of named objects in the main paper was 21 of the 133 listed components (~16%). This is consistent with other similar types of consortium or “big data” papers that I’ve explored before: the bulk of the necessary information about software tools, data sources, methods, parameters, and features have been in the extensive supplemental materials.

The items are referenced in various ways. Sometimes they are named in the body of the main text, or the methods. Sometimes they are included as notes. Sometimes tools are mentioned only in figure legends, or only in references. In this case, some details were found in the “Author information” section.


As noted above, most were found in the supplemental information. And in this example, this could be in the text or in tables. This is quite typical of these large project papers, in our experience. Anyone attempting to text-mine publications for this type of information should be aware of this variety of locations for this information.

Which bioinformatics objects are involved in this paper?
Describing bioinformatics tools, resources, databases, files, etc, has always been challenging. These are analogous to the “reagents” that I would have put in my benchwork biology papers years ago. They may matter to the outcome, such as enzyme vendors, mouse strain versions, or antibody species details. They constitute things you would need to reproduce or extend the work, or to appropriately understand the context. But in the case of bioinformatics, this can mean file formats such as the FASTQ or axt format from UCSC Genome Browser. They can mean repository resources like the SRA. They can be various different versioned downloaded data sets from ENSEMBL (version 67, 69, 70, or 73 here, but which were counted only once as ENSEMBL). It might be references to Reactome in a table.

With this broad definition in mind, Table 1 provides the list of named bioinformatics objects extracted from this project. The name or nickname or designation, the site at which it can be found (if available), and a publication or some citation is included when possible. Finally, a column designates whether it was found in the main paper as well.

What is not indicated is that some are references multiple times in different contexts and usages, with might cause people to not realize how frequently these are used. For example, ironically, RepeatMasker was referenced so many times I began to stop marking it up at one point.

Table 1. Software tools, objects, formats, files, and resources extracted from a typical mammalian genome sequencing project. See the web version supplement to this blog post:, or access at FigShare:

Bioinformatics tools extracted from a typical mammalian genome project [supplement] – See more at:
Bioinformatics tools extracted from a typical mammalian genome project [supplement] – See more at:


What can we learn about the source or use of these items?
Searches for the information about the source code, data sets, file types, repositories, and associated descriptive information about the items yields a variety of access. Some objects are associated with traditional scientific publications and have valid and current links to software or data (but are also sometimes incorrectly cited). These may be paywalled in certain publications, or are described in unavailable meeting papers. Some do not have associated publications at all, or are described as submitted or in preparation. Some tools remain unpublished in the literature, long after they’ve gone into wide use, and their documentation or manual is cited instead. Some reside on faculty research pages, some are student dissertations. Some tools are found on project-specific pages. Some exist on code repositories—sometimes deprecated ones that may disappear. A number of them have moved from their initial publications, without forwarding addresses. Some are allusions to procedures other publications. Some of them are like time travel right back to the 1990s, with pages that appear to be original for the time. Some may be at risk of disappearing completely the next time an update at a university web site changes site access.

Other tools include commercial packages that may have unknown details, versions, or questionable sustainability and future access.

When details of data processing or software implementations are provided, the amount can vary. Sometimes parameters are included, others not.

Missing tool I wanted to have
One of my favorite data representations in the project results was Figure 2 in the main paper, Oxford grids of the species comparisons organized in a phylogenetic tree structure. This conveyed an enormous amount of information in a small area very effectively. I had hoped that this was an existing tool somewhere, but upon writing to the team I found it’s an R script by one of the authors, with a subsequent tree arrangement in the graphics program “Illustrator” by another collaborator. I really liked this, though, and hope it becomes available more broadly.

Easter eggs
The most fun citation I came across was the page for PHYLIP, and the FAQ and credits were remarkable. Despite the fact that there is no traditional publication available to me, a lengthy “credits” page offers some interesting insights about the project. The “No thanks to” portion was actually a fascinating look at the tribulations of getting funding to support software development and maintenance. The part about “outreach” was particularly amusing to us:

“Does all this “outreach” stuff mean I have to devote time to giving workshops to mystified culinary arts students? These grants are for development of advanced methods, and briefing “the public or non-university educators” about those methods would seem to be a waste of time — though I do spend some effort on fighting creationists and Intelligent Design advocates, but I don’t bring up these methods in doing so.”

Even the idea of “outreach” and support for use of the tools is certainly unclear to the tool providers, apparently. Training? Yeah, not in any formal way.


The gibbon genome sequencing project provided an important and well-documented example of a typical project in this arena. In my experience, this was a more detailed collection and description than many other projects I’ve explored, and some tools that were new and interesting to me were provided. Clearly an enormous number and range of bioinformatics items, tools, repositories, and concepts are required for the scope of a genome sequencing project. Tracing the provenance of them, though, is uneven and challenging, and this is not unique to this project—it’s a problem among the field. Current access to bioinformatics objects is also uneven, and future access may be even more of a hurdle as aging project pages may disappear or become unusable. This project has provided an interesting snapshot of the state of play, and good overview of the scope of awareness, skills, resources, and knowledge that researchers, support staff, or students would need to accomplish projects of similar scope.

little_macIt used to be simpler. We used to use the small number of tools on the VAX, uphill, in the snow, both ways, of course. When I was a grad student, one day in the back of the lab in the early 1990s, my colleague Trey and I were poking around at something we’d just heard about—the World Wide Web. We had one of those little funny Macs with the teeny screens, and we found people were making texty web pages with banal fonts and odd colors, and talking about their research.

Although we had both been using a variety of installed programs or command lines for sequence reading and alignment, manipulation, plasmid maps, literature searching and storage, image processing, phylogenies, and so on—we knew that this web thing was going to break the topic wide open.

Not long after, I was spending more and more time in the back room of the lab, pulling out sequences from this NCBI place (see a mid-1990s interface here), and looking for novel splice variants. I found them. Just by typing—no radioactivity and gels required by me! How cool was that? We relied on Pedro’s List to locate more useful tools (archive of Pedro’s Molecular Biology Search and Analysis Tools.).

Both of us then went off into postdocs and jobs that were heavily into biological software and/or database development. We’ve had a front seat to the changes over this period, and it’s been really amazing to watch. And it’s been great for us—we developed our interests into a company that helps people use these tools more effectively, and it has been really rewarding.

At OpenHelix, we are always trying to keep an eye on what tools people are using. We regularly trawl through the long, long, long supplementary materials from the “big data” sorts of projects, using a gill net to extract the software tools that are in use in the community. What databases and sites are people relying on? What are the foundational things everyone needs? What are the cutting-edge things to keep a lookout for? What file formats or terms would people need to connect with a resource?

But as I began to do it, I thought: maybe I should use this as a launching point to discuss some of the issues of software tools and data in genomics. If you were new to the field and had to figure out how a project like this goes, or what knowledge, skills, and tools you’d need, can you establish some idea of where to aim? So I used this paper to sort of analyze the state of play: what bioinformatics sites/tools/formats/objects/items are included in a work of this scope? Can you locate them? Where are the barriers or hazards? Could you learn to use them and replicate the work, or drive forward from here?

It was illuminating to me to actually assemble it all in one place. It took quite a bit of time to track the tools down and locate information about them. But it seemed to be a snapshot worth taking. And I hope it highlights some of the needs in the field, before some of the key pieces become lost to the vagaries of time and technology. And also I hope the awareness encourages good behavior in the future. Things seem to be getting better—community pressure to publish data sets and code in supported repositories has increased. We could use some standardized citation strategies for the tools, sources, and parameters. The US NIH getting serious about managing “big data” and ensuring that it can be used properly has been met with great enthusiasm. But there are still some hills left to climb before we’re on top of this.


Carbone L., R. Alan Harris, Sante Gnerre, Krishna R. Veeramah, Belen Lorente-Galdos, John Huddleston, Thomas J. Meyer, Javier Herrero, Christian Roos, Bronwen Aken & Fabio Anaclerio & al. (2014). Gibbon genome and the fast karyotype evolution of small apes, Nature, 513 (7517) 195-201. DOI:

FigShare version of this post:

Video Tip of the Week: Biodalliance browser with HiSeq X-Ten data

Drama surrounding the $1000 genome erupts every so often, and earlier this year when the HiSeq X Ten setup was unveiled there was a lot of chatter–and questions: Is the $1,000 genome for real? And some push-back on the cost analysis: That “$1000 genome” is going to cost you $72M. A piece that offers nice framework for the field of play is here: Welcome to the $1,000 genome: Mick Watson on Illumina and next-gen sequencing. Aside from the media flurry, though, what matters is the data. And not many people have had access to the data yet.

Via Gholson Lyon, I heard about access to some:

A set of collaborators (The Garvan Institute of Medical Research, DNAnexus and AllSeq) have provided a test data set from the X Ten. I’ll let them describe this effort:

Take advantage of this unique opportunity to explore X Ten data.

The Garvan Institute of Medical Research, DNAnexus and AllSeq have teamed up to offer the genomics community open access to the first publicly available test data sets generated using Illumina’s HiSeq X Ten, an extremely powerful sequencing platform.  Our goal is to provide sample data that will allow you to gain a deeper understanding of what this technological advancement means for your work today and in the future.

My focus won’t be this data itself–but if you are interested in many of the technical aspects of this system and their process, have a listen to this informative presentation by Warren Kaplan from Garvan:

The sample data is derived from a cell line, the GM12878 cells. These cells are from the Coriell Repository here: Catalog ID: GM12878. Conveniently, this is one of the Tier 1 cell lines from the ENCODE project too, so there is other public data out there on this cell line–which I have explored in the past and knew some things about.

There are 2 different data sets of the sequence in the download files, and one of them is available in the browser to view. I’m sure the Genoscenti will be all over the downloadable files. But because I’m always interested new visualizations, I wanted to explore the genome browser they made available. Although I had heard of Biodalliance before, we hadn’t highlighted it as a tip, so I thought that would be interesting to explore. Biodalliance is a flexible, embeddable, extensible system that’s worth a look on it’s own, besides delivering this test data. And if you come by at a later date and the X Ten data is no longer available, go over to their site for nice sample data sets. Their “getting started” page has a nice intro to the features.

In the video, I’ll just take a quick test drive around some of the visualization features with the X-Ten GM12878 data. I’ll look at a couple of sample regions, one with the SOD1 gene just to illustrate the search and the tracks. And I’ll look at a region that I knew from the previous ENCODE CNV data had a homozygous deletion to see how that looked in this data set. (If you want to look for deletions later, search for the genes OR2T10 or UGT2B17).

Note: the data is time-sensitive–apparently it’s only available until September 30 2014. So get it while it’s hot, or browse around now.

Quick Links:

Test data site:

Biodalliance browser software details:


Down T.A. & T. J. P. Hubbard (2011). Dalliance: interactive genome viewing on the web, Bioinformatics, 27 (6) 889-890. DOI:

Check Hayden E. (2014). Is the $1,000 genome for real?, Nature, DOI:

Dunham I., Shelley F. Aldred, Patrick J. Collins, Carrie A. Davis, Francis Doyle, Charles B. Epstein, Seth Frietze, Jennifer Harrow, Rajinder Kaul & Jainab Khatun & (2012). An integrated encyclopedia of DNA elements in the human genome, Nature, 489 (7414) 57-74. DOI:

Garvan NA12878 HiSeqX datasets by The Garvan Institute of Medical Research, DNAnexus and AllSeq is licensed under a Creative Commons Attribution 4.0 International License