UCSC Genome Bioinformatics

UCSC replaces UCSC Genes with GENCODE as default gene set

UCSC Genome BioinformaticsThis is a big deal. And now I have to change my training materials. But I think it’s worthwhile. The GENCODE set is very extensive and the range of annotated types captures important details.

This email came from the UCSC Genome Browser announcement mailing list. Pasting in full for those who aren’t on this list, or link to the list item here:

[genome-announce] GENCODE Genes Now the Default Gene Set on the Human (GRCh38/hg38) Assembly

In a move towards standardizing on a common gene set within the bioinformatics community, UCSC has made the decision to adopt the GENCODE set of gene models as our default gene set on the human genome assembly. Today we have released the GENCODE v22 comprehensive gene set as our default gene set on human genome assembly GRCh38 (hg38), replacing the previous default UCSC Genes set generated by UCSC. To facilitate this transition, the new gene set employs the same familiar UCSC Genes schema, using nearly all the same table names and fields that have appeared in earlier versions of the UCSC set.

By default, the browser displays only the transcripts tagged as “basic” by the GENCODE Consortium. These may be found in the track labeled “GENCODE Basic” in the Genes and Gene Predictions track group. However, all the transcripts in the GENCODE comprehensive set are present in the tables, and may be viewed by adjusting the track configuration settings for the All GENCODE super-track. The most recent version of the UCSC-generated genes can still be accessed in the track “Old UCSC Genes”.

The new release has 195,178 total transcripts, compared with 104,178 in the previous version. The total number of canonical genes has increased from 48,424 to 49,534. Comparing the new gene set with the previous version:

  • 9,459 transcripts did not change.
  • 22,088 transcripts were not carried forward to the new version.
  • 43,681 transcripts are “compatible” with those in the previous set, meaning that the two transcripts show consistent splicing. In most cases, the old and new transcripts differ in the lengths of their UTRs.
  • 28,950 transcripts overlap with those in the previous set, but do not show consistent splicing (i.e., they contain overlapping introns with differing splice sites)

More details about the new GENCODE Basic track can be found on the GENCODE Basic track description page.

+++++++++++

Off we go. How to add excitement to my morning. I need more coffee still, though.