Tip of the Week: 1000 Genomes Project Browser


You may have been hearing about the 1000 Genomes project–it’s one of the ongoing “big data” projects that is going to yield a great deal of variation information about the human genome. The goal is to sequence well over1000 genomes to identify “most genetic variants that have frequencies of at least 1% in the populations studied”.  They are doing this by sequencing large numbers of samples  with 4x coverage. You can read more about their strategy in their About page on their web site. It also lists the anticipated sample populations.

In this week’s Tip of the Week I’m going to take a quick spin through their browser. (You can also download all the data, but I’ll be focusing on the browser.) They have begun to release data now, and there are 6 individual sequences available at this time.  These are part of their “pilot” studies.  You can get some details on the pilot from their about page, which links to this PDF about the samples.

They are using the Ensembl framework to display their data. So if you are familiar with using Ensembl you’ll have some facility moving around this browser.  One thing that isn’t apparent right away from the site is that you can click the Resembl link on the display to turn on a track that puts the read/coverage data on the viewer. I also liked the alignment display  of all 6 genomes–but I’m sure that’s going to get challenging to view later with more and more genomes.

In an exchange with their very helpful help desk yesterday, I got this quick summary of the samples you’ll see:

For the high coverage populations NA12891, NA12892 and NA12878 are the CEU trio, NA19238, NA19239 and NA19240 are the YRI trio both father, mother, child respectively and both children were daughters.

If you have questions about their data, be sure to go ask them for help–they were very speedy with answers for me :) .

Some of the project data has also been picked up by UCSC and you can access the same sequences in the UCSC Genome Browser in the Genome Variants track on the March 2006 human assembly. (You’ll also see Venter, Watson, and some other individual genomes there).

Quick links:

The Project: http://www.1000genomes.org/

The Browser: http://browser.1000genomes.org/

An article in Science with some background:  A Plan to Capture Human Diversity in 1000 Genomes

8 thoughts on “Tip of the Week: 1000 Genomes Project Browser

  1. Pingback: Tweets that mention Tip of the Week: 1000 Genomes Project Browser | The OpenHelix Blog -- Topsy.com

  2. gsgs

    Does this exist ? :

    an executable (together with one data-file) that lets me specify the genome-name,
    chromosome,start-position, end-position on the comand line an then prints the nucleotides
    of that genome at those positions – all aligned to the same reference.
    How big would that data-file be ? 1GB should be enough

    If it doesn’t exist yet, I would consider it a big failure within the whole human sequencing projects,
    with so much effort gone into it aready. They would have failed to collect + display the results
    in the one an only suitable format, so others can easily work with it.
    It seems to me like creating a new language without giving an alphabetic dictionary
    to some existing language.

  3. Mary Post author

    @gsgs: you know, all the data is out there. If it doesn’t meet your precise requirements, you are going to have to learn a little perl perhaps.

    I hardly think failure to meet gsgs’ file structure demands constitutes a failure of “the whole human sequencing projects”. It may be a failure. But it’s not the projects.

  4. gsgs

    first I had “failure of” but didn’t mean it and
    changed it to “failure within”, meaning that
    the data-presentation part somehow failed (IMO)

    I claim, that that program is the ultimate,most canonical,reasonable,useful,compressed
    way to present the data. All these different formats, alignment explanations,
    reference-genomes – just confusing and non-uniform.

    Time will tell …

    those .gff files are the best that I found so far,
    but the reference is not uniform, only
    (selected ?) SNPs

    and then another problem is : even if a good,easy
    form is available, it might still be hard to find it
    with other presentations showing up on
    keyword-search confusing the list of hits

  5. Steve Chervitz Trutane

    There are tools for extracting arbitrary ranges from sequence data files, such as nibFrag, 2bitToFa, and some other approaches in R or awk mentioned here:

    http://biostar.stackexchange.com/questions/979/fastest-way-of-extracting-millions-of-short-sequences-from-the-human-genome

    https://lists.soe.ucsc.edu/pipermail/genome/2006-November/012192.html

    None of these perform alignment to a reference, so you’d need to do that yourself. Some tools for doing this are noted at http://www.bioperl.org/wiki/Whole_genome_alignment

  6. Pingback: 1000 Genomes tutorial at ASHG | The OpenHelix Blog

  7. Pingback: Oddest photo accompanying 1000 Genomes news | The OpenHelix Blog

  8. Pingback: dbSNP 132 now at UCSC Genome Browser: important changes | The OpenHelix Blog

Comments are closed.