Tag Archives: 1000 Genomes


Friday SNPpets

This week’s SNPpets includes, well, actual pets this week. BGI is selling edited pigs. Welcome to the future. There’s more future, too–see the tweet by Timothy Read about where NGS sequencing will go. And I, for one, welcome our new “smart toilet” overlords. NCBI is cleaning up some of the past (finally). The top 10 useful bioinformatics skills list is compiled. The wrap up of 1000 Genomes project. And a heated discussion of authorship as it pertains to software projects was really interesting. And more….

SNPpets_2Welcome to our Friday feature link collection: SNPpets. During the week we come across a lot of links and reads that we think are interesting, but don’t make it to a blog post. Here they are for your enjoyment…

And see what you can get with your sequence from Mick:

Video Tip of the Week: TrioVis for family genome data sets

I’m always interested in new strategies to visualize data. So when I saw discussion about a tool to help analyze family genomic data, I went to have a look. TrioVis is a new software tool that offers nice visualization and filtering strategies for exploring parent and child trio data sets. These data sets will become increasingly common as families seek out information for uncharacterized medical situations that may be affecting their kids. But they are being widely used already in many research situations.

TrioVis relies on the common VCF or Variant Call Format files that are generated from sequencing data. You can have a look at the types of information they carry at the 1000 Genomes project site. These files are created for each parent and the child in a trio situation, and then they are visualized with TrioVis in this manner:

The user interface consists of five sections: the main table (Fig. 1A), the global variant count bar graphs (Fig. 1B), the variant frequency sliders (Fig. 1C), the coverage sliders (Fig. 1D) and the histogram view (Fig. 1E). Each section focuses on a specific aspect of trio data and offers specific interactive features to calibrate the thresholds. Father, mother and child are colour-coded in green, orange and blue, respectively.

You can read the paper for more details on their goals and strategies. They also point to some 1000 Genomes project sample data you can use to run their tool.

But I also want to commend the TrioVis folks for putting a screencast of their tool right in their abstract. So their video is what I’d like you to view as this week’s Tip of the Week:

TrioVis from Ryo Sakai on Vimeo.

Right now there isn’t a web interface to use, but I noticed in their paper that they plan to integrate this into Galaxy. I think that’s another great idea on their part.

So if you find yourself exploring family trio data sets, consider a look at TrioVis.

Hat tip to Justin Johnson for drawing my attention to this paper and resource.

Quick links:

TrioVis software: https://bitbucket.org/biovizleuven/triovis/wiki/Home

TrioVis video: http://vimeo.com/user6757771/triovis


Sakai, R., Sifrim, A., Vande Moere, A., & Aerts, J. (2013). TrioVis: a visualization approach for filtering genomic variants of parent-child trios Bioinformatics DOI: 10.1093/bioinformatics/btt267

Video Tip of the Week: 1000 Genomes Dataset Browser from NCBI

A recent NCBI Newsletter announced the release of a new resource named the 1000 Genomes Dataset Browser, and that is the resource that I will be featuring in this tip. It is one of the tools available through the new NCBI Variation resources page, which also features resources such as dbSNP, dbVar, dbGaP and ClinVar (many of which OpenHelix has tutorials for) as well as other variation tools – Variation Reporter (pre-release version), Clinical Remap (beta version) and the Phenotype-Genotype Integrator.

Before I discuss NCBI’s 1000 Genomes Dataset Browser, I’d like to spend a bit of time on the 1000 Genomes project, in order to distinguish what is from NCBI and what is from the project itself. From the 1000 Genomes Pilot paper:

“The aim of the 1000 Genomes Project is to discover, genotype and provide accurate haplotype information on all forms of human DNA polymorphism in multiple human populations. Specifically, the goal is to characterize over 95% of variants that are in genomic regions accessible to current high-throughput sequencing technologies and that have allele frequency of 1% or higher (the classical definition of polymorphism) in each of five major population groups (populations in or with ancestry from Europe, East Asia, South Asia, West Africa and the Americas).”

You can access the full paper from the link below. The project has now moved past the pilot phase and is releasing new data all the time. You can see announcements and project details, or access that data, through the official 1000 Genomes project site, or through the official 1000 Genomes version of the Ensembl Browser. As you might imagine for a “big data” project such as this, data has been added to a variety of NCBI databases, including dbSNP, the Sequence Read Archive (SRA) and BioSample. Although you could search for this data through the universal Entrez search system, previously to view the data you would have to view individual results at each separate database. The 1000 Genomes Browser at NCBI has been created as a powerful interface for comprehensively searching for, and viewing, 1000 Genomes data contained in NCBI resources on a single page.

In the video tip I will familiarize you to the various areas of the page - the browser is created with series of widgets, each with its own function. I will not be able to cover all of the features, or demonstrate how users can upload their own variation data to the browser – I’ll leave you the fun of exploring those on your own. Because the tool is so young, bugs and suggestions/comments are still being actively requested – if you find something, check out the FAQs (which discuss bugs at various stages of being fixed) and then email the team.

Quick Links:
NCBI Newsletter announcement July 20, 2012: http://1.usa.gov/RQu5dR

NCBI Variation page: http://www.ncbi.nlm.nih.gov/variation/

NCBI 1000 Genomes Browser page:

1000 Genomes Project site: http://www.1000genomes.org/home

The 1000 genomes project specific version of the Ensembl Browser:

The 1000 Genomes Project Consortium (2010). A map of human genome variation from population-scale sequencing Nature, 467, 1061-1073 DOI: 10.1038/nature09534

Friday SNPpets

Welcome to our Friday feature link collection: SNPpets. During the week we come across a lot of links and reads that we think are interesting, but don’t make it to a blog post. Here they are for your enjoyment…

What’s the answer? (1000Genomes SNPs issues)

BioStar is a site for asking, answering and discussing bioinformatics questions. We are members of thecommunity and find it very useful. Often questions and answers arise at BioStar that are germane to our readers (end users of genomics resources). Every Thursday we will be highlighting one of those questions and answers here in this thread. You can ask questions in this thread, or you can always join in at BioStar.

This week’s question:

Why are there more non-synonymous SNPs than synonymous SNPs in the 1000 genomes data?

I have downloaded SNP data from the 1000 genomes project through Biomart and UCSC genome browser. These SNP data are annotated as being synonymous or non-synonymous (missense). In all textbooks it is said the the number of synonymous mutations should be much higher than non-synonymous mutations. Then why is it that I consistently observe higher number of non-synonymous SNPs for the human genome? Do you think there might be a mistake in annotating these SNPs or there is something else that I am missing?


This question generated a lot of discussion. And one of the key aspects is that you have to really pay attention to how the annotation features are provided in a database. Have a look at the chatter over there about various aspects of SNP annotations.

Friday SNPpets

Welcome to our Friday feature link collection: SNPpets. During the week we come across a lot of links and reads that we think are interesting, but don’t make it to a blog post. Here they are for your enjoyment…

    • Heh. RT @davidmam: #filmquotebioinformatics The first rule about bioinformatics in grants is that nobody talks about bioinformatics in grants.  [Mary]
    • RT @widdowquinn: #filmquotebioinformatics ‘Homology!’
      ‘You keep using that word. I do not think it means what you think it means.’ [Mary]
    • Note–those two film quotes were part of a hilarious meme recently, which was captured in a storify by Casey Bergman: Bioinformatics Film Quotes
    • Hey Hopkins folks–do this if you have been wanting to pick up some programming skills: RT @jiffyclub: Help spread the word about our upcoming Johns Hopkins Software Carpentry bootcamp! Lots of room still. It’s free! http://t.co/iLIp2JZ5 [Mary]
    • From a LinkedIn discussion for the International Society for Biocuration group: “American Society for Cell Biology (ASCB) 2012 Annual Meeting Travel Awards now available http://bit.ly/Jj1YFA” [Jennifer]
    • RT @NCBI: Try the new 1,000 Genomes Browser which displays graphics and tables for Project data as well as NCBI annotations: http://t.co/QPTe59J6 [Mary]
    • Amen. I wish curation got the respect it deserves: RT @GenomeMedicine: John Hawks’ blog puts genomic data into perspective http://t.co/kziT7zvV. New generation of data curators needed? [Mary]
    • From a LinkedIn discussion at the American Association for the Advancement of Science (AAAS) group: “Its time for the 2012 Dance Your PhD Contest!!! Enter for a chance to win $1,000, a trip to TEDx in Brussels and be featured in Science Mag! Details at http://gonzolabs.org/dance/  …” [Jennifer]
    • RT @iGenomics: Just about to finish teaching a fun 2-day RNA-Seq workshop.The teaching materials are at http://t.co/vqhp304z [Mary]
    • From The Wall Street Journal: “Making Gene Mapping Part of Everyday Care” HT: Bio SmartBrief [Jennifer]


Special video SNPpet–this story had a link to a video about obtaining samples for genome sequence from unusual species. Makes those mouse bites I got way back seem pretty tame. And makes you think about your choices of species for sequencing. Choose wisely, young grad student:

RT @DNAday: Mapping the crocodile genome is not for the faint of heart http://t.co/hAXJRsow

Video Tip of Week: Bioproject, it’s where to start finding data (hint, not the papers so much anymore))

A few months ago, Jennifer did a nice tip on on NCBI’s Genome Resources and the changes there. There she briefly mentioned Genome Project resource moving to a new home, BioProject, just about a year ago. Today, I’d like to give you a quick overview of BioProject. It was described in this year’s issue of NAR’s database issue: “BioProject and BioSample databases at NCBI: facilitating capture and organization of metadata.From the abstract:

As the volume and complexity of data sets archived at NCBI grow rapidly, so does the need to gather and organize the associated metadata. Although metadata has been collected for some archival databases, previously, there was no centralized approach at NCBI for collecting this information and using it across databases. The BioProject database was recently established to facilitate organization and classification of project data submitted to NCBI, EBI and DDBJ databases.

This is just one step in the process the biological science community will have to do to get a handle of the data deluge. If scientists are to get a handle of the projects and data that is spewing at breakneck speeds, a key is knowing what data is being generated, organizing the projects.

As Mary (and we here at OpenHelix) keep not-so-gently reminding you, the data isn’t in the papers any more. Huge projects like 1000 Genomes, ENCODE and others and reduced sequencing costs produce enough data that finding it is difficult.

BioProject grew out of a need to better organize these large projects’ datasets and metadata and replaces NCBI’s Genome Project resource. These projects produce data which is then deposited in several repositories. BioProject “provides an organizational framework to access metadata about research projects and the data from those projects which is deposited, or planned for deposition, into archival databases.”

Quick Links:

BioProject Help
BioSample (descriptions of biological source materials used in experimental assays)
ENCODE (sponsored tutorial)
1000 Genomes 


Barrett, T., Clark, K., Gevorgyan, R., Gorelenkov, V., Gribov, E., Karsch-Mizrachi, I., Kimelman, M., Pruitt, K., Resenchuk, S., Tatusova, T., Yaschenko, E., & Ostell, J. (2011). BioProject and BioSample databases at NCBI: facilitating capture and organization of metadata Nucleic Acids Research, 40 (D1) DOI: 10.1093/nar/gkr1163

Video Tip of the Week: Population Genetics Introduction

We are on the road this week at a workshop in Southern California, so I am going to hand off my tip responsibilities to Lynn Jorde.

Another session in the Current Topics in Genome Analysis 2012 course that has been organized by the NHGRI featured Lynn Jorde. Lynn delivered a lecture (about 1.5 hours long in total–but he makes you stand up at 1 hour to stretch :) ) that provides a nice and gentle introduction to population genetics.

Jorde starts with a list of applications of human genetic variation, such as:

  • deciphering human history
  • inferring individual ancestry
  • forensics (I had no idea that there were 25,000 criminal cases a year with DNA issues)
  • finding and understanding disease causing genes

He does some very clever and helpful comparisons to make his points. At one point he compares humans and broccoli. And he uses an item from the Weekly World News to illustrate a point–this made me laugh because I’ve done the same thing.

Touching carefully on the issue of “race”, he acknowledges that human genetics discussions on that can generate more heat than light. So he doesn’t use that term in his writing. And there are a number of cases where social concepts of race vs. medical treatment are not cohering. He finds “ancestry” the more useful way to think about predictions for responses to drugs or treatments.

He also notes though that there is need for caution at this point on reliance on the data we are seeing from next-gen sequencing platforms. Specifically he calls out this paper in Genetics in Medicine as a key awareness (emphasis mine):

CONCLUSIONS: Our analyses demonstrate that clinical prognoses are complicated by sequencing platform-specific errors and ethnicity. We show that disease-causing alleles are globally distributed along ethnic lines, with alleles known to be disease causing in Eurasians being significantly more likely to be homozygous in Africans.

[By the way: that paper is interesting on a couple of other fronts too: it tries to figure out what a “healthy genome” would look like, and heavily uses OMIM to assess that.]

Another clever example to illustrate relationships among people used an analysis of the Supreme Court decisions to describe neighbor-joining networks. And he used profiles of political candidates to explain distance matrix. It seemed pretty approachable to me.

This talk isn’t specific about any particular software tools, but he does reference important population genetics data sets that you should be familiar with if you use tools that have that data. He speaks about the HapMap project, the 1000 Genomes data, and VAAST (the Variant Annotation, Analysis & Search Tool) software.

So check out this talk for a nice overview of population genetics, and important and current factors around this field today.

Quick links:

Lecture on YouTube: http://youtu.be/Ng6vKcGkzZs

Current Topics in Genome Analysis course: http://www.genome.gov/12514288


Moore, B., Hu, H., Singleton, M., De La Vega, F., Reese, M., & Yandell, M. (2011). Global analysis of disease-related DNA sequence variation in 10 healthy individuals: Implications for whole genome-based clinical diagnostics Genetics in Medicine, 13 (3), 210-217 DOI: 10.1097/GIM.0b013e31820ed321

Yandell, M., Huff, C., Hu, H., Singleton, M., Moore, B., Xing, J., Jorde, L., & Reese, M. (2011). A probabilistic disease-gene finder for personal genomes Genome Research, 21 (9), 1529-1542 DOI: 10.1101/gr.123158.111

dbSNP 132 now at UCSC Genome Browser: important changes

While we were traveling for workshops the other day, there was an announcement from the UCSC Genome Browser team that a lot of people have been waiting for: dbSNP132 can be explored on the browser now. It is available on the hg19 assembly–which is the February 2009 one that you can select in the human genome gateway options.

People were eagerly awaiting this for a couple of reasons–first, a new dbSNP release is always offering new SNPs people might want to explore. But this particular release also has SNPs that people wanted to access from the 1000 Genomes project. Here’s the release announcement from dbSNP that describes it:

  1. Build 132:Human include data from 1000 Genomes project pilot 1, 2, and 3 studies.   All 1000 genomes submissions to dbSNP can be searched by batch or using Entrez search filters….

Note: the dbSNP announcement also offers help on filtering for just those SNPs at NCBI if you want them. This led me to try to filter the 1000Genomes submitter name in the UCSC Table Browser as well. It worked–but I haven’t checked all of it yet, so caveat lector on that right now…But some people might want to do that. You could create a 1000Genomes custom track with that sort of query I think.

Table Browser, Filter for dbSNP submitter field:

And this yielded this sort of output–where the submitter field contains 1000Genomes, but it may also include other submitters:

But another really important aspect of the 132 build in the context of the UCSC Genome Browser is that they have changed the way they are offering the SNPs to you. In the past the SNPs have always been in one big bucket. But now they have separated them out into 4 options: Common SNPs, Flagged SNPs, Multiple location mapped SNPs (Mult. SNPs), and All SNPs. So the menus on the browser look like this now:

Key point: the Common SNPs are on by default. If you want All SNPs (or any of the others) you will have to specifically make that choice. Also remember in your table browser queries to make the appropriate selection.

This is a nice option that people have been asking for. But it does represent a change from the way they have been offered before, so be sure you know which SNP subset you want to explore and make the right choice.

PS: I was going to make a custom track of 1000Genomes to load up for anyone as a public service, but I crashed the browser. I may try again later. I think it would be a handy track to have to load up. If someone else gets to it first, let me know and give me your session link and I’ll add it.

PPS: If you don’t know how to navigate around UCSC, change the menu options, or do Table Browser queries, check out the tutorials that we have that are sponsored by UCSC and are freely available: http://openhelix.com/ucsc

Friday SNPpets

Welcome to our Friday feature link collection: SNPpets. During the week we come across a lot of links and reads that we think are interesting, but don’t make it to a blog post. Here they are for your enjoyment…

  • RT @32nm: ISCB Public Policy Statement on Open Access to Scientific and Technical Research Literature — Bioinformatics http://ff.im/x9hQ2 Hat tip to Mitsuteru Nakao [Mary]
  • The beginning of population level structural variation information on the human genome, from the at the  1000 genomes project. Hat tip to GenomeWeb [Jennifer]
  • Registration for the March 2011 GMOD Meeting is now open. See the wiki page at March 2011 Community Meeting. [Mary]
  • Need to assemble and analyze large datasets for multigene phylogenetic analysis? Might want to try out the new iPhy. Paper here. [Trey]
  • Ok, I had to go look at this–not what I expected…. : Top 200 Genomic Females list for Feb. 2011 is NOW ONLINE at http://bit.ly/i0Nj3O via @holstein_world. Yes, the name did clue me in, but I still had to look. [Mary]
  • Scientific Reports: interesting new venture by Nature Publishing Group. [Jennifer]
  • I love when rare diseases can shine lights into the darkness–very neat story on arterial calcification. Fascinating: RT @NatureNews: Solution to medical mystery offers treatment hope http://ff.im/-xiZYE [Mary]