Tag Archives: genome

Video Tip of the Week: Big Changes to NCBI’s Genome Resources

NCBI was created in 1988 and has maintained the GenBank database for years. They also provide many computational resources and data retrieval systems for many types of biological data. As such they know all too well how quickly the data that biologists collect has changed and expanded. As uses for various data types have been developed, it has become obvious that new types of information (such as expanded metadata) need to be collected, and new ways of handling data are required.

NCBI has been adapting to such needs throughout the years and recently has been adapting its genome resources. Today’s tip will be based on some of those changes. My video will focus on the “completely redesigned Genome site”, which was recently rolled out and announced in the most recent NCBI newsletter. I haven’t found a publication describing the changes, but the newsletter goes into some detail and the announcement found at the top of the Genome site (& that I point out in the video) has very helpful details about the changes.

As you will see in the announcement, the Genome resource is not the only related resource to have undergone changes recently, including the redesign of the Genome Project resource into the BioProject resource and the creation of the BioSample resource. I won’t have time to go into detail about those two resources but at the end of my post I will link to two recent NCBI publications that came out in Nucleic Acids Research this month – these are good resources to read for more information on BioProject, BioSample, and on the NCBI as a whole. For a historical perspective I also link to the original Genome reference, which is in Bioinformatics and currently free to access.

Some of the changes are very interesting, including that “Single genome records now represent an organism and not a genome for one isolate.” The NCBI newsletter states that “Major improvements include a more natural organization at the level of the organism for prokaryotic, eukaryotic, and viral genomes. Reports include information about the availability of nuclear or prokaryotic primary genomes as well as organelles and plasmids. ” There’s also a note that “Because of the reorganization to a natural classification system, older genome identifiers are no longer valid. Typically these genome identifiers were not exposed in the previous system and were used mainly for programmatic access. ” That makes me wonder what changes this will mandate to other NCBI’s resources, as well as external resources. I haven’t seen any announcements on that yet, so I’ll just have to stay tuned & check around often.

Enjoy the tip & let us, or NCBI, know what you think of their changes! :)

Quick Links:

NCBI Homepage: http://www.ncbi.nlm.nih.gov/

Entrez Genome Resource Homepage: http://www.ncbi.nlm.nih.gov/genome

BioProject Resource Homepage: http://www.ncbi.nlm.nih.gov/bioproject/


Historic Entrez Genome reference: Tatusova, T., Karsch-Mizrachi, I., & Ostell, J. (1999). Complete genomes in WWW Entrez: data representation and analysis Bioinformatics, 15 (7), 536-543 DOI: 10.1093/bioinformatics/15.7.536

Barrett, T., Clark, K., Gevorgyan, R., Gorelenkov, V., Gribov, E., Karsch-Mizrachi, I., Kimelman, M., Pruitt, K., Resenchuk, S., Tatusova, T., Yaschenko, E., & Ostell, J. (2011). BioProject and BioSample databases at NCBI: facilitating capture and organization of metadata Nucleic Acids Research DOI: 10.1093/nar/gkr1163

Sayers, E., Barrett, T., Benson, D., Bolton, E., Bryant, S., Canese, K., Chetvernin, V., Church, D., DiCuccio, M., Federhen, S., Feolo, M., Fingerman, I., Geer, L., Helmberg, W., Kapustin, Y., Krasnov, S., Landsman, D., Lipman, D., Lu, Z., Madden, T., Madej, T., Maglott, D., Marchler-Bauer, A., Miller, V., Karsch-Mizrachi, I., Ostell, J., Panchenko, A., Phan, L., Pruitt, K., Schuler, G., Sequeira, E., Sherry, S., Shumway, M., Sirotkin, K., Slotta, D., Souvorov, A., Starchenko, G., Tatusova, T., Wagner, L., Wang, Y., Wilbur, W., Yaschenko, E., & Ye, J. (2011). Database resources of the National Center for Biotechnology Information Nucleic Acids Research DOI: 10.1093/nar/gkr1184

Video Tip of the Week: Variation Data from Ensembl

Trey introduced me to this “decent collection of video tutorials ” from Ensembl, but he and Mary are currently in Morocco teaching a 3-day bioinformatics workshop & then attending the conference (yes, I am envious!). I am therefore creating this week’s tip based on the tutorials that Trey pointed me to. In today’s tip I am going to parallel a tutorial available from Ensembl on SNP information in order to both: 1) show you haw you can access variation information from Ensembl and 2) compare doing these steps using Ensembl 64 (here in this video) and using Ensembl 54 (archived) (in the Ensembl video).

Bioscience resources often are continuously being developed and improved & it can be difficult to keep videos and documentation up-to-date. That’s why here at OpenHelix we work continuously to keeping our materials up-to-date, with weekly tips on new features and updated tutorials as updated sites become stable.

The Ensembl video (SNPs and other Variations – 1 of 2) is quite nice & provides more detail about the actual Ensembl data than I can in my short movie, but it was done a few years ago on an older version of Ensembl. Since then the resource has been updated, and gone through several new versions of the data. I’m going to follow the same steps that are done in part one of the Ensembl SNP tutorial so that you can see examples of what’s changed & what is pretty much the same. I’d suggest you watch both videos back-to-back to get a good idea of what’s changed, and what types of variation information are available from Ensembl. From that basis I’m sure you’ll be able to watch Ensembl’s second SNP video & apply it to using the current version of Ensembl without much trouble. For more details you can refer to the most recent Ensembl paper in the NAR database  issue, which describes not just variation information but Ensembl as a whole.

Quick links:

Ensembl Browser: http://www.ensembl.org/index.html

Legacy Ensembl Browser (release 54): http://may2009.archive.ensembl.org/index.html

Ensembl tutorial, part 1 of 2: http://useast.ensembl.org/Help/Movie?id=208

Ensembl tutorial, part 1 of 2: http://useast.ensembl.org/Help/Movie?id=211

OpenHelix Ensembl tutorial materials: http://www.openhelix.eu/cgi/tutorialInfo.cgi?id=95

Ensembl Tutorial List: http://useast.ensembl.org/common/Help/Movie?db=core

Flicek, P., Aken, B., Ballester, B., Beal, K., Bragin, E., Brent, S., Chen, Y., Clapham, P., Coates, G., Fairley, S., Fitzgerald, S., Fernandez-Banet, J., Gordon, L., Graf, S., Haider, S., Hammond, M., Howe, K., Jenkinson, A., Johnson, N., Kahari, A., Keefe, D., Keenan, S., Kinsella, R., Kokocinski, F., Koscielny, G., Kulesha, E., Lawson, D., Longden, I., Massingham, T., McLaren, W., Megy, K., Overduin, B., Pritchard, B., Rios, D., Ruffier, M., Schuster, M., Slater, G., Smedley, D., Spudich, G., Tang, Y., Trevanion, S., Vilella, A., Vogel, J., White, S., Wilder, S., Zadissa, A., Birney, E., Cunningham, F., Dunham, I., Durbin, R., Fernandez-Suarez, X., Herrero, J., Hubbard, T., Parker, A., Proctor, G., Smith, J., & Searle, S. (2009). Ensembl’s 10th year Nucleic Acids Research, 38 (Database) DOI: 10.1093/nar/gkp972

What’s the Answer? Open Thread (GWAS genotyping)

BioStar is a site for asking, answering and discussing bioinformatics questions. We are members of the community and find it very useful. Often questions and answers arise at BioStar that are germane to our readers (end users of genomics resources). Every Thursday we will be highlighting one of those questions and answers here in this thread. You can ask questions in this thread, or you can always join in at BioStar.

Question of the Week:

How much of the genome is captured by a GWAS?

Two great answers to this questions, a quote from the first one. Click the above link for more.

Human genome encodes 1 SNP/100-300bp; ~3GB sequence ~10million SNPs. It is impossible to analyze such a large number of data due to several limiting factors. To deal with this issue we can use Linkage Disequilibrium (LD) mapping (See section on D’, recombination rate), HaplotypeHaplotype blocks and Haplotype Tag SNPs (tagSNPs). (Read about HapMap project here). Instead of genotyping all the 10M SNPs we can genotype tagSNPs in a haplotype block. This is a representative SNP in a given region of genome with high LD. This will enable to find genetic variation without genotyping all the 10M SNPs. Previous studies indicated that genotyping chips with .5M-1M SNPs will be sufficient for a good GWAS.

Naked Mole Rat, another day, another genome

The latest genome to be completed is the naked mole rat (Heterocephalus glaber). Now, could there be a cooler (if ugly) mammal on the planet? It’s one of only two truly eusocial mammals in the world, it lives up to 28 long years (my daughter’s rat, no relation, lived only 3 years) and is surprisingly resistant to a lot of diseases.

So, no wonder the genome was sequenced. Maybe we can learn some things about social behavior and longevity.

Of course there is a resource for it at http://www.naked-mole-rat.org/ though it’s basically just a blast server and some downloads. I’m counting down to the day it’s available at UCSC or Ensembl :D. I have some genes I’m interested in comparing.

Tip of the Week: Converting Genome Coordinates

I did this tip over two years ago and am revisiting it today with a bit more information, on SciVee (so it’s shareable) and up-to-date. I’ve been updating our Galaxy tutorial and that tip has been one of the most tweeted, shared and visited tips we’ve done (not the most, just one of), so thought now would be a good time to revisit it. This tip will go through the Galaxy tool to “liftover” genome coordinates between assemblies and genomes. You might also wish to visit a few other tools and places where you can convert genome coordinates between genome assemblies such as the UCSC Genome Browser Liftover utility (access that link from “utilities” menu on the front page, it uses a chain conversion files), FlyBase (for D. melanogaster genome), Maker (an annotation tool from GMOD that includes an assembly conversion tool), Ensembl Assembly converter, and I’m sure there are others. Have any to report? As the comment below informs us, there is also NCBI’s new remapping service which maps between assemblies (within species) and between refseq sequences and assemblies.

A word about methodology, as mentioned in the first paragraph, UCSC Genome Browser’s liftover tool uses chain conversion files. I am unsure of the methodology used at Galaxy though I’m assuming it’s similar. I have an inquiry in and will update this page when I know the answer.

Indeed it is. I received an nice answer from the Galaxy support team:

The liftOver program and the underlying mapping file comes from UCSC and is based on their “Chain/Net” comparative genome algorithms.

The data represents the syntenic genome regions for the two reference genomes involved. Genes with similar annotation, between closely related species, found within these syntenic regions have a good likelihood of being orthologs, but gene function is not considered by the algorithm and would have to be evaluated independently to confirm orthology.

This mailing list discussion at the UCSC Genome Browser project would be a good place to learn about the details:
Contacting the team directly at genome@ucsc.edu is also an option if you have a specific question about the algorithm.

Hopefully this helps!

Gobbler genome

For those of you who are not American Thanksgiving observers, turkey is the main course of choice for most Americans for that harvest feast. Two years ago Mary reported the turkey genome was on it’s way. Well, it’s apropos that the turkey genome is nearly complete (PLoS paper out in September) and ready for this year’s Thanksgiving feast!

Yum, yum.

A current BMC Genomics paper reveals that there have been multiple intrachromosomal rearrangements between the turkey genome and the chicken one. I guess one could say that a turkey is just a chicken that’s been reconfigured? (hattip: Daily Scan)

With these publications, you’ve got some uber-geeky fodder to prove to your family that you are indeed a biology nerd.

Speaking of chickens and turkeys… when I lived in Korea in 1980 my American friends and I had a hard time finding a turkey (Koreans thought they were ugly). We finally found a farm that raised them for American servicemen and expatriates. We went there and picked the largest they had, it was an ugly head above the rest. We took it, live, in a box on a 1 hour bus ride back to the market near our home with Koreans staring at our ugly-headed box like we were insane (perhaps we were?). We took it to an incredulous chicken butcher and when we returned for our dispatched and deheaded turkey, he took pity on us and gave us two large chickens. When we saw the pathetic turkey, we understood why the butcher insisted on gifting us two chickens: the turkey, as large as it was, had no meat on it.

Well, we roasted two chickens that Thanksgiving (and made a pumpkin pie from a pumpkin we found at the market.. the purveyor of which thought we were crazy to eat a pumpkin, they were for decoration!).

Anyhoo, a long anecdote that just goes to show that perhaps a chicken does indeed suffice since they are just rearranged birds (though I wonder where the turkey’s head came from on the chicken?)

Friday SNPpets

Welcome to our Friday feature link dump: SNPpets. During the week we come across a lot of links and reads that we think are interesting, but don’t make it to a blog post. Here they are for your enjoyment…

Tip of the Week: GeVo and Genome Comparison

gevo_thumbToday’s tip of the week introduces a new (to us) tool for genomic comparisons. We came across this tool reading a blog post at James and the Giant Corn (great blog) about a figure from his research proposal. See, there are reasons to read blogs :D. The tool he uses to create this figure and analysis is GeVo at CoGe which has several useful tools in addition to GeVo. In today’s tip of the week, we’ll take a quick look at James’ figure at GeVo and introduce CoGe. Check them out, they look like quite useful tools. (and while you’re at it, check out  James’ blog. Tidbits like this and interesting discussions make it well worth it.)


nr_Bassett-DachshundI had a Basset Hound growing up. His name was Useless, Useless S. Grunt. Well, actually it was formally Ulysses S. Grant because the US Kennel Club wouldn’t accept Useless S. Grunt as a name as they felt it was too demeaning. Not sure if they felt it was demeaning to the dog or to the president, but that’s neither here nor there is it?

So,you ask, what made me think of that long-passed sweet dog that tripped over it’s too-long ears with it’s too-short legs? It turns out that they found out what genetic cause there was for those short legs in Basset Hounds (and Dachshunds and other breeds).

As NHGRI’s press release states:

In a study published in the advance online edition of the journal Science, the researchers led by NHGRI’s Elaine Ostrander, Ph.D., examined DNA samples from 835 dogs, including 95 with short legs. Their survey of more than 40,000 markers of DNA variation uncovered a genetic signature exclusive to short-legged breeds. Through follow-up DNA sequencing and computational analyses, the researchers determined the dogs’ disproportionately short limbs can be traced to one mutational event in the canine genome – a DNA insertion – that occurred early in the evolution of domestic dogs.

The insertion turns out to be a retrogene, which of course I also find interesting in that I studied retrotransposable elements. Reverse transcriptase has this habit of reverse transcribing RNA into DNA which can get reinserted back into the genome (hence processed pseudogenes of course).

The study is interesting for two reasons (other than because I had a Basset Hound and studied the evolution of retroelements ;), it gives us a further clue into evolutionary events that lead to large changes in morphology and the role of retrotranscription and it gives us a clue into possible human conditions.

For more about dog genome, you can read our several posts about the dog genome, go to NCBI’s dog genome home site (or UCSC or Ensembl and other browsers) and read the paper (needs a subscription of course, it’s in Science). It’s an interesting read so far (I want to find some time to read it more fully, perhaps Useless doesn’t live up to his name.. he didn’t really even then :D).

Cold genomes

coldvirusRecently, we are learning a lot about the cold virus. The genomes of many have now been sequenced (that is a subscription-required Science report, you can read more about the report here).

You can find more genomic information at the picornaviridae.com at the NCBI’s Entrez Genomes and some structural information at MMDB. (just a side note, rhinovirus is now classified as enterovirus).