The folks at NCBI recently hosted a webinar that covered a number of resources: GTR, ClinVar, and MedGen. It was a nice introduction to these resources using a case study of exploring information about a 9-year-old child who needed to get clearance for participation in sports. So they follow the course of some details about this kid across the different resources at NCBI to show what you could learn at the different sites.
I was hoping that recording would become available so that could be a triple-tip of the week, but I haven’t seen any announcements of it; I’ll keep an eye out and highlight it in the future if it does. Below I have also referenced a paper that covers some of the same ground as that webinar. But in the meantime they also recently added a new short video about the Variation Viewer that I found handy as well. So that will be this week’s video tip.
I particularly liked the way you can easily select an exon to focus on, with the little bubbles near the top. That wasn’t obvious to me at first. People are often asking me for handy ways to focus in on the specifics of a single exon.
In addition to this video, I will also offer a screen-cap of one of the slides from the longer webinar that linked to related resources around NCBI. If you haven’t checked out these associated tools you will want to look at them as well. There are a lot of terrific tools available and they are always adding new useful features. Follow them on Twitter for announcements about their tools and trainings–that’s how I stay on top of the new items.
Landrum M.J., G. R. Riley, W. Jang, W. S. Rubinstein, D. M. Church & D. R. Maglott (2014). ClinVar: public archive of relationships among sequence variation and human phenotype, Nucleic Acids Research, 42 (D1) D980-D985. DOI: http://dx.doi.org/10.1093/nar/gkt1113
A couple of weeks back we did a workshop on the UCSC Genome Browser, and I was asked a question we see pretty frequently: Is there a way to export the browser view that you selected with specific tracks, filters, regions, etc? People may want to have a record of their customized view in a lab notebook, or use it for teaching, or in a seminar perhaps–or of course to publish your awesome observations in journals.
Most of the time I just take screen shots of what I need with a screen capture tool (my personal favorite is Snag-It from TechSmith). But there may be times you want something a bit heavier-duty. If you are going to do a poster, or submit it for publication, for example, you might want a nice PostScript version you can work with and edit further. At UCSC, the way to do that is with the “View” menu option here for PDF/PS:
Export the browser image to a file for further editing or use.
When you get a file, you can take it down and use Adobe graphics tools if you have them, or free open-source one like InkScape. You can change the colors, delete stuff, add more annotations, etc.
So when I saw that there was a similar function with the NCBI‘s Sequence Viewer tool, I thought I should mention that as well. They have a nice and clear video that illustrates how to accomplish getting the image out of the Viewer and into a file.
Click the “Graphics” link on the page to open the Sequence Viewer.
After you get to the sequence viewer, follow the instructions just as it plays out in the YouTube video. It’s pretty straight-forward–just watch out to click the right menu for PDFs.
If you haven’t used the NCBI Sequence Viewer much, you should definitely check it out. There are some other helpful videos for more features as well. And another neat feature is that you can embed sequence viewer in your own web pages.
All of the genome browsers have different features and functions, and it’s nice to know that there are various strategies to accomplish tasks you might need to get done.
Karolchik D., Barber G.P., Casper J., Clawson H., Cline M.S., Diekhans M., Dreszer T.R., Fujita P.A., Guruvadoo L. & Haeussler M. & (2013). The UCSC Genome Browser database: 2014 update, Nucleic Acids Research, 42 (D1) D764-D770. DOI: 10.1093/nar/gkt1168
Acland A., Agarwala R., Barrett T., Beck J., Benson D.A., Bollin C., Bolton E., Bryant S.H., Canese K. & Church D.M. & (2013). Database resources of the National Center for Biotechnology Information, Nucleic Acids Research, 42 (D1) D7-D17. DOI: 10.1093/nar/gkt1146
Everyone is pretty comfortable with the concept of the non-standard but commonly used dog-years as a way to compare life spans to humans. Car enthusiasts have a taxonomy for the ages of vehicles. But I’ve been sitting here wondering what the genome-browser-years scale should be. I’ve been thinking about it because of the recent announcement over the UCSC list the other day:
Over the past 12 years we have made efforts to maintain visualization of many old, archived assemblies on our Archive server at http://genome-archive.cse.ucsc.edu/, in addition to providing download access to the associated data sets. Unfortunately, this visualization is no longer sustainable for very old assemblies due to the many changes in the Genome Browser software as it has matured. We are therefore reducing access to certain old assemblies to data downloads only, and are announcing the shutdown of our Archive server. We will continue to provide Genome Browser access for the 4 most current human assemblies, and at least the 2 most current assemblies for all other organisms with some exceptions. We will discontinue our visualization support for all other old assemblies, but will continue to make these data sets available on our download servers. The assemblies currently on our archive server for which we have discontinued visualization support include early human assembly drafts, hg4, hg5, hg6, hg7, hg8, hg10, hg11, hg12, hg13, hg15, rn1, rn2, mm1, mm2, mm3, mm4, mm5, rheMac1, bosTau1, ce1, danRer1, and danRer2. Links to the data and annotations associated with these assemblies have been added in the appropriate places on our Downloads page at http://hgdownload.soe.ucsc.edu/. Please contact us if you have difficulty locating a data set of interest.
UCSC Genome Bioinformatics Group
A view of the UCSC Genome Browser in the early days, ~2004.
I’m sure the browser versions aren’t used very much anymore, but it was something I needed to be aware of. In our workshops I mention that the older versions have been available from the “archives” navigation on the landing page, but that will be gone now. Occasionally there are old papers that reference a genomic span that you want to revisit–but that’s becoming less common for those really old assemblies at this point. The data will persist for downloading, but the browser visuals will be gone. But it made me want to go back and look through my old materials to see what the early browsers looked like (click on the image to embiggen). A lot of the foundational structure is the same, but if you look at one of these old assemblies you might be surprised. A lot fewer tracks, that’s for sure. In my shot, there are only a few species in the Conservation track (human, chimp, mouse, rat, chicken). We just didn’t have that much data available–not just across species, but other types of techniques and tools. Fewer tracks. Fewer functionality buttons.
This was almost as much fun as looking back at the old NCBI interfaces that I remember from way back. Those of you who have been in this rodeo for a while may remember those. Some of you will even remember the key “Pedro’s Tools” from back in the day. Sometimes it’s worth looking back at where we came from to realize how much further we are than we realized. I know there’s a lot of grousing about not having cured cancer and changed the pharmaceutical industry with the human genome sequence yet–but we really haven’t had the data that long in browser-years. Or maybe it’s more like Mars years–longer than you realize, despite the fairly comparable span of a single day. The arc of science is long, but it bends toward answers.
Edit: I realized I should look at the earliest paper and add that reference below. Check out Figure 1 for an even older view of the data.
Kent W.J., Sugnet C.W., Furey T.S., Roskin K.M., Pringle T.H., Zahler A.M. & Haussler D. (2002). The Human Genome Browser at UCSC, Genome Res., 12 996-1006. DOI: 10.1101/gr.229102
…an integrated application for viewing and analyzing sequence data. With Genome Workbench, you can view data in publically available sequence databases at NCBI, and mix this data with your own private data.
It’s a useful program and they have a great set of videos to introduce you to the workbench’s functions and features. The video embedded here is the introduction, but they also have several additional videos including how to load a genome into the workbench, phylogenies and others. Check it out.
( forgive the delay of this week’s tip. Snow canceled work, and knocked out Internet access!)
Before I discuss NCBI’s 1000 Genomes Dataset Browser, I’d like to spend a bit of time on the 1000 Genomes project, in order to distinguish what is from NCBI and what is from the project itself. From the 1000 Genomes Pilot paper:
“The aim of the 1000 Genomes Project is to discover, genotype and provide accurate haplotype information on all forms of human DNA polymorphism in multiple human populations. Specifically, the goal is to characterize over 95% of variants that are in genomic regions accessible to current high-throughput sequencing technologies and that have allele frequency of 1% or higher (the classical definition of polymorphism) in each of five major population groups (populations in or with ancestry from Europe, East Asia, South Asia, West Africa and the Americas).”
You can access the full paper from the link below. The project has now moved past the pilot phase and is releasing new data all the time. You can see announcements and project details, or access that data, through the official 1000 Genomes project site, or through the official 1000 Genomes version of the Ensembl Browser. As you might imagine for a “big data” project such as this, data has been added to a variety of NCBI databases, including dbSNP, the Sequence Read Archive (SRA) and BioSample. Although you could search for this data through the universal Entrez search system, previously to view the data you would have to view individual results at each separate database. The 1000 Genomes Browser at NCBI has been created as a powerful interface for comprehensively searching for, and viewing, 1000 Genomes data contained in NCBI resources on a single page.
In the video tip I will familiarize you to the various areas of the page - the browser is created with series of widgets, each with its own function. I will not be able to cover all of the features, or demonstrate how users can upload their own variation data to the browser – I’ll leave you the fun of exploring those on your own. Because the tool is so young, bugs and suggestions/comments are still being actively requested – if you find something, check out the FAQs (which discuss bugs at various stages of being fixed) and then email the team.
Welcome to our Friday feature link collection: SNPpets. During the week we come across a lot of links and reads that we think are interesting, but don’t make it to a blog post. Here they are for your enjoyment…
RT @ianholmes: This post by @Graham_Coop on prepublishing papers to arXiv is spot-on, a call to arms for biologists! http://t.co/goV3Pc8g cc @ctitusbrown [Mary]
Hmmm. Have to take a close look at this. RT @SAGRudd: Discovery of multi-dimensional modules by integrative analysis of cancer genomic data = #magnificent trickery @ http://t.co/XPwlBpedsoon followed by RT @moorejh: #complexity MT @SAGRudd Discovery of multi-dimensional modules by integrative analysis of #cancer #genomics data @ http://t.co/y788iaMl [Mary]
I struggled with whether to hold our production team for the new sidebar, or to produce our tutorial with the plan to update in the near future – it is always a struggle to know which is the best option because resource changes can occur at the speed of light, or according to geological time scales (ok, that’s an exaggeration but it feels that way when you want to release a wonderful, up-to-date project & something holds you up and causes delayed publication of our tutorial materials). With PubMed I was lucky – I saw a tweet that the sidebar feature would be added “in the next week”. I asked our voice professional to put the script on hold & I paced around PubMed waiting to see what (& when) things would occur.
True to their word, the sidebar feature showed up on PubMed results on May 10th, exactly one week since I had seen the “in the next week” announcement – my THANKS to the NCBI & PubMed Teams! Not only did they push out their updates in a timely manner, they made a YouTube video explaining the changes & discussing where future changes are slated to go. The video is clear, and quick, so I am using it as my tip this week. I’m not sure the feature is 100% stable, as I show in the image below, and describe later in the post, but I think the change might accomplish NCBI’s goal – for more people to notice & utilize filters for their searches.
In the video the narrator states that the filters area is gone & the two default filters are permanently selected, as indicated by the check marks that can’t be “unclicked”. I”m not seeing those check marks on either “Free full text available” link (shown) or the “Review” link, which is not in view in my image. I also see a difference as to whether I get the right filtered subsets depending on whether I am logged into My NCBI (the upper window shown in the back of the image), or not (the lower, front window). In my hands IE 9.0 & Firefox 12.0 both function similarly in these aspects.
The NCBI video doesn’t really show how results look after filters are added, but in playing with it to me it looks like all of your filters are applied to your search & you only get one set of results, not links to various subsets. Although it is now easier to add filters to searches, if that’s how filters are going to work going forward, I think I will miss the old filters – I kind of like being able to switch between various subcategories of results without having to change my filters or rerun searches. Be sure to share your thoughts & preferences with NCBI so that they can create the best resource for their users needs!
* OpenHelix tutorial for this resource available for individual purchase or through a subscription.
PubMed Reference: Sayers, E.W., Barrett, T., Benson, D.A., Bolton, E., Bryant, S.H., Canese, K., Chetvernin, V., Church, D.M., DiCuccio, M., Federhen, S. & (2011). Database resources of the National Center for Biotechnology Information, Nucleic Acids Research, 40 (D1) D25. DOI: 10.1093/nar/gkr1184
As the volume and complexity of data sets archived at NCBI grow rapidly, so does the need to gather and organize the associated metadata. Although metadata has been collected for some archival databases, previously, there was no centralized approach at NCBI for collecting this information and using it across databases. The BioProject database was recently established to facilitate organization and classification of project data submitted to NCBI, EBI and DDBJ databases.
This is just one step in the process the biological science community will have to do to get a handle of the data deluge. If scientists are to get a handle of the projects and data that is spewing at breakneck speeds, a key is knowing what data is being generated, organizing the projects.
BioProject grew out of a need to better organize these large projects’ datasets and metadata and replaces NCBI’s Genome Project resource. These projects produce data which is then deposited in several repositories. BioProject “provides an organizational framework to access metadata about research projects and the data from those projects which is deposited, or planned for deposition, into archival databases.”
Barrett, T., Clark, K., Gevorgyan, R., Gorelenkov, V., Gribov, E., Karsch-Mizrachi, I., Kimelman, M., Pruitt, K., Resenchuk, S., Tatusova, T., Yaschenko, E., & Ostell, J. (2011). BioProject and BioSample databases at NCBI: facilitating capture and organization of metadata Nucleic Acids Research, 40 (D1) DOI: 10.1093/nar/gkr1163
” provides a central location for voluntary submission of genetic test information by providers. The scope includes the test’s purpose, methodology, validity, evidence of the test’s usefulness, and laboratory contacts and credentials. The overarching goal of the GTR is to advance the public health and research into the genetic basis of health and disease.”
I’m always interested in checking out new resources from NCBI, especially when it is my turn to do a weekly tip. Initially I figured that I would check out the GTR and post a video on how to use it – but the NCBI beat me to that. You can see their YouTube tips (there are two) by clicking the link on their homepage & learn some search tips, etc. [Note, the two videos continued to loop for me & I needed to stop them after viewing them once].
But the question that I came up with is, “What will the GTR provide me with that I am not already getting from other clinical resources that I use, and that OpenHelix trains on?” I try to address that question in my video by doing the same search, for “Cystic fibrosis”, at five different clinically-related resources, and discussing what each offers and specializes in doing. Of course, in a five minute video I can’t be comprehensive – either for resources or what they cover – but I think it will give you enough of a taste for you to appreciate what the GTR offers you, or to continue the comparison on your own.
The resources that I visit in the tip movie are: the GTR, GeneTests, the Genetic Home Reference (GHR), OMIM, and Orphanet. At each resource I do a basic search for the the disease “Cystic fibrosis” and show the initial results display that resulted. I don’t have time to compare the detailed reports available at each, but lower on the post I link to a reference on the resource (if available), as well as the landing page for OpenHelix training materials on the resource – since we have a tutorial on many of these resources. I also include direct links to each resource.
I’d suggest that you read the NIH News article on the GTR release for some background on the GTR. I won’t cover everything here, but there are a couple of paragraphs that I want to point your attention to. The first explains the relationship between GeneTests and GTR, and says:
“GTR is built upon data pulled from the laboratory directory of GeneTests, a pioneering NIH-funded resource that will be phased out over the coming year. GTR is designed to contain more detailed information than its predecessor, as well as to encompass a much broader range of testing approaches, such as complex tests for genetic variations associated with common diseases and with differing responses to drugs. GeneReviews, which is the section of GeneTests that contains peer-reviewed, clinical descriptions of more than 500 conditions, is also now available through GTR.”
It seems to be another case where it was deemed easier to start a new resource (GTR) than to try and revamp an old resource (GeneTests) to handle the amazing influx of new data. Often resources aren’t retired as soon as expected, due to user feedback, but it is important to note that GTR seems to be in place to eventually replace GeneTests. I assume the GeneReviews will still be edited by & copyright to the University of Washington, Seattle, but I don’t have a reference for that. The similar transition occurred for OMIM, which was hosted at NCBI for years but now has a new URL at Johns Hopkins (watch for our new tutorial on OMIM, which is currently in the works).
The second paragraph that I found particularly interesting was the one on what the GTR contains, and will contain. It states:
“In addition to basic facts, GTR will offer detailed information on analytic validity, which assesses how accurately and reliably the test measures the genetic target; clinical validity, which assesses how consistently and accurately the test detects or predicts the outcome of interest; and information relating to the test’s clinical utility, or how likely the test is to improve patient outcomes.”
I didn’t immediately find mention of who will provide the validity or utility information in the GTR documentation, which is currently under construction. It is clear that much of the content of the database will be “voluntarily submitted by test providers”, and it is stated that “NIH does not independently verify information submitted to the GTR; it relies on submitters to provide information that is accurate and not misleading.”, but I also saw that experts will input on GTR’s content regularly, as can be read here. The GTR team is also very interested in receiving input on the resource, which can be submitted through the GTR feedback form.
*OpenHelix tutorials for these resources available for individual purchase or through a subscription
For GeneTests (free from PMC) – Pagon RA (2006). GeneTests: an online genetic information resource for health care providers. Journal of the Medical Library Association : JMLA, 94 (3), 343-8 PMID: 16888670
For GHR (free from PMC) – Mitchell JA, Fomous C, & Fun J (2006). Challenges and strategies of the Genetics Home Reference. Journal of the Medical Library Association : JMLA, 94 (3), 336-42 PMID: 16888669
For OMIM (open access article) – Amberger, J., Bocchini, C., & Hamosh, A. (2011). A new face and new challenges for Online Mendelian Inheritance in Man (OMIM®) Human Mutation, 32 (5), 564-567 DOI: 10.1002/humu.21466
For Orphanet (full access requires subscription) - Aymé, S., & Schmidtke, J. (2007). Networking for rare diseases: a necessity for Europe Bundesgesundheitsblatt – Gesundheitsforschung – Gesundheitsschutz, 50 (12), 1477-1483 DOI: 10.1007/s00103-007-0381-9
NCBI was created in 1988 and has maintained the GenBank database for years. They also provide many computational resources and data retrieval systems for many types of biological data. As such they know all too well how quickly the data that biologists collect has changed and expanded. As uses for various data types have been developed, it has become obvious that new types of information (such as expanded metadata) need to be collected, and new ways of handling data are required.
NCBI has been adapting to such needs throughout the years and recently has been adapting its genome resources. Today’s tip will be based on some of those changes. My video will focus on the “completely redesigned Genome site”, which was recently rolled out and announced in the most recent NCBI newsletter. I haven’t found a publication describing the changes, but the newsletter goes into some detail and the announcement found at the top of the Genome site (& that I point out in the video) has very helpful details about the changes.
As you will see in the announcement, the Genome resource is not the only related resource to have undergone changes recently, including the redesign of the Genome Project resource into the BioProject resource and the creation of the BioSample resource. I won’t have time to go into detail about those two resources but at the end of my post I will link to two recent NCBI publications that came out in Nucleic Acids Research this month – these are good resources to read for more information on BioProject, BioSample, and on the NCBI as a whole. For a historical perspective I also link to the original Genome reference, which is in Bioinformatics and currently free to access.
Some of the changes are very interesting, including that “Single genome records now represent an organism and not a genome for one isolate.” The NCBI newsletter states that “Major improvements include a more natural organization at the level of the organism for prokaryotic, eukaryotic, and viral genomes. Reports include information about the availability of nuclear or prokaryotic primary genomes as well as organelles and plasmids. ” There’s also a note that “Because of the reorganization to a natural classification system, older genome identifiers are no longer valid. Typically these genome identifiers were not exposed in the previous system and were used mainly for programmatic access. ” That makes me wonder what changes this will mandate to other NCBI’s resources, as well as external resources. I haven’t seen any announcements on that yet, so I’ll just have to stay tuned & check around often.
Enjoy the tip & let us, or NCBI, know what you think of their changes!
Historic Entrez Genome reference: Tatusova, T., Karsch-Mizrachi, I., & Ostell, J. (1999). Complete genomes in WWW Entrez: data representation and analysisBioinformatics, 15 (7), 536-543 DOI: 10.1093/bioinformatics/15.7.536
Barrett, T., Clark, K., Gevorgyan, R., Gorelenkov, V., Gribov, E., Karsch-Mizrachi, I., Kimelman, M., Pruitt, K., Resenchuk, S., Tatusova, T., Yaschenko, E., & Ostell, J. (2011). BioProject and BioSample databases at NCBI: facilitating capture and organization of metadataNucleic Acids Research DOI: 10.1093/nar/gkr1163
Sayers, E., Barrett, T., Benson, D., Bolton, E., Bryant, S., Canese, K., Chetvernin, V., Church, D., DiCuccio, M., Federhen, S., Feolo, M., Fingerman, I., Geer, L., Helmberg, W., Kapustin, Y., Krasnov, S., Landsman, D., Lipman, D., Lu, Z., Madden, T., Madej, T., Maglott, D., Marchler-Bauer, A., Miller, V., Karsch-Mizrachi, I., Ostell, J., Panchenko, A., Phan, L., Pruitt, K., Schuler, G., Sequeira, E., Sherry, S., Shumway, M., Sirotkin, K., Slotta, D., Souvorov, A., Starchenko, G., Tatusova, T., Wagner, L., Wang, Y., Wilbur, W., Yaschenko, E., & Ye, J. (2011). Database resources of the National Center for Biotechnology InformationNucleic Acids Research DOI: 10.1093/nar/gkr1184