This is Table 1 that accompanies the full blog post: Bioinformatics tools extracted from a typical mammalian genome project. See the main post for the details and explanation. The table is too long to keep in the post, but I wanted it to be web-searchable. A copy also resides at FigShare: http://dx.doi.org/10.6084/m9.figshare.1194867
This week’s tip of the week is on Gemini which is the acronym for “GENome MINing.” Unlike most of the tips we give every week, this one is a software package. But, it is does use and integrate with many internet databases such as dbSNP, ENCODE, UCSC, ClinVar and KEGG. It’s also a freely available, open source tool and quite a useful software package that gives the researcher the ability to create quite complex queries based on genotypes, inheritance patterns, etc. The above 12 minute clip is a talk given at a conference that gives a introduction of the science behind the tool.
The abstract from the recent paper from the developers gives a good introduction concerning the functionality of the tool:
Modern DNA sequencing technologies enable geneticists to rapidly identify genetic variation among many human genomes. However, isolating the minority of variants underlying disease remains an important, yet formidable challenge for medical genetics. We have developed GEMINI (GEnome MINIng), a flexible software package for exploring all forms of human genetic variation. Unlike existing tools, GEMINI integrates genetic variation with a diverse and adaptable set of genome annotations (e.g., dbSNP, ENCODE, UCSC, ClinVar, KEGG) into a unified database to facilitate interpretation and data exploration. Whereas other methods provide an inflexible set of variant filters or prioritization methods, GEMINI allows researchers to compose complex queries based on sample genotypes, inheritance patterns, and both pre-installed and custom genome annotations. GEMINI also provides methods for ad hoc queries and data exploration, a simple programming interface for custom analyses that leverage the underlying database, and both command line and graphical tools for common analyses. We demonstrate GEMINI’s utility for exploring variation in personal genomes and family based genetic studies, and illustrate its ability to scale to studies involving thousands of human samples. GEMINI is designed for reproducibility and flexibility and our goal is to provide researchers with a standard framework for medical genomics.
If you’d like to learn more, there is some pretty good documentation of the software package here.
While I’m at it, and totally unrelated except it’s human genomics, there is this slideshare presentation of the ‘current’ state of personal genomics. Current is in quotes because the slideshare is actually from 3 years ago, but there is a lot of good information in there. Anyone know of a more up-to-date slide set or extensive intro to the current state of personal genomics science similar to this?
(tutorials are linked below for those tools in bold above)
Paila U, Chapman BA, Kirchner R, & Quinlan AR (2013). GEMINI: Integrative Exploration of Genetic Variation and Genome Annotations. PLoS computational biology, 9 (7) PMID: 23874191
NCBI was created in 1988 and has maintained the GenBank database for years. They also provide many computational resources and data retrieval systems for many types of biological data. As such they know all too well how quickly the data that biologists collect has changed and expanded. As uses for various data types have been developed, it has become obvious that new types of information (such as expanded metadata) need to be collected, and new ways of handling data are required.
NCBI has been adapting to such needs throughout the years and recently has been adapting its genome resources. Today’s tip will be based on some of those changes. My video will focus on the “completely redesigned Genome site”, which was recently rolled out and announced in the most recent NCBI newsletter. I haven’t found a publication describing the changes, but the newsletter goes into some detail and the announcement found at the top of the Genome site (& that I point out in the video) has very helpful details about the changes.
As you will see in the announcement, the Genome resource is not the only related resource to have undergone changes recently, including the redesign of the Genome Project resource into the BioProject resource and the creation of the BioSample resource. I won’t have time to go into detail about those two resources but at the end of my post I will link to two recent NCBI publications that came out in Nucleic Acids Research this month – these are good resources to read for more information on BioProject, BioSample, and on the NCBI as a whole. For a historical perspective I also link to the original Genome reference, which is in Bioinformatics and currently free to access.
Some of the changes are very interesting, including that “Single genome records now represent an organism and not a genome for one isolate.” The NCBI newsletter states that “Major improvements include a more natural organization at the level of the organism for prokaryotic, eukaryotic, and viral genomes. Reports include information about the availability of nuclear or prokaryotic primary genomes as well as organelles and plasmids. ” There’s also a note that “Because of the reorganization to a natural classification system, older genome identifiers are no longer valid. Typically these genome identifiers were not exposed in the previous system and were used mainly for programmatic access. ” That makes me wonder what changes this will mandate to other NCBI’s resources, as well as external resources. I haven’t seen any announcements on that yet, so I’ll just have to stay tuned & check around often.
Enjoy the tip & let us, or NCBI, know what you think of their changes!
NCBI Homepage: http://www.ncbi.nlm.nih.gov/
Entrez Genome Resource Homepage: http://www.ncbi.nlm.nih.gov/genome
BioProject Resource Homepage: http://www.ncbi.nlm.nih.gov/bioproject/
Historic Entrez Genome reference: Tatusova, T., Karsch-Mizrachi, I., & Ostell, J. (1999). Complete genomes in WWW Entrez: data representation and analysis Bioinformatics, 15 (7), 536-543 DOI: 10.1093/bioinformatics/15.7.536
Barrett, T., Clark, K., Gevorgyan, R., Gorelenkov, V., Gribov, E., Karsch-Mizrachi, I., Kimelman, M., Pruitt, K., Resenchuk, S., Tatusova, T., Yaschenko, E., & Ostell, J. (2011). BioProject and BioSample databases at NCBI: facilitating capture and organization of metadata Nucleic Acids Research DOI: 10.1093/nar/gkr1163
Sayers, E., Barrett, T., Benson, D., Bolton, E., Bryant, S., Canese, K., Chetvernin, V., Church, D., DiCuccio, M., Federhen, S., Feolo, M., Fingerman, I., Geer, L., Helmberg, W., Kapustin, Y., Krasnov, S., Landsman, D., Lipman, D., Lu, Z., Madden, T., Madej, T., Maglott, D., Marchler-Bauer, A., Miller, V., Karsch-Mizrachi, I., Ostell, J., Panchenko, A., Phan, L., Pruitt, K., Schuler, G., Sequeira, E., Sherry, S., Shumway, M., Sirotkin, K., Slotta, D., Souvorov, A., Starchenko, G., Tatusova, T., Wagner, L., Wang, Y., Wilbur, W., Yaschenko, E., & Ye, J. (2011). Database resources of the National Center for Biotechnology Information Nucleic Acids Research DOI: 10.1093/nar/gkr1184
I did a tip on CoGe’s tool, GeVo about two years ago and we’ve had a guest post about CoGe from Eric Lyons, the lead developer of CoGe just over a year ago. In our ongoing and occasional quest to keep our tips fresh (and move them to SciVee), I’ve decided to revisit CoGe and one of their tools. CoGe has changed a bit since we last visited it (see some of the changes here). There is a new interface, more documentation and many more tutorials, some new tools and interconnections and a lot more genomes. I’m going to give a brief introduction to SynMap and going to use it to do a genome rearrangement analysis (a subject of a text tutorial at the site).
The algorithm selected in the example is QUOTA-ALIGN which is the subject of a recent paper, “Screening synteny blocks in pairwise genome comparisions through integer programming” in BMC Bioinformatics. As the paper conclusion states:
The QUOTA-ALIGN algorithm screens a set of synteny blocks to retain only those compatible with a user specified ploidy relationship between two genomes. These blocks, in turn, may be used for additional downstream analyses such as identifying true orthologous regions in interspecific comparisons.
And as mentioned, and you’ll see in this tip, “QUOTA-ALIGN program is also integrated as a major component in SynMap http://genomevolution.com/CoGe/SynMap.pl webcite, offering easier access to thousands of genomes for non-programmers.”
Tang, H., Lyons, E., Pedersen, B., Schnable, J., Paterson, A., & Freeling, M. (2011). Screening synteny blocks in pairwise genome comparisons through integer programming BMC Bioinformatics, 12 (1) DOI: 10.1186/1471-2105-12-102
I was playing around with Google Scholar’s new citation feature that allowed me to collect my papers in one place easily (worked pretty well, btw, save a few glitches, see below) when I noticed it missed a paper of mine from 2000: “Gene context conservation of a higher order than operons.” The abstract:
Operons, co-transcribed and co-regulated contiguous sets of genes, are poorly conserved over short periods of evolutionary time. The gene order, gene content and regulatory mechanisms of operons can be very different, even in closely related species. Here, we present several lines of evidence which suggest that, although an operon and its individual genes and regulatory structures are rearranged when comparing the genomes of different species, this rearrangement is a conservative process. Genomic rearrangements invariably maintain individual genes in very specific functional and regulatory contexts. We call this conserved context an uber-operon.
The uber-operon. It was my PI’s suggested term. Living and working in Germany at the time, I thought it was kind of funny. Anyway, I never really expanded more than another paper or so on that research and kind of lost track whether that paper resulted in much. I typed in ‘uber-operon’ in google today and found that it’s been cited a few times (88) and, I found this interesting: there have been a few databases built of “uber-operons.”
A Chinese research group created the Uber-Operon Database. The paper looks interesting, but unfortunately the server is down (whether this is temporary or permanent, I do not know), the ODB (Operon Database) uses uber-operons (which they call reference operons) to predict operons in the database , Nebulon is another, HUGO is another. Read the chapter on computational methods for predicting uber-operons
Just goes to show you, there’s a database for everything.
Oh, and back to Google Scholar citation. It did find nearly every paper I’ve published, though it missed two (including the one above) and had two false positives. Additionally, many citations are missing (like the 88 for this paper, and many others from other papers). That’s not to say it’s not useful, I find it a nice tool but it’s not perfect. You can find out more about google scholar citation here, and about Microsoft’s similar feature here.
Oh, and does this post put me in the HumbleBrag Hall of Fame? If that’s reserved for twitter, than maybe I should twitter this so I can get there :). (though I’m not sure pointing out relatively small databases based a relatively minor paper constitutes bragging, humbly or not LOL).
Today’s tip of the week is on SNPTips. We had a guest post on this earlier. We usually do tips on databases and analysis tools, but after getting our 23andme data, we’ve been using SNPTips often and thought it might be of use to some of our readers. SNPTips was created by 5am Solutions for 23andme* customers to easily view their genomic data while browsing the web. The tip will quickly show you how to install the browser extension and what it does. At the end of the tip, I briefly show a custom annotation track I created of my 23andme data using UCSC Genome Browsers** Personal Genome SNP format. The format is not perfect for 23andme data (doesn’t allow for rsID field, has fields of little use with 23andme data, etc), but it does help tremendously if you want to browse your data with the genome browser. You basically take the 23andme data that looks like this:
You can do this in a spreadsheet program like I did, it’s a bit labor intensive. If I decide to do it for my daughter’s and husband’s genome data (which is a distinct possibility), I’d created a perl script to change the format (or maybe there is something already out there?).
It basically entails:
*eliminating the rsID column
*rearranging the columns to the correct order
*adding “chr” to the chromosome number
*adding four columns, 1 with the number of alleles, 2 with 0′s (frequency data the 23andme data doesn’t have)
*changing the genotypes from xx to x and xy to x/y.
Remember also that the 23andme position data is from build 36 (2006, hg18) and the genotypes displayed in 23andme data are oriented with respect to the positive strand on the reference assembly.
It’s not the most elegant solution, but it works and nicely with SNPTips. It has been quite addictive for me :). I’m sure there are more elegant ones that can be done.
*OpenHelix and it’s employees have no commercial connection or financial interest with 5amsolutions or 23andme.
**UCSC sponsors tutorials and outreach with OpenHelix through as a subgrantee.
Welcome to our Friday feature link collection: SNPpets. During the week we come across a lot of links and reads that we think are interesting, but don’t make it to a blog post. Here they are for your enjoyment…
- NIGMS Feedback Loop on “Maintaining “Legacy” Scientific Resources“, which I think is an important discussion & effort. [Jennifer]
- BenchFly on “Hurdles to the Non-Research Career” – vote to see the results so far. [Jennifer]
- SpotXplore: a Cytoscape plugin for visual exploration of hotspot expression in gene regulatory networks http://bit.ly/bF4O7O via laylamichan [Mary]
- My favorite tweet this week brings cross-cultural awareness. I never heard either variant, we always just use the letters [Mary]
- Crowd-sourced science research funding? Why not? Sciflies is an interesting idea. Curious to see how that works out. [Trey]
This week’s tip is a brief introduction to Galaxy Pages. These are special pages that users can create within the Galaxy system to annotate, describe and explain various analyses done using Galaxy. The user has many abilities to link to and embed histories, workflows and datasets along with using text and images and more to fully annotate analyses. As described last week, this is one of the many additions Galaxy has added to increase reproducibility and transparency of genomics research.
Galaxy started out as a very useful tool to do genomics research that was reproducible and sharable. One of my pet peeves in reading research papers that use genomic analysis or online genomics resources is the materials and methods sections. Often the methods and parameters used are mentioned only in a very cursory manner, if at all. I would not be able to reproduce the research. This, along with the ability to easily do and share analysis, is one of the fundamental purpose Galaxy was developed and does a pretty good job of it (I am a bit biased*).
The Galaxy developers have recently published a paper: “Galaxy: a comprehensive approach for supporting accessible, reproducible and transparent computational research in the life sciences” in Genome Biology.
There have been a couple questions or functions I have felt that Galaxy needed to better fulfill the goal of reproducible and transparent computational research. One of the things we’ve been asked in workshops on Galaxy has been how long will ‘histories’ and ‘workflows’ persist. The Galaxy developers insisted these would persist indefinitely (as indefinite as an online world could be). In this paper, the developers answer that question with what seems to me a pretty good, broad approach to persistence:
We are pursuing three strategies to ensure that any Galaxy analysis and associated objects can be made easily and persistently accessible. First, we are developing export and import support so that Galaxy analyses can be stored as files and transferred among different Galaxy servers. Second, we are building a community space where users can upload and share Galaxy objects. Third, we plan to enable direct export of Galaxy Pages and analyses associated with publications to a long-term, searchable data archive such as Dryad.
Another feature that, though I knew this was coming, it’s good to see it in published form and in the beta site, a community of tools and users. It’s mentioned in the quote above, but it’s more than that. It’s an extension of the ability to share histories and workflows:
To help users make better and faster choices within Galaxy, we are extending Galaxy’s sharing model to help the Galaxy user community find and highlight useful items. Ideally, the community will identify histories, workflows, and other items that represent best practices; best practice items can be used to help guide users in their own analyses.
The beta site gives you a look at what’s coming in the “Galaxy Tool Shed,” a place to upload, download and share tools to import into Galaxy installations. Hopefully this will eventually also include the ability to rate and discuss tools. Another aspect I’ll be looking forward to is the ability to share workflows to an open and broader community. Right now there is the excellent ability to share histories and workflows with other users within your network of colleagues, but I would like to see an open community to share and rate workflows. From the comment above, it seems that is coming. It will be a very welcome addition.
One last feature added I’d like to mention is pages:
Galaxy Pages (Figure 4) are the principal means for communicating accessible, reproducible, and transparent computational research through Galaxy. Pages are custom web-based documents that enable users to communicate about an entire computational experiment, and Pages represent a step towards the next generation of online publication or publication supplement. A Page, like a publication or supplement, includes a mix of text and graphs describing the experiment’s analyses.In addition to standard content, a Page also includes embedded Galaxy items from the experiment: datasets, histories, and workflows. These embedded items provide an added layer of interactivity, providing additional details and links to use the items as well.
I tried out the pages (click “user” at the top right of the page, then click “pages” to access pages). I like the ability to basically write what is a methods and materials for computational biology. You can describe what you did, embed histories, datasets and the like. Unfortunately, at the time of this writing I was able to build a page, but unable to view it (server error, I used latest versions Safari and Firefox in Mac 10.5). I am sure this is a temporary glitch.
Galaxy has making huge progress in the last couple years and looks poised to become a go-to tool for computational analysis for experimental biologists. In that vein, you might want to check out their introductory tutorial or screencasts to get acquainted with the tool!
*disclaimer: The Galaxy group contracts with OpenHelix to provide an introductory tutorial on Galaxy (free and open to all users).
Goecks, J., Nekrutenko, A., Taylor, J., & Galaxy Team, T. (2010). Galaxy: a comprehensive approach for supporting accessible, reproducible, and transparent computational research in the life sciences Genome Biology, 11 (8) DOI: 10.1186/gb-2010-11-8-r86
Ok, so this isn’t the same as our usual tips. But recently I was involved in an animal models project that led me to this resource on genomic pathology. The deeper I got into this animal model project, the more clear it became that a tremendous amount of genomic data is coming that is going to be great–but it will need to be paired with appropriate histology and pathology for a more complete understanding of the genomic biology.
All these model organism projects–knockout mice or rats, mutant mice for cancer studies for example, inbred lines with specific characteristics and genomic regions like the Collaborative Cross, treated animals–need quality pathology assessments. There are phenotyping projects like Europhenome being done on large sets of animals, and they require not only standardized descriptions and ontologies, but also image samples and evaluations. In an age where we all scan around at all this software looking at genes and genomic regions, we have to have pathology data as well. And that data will also need to be standardized and stored in appropriate database resources for researchers to find and examine. I recently heard Dr. Robert Cardiff talk about his work on Pathobiology of the Mouse and how crucial it is to capture the information in a standardized and searchable ways. He’s one of the drivers of this project, and fully understands the needs in this arena.
More people should be trained in pathology to examine these animals. So during this project I was impressed to find out about an online learning project that could be helpful for people who need to understand the foundations of animal research and be introduced to important aspects of pathology. This project has won an award for Outstanding Distance Learning (May 25). So as a public service in genomics I point you to this UC Davis project.
You can have a look at the background and goals for this from the Center for Genomic Pathology site. From there you can click the navigation for UCD Information Session to get a taste of their course, or click on my image above. It’s a nice effort.
We have no relationships with UC Davis or this online learning project–we just thought it was a valuable and important component to genomics and wanted to talk about it.