You may have been hearing about the 1000 Genomes project–it’s one of the ongoing “big data” projects that is going to yield a great deal of variation information about the human genome. The goal is to sequence well over1000 genomes to identify “most genetic variants that have frequencies of at least 1% in the populations studied”. They are doing this by sequencing large numbers of samples with 4x coverage. You can read more about their strategy in their About page on their web site. It also lists the anticipated sample populations.
In this week’s Tip of the Week I’m going to take a quick spin through their browser. (You can also download all the data, but I’ll be focusing on the browser.) They have begun to release data now, and there are 6 individual sequences available at this time. These are part of their “pilot” studies. You can get some details on the pilot from their about page, which links to this PDF about the samples.
They are using the Ensembl framework to display their data. So if you are familiar with using Ensembl you’ll have some facility moving around this browser. One thing that isn’t apparent right away from the site is that you can click the Resembl link on the display to turn on a track that puts the read/coverage data on the viewer. I also liked the alignment display of all 6 genomes–but I’m sure that’s going to get challenging to view later with more and more genomes.
In an exchange with their very helpful help desk yesterday, I got this quick summary of the samples you’ll see:
For the high coverage populations NA12891, NA12892 and NA12878 are the CEU trio, NA19238, NA19239 and NA19240 are the YRI trio both father, mother, child respectively and both children were daughters.
If you have questions about their data, be sure to go ask them for help–they were very speedy with answers for me .
Some of the project data has also been picked up by UCSC and you can access the same sequences in the UCSC Genome Browser in the Genome Variants track on the March 2006 human assembly. (You’ll also see Venter, Watson, and some other individual genomes there).
In today’s tip I will introduce you briefly to the changes at NCBI’s Protein database. I highlighted that changes had been made in a Friday SNPets, and someone asked for more details. Our full updated tutorial will be much more complete than this short tip, so be watching for that to be completed in the near future – but for now, enjoy this tip & head over to NCBI to do some exploring of your own!
We’ve long been fans of the tools developed by the team responsible for MINT: Molecular INTeraction database. MINT is a curated resource full of experimentally verified protein-protein interactions, with some great visualization options. In addition to the main MINT interface, there are other aspects to the site that bring other types of visualization as well. We have done a tip on MINT in the past, but we wanted to re-visit this for our SciVee collection, and also mention a handy tool called Connect. Connect can be used to enter a list of up to ~100 proteins and generate the connection map between them.
HomoMINT: this tool extends the experimentally-verified interaction collection to include inferred interactions for human, based on data from model organisms. So this is homologous interactions, hence the name….
Domino: a look at the domains that are involved in the protein-protein interactions.
VirusMINT: this aspect of MINT explores the viral proteins that includes how the virus proteins interact with host proteins to disrupt host physiology.
For this week’s tip I’ll focus mainly on the experimentally-verified portion of MINT and that interface, and introduce the others. You’ll see how to do a quick search, explore protein details, and then load up the network in the visualization tool. We have a full tutorial on MINT available for subscribers for people who want to go deeper into the functionality–we can only barely touch on the features in our screencast movie limit.
Today’s tip is the continuation of researching a single SNP in an individual genome. Trey will use a dbSNP RS ID to find linkage disequilibrium information between a SNP of interest and SNPs in the region easily and quickly. GVS, the Genome Variation Server at the University of Washington to analyze a dbSNP rs ID of your choice. This 3 minute screencast will show you how to use the GVS tool to quickly get this information for a wide range of populations.
Last January I did a tip that featured the monthly Structural Genomics Update, which is essentially a newsletter and article collection from PSI Structural Genomics Knowledgebase (SGKB). However, the update is just one aspect of what SGKB offers. In today’s tip I want to feature the wonderfully efficient ways that Structural Genomics Knowledgebase provides you with to learn about the proteins that you are interested in. What I tried to stress in this tip is the different emphasis that Structural Genomics has compared to the RCSB PDB. The RCSB PDB is a GREAT resource, which we also have free tutorial on, but it was created by and for structural biologists. Its displays feature angstroms, angles, conformers and more.
Me, I’m a Molecular Biologist by training & I think about proteins in terms of genomes, pathways, medical relevance, molecular functions, and the like. The SGKB thinks like me, and even organizes information and links into those sorts of categories. I really like how it presents protein information to me, and in the process how it eases me into thinking about the more ‘hard core’ structural details that I see in PDB. The tip is just a teaser taste of the SGKB – if I peak your interest, please do check out OpenHelix’s full, free introductory tutorial on the PSI-SGKB (sponsored by PSI SGKB) as well as the site itself. You never know, you might just learn to love a crystal! :^)
A couple of years back at a conference I was introduced to BioCatalogue. It seemed to me to be a really useful idea: locate bioinformatics tools and databases that are web-accessible, and that also have a mechanism to use the web service features to access the tool/server using strategies that don’t require the main web interface of the site. There are some introductions to the concept of web services out there–some of them are more for introduction, but most are aimed at programmers. Essentially it is kind of a back door into the tool, and lets you pull the information you need out in ways that you want–not constrained by the main user interface.
BioCatalogue is a curated collection of these web services. The creators of BioCatalogue provide the framework and perform some of the collection and annotation–but they also enable the user community to bring in web services and annotate them as well. This means that you can use BioCatalogue to find and learn more about the services, and you can feed back into the system as well if you join the community. If you are a software provider you can register your service there–so more people can locate you and learn about your project. Another really nice aspect of BioCatalogue is that they monitor the services. As we know at OpenHelix–plenty of times a tool you have accessed in the past is suddenly unavailable. Sometimes they are intermittent server problems, but sometimes they are longer-term issues. BioCatalogue is regularly checking the status of the tools so you can have confidence that the tool has been up and seems stable.
The Web Server issue (see the 2009 issue here) of Nucleic Acids Research provides a wealth of information about useful servers with bioinformatics tools. And there’s a paper for the 2010 Server issue about BioCatalogue that will offer more details on the background (linked below). In this week’s movie I can only briefly introduce the site and the features available. Check out the paper from the BioCatalogue team, and explore the documentation wiki to learn more about the features and functions that are provided.
Now, these web services are not for everyone. For many people the main user interface will still be the best mechanism to access a tool. But if you need more advanced or customized queries, or if you want to create inflows into your own tools, or if you want to use some of the cool work flow software that’s out there now (such as Galaxy or Taverna)–web services may be right for you.
Bhagat, J., Tanoh, F., Nzuobontane, E., Laurent, T., Orlowski, J., Roos, M., Wolstencroft, K., Aleksejevs, S., Stevens, R., Pettifer, S., Lopez, R., & Goble, C. (2010). BioCatalogue: a universal catalogue of web services for the life sciences Nucleic Acids Research DOI: 10.1093/nar/gkq394
The last tip of the week I did was Genome Variation Tour I where we started our journey following one SNP in an individual’s genome through various databases to see what we can find out about that variation. In that tip we started out by looking at a SNP in the CYP4F2 gene in the UCSC Genome Browser and followed it to dbSNP. Today’s tip will continue our journey to OMIM to see what information we can find there. We’ll find this variation is clinically associated with Warfarin dosage effects and specifically this individual’s C/T heterozygosity indicates an intermediate dosage for effectiveness if indeed he ever needed this drug. In some ways, your guess is as good as mine as to what we will find and what avenues we will be taking in the next few tips I’ll be doing. I’m am discovering information as I go along too. I can tell you though that the next installment of the genome variation tour will take us to PubMed, and a few not particularly well known but gem databases perhaps and probably back to the UCSC Genome Browser to expand our look at the interactions of several variations in this individuals genome.
If you haven’t noticed, we’ve started adding our tips of the week to SciVee and using the embed to embed them here. This allows you to view the video and share it on your web site or with friends. We’ve also created a “community” over at SciVee we’ve called “Genome Resource Training” where we will be gathering all these video tutorial tips along with any other video tutorials we find over at SciVee that train or introduce researchers on some of the huge amount of resources out there (click the “SciVee” icon above to visit the community). We’ve got about 8 tips over there now and another 8 or so videos from other sources. This community will only grow! So, come check it out, join SciVee and join our community! Of course you will _always_ find our tips here and much more on this blog, so keep us in your feeds and bookmarks and don’t forget to get some in-depth training with our tutorials! We are looking forward to a longterm and expanding work with SciVee.
We also now have a Facebook page, where we will be posting these tips weekly with an occasional ‘general-interest’ genomics link or two. So, please.. follow us there if you are on Facebook and ’share with friends!
We are acutely aware of the thousands of bioinformatics resources out there, and we are often asked for guidance on finding a particular type of tool for some function or other. There are some excellent lists out there which attempt to catalog the various tools–the NAR Database Issue and corresponding list, the Resource Collection at the Univ. of Pittsburg, and others. But recently we saw one developed with a specific focus, which claims to bring together over 200 resources for the mouse. The Mouse Resource Browser collects and categorizes a number of different types of things–not just databases, as we’ll see. Find them here: http://bioit.fleming.gr/mrb
The curated collection of sites that may be of use to mouse researchers has a number of features. The developers used a questionnaire to elicit some information from the resource providers, and when they don’t have that input they have created some basic information for the records themselves. You can do a basic search for resources with a quick search box. There is an advanced search option. I found the option of browsing by category (they have 22 categories) the most informative to figure out what kind of resources they had collected.
The data for a given record is organized across a series of tabs:
General: description, highlights and subject matter of the resource
Ontologies and Standards: if the resource relies on any of the important vocabularies or standards formats in the field, they are listed here
Technical: details of implementation, type of database, access methods, if there is a web services component, whether there are downloads or not
CASIMIR DDF: this is an interesting tab that assesses some of the features of the resources such as currency/updates, quality control process, versioning, technical documentation, user support, and more.
Although the focus is mouse, you’ll see some more broad types of resources in there. For example, UCSC Genome Browser is listed as there is a mouse database there. Reactome is listed. These have a species range and include mouse, but are certainly not focused on mouses. Other types of resources include commercial suppliers such as Charles River. So it isn’t limited just to things like sequence databases and things of that nature–it’s got more aspects that researchers employing mouse as a model system might find useful.
There are some choices they have made that I’m not sure I would have. They list the MGI mailing list as a separate feature from MGI. But as I thought more about it, I could see why. There is good information there, and if you don’t know of it already a pointer might help. But as I was thinking of the 200+ resources just for mouse, I thought that sort of affected the total.
If you use mouse as your model system, you will probably find some useful databases and other web sites that are handy for your work. If you don’t work with mice, there are probably still some useful resources for your work as well. Check out MRB’s site for more information: http://bioit.fleming.gr/mrb
Reference: Zouberakis, M., Chandras, C., Swertz, M., Smedley, D., Gruenberger, M., Bard, J., Schughart, K., Rosenthal, N., Hancock, J., Schofield, P., Kollias, G., & Aidinis, V. (2010). Mouse Resource Browser–a database of mouse databases Database, 2010 DOI: 10.1093/database/baq010
Recent Comments