Tag Archives: UniProt

Announcement of Updated Tutorial Materials: UniProt, Overview of Genome Browsers, and World Tour of Resources

As many of you know, OpenHelix specializes in helping people access and utilize the gold mine of public bioscience data in order to further research.  One of the ways that we do this is by creating materials to train people – researchers, clinicians, librarians, and anyone interested in science - on where to find data they are interested in, and how to access data at particular public databases and data repositories. We’ve got over 100 such tutorials on everything from PubMed to the Functional Glycomics Gateway (more on that later).

In addition creating these tutorials, we also spend a lot of time to keep them accurate and up-to-date. This can be a challenge, especially when lots of databases or resources all have major releases around the same time. Our team continually assesses and updates our materials and in this post I am happy to announce recently released updates to three of our tutorials: UniProt, World Tour, and Overview of Genome Browsers.

Our Introductory UniProt tutorial shows users how to: perform text searches at UniProt for relevant protein information, search with sequences as a starting point, understand the different types of UniProt records, and create multi-sequence alignments from protein records using Clustal.

Our Overview of Genome Browsers introduces users to introduce Ensembl, Map Viewer, UCSC Genome Browser, the Integrated Microbial Genomes (IMG) browser, and to the GBrowse software system. We also touch on WebGBrowse, JBrowse, the Integrative Genomics Viewer (IGV), the ARGO Genome Browser, the Integrated Genome Browser (IGB)GAGGLE, and the Circular Genome Viewer, or CGView.

Our World Tour of Genomics Resources is free and accessible without registration. It includes a tour of example resources, organized by categories such as Algorithms and Analysis tools, expression resources, genome browsers (both Eukaryotic and Prokaryotic/Microbial) , Literature and text mining resources, and resources focused on nucleotides, proteins, pathways, disease and variation. This main discussion will then lead into a discussion of how to find resources with the free OpenHelix Resource Search Portal, followed by learning to use resources with OpenHelix tutorials, and a discussion of additional methods of learning about resources.

Quick Links:

OpenHelix Introductory UniProt tutorial suite: http://www.openhelix.com/cgi/tutorialInfo.cgi?id=77

OpenHelix Overview to Genome Browsers tutorial suite: http://www.openhelix.com/cgi/tutorialInfo.cgi?id=65

Free OpenHelix World Tour of Genomics Resources tutorial suite: http://www.openhelix.com/cgi/tutorialInfo.cgi?id=119

 


NAR database issue (always a treasure trove)

The advance access release of most of the  NAR database issue articles is out. As usual, this this database issue includes a wealth of new and updated data repositories and analysis tools. We’ll be writing up additional more extensive blog posts on it and doing some tips of the week over the next couple months, but I thought I’d highlight the issue and some of the reports:

There are a lot of updates to many of the databases we know and love (links to go full text article): UCSC Genome Browser, Ensembl, UniProt, MINT, SMART, WormBase, Gene Ontology,  ENCODE, KEGG, UCSC Archaeal Browser, IMG/M, DBTSS, InterPro and others (we have tutorials on all those listed here).

And, as an indication of the explosion of data available (itself a subject of a database issue article, SRA), there are a lot of new(ish) databases on specific datatypes such as MINAS, a database of metal ions in nucleic acids (nice name :D); doRiNA, a database of RNA interactions in post-transcriptional regulation; BitterDB, a database of bitter compounds and well over 100 more.

And I’ll give a special shout out to my former PI at EMBL because I can, Peer Bork’s group has 4 databases listed in the issue: eggNOG, SMART, STITCH and OGEE. (and he and a couple members are on the InterPro paper also).

This is going to be a wealth of information to wade through!

sciseekclaimtoken-4ec6d4e6da3c3

sciseekclaimtoken-4ec6cf9447e17

World tour of workshops, recent stop: Morocco, Africa

Trainers & organizers

Last year I had the opportunity to give a workshop in Ifrane Morocco (UCSC Genome and Table browsers, Galaxy) at Al Akhawayn University. This year, Mary and I returned for a longer 3-day workshop at University Hassan II in Mohammadia. OpenHelix was a co-sponsor of the workshop (donating our time, materials and expertise). The workshop covered a plethora of topics from a world tour of resources (tutorial-free) and introductory UCSC  Genome Browser (tutorial-free) and ENCODE (tutorial-free) to genome variation analysis in dbSNP (tutorial-subscription) and analysis using Galaxy (tutorial-subscription). You can see the full schedule of the topics Mohammadia Workshop Schedule here (pdf).

As last year, we were impressed with the students (there were 117 total, about 50/50 gender ratio). English is their 3rd or 4th language in most cases, Moroccan Arabic, French or various African languages being their language of choice. Yet, they were attentive and asked very perceptive and fascinating questions. They were also very enthusiastic

The workshop students

learners. It was a delight to teach them.

We’d like to thank Mohammed Bourdi at NIH, who spent large amounts of time and financial resources to organize this (and last year’s) workshop. We hope to repeat and expand these for next year and perhaps years to come. We will be looking for sponsors.

Several questions were asked at the workshop we’d like to reiterate the answers here and seek some answers from our readers:

*One student was looking for wheat genome resources for designing primers. The wheat genome is as yet incomplete, but there are some resources to get started:
Wheat Genome Sequencing Consortium
Gramene’s wheat resources
Wheat Genetic and Genomic Resource Center @ Kansas State
Perhaps also COGE for conserved sequences
edited to add:
CerealsDB and
James’ post on the wheat draft sequence might give some insight into that huge genome.
*Another student asked about dotplot tools:
Galaxy offers a large collection of EMBOSS tools including dotplot analysis, as does EBI Emboss tool

* Another question concerned finding a ‘dynamic programming’ (optimal solution) multiple sequence alignment tool as opposed to a heuristic one. The issue with this is the complexity of the search space of dynamic programming solution, this slide set might help with the understanding, particularly slides 1-5 and 17-22. It is too computationally intensive. That said, the student might want to check out MSAProps and this list at Wikipedia.

Do our readers have any other guidance on this?

Teaching moment

* Another student asked  if we know how to find DC-area internships in biological sciences. Another student (mathematician from Mali) was looking for something in the US in bioinformatics. Any ideas of programs to bring African biology students to the US or Canada?

If our Moroccan students (or anyone else) have any additional questions, please feel free to ask them here!

 

ANd a side note. Last year I had all of 3 hours to tour Fes. This year I took advantage of my trip. Mary and I spent a few days in Fes and Marrakech. My family joined us in Marrakech and later my family and I toured for 8 days visiting the Atlas mountains, the Sahara and Fes. Needless to say, it was a trip of a lifetime. Morocco is a fascinating and beautiful place. I look forward to visiting again.

Gates and doors of Fes are beautiful

camel excursion to the Sahara

 

 

 

 

On a Mission for Protein Information

It’s probably just the human brain’s ability to connect dots  &  find patterns, but it can be interesting how many “unrelated” events and information bits accumulate in my head & eventually get mulled into an idea or theory. Take, for example, a recent biotech mixer, bits from an education leadership series & a past Nature article – each “event” has been meandering in my mind and now they are finding their way out as this blog post.

OK, now the explanation: At a recent local biotech event I heard about a company (KeraNetics) purifying keratin proteins & using them to develop therapeutic and research applications. The company & their research sounded very interesting & because a lot of it is aimed at aiding wounded soldiers, it also sounded directly beneficial. The talk was short, only about 20 minutes, so there wasn’t a lot of time for details or questions. I decided I’d venture forth through many of the bioscience databases and resources that I know and love, in order to learn more about keratin.

My quest was both fun and frustrating because of the nature of the beast – keratin is “well known” (i.e. it comes up in high school academic challenge competitions ‘a lot’, according to someone in the know), but is hard to work with (i.e. tough, insoluble, fibrous structural proteins) that is hard to find much general information on in your average protein database (because it is  made of many different gene products, all referred to as “keratin”). I decided to begin my adventure at two of my favorite protein resources, PDB & SBKB, but I found no solved structures for keratin. Because of the way model organism databases are curated and organized, I often begin a protein search there, just to get some basic background, gene names, sequence information, etc. I (of course) found nothing other than a couple of GO terms in the Saccharomyces Genome Database (SGD), but I found hundreds of results in both Mouse Genome Informatics (MGI) (660 genomic features) and Rat Genome Database (RGD) (162 rat genes, 342 human genes). I also found gene names (Krt*), sequences and many summary annotations with references to diseases with links to OMIM. When I queried for “keratin”, in OMIM I got 180 hits, including 61 “clinical synopsises”, in UniProt returned 505 reviewed entries and 2,435 unreviewed entiries, in Entrez Protein 10,611 results and in PubMed 26,430 articles with 1,707 reviews. I got my curiosity about KeraNetics’ research sated by using a PubMed advanced search for Keratin in the abstract or title & the PI’s name as author (search = “(keratin[Title/Abstract]) AND Van Dyke[Author]“).

I ended up with a lot of information leads that I could have hunted through, but it was a fun process in which I learned a lot about keratin. This is where the education stuff comes in. I’ve been seeing a lot of studies go by talking about reforming education to be more investigation driven, and I can totally see how that can work. “Learning” through memorization & regurgitation is dry for everyone & rough for the “memory challenged”, like me. Having a reason or curiosity to explore, with a new nugget of data or understanding lurking around each corner, the information just seems to get in better & stay longer. (OT, but thought I’d mention a related site that I found today w/ some neat stuff: Mind/Shift-How we will learn.)

And I could have done the advanced PubMed search in the beginning, but what fun would that have been? Plus there is a lot that I learned about keratin from what I didn’t find, like that there wasn’t a plethora of PDB structures for keratin proteins. That brings me to the final dot in my mullings – an article that I came across today as I worked on my reading backlog: “Too many roads not taken“. If you have a subscription to Nature you can read it, but the main point is that researchers are still largely focusing on the same set of proteins that they have been for a long time, because these are the proteins for which there are research tools (antibodies, chemical inhibitors, etc). This same sort of philosophy is fueling the Protein Structure Initiative (PSI) efforts, as described here. Anyway, I found the article interesting & agree with the authors general suggestions. I would however extend it beyond these physical research tools & say that going forward researchers need more data analysis tools, and training on how to use them – but I would, wouldn’t I? :)

References:

  • Sierpinski P, Garrett J, Ma J, Apel P, Klorig D, Smith T, Koman LA, Atala A, & Van Dyke M (2008). The use of keratin biomaterials derived from human hair for the promotion of rapid regeneration of peripheral nerves. Biomaterials, 29 (1), 118-28 PMID: 17919720
  • Edwards, A., Isserlin, R., Bader, G., Frye, S., Willson, T., & Yu, F. (2011). Too many roads not taken Nature, 470 (7333), 163-165 DOI: 10.1038/470163a

Video Tip of the Week: VnD Resource for Genetic Variation and Drug Information


In today’s tip I am going to feature a resource that I found recently. I’ve been updating our dbSNP tutorial, which Mary & Trey will be presenting at workshops in Morocco, and also our free PDB tutorial, which is sponsored by the RCSB PDB team. I have therefore been thinking about protein structures and small sequence variations a lot lately. As I explored the latest Database issue of NAR looking for resources to do a tip on, I found an article describing the VnD (genetic Variation and Drug) resource, which can also be accessed at the URL www.vandd.org, according to the NAR article. The article is “VnD: a structure-centric database of disease-related SNPs and drugs“, and figure one shows a veritable Who’s Who of protein, variation and disease resources, so I had to investigate.

What I found at VnD made me sure that this was a resource that I wanted to feature in a tip. VnD is from the Korean Bioinformation Center, or KOBIC, who has a list of databases and tools that they provide. I’ll save the rest of the KOBIC resources for another post & concentrate on VnD here. Compiling data from resources such as RefSeq, OMIM, UniProt, PDB, DrugBank, dbSNP, GAD and more might have been cool enough, depending on how it was done, but the VnD also does their own structure modeling analysis on how the variation affects the protein structure and drug/ligand binding.

This tip movie isn’t long enough to really show you the breadth of what is available from the VnD, but I hope it will be enough to encourage you to read the NAR article (listed below), and to check out VnD. One thing to note: don’t expect to find every dbSNP rs# over there – one that I’ve been using in our tutorial isn’t over there. They are specifically interested in variations within genes that might effect drug binding. But hey, you can’t query DrugBank with rs#s, and I’ve never seen the structure modeling done like VnD, so it is a worthy resource that you may want to investigate if you are interested in how genetic variations connect with disease and drug therapies.

Quick links:

VnD: Variations and Drugs resource -  http://vnd.kobic.re.kr:8080/VnD/index.jsp

Korean Bioinformation Center (KOBIC) – http://www.kobic.re.kr/

RCSB PDB – http://www.pdb.org

OpenHelix Tutorial on the RCSB PDB – http://www.openhelix.com/pdb

dbSNP: Short Genetic Variations, from NCBI -  http://www.ncbi.nlm.nih.gov/projects/SNP/

OpenHelix Tutorial on NCBI’s dbSNP – http://www.openhelix.com/cgi/tutorialInfo.cgi?id=39

For links to other resources and OpenHelix tutorials mentioned in this post, please see our catalog of resources – http://www.openhelix.com/cgi/tutorials.cgi

Reference:
Yang, J., Oh, S., Ko, G., Park, S., Kim, W., Lee, B., & Lee, S. (2010). VnD: a structure-centric database of disease-related SNPs and drugs Nucleic Acids Research, 39 (Database) DOI: 10.1093/nar/gkq957

Tip of the Week: From UniProt to the PSI SBKB and Back Again


It is often beneficial to visit multiple biomedical databases or resources, even if they seem to provide overlapping  information because no two resources focus on the exact same information, or present it in exactly the same way. Instead of duplicating each others’ curation efforts, database often link out to related information at other resources. You can think of these links as “social connections”, if you want and in today’s tip I want to show you a couple of connections between protein information resources, including a new connection that really features some of the core value of the PSI’s Structural Biology Knowledgebase, or SBKB.

I begin the tip at the UniProtKB, where I search for a UniProt ID number. From the resulting protein report I first briefly show you how to link out to a corresponding RCSB PDB report, where you can find high quality protein structure information and more. If you are interested in learning more about the RCSB PDB & how to use it, please check out OpenHelix’s full, free tutorial that is sponsored by the RCSB PDB.

From there I return to the UniProt report and demonstrate a new link out option that links to protein protocols, available materials, as well as information about theoretical models and predicted protein targets from the SBKB. I don’t have time to show it but a recent update to the SBKB allows users to now search the Structure Biology Knowledgebase with a UniProt accession number. These searches provide users with additional information including protein structure information and information about pre-released structure sequence. As with the RCSB PDB, we have a free tutorial on the SBKB that is sponsored by the Protein Structure Initiative.

As I scroll through the UniProt protein report users will see information and links for a wide variety of bioscience resources. OpenHelix, as I’m sure many of you are aware, has tutorials on how to use many of these resources. Our tutorials on the RCSB PDB and the PSI SBKB are both free. Our tutorials on UniProt and many other resources are available through a subscription to our database of trainings or through purchase of individual access. Whether you learn the resources through our tutorials, through the references I list below, or through your own explorations of the databases, there really is an amazing amount of information available through these interlinked, publicly-funded resources – please make use of them in your research!

Quick Links:

UniProt Knowledgebase -  http://www.uniprot.org/

OpenHelix Tutorial on UniProt – http://www.openhelix.com/cgi/tutorialInfo.cgi?id=77

RCSB PDB – http://www.pdb.org

OpenHelix Tutorial on the RCSB PDB – http://www.openhelix.com/pdb

The Protein Structure Initiative Structural Biology Knowledgebase (SBKB) -  http://www.sbkb.org/

OpenHelix Tutorial on the SBKB – http://www.openhelix.com/sbkb

Catalog of all OpenHelix tutorials – http://www.openhelix.com/cgi/tutorials.cgi

References:
The UniProt Consortium. (2009). The Universal Protein Resource (UniProt) in 2010 Nucleic Acids Research, 38 (Database) DOI: 10.1093/nar/gkp846

Rose, P., Beran, B., Bi, C., Bluhm, W., Dimitropoulos, D., Goodsell, D., Prlic, A., Quesada, M., Quinn, G., Westbrook, J., Young, J., Yukich, B., Zardecki, C., Berman, H., & Bourne, P. (2010). The RCSB Protein Data Bank: redesigned web site and web services Nucleic Acids Research, 39 (Database) DOI: 10.1093/nar/gkq1021

Berman, H., Westbrook, J., Gabanyi, M., Tao, W., Shah, R., Kouranov, A., Schwede, T., Arnold, K., Kiefer, F., Bordoli, L., Kopp, J., Podvinec, M., Adams, P., Carter, L., Minor, W., Nair, R., & Baer, J. (2009). The protein structure initiative structural genomics knowledgebase Nucleic Acids Research, 37 (Database) DOI: 10.1093/nar/gkn790

Many Protein Resources Have Recently Announced Updates

PDB structure 3rg9

 

 

 

 

In our ongoing pursuit of up-to-date tutorials, I’ve been tracking changes that are occurring at resources and planning our updates accordingly. Protein resources are especially going to keep me out of trouble this summer, because their developers and curators have been busy! I’ve compiled a short synopsis below, and would appreciate comments on any other resources you know about, or want to brag about! :)

  • I featured the ExPASy list of proteomic tools in a past tip. As of  Tuesday this list is no longer being kept up-to-date, but the ExPASy resource has been expanded beyond being “just” a proteomics resource and is now the new SIB Bioinformatics Resource Portal. According to its developers, the portal:

    “provides access to scientific databases and software tools in different areas of life sciences including proteomics, genomics, phylogeny, systems biology, population genetics, transcriptomics etc. … On this portal you find resources from many different SIB groups as well as external institutions.”

    And never fear, there is still an up-to-date list of proteomics tools found here.

  • I mentioned in my tip last week that NCBI’s MMDB has undergone an update & I’ll be updating our tutorial on it soon.
  • NCI/Nature Pathway Interaction Database, or PID, had an update June 14th that includes new and updated pathway information.
  • PROSITE had an update June 21st, which is Release 20.73, and now includes 1618 documentation entries, 1308 patterns, 936 profiles and 925 ProRules.
  • The RCSB PDB resource has announced updates to their Browse Database function, enhanced sequence displays from structure summary pages and the PDB-101 educational resource available from blackboard logos on PDB pages. For more details on using PDB, please see our free PDB Introductory tutorial sponsored by the RCSB.
  • STRING’s 9.0 release is now available, and we’ll be looking into anything we need to update in our tutorial as a result.
  • UniProt released an update June 28th that included a major update on many bacterial and archaeal Type II Toxin-Antitoxin modules, as is described here.

Enjoy all the new information – I know I will! :)

Tip of the Week: WAVe, Web Analysis of the Variome

Today’s Tip of the Week is a short introduction to WAVe, or the Web Analysis of the Variome. The tool was recently introduced to us, and I’ve found it a welcome introduction to the tools available to the researcher to analyze human variation. This is apropos considering the recent paper we’ve been discussing on the clinical assessment of a personal genome (here, here and here) and that papers implications for personalized medicine and the use of online variation resources. WAVe also has introduced me to some additional tools I’ve either not been aware of, or haven’t used, which might be of use such as: LOVD (Leiden Open Variation Database), QuExT (Query Expansion Tool, also from the same developers as WAVe), and others. Of course there are also database information pulled in from Ensembl, Reactome, KEGG, InterPro, PDB, UniProt, NCBI and many others. Take some time to check it out.

Personal Genomics, clinical assessment and online resources

ResearchBlogging.orgThe Lancet paper, Clinical assessment incorporating a personal genome, has held my fascination this weekend (yes, I read it at the beach). Mary posted Friday and again Saturday on the paper and related NPR segment. It feels to me to be a seminal paper, though I do agree with Daniel at Genetic Future, there are a lot there we still don’t know. A large portion of the variation is in non-coding regions, and thus predictions and propensities are hard to come by with the available analysis. In fact, as he pointed out, many of the coding region variations have little information as to their effect on disease. I would add also that even if we get to that holy grail of $1,000 to sequence a personal genome, this kind of extensive analysis would still be time and cost-prohibitive for the vast majority of sequenced genomes.

Yet, as with all early steps in science and medicine, there’s missing pieces, large gaps and huge efforts (think “space travel,” “computers,” “microwave ovens,” “internet,”) that over time become inexpensive and commonplace (ok, so the former isn’t necessarily “inexpensive”). Sequencing genomes will become inexpensive before the analysis does, but both will come. And I think this paper is pointing to that future.

The other hurdle to large scale personal genomics I see (of course) is the understanding and use of the genomics and data resources. The authors use a large (and excellent, in my opinion) suite of genomics resources to do obtain data and do their analysis. I’ll list them here with links in alphabetical order:

dbSNP (T)
GVS (T)
HapMap (T)
HGMD
OMIM (T)
PharmGKB
PolyPhen
PubMed (T)
SIFT
UniProt (T)

All of these resources have a wealth of data, but even then, that is a lot of analysis and familiarization that is needed with each tool. Each tool does have documentation and tutorials, and of course OpenHelix has tutorials on many of the ones mentioned (those with linked “T”s after the name). Still, this one analysis took a large number of tools and familiarization.

The paper does have a pretty good figure (figure 1) outlining the analysis process. For example, they SIFTed the genome to find gene-associated, non-synonymous, rare and novel and disease associated variations and then analyzed those using dbSNP, HGMD, OMIM and PubMed to analyze something like HFE2 which might have an association with Haemochromotosis. One of my quibbles with the paper, as often is with these papers, is that there isn’t a good methods ‘walk-through’ of the paper using something like Galaxy or Taverna in a history or workflow that would help reproduce the analysis.

We also have a tutorial I’d like to point you to, one that walks through a similar process and teaches users the basics of walking through that process. You can find this tutorial here, it’s free and publicly available. The tutorial walks the user through the analysis of a gene variation, in this case in the CYPC9 that effects an individual’s response to Warfarin. There is a similar variation (different gene, affects same drug response) in the paper. The tutorial uses the NIEHS SNPs site to get an overview of the variation including SIFT and PolyPhen predictions, then to the UCSC Genome Browser to find an overview of the region, walks through the dbSNP information and does a quick tag SNP analysis using GVS. That tutorial is only one very small step in what will have to be a immense education into genomics and genomics resources.

That is all to point out that the paper is an fascinating first step, and as a first step suggests the gaping holes we will have in bringing personal genomics to medicine.

Ashley, E., Butte, A., Wheeler, M., Chen, R., Klein, T., Dewey, F., Dudley, J., Ormond, K., Pavlovic, A., & Morgan, A. (2010). Clinical assessment incorporating a personal genome The Lancet, 375 (9725), 1525-1535 DOI: 10.1016/S0140-6736(10)60452-7

Margaret Dayhoff, a founder of the field of bioinformatics

If you are a biomedical researcher, have you ever used protein databases like UniProt to get information about proteins that you are interested in?  Do you know how that database got there?  I don’t mean today, I mean decades ago—how did a resource like this come to even exist at all?  When researchers search a protein database or align amino acid sequences, frequently they’ll come across a name helped start it all years ago. Margaret Dayhoff was one of the people that pioneered this crucial functionality, a true founder in the field of bioinformatics.  But in some histories and timelines of bioinformatics she barely gets a mention.  To celebrate Ada Lovelace Day, I’m going to introduce you to Dr. Dayhoff and I hope to raise awareness of her important fundamental contributions to the field of bioinformatics.

Because we can access all the protein information we can stand with a few keystrokes today, it is easy to forget that this data 1) didn’t always exist, and 2) when it did exist, it wasn’t easy to find and work with.  In the 1960s, only a handful of protein sequences were known.  But it was clear that more of this data would be incredibly useful in a number of ways, and was certainly going to be generated at an increasingly faster pace.  And soon it would overwhelm any one person’s ability to analyze and retain.  DNA sequences…don’t even go there yet….

use_of_computersBut there were some prepared minds ready to begin thinking about these data and the associated opportunities around them.  They were also aware that computers might help with these problems.  Robert Ledley was one of them.  Ledley had trained as a dentist, but obtained a degree in physics and became increasingly interested in the possibilities of applying computational resources to biomedical problems.  A report authored by Ledley is one of the earliest studies of biomedical computation, and can be viewed on Google Books today.

Working with Ledley at the National Biomedical Research Foundation was a woman named Margaret Dayhoff.  With an undergraduate degree in mathematics and graduate studies in chemistry, Dayhoff had pioneered work with punch cards and data processing machines to evaluate molecular resonance energies of organic molecules.  She obtained a Watson Computing Laboratory Fellowship to pursue the work to complete her PhD, which is described by a biographer as:

The process was iterative and required manually carrying cards from one type of machine to another (4 types), as no single machine could do the whole iteration. Convergence was slow and several months could be required for a result. Continue reading