Tag Archives: UniProt


Video Tip of the Week: UniProt updates, now including portable BED files

UniProt is one of the core resources that provides tremendously important curated information about proteins. You will find links to UniProt in lots of other tools and databases as well, but we’ve always championed going directly there for the full look at all the wide range of information they offer. Their foundation remains solid, but they also continue to add new and useful features over time. Recently they had a webinar to describe some of the new things, and the recording of that webinar will be this week’s Video Tip fo the Week.

The video starts with an overview of the whole UniProt site. The core of their great resource is the same, of course. UniProtKB, UniRef, and UniParc are there for various ways to look across the data. The handy Proteomes collection of the proteins in a given species is available, and they also have reference proteomes from that access point. There’s a short section in the video that’s a guide to the basic search functions.

About 9 minutes in they introduce the UniRule annotation features. When certain conditions are met, an annotation gets applied to a protein–which you can trace from the protein pages by clicking on the UniRule link for that annotation. unirule_sampleAnd their software offers a very cool way to look and see how/when conditions are applied. It will load a decision flow path and highlights what the logic rules were used in that particular case, so you can trace it and understand how a protein got a given item. That’s what I illustrate in the screen shot here.

About 14 min, the topic changed to the new Genome Annotation Tracks. They now offer you a way to take their annotations for a UniProtKB entry and use them with a separate genome browser. They hand you BED or BigBed files for different features. You can also load the whole thing as a Hub file to see all the sequence feature data at once. They are species-specific, and started with human, but others are coming. You can access them from the “Downloads” area of the homepage. The video also described a bit about the structure there as well. So you could take these files to ENSEMBL or UCSC Genome Browser and load them, with all the UniProt features now to compare to the existing genomic context at those browsers. They illustrate how you can look at the “active site” annotations, but you can also look at post-translation modification sites, domains, etc. This was a feature that was new to me, and looks like a terrific idea.

So even if you think you know UniProt, check out these new options for additional ways to interact with the high-quality information they provide. Good stuff.

Quick links:

UniProt: http://www.uniprot.org/


The UniProt Consortium (2014). UniProt: a hub for protein information Nucleic Acids Research, 43 (D1) DOI: 10.1093/nar/gku989

What’s The Answer? (proteins without genes in the dbs)

This week’s highlighted discussion offers a peek at some odd situations in public databases. Sometimes there are things missing that you can’t quite figure out. I thought the exploration of why this happens was interesting and informative about working with databases.

Biostars is a site for asking, answering and discussing bioinformatics questions and issues. We are members of the Biostars_logo community and find it very useful. Often questions and answers arise at Biostars that are germane to our readers (end users of genomics resources). Every Thursday we will be highlighting one of those items or discussions here in this thread. You can ask questions in this thread, or you can always join in at Biostars.

This week’s highlighted issue at Biostars is one of the ones that can be really mystifying to encounter. But because of the way databases are curated, sometimes there are odd situations that don’t make sense at first glance. Sometimes these are real bugs–but other times they are decisions that had to be made to accommodate some strange feature of biology that doesn’t align with a database configuration.

Question: Proteins without genes ? Is that even possible ?

Hello all,

I am looking at some mass-spec data.
I found several fragments mapping to Ig heavy chain V-II region WAH protein and want to find corresponding gene.

Example http://www.uniprot.org/uniprot/P01770

Uniprot Screenshot
Uniprot says the gene name as “NULL”. Is this an annotation error or any special aspect of Ig regions am missing ? I want to map several proteins with these type of names to genes.

  • Cluster of Ig heavy chain V-I region HG3
  • Cluster of Ig heavy chain V-II region SESS
  • Cluster of Ig heavy chain V-III region BRO
  • Cluster of Ig lambda chain V-I region NEW
  • Cluster of Ig lambda chain V-II region BUR
  • Ig heavy chain V-II region WAH
  • Ig heavy chain V-III region BUT
  • Ig heavy chain V-III region GAL
  • Ig heavy chain V-III region NIE
  • Ig heavy chain V-III region WEA
  • Ig kappa chain V-I region Kue
  • Ig kappa chain V-I region Wes
  • Ig kappa chain V-III region VG (Fragment)
  • Ig lambda chain V-III region LOI
  • Ig lambda chain V-III region SH
  • Ig lambda chain V-V region DEL

How can I map these to corresponding gene names ?

Thoughts ?

Khader Shameer

Having been involved in curation, I can see how this transpired. But there was a great answer from the UniProt folks themselves in the thread. And input from others too. I thought the discussion was fascinating. Go have a look at the outcome.


Announcement of Updated Tutorial Materials: UniProt, Overview of Genome Browsers, and World Tour of Resources

As many of you know, OpenHelix specializes in helping people access and utilize the gold mine of public bioscience data in order to further research.  One of the ways that we do this is by creating materials to train people – researchers, clinicians, librarians, and anyone interested in science - on where to find data they are interested in, and how to access data at particular public databases and data repositories. We’ve got over 100 such tutorials on everything from PubMed to the Functional Glycomics Gateway (more on that later).

In addition creating these tutorials, we also spend a lot of time to keep them accurate and up-to-date. This can be a challenge, especially when lots of databases or resources all have major releases around the same time. Our team continually assesses and updates our materials and in this post I am happy to announce recently released updates to three of our tutorials: UniProt, World Tour, and Overview of Genome Browsers.

Our Introductory UniProt tutorial shows users how to: perform text searches at UniProt for relevant protein information, search with sequences as a starting point, understand the different types of UniProt records, and create multi-sequence alignments from protein records using Clustal.

Our Overview of Genome Browsers introduces users to introduce Ensembl, Map Viewer, UCSC Genome Browser, the Integrated Microbial Genomes (IMG) browser, and to the GBrowse software system. We also touch on WebGBrowse, JBrowse, the Integrative Genomics Viewer (IGV), the ARGO Genome Browser, the Integrated Genome Browser (IGB)GAGGLE, and the Circular Genome Viewer, or CGView.

Our World Tour of Genomics Resources is free and accessible without registration. It includes a tour of example resources, organized by categories such as Algorithms and Analysis tools, expression resources, genome browsers (both Eukaryotic and Prokaryotic/Microbial) , Literature and text mining resources, and resources focused on nucleotides, proteins, pathways, disease and variation. This main discussion will then lead into a discussion of how to find resources with the free OpenHelix Resource Search Portal, followed by learning to use resources with OpenHelix tutorials, and a discussion of additional methods of learning about resources.

Quick Links:

OpenHelix Introductory UniProt tutorial suite: http://www.openhelix.com/cgi/tutorialInfo.cgi?id=77

OpenHelix Overview to Genome Browsers tutorial suite: http://www.openhelix.com/cgi/tutorialInfo.cgi?id=65

Free OpenHelix World Tour of Genomics Resources tutorial suite: http://www.openhelix.com/cgi/tutorialInfo.cgi?id=119


NAR database issue (always a treasure trove)

The advance access release of most of the  NAR database issue articles is out. As usual, this this database issue includes a wealth of new and updated data repositories and analysis tools. We’ll be writing up additional more extensive blog posts on it and doing some tips of the week over the next couple months, but I thought I’d highlight the issue and some of the reports:

There are a lot of updates to many of the databases we know and love (links to go full text article): UCSC Genome Browser, Ensembl, UniProt, MINT, SMART, WormBase, Gene Ontology,  ENCODE, KEGG, UCSC Archaeal Browser, IMG/M, DBTSS, InterPro and others (we have tutorials on all those listed here).

And, as an indication of the explosion of data available (itself a subject of a database issue article, SRA), there are a lot of new(ish) databases on specific datatypes such as MINAS, a database of metal ions in nucleic acids (nice name :D); doRiNA, a database of RNA interactions in post-transcriptional regulation; BitterDB, a database of bitter compounds and well over 100 more.

And I’ll give a special shout out to my former PI at EMBL because I can, Peer Bork’s group has 4 databases listed in the issue: eggNOG, SMART, STITCH and OGEE. (and he and a couple members are on the InterPro paper also).

This is going to be a wealth of information to wade through!

UCSC Genome Browser: http://genome.ucsc.edu
Ensembl: http://www.ensembl.org/
UniProt: http://www.uniprot.org/
MINT: http://mint.bio.uniroma2.it/mint/
SMART: http://smart.embl.de/
WormBase: http://www.wormbase.org/
Gene Ontology: http://www.geneontology.org/
ENCODE: http://genome.ucsc.edu/ENCODE/
KEGG: http://www.kegg.jp
UCSC Archaeal Brower: http://archaea.ucsc.edu/
IMG: http://img.jgi.doe.gov/cgi-bin/w/main.cgi
DBTSS: http://dbtss.hgc.jp/
InterPro: http://www.ebi.ac.uk/interpro




World tour of workshops, recent stop: Morocco, Africa

Trainers & organizers

Last year I had the opportunity to give a workshop in Ifrane Morocco (UCSC Genome and Table browsers, Galaxy) at Al Akhawayn University. This year, Mary and I returned for a longer 3-day workshop at University Hassan II in Mohammadia. OpenHelix was a co-sponsor of the workshop (donating our time, materials and expertise). The workshop covered a plethora of topics from a world tour of resources (tutorial-free) and introductory UCSC  Genome Browser (tutorial-free) and ENCODE (tutorial-free) to genome variation analysis in dbSNP (tutorial-subscription) and analysis using Galaxy (tutorial-subscription). You can see the full schedule of the topics Mohammadia Workshop Schedule here (pdf).

As last year, we were impressed with the students (there were 117 total, about 50/50 gender ratio). English is their 3rd or 4th language in most cases, Moroccan Arabic, French or various African languages being their language of choice. Yet, they were attentive and asked very perceptive and fascinating questions. They were also very enthusiastic

The workshop students

learners. It was a delight to teach them.

We’d like to thank Mohammed Bourdi at NIH, who spent large amounts of time and financial resources to organize this (and last year’s) workshop. We hope to repeat and expand these for next year and perhaps years to come. We will be looking for sponsors.

Several questions were asked at the workshop we’d like to reiterate the answers here and seek some answers from our readers:

*One student was looking for wheat genome resources for designing primers. The wheat genome is as yet incomplete, but there are some resources to get started:
Wheat Genome Sequencing Consortium
Gramene’s wheat resources
Wheat Genetic and Genomic Resource Center @ Kansas State
Perhaps also COGE for conserved sequences
edited to add:
CerealsDB and
James’ post on the wheat draft sequence might give some insight into that huge genome.
*Another student asked about dotplot tools:
Galaxy offers a large collection of EMBOSS tools including dotplot analysis, as does EBI Emboss tool

* Another question concerned finding a ‘dynamic programming’ (optimal solution) multiple sequence alignment tool as opposed to a heuristic one. The issue with this is the complexity of the search space of dynamic programming solution, this slide set might help with the understanding, particularly slides 1-5 and 17-22. It is too computationally intensive. That said, the student might want to check out MSAProps and this list at Wikipedia.

Do our readers have any other guidance on this?

Teaching moment

* Another student asked  if we know how to find DC-area internships in biological sciences. Another student (mathematician from Mali) was looking for something in the US in bioinformatics. Any ideas of programs to bring African biology students to the US or Canada?

If our Moroccan students (or anyone else) have any additional questions, please feel free to ask them here!


ANd a side note. Last year I had all of 3 hours to tour Fes. This year I took advantage of my trip. Mary and I spent a few days in Fes and Marrakech. My family joined us in Marrakech and later my family and I toured for 8 days visiting the Atlas mountains, the Sahara and Fes. Needless to say, it was a trip of a lifetime. Morocco is a fascinating and beautiful place. I look forward to visiting again.

Gates and doors of Fes are beautiful

camel excursion to the Sahara





On a Mission for Protein Information

It’s probably just the human brain’s ability to connect dots  &  find patterns, but it can be interesting how many “unrelated” events and information bits accumulate in my head & eventually get mulled into an idea or theory. Take, for example, a recent biotech mixer, bits from an education leadership series & a past Nature article – each “event” has been meandering in my mind and now they are finding their way out as this blog post.

OK, now the explanation: At a recent local biotech event I heard about a company (KeraNetics) purifying keratin proteins & using them to develop therapeutic and research applications. The company & their research sounded very interesting & because a lot of it is aimed at aiding wounded soldiers, it also sounded directly beneficial. The talk was short, only about 20 minutes, so there wasn’t a lot of time for details or questions. I decided I’d venture forth through many of the bioscience databases and resources that I know and love, in order to learn more about keratin.

My quest was both fun and frustrating because of the nature of the beast – keratin is “well known” (i.e. it comes up in high school academic challenge competitions ‘a lot’, according to someone in the know), but is hard to work with (i.e. tough, insoluble, fibrous structural proteins) that is hard to find much general information on in your average protein database (because it is  made of many different gene products, all referred to as “keratin”). I decided to begin my adventure at two of my favorite protein resources, PDB & SBKB, but I found no solved structures for keratin. Because of the way model organism databases are curated and organized, I often begin a protein search there, just to get some basic background, gene names, sequence information, etc. I (of course) found nothing other than a couple of GO terms in the Saccharomyces Genome Database (SGD), but I found hundreds of results in both Mouse Genome Informatics (MGI) (660 genomic features) and Rat Genome Database (RGD) (162 rat genes, 342 human genes). I also found gene names (Krt*), sequences and many summary annotations with references to diseases with links to OMIM. When I queried for “keratin”, in OMIM I got 180 hits, including 61 “clinical synopsises”, in UniProt returned 505 reviewed entries and 2,435 unreviewed entiries, in Entrez Protein 10,611 results and in PubMed 26,430 articles with 1,707 reviews. I got my curiosity about KeraNetics’ research sated by using a PubMed advanced search for Keratin in the abstract or title & the PI’s name as author (search = “(keratin[Title/Abstract]) AND Van Dyke[Author]“).

I ended up with a lot of information leads that I could have hunted through, but it was a fun process in which I learned a lot about keratin. This is where the education stuff comes in. I’ve been seeing a lot of studies go by talking about reforming education to be more investigation driven, and I can totally see how that can work. “Learning” through memorization & regurgitation is dry for everyone & rough for the “memory challenged”, like me. Having a reason or curiosity to explore, with a new nugget of data or understanding lurking around each corner, the information just seems to get in better & stay longer. (OT, but thought I’d mention a related site that I found today w/ some neat stuff: Mind/Shift-How we will learn.)

And I could have done the advanced PubMed search in the beginning, but what fun would that have been? Plus there is a lot that I learned about keratin from what I didn’t find, like that there wasn’t a plethora of PDB structures for keratin proteins. That brings me to the final dot in my mullings – an article that I came across today as I worked on my reading backlog: “Too many roads not taken“. If you have a subscription to Nature you can read it, but the main point is that researchers are still largely focusing on the same set of proteins that they have been for a long time, because these are the proteins for which there are research tools (antibodies, chemical inhibitors, etc). This same sort of philosophy is fueling the Protein Structure Initiative (PSI) efforts, as described here. Anyway, I found the article interesting & agree with the authors general suggestions. I would however extend it beyond these physical research tools & say that going forward researchers need more data analysis tools, and training on how to use them – but I would, wouldn’t I? :)


  • Sierpinski P, Garrett J, Ma J, Apel P, Klorig D, Smith T, Koman LA, Atala A, & Van Dyke M (2008). The use of keratin biomaterials derived from human hair for the promotion of rapid regeneration of peripheral nerves. Biomaterials, 29 (1), 118-28 PMID: 17919720
  • Edwards, A., Isserlin, R., Bader, G., Frye, S., Willson, T., & Yu, F. (2011). Too many roads not taken Nature, 470 (7333), 163-165 DOI: 10.1038/470163a

Video Tip of the Week: VnD Resource for Genetic Variation and Drug Information

In today’s tip I am going to feature a resource that I found recently. I’ve been updating our dbSNP tutorial, which Mary & Trey will be presenting at workshops in Morocco, and also our free PDB tutorial, which is sponsored by the RCSB PDB team. I have therefore been thinking about protein structures and small sequence variations a lot lately. As I explored the latest Database issue of NAR looking for resources to do a tip on, I found an article describing the VnD (genetic Variation and Drug) resource, which can also be accessed at the URL www.vandd.org, according to the NAR article. The article is “VnD: a structure-centric database of disease-related SNPs and drugs“, and figure one shows a veritable Who’s Who of protein, variation and disease resources, so I had to investigate.

What I found at VnD made me sure that this was a resource that I wanted to feature in a tip. VnD is from the Korean Bioinformation Center, or KOBIC, who has a list of databases and tools that they provide. I’ll save the rest of the KOBIC resources for another post & concentrate on VnD here. Compiling data from resources such as RefSeq, OMIM, UniProt, PDB, DrugBank, dbSNP, GAD and more might have been cool enough, depending on how it was done, but the VnD also does their own structure modeling analysis on how the variation affects the protein structure and drug/ligand binding.

This tip movie isn’t long enough to really show you the breadth of what is available from the VnD, but I hope it will be enough to encourage you to read the NAR article (listed below), and to check out VnD. One thing to note: don’t expect to find every dbSNP rs# over there – one that I’ve been using in our tutorial isn’t over there. They are specifically interested in variations within genes that might effect drug binding. But hey, you can’t query DrugBank with rs#s, and I’ve never seen the structure modeling done like VnD, so it is a worthy resource that you may want to investigate if you are interested in how genetic variations connect with disease and drug therapies.

Quick links:

VnD: Variations and Drugs resource -  http://vnd.kobic.re.kr:8080/VnD/index.jsp

Korean Bioinformation Center (KOBIC) – http://www.kobic.re.kr/

RCSB PDB – http://www.pdb.org

OpenHelix Tutorial on the RCSB PDB – http://www.openhelix.com/pdb

dbSNP: Short Genetic Variations, from NCBI -  http://www.ncbi.nlm.nih.gov/projects/SNP/

OpenHelix Tutorial on NCBI’s dbSNP – http://www.openhelix.com/cgi/tutorialInfo.cgi?id=39

For links to other resources and OpenHelix tutorials mentioned in this post, please see our catalog of resources – http://www.openhelix.com/cgi/tutorials.cgi

Yang, J., Oh, S., Ko, G., Park, S., Kim, W., Lee, B., & Lee, S. (2010). VnD: a structure-centric database of disease-related SNPs and drugs Nucleic Acids Research, 39 (Database) DOI: 10.1093/nar/gkq957

Tip of the Week: From UniProt to the PSI SBKB and Back Again

It is often beneficial to visit multiple biomedical databases or resources, even if they seem to provide overlapping  information because no two resources focus on the exact same information, or present it in exactly the same way. Instead of duplicating each others’ curation efforts, database often link out to related information at other resources. You can think of these links as “social connections”, if you want and in today’s tip I want to show you a couple of connections between protein information resources, including a new connection that really features some of the core value of the PSI’s Structural Biology Knowledgebase, or SBKB.

I begin the tip at the UniProtKB, where I search for a UniProt ID number. From the resulting protein report I first briefly show you how to link out to a corresponding RCSB PDB report, where you can find high quality protein structure information and more. If you are interested in learning more about the RCSB PDB & how to use it, please check out OpenHelix’s full, free tutorial that is sponsored by the RCSB PDB.

From there I return to the UniProt report and demonstrate a new link out option that links to protein protocols, available materials, as well as information about theoretical models and predicted protein targets from the SBKB. I don’t have time to show it but a recent update to the SBKB allows users to now search the Structure Biology Knowledgebase with a UniProt accession number. These searches provide users with additional information including protein structure information and information about pre-released structure sequence. As with the RCSB PDB, we have a free tutorial on the SBKB that is sponsored by the Protein Structure Initiative.

As I scroll through the UniProt protein report users will see information and links for a wide variety of bioscience resources. OpenHelix, as I’m sure many of you are aware, has tutorials on how to use many of these resources. Our tutorials on the RCSB PDB and the PSI SBKB are both free. Our tutorials on UniProt and many other resources are available through a subscription to our database of trainings or through purchase of individual access. Whether you learn the resources through our tutorials, through the references I list below, or through your own explorations of the databases, there really is an amazing amount of information available through these interlinked, publicly-funded resources – please make use of them in your research!

Quick Links:

UniProt Knowledgebase -  http://www.uniprot.org/

OpenHelix Tutorial on UniProt – http://www.openhelix.com/cgi/tutorialInfo.cgi?id=77

RCSB PDB – http://www.pdb.org

OpenHelix Tutorial on the RCSB PDB – http://www.openhelix.com/pdb

The Protein Structure Initiative Structural Biology Knowledgebase (SBKB) -  http://www.sbkb.org/

OpenHelix Tutorial on the SBKB – http://www.openhelix.com/sbkb

Catalog of all OpenHelix tutorials – http://www.openhelix.com/cgi/tutorials.cgi

The UniProt Consortium. (2009). The Universal Protein Resource (UniProt) in 2010 Nucleic Acids Research, 38 (Database) DOI: 10.1093/nar/gkp846

Rose, P., Beran, B., Bi, C., Bluhm, W., Dimitropoulos, D., Goodsell, D., Prlic, A., Quesada, M., Quinn, G., Westbrook, J., Young, J., Yukich, B., Zardecki, C., Berman, H., & Bourne, P. (2010). The RCSB Protein Data Bank: redesigned web site and web services Nucleic Acids Research, 39 (Database) DOI: 10.1093/nar/gkq1021

Berman, H., Westbrook, J., Gabanyi, M., Tao, W., Shah, R., Kouranov, A., Schwede, T., Arnold, K., Kiefer, F., Bordoli, L., Kopp, J., Podvinec, M., Adams, P., Carter, L., Minor, W., Nair, R., & Baer, J. (2009). The protein structure initiative structural genomics knowledgebase Nucleic Acids Research, 37 (Database) DOI: 10.1093/nar/gkn790

Many Protein Resources Have Recently Announced Updates

PDB structure 3rg9





In our ongoing pursuit of up-to-date tutorials, I’ve been tracking changes that are occurring at resources and planning our updates accordingly. Protein resources are especially going to keep me out of trouble this summer, because their developers and curators have been busy! I’ve compiled a short synopsis below, and would appreciate comments on any other resources you know about, or want to brag about! :)

  • I featured the ExPASy list of proteomic tools in a past tip. As of  Tuesday this list is no longer being kept up-to-date, but the ExPASy resource has been expanded beyond being “just” a proteomics resource and is now the new SIB Bioinformatics Resource Portal. According to its developers, the portal:

    “provides access to scientific databases and software tools in different areas of life sciences including proteomics, genomics, phylogeny, systems biology, population genetics, transcriptomics etc. … On this portal you find resources from many different SIB groups as well as external institutions.”

    And never fear, there is still an up-to-date list of proteomics tools found here.

  • I mentioned in my tip last week that NCBI’s MMDB has undergone an update & I’ll be updating our tutorial on it soon.
  • NCI/Nature Pathway Interaction Database, or PID, had an update June 14th that includes new and updated pathway information.
  • PROSITE had an update June 21st, which is Release 20.73, and now includes 1618 documentation entries, 1308 patterns, 936 profiles and 925 ProRules.
  • The RCSB PDB resource has announced updates to their Browse Database function, enhanced sequence displays from structure summary pages and the PDB-101 educational resource available from blackboard logos on PDB pages. For more details on using PDB, please see our free PDB Introductory tutorial sponsored by the RCSB.
  • STRING’s 9.0 release is now available, and we’ll be looking into anything we need to update in our tutorial as a result.
  • UniProt released an update June 28th that included a major update on many bacterial and archaeal Type II Toxin-Antitoxin modules, as is described here.

Enjoy all the new information – I know I will! :)

Tip of the Week: WAVe, Web Analysis of the Variome

Today’s Tip of the Week is a short introduction to WAVe, or the Web Analysis of the Variome. The tool was recently introduced to us, and I’ve found it a welcome introduction to the tools available to the researcher to analyze human variation. This is apropos considering the recent paper we’ve been discussing on the clinical assessment of a personal genome (here, here and here) and that papers implications for personalized medicine and the use of online variation resources. WAVe also has introduced me to some additional tools I’ve either not been aware of, or haven’t used, which might be of use such as: LOVD (Leiden Open Variation Database), QuExT (Query Expansion Tool, also from the same developers as WAVe), and others. Of course there are also database information pulled in from Ensembl, Reactome, KEGG, InterPro, PDB, UniProt, NCBI and many others. Take some time to check it out.