Tag Archives: annotation

Video Tip of the Week: DNA Subway

At a recent training workshop on the UCSC Genome Browser, I spoke with an educator who is using a custom local installation of the browser to work with students on bioinformatics lessons. It’s a project called the Genomics Education Partnership at WashU, and students learn by annotating regions of the genome with bioinformatics tools. You can see the team’s installation of the browser here. It sounded like an enjoyable and effective method and useful to students.

So the other day when I was exploring some of the resources available from the iPlant Collaborative, I was reminded of the annotation educational method by their very cool DNA Subway project. It’s another strategy to educate students with genome annotation tools–but I also think there might be some scientists who might want to use it beyond formal educational settings. It’s not new–I can remember reading about it in the past, but looking at it again with fresh eyes after that other conversation was worthwhile. And they’ve added new features since I last explored.

Student annotation projects are widespread, and there are probably numerous different successful strategies that local folks have implemented to set this up. But I suspect that more folks who are teaching bioinformatics might find the workflow infrastructure of the DNA Subway system a useful mechanism to use themselves, rather than setting up their own. So this week’s video tip of the week highlights the DNA Subway. Oh–and by the way: just because it’s at iPlant doesn’t mean it’s restricted to plants. You can go over there and see the various species options.

The providers of the Subway describe it as:

DNA Subway makes high-level genome analysis broadly available to students and educators and provides easy access to the types of data and informatics tools that drive modern biology. Using the intuitive metaphor of a subway map, DNA Subway organizes research-grade bioinformatics analysis tools into logical workflows and presents them in an appealing interface.”

I thought this was a really effective way to conceptualize the tasks that need to occur on a project. And it’s integrated with the tools you need at each “stop” to accomplish the tasks. The new “green line” in Beta that they have created isn’t shown in the video, but you should have a look at the site. It’s got tools for NGS RNA-seq data analysis, integrating the Tuxedo workflow protocol that includes TopHat, Bowtie, and Cufflinks, and is a really good thing for students to be exposed to. If you go over to the DNA Subway site itself and choose the “green line” to explore, you can see more information.

I can’t seem to embed their video, so I’d recommend you look at the larger size version on a separate page, and to go over and have a look for yourself at the DNA Subway.

Go over to their site by clicking on the image to access the video.

Go over to their site by clicking on the image to access the video.

Quick links:

DNA Subway main description page: http://www.iplantcollaborative.org/discover/dna-subway

DNA Subway installation: http://dnasubway.iplantcollaborative.org/

DNA Subway video tour (larger size): http://dnasubway.iplantcollaborative.org/files/tour/index.html


Goff S.A., Vaughn M., McKay S., Lyons E., Stapleton A.E., Gessler D., Matasci N., Wang L., Hanlon M. & Lenards A. & (2011). The iPlant Collaborative: Cyberinfrastructure for Plant Biology, Frontiers in Plant Science, 2 34. DOI:

Trapnell C., Roberts A., Goff L., Pertea G., Kim D., Kelley D.R., Pimentel H., Salzberg S.L., Rinn J.L. & Pachter L. & (2012). Differential gene and transcript expression analysis of RNA-seq experiments with TopHat and Cufflinks, Nature Protocols, 7 (3) 562-578. DOI:

Got a genome + transcriptome. Now what?

I was catching up on some mailing list reading last week when I saw an unusual item come across the UCSC discussion mailing list. Someone who is in the process of obtaining genome and transcriptome sequence for a new project asked the UCSC group for guidance on what to do with it. It’s actually a question we’ve been hearing a lot in workshops–people are considering grants for this sort of project, or have plans for a brand new sequencer that’s arrived at their site. I thought other people might consider these recommendations useful information too, so I’m re-posting it here:


Dear UCSC Genome Bioinformatics,

My name is Padraig Doolan and I am the Program Leader for Expression
Microarrays and Bioinformatics at the National Institute for Cellular
Biotechnology (NICB), Ireland (www.nicb.ie/). We are a publicly-funded
basic science research institute.

Our small bioinformatics group are just starting the process of
analysisng a new genome (and transcriptome) for the Chinese Hamster
Ovary (CHO) cell line which was recently published (Xu et al., The
genomic sequence of the Chinese hamster ovary (CHO)-K1 cell line. Nat
Biotechnol. 2011 Jul 31;29(8):735-41. doi: 10.1038/nbt.1932.) by another
group. We do a lot of functional work on this organism and we’re looking
for some good guidelines (published papers, online resources, etc.)
which might help us map out some achievable goals with regard to the
in-silico characterisation of this genome.

For example, after the sequence is published, what are the next step(s)
in providing relevant information? Lists of SNPs? Predicted
proteome/secretome/numbers of predicted protein types (e.g.
kinases/g-coupled/nuclear-/membrane-localised), etc.?

I’m looking through the Human Genome Project Publications list
for inspiration, but this type of analysis output is relatively new for
our group (we are usually more focussed on translational medicine). Is
there any recommended guidelines your institute can suggest for
following in the footsteps of the HGP in in silico analysis of novel
genomes/transcriptomes? Can your organisation suggest a couple of key
papers or maybe a good analysis strategy?

Best regards,
Padraig Doolan

UCSC generally tries to limit their discussion to specifics of the data and software at their site–because that’s their mission, of course, and because they can’t be all things genomics to everyone–they wouldn’t have time for their own work. But this was a special case, and they assembled a very cool answer for Padraig and his team.

The CHO paper that Padraig references I had remembered seeing at the time, but I didn’t investigate further. So I went looking to see if the group had a browser set up, and I was unable to find one. I did find a preview assembly at Ensembl. But I can see why a local group would need more details in their own collection and why they’d want to do some things themselves too. And possibly an easy way to extend the reference sequence with their own data rather than waiting for a big browser team to get to it.


Hi Padraig,

I queried our engineers and got this list of recommendations for you:

1) Aligning all genbank mRNAs from Chinese Hamster
2) Aligning all of their own transcriptome data
3) Aligning all of genbank ESTs from Chinese Hamster
4) Mapping human proteins as derived from either the UCSC gene set or RefSeq
5) Mapping mouse proteins from UCSC or RefSeq
6) Doing a multiple species genome alignment with mouse, rat, rabbit,
dog, elephant, opossum, platypus, chicken. Do pairwise alignments as well.
7) Mine the genomic reads and transcriptomic reads for SNPs. Be careful
not to call recently duplicated and only slightly diverged regions
slight divergences as SNPs though.
\8) Run several repeat finders.
9) Run a CpG island detector.
10) Run a good gene prediction program like Augustus.
11) Try to find a wet lab group willing to do some DNAse assays….

I hope this is helpful. Good luck with your work!

Brooke Rhead
UCSC Genome Bioinformatics Group


I thought this was pretty much the list of things I’d want to see with a new genome on a new browser. And the reason I think this is especially key is because there’s only going to be more and more of this. With the new sequencing technologies and the data deluge, more groups are going to find themselves with important sequence data for their labs or their local researchers. Could be patients, could be model organisms, could be species. How to proceed with this data is important.

What else would you do? Do you have other recommendations for groups faced with this?

Also today I just happened to note that Jonathan Eisen linked to a paper that might offer guidance for people with new genomes: Important paper on annotation standards for bacterial/archael genomes — readying for the “data deluge”. I think this is great, and a crucial discussion and awareness to have right now. For exactly the same reasons–new folks are going to be faced with assembling and annotating features of new genomes at incredible rates, and we have learned some things about best practices and the needs. Of course, things will evolve–but a few good starting points are really helpful guidance.

EDIT: just got a note from the CHO paper researchers, and they point me to this site for some tools: http://www.chogenome.org/


Xu, X., Nagarajan, H., Lewis, N., Pan, S., Cai, Z., Liu, X., Chen, W., Xie, M., Wang, W., Hammond, S., Andersen, M., Neff, N., Passarelli, B., Koh, W., Fan, H., Wang, J., Gui, Y., Lee, K., Betenbaugh, M., Quake, S., Famili, I., Palsson, B., & Wang, J. (2011). The genomic sequence of the Chinese hamster ovary (CHO)-K1 cell line Nature Biotechnology, 29 (8), 735-741 DOI: 10.1038/nbt.1932

Klimke, W., O’Donovan, C., White, O., Brister, J., Clark, K., Fedorov, B., Mizrachi, I., Pruitt, K., & Tatusova, T. (2011). Solving the Problem: Genome Annotation Standards before the Data Deluge Standards in Genomic Sciences, 5 (1), 168-193 DOI: 10.4056/sigs.2084864

Tip of the Week: Genomic Encyclopedia of Bacteria & Archaea (GEBA)

Being summer, a strangely slow connection and some other factors, I am embedding a talk from Doug Ramsey (posted on SciVee) on the GEBA project at JGI (instead of doing a tip myself :). The GEBA project recognizes that many, if not most, of the bacterial and archaeal genomes that have been sequenced to date have some relevance to human disease or other human interest. This of course is reasonable, but it also leads to big gaps in our knowledge of bacterial evolution and genomics, knowledge that would help us better understand those genomes that we find relevant and knowledge that in and of itself can be quite interesting and potentially useful. View the talk to learn more about this project to sequence 100 phylogenetically diverse bacterial and Archaeal genomes.
I’m also posting this as an introduction to JGI’s Adopt a Genome project. This project allows student groups to adopt and study a bacteria in the GEBA project and hopefully add to our knowledge and annotations of the genome while learning. The students can then annotate the adopted genome by using IMG-ACT.

Tip of the Week: UCSC wiki annotations


In the continuing effort to get scientists and researchers to annotate and curate data and to capture the huge amount of knowledge available, UCSC Genome Browser has added a wiki annotation track to the browser. It’s not the first effort of course, GeneWiki is an effort, with mixed results so far, to annotate gene function information as a community exercise using Wikipedia. Some journals are requiring wiki entries, and several databases have opened wikis for curation. Wikis could be a solution for capturing the exponentially increasing amount of data,

or they could be just another place for adding confusion… or both. I suspect out of the plethora the wikis coming available for annotation and curation of genomic data, something will stick and find that Goldilocks balance of a dedicated community, ease of use, usability, and other aspects that will be needed for this to work.

Perhaps UCSC Genome Browser has that balance. It will remain to be seen, but let’s get started. Today’s tip is introducing the new wiki track in the UCSC Genome Browser.

Required Wiki updates?

In the push to ‘communitize’ annotation and curation, one journal, RNA Biology, is requiring submitters to add or update their RNA sequences on wikipedia. This article suggests that it’s working so far (update, link to the article added),

The first examples of this program in action are already online. The journal is hosting an open access paper that describes a family of RNA molecules found in nematode worms; a corresponding Wikipedia page is already in place. In good Wikipedia form, the phylogenetic analysis of these RNAs is dinged for not providing citations, while the article as a whole is flagged as having excess jargon. (The talk page hosts an interesting discussion of how much jargon can possibly be eliminated from a highly technical description like this.)

So far, everyone is happy with the results. A few scientists have started updating the scientific content of the RNA entries, while the usual Wikipedia denizens have helped out in terms of catching typos and improving the formatting. The people backing the project expect that it will be immune to some of the issues that plague other Wikipedia entries; Nature quotes one of the biologists as saying, “”We don’t think vandalism will ever be as much of a problem for a Wikipedia page on transfer RNAs as it is for a page on George Bush.”

And looking at that one entry, it does seem to. But I have a question, if researchers are soon required not only to submit and/or annotate in a database and to wikis and curate and annotate if they wish to publish, doesn’t this start to place an undue burden on researchers who already have grant writing, teaching, and more in addition to actual research? There does need to be a solution to the growing need for curation and annotation of data, it will be interesting to see if this is one solution that will hold.

Teaching and annotating at the same time

plos teaching paperA recent paper (couple weeks ago) in PLoS Biology from Hingamp et al. had me intrigued. Entitled Metagenome Annotation Using a Distributed Grid of Undergraduate Students, the lecturers put together a system to teach bioinformatics to undergraduates that uses new unannotated sequences from metagenome projects. As stated in the announcement,

This method asks students to randomly pick and analyze unknown metagenomic DNA fragments from a real research sequence stockpile. The student’s mission, using Internet tools only, is to figure out from which organism the DNA comes from, and what biological function it might have. As well as gaining confidence and proficiency in bioinformatics, students experience the authentic research process of weighing the arguments, establishing prediction reliability, building hypotheses, and maintaining rigorous disourse.

The lecturers have put together  a teaching-annotation procedure in a publicly accessible “annotation environment” they call “Annotathon.” This web interface walks the student through the annotation process in a procedure as you see in the figure here. Since you can join and use this interface, I thought I’d give it a test drive.

Continue reading

Tip of the Week: A quick annotation of a genome

gatu tip thumbnailHey, say you’ve got a bacterial genome you just sequenced in your spare time (hey, the way technology is going, it’s not far off) and you need to do a quick and dirty annotation to get you started. Well, there are several tools out there to do that, predict genes, annotate regions, etc. I’d like to show you one in this tip that you might not have thought of but could be a useful tool to get started. It’s GATU (Genome Annotation Transfer Utility) at VBRC. As the name suggests, this doesn’t do any major gene predicting, what it does is take your genome and compare it to a closely related genome (the closer the better of course) and transfers all the annotation from the characterized genome. This is from a viral resource (VBRC) but it works just as well with bacterial genomes, something that might not have been obvious and puts another tool in your belt.


How well do we know our genes?

Gene Characterization IndexDo you have some favorite genes? Well, of course you do–you are probably a researcher who has in the past worked on some specific genes, or you are interested in groups of genes or genomic regions. Or maybe classes of genes. There is a new resource that provides you with a score of how well a given protein coding gene is annotated, and possibly therefore understood. The GCI, or Gene Characterization Index, can tell you. http://cisreg.ca/gci/

I love the idea of this project. The team wanted to look at the gene space and understand how well we knew the human genes. They looked at the growth of our knowledge over time, too–which provides an interesting view of our progress–as shown in this figure from their web site. And they wanted to identify the darkness–where don’t we know enough? Where are some great genes to examine that we can learn some really new things?
That’s the kind of project I wanted to do when I was still in academia. I thought you could build a whole lab and crank out students who get assigned an unknown gene, and it is their job over the next few years to analyze and understand the gene. It would be unbiased by a disease area vision, or by the lab director’s preconceptions of what the gene might do. They could try all sorts of techniques to get there. It is probably also entirely unfundable by grant agencies. Alas.

Continue reading