I was catching up on some mailing list reading last week when I saw an unusual item come across the UCSC discussion mailing list. Someone who is in the process of obtaining genome and transcriptome sequence for a new project asked the UCSC group for guidance on what to do with it. It’s actually a question we’ve been hearing a lot in workshops–people are considering grants for this sort of project, or have plans for a brand new sequencer that’s arrived at their site. I thought other people might consider these recommendations useful information too, so I’m re-posting it here:
Dear UCSC Genome Bioinformatics,
My name is Padraig Doolan and I am the Program Leader for Expression
Microarrays and Bioinformatics at the National Institute for Cellular
Biotechnology (NICB), Ireland (www.nicb.ie/). We are a publicly-funded
basic science research institute.
Our small bioinformatics group are just starting the process of
analysisng a new genome (and transcriptome) for the Chinese Hamster
Ovary (CHO) cell line which was recently published (Xu et al., The
genomic sequence of the Chinese hamster ovary (CHO)-K1 cell line. Nat
Biotechnol. 2011 Jul 31;29(8):735-41. doi: 10.1038/nbt.1932.) by another
group. We do a lot of functional work on this organism and we’re looking
for some good guidelines (published papers, online resources, etc.)
which might help us map out some achievable goals with regard to the
in-silico characterisation of this genome.
For example, after the sequence is published, what are the next step(s)
in providing relevant information? Lists of SNPs? Predicted
proteome/secretome/numbers of predicted protein types (e.g.
I’m looking through the Human Genome Project Publications list
for inspiration, but this type of analysis output is relatively new for
our group (we are usually more focussed on translational medicine). Is
there any recommended guidelines your institute can suggest for
following in the footsteps of the HGP in in silico analysis of novel
genomes/transcriptomes? Can your organisation suggest a couple of key
papers or maybe a good analysis strategy?
UCSC generally tries to limit their discussion to specifics of the data and software at their site–because that’s their mission, of course, and because they can’t be all things genomics to everyone–they wouldn’t have time for their own work. But this was a special case, and they assembled a very cool answer for Padraig and his team.
The CHO paper that Padraig references I had remembered seeing at the time, but I didn’t investigate further. So I went looking to see if the group had a browser set up, and I was unable to find one. I did find a preview assembly at Ensembl. But I can see why a local group would need more details in their own collection and why they’d want to do some things themselves too. And possibly an easy way to extend the reference sequence with their own data rather than waiting for a big browser team to get to it.
I queried our engineers and got this list of recommendations for you:
1) Aligning all genbank mRNAs from Chinese Hamster
2) Aligning all of their own transcriptome data
3) Aligning all of genbank ESTs from Chinese Hamster
4) Mapping human proteins as derived from either the UCSC gene set or RefSeq
5) Mapping mouse proteins from UCSC or RefSeq
6) Doing a multiple species genome alignment with mouse, rat, rabbit,
dog, elephant, opossum, platypus, chicken. Do pairwise alignments as well.
7) Mine the genomic reads and transcriptomic reads for SNPs. Be careful
not to call recently duplicated and only slightly diverged regions
slight divergences as SNPs though.
\8) Run several repeat finders.
9) Run a CpG island detector.
10) Run a good gene prediction program like Augustus.
11) Try to find a wet lab group willing to do some DNAse assays….
I hope this is helpful. Good luck with your work!
UCSC Genome Bioinformatics Group
I thought this was pretty much the list of things I’d want to see with a new genome on a new browser. And the reason I think this is especially key is because there’s only going to be more and more of this. With the new sequencing technologies and the data deluge, more groups are going to find themselves with important sequence data for their labs or their local researchers. Could be patients, could be model organisms, could be species. How to proceed with this data is important.
What else would you do? Do you have other recommendations for groups faced with this?
Also today I just happened to note that Jonathan Eisen linked to a paper that might offer guidance for people with new genomes: Important paper on annotation standards for bacterial/archael genomes — readying for the “data deluge”. I think this is great, and a crucial discussion and awareness to have right now. For exactly the same reasons–new folks are going to be faced with assembling and annotating features of new genomes at incredible rates, and we have learned some things about best practices and the needs. Of course, things will evolve–but a few good starting points are really helpful guidance.
EDIT: just got a note from the CHO paper researchers, and they point me to this site for some tools: http://www.chogenome.org/
Xu, X., Nagarajan, H., Lewis, N., Pan, S., Cai, Z., Liu, X., Chen, W., Xie, M., Wang, W., Hammond, S., Andersen, M., Neff, N., Passarelli, B., Koh, W., Fan, H., Wang, J., Gui, Y., Lee, K., Betenbaugh, M., Quake, S., Famili, I., Palsson, B., & Wang, J. (2011). The genomic sequence of the Chinese hamster ovary (CHO)-K1 cell line Nature Biotechnology, 29 (8), 735-741 DOI: 10.1038/nbt.1932
Klimke, W., O’Donovan, C., White, O., Brister, J., Clark, K., Fedorov, B., Mizrachi, I., Pruitt, K., & Tatusova, T. (2011). Solving the Problem: Genome Annotation Standards before the Data Deluge Standards in Genomic Sciences, 5 (1), 168-193 DOI: 10.4056/sigs.2084864