Tag Archives: ensembl

Bioinformatics tools extracted from a typical mammalian genome project [supplement]

This is Table 1 that accompanies the full blog post: Bioinformatics tools extracted from a typical mammalian genome project. See the main post for the details and explanation. The table is too long to keep in the post, but I wanted it to be web-searchable. A copy also resides at FigShare: http://dx.doi.org/10.6084/m9.figshare.1194867

Continue reading

Bioinformatics tools extracted from a typical mammalian genome project

In this extended blog post, I describe my efforts to extract the information about bioinformatics-related items from a recent genome sequencing paper, and the larger issues this raises in the field. It’s long, and it’s something of a hybrid between a blog post and a paper format, just to give it some structure for my own organization. A copy of this will also be posted at FigShare with the full data set. Huge thanks to the gibbon genome project team for a terrific paper and extensively-documented collection of their processes and resources. The issues I wanted to highlight are about the access to bioinformatics tools in general and are not specific to this project at all, but are about the field.

==============================================
Introduction:

In the field of bioinformatics, there is a lot of discussion about data and code availability, and reproducibility or replication of research using the resources described in previous work. To explore the scope of the problem, I used the recent publication of the well-documented gibbon genome sequence project as a launching point to assess the tools, repositories, data sources, and other bioinformatics-related items that had been in use in a current project. Details of the named bioinformatics items were extracted from the publication, and location and information about the tools was then explored.

Only a small fraction of the bioinformatics items from the project were denoted in the main body of the paper (~16%). Most of them were found in the supplementary materials. As we’ve noted in the past, neither the data nor the necessary tools are published in the traditional paper structure any more. Among the over 100 bioinformatics items described in the work, availability and usability varies greatly. Some reside on faculty or student web sites, some on project sites, some in code repositories. Some are published in the traditional literature, some are student thesis publications, some are not ever published and only a web site or software documentation manual serves to provide required details. This means that information about how to use the tools is very uneven, and support is often non-existent. Access to different software versions poses an additional challenge, either for open source tools or commercial products.

New publication and storage strategies, new technological tools, and broad community awareness and support are beginning to change these things for the better, and will certainly help going forward. Strategies for consistently referencing tools, versions, and information about them would be extremely beneficial. The bioinformatics community may also want to consider the need to manage some of the historical, foundational pieces that are important for this field, some of which may need to be rescued from their current status in order to remain available to the community in the future.

Methods:

From the Nature website, I obtained a copy of the recently published paper: Gibbon genome and the fast karyotype evolution of small apes (Carbone et al, 2014). From the text of the paper and the supplements, I manually extracted all the references to named database tools, data source sites, file types, programs, utilities, or other computational moving parts that I could identify. There maybe be some missed by this process, for example, names that I didn’t recognize or didn’t connect with some existing tool (or some image generated from a tool, perhaps). Some references were to “in house Perl scripts” or other “custom” scenarios were not generally included unless they had been made available. Pieces deemed as being done “in a manner similar to that already described” in some other reference were present, and I did not go upstream to prior papers to extract those details. Software associated with laboratory equipment, such as sequencers (located at various institutions) or PCR machines were not included. So this likely represents an under-count of the software items in use. I also contacted the research team for a couple of additional things, and quickly received help and guidance. Using typical internet search engines or internal searches at publisher or resource sites, I tried to match the items to sources of software or citations for the items.

What I put in the bucket included specific names of items or objects that would be likely to be necessary and/or unfamiliar to students or researchers outside of the bioinformatics community. Some are related, but different. For example, you need to understand what “Gene Ontology” is as a whole, but you also need to know what “GOslim” is, a conceptual difference and a separate object in my designation system here. Some are sub-components of other tools, but important aspects to understand (GOTERM_BP_FAT at DAVID or randomBed from BEDTools) and are individual named items in the report, as these might be obscure to non-practitioners. Other bioinformatics professionals might disagree with their assignment to this collection. We may discuss removal or inclusion of these in discussions about them in future iterations of the list.

Results:

After creating a master list of references to bioinformatics objects or items, the list was checked and culled for duplicates or untraceable aspects. References to “in house Perl scripts” or other “custom” scripts were usually eliminated, unless special reference to a code repository was provided. This resulted in 133 items remaining.

How are they referenced? Where in the work?
Both the main publication (14 PDF pages) and the first Supplementary Information file (133 PDF pages) provided the names of bioinformatics objects in use for this project. All of the items referenced in the main paper were also referenced in the supplement. The number of named objects in the main paper was 21 of the 133 listed components (~16%). This is consistent with other similar types of consortium or “big data” papers that I’ve explored before: the bulk of the necessary information about software tools, data sources, methods, parameters, and features have been in the extensive supplemental materials.

The items are referenced in various ways. Sometimes they are named in the body of the main text, or the methods. Sometimes they are included as notes. Sometimes tools are mentioned only in figure legends, or only in references. In this case, some details were found in the “Author information” section.

author_info_sm

As noted above, most were found in the supplemental information. And in this example, this could be in the text or in tables. This is quite typical of these large project papers, in our experience. Anyone attempting to text-mine publications for this type of information should be aware of this variety of locations for this information.

Which bioinformatics objects are involved in this paper?
Describing bioinformatics tools, resources, databases, files, etc, has always been challenging. These are analogous to the “reagents” that I would have put in my benchwork biology papers years ago. They may matter to the outcome, such as enzyme vendors, mouse strain versions, or antibody species details. They constitute things you would need to reproduce or extend the work, or to appropriately understand the context. But in the case of bioinformatics, this can mean file formats such as the FASTQ or axt format from UCSC Genome Browser. They can mean repository resources like the SRA. They can be various different versioned downloaded data sets from ENSEMBL (version 67, 69, 70, or 73 here, but which were counted only once as ENSEMBL). It might be references to Reactome in a table.

With this broad definition in mind, Table 1 provides the list of named bioinformatics objects extracted from this project. The name or nickname or designation, the site at which it can be found (if available), and a publication or some citation is included when possible. Finally, a column designates whether it was found in the main paper as well.

What is not indicated is that some are references multiple times in different contexts and usages, with might cause people to not realize how frequently these are used. For example, ironically, RepeatMasker was referenced so many times I began to stop marking it up at one point.

Table 1. Software tools, objects, formats, files, and resources extracted from a typical mammalian genome sequencing project. See the web version supplement to this blog post: http://blog.openhelix.eu/?p=20002, or access at FigShare: http://dx.doi.org/10.6084/m9.figshare.1194867

Bioinformatics tools extracted from a typical mammalian genome project [supplement] – See more at: http://blog.openhelix.eu/?p=20002&preview=true#sthash.pcNdYhOZ.dpuf
Bioinformatics tools extracted from a typical mammalian genome project [supplement] – See more at: http://blog.openhelix.eu/?p=20002&preview=true#sthash.pcNdYhOZ.dpuf

Table1

What can we learn about the source or use of these items?
Searches for the information about the source code, data sets, file types, repositories, and associated descriptive information about the items yields a variety of access. Some objects are associated with traditional scientific publications and have valid and current links to software or data (but are also sometimes incorrectly cited). These may be paywalled in certain publications, or are described in unavailable meeting papers. Some do not have associated publications at all, or are described as submitted or in preparation. Some tools remain unpublished in the literature, long after they’ve gone into wide use, and their documentation or manual is cited instead. Some reside on faculty research pages, some are student dissertations. Some tools are found on project-specific pages. Some exist on code repositories—sometimes deprecated ones that may disappear. A number of them have moved from their initial publications, without forwarding addresses. Some are allusions to procedures other publications. Some of them are like time travel right back to the 1990s, with pages that appear to be original for the time. Some may be at risk of disappearing completely the next time an update at a university web site changes site access.

Other tools include commercial packages that may have unknown details, versions, or questionable sustainability and future access.

When details of data processing or software implementations are provided, the amount can vary. Sometimes parameters are included, others not.

Missing tool I wanted to have
One of my favorite data representations in the project results was Figure 2 in the main paper, Oxford grids of the species comparisons organized in a phylogenetic tree structure. This conveyed an enormous amount of information in a small area very effectively. I had hoped that this was an existing tool somewhere, but upon writing to the team I found it’s an R script by one of the authors, with a subsequent tree arrangement in the graphics program “Illustrator” by another collaborator. I really liked this, though, and hope it becomes available more broadly.

Easter eggs
The most fun citation I came across was the page for PHYLIP, and the FAQ and credits were remarkable. Despite the fact that there is no traditional publication available to me, a lengthy “credits” page offers some interesting insights about the project. The “No thanks to” portion was actually a fascinating look at the tribulations of getting funding to support software development and maintenance. The part about “outreach” was particularly amusing to us:

“Does all this “outreach” stuff mean I have to devote time to giving workshops to mystified culinary arts students? These grants are for development of advanced methods, and briefing “the public or non-university educators” about those methods would seem to be a waste of time — though I do spend some effort on fighting creationists and Intelligent Design advocates, but I don’t bring up these methods in doing so.”

Even the idea of “outreach” and support for use of the tools is certainly unclear to the tool providers, apparently. Training? Yeah, not in any formal way.

Discussion:

The gibbon genome sequencing project provided an important and well-documented example of a typical project in this arena. In my experience, this was a more detailed collection and description than many other projects I’ve explored, and some tools that were new and interesting to me were provided. Clearly an enormous number and range of bioinformatics items, tools, repositories, and concepts are required for the scope of a genome sequencing project. Tracing the provenance of them, though, is uneven and challenging, and this is not unique to this project—it’s a problem among the field. Current access to bioinformatics objects is also uneven, and future access may be even more of a hurdle as aging project pages may disappear or become unusable. This project has provided an interesting snapshot of the state of play, and good overview of the scope of awareness, skills, resources, and knowledge that researchers, support staff, or students would need to accomplish projects of similar scope.

little_macIt used to be simpler. We used to use the small number of tools on the VAX, uphill, in the snow, both ways, of course. When I was a grad student, one day in the back of the lab in the early 1990s, my colleague Trey and I were poking around at something we’d just heard about—the World Wide Web. We had one of those little funny Macs with the teeny screens, and we found people were making texty web pages with banal fonts and odd colors, and talking about their research.

Although we had both been using a variety of installed programs or command lines for sequence reading and alignment, manipulation, plasmid maps, literature searching and storage, image processing, phylogenies, and so on—we knew that this web thing was going to break the topic wide open.

Not long after, I was spending more and more time in the back room of the lab, pulling out sequences from this NCBI place (see a mid-1990s interface here), and looking for novel splice variants. I found them. Just by typing—no radioactivity and gels required by me! How cool was that? We relied on Pedro’s List to locate more useful tools (archive of Pedro’s Molecular Biology Search and Analysis Tools.).

Both of us then went off into postdocs and jobs that were heavily into biological software and/or database development. We’ve had a front seat to the changes over this period, and it’s been really amazing to watch. And it’s been great for us—we developed our interests into a company that helps people use these tools more effectively, and it has been really rewarding.

At OpenHelix, we are always trying to keep an eye on what tools people are using. We regularly trawl through the long, long, long supplementary materials from the “big data” sorts of projects, using a gill net to extract the software tools that are in use in the community. What databases and sites are people relying on? What are the foundational things everyone needs? What are the cutting-edge things to keep a lookout for? What file formats or terms would people need to connect with a resource?

But as I began to do it, I thought: maybe I should use this as a launching point to discuss some of the issues of software tools and data in genomics. If you were new to the field and had to figure out how a project like this goes, or what knowledge, skills, and tools you’d need, can you establish some idea of where to aim? So I used this paper to sort of analyze the state of play: what bioinformatics sites/tools/formats/objects/items are included in a work of this scope? Can you locate them? Where are the barriers or hazards? Could you learn to use them and replicate the work, or drive forward from here?

It was illuminating to me to actually assemble it all in one place. It took quite a bit of time to track the tools down and locate information about them. But it seemed to be a snapshot worth taking. And I hope it highlights some of the needs in the field, before some of the key pieces become lost to the vagaries of time and technology. And also I hope the awareness encourages good behavior in the future. Things seem to be getting better—community pressure to publish data sets and code in supported repositories has increased. We could use some standardized citation strategies for the tools, sources, and parameters. The US NIH getting serious about managing “big data” and ensuring that it can be used properly has been met with great enthusiasm. But there are still some hills left to climb before we’re on top of this.

Reference:

Carbone L., R. Alan Harris, Sante Gnerre, Krishna R. Veeramah, Belen Lorente-Galdos, John Huddleston, Thomas J. Meyer, Javier Herrero, Christian Roos, Bronwen Aken & Fabio Anaclerio & al. (2014). Gibbon genome and the fast karyotype evolution of small apes, Nature, 513 (7517) 195-201. DOI: http://dx.doi.org/10.1038/nature13679

FigShare version of this post: http://dx.doi.org/10.6084/m9.figshare.1194879

VideoTip of the Week: ENCODE @ Ensembl

We have a lot of tutorials (2 in fact, ENCODE Foundations & ENCODE @ UCSC), tips and information about ENCODE. We also have a lot of tutorials (again 2, Ensembl and Ensembl Legacy- on the older versions ), tips and information about Ensembl, the database and browser at EBI.

Now here is a tip of the week on both Ensembl AND ENCODE. This is one of the more recent additions to Ensembl’s video tutorials. This video looks at how to identify sequences that may be involved in gene regulation. Most of this data at Ensembl is based on ENCODE data. This is using the “Matrix,” a way to select the regulation data you need based on cell types and TF’s. At the end of the 8 minute video they discuss a bit more about how to get all ENCODE data.

So, now you have a wealth of information here at OpenHelix through our tutorials and our blog about ENCODE and Ensembl.

Quick links:

ENCODE: http://encodeproject.org/ENCODE/
ENCODE @ UCSC: http://genome.ucsc.edu/ENCODE/
Ensembl: http://www.ensembl.org
ENCODE Tutorials: http://openhelix.com/encode
Ensembl Tutorials: http://openhelix.com/cgi/tutorialInfo.cgi?id=95

Video Tip of the Week: Browsing Butterflies with GBrowse and Ensembl

A couple of months back when the Heliconius (Postman) Butterfly genome paper was released, we got to see another example of how the new sequencing technologies are giving us access to more and more genome data–in species that are not the main model organisms. Monarch butterfly genome data had been released prior to that as well. And you may not know that there’s a huge effort to get thousands of insect genomes–the i5k project. I think that’s my favorite thing about where we are today: we can examine more species in more detail than we ever have before.  Not only do we get interesting details from the genome sequence framework, but interesting info about species evolutionary relationships, and intriguing and novel biology features can be explored as well. I mean–the human genome and its variations are great–but Monarch butterflies have a sun compass! How cool is that??

And like most genome papers today, only a fraction of the data that was obtained is in the main body of the paper. The “compelling examples” might be there. But of the “12,699 predicted protein-coding genes” of the Heliconius genome, only a handful are really addressed in the text. A few more handfuls in some figures. The earlier Monarch butterfly paper delivered “a set of 16,866 protein-coding genes” (and 10 supplements beyond the paper!). But to access the data yourself and compare to your genes and species of interest you need to turn to the browsers that accompany the papers.

In this case you have two choices for browser styles: the Heliconius Genome Consortium (authors of the paper) maintain a GBrowse installation at their Butterflygenome.org site. The Monarch group has a GBrowse at MonarchBase. In addition, the data for both is also now included in Ensembl as of the July 2012 release 15. [note: see administrative details in the comments --mm]

For this week’s tip we fly around from the species-specific GBrowsers to the collected sets at Ensembl. It’s great to have the species-specific sites for depth of information about the projects and resources, but it’s also nice to have the additional tools and displays of the larger genome browsers. Community browsers can offer very current and new data that might not yet be included in the super-browsers, and the super-browsers may offer additional tools and infrastructure that is not available from the community browsers. Your best bet is to be aware of both, and to get comfortable with the main software features and their strengths and weaknesses.

The bugs are coming–and thousands of them. Be ready. And beware: look for the right superhero

Note: I have been unable to locate the Mothra genome that’s been all atwitter for the last couple of days.

Quick links:

Heliconius GBrowse: http://butterflygenome.org/

MonarchBase: http://monarchbase.umassmed.edu/genome.html

Ensembl Metazoa: http://metazoa.ensembl.org/

i5k Insect and other Arthropod Genome Sequencing Initiative http://arthropodgenomes.org/wiki/i5K

If you came looking for butterfly photos, try this: http://www.butterfliesandmoths.org/ This is also a citizen science site where you can submit your own sightings–I have done that in the past.

References:

Dasmahapatra, K.K., Walters, J.R., Briscoe, A.D., Davey, J.W., Whibley, A., Nadeau, N.J., Zimin, A.V., Hughes, D.S.T., Ferguson, L.C., Martin, S.H. & (2012). Butterfly genome reveals promiscuous exchange of mimicry adaptations among species, Nature, DOI: 10.1038/nature11041

Zhan, S., Merlin, C., Boore, J. & Reppert, S. (2011). The Monarch Butterfly Genome Yields Insights into Long-Distance Migration, Cell, 147 (5) 1185. DOI: 10.1016/j.cell.2011.09.052

Stensmyr, M. & Hansson, B. (2011). A Genome Befitting a Monarch, Cell, 147 (5) 972. DOI: 10.1016/j.cell.2011.11.009

Kersey, P.J., Staines, D.M., Lawson, D., Kulesha, E., Derwent, P., Humphrey, J.C., Hughes, D.S.T., Keenan, S., Kerhornou, A., Koscielny, G. & (2011). Ensembl Genomes: an integrative resource for genome-scale data from non-vertebrate species, Nucleic Acids Research, 40 (D1) D97. DOI: 10.1093/nar/gkr895

Video Tips of the Week: Annual Review IV, 2nd half

As you may know, we’ve been doing these video tips-of-the-week for FOUR years now. We have completed around 200 little tidbit introductions to various resources from last year, 2011 (yep, it’s 2012 now). At the end of the year we’ve established a sort of holiday tradition: we are doing a summary post to collect them all. If you have missed any of them it’s a great way to have a quick look at what might be useful to your work.

You can see past years’ tips here: 2008 I2008 II2009 I2009 II2010 I2010 II. The summary of the first half of 2011 is available from last week.

July 2011

July 6: Prioritizing genes using the Gene Prioritization Portal

July 13: PolySearch, searching many databases at once

July 20: Human Epigenomics Visualization Hub

July 27: The new SIB Bioinformatics Resource Portal

 

August 2011

August 3: SNPexp, correlation between SNPs and gene expression 

August 10: CompaGB for comparing genome browser software

August 17: CoGe, comparing genomes revisited

August 24: Domain Draw for quick motif diagrams

August 31: From UniProt to the PSI SBKB and back again

 

September 2011

September 7: Plant comparative genomics using Plaza

September 14: phiGENOME for bacteriophage genome exploration

September 21: Getting flanking sequences of genomic locations

September 28: Introduction to R statistical software 

 

October 2011

October 5: VnD resource for genetic variation and drug information

October 12: Track Hubs in UCSC Genome Browser

October 19: Mitochondrial Transcriptome GBrowser 

October 26: Variation data from Ensembl

 

November 2011

November 2: MizBee Synteny Browser

November 9: The new database of genomic variants: DGV2

November 16: MapMi, automated mapping of microRNA loci

November 23: BioMart’s new central portal

November 30: Phosphida, a post-translational modification database

December 2011

December 7: VarSifter, for identifying key sequence variations

December 14: Big changes to NCBI’s genome resources

December 21: eggNOG for the Holidays (or to explore orthologous genes)

December 28: Video Tips of the Week: Annual Review IV (first half of 2011)

Announcement of Updated Tutorial Materials: UniProt, Overview of Genome Browsers, and World Tour of Resources

As many of you know, OpenHelix specializes in helping people access and utilize the gold mine of public bioscience data in order to further research.  One of the ways that we do this is by creating materials to train people – researchers, clinicians, librarians, and anyone interested in science - on where to find data they are interested in, and how to access data at particular public databases and data repositories. We’ve got over 100 such tutorials on everything from PubMed to the Functional Glycomics Gateway (more on that later).

In addition creating these tutorials, we also spend a lot of time to keep them accurate and up-to-date. This can be a challenge, especially when lots of databases or resources all have major releases around the same time. Our team continually assesses and updates our materials and in this post I am happy to announce recently released updates to three of our tutorials: UniProt, World Tour, and Overview of Genome Browsers.

Our Introductory UniProt tutorial shows users how to: perform text searches at UniProt for relevant protein information, search with sequences as a starting point, understand the different types of UniProt records, and create multi-sequence alignments from protein records using Clustal.

Our Overview of Genome Browsers introduces users to introduce Ensembl, Map Viewer, UCSC Genome Browser, the Integrated Microbial Genomes (IMG) browser, and to the GBrowse software system. We also touch on WebGBrowse, JBrowse, the Integrative Genomics Viewer (IGV), the ARGO Genome Browser, the Integrated Genome Browser (IGB)GAGGLE, and the Circular Genome Viewer, or CGView.

Our World Tour of Genomics Resources is free and accessible without registration. It includes a tour of example resources, organized by categories such as Algorithms and Analysis tools, expression resources, genome browsers (both Eukaryotic and Prokaryotic/Microbial) , Literature and text mining resources, and resources focused on nucleotides, proteins, pathways, disease and variation. This main discussion will then lead into a discussion of how to find resources with the free OpenHelix Resource Search Portal, followed by learning to use resources with OpenHelix tutorials, and a discussion of additional methods of learning about resources.

Quick Links:

OpenHelix Introductory UniProt tutorial suite: http://www.openhelix.com/cgi/tutorialInfo.cgi?id=77

OpenHelix Overview to Genome Browsers tutorial suite: http://www.openhelix.com/cgi/tutorialInfo.cgi?id=65

Free OpenHelix World Tour of Genomics Resources tutorial suite: http://www.openhelix.com/cgi/tutorialInfo.cgi?id=119

 


Got a genome + transcriptome. Now what?

I was catching up on some mailing list reading last week when I saw an unusual item come across the UCSC discussion mailing list. Someone who is in the process of obtaining genome and transcriptome sequence for a new project asked the UCSC group for guidance on what to do with it. It’s actually a question we’ve been hearing a lot in workshops–people are considering grants for this sort of project, or have plans for a brand new sequencer that’s arrived at their site. I thought other people might consider these recommendations useful information too, so I’m re-posting it here:

Question:

Dear UCSC Genome Bioinformatics,

My name is Padraig Doolan and I am the Program Leader for Expression
Microarrays and Bioinformatics at the National Institute for Cellular
Biotechnology (NICB), Ireland (www.nicb.ie/). We are a publicly-funded
basic science research institute.

Our small bioinformatics group are just starting the process of
analysisng a new genome (and transcriptome) for the Chinese Hamster
Ovary (CHO) cell line which was recently published (Xu et al., The
genomic sequence of the Chinese hamster ovary (CHO)-K1 cell line. Nat
Biotechnol. 2011 Jul 31;29(8):735-41. doi: 10.1038/nbt.1932.) by another
group. We do a lot of functional work on this organism and we’re looking
for some good guidelines (published papers, online resources, etc.)
which might help us map out some achievable goals with regard to the
in-silico characterisation of this genome.

For example, after the sequence is published, what are the next step(s)
in providing relevant information? Lists of SNPs? Predicted
proteome/secretome/numbers of predicted protein types (e.g.
kinases/g-coupled/nuclear-/membrane-localised), etc.?

I’m looking through the Human Genome Project Publications list
(http://www.ornl.gov/sci/techresources/Human_Genome/publicat/publications.shtml)
for inspiration, but this type of analysis output is relatively new for
our group (we are usually more focussed on translational medicine). Is
there any recommended guidelines your institute can suggest for
following in the footsteps of the HGP in in silico analysis of novel
genomes/transcriptomes? Can your organisation suggest a couple of key
papers or maybe a good analysis strategy?

Best regards,
Padraig Doolan

UCSC generally tries to limit their discussion to specifics of the data and software at their site–because that’s their mission, of course, and because they can’t be all things genomics to everyone–they wouldn’t have time for their own work. But this was a special case, and they assembled a very cool answer for Padraig and his team.

The CHO paper that Padraig references I had remembered seeing at the time, but I didn’t investigate further. So I went looking to see if the group had a browser set up, and I was unable to find one. I did find a preview assembly at Ensembl. But I can see why a local group would need more details in their own collection and why they’d want to do some things themselves too. And possibly an easy way to extend the reference sequence with their own data rather than waiting for a big browser team to get to it.

Reply:

Hi Padraig,

I queried our engineers and got this list of recommendations for you:

1) Aligning all genbank mRNAs from Chinese Hamster
2) Aligning all of their own transcriptome data
3) Aligning all of genbank ESTs from Chinese Hamster
4) Mapping human proteins as derived from either the UCSC gene set or RefSeq
5) Mapping mouse proteins from UCSC or RefSeq
6) Doing a multiple species genome alignment with mouse, rat, rabbit,
dog, elephant, opossum, platypus, chicken. Do pairwise alignments as well.
7) Mine the genomic reads and transcriptomic reads for SNPs. Be careful
not to call recently duplicated and only slightly diverged regions
slight divergences as SNPs though.
\8) Run several repeat finders.
9) Run a CpG island detector.
10) Run a good gene prediction program like Augustus.
11) Try to find a wet lab group willing to do some DNAse assays….

I hope this is helpful. Good luck with your work!


Brooke Rhead
UCSC Genome Bioinformatics Group

 

I thought this was pretty much the list of things I’d want to see with a new genome on a new browser. And the reason I think this is especially key is because there’s only going to be more and more of this. With the new sequencing technologies and the data deluge, more groups are going to find themselves with important sequence data for their labs or their local researchers. Could be patients, could be model organisms, could be species. How to proceed with this data is important.

What else would you do? Do you have other recommendations for groups faced with this?

Also today I just happened to note that Jonathan Eisen linked to a paper that might offer guidance for people with new genomes: Important paper on annotation standards for bacterial/archael genomes — readying for the “data deluge”. I think this is great, and a crucial discussion and awareness to have right now. For exactly the same reasons–new folks are going to be faced with assembling and annotating features of new genomes at incredible rates, and we have learned some things about best practices and the needs. Of course, things will evolve–but a few good starting points are really helpful guidance.

EDIT: just got a note from the CHO paper researchers, and they point me to this site for some tools: http://www.chogenome.org/

References:

Xu, X., Nagarajan, H., Lewis, N., Pan, S., Cai, Z., Liu, X., Chen, W., Xie, M., Wang, W., Hammond, S., Andersen, M., Neff, N., Passarelli, B., Koh, W., Fan, H., Wang, J., Gui, Y., Lee, K., Betenbaugh, M., Quake, S., Famili, I., Palsson, B., & Wang, J. (2011). The genomic sequence of the Chinese hamster ovary (CHO)-K1 cell line Nature Biotechnology, 29 (8), 735-741 DOI: 10.1038/nbt.1932

Klimke, W., O’Donovan, C., White, O., Brister, J., Clark, K., Fedorov, B., Mizrachi, I., Pruitt, K., & Tatusova, T. (2011). Solving the Problem: Genome Annotation Standards before the Data Deluge Standards in Genomic Sciences, 5 (1), 168-193 DOI: 10.4056/sigs.2084864

Saving your genome views: Ensembl (and UCSC Genome Browser, others)

In the latest update of Ensembl, the developers added the ability to save configurations. This allows you to set your  track views and analysis to a specific configuration and load that configuration at a later time. The blog post linked previously (or here) explains the steps to creating your own configurations you can save and return to. In the future they will be adding the ability to share your configurations with colleagues and other researchers.

UCSC has a similar function they call sessions. You can read more about it here on our blog or go to the sessions user guide at UCSC.

As far as I know, NCBI’s Map Viewer does not have the same functionality (please correct me if I’m wrong).

GBrowse has the ability to save sessions, but since it is an open-source program, it will depend on the specific installation on whether that function is available.

NAR database issue (always a treasure trove)

The advance access release of most of the  NAR database issue articles is out. As usual, this this database issue includes a wealth of new and updated data repositories and analysis tools. We’ll be writing up additional more extensive blog posts on it and doing some tips of the week over the next couple months, but I thought I’d highlight the issue and some of the reports:

There are a lot of updates to many of the databases we know and love (links to go full text article): UCSC Genome Browser, Ensembl, UniProt, MINT, SMART, WormBase, Gene Ontology,  ENCODE, KEGG, UCSC Archaeal Browser, IMG/M, DBTSS, InterPro and others (we have tutorials on all those listed here).

And, as an indication of the explosion of data available (itself a subject of a database issue article, SRA), there are a lot of new(ish) databases on specific datatypes such as MINAS, a database of metal ions in nucleic acids (nice name :D); doRiNA, a database of RNA interactions in post-transcriptional regulation; BitterDB, a database of bitter compounds and well over 100 more.

And I’ll give a special shout out to my former PI at EMBL because I can, Peer Bork’s group has 4 databases listed in the issue: eggNOG, SMART, STITCH and OGEE. (and he and a couple members are on the InterPro paper also).

This is going to be a wealth of information to wade through!

sciseekclaimtoken-4ec6d4e6da3c3

sciseekclaimtoken-4ec6cf9447e17

Video Tip of the Week: Variation Data from Ensembl

Trey introduced me to this “decent collection of video tutorials ” from Ensembl, but he and Mary are currently in Morocco teaching a 3-day bioinformatics workshop & then attending the conference (yes, I am envious!). I am therefore creating this week’s tip based on the tutorials that Trey pointed me to. In today’s tip I am going to parallel a tutorial available from Ensembl on SNP information in order to both: 1) show you haw you can access variation information from Ensembl and 2) compare doing these steps using Ensembl 64 (here in this video) and using Ensembl 54 (archived) (in the Ensembl video).

Bioscience resources often are continuously being developed and improved & it can be difficult to keep videos and documentation up-to-date. That’s why here at OpenHelix we work continuously to keeping our materials up-to-date, with weekly tips on new features and updated tutorials as updated sites become stable.

The Ensembl video (SNPs and other Variations – 1 of 2) is quite nice & provides more detail about the actual Ensembl data than I can in my short movie, but it was done a few years ago on an older version of Ensembl. Since then the resource has been updated, and gone through several new versions of the data. I’m going to follow the same steps that are done in part one of the Ensembl SNP tutorial so that you can see examples of what’s changed & what is pretty much the same. I’d suggest you watch both videos back-to-back to get a good idea of what’s changed, and what types of variation information are available from Ensembl. From that basis I’m sure you’ll be able to watch Ensembl’s second SNP video & apply it to using the current version of Ensembl without much trouble. For more details you can refer to the most recent Ensembl paper in the NAR database  issue, which describes not just variation information but Ensembl as a whole.

Quick links:

Ensembl Browser: http://www.ensembl.org/index.html

Legacy Ensembl Browser (release 54): http://may2009.archive.ensembl.org/index.html

Ensembl tutorial, part 1 of 2: http://useast.ensembl.org/Help/Movie?id=208

Ensembl tutorial, part 1 of 2: http://useast.ensembl.org/Help/Movie?id=211

OpenHelix Ensembl tutorial materials: http://www.openhelix.eu/cgi/tutorialInfo.cgi?id=95

Ensembl Tutorial List: http://useast.ensembl.org/common/Help/Movie?db=core

Reference:
Flicek, P., Aken, B., Ballester, B., Beal, K., Bragin, E., Brent, S., Chen, Y., Clapham, P., Coates, G., Fairley, S., Fitzgerald, S., Fernandez-Banet, J., Gordon, L., Graf, S., Haider, S., Hammond, M., Howe, K., Jenkinson, A., Johnson, N., Kahari, A., Keefe, D., Keenan, S., Kinsella, R., Kokocinski, F., Koscielny, G., Kulesha, E., Lawson, D., Longden, I., Massingham, T., McLaren, W., Megy, K., Overduin, B., Pritchard, B., Rios, D., Ruffier, M., Schuster, M., Slater, G., Smedley, D., Spudich, G., Tang, Y., Trevanion, S., Vilella, A., Vogel, J., White, S., Wilder, S., Zadissa, A., Birney, E., Cunningham, F., Dunham, I., Durbin, R., Fernandez-Suarez, X., Herrero, J., Hubbard, T., Parker, A., Proctor, G., Smith, J., & Searle, S. (2009). Ensembl’s 10th year Nucleic Acids Research, 38 (Database) DOI: 10.1093/nar/gkp972