This is Table 1 that accompanies the full blog post: Bioinformatics tools extracted from a typical mammalian genome project. See the main post for the details and explanation. The table is too long to keep in the post, but I wanted it to be web-searchable. A copy also resides at FigShare: http://dx.doi.org/10.6084/m9.figshare.1194867
In this extended blog post, I describe my efforts to extract the information about bioinformatics-related items from a recent genome sequencing paper, and the larger issues this raises in the field. It’s long, and it’s something of a hybrid between a blog post and a paper format, just to give it some structure for my own organization. A copy of this will also be posted at FigShare with the full data set. Huge thanks to the gibbon genome project team for a terrific paper and extensively-documented collection of their processes and resources. The issues I wanted to highlight are about the access to bioinformatics tools in general and are not specific to this project at all, but are about the field.
In the field of bioinformatics, there is a lot of discussion about data and code availability, and reproducibility or replication of research using the resources described in previous work. To explore the scope of the problem, I used the recent publication of the well-documented gibbon genome sequence project as a launching point to assess the tools, repositories, data sources, and other bioinformatics-related items that had been in use in a current project. Details of the named bioinformatics items were extracted from the publication, and location and information about the tools was then explored.
Only a small fraction of the bioinformatics items from the project were denoted in the main body of the paper (~16%). Most of them were found in the supplementary materials. As we’ve noted in the past, neither the data nor the necessary tools are published in the traditional paper structure any more. Among the over 100 bioinformatics items described in the work, availability and usability varies greatly. Some reside on faculty or student web sites, some on project sites, some in code repositories. Some are published in the traditional literature, some are student thesis publications, some are not ever published and only a web site or software documentation manual serves to provide required details. This means that information about how to use the tools is very uneven, and support is often non-existent. Access to different software versions poses an additional challenge, either for open source tools or commercial products.
New publication and storage strategies, new technological tools, and broad community awareness and support are beginning to change these things for the better, and will certainly help going forward. Strategies for consistently referencing tools, versions, and information about them would be extremely beneficial. The bioinformatics community may also want to consider the need to manage some of the historical, foundational pieces that are important for this field, some of which may need to be rescued from their current status in order to remain available to the community in the future.
From the Nature website, I obtained a copy of the recently published paper: Gibbon genome and the fast karyotype evolution of small apes (Carbone et al, 2014). From the text of the paper and the supplements, I manually extracted all the references to named database tools, data source sites, file types, programs, utilities, or other computational moving parts that I could identify. There maybe be some missed by this process, for example, names that I didn’t recognize or didn’t connect with some existing tool (or some image generated from a tool, perhaps). Some references were to “in house Perl scripts” or other “custom” scenarios were not generally included unless they had been made available. Pieces deemed as being done “in a manner similar to that already described” in some other reference were present, and I did not go upstream to prior papers to extract those details. Software associated with laboratory equipment, such as sequencers (located at various institutions) or PCR machines were not included. So this likely represents an under-count of the software items in use. I also contacted the research team for a couple of additional things, and quickly received help and guidance. Using typical internet search engines or internal searches at publisher or resource sites, I tried to match the items to sources of software or citations for the items.
What I put in the bucket included specific names of items or objects that would be likely to be necessary and/or unfamiliar to students or researchers outside of the bioinformatics community. Some are related, but different. For example, you need to understand what “Gene Ontology” is as a whole, but you also need to know what “GOslim” is, a conceptual difference and a separate object in my designation system here. Some are sub-components of other tools, but important aspects to understand (GOTERM_BP_FAT at DAVID or randomBed from BEDTools) and are individual named items in the report, as these might be obscure to non-practitioners. Other bioinformatics professionals might disagree with their assignment to this collection. We may discuss removal or inclusion of these in discussions about them in future iterations of the list.
After creating a master list of references to bioinformatics objects or items, the list was checked and culled for duplicates or untraceable aspects. References to “in house Perl scripts” or other “custom” scripts were usually eliminated, unless special reference to a code repository was provided. This resulted in 133 items remaining.
How are they referenced? Where in the work?
Both the main publication (14 PDF pages) and the first Supplementary Information file (133 PDF pages) provided the names of bioinformatics objects in use for this project. All of the items referenced in the main paper were also referenced in the supplement. The number of named objects in the main paper was 21 of the 133 listed components (~16%). This is consistent with other similar types of consortium or “big data” papers that I’ve explored before: the bulk of the necessary information about software tools, data sources, methods, parameters, and features have been in the extensive supplemental materials.
The items are referenced in various ways. Sometimes they are named in the body of the main text, or the methods. Sometimes they are included as notes. Sometimes tools are mentioned only in figure legends, or only in references. In this case, some details were found in the “Author information” section.
As noted above, most were found in the supplemental information. And in this example, this could be in the text or in tables. This is quite typical of these large project papers, in our experience. Anyone attempting to text-mine publications for this type of information should be aware of this variety of locations for this information.
Which bioinformatics objects are involved in this paper?
Describing bioinformatics tools, resources, databases, files, etc, has always been challenging. These are analogous to the “reagents” that I would have put in my benchwork biology papers years ago. They may matter to the outcome, such as enzyme vendors, mouse strain versions, or antibody species details. They constitute things you would need to reproduce or extend the work, or to appropriately understand the context. But in the case of bioinformatics, this can mean file formats such as the FASTQ or axt format from UCSC Genome Browser. They can mean repository resources like the SRA. They can be various different versioned downloaded data sets from ENSEMBL (version 67, 69, 70, or 73 here, but which were counted only once as ENSEMBL). It might be references to Reactome in a table.
With this broad definition in mind, Table 1 provides the list of named bioinformatics objects extracted from this project. The name or nickname or designation, the site at which it can be found (if available), and a publication or some citation is included when possible. Finally, a column designates whether it was found in the main paper as well.
What is not indicated is that some are references multiple times in different contexts and usages, with might cause people to not realize how frequently these are used. For example, ironically, RepeatMasker was referenced so many times I began to stop marking it up at one point.
Table 1. Software tools, objects, formats, files, and resources extracted from a typical mammalian genome sequencing project. See the web version supplement to this blog post: http://blog.openhelix.eu/?p=20002, or access at FigShare: http://dx.doi.org/10.6084/m9.figshare.1194867
What can we learn about the source or use of these items?
Searches for the information about the source code, data sets, file types, repositories, and associated descriptive information about the items yields a variety of access. Some objects are associated with traditional scientific publications and have valid and current links to software or data (but are also sometimes incorrectly cited). These may be paywalled in certain publications, or are described in unavailable meeting papers. Some do not have associated publications at all, or are described as submitted or in preparation. Some tools remain unpublished in the literature, long after they’ve gone into wide use, and their documentation or manual is cited instead. Some reside on faculty research pages, some are student dissertations. Some tools are found on project-specific pages. Some exist on code repositories—sometimes deprecated ones that may disappear. A number of them have moved from their initial publications, without forwarding addresses. Some are allusions to procedures other publications. Some of them are like time travel right back to the 1990s, with pages that appear to be original for the time. Some may be at risk of disappearing completely the next time an update at a university web site changes site access.
Other tools include commercial packages that may have unknown details, versions, or questionable sustainability and future access.
When details of data processing or software implementations are provided, the amount can vary. Sometimes parameters are included, others not.
Missing tool I wanted to have
One of my favorite data representations in the project results was Figure 2 in the main paper, Oxford grids of the species comparisons organized in a phylogenetic tree structure. This conveyed an enormous amount of information in a small area very effectively. I had hoped that this was an existing tool somewhere, but upon writing to the team I found it’s an R script by one of the authors, with a subsequent tree arrangement in the graphics program “Illustrator” by another collaborator. I really liked this, though, and hope it becomes available more broadly.
The most fun citation I came across was the page for PHYLIP, and the FAQ and credits were remarkable. Despite the fact that there is no traditional publication available to me, a lengthy “credits” page offers some interesting insights about the project. The “No thanks to” portion was actually a fascinating look at the tribulations of getting funding to support software development and maintenance. The part about “outreach” was particularly amusing to us:
“Does all this “outreach” stuff mean I have to devote time to giving workshops to mystified culinary arts students? These grants are for development of advanced methods, and briefing “the public or non-university educators” about those methods would seem to be a waste of time — though I do spend some effort on fighting creationists and Intelligent Design advocates, but I don’t bring up these methods in doing so.”
Even the idea of “outreach” and support for use of the tools is certainly unclear to the tool providers, apparently. Training? Yeah, not in any formal way.
The gibbon genome sequencing project provided an important and well-documented example of a typical project in this arena. In my experience, this was a more detailed collection and description than many other projects I’ve explored, and some tools that were new and interesting to me were provided. Clearly an enormous number and range of bioinformatics items, tools, repositories, and concepts are required for the scope of a genome sequencing project. Tracing the provenance of them, though, is uneven and challenging, and this is not unique to this project—it’s a problem among the field. Current access to bioinformatics objects is also uneven, and future access may be even more of a hurdle as aging project pages may disappear or become unusable. This project has provided an interesting snapshot of the state of play, and good overview of the scope of awareness, skills, resources, and knowledge that researchers, support staff, or students would need to accomplish projects of similar scope.
It used to be simpler. We used to use the small number of tools on the VAX, uphill, in the snow, both ways, of course. When I was a grad student, one day in the back of the lab in the early 1990s, my colleague Trey and I were poking around at something we’d just heard about—the World Wide Web. We had one of those little funny Macs with the teeny screens, and we found people were making texty web pages with banal fonts and odd colors, and talking about their research.
Although we had both been using a variety of installed programs or command lines for sequence reading and alignment, manipulation, plasmid maps, literature searching and storage, image processing, phylogenies, and so on—we knew that this web thing was going to break the topic wide open.
Not long after, I was spending more and more time in the back room of the lab, pulling out sequences from this NCBI place (see a mid-1990s interface here), and looking for novel splice variants. I found them. Just by typing—no radioactivity and gels required by me! How cool was that? We relied on Pedro’s List to locate more useful tools (archive of Pedro’s Molecular Biology Search and Analysis Tools.).
Both of us then went off into postdocs and jobs that were heavily into biological software and/or database development. We’ve had a front seat to the changes over this period, and it’s been really amazing to watch. And it’s been great for us—we developed our interests into a company that helps people use these tools more effectively, and it has been really rewarding.
At OpenHelix, we are always trying to keep an eye on what tools people are using. We regularly trawl through the long, long, long supplementary materials from the “big data” sorts of projects, using a gill net to extract the software tools that are in use in the community. What databases and sites are people relying on? What are the foundational things everyone needs? What are the cutting-edge things to keep a lookout for? What file formats or terms would people need to connect with a resource?
But as I began to do it, I thought: maybe I should use this as a launching point to discuss some of the issues of software tools and data in genomics. If you were new to the field and had to figure out how a project like this goes, or what knowledge, skills, and tools you’d need, can you establish some idea of where to aim? So I used this paper to sort of analyze the state of play: what bioinformatics sites/tools/formats/objects/items are included in a work of this scope? Can you locate them? Where are the barriers or hazards? Could you learn to use them and replicate the work, or drive forward from here?
It was illuminating to me to actually assemble it all in one place. It took quite a bit of time to track the tools down and locate information about them. But it seemed to be a snapshot worth taking. And I hope it highlights some of the needs in the field, before some of the key pieces become lost to the vagaries of time and technology. And also I hope the awareness encourages good behavior in the future. Things seem to be getting better—community pressure to publish data sets and code in supported repositories has increased. We could use some standardized citation strategies for the tools, sources, and parameters. The US NIH getting serious about managing “big data” and ensuring that it can be used properly has been met with great enthusiasm. But there are still some hills left to climb before we’re on top of this.
Carbone L., R. Alan Harris, Sante Gnerre, Krishna R. Veeramah, Belen Lorente-Galdos, John Huddleston, Thomas J. Meyer, Javier Herrero, Christian Roos, Bronwen Aken & Fabio Anaclerio & al. (2014). Gibbon genome and the fast karyotype evolution of small apes, Nature, 513 (7517) 195-201. DOI: http://dx.doi.org/10.1038/nature13679
FigShare version of this post: http://dx.doi.org/10.6084/m9.figshare.1194879
Recently, the Broad Institute announced a new tool: GenomeSpace. When I first looked at it, admittedly a very cursory look, I wasn’t sure how it would be much different than an integrator of tools like Galaxy or GenePattern. Obviously that cursory look was wrong at first glance since both Galaxy and GenePattern are in their list of tools that are supported. So what is GenomeSpace? Well, you can read the answer here at their “What is GenomeSpace” page :). Basically, GenomeSpace has several functions. As described here, “GenomeSpace supports several bioinformatics tools, all integrated to allow easy accessibility, easy conversion, and frictionless sharing.” It is a space (in that every expanding Amazon cloud) that allows you to store your data files and, importantly, GenomeSpace allows you to seamlessly move those files between the tools to complete complex, or simple, analyses. It achieves this by automatically converting file formats and by allowing the user to attach their accounts at the tools to their account at GenomeSpace, thus alleviating the need to log in several times when using more than one tool.
To get a good idea of what GenomeSpace might be able to do for a researcher, check out the recipes on the site. As Anton states:
GenomeSpace is an integration of integrators,” Nekrutenko said. “The benefit to the user is that this brings together distinctive collections of functionalities offered by individual tools.”
The site is new, and only in beta. They only recently opened up registration from their invite-only stage. As such, there are some bugs and some features that aren’t quite at full capacity. For example, the Galaxy and UCSC Table Browser integration is with the test versions of those tools during beta. Thus, for example, your account at Galaxy will not be recognized when trying to link that account with GenomeSpace. I had to create a new one on the test site. And, if you go to the public version of the Table Browser, it will look different (no link to GenomeSpace as there is on the test site). Currently there are seven tools, more to come.
All that aside, it’s definitely a tool to get acquainted with. And with that in mind, take a quick introductory spin with me in this week’s video tip to get an idea of what you might be able to do.
Recently many of the bioinformatics tweeps I follow were excited about the tool called VarSifter. Here’s the notice that I saw:
I just had a chance to watch the video, and now I can see why they were impressed! Over the years in the workshops we do, people have asked questions in various theme groups. For a while it was lists of genes and microarrays. Then it was known SNP variations. Then it became transcription factor binding sites. Lately it’s been: I have a giant set of sequence data that I need to process to find new variants that might impact genes. How do I do that? This video tip-of-the-week will help you to understand how to do that.
In this video that was part of a day of lectures at the NHGRI about how to deal with exome sequencing data: Next-Gen 101: Video Tutorial on Conducting Whole-Exome Sequencing Research . There is a whole series of video and slide material available from NHGRI’s page. And the one I’m highlighting here is number 3 on that list. Be sure to download the slides if you want to take notes, and access the references and URLs that are key to the material.
Jamie Teer gives a terrific talk about dealing with the exome sequence data output that next-gen projects are yielding. It starts with just managing and viewing the reads, and he highlights a couple of different ways to do this. It includes SAMtools, and also showing how they look in both UCSC Genome Browser and in the Broad’s Integrative Genomics Viewer, IGV. It’s nice to see a comparison of these to illustrate what you might expect to see. We could help you to understand how to load this kind of data as custom tracks in the UCSC Genome Browser with our advanced tutorial, and you’ll find some nice guidance on what to expect from IGV from the paper listed below in the references area.
The video also describes annotation software that helps you to identify where the variations and consequences are in the data. Many of these tools we have talked about either in our tutorials or our other tips-of-the-week.
He also describes how people generate pipelines to flow the data through a series of steps to do the analysis. Sometimes these are home-made programs used by a local group. But he also mentioned how Galaxy can help to accomplish this now. We’ve been fans of Galaxy for a long time, and we know people are using it in exactly this manner.
You still should have a basic understanding of all the tools individually if you want to use them all, or tools that incorporate them all into workflows/processes, though. It will help you to create better workflows/pipelines. And it also matters that you know what you aren’t seeing/using.
Teer closes by introducing the VarSifter software that he’s been involved with creating. This software is freely available for you to download at the VarSifter site. Usually we prefer to highlight web-based interfaces, but there isn’t one for VarSifter. But if you see the utility in it you can also try to get a local copy set up for yourself. VarSifter will help you to view, sort, and filter variants in a lot of ways.
So have a look at this video if you are interested in understanding how these analyses are done, and if you are interested in knowing more about the tools that can be used. It’s worth the 40 minutes–really.
YouTube page: http://www.youtube.com/watch?v=I7azpqTWFuM
VarSifter home page: http://research.nhgri.nih.gov/software/VarSifter/
Exome analysis Talks at NHGRI: http://www.genome.gov/27545880
IGV: Robinson, J., Thorvaldsdóttir, H., Winckler, W., Guttman, M., Lander, E., Getz, G., & Mesirov, J. (2011). Integrative genomics viewer Nature Biotechnology, 29 (1), 24-26 DOI: 10.1038/nbt.1754
UCSC new paper: Dreszer, T., Karolchik, D., Zweig, A., Hinrichs, A., Raney, B., Kuhn, R., Meyer, L., Wong, M., Sloan, C., Rosenbloom, K., Roe, G., Rhead, B., Pohl, A., Malladi, V., Li, C., Learned, K., Kirkup, V., Hsu, F., Harte, R., Guruvadoo, L., Goldman, M., Giardine, B., Fujita, P., Diekhans, M., Cline, M., Clawson, H., Barber, G., Haussler, D., & James Kent, W. (2011). The UCSC Genome Browser database: extensions and updates 2011 Nucleic Acids Research DOI: 10.1093/nar/gkr1055
SAMtools: Li, H., Handsaker, B., Wysoker, A., Fennell, T., Ruan, J., Homer, N., Marth, G., Abecasis, G., Durbin, R., & , . (2009). The Sequence Alignment/Map format and SAMtools Bioinformatics, 25 (16), 2078-2079 DOI: 10.1093/bioinformatics/btp352
Welcome to our Friday feature link collection: SNPpets. During the week we come across a lot of links and reads that we think are interesting, but don’t make it to a blog post. Here they are for your enjoyment…
- TCGA = The Cancer Genome Atlas: RT @ewanbirney: RT @genome_gov: TCGA data will now flow to ICGC data portal & be available for integrated analysis w/ int’l cancer genomics projects.
- The World Health Organization (WHO) to identify the major priorities for using genomics in developing nations. HT: GenomeWeb [Jennifer]
- Fascinating work-around for a CNV region in the human genome. RT @Awesomics: Hydatidiform moles help close gap in clinically important region of genome MT @GenomeRef The CCL3L1 region of chr17 http://bit.ly/ff01Hy [Mary]
- Yay bugs! 5000 Insects Genomes project, or i5k. (Remember when 1000 genomes was enough for a big data project? Glory days…) Check out the PDF interview for a nice readable take on the project framework and goals. Hat tip to RT @physorg_biology: Entomologists launch the 5,000 Insect Genome Project (i5k) http://bit.ly/kybQof [Mary]
- I love the smell of fresh new attributes… RT @NCBI: New attributes have been added to dbSNP to allow for searching and filtering human genomic variations. http://1.usa.gov/kUeJw8 [Mary]
- Why we really do bioinformatics? Um, not so much for me. But did crack me up. RT @davebroome: @SimonBux But then you’d have to take out the part saying “bioinformatics”, which is how you impress the ladies. [Mary]
- RT @aaronquinlan: IGV 2.0 is out with new NGS features: “split-view”, “view as pairs”, “splice-junctions”. very nice. #bioinformatics http://goo.gl/OyNdv [Mary]
- Pierre has uploaded some slides that are great, and made me laugh out loud in the twitter + facebook section: RT @yokofakun: I’ve uploaded a *draft* of my presentation “being a Bioinformatician 2.0″ http://slidesha.re/jinWgY [Mary]
Recently I’ve been coming across more and more requests and need for genome annotation and visualization software. Genomes are being completed left and right and researchers need ways to browse and annotate these genomes. There are a lot of tools out there. This post is a quick attempt to start listing those. It is not exhaustive right now, right now there are the ones off the top of my head and those focused a bit on visualization (though there is annotation). I plan to expand this list (have any to suggest) and enhance it with more descriptions as time goes forward. Probably make it a page if it becomes useful enough. I’m not listing databases (such as UCSC Genome Browser, RGD, Ensembl, Flybase, but rather software that researchers can use to create such browsable genomes). So, here we go…
I haven’t had time to check it out, but I wanted to pass it along to those of you who might want to investigate over the holidays. I’m delighted to see that the Integrative Genomics Viewer team has added the support for uploading your own genomes and aren’t restricted to the initial ones that were provided, as I discussed in a prior post. They are responsive to feedback–that’s very nice
Dear IGV user,
We are pleased to announce the release of IGV version 1.2.
Highlights of this release include:
- Additional supported genomes. Several genomes have been added to the IGV genome server, including Plasmodium, Neurospora crassa, and S. pombe. For a complete list of supported genomes, see the Resources page on the IGV web site at http://www.broad.mit.edu/igv/resources.html.
- Imported genomes. IGV is no longer limited to the set of genomes on the IGV genome server. Users can now import their favorite genomes directly into IGV.
- Additional data file formats. IGV now supports General Feature Format (GFF2, GFF3) and Wiggle (WIG) files.
- Support for probe identifiers. When loading a GCT or RES file, IGV now recognizes Affymetrix, Agilent and Illumina probe identifiers and automatically places the gene expression data at the correct loci.
Download IGV 1.2 from:
The release notes are available at :
We always welcome your comments, including ideas for IGV improvements as well as requests for genomes and data tracks to be added to the IGV server. Contact us at email@example.com.
The IGV Team
From the Genome-Technology mailing list I found about about this software release from the Broad Institute:
NEW YORK (GenomeWeb News) – The Broad Institute of MIT and Harvard has created a genomics informatics tool that will allow researchers to visualize genomic information, and has made it publicly available for free, Broad said today….
So of course I went to check it out. Because I love new software! You can check it out yourself here:
Integrative Genomics Viewer: http://www.broad.mit.edu/igv/
There is a quick start introduction and a movie you can watch where someone demonstrates some clicks (no audio, or if there was I didn’t get any). A quick registration gives you access. A little java downloading and you are off to the races. There is a sample data set to get you started.
My first question was: what genomes can I see? Lo and behold–the FAQ says:
Answer: Sequence is read from the genome on a server at the Broad. For sequences to appear, you must be connected to the internet, the server must be available, and the genome that you have selected must be on the server. As of July 2008, the server provides sequence for the following genomes: hg17, hg18, mm8, and mm9.
At first I thought it was a tool to pull in your own genomes and view stuff, but it appears to rely on what’s on their server. But I haven’t dug enough yet, I’m not certain that the final answer. But if you are using one of those genomes, I could see some real utility in pulling in your data as tracks and viewing it alongside the reference sequence.
Looks nice to me. I’ll be checking it out some more and I’ll let you know what I find. Feel free to add your own reviews!